File size: 16,369 Bytes
02bc6fe
 
 
 
 
 
 
 
 
 
47623d0
 
 
02bc6fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54c89cc
 
02bc6fe
54c89cc
 
02bc6fe
54c89cc
 
02bc6fe
54c89cc
 
02bc6fe
54c89cc
 
02bc6fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54c89cc
 
02bc6fe
 
 
 
 
 
 
 
 
 
 
 
54c89cc
02bc6fe
 
 
 
54c89cc
02bc6fe
 
 
 
 
 
 
 
 
 
 
 
 
efd0555
02bc6fe
efd0555
 
 
 
 
 
02bc6fe
efd0555
 
 
02bc6fe
efd0555
 
 
02bc6fe
 
efd0555
02bc6fe
efd0555
02bc6fe
efd0555
 
 
02bc6fe
efd0555
02bc6fe
efd0555
 
 
 
 
 
02bc6fe
 
 
 
 
 
 
 
 
 
 
efd0555
 
02bc6fe
 
 
 
 
efd0555
02bc6fe
efd0555
 
 
02bc6fe
efd0555
 
02bc6fe
 
efd0555
02bc6fe
efd0555
 
 
 
 
 
ef23e25
efd0555
 
 
 
 
 
 
 
 
 
 
 
ef23e25
efd0555
 
 
 
 
 
 
 
 
 
 
 
ef23e25
efd0555
 
 
 
 
 
 
 
 
02bc6fe
 
47623d0
02bc6fe
 
 
 
 
efd0555
02bc6fe
 
 
efd0555
02bc6fe
e00be02
02bc6fe
 
 
 
 
efd0555
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
---
license: mit
language:
- ru
base_model:
- t-tech/T-lite-it-1.0
pipeline_tag: text-generation
library_name: transformers
tags:
- pytorch
metrics:
- mae
- pearsonr
---
# pollux-judge-7b

<!-- Provide a quick summary of what the model is/does. -->

![banner](images/logo_pollux_horiz_short_WHITEBG.png)

pollux-judge-7b is a 7-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

pollux-judge-7b is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs). 
At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.

Built upon the [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) architecture, pollux-judge-7b is a decoder-based 7 billion parameter model trained in a sequence-to-sequence fashion. 
The model is designed to predict both numerical scores and detailed textual rationales based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.

While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX). 


- **Model type:** decoder
- **Language(s) (NLP):** Russian
- **License:** MIT
- **Finetuned from model:** [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0)

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [POLLUX code base](https://github.com/ai-forever/POLLUX)
- **Paper:** [ArXiv preprint](https://arxiv.org/pdf/2505.24616)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

pollux-judge-7b is specifically designed for assessing text responses against a single, predefined criterion per evaluation run. 
The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.


### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

While the model may **technically** process multiple criteria simultaneously, such usage falls outside its intended design and may yield unpredictable results. 
Similarly, the model is not designed to function autonomously in determining appropriate evaluation criteria—it requires explicit criterion specification to perform reliable assessments.

For optimal performance and reliable results, users should structure each evaluation session around one criterion at a time, providing all necessary contextual components to enable the model's comprehensive scoring and rationale generation capabilities.


## MODEL OUTPUT DISCLAIMER AND LIMITATION OF LIABILITY

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

All content, responses, and outputs generated by pollux-judge-7b (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data. 
Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").

The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers. 
Generated content should not be interpreted as official statements, advice, or guidance from the Developers.

While the Developers employed appropriate data curation practices during fine-tuning and avoided the intentional inclusion of inappropriate content, the Model's responses may reflect patterns present in the underlying pre-training datasets, which were sourced from publicly available internet content and other large-scale text corpora.

The Developers expressly disclaim responsibility for any content generated by the Model. Users acknowledge that:
- Generated outputs are probabilistic and may contain inaccuracies, biases, or inappropriate content
- The Developers cannot guarantee the accuracy, completeness, or appropriateness of any Model output
- Users assume full responsibility for evaluating and using Model-generated content

Users are solely responsible for reviewing, validating, and determining the appropriateness of any Model-generated content before use or distribution.


## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

torch.manual_seed(42)

PROMPT_TEMPLATE = '''### Задание для оценки:
{instruction}

### Эталонный ответ:
{reference_answer}

### Ответ для оценки:
{answer}

### Критерий оценки:
{criteria_name}

### Шкала оценивания по критерию:
{criteria_rubrics}
'''

instruction = 'Сколько будет 2+2?'
reference_answer = ''
answer = 'Будет 4'
criteria_name = 'Правильность ответа'
criteria_rubrics = '''0: Дан неправильный ответ или ответ отсутствует.

1: Ответ модели неполный (не на все вопросы задания получен ответ, в формулировке ответа отсутствует часть информации).

2: Ответ модели совпадает с эталонным или эквивалентен ему.'''

prompt = PROMPT_TEMPLATE.format(instruction=instruction,
                                reference_answer=reference_answer,
                                answer=answer,
                                criteria_name=criteria_name,
                                criteria_rubrics=criteria_rubrics)

MODEL_PATH = "ai-forever/pollux-judge-7b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

sequence_ids = model.generate(
    **model_inputs,
    max_new_tokens=4096
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, sequence_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Given the substantial time investment required for manual dataset creation—approximately 24,447 hours for the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX)—we opted to employ synthetic data for training purposes, as acquiring a manually composed training set of comparable size was not feasible.

Our synthetic data generation process proceeded in several stages. 
Initially, we generated 78,000 instructions using three state-of-the-art language models: [DeepSeekV3](https://huggingface.co/deepseek-ai/DeepSeek-V3), [OpenAI GPT-4o](https://openai.com/index/hello-gpt-4o/), and [o3-mini](https://openai.com/index/openai-o3-mini/), with each model contributing equally to the instruction pool. 
These instructions were based on the POLLUX tasks taxonomy and complexity levels to ensure consistency with the original framework. 
Training data does not include Recommendations, Applied Brainstorming, Literary Text Generation, Questions Generation, Style
Transfer, Code Modification, and AI as a Character tasks alongside corresponding Task-specific criteria to enable out-of-domain evaluation of the resulting LM-as-a-Judge model.
To maintain data quality, we implemented filtering procedure that removed instructions containing more than 5% non-Russian tokens as well as duplicate entries, ultimately yielding a refined set of 26,000 high-quality instructions.

Subsequently, we mapped these synthetic instructions to their corresponding evaluation criteria sets using the same algorithm employed in the original [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX). 
Each criteria set comprised Critical, General, Subjective, and relevant Domain- and Task-specific criteria (for detailed methodology, see Section 2.3 in the [preprint](https://arxiv.org/pdf/2505.24616)). 
To generate diverse responses, we employed 15 open-source language models from various families, including Llama, Phi, Qwen, Mistral, and Gemma, with each model contributing equally to the answer generation process, for the complete listing of the models see Appendix M.2 in the [preprint](https://arxiv.org/pdf/2505.24616)). 

For criteria annotation, we utilized [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which generated numerical scores based on established criterion rubrics along with corresponding rationales for each evaluation.
This systematic approach resulted in 8,000,000 samples, each containing the complete tuple of (instruction, answer, criterion, score, rationale). 
From this dataset, we performed stratified random sampling across tasks to obtain our final training set of 1,000,000 samples, ensuring balanced representation across different task categories.


### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

The model was trained in sequence-to-sequence fashion. 
Input includes source instruction, LLM's answer, name of criterion, its rubrics and reference answer if present.
The output is expected to be numerical score from provided rubrics and textual explanation. 

#### Training Hyperparameters

- **Training regime:** bf16 mixed precision; <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
- **Epochs:** 3;
- **Optimizer:** AdamW;
- **Learning rate:** from 1e-05 to 0 with linear scheduler;
- **Batch size:** 256;
- 

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

For testing data we employed the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX). 
Note this provides both in- and out-of-domain evaluation as some of the tasks and criteria are absent in training data. 

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

We employed **Spearman’s rank correlation** with expert judgements and **Mean Absolute Error (MAE)** metrics alongside the Verdict Confidence to assess the performance of pollux-judge-7b and compare it with those of the reference models.

MAE offers a high degree of interpretability, as it is measured on the same scale as the annotation – specifically, in points. 
On the other hand, Spearman’s rank correlation allows to quantify the degree of monotonic association between the two rankings of models outputs and 
to demonstrate how consistently the LLM-as-Judge reproduces the relative ordering of output quality as established by human experts.

The Verdict Confidence is computed as the maximum empirical probability among the assigned scores.
We adopted Verdict Confidence as a measure of the annotator agreement instead of Krippendorff’s alpha and the Dawid-Skene algorithm due to their relatively complex interpretability.


### Results

For the reference models we took [OpenAI GPT-4o](https://openai.com/index/hello-gpt-4o/), [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) and [M-Prometheus-14B](https://huggingface.co/Unbabel/M-Prometheus-14B).
We report aggregate results averaged over the evaluated models (LM-as-a-Judge predicts the scores that have been assigned to the answers of particular LLM) on the out-of-domain part of [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
For detailed evaluation results see Appendix D in the [preprint](https://arxiv.org/pdf/2505.24616).

Spearman’s rank correlation:

| Model | pollux-judge-7b |  DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
| --- | --- | --- | --- | --- | 
| Claude 3.5 Sonnet (2024-10-22) | 0.660 | 0.739 | -0.006 | 0.759 |
| GPT-4o (2024-08-06) |  0.596 |  0.627 | -0.033 | 0.643 |
| GigaChat-Max (1.0.26.20) |  0.596 |  0.640 | 0.027 | 0.649 |
| Llama-3.1-405B | 0.613 |  0.591 | 0.022 | 0.639 |
| T-pro-it-1.0 | 0.571 |  0.573 | -0.044 | 0.616 |
| YaGPT-4-Pro (2024-10-23) | 0.616 | 0.635 | 0.099 | 0.671 |
|o1 (2024-12-17) |  0.675 |  0.748 | -0.022 | 0.771 |
| Avg. | 0.619 |  0.647 | 0.019 | 0.674 |

MAE (MAE values are given in parenthesis):

| Model                          | pollux-judge-7b |  DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
| ---                            | --- | --- | --- | --- | 
| Claude 3.5 Sonnet (2024-10-22) | 0.501 | 0.245 | 2.697 | 0.236 |
| GPT-4o (2024-08-06)            | 0.484 | 0.349 | 2.676 | 0.339 |
| GigaChat-Max (1.0.26.20)       | 0.477 | 0.350 | 2.468 | 0.342 |
| Llama-3.1-405B                 | 0.517 | 0.448 | 1.912 | 0.405 |
| T-pro-it-1.0                   | 0.497 | 0.475 | 2.978 | 0.425 |
| YaGPT-4-Pro (2024-10-23)       | 0.511 | 0.387 | 1.793 | 0.369 |
|o1 (2024-12-17)                 | 0.438 | 0.244 | 2.873 | 0.229 |
| Avg.                           | 0.489 | 0.356 | 2.487 | 0.335 |

Verdict Confidence (calculated on the whole test sample):

| Model                          | pollux-judge-7b |  DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
| ---                            | --- | --- | --- | --- | 
| Claude 3.5 Sonnet (2024-10-22) | 0.800 | 0.879 | 0.645 | 0.877 |
| GPT-4o (2024-08-06)            | 0.822 | 0.877 | 0.702 | 0.877 |
| GigaChat-Max (1.0.26.20)       | 0.824 | 0.878 | 0.715 | 0.879 |
| Llama-3.1-405B                 | 0.777 | 0.836 | 0.684 | 0.837 |
| T-pro-it-1.0                   | 0.791 | 0.838 | 0.644 | 0.842 |
| YaGPT-4-Pro (2024-10-23)       | 0.813 | 0.866 | 0.738 | 0.867 |
|o1 (2024-12-17)                 | 0.821 | 0.885 | 0.643 | 0.882 |
| Avg.                           | 0.808 | 0.866 | 0.684 | 0.867 |


## Technical Specifications

### Compute Infrastructure

#### Hardware

64 Nvidia A100 80Gb GPUs. 

#### Software

The model was trained with [FSDP](https://huggingface.co/docs/peft/accelerate/fsdp). 

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```
@misc{martynov2025eyejudgementdissectingevaluation,
      title={Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX}, 
      author={Nikita Martynov and Anastasia Mordasheva and Dmitriy Gorbetskiy and Danil Astafurov and Ulyana Isaeva and Elina Basyrova and Sergey Skachkov and Victoria Berestova and Nikolay Ivanov and Valeriia Zanina and Alena Fenogenova},
      year={2025},
      eprint={2505.24616},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.24616}, 
}
```