Safetensors
llama

Benchmark evaluation

#3
by jglowa - opened

Please provide benchmark results, e.g. EuroEval or Eurolinuga. Perplexity doesn't tell much.

Tilde org

We are working on it as of now.
We currently have issues with the LLM Evaluation harness, as the results we get from using it do not match those we get from a plain HuggingFace reimplementation of the same tests.

Using lm-eval-harness or lighteval by HF allows for standardized evaluation that comes close to an industry standard and is battle tested. Plain HuggingFace reimplementation (whatever might be meant with that) will not allow proper comparison, as it will be prone to errors that were eradicated through many years of testing for both mentioned frameworks.

Using lm-eval-harness or lighteval by HF allows for standardized evaluation that comes close to an industry standard and is battle tested. Plain HuggingFace reimplementation (whatever might be meant with that) will not allow proper comparison, as it will be prone to errors that were eradicated through many years of testing for both mentioned frameworks.

This is the reason why we are waiting for results from lm-eval-harness instead of publishing what we got with HF. The problem was with how the tokeniser is loaded in lm-eval-harness.

@TBergmanis Are you saying that if one loads in your tokeniser normally (with AutoTokenizer.from_pretrained), then it should be alright? Since in that case I suppose vLLM works fine as well, since they're just using that method as well.

@TBergmanis Are you saying that if one loads in your tokeniser normally (with AutoTokenizer.from_pretrained), then it should be alright? Since in that case I suppose vLLM works fine as well, since they're just using that method as well.

Hi! When using plain python HF, I think it should be fine. When using something that wraps HF, we have found that occasionally the tokenizer defaults to use_fast=True somewhere under the hood. This will not break the inference, but will (sometimes severely) degrade the performance of the model. Unfortunately, this was the case for our VLLM version, so I we had to use the following setup:

python -m vllm.entrypoints.openai.api_server \
               --model "$MODEL_DIR" \
               --tokenizer "$MODEL_DIR" \ 
               --tokenizer-mode "slow" \

Similarly, for some lm-evaluation-harness versions, we found that passing --model_args "pretrained=TildeAI/TildeOpen-30b,use_fast=False" would still fall back to fast tokenizer, and we had to edit the source code.

@mkronis Thanks for that.

I've now implemented the use of the slow tokenizer in EuroEval (with the vLLM backend) with a new "#slow-tokenizer" parameter, i.e., euroeval -m TildeAI/TildeOpen-30b#slow-tokenizer, with this PR: https://github.com/EuroEval/EuroEval/pull/1257/

Running that evaluation now, will take a day or two.

@mkronis Thanks for that.

I've now implemented the use of the slow tokenizer in EuroEval (with the vLLM backend) with a new "#slow-tokenizer" parameter, i.e., euroeval -m TildeAI/TildeOpen-30b#slow-tokenizer, with this PR: https://github.com/EuroEval/EuroEval/pull/1257/

Running that evaluation now, will take a day or two.

@saattrupdan , great to hear! Looking forward to your results.

Meanwhile, we have released some preliminary benchmark results obtained via lm-evaluation-harness, available at the updated README.

@mkronis Evaluation results with the slow tokenizer live now: https://euroeval.com/leaderboards/Multilingual/european

It doesn't perform that well compared to SOTA 30B models, unfortunately:

Screenshot 2025-09-17 at 13.54.43.png

Why does TildeOpen perform so poor with it's 30B params? Mistral Small 3.2 24B or Gemma 3 27B are much better, even though they are smaller. Heck, even Gemma 3 4B, Llama 3.1 8B and EuroLLM 9B are better in EuroEval!

Why does TildeOpen perform so poor with it's 30B params? Mistral Small 3.2 24B or Gemma 3 27B are much better, even though they are smaller. Heck, even Gemma 3 4B, Llama 3.1 8B and EuroLLM 9B are better in EuroEval!

It depends on how you train the model. EuroLLM included instruction data related to the test sets (such as MMLU train) in the model pretraining phase, and its results improved. I suppose what @saattrupdan evaluated was TildeOpen version 1.0, which was trained just on plain text without any instruction data. Version 1.1 has the MMLU train added and shows much better results. I am sure that seeing an extra 500M tokens of text + 10m tokens of instruction didn't make the 1.1 model much smarter than 1.0. Rather, it became better at doing the tests.

Another aspect that comes to mind is that all the tests are simply translated American tests. It doesn't take much, let's say, Bulgarian but rather loads American English texts to perform well on those. If you look at the performance on data sets like National Exams, which are 1) original tests from each respective country, and 2) in the original languages, you see that the same models perform worse than TildeOpen.

You can take a look at the results we get and replicate them if you wish.

@mkronis Evaluation results with the slow tokenizer live now: https://euroeval.com/leaderboards/Multilingual/european

It doesn't perform that well compared to SOTA 30B models, unfortunately:

Thanks for evaluating our model - I answered some of my concerns in the comment above.

@TBergmanisThank you for your explanation. But where can I find some independent benchmark results including TildeOpen 1.1? Maybe it could be included in https://huggingface.co/spaces/Eurolingua/european-llm-leaderboard?

It depends on how you train the model. EuroLLM included instruction data related to the test sets (such as MMLU train) in the model pretraining phase, and its results improved. I suppose what @saattrupdan evaluated was TildeOpen version 1.0, which was trained just on plain text without any instruction data. Version 1.1 has the MMLU train added and shows much better results. I am sure that seeing an extra 500M tokens of text + 10m tokens of instruction didn't make the 1.1 model much smarter than 1.0. Rather, it became better at doing the tests.

But even compared with other base decoder models it still underperforms though?

Another aspect that comes to mind is that all the tests are simply translated American tests. It doesn't take much, let's say, Bulgarian but rather loads American English texts to perform well on those. If you look at the performance on data sets like National Exams, which are 1) original tests from each respective country, and 2) in the original languages, you see that the same models perform worse than TildeOpen.

EuroEval consists of 7 diverse tasks, with mostly gold standard datasets. Here is the overview of the number of translated datasets per language:

Language #translated #total
Danish 1 7
Dutch 3 7
English 0 7
Estonian 1 7
Faroese 0 4
Finnish 2 6
French 2 7
German 2 7
Icelandic 1 7
Italian 3 7
Latvian 2 7
Norwegian 0 7
Polish 1 7
Portuguese 0 7
Spanish 2 7

So I think it's unfair to simply say that it's only due to translated evaluation datasets.

As for the Tilde v1.1, I only see a single model on your Hugging Face organisation, so I suppose it hasn't been publicly released yet?

It depends on how you train the model. EuroLLM included instruction data related to the test sets (such as MMLU train) in the model pretraining phase, and its results improved. I suppose what @saattrupdan evaluated was TildeOpen version 1.0, which was trained just on plain text without any instruction data. Version 1.1 has the MMLU train added and shows much better results. I am sure that seeing an extra 500M tokens of text + 10m tokens of instruction didn't make the 1.1 model much smarter than 1.0. Rather, it became better at doing the tests.

But even compared with other base decoder models it still underperforms though?

@saattrupdan I don't think there is a simple answer to it. EuroLLM representatives have explicitly stated that they used training splits of test sets for the foundational training of their models. Their models still count as foundational (base) as they are not instruction-tuned, right? So, when we added the same sets to TildeOpen 1.0->1.1, the performance on specific tests improved and surpassed EuroLLM 22b preview. But this then makes the whole evaluation a bit stupid, as the pattern goes like this: if I see some other similar model performing much better than some other, I simply check if I can improve the performance by adding an instruction data set to the foundational model training set. If that improves results, I simply bump the model version and update the weights. But doesn't it make the whole evaluation pointless? So, the model you evaluated was trained purely on plain language text. The models in your leaderboard are not. I am 100% sure of that because I have been told so by EuroLLM and Gemma teams in private, public and writing.

So I think it's unfair to simply say that it's only due to translated evaluation datasets.

Sure, you are right. Your data sets are different from what I had in mind (I mixed up your project with Eurolingua https://huggingface.co/spaces/Eurolingua/european-llm-leaderboard, which just translated the American benchmarks). I should read your papers on who you test the models. I, however, wonder, are summarization and NER really the tasks for testing Foundational (base) LLMs?
When building the model, we aimed for qualities like multilinguality and coherent language generation. Those land quite far from NER label matching in JSON, for example. If we had added datasets for NER model training during the foundational model training, would our model have been smarter if it had performed better?
And again - I am very grateful for your taking the time to evaluate our work!

As for v1.1: Commit 5ba1bebaca4d53d2b06d46bc4c24d480101b85f8 updated the model version to v1.1. I now realize that it is not as obvious as I hoped it would be from the model card/view.

@TBergmanisThank you for your explanation. But where can I find some independent benchmark results including TildeOpen 1.1? Maybe it could be included in https://huggingface.co/spaces/Eurolingua/european-llm-leaderboard?

@jglowa I outlined some of my doubts above. I would rather not spend too much time on evaluating the base model - it is just a curiosity - an artefact we can either make useful or not. I genuinely believe that while @saattrupdan is doing great job at evaluating the LLM landscape evaluating base models is a complicated - as soon as the tests move away from the intrinsic modeling objectives they can be manipulated by adding task specific datasets. That is something we avoided doing but when tested we saw that it yields targeted improvements. It seems pointless though, because we plan to publish instruction tunned models where instruction tunning data should be used instead.
We currently are working on machine translation as an application. We also have in-context question answering and summarization in mind. The performance of downstream applications should be easier to evaluate. Is there a task or a problem you are particularly interested?

@TBergmanis If you make new versions, could you please publish them in separate repositories? Updating the weights in the same repository breaks the reproducibility of scripts loading the model and I have in the past had subtle bugs being introduced in some pieces of software since the models have somehow changed. Separate repositories also allow easier independent evaluation of different versions.

Git tags might also work as long as it is clearly documented in the readme that there are different versions with different tags.

Sign up or log in to comment