Model Description

Language: Galician
Developed by: HPLT
Paper: arxiv.org/abs/2511.01066
Evaluation results: hf.co/datasets/HPLT/2505-deduplication-evals using HPLT-E
License: Apache 2.0

The HPLT's Llama-2b collection comprises monolingual decoder-only language models pretrained by the HPLT team as part of the third release.

The models are released as artifacts of our ablation studies on evaluating different corpora and sampling strategies across multiple languages:

⚖️ HPLT Pre-3.0 Comparison: Comparison of data deduplication strategies on a pre-release version of HPLT 3.0 across nine selected languages (HPLT 3.0 pre-release).
📚 Corpora Comparison: Evaluation of HPLT 2.0, HPLT 3.0, FineWeb 2.1.0, and MADLAD-400 1.0 on nine selected languages (HPLT 3.0 release).
🧰 Web Document Scorer (WDS) Comparison: Analysis of HPLT 3.0 corpora sampled using different WDS thresholds, focusing on Spanish and French (HPLT 3.0 release).

Please find more details in our GitHub repository and pre-print.

Model Architecture

All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens.

Pretraining Corpus

This model is pretrained on 30B tokens from HPLT 2.0 from scratch. For lower-resource languages with less than 30B tokens of available data, datasets are uniformly upsampled (repeated) following Muennighoff et al. (2023). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes.

Intended Use

Intended Use Cases: The model is intended for research use in Galician and reproducibility purposes. Since this model is only pretrained, its performance can be potentially improved in a variety of natural language understanding and generation tasks using post-training data.

Out of Scope: Model usage in languages beyond the explicitly referenced as supported in this model card.

How to use

This repository contains the following intermediate checkpoints due to limited quota resources:

1B
10B
20B
main

The other checkpoints can be provided upon request.

Use with Transformers

You can run the inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="HPLT/hplt-2.0-glg_Latn-llama-2b-30bt", 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

Specific intermediate checkpoint can be accessed using the revision argument when loading the model.

from transformers import AutoModelForCausalLM
import torch

revision = "10B"

model = AutoModelForCausalLM.from_pretrained(
    "HPLT/hplt-2.0-glg_Latn-llama-2b-30bt",
    torch_dtype=torch.bfloat16,
    revision=revision,
    device_map="auto"
)

Cite us

@article{oepen2025hplt,
  title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
  author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
  journal={arXiv preprint arXiv:2511.01066},
  year={2025}
}

Downloads last month: 60

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including HPLT/hplt-2.0-glg_Latn-llama-2b-30bt

2505-deduplication

Collection

Llama-2b ablation models released as part of HPLT 3.0 • 36 items • Updated 6 days ago