Table of Contents
Click to expand
Model Description
This is a HuBERT Base model pre-trained using 2,000 hours of Iberian languages speech data (Spanish, Catalan, Basque, and Galician). The model architecture is the same as the original HuBERT Base model, which contains 12 transformer layers. Pre-training was done by Barcelona Supercomputing Center.
Intended Uses and Limitations
This pre-trained model generates Speech Representations that can be used for any Iberian speech-related task. This model does not have a tokenizer as it was pretrained on audio alone.
In order to use this model for Automatic Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for more in-detail explanation of how to fine-tune the model for Speech Recognition. For an explanation of how to fine-tune the model for Audio Classification, check out this tutorial.
Pre-training Details
This model was pre-trained using code from the official repository, and the detailed training configuration can be found in the same repository and the original paper.
For pre-training, a 2,000 hours dataset was created using subsets from training splits from the following datasets:
| Dataset | Language | Selected hours | Comments |
|---|---|---|---|
| Basque Parliament Speech Corpus 1.0 | Spanish | 191 | |
| VoxPopuli | Spanish | 152 | |
| CommonVoice 21 | Spanish | 120 | |
| VoxForge Spanish | Spanish | 37 | |
| Catalan Youtube Speech | Catalan | 170 | |
| 3CatParla | Catalan | 170 | This dataset is private and is planned to be published as public soon. |
| CommonVoice 21 | Catalan | 44 | |
| Corts Valencianes | Catalan | 44 | Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version. |
| parlament_parla_v3 | Catalan | 44 | This dataset is private and is planned to be published as public soon. |
| IB3 - Speech Corpus for Catalan-varieties ASR | Catalan | 28 | This dataset is private and is planned to be published as public soon. |
| Basque Parliament Speech Corpus 1.0 | Basque | 334 | |
| CommonVoice 21 | Basque | 166 | |
| Nos_RG-Podcast-GL | Galician | 250 | |
| Nos_ParlaSpeech-GL | Galician | 100 | |
| CommonVoice 21 | Galician | 90 | |
| Nos_Transcrispeech-GL | Galician | 35 | |
| Nos_Celtia-GL | Galician | 25 |
Indirect evaluation results
To assess the pre-trained Catalan Speech Representations' quality, we evaluated them using two indirect tasks: Automatic Speech Recognition (ASR) and Language Identification (LID).
Automatic Speech Recognition
We created train and validation ASR-labelled datasets using a 400 hours subsample from the pre-training dataset split. For testing, we created a test split concatenating all the test splits from:
- CommonVoice 21
- Basque Parliament Speech Corpus 1.0
- VoxPopuli
- 3CatParla
- Corts Valencianes
- parlament_parla_v3
- Catalan Youtube Speech
- Nos_ParlaSpeech-GL
- Nos_RG-Podcast-GL
- Nos_Transcrispeech-GL
We fine-tuned on this ASR-labelled 400 hours training split the following models:
- Iberian pre-trained HuBERT: BSC-LT/hubert-base-los-2k (our model)
- English pre-trained HuBERT: facebook/hubert-base-ls960
- Multi-lingual pre-trained HuBERT: utter-project/mHuBERT-147
All of these models were pre-trained using exactly the same configurations. We trained them for 20 epochs. For the fine-tuning process, we froze models' parameters using the freeze_feature_encoder() method. hubert-base-los-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 95% of them were fine-tuned. The results were the following:
| Model | Train WER | Validation WER | Test WER ↑ |
|---|---|---|---|
| hubert-base-los-2k | 5.6% | 8.5% | 12.8% |
| mHuBERT-147 | 8.1% | 11.2% | 15.9% |
| hubert-base-ls960 | 11.6% | 15.2% | 20.7% |
Language Identification
We created train and validation Language Identification labelled datasets using a 200 hours subsample from the pre-training dataset split (excluding Common Voice splits). For testing, we created a test split concatenating all the Spanish, Catalan, Basque, and Galician test splits from CommonVoice 21.
We fine-tuned on this 200 hours labelled training split the following models:
- Iberian pre-trained HuBERT: BSC-LT/hubert-base-los-2k (our model)
- English pre-trained HuBERT: facebook/hubert-base-ls960
- Multi-lingual pre-trained HuBERT: utter-project/mHuBERT-147
All of these models were pre-trained using exactly the same configurations. We trained them for 10 epochs. For the fine-tuning process, we froze models' parameters using the freeze_base_model() method. hubert-base-los-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 0.2% of them were fine-tuned.
The results were the following:
| Model | Train f1-macro | Validation f1-macro | Test f1-macro ↓ |
|---|---|---|---|
| hubert-base-los-2k | 99.6% | 99.6% | 69.2% |
| mHuBERT-147 | 97.2% | 97.6% | 37.6% |
| hubert-base-ls960 | 91.6% | 92.1% | 20.3% |
How to use the model
Speech Representations
To obtain Speech Representations (HuBERT outputs) from audio in Iberian languages using this model, you can follow this example:
(Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended).
from datasets import load_dataset, Audio
import torch
from transformers import AutoFeatureExtractor, AutoModel
#Load the dataset
dataset = load_dataset("projecte-aina/ib3_ca_asr", split='train[:1%]', trust_remote_code=True)
#Downsample to 16kHz
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
# Hugginface pre-trained model path
MODEL_NAME = "BSC-LT/hubert-base-los-2k"
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device.")
# Load feature extractor
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME)
# Load model
model = AutoModel.from_pretrained(MODEL_NAME)
model = model.to(device)
def map_to_speech_representations(batch):
#Process the dataset
audio = batch["audio"]
input_features = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_values
input_features = input_features.to(device)
# Extract HuBERT's Speech Representations
with torch.no_grad():
outputs = model(
input_features,
output_hidden_states = True,
)
speech_representations = outputs.last_hidden_state
hidden_states = outputs.hidden_states
batch["speech_representations"] = speech_representations
batch["hidden_states"] = hidden_states
return batch
dataset = dataset.map(map_to_speech_representations)
print(dataset)
Discrete Speech Representations
Important remark: the k-means model available in this repo and used for extracting Discrete Speech Representations was trained using HuBERT's 6th layer.
To obtain Discrete Speech Representations (HuBERT's k-means centroids) from audio in Iberian languages using this model, you can follow this example:
(Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended).
from datasets import load_dataset, Audio
import torch
from transformers import AutoFeatureExtractor, AutoModel
import joblib
import numpy as np
from huggingface_hub import hf_hub_download
#Load the dataset
dataset = load_dataset("projecte-aina/ib3_ca_asr", split='train[:1%]', trust_remote_code=True)
#Downsample to 16kHz
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
# Hugginface pre-trained model path
MODEL_NAME = "BSC-LT/hubert-base-los-2k"
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device.")
# Load feature extractor
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME)
# Load model
model = AutoModel.from_pretrained(MODEL_NAME)
model = model.to(device)
# Load k-means
km_path = hf_hub_download(repo_id="BSC-LT/hubert-base-los-2k", filename="k_means.km")
km_model = joblib.load(km_path)
clusters = km_model.cluster_centers_
def map_to_discrete_units(batch):
#Process the dataset
audio = batch["audio"]
input_features = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_values
input_features = input_features.to(device)
with torch.no_grad():
outputs = model(
input_features,
output_hidden_states = True,
)
# Extract HuBERT's Speech Representations
hidden_states = outputs.hidden_states
# Extract 6-th layer features
k_means_input = hidden_states[5].squeeze()
k_means_input = k_means_input.cpu()
k_means_input = np.array(k_means_input, dtype='f')
labels = km_model.predict(k_means_input)
batch["discrete_units"] = clusters[labels]
return batch
dataset = dataset.map(map_to_discrete_units)
print(dataset)
Automatic Speech Recognition
In order to use this model for Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for more in-detail explanation of how to fine-tune the model for Speech Recognition.
Audio Classification
For an explanation of how to fine-tune the model for Audio Classification, check out this tutorial.
Citation
If this model contributes to your research, please cite the work:
@misc{costa2025hubertbaseca2k,
title={LOSHuBERT: the first full Iberian pre-trained HuBERT.},
author={Costa, Federico; Messaoudi, Abir; Peiró-Lilja, Alex; Casals-Salvador, Marc; España-Bonet, Cristina},
organization={Barcelona Supercomputing Center},
url={https://huggingface.co/BSC-LT/hubert-base-los-2k},
year={2025}
}
Additional Information
Author
The pre-training process was performed during 2025, in the Language Technologies Unit of the Barcelona Supercomputing Center.
Contact
For further information, please send an email to langtech@bsc.es.
Copyright
Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center.
License
Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.
The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.
Disclaimer
Click to expand
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
Be aware that the model may have biases and/or any other undesirable distortions.
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
In no event shall the owner and creator of the model (Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties.
- Downloads last month
- 5
Model tree for BSC-LT/hubert-base-los-2k
Base model
facebook/hubert-base-ls960