---
language: en
license: apache-2.0
---

# BERT Hash Nano Models

This is a set of 3 Nano [BERT](https://arxiv.org/abs/1810.04805) models with a modified embeddings layer. The embeddings layer is the same BERT vocabulary (30,522 tokens) projected to a smaller dimensional space then re-encoded to the hidden size. This method is inspired by [MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings](https://arxiv.org/abs/2405.19504).

The number of projections is like a hash. Setting the projections parameter to 5 is like generating a 160-bit hash (5 x float32) for each token. That hash is then projected to the hidden size.

This significantly reduces the number of parameters necessary for token embeddings.

For example:

Standard token embeddings:
- 30,522 (vocab size) x 768 (hidden size) = 23,440,896 parameters
- 23,440,896 x 4 (float32) = 93,763,584 bytes

Hash token embeddings:
- 30,522 (vocab size) x 5 (hash buckets) + 5 x 768 (projection matrix)= 156,450 parameters
- 156,450 x 4 (float32) = 625,800 bytes

These models are pre-trained on the same training corpus as BERT (with a copy of Wikipedia from 2025) as recommended in the paper [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962).

Below is a subset of GLUE scores on the dev set using the [script provided by Hugging Face Transformers](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py) with the following parameters.

```bash
python run_glue.py --model_name_or_path <model path> --task_name <task name> --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 1e-4 --num_train_epochs 4 --output_dir outputs --trust-remote-code True
```

| Model | Parameters | MNLI (acc m/mm) | MRPC (f1/acc)    | SST-2 (acc) | 
| ----- | ---------- | --------------- | ---------------- | ----------- |
| [baseline (bert-tiny)](https://hf.co/google/bert_uncased_L-2_H-128_A-2) | 4.4M | 0.7114 / 0.7161 | 0.8318 / 0.7353 | 0.8222 |
| [**bert-hash-femto**](https://hf.co/neuml/bert-hash-femto) | **0.243M** | **0.5697 / 0.5750** | **0.8122 / 0.6838** | **0.7821** |
| [bert-hash-pico](https://hf.co/neuml/bert-hash-pico) | 0.448M | 0.6228 / 0.6363 | 0.8205 / 0.7083 | 0.7878 |
| [bert-hash-nano](https://hf.co/neuml/bert-hash-nano) | 0.969M | 0.6565 / 0.6670 | 0.8172 / 0.7083 | 0.8131 |

## Usage

These models can be loaded using Hugging Face Transformers as follows. Note that given that this is a custom architecture, `trust_remote_code` needs to be set.

```python
from transformers import AutoModel

model = AutoModel.from_pretrained("neuml/bert-hash-femto", trust_remote_code=True)
```

## Training

Training your own Nano model is simple. All you need is a Hugging Face dataset and the code below using [txtai](https://github.com/neuml/txtai).

```python
from datasets import concatenate_datasets, load_dataset
from transformers import AutoTokenizer

from txtai.pipeline import HFTrainer

from configuration_bert_hash import *
from modeling_bert_hash import *

dataset = load_dataset("path to target HF dataset")

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

config = BertHashConfig(
       hidden_size=128,
       num_hidden_layers=2,
       num_attention_heads=2,
       intermediate_size=512,
       projections=16
)
model = BertHashForMaskedLM(config)

print(config)
print("Total parameters:", sum(p.numel() for p in model.bert.parameters()))

train = HFTrainer()

# Train using MLM
train((model, tokenizer), dataset, task="language-modeling", output_dir="model",
       fp16=True, learning_rate=1e-3, per_device_train_batch_size=64, num_train_epochs=3,
       warmup_steps=2500, weight_decay=0.01, adam_epsilon=1e-6,
       tokenizers=True, dataloader_num_workers=20,
       save_strategy="steps", save_steps=5000, logging_steps=500,
)
```

## Future Work

This model demonstrates that smaller models can still be productive models.

The hope is that this work opens the door to many in building small encoder models that pack a punch. Models can be trained in a matter of hours using consumer GPUs.

Imagine more specialized models like this for medical, legal, science and more.


## More Information

Read more about this model and how it was built in [this article](https://medium.com/neuml/training-tiny-language-models-with-token-hashing-b744aa7eb931).