--- language: en license: apache-2.0 --- # BERT Hash Nano Models This is a set of 3 Nano [BERT](https://arxiv.org/abs/1810.04805) models with a modified embeddings layer. The embeddings layer is the same BERT vocabulary (30,522 tokens) projected to a smaller dimensional space then re-encoded to the hidden size. This method is inspired by [MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings](https://arxiv.org/abs/2405.19504). The number of projections is like a hash. Setting the projections parameter to 5 is like generating a 160-bit hash (5 x float32) for each token. That hash is then projected to the hidden size. This significantly reduces the number of parameters necessary for token embeddings. For example: Standard token embeddings: - 30,522 (vocab size) x 768 (hidden size) = 23,440,896 parameters - 23,440,896 x 4 (float32) = 93,763,584 bytes Hash token embeddings: - 30,522 (vocab size) x 5 (hash buckets) + 5 x 768 (projection matrix)= 156,450 parameters - 156,450 x 4 (float32) = 625,800 bytes These models are pre-trained on the same training corpus as BERT (with a copy of Wikipedia from 2025) as recommended in the paper [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). Below is a subset of GLUE scores on the dev set using the [script provided by Hugging Face Transformers](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py) with the following parameters. ```bash python run_glue.py --model_name_or_path --task_name --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 1e-4 --num_train_epochs 4 --output_dir outputs --trust-remote-code True ``` | Model | Parameters | MNLI (acc m/mm) | MRPC (f1/acc) | SST-2 (acc) | | ----- | ---------- | --------------- | ---------------- | ----------- | | [baseline (bert-tiny)](https://hf.co/google/bert_uncased_L-2_H-128_A-2) | 4.4M | 0.7114 / 0.7161 | 0.8318 / 0.7353 | 0.8222 | | [**bert-hash-femto**](https://hf.co/neuml/bert-hash-femto) | **0.243M** | **0.5697 / 0.5750** | **0.8122 / 0.6838** | **0.7821** | | [bert-hash-pico](https://hf.co/neuml/bert-hash-pico) | 0.448M | 0.6228 / 0.6363 | 0.8205 / 0.7083 | 0.7878 | | [bert-hash-nano](https://hf.co/neuml/bert-hash-nano) | 0.969M | 0.6565 / 0.6670 | 0.8172 / 0.7083 | 0.8131 | ## Usage These models can be loaded using Hugging Face Transformers as follows. Note that given that this is a custom architecture, `trust_remote_code` needs to be set. ```python from transformers import AutoModel model = AutoModel.from_pretrained("neuml/bert-hash-femto", trust_remote_code=True) ``` ## Training Training your own Nano model is simple. All you need is a Hugging Face dataset and the code below using [txtai](https://github.com/neuml/txtai). ```python from datasets import concatenate_datasets, load_dataset from transformers import AutoTokenizer from txtai.pipeline import HFTrainer from configuration_bert_hash import * from modeling_bert_hash import * dataset = load_dataset("path to target HF dataset") tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") config = BertHashConfig( hidden_size=128, num_hidden_layers=2, num_attention_heads=2, intermediate_size=512, projections=16 ) model = BertHashForMaskedLM(config) print(config) print("Total parameters:", sum(p.numel() for p in model.bert.parameters())) train = HFTrainer() # Train using MLM train((model, tokenizer), dataset, task="language-modeling", output_dir="model", fp16=True, learning_rate=1e-3, per_device_train_batch_size=64, num_train_epochs=3, warmup_steps=2500, weight_decay=0.01, adam_epsilon=1e-6, tokenizers=True, dataloader_num_workers=20, save_strategy="steps", save_steps=5000, logging_steps=500, ) ``` ## Future Work This model demonstrates that smaller models can still be productive models. The hope is that this work opens the door to many in building small encoder models that pack a punch. Models can be trained in a matter of hours using consumer GPUs. Imagine more specialized models like this for medical, legal, science and more. ## More Information Read more about this model and how it was built in [this article](https://medium.com/neuml/training-tiny-language-models-with-token-hashing-b744aa7eb931).