|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- pytorch |
|
- causal-lm |
|
- bitnet |
|
- 8bit-quantization |
|
- layer-skip |
|
- early-exit |
|
- safetensors |
|
- fineweb-edu |
|
datasets: |
|
- HuggingFaceFW/fineweb-edu |
|
model_type: bitnet |
|
--- |
|
|
|
# bitnet-8bit-v3 |
|
|
|
This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset. |
|
|
|
## Architecture Overview |
|
|
|
### Input Processing |
|
- **Token Embeddings**: 128,256 vocabulary size |
|
- **Position Embeddings**: Up to 128 positions |
|
- **Hidden Dimensions**: 1024-dimensional hidden states |
|
|
|
### Transformer Layers (12 total) |
|
Each layer contains: |
|
- Layer normalization (eps=1e-05) |
|
- **Multi-Head Attention**: 16 heads |
|
- Residual connections |
|
- **Feed-Forward Network**: 1024 → 4096 → 1024 |
|
- Dropout (0.1) after attention and FFN |
|
- Activation function: gelu |
|
|
|
### Special Features |
|
- **8-bit Quantization**: 8-bit activations for efficiency |
|
- **Layer Skipping**: Dynamic computation with skip probability 0.1 |
|
- Minimum layers to keep: 4 |
|
- **Early Exit**: Can terminate at any layer if confidence > 95% |
|
|
|
## Model Configuration |
|
|
|
```json |
|
{ |
|
"vocab_size": 128256, |
|
"hidden_size": 1024, |
|
"num_hidden_layers": 12, |
|
"num_attention_heads": 16, |
|
"intermediate_size": 4096, |
|
"max_position_embeddings": 128, |
|
"activation_bits": 8, |
|
"hidden_dropout_prob": 0.1, |
|
"attention_probs_dropout_prob": 0.1 |
|
} |
|
``` |
|
|
|
## Training Details |
|
|
|
- **Dataset**: FineWeb-EDU (sample-10BT subset) |
|
- **Batch Size**: 64 (with gradient accumulation) |
|
- **Learning Rate**: 5e-05 |
|
- **Weight Decay**: 0.01 |
|
- **Warmup Steps**: 1000 |
|
- **Max Gradient Norm**: 1.0 |
|
- **Gradient Accumulation Steps**: 4 |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForCausalLM.from_pretrained("bitnet-8bit-v3") |
|
tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit-v3") |
|
|
|
# Basic generation |
|
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt") |
|
outputs = model.generate(**inputs, max_length=100, temperature=0.7) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## Performance Characteristics |
|
|
|
- **Memory Efficiency**: 8-bit quantization reduces memory footprint |
|
- **Adaptive Computation**: Layer skipping reduces average computation by ~10% |
|
- **Low Latency**: Early exit can terminate computation when confident |
|
- **Compact Size**: Significantly smaller than full-precision models |
|
|
|
## Limitations |
|
|
|
- Maximum sequence length is limited to 128 tokens |
|
- This is an experimental BitNet implementation |
|
- Early exit and layer skipping require compatible inference code |
|
- Model performance may vary based on the complexity of the input |
|
|
|
## Technical Details |
|
|
|
- **Initializer Range**: 0.02 |
|
- **Layer Norm Epsilon**: 1e-05 |
|
- **Tokenizer**: Based on Meta-Llama-3-8B-Instruct tokenizer |
|
- **Format**: SafeTensors for fast and safe loading |
|
|
|
## Citation |
|
|
|
If you use this model, please cite: |
|
|
|
```bibtex |
|
@misc{bitnet2024, |
|
title={BitNet: 8-bit Quantized Transformer with Layer Skipping}, |
|
author={Your Name}, |
|
year={2024}, |
|
url={https://huggingface.co/bitnet-8bit-v3} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
Apache 2.0 - This model can be used for commercial purposes. |
|
|
|
## Acknowledgments |
|
|
|
- Training data from FineWeb-EDU by HuggingFace |
|
- Tokenizer from Meta's Llama-3-8B-Instruct model |