bitnet-8bit / README.md
Ram07's picture
Add model card
3b2c7bc verified
---
language:
- en
license: apache-2.0
tags:
- pytorch
- causal-lm
- bitnet
- quantized
- 8bit
- layer-skip
- early-exit
- rope
- safetensors
- fineweb-edu
datasets:
- HuggingFaceFW/fineweb-edu
---
# bitnet-8bit
This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.
## Architecture Overview
### Input Processing
- **Token Embeddings**: 128,256 vocabulary size
- **Position Embeddings**: Up to 128 positions
- **Hidden Dimensions**: 1024-dimensional hidden states
### Transformer Layers (12 total)
Each layer contains:
- Layer normalization
- **BitNet Attention**: 8 heads, 64 dimensions per head
- Residual connections
- **BitNet Feed-Forward Network**: 1024 → 4096 → 1024
- Dropout (0.1) after attention and FFN
### Special Features
- **8-bit Quantization**: Applied in attention and FFN layers for extreme efficiency
- **Rotary Position Embeddings (RoPE)**: Used in attention with dimension 64
- **Layer Skipping**: Quadratic dropout schedule: p_l = p_max × (l/L)²
- Maximum skip probability: 0.1
- No explicit minimum active layers
- **Early Exit**: Can terminate at any layer if confidence > 95%
## Model Details
| Parameter | Value |
|-----------|-------|
| Model Type | BitNet with Quantization |
| Vocabulary Size | 128,256 |
| Hidden Size | 1,024 |
| Number of Layers | 12 |
| Attention Heads | 8 |
| Head Dimension | 64 |
| FFN Intermediate Size | 4,096 |
| Max Sequence Length | 128 |
| Quantization Bits | 8 |
| Dropout Rate | 0.1 |
## Training
- **Dataset**: FineWeb-EDU (sample-10BT subset)
- **Training Framework**: PyTorch with mixed precision (FP16)
- **Optimization**: Gradient checkpointing and streaming dataset implementation
- **Hardware**: Training details available in repository
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("bitnet-8bit")
tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit")
# Basic generation
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# With early exit (if supported in inference)
outputs = model.generate(
**inputs,
max_length=100,
early_exit_threshold=0.95, # Exit when 95% confident
use_cache=True
)
```
## Performance Characteristics
- **Memory Efficiency**: 8-bit quantization reduces memory footprint significantly
- **Adaptive Computation**: Layer skipping and early exit reduce average computation
- **Inference Speed**: Variable depending on early exit and layer skipping activation
- **Quality**: Comparable to full-precision models of similar size despite quantization
## Limitations
- Maximum sequence length is limited to 128 tokens
- This is an experimental BitNet implementation with custom architecture
- Early exit and layer skipping require compatible inference code
- Quantization may affect performance on certain tasks
## Citation
If you use this model, please cite:
```bibtex
@misc{bitnet2024,
title={BitNet with Layer Skipping and Early Exit},
author={Your Name},
year={2024},
url={https://huggingface.co/bitnet-8bit}
}
```
## License
Apache 2.0 - This model can be used for commercial purposes.