File size: 3,292 Bytes
d87880a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
---
language:
- en
license: apache-2.0
tags:
- pytorch
- causal-lm
- bitnet
- 8bit-quantization
- layer-skip
- early-exit
- safetensors
- fineweb-edu
datasets:
- HuggingFaceFW/fineweb-edu
model_type: bitnet
---
# bitnet-8bit-v3
This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.
## Architecture Overview
### Input Processing
- **Token Embeddings**: 128,256 vocabulary size
- **Position Embeddings**: Up to 128 positions
- **Hidden Dimensions**: 1024-dimensional hidden states
### Transformer Layers (12 total)
Each layer contains:
- Layer normalization (eps=1e-05)
- **Multi-Head Attention**: 16 heads
- Residual connections
- **Feed-Forward Network**: 1024 → 4096 → 1024
- Dropout (0.1) after attention and FFN
- Activation function: gelu
### Special Features
- **8-bit Quantization**: 8-bit activations for efficiency
- **Layer Skipping**: Dynamic computation with skip probability 0.1
- Minimum layers to keep: 4
- **Early Exit**: Can terminate at any layer if confidence > 95%
## Model Configuration
```json
{
"vocab_size": 128256,
"hidden_size": 1024,
"num_hidden_layers": 12,
"num_attention_heads": 16,
"intermediate_size": 4096,
"max_position_embeddings": 128,
"activation_bits": 8,
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1
}
```
## Training Details
- **Dataset**: FineWeb-EDU (sample-10BT subset)
- **Batch Size**: 64 (with gradient accumulation)
- **Learning Rate**: 5e-05
- **Weight Decay**: 0.01
- **Warmup Steps**: 1000
- **Max Gradient Norm**: 1.0
- **Gradient Accumulation Steps**: 4
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("bitnet-8bit-v3")
tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit-v3")
# Basic generation
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Performance Characteristics
- **Memory Efficiency**: 8-bit quantization reduces memory footprint
- **Adaptive Computation**: Layer skipping reduces average computation by ~10%
- **Low Latency**: Early exit can terminate computation when confident
- **Compact Size**: Significantly smaller than full-precision models
## Limitations
- Maximum sequence length is limited to 128 tokens
- This is an experimental BitNet implementation
- Early exit and layer skipping require compatible inference code
- Model performance may vary based on the complexity of the input
## Technical Details
- **Initializer Range**: 0.02
- **Layer Norm Epsilon**: 1e-05
- **Tokenizer**: Based on Meta-Llama-3-8B-Instruct tokenizer
- **Format**: SafeTensors for fast and safe loading
## Citation
If you use this model, please cite:
```bibtex
@misc{bitnet2024,
title={BitNet: 8-bit Quantized Transformer with Layer Skipping},
author={Your Name},
year={2024},
url={https://huggingface.co/bitnet-8bit-v3}
}
```
## License
Apache 2.0 - This model can be used for commercial purposes.
## Acknowledgments
- Training data from FineWeb-EDU by HuggingFace
- Tokenizer from Meta's Llama-3-8B-Instruct model |