File size: 3,818 Bytes
b391d02 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
language:
- en
license: apache-2.0
tags:
- pytorch
- causal-lm
- bitnet2
- h-bitlinear
- 8bit-quantization
- layer-skip
- early-exit
- safetensors
- fineweb-edu
datasets:
- HuggingFaceFW/fineweb-edu
model_type: bitnet2
---
# bitnetv2-model
This is a BitNetModel2 with H-BitLinear layers, 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.
## Architecture Overview
### Input Processing
- **Token Embeddings**: 128,256 vocabulary size
- **Position Embeddings**: Up to 128 positions
- **Hidden Dimensions**: 512-dimensional hidden states (power of 2 for H-BitLinear)
### Transformer Layers (12 total)
Each layer contains:
- Layer normalization (eps=1e-05)
- **Multi-Head Attention**: 8 heads
- Residual connections
- **H-BitLinear Feed-Forward Network**: 512 → 2048 → 512
- Dropout (0.1) after attention and FFN
- Activation function: silu
### Special Features
- **H-BitLinear Layers**: Hadamard transform-based linear layers for efficiency
- **8-bit Quantization**: 8-bit activations for memory efficiency
- **Layer Skipping**: Dynamic computation with skip probability 0.1
- Minimum layers to keep: 4
- **Early Exit**: Can terminate at any layer if confidence > 95%
## Model Configuration
```json
{
"vocab_size": 128256,
"hidden_size": 512,
"num_hidden_layers": 12,
"num_attention_heads": 8,
"intermediate_size": 2048,
"max_position_embeddings": 128,
"activation_bits": 8,
"use_h_bitlinear": True,
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1
}
```
## Training Details
- **Dataset**: FineWeb-EDU (sample-10BT subset)
- **Batch Size**: 8 (with gradient accumulation)
- **Learning Rate**: 5e-05
- **Weight Decay**: 0.01
- **Warmup Steps**: 1000
- **Max Gradient Norm**: 1.0
- **Gradient Accumulation Steps**: 4
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("bitnetv2-model")
tokenizer = AutoTokenizer.from_pretrained("bitnetv2-model")
# Basic generation
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Performance Characteristics
- **Memory Efficiency**: H-BitLinear layers and 8-bit quantization reduce memory footprint
- **Adaptive Computation**: Layer skipping reduces average computation by ~10%
- **Low Latency**: Early exit can terminate computation when confident
- **Compact Size**: Significantly smaller than full-precision models
- **H-BitLinear Benefits**: Hadamard transforms enable efficient matrix operations
## Limitations
- Maximum sequence length is limited to 128 tokens
- This is an experimental BitNetModel2 implementation with H-BitLinear layers
- Early exit and layer skipping require compatible inference code
- Model performance may vary based on the complexity of the input
- H-BitLinear layers require power-of-2 dimensions
## Technical Details
- **Initializer Range**: 0.02
- **Layer Norm Epsilon**: 1e-05
- **Tokenizer**: Based on Meta-Llama-3-8B-Instruct tokenizer
- **Format**: SafeTensors for fast and safe loading
- **H-BitLinear**: Uses Hadamard transforms for efficient linear operations
## Citation
If you use this model, please cite:
```bibtex
@misc{bitnet2_2024,
title={BitNetModel2: H-BitLinear Transformer with Layer Skipping and Early Exit},
author={Your Name},
year={2024},
url={https://huggingface.co/bitnetv2-model}
}
```
## License
Apache 2.0 - This model can be used for commercial purposes.
## Acknowledgments
- Training data from FineWeb-EDU by HuggingFace
- Tokenizer from Meta's Llama-3-8B-Instruct model
- H-BitLinear implementation for efficient matrix operations |