File size: 3,818 Bytes
b391d02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
language:
- en
license: apache-2.0
tags:
- pytorch
- causal-lm
- bitnet2
- h-bitlinear
- 8bit-quantization
- layer-skip
- early-exit
- safetensors
- fineweb-edu
datasets:
- HuggingFaceFW/fineweb-edu
model_type: bitnet2
---

# bitnetv2-model

This is a BitNetModel2 with H-BitLinear layers, 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.

## Architecture Overview

### Input Processing
- **Token Embeddings**: 128,256 vocabulary size
- **Position Embeddings**: Up to 128 positions
- **Hidden Dimensions**: 512-dimensional hidden states (power of 2 for H-BitLinear)

### Transformer Layers (12 total)
Each layer contains:
- Layer normalization (eps=1e-05)
- **Multi-Head Attention**: 8 heads
- Residual connections
- **H-BitLinear Feed-Forward Network**: 512 → 2048 → 512
- Dropout (0.1) after attention and FFN
- Activation function: silu

### Special Features
- **H-BitLinear Layers**: Hadamard transform-based linear layers for efficiency
- **8-bit Quantization**: 8-bit activations for memory efficiency
- **Layer Skipping**: Dynamic computation with skip probability 0.1
  - Minimum layers to keep: 4
- **Early Exit**: Can terminate at any layer if confidence > 95%

## Model Configuration

```json
{
  "vocab_size": 128256,
  "hidden_size": 512,
  "num_hidden_layers": 12,
  "num_attention_heads": 8,
  "intermediate_size": 2048,
  "max_position_embeddings": 128,
  "activation_bits": 8,
  "use_h_bitlinear": True,
  "hidden_dropout_prob": 0.1,
  "attention_probs_dropout_prob": 0.1
}
```

## Training Details

- **Dataset**: FineWeb-EDU (sample-10BT subset)
- **Batch Size**: 8 (with gradient accumulation)
- **Learning Rate**: 5e-05
- **Weight Decay**: 0.01
- **Warmup Steps**: 1000
- **Max Gradient Norm**: 1.0
- **Gradient Accumulation Steps**: 4

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("bitnetv2-model")
tokenizer = AutoTokenizer.from_pretrained("bitnetv2-model")

# Basic generation
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Performance Characteristics

- **Memory Efficiency**: H-BitLinear layers and 8-bit quantization reduce memory footprint
- **Adaptive Computation**: Layer skipping reduces average computation by ~10%
- **Low Latency**: Early exit can terminate computation when confident
- **Compact Size**: Significantly smaller than full-precision models
- **H-BitLinear Benefits**: Hadamard transforms enable efficient matrix operations

## Limitations

- Maximum sequence length is limited to 128 tokens
- This is an experimental BitNetModel2 implementation with H-BitLinear layers
- Early exit and layer skipping require compatible inference code
- Model performance may vary based on the complexity of the input
- H-BitLinear layers require power-of-2 dimensions

## Technical Details

- **Initializer Range**: 0.02
- **Layer Norm Epsilon**: 1e-05
- **Tokenizer**: Based on Meta-Llama-3-8B-Instruct tokenizer
- **Format**: SafeTensors for fast and safe loading
- **H-BitLinear**: Uses Hadamard transforms for efficient linear operations

## Citation

If you use this model, please cite:

```bibtex
@misc{bitnet2_2024,
  title={BitNetModel2: H-BitLinear Transformer with Layer Skipping and Early Exit},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/bitnetv2-model}
}
```

## License

Apache 2.0 - This model can be used for commercial purposes.

## Acknowledgments

- Training data from FineWeb-EDU by HuggingFace
- Tokenizer from Meta's Llama-3-8B-Instruct model
- H-BitLinear implementation for efficient matrix operations