File size: 3,818 Bytes

b391d02

---
language:
- en
license: apache-2.0
tags:
- pytorch
- causal-lm
- bitnet2
- h-bitlinear
- 8bit-quantization
- layer-skip
- early-exit
- safetensors
- fineweb-edu
datasets:
- HuggingFaceFW/fineweb-edu
model_type: bitnet2
---

# bitnetv2-model

This is a BitNetModel2 with H-BitLinear layers, 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.

## Architecture Overview

### Input Processing
- **Token Embeddings**: 128,256 vocabulary size
- **Position Embeddings**: Up to 128 positions
- **Hidden Dimensions**: 512-dimensional hidden states (power of 2 for H-BitLinear)

### Transformer Layers (12 total)
Each layer contains:
- Layer normalization (eps=1e-05)
- **Multi-Head Attention**: 8 heads
- Residual connections
- **H-BitLinear Feed-Forward Network**: 512 → 2048 → 512
- Dropout (0.1) after attention and FFN
- Activation function: silu

### Special Features
- **H-BitLinear Layers**: Hadamard transform-based linear layers for efficiency
- **8-bit Quantization**: 8-bit activations for memory efficiency
- **Layer Skipping**: Dynamic computation with skip probability 0.1
  - Minimum layers to keep: 4
- **Early Exit**: Can terminate at any layer if confidence > 95%

## Model Configuration

```json
{
  "vocab_size": 128256,
  "hidden_size": 512,
  "num_hidden_layers": 12,
  "num_attention_heads": 8,
  "intermediate_size": 2048,
  "max_position_embeddings": 128,
  "activation_bits": 8,
  "use_h_bitlinear": True,
  "hidden_dropout_prob": 0.1,
  "attention_probs_dropout_prob": 0.1
}
```

## Training Details

- **Dataset**: FineWeb-EDU (sample-10BT subset)
- **Batch Size**: 8 (with gradient accumulation)
- **Learning Rate**: 5e-05
- **Weight Decay**: 0.01
- **Warmup Steps**: 1000
- **Max Gradient Norm**: 1.0
- **Gradient Accumulation Steps**: 4

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("bitnetv2-model")
tokenizer = AutoTokenizer.from_pretrained("bitnetv2-model")

# Basic generation
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Performance Characteristics

- **Memory Efficiency**: H-BitLinear layers and 8-bit quantization reduce memory footprint
- **Adaptive Computation**: Layer skipping reduces average computation by ~10%
- **Low Latency**: Early exit can terminate computation when confident
- **Compact Size**: Significantly smaller than full-precision models
- **H-BitLinear Benefits**: Hadamard transforms enable efficient matrix operations

## Limitations

- Maximum sequence length is limited to 128 tokens
- This is an experimental BitNetModel2 implementation with H-BitLinear layers
- Early exit and layer skipping require compatible inference code
- Model performance may vary based on the complexity of the input
- H-BitLinear layers require power-of-2 dimensions

## Technical Details

- **Initializer Range**: 0.02
- **Layer Norm Epsilon**: 1e-05
- **Tokenizer**: Based on Meta-Llama-3-8B-Instruct tokenizer
- **Format**: SafeTensors for fast and safe loading
- **H-BitLinear**: Uses Hadamard transforms for efficient linear operations

## Citation

If you use this model, please cite:

```bibtex
@misc{bitnet2_2024,
  title={BitNetModel2: H-BitLinear Transformer with Layer Skipping and Early Exit},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/bitnetv2-model}
}
```

## License

Apache 2.0 - This model can be used for commercial purposes.

## Acknowledgments

- Training data from FineWeb-EDU by HuggingFace
- Tokenizer from Meta's Llama-3-8B-Instruct model
- H-BitLinear implementation for efficient matrix operations