bitnetv2-model / README.md

Add model card

b391d02 verified about 1 month ago

3.82 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- pytorch
	- causal-lm
	- bitnet2
	- h-bitlinear
	- 8bit-quantization
	- layer-skip
	- early-exit
	- safetensors
	- fineweb-edu
	datasets:
	- HuggingFaceFW/fineweb-edu
	model_type: bitnet2
	---

	# bitnetv2-model

	This is a BitNetModel2 with H-BitLinear layers, 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.

	## Architecture Overview

	### Input Processing
	- Token Embeddings: 128,256 vocabulary size
	- Position Embeddings: Up to 128 positions
	- Hidden Dimensions: 512-dimensional hidden states (power of 2 for H-BitLinear)

	### Transformer Layers (12 total)
	Each layer contains:
	- Layer normalization (eps=1e-05)
	- Multi-Head Attention: 8 heads
	- Residual connections
	- H-BitLinear Feed-Forward Network: 512 → 2048 → 512
	- Dropout (0.1) after attention and FFN
	- Activation function: silu

	### Special Features
	- H-BitLinear Layers: Hadamard transform-based linear layers for efficiency
	- 8-bit Quantization: 8-bit activations for memory efficiency
	- Layer Skipping: Dynamic computation with skip probability 0.1
	- Minimum layers to keep: 4
	- Early Exit: Can terminate at any layer if confidence > 95%

	## Model Configuration

	```json
	{
	"vocab_size": 128256,
	"hidden_size": 512,
	"num_hidden_layers": 12,
	"num_attention_heads": 8,
	"intermediate_size": 2048,
	"max_position_embeddings": 128,
	"activation_bits": 8,
	"use_h_bitlinear": True,
	"hidden_dropout_prob": 0.1,
	"attention_probs_dropout_prob": 0.1
	}
	```

	## Training Details

	- Dataset: FineWeb-EDU (sample-10BT subset)
	- Batch Size: 8 (with gradient accumulation)
	- Learning Rate: 5e-05
	- Weight Decay: 0.01
	- Warmup Steps: 1000
	- Max Gradient Norm: 1.0
	- Gradient Accumulation Steps: 4

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("bitnetv2-model")
	tokenizer = AutoTokenizer.from_pretrained("bitnetv2-model")

	# Basic generation
	inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=100, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Performance Characteristics

	- Memory Efficiency: H-BitLinear layers and 8-bit quantization reduce memory footprint
	- Adaptive Computation: Layer skipping reduces average computation by ~10%
	- Low Latency: Early exit can terminate computation when confident
	- Compact Size: Significantly smaller than full-precision models
	- H-BitLinear Benefits: Hadamard transforms enable efficient matrix operations

	## Limitations

	- Maximum sequence length is limited to 128 tokens
	- This is an experimental BitNetModel2 implementation with H-BitLinear layers
	- Early exit and layer skipping require compatible inference code
	- Model performance may vary based on the complexity of the input
	- H-BitLinear layers require power-of-2 dimensions

	## Technical Details

	- Initializer Range: 0.02
	- Layer Norm Epsilon: 1e-05
	- Tokenizer: Based on Meta-Llama-3-8B-Instruct tokenizer
	- Format: SafeTensors for fast and safe loading
	- H-BitLinear: Uses Hadamard transforms for efficient linear operations

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{bitnet2_2024,
	title={BitNetModel2: H-BitLinear Transformer with Layer Skipping and Early Exit},
	author={Your Name},
	year={2024},
	url={https://huggingface.co/bitnetv2-model}
	}
	```

	## License

	Apache 2.0 - This model can be used for commercial purposes.

	## Acknowledgments

	- Training data from FineWeb-EDU by HuggingFace
	- Tokenizer from Meta's Llama-3-8B-Instruct model
	- H-BitLinear implementation for efficient matrix operations