Ram07
/

bitnet-8bit

Model card Files Files and versions Community

bitnet-8bit / README.md

Ram07's picture

Add model card

3b2c7bc verified about 1 month ago

|

history blame contribute delete

3.37 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- pytorch
	- causal-lm
	- bitnet
	- quantized
	- 8bit
	- layer-skip
	- early-exit
	- rope
	- safetensors
	- fineweb-edu
	datasets:
	- HuggingFaceFW/fineweb-edu
	---

	# bitnet-8bit

	This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.

	## Architecture Overview

	### Input Processing
	- Token Embeddings: 128,256 vocabulary size
	- Position Embeddings: Up to 128 positions
	- Hidden Dimensions: 1024-dimensional hidden states

	### Transformer Layers (12 total)
	Each layer contains:
	- Layer normalization
	- BitNet Attention: 8 heads, 64 dimensions per head
	- Residual connections
	- BitNet Feed-Forward Network: 1024 → 4096 → 1024
	- Dropout (0.1) after attention and FFN

	### Special Features
	- 8-bit Quantization: Applied in attention and FFN layers for extreme efficiency
	- Rotary Position Embeddings (RoPE): Used in attention with dimension 64
	- Layer Skipping: Quadratic dropout schedule: p_l = p_max × (l/L)²
	- Maximum skip probability: 0.1
	- No explicit minimum active layers
	- Early Exit: Can terminate at any layer if confidence > 95%

	## Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Model Type \| BitNet with Quantization \|
	\| Vocabulary Size \| 128,256 \|
	\| Hidden Size \| 1,024 \|
	\| Number of Layers \| 12 \|
	\| Attention Heads \| 8 \|
	\| Head Dimension \| 64 \|
	\| FFN Intermediate Size \| 4,096 \|
	\| Max Sequence Length \| 128 \|
	\| Quantization Bits \| 8 \|
	\| Dropout Rate \| 0.1 \|

	## Training

	- Dataset: FineWeb-EDU (sample-10BT subset)
	- Training Framework: PyTorch with mixed precision (FP16)
	- Optimization: Gradient checkpointing and streaming dataset implementation
	- Hardware: Training details available in repository

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("bitnet-8bit")
	tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit")

	# Basic generation
	inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=100, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))

	# With early exit (if supported in inference)
	outputs = model.generate(
	**inputs,
	max_length=100,
	early_exit_threshold=0.95, # Exit when 95% confident
	use_cache=True
	)
	```

	## Performance Characteristics

	- Memory Efficiency: 8-bit quantization reduces memory footprint significantly
	- Adaptive Computation: Layer skipping and early exit reduce average computation
	- Inference Speed: Variable depending on early exit and layer skipping activation
	- Quality: Comparable to full-precision models of similar size despite quantization

	## Limitations

	- Maximum sequence length is limited to 128 tokens
	- This is an experimental BitNet implementation with custom architecture
	- Early exit and layer skipping require compatible inference code
	- Quantization may affect performance on certain tasks

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{bitnet2024,
	title={BitNet with Layer Skipping and Early Exit},
	author={Your Name},
	year={2024},
	url={https://huggingface.co/bitnet-8bit}
	}
	```

	## License

	Apache 2.0 - This model can be used for commercial purposes.