Ram07
/

bitnet-8bit-v3

8bit-quantization

Model card Files Files and versions Community

bitnet-8bit-v3 / README.md

Ram07's picture

Add model card

d87880a verified about 1 month ago

|

history blame contribute delete

3.29 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- pytorch
	- causal-lm
	- bitnet
	- 8bit-quantization
	- layer-skip
	- early-exit
	- safetensors
	- fineweb-edu
	datasets:
	- HuggingFaceFW/fineweb-edu
	model_type: bitnet
	---

	# bitnet-8bit-v3

	This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.

	## Architecture Overview

	### Input Processing
	- Token Embeddings: 128,256 vocabulary size
	- Position Embeddings: Up to 128 positions
	- Hidden Dimensions: 1024-dimensional hidden states

	### Transformer Layers (12 total)
	Each layer contains:
	- Layer normalization (eps=1e-05)
	- Multi-Head Attention: 16 heads
	- Residual connections
	- Feed-Forward Network: 1024 → 4096 → 1024
	- Dropout (0.1) after attention and FFN
	- Activation function: gelu

	### Special Features
	- 8-bit Quantization: 8-bit activations for efficiency
	- Layer Skipping: Dynamic computation with skip probability 0.1
	- Minimum layers to keep: 4
	- Early Exit: Can terminate at any layer if confidence > 95%

	## Model Configuration

	```json
	{
	"vocab_size": 128256,
	"hidden_size": 1024,
	"num_hidden_layers": 12,
	"num_attention_heads": 16,
	"intermediate_size": 4096,
	"max_position_embeddings": 128,
	"activation_bits": 8,
	"hidden_dropout_prob": 0.1,
	"attention_probs_dropout_prob": 0.1
	}
	```

	## Training Details

	- Dataset: FineWeb-EDU (sample-10BT subset)
	- Batch Size: 64 (with gradient accumulation)
	- Learning Rate: 5e-05
	- Weight Decay: 0.01
	- Warmup Steps: 1000
	- Max Gradient Norm: 1.0
	- Gradient Accumulation Steps: 4

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("bitnet-8bit-v3")
	tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit-v3")

	# Basic generation
	inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=100, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Performance Characteristics

	- Memory Efficiency: 8-bit quantization reduces memory footprint
	- Adaptive Computation: Layer skipping reduces average computation by ~10%
	- Low Latency: Early exit can terminate computation when confident
	- Compact Size: Significantly smaller than full-precision models

	## Limitations

	- Maximum sequence length is limited to 128 tokens
	- This is an experimental BitNet implementation
	- Early exit and layer skipping require compatible inference code
	- Model performance may vary based on the complexity of the input

	## Technical Details

	- Initializer Range: 0.02
	- Layer Norm Epsilon: 1e-05
	- Tokenizer: Based on Meta-Llama-3-8B-Instruct tokenizer
	- Format: SafeTensors for fast and safe loading

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{bitnet2024,
	title={BitNet: 8-bit Quantized Transformer with Layer Skipping},
	author={Your Name},
	year={2024},
	url={https://huggingface.co/bitnet-8bit-v3}
	}
	```

	## License

	Apache 2.0 - This model can be used for commercial purposes.

	## Acknowledgments

	- Training data from FineWeb-EDU by HuggingFace
	- Tokenizer from Meta's Llama-3-8B-Instruct model