Dots MOCR โ 4-bit Quantized (NF4)
๐ Introduction
This repository provides a 4-bit quantized version of dots.mocr, optimized using BitsAndBytes (NF4 precision) for efficient, low-memory inference.
The original model is a powerful multimodal OCR system capable of:
- Document parsing
- Layout understanding
- Multilingual OCR
- Structured outputs (JSON / Markdown / SVG)
This version enables deployment on low-VRAM GPUs while maintaining strong performance.
โ๏ธ Key Features
- 4-bit quantization (NF4)
- Reduced VRAM usage (~70โ80%)
- Faster inference
- Compatible with Hugging Face Transformers
- Supports OCR and document parsing
- Suitable for edge and local deployments
๐ ๏ธ Installation (Base Setup)
โ ๏ธ This model depends on the original dots.mocr repository.
conda create -n dots_mocr python=3.12
conda activate dots_mocr
git clone https://github.com/rednote-hilab/dots.mocr.git
cd dots.mocr
pip install -e .
pip install flash-attn==2.8.0.post2
๐ Usage (Quantized Inference)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "rednote-hilab/dots.mocr"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# Example usage
inputs = tokenizer("Extract text from image", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐ Quantization Details
| Parameter | Value |
|---|---|
| Precision | 4-bit |
| Quant Type | NF4 |
| Compute Dtype | float16 |
| Double Quant | Enabled |
| Library | BitsAndBytes |
๐ Use Cases
- Document OCR
- PDF parsing
- Layout detection
- Structured data extraction
- AI-powered document understanding
- Edge deployment of large OCR models
โ ๏ธ Limitations
- Slight accuracy drop compared to full precision
- GPU recommended for optimal performance
- Some layers remain in higher precision
- Not fully optimized for CPU inference
๐ฎ Future Work
- GGUF conversion for CPU inference
- FlashAttention optimization improvements
- Integration with full OCR pipelines
- Web UI (Gradio / Streamlit demo)
- Benchmark comparisons (VRAM vs accuracy)
๐ Acknowledgement
- Base Model:
rednote-hilab/dots.mocr - Quantization: BitsAndBytes
- Framework: Hugging Face Transformers
๐ License
MIT License
- Downloads last month
- 84
Model tree for Durgaram/dots.mocr-4bit
Base model
rednote-hilab/dots.mocr