Dots MOCR โ€“ 4-bit Quantized (NF4)

๐Ÿ” Introduction

This repository provides a 4-bit quantized version of dots.mocr, optimized using BitsAndBytes (NF4 precision) for efficient, low-memory inference.

The original model is a powerful multimodal OCR system capable of:

  • Document parsing
  • Layout understanding
  • Multilingual OCR
  • Structured outputs (JSON / Markdown / SVG)

This version enables deployment on low-VRAM GPUs while maintaining strong performance.


โš™๏ธ Key Features

  • 4-bit quantization (NF4)
  • Reduced VRAM usage (~70โ€“80%)
  • Faster inference
  • Compatible with Hugging Face Transformers
  • Supports OCR and document parsing
  • Suitable for edge and local deployments

๐Ÿ› ๏ธ Installation (Base Setup)

โš ๏ธ This model depends on the original dots.mocr repository.

conda create -n dots_mocr python=3.12
conda activate dots_mocr

git clone https://github.com/rednote-hilab/dots.mocr.git
cd dots.mocr

pip install -e .
pip install flash-attn==2.8.0.post2

๐Ÿš€ Usage (Quantized Inference)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "rednote-hilab/dots.mocr"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Example usage
inputs = tokenizer("Extract text from image", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

๐Ÿ“Š Quantization Details

Parameter Value
Precision 4-bit
Quant Type NF4
Compute Dtype float16
Double Quant Enabled
Library BitsAndBytes

๐Ÿ“Œ Use Cases

  • Document OCR
  • PDF parsing
  • Layout detection
  • Structured data extraction
  • AI-powered document understanding
  • Edge deployment of large OCR models

โš ๏ธ Limitations

  • Slight accuracy drop compared to full precision
  • GPU recommended for optimal performance
  • Some layers remain in higher precision
  • Not fully optimized for CPU inference

๐Ÿ”ฎ Future Work

  • GGUF conversion for CPU inference
  • FlashAttention optimization improvements
  • Integration with full OCR pipelines
  • Web UI (Gradio / Streamlit demo)
  • Benchmark comparisons (VRAM vs accuracy)

๐Ÿ™Œ Acknowledgement

  • Base Model: rednote-hilab/dots.mocr
  • Quantization: BitsAndBytes
  • Framework: Hugging Face Transformers

๐Ÿ“„ License

MIT License

Downloads last month
84
Safetensors
Model size
3B params
Tensor type
F32
ยท
F16
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Durgaram/dots.mocr-4bit

Quantized
(8)
this model