Qwen3-30M with GPT-2 Tokenizer (FP16)

A 30M parameter version of Qwen3-0.6B using GPT-2's tokenizer for better compatibility, optimized with FP16 precision.

Model Details

  • Base Model: Qwen/Qwen3-0.6B
  • Architecture: Qwen3 (8 layers, 224 hidden size)
  • Parameters: ~35M (reduced from 637M)
  • Tokenizer: GPT-2 (50,257 vocabulary)
  • Vocabulary: Reduced from 151,936 to 50,257 tokens
  • Precision: FP16 (half precision for memory efficiency)
  • Model Size: ~60MB (vs ~120MB in FP32)

Architecture Specifications

  • Layers: 8 transformer layers
  • Hidden Size: 224
  • Intermediate Size: 896 (4x hidden_size)
  • Attention Heads: 8
  • Key-Value Heads: 8
  • Max Position Embeddings: 32,768
  • Activation: SiLU

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load with automatic fp16 support
tokenizer = AutoTokenizer.from_pretrained("Mostafa8Mehrabi/qwen3-30m-fp16")
model = AutoModelForCausalLM.from_pretrained(
    "Mostafa8Mehrabi/qwen3-30m-fp16",
    torch_dtype=torch.float16,  # Explicitly use fp16
    device_map="auto"  # Automatically place on available device
)

# For GPU inference (recommended)
# model = model.to("cuda") # if you have a GPU

inputs = tokenizer("Hello, how are you?", return_tensors="pt")
# Move inputs to same device as model if using GPU
# inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Key Features

  • βœ… FP16 Precision: 50% smaller model size, faster inference
  • βœ… 30M Parameters: Ultra-lightweight for edge deployment
  • βœ… 8 Layers: Balanced depth for performance
  • βœ… Standard GPT-2 tokenizer (no trust_remote_code)
  • βœ… Compatible vocabulary sizes
  • βœ… SafeTensors format for faster loading
  • βœ… Works like any HuggingFace model
  • βœ… 21x smaller than original Qwen3-0.6B
  • βœ… GPU optimized for efficient inference

Architecture Comparison

Component Original This Model
Parameters 637M ~35M
Vocabulary 151,936 50,257
Hidden Size 1024 224
Layers 28 8
Intermediate Size 4096 896
Attention Heads 16 8
Tokenizer Qwen3 GPT-2
Precision FP32 FP16
Model Size ~1.2GB ~60MB

Memory Requirements

  • FP16: 60MB model + ~30MB working memory = **90MB total**
  • FP32: ~120MB model + ~60MB working memory = ~180MB total
  • Memory savings: ~50% reduction compared to FP32
  • Ultra-lightweight: Perfect for mobile and edge devices

Performance Notes

  • FP16 provides significant memory savings with minimal quality loss
  • 30M parameters optimized for fast inference while maintaining coherence
  • Ideal for deployment in resource-constrained environments
  • Compatible with both CPU and GPU inference
  • Faster loading times due to smaller file size
  • 8 layers provide good balance between model capacity and speed
Downloads last month
20
Safetensors
Model size
34.7M params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Mostafa8Mehrabi/qwen3-30m-fp16

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(429)
this model
Finetunes
1 model