--- language: - vi - en license: apache-2.0 base_model: Qwen/Qwen3-30B-A3B tags: - vietnamese - qwen - instruction-tuning - lora - text-generation - conversational - qlora - unsloth pipeline_tag: text-generation library_name: transformers datasets: - 5CD-AI/Vietnamese-openorca-2 - vilm/viet-instruct-v2 - 5CD-AI/Vietnamese-UltraChat - MBZUAI/Bactrian-X model-index: - name: Qwen3-30B Vietnamese Instruct results: - task: type: text-generation name: Text Generation dataset: name: VMLU type: tridm/VMLU metrics: - type: accuracy name: VMLU Accuracy value: TBD --- # Qwen3-30B Vietnamese Instruct **Fine-tuned Qwen3-30B-A3B for Vietnamese instruction-following** This model is a Vietnamese-optimized version of Qwen3-30B-A3B, fine-tuned on 327K high-quality Vietnamese instruction samples using LoRA (Low-Rank Adaptation). ## Model Description - **Model Type**: Large Language Model (Mixture-of-Experts) - **Base Model**: [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) - **Language**: Vietnamese (primary), English (secondary) - **Fine-tuning Method**: LoRA (rank=64, alpha=128) with 4-bit quantization - **Training Data**: 327,113 Vietnamese instruction-response pairs - **License**: Apache 2.0 - **Developed by**: Vietnamese LLM Project ## Intended Use This model is designed for Vietnamese natural language processing tasks, including: - **Question Answering**: Answer questions in Vietnamese - **Instruction Following**: Execute tasks described in Vietnamese - **Conversational AI**: Engage in multi-turn Vietnamese dialogues - **Text Generation**: Generate Vietnamese text based on prompts - **Educational Applications**: Tutoring, explanations, and knowledge sharing ### Direct Use ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "danghuyhoang/qwen3-30b-vietnamese-instruct", torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("danghuyhoang/qwen3-30b-vietnamese-instruct") messages = [ {"role": "system", "content": "Bạn là trợ lý AI thông minh và hữu ích."}, {"role": "user", "content": "Giải thích khái niệm machine learning bằng tiếng Việt."} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample=True ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Use with Unsloth (2-5x faster inference) ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="danghuyhoang/qwen3-30b-vietnamese-instruct", max_seq_length=2048, dtype=None, load_in_4bit=True, ) FastLanguageModel.for_inference(model) # Enable inference mode messages = [ {"role": "user", "content": "Việt Nam có bao nhiêu tỉnh thành?"} ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to("cuda") outputs = model.generate(input_ids=inputs, max_new_tokens=128) print(tokenizer.decode(outputs[0])) ``` ## Training Data The model was fine-tuned on a diverse collection of Vietnamese instruction datasets: | Dataset | Samples | Source | License | |---------|---------|--------|---------| | OpenOrca-Viet | 121,178 | 5CD-AI | Apache 2.0 | | VILM Instruction | Subset | VILM Project | Open | | Vietnamese UltraChat | Subset | 5CD-AI | MIT | | Bactrian-X Vietnamese | Subset | MBZUAI | CC BY-NC 4.0 | | Vietnamese MATH | 40,000 | 5CD-AI | Apache 2.0 | | Multi-turn Chat | 12,697 | 5CD-AI | Apache 2.0 | **Total**: 320,570 training samples + 6,543 validation samples ### Data Preprocessing - All data converted to ChatML format - Vietnamese content validation - Token length filtering (max 3072 tokens) - Quality filtering and deduplication - 98/2 train/validation split ## Training Details ### Training Hyperparameters - **Base Model**: unsloth/qwen3-30b-a3b - **Training Method**: LoRA fine-tuning with 4-bit quantization - **LoRA Configuration**: - Rank (r): 64 - Alpha: 128 - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - Dropout: 0.0 - **Training Hyperparameters**: - Epochs: 1 - Batch size: 36 (per device) - Gradient accumulation: 1 - Learning rate: 0.00012 - Optimizer: AdamW 8-bit - LR Scheduler: Linear warmup (50 steps) - Max sequence length: 3072 - Weight decay: 0.01 - **Hardware**: 1× NVIDIA A100 80GB - **Training Time**: ~47 hours - **Framework**: Unsloth + Hugging Face Transformers ### Training Loss Training loss decreased steadily from ~1.5 to ~0.8 over 8,905 steps, indicating successful learning. ### Optimization Techniques - **Unsloth**: 2-5x faster training with optimized CUDA kernels - **Flash Attention 2**: Memory-efficient attention computation - **Gradient Checkpointing**: Reduced memory usage - **4-bit Quantization**: QLoRA for memory efficiency - **Mixed Precision**: bfloat16 for numerical stability ## Evaluation ### VMLU Benchmark (Vietnamese Multitask Language Understanding) The model is evaluated on VMLU, a comprehensive Vietnamese benchmark with 744 questions across multiple subjects. **Evaluation Method**: Logit-based scoring (industry standard, same as MMLU) | Metric | Score | |--------|-------| | Overall Accuracy | Coming Soon | | STEM | Coming Soon | | Humanities | Coming Soon | | Social Sciences | Coming Soon | **Note**: To reproduce evaluation, use the evaluation script from the [GitHub repository](https://github.com/andreidhoang/vietnamese-llm-finetuning). ### Comparison with Base Model | Model | VMLU Accuracy | |-------|---------------| | Qwen3-30B-A3B (Base) | Baseline | | Qwen3-30B-Vietnamese (This model) | Coming Soon | ## Limitations and Biases ### Known Limitations 1. **Vietnamese-Specific**: Optimized for Vietnamese, may have reduced performance on other languages 2. **Instruction Bias**: Trained primarily on instruction-following data, may not excel at creative writing 3. **Factual Knowledge Cutoff**: Based on Qwen3's training data (cutoff date unknown) 4. **Context Length**: Trained with max 3072 tokens, performance may degrade on longer contexts 5. **Mathematical Reasoning**: While improved, may still struggle with complex multi-step math 6. **Code Generation**: Not specifically optimized for coding tasks ### Potential Biases - Training data may reflect biases present in Vietnamese internet content - Instruction datasets may have geographic/cultural biases - Model may perform better on formal Vietnamese than colloquial speech - Limited exposure to Vietnamese dialects and regional variations ### Ethical Considerations - **Misinformation**: Model may generate plausible but incorrect information - **Harmful Content**: Despite safety measures, model may occasionally produce inappropriate content - **Privacy**: Do not input personal or sensitive information - **Transparency**: Always disclose when content is AI-generated ## Responsible Use ### Recommended Practices - Verify factual claims from independent sources - Use human review for high-stakes applications - Implement content filtering for production deployments - Monitor outputs for bias and harmful content - Provide user disclosure about AI involvement ### Not Recommended For - Medical, legal, or financial advice without expert review - Content moderation as sole decision-maker - High-stakes decision-making without human oversight - Generating content intended to deceive ## Hardware Requirements ### For Inference **Minimum**: - GPU: RTX 4090 24GB (with 4-bit quantization) - RAM: 32GB - Disk: 100GB **Recommended**: - GPU: A100 40GB or equivalent - RAM: 64GB - Disk: 150GB ### For Fine-tuning - GPU: A100 80GB (for LoRA fine-tuning) - RAM: 128GB - Disk: 500GB - See training guide for optimal configurations ## Technical Specifications ### Model Architecture - **Type**: Mixture-of-Experts (MoE) Transformer - **Experts**: Multiple expert networks (A3B variant) - **Parameters**: ~30 billion (with sparse activation) - **Hidden Size**: 4096 - **Attention Heads**: 32 - **Layers**: 40 - **Vocabulary Size**: 151,936 tokens - **Context Length**: 32,768 tokens (training limited to 3072) ### File Sizes - **Full Model**: ~60GB (bfloat16) - **4-bit Quantized**: ~20GB - **LoRA Adapters Only**: ~1-2GB ## Citation If you use this model, please cite: ```bibtex @software{qwen3_vietnamese_2025, title = {Qwen3-30B Vietnamese Instruct}, author = {Vietnamese LLM Project}, year = {2025}, url = {https://huggingface.co/danghuyhoang/qwen3-30b-vietnamese-instruct}, license = {Apache-2.0} } ``` Also cite the base Qwen3 model: ```bibtex @article{qwen3_2024, title={Qwen3 Technical Report}, author={Qwen Team}, journal={arXiv preprint}, year={2024} } ``` ## Acknowledgments This model was created using: - **[Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)** by Alibaba Cloud - **[Unsloth](https://github.com/unslothai/unsloth)** for optimized training - **Vietnamese datasets** from 5CD-AI, VILM, MBZUAI, and the Vietnamese NLP community ## Model Card Contact For questions or issues: - **GitHub**: [vietnamese-llm-finetuning](https://github.com/andreidhoang/vietnamese-llm-finetuning) - **Issues**: [GitHub Issues](https://github.com/andreidhoang/vietnamese-llm-finetuning/issues) - **Discussions**: [GitHub Discussions](https://github.com/andreidhoang/vietnamese-llm-finetuning/discussions) --- **License**: Apache 2.0 **Last Updated**: 2025-11-08