Yonghua Lin's picture

Yonghua Lin

Yonghua

·

AI & ML interests

None yet

Recent Activity

new activity 24 days ago

deepseek-ai/DeepSeek-V4-Flash:Run DeepSeek-V4-Flash on more hardware: FP8/BF16 adapted versions for 8 AI chips (ready to download)

posted an update 24 days ago

🚀 Run DeepSeek V4 on more AI GPUs with FlagOS DeepSeek V4 just dropped with huge specs: 1.6T params, 1M context, MIT license. But there’s a catch: the official weights use FP4+FP8 mixed precision, which mainly targets NVIDIA Blackwell / B200-class GPUs. So we built DeepSeek-V4-FlagOS. On Day 0, the FlagOS community completed multi-chip adaptation across 8 AI hardware platforms: ✅ NVIDIA H100/H20 — FP8/BF16 ✅ Huawei Ascend — BF16 ✅ Hygon DCU — BF16 ✅ MetaX GPU — BF16 ✅ Moore Threads MTT S5000 — FP8 ✅ Kunlunxin XPU — BF16 ✅ T-Head/Alibaba Zhenwu — BF16 ✅ Iluvatar GPU — BF16 🔧 What makes it work? 1️⃣ FlagGems operator replacement DeepSeek V4 operators — MoE routing, Attention, RMSNorm and more — are reimplemented with Triton, reducing dependency on CUDA-specific libraries. New V4 operators include: Act Quant, hc_split_sinkhorn, FP8 MatMul, Sparse Attention, Hadamard Transform. 2️⃣ Flexible tensor parallelism DeepSeek V4 uses o_groups=8, which can limit TP. We added an independent communication group for o-groups, while allowing the rest of the model to scale to higher TP, enabling deployment on 32GB/64GB cards. 3️⃣ FP4 → BF16 conversion For hardware without native FP4, we provide ready-to-use BF16 conversion and pre-converted model releases. 📦 Pre-converted models are available on Hugging Face: V4-Pro: FlagRelease/DeepSeek-V4-Pro-nvidia-FlagOS FlagRelease/DeepSeek-V4-Pro-metax-FlagOS FlagRelease/DeepSeek-V4-Pro-mthreads-FlagOS FlagRelease/DeepSeek-V4-Pro-hygon-FlagOS FlagRelease/DeepSeek-V4-Pro-ascend-FlagOS V4-Flash: FlagRelease/DeepSeek-V4-Flash-nvidia-FlagOS FlagRelease/DeepSeek-V4-Flash-zhenwu-FlagOS FlagRelease/DeepSeek-V4-Flash-kunlunxin-FlagOS FlagRelease/DeepSeek-V4-Flash-iluvatar-FlagOS ⚡ Performance on NVIDIA H20, V4-Flash FP8: FlagGems C++ Wrapper + Triton: 70.7 tok/s DeepSeek TileLang: 62.99 tok/s That’s 12.24% faster. 👉 Try it here: https://github.com/flagos-ai/DeepSeek-V4-FlagOS Open models should run on open infrastructure

authored a paper 8 months ago

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

View all activity

Organizations

authored a paper 8 months ago

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Paper • 2509.17177 • Published Sep 21, 2025 • 13

authored a paper 11 months ago

CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

Paper • 2506.07463 • Published Jun 9, 2025 • 12

authored a paper over 1 year ago

Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27, 2024 • 99