πŸ€— Hugging Face   |   πŸ€– ModelScope

Introduction

Today, Ling-flash-2.0 is officially open-sourced! πŸš€ Following the release of the language model Ling-mini-2.0 and the thinking model Ring-mini-2.0, we are now open-sourcing the third MoE LLM under the Ling 2.0 architecture: Ling-flash-2.0, a language model with 100B total parameters and 6.1B activated parameters (4.8B non-embedding). Trained on 20T+ tokens of high-quality data, together with supervised fine-tuning and multi-stage reinforcement learning, Ling-flash-2.0 achieves SOTA performance among dense models under 40B parameters, despite activating only ~6B parameters. Compared to MoE models with larger activation/total parameters, it also demonstrates strong competitiveness. Notably, it delivers outstanding performance in complex reasoning, code generation, and frontend development.

Powerful Complex Reasoning Abilities

We conducted a comprehensive evaluation of Ling-flash-2.0’s reasoning capabilities, reporting strong results on representative benchmarks:

  • Multi-disciplinary knowledge reasoning: GPQA-Diamond, MMLU-Pro
  • Advanced mathematical reasoning: AIME 2025, Omni-MATH, OptMATH (advanced mathematical optimization tasks)
  • Challenging code generation: LiveCodeBench v6, CodeForces-Elo
  • Logical reasoning: KOR-Bench, ARC-Prize
  • Key regulated industries (Finance, Healthcare): FinanceReasoning, HealthBench

Compared with dense models under 40B (e.g., Qwen3-32B-Non-Thinking, Seed-OSS-36B-Instruct (think budget=0)) and larger-activation/total-parameter MoE models (e.g., Hunyuan-A13B-Instruct, GPT-OSS-120B/low), Ling-flash-2.0 demonstrates stronger complex reasoning power. Moreover, it shows high competitiveness on creative tasks (Creative Writing v3).

Efficient Architecture, High-Speed Inference

Guided by Ling Scaling Laws, Ling 2.0 adopts a 1/32 activation-ratio MoE architecture, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, aux-loss-free + sigmoid routing strategy, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable small-activation MoE models to achieve 7Γ— efficiency gains over equivalent dense architectures. In other words, with just 6.1B activated parameters (4.8B non-embedding), Ling-flash-2.0 can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:

  • On H20 hardware, Ling-flash-2.0 achieves 200+ tokens/s, offering 3Γ— speedups compared to 36B dense models in everyday use.
  • With YaRN extrapolation, it supports 128K context length, and as output length grows, its relative speedup can reach 7Γ— or more.

Model Downloads

You can download the following table to see the various stage of Ling-flash-2.0 models. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model Context Length Download
Ling-flash-base-2.0 32K -> 128K (YaRN) πŸ€— HuggingFace
πŸ€– ModelScope
Ling-flash-2.0 32K -> 128K (YaRN) πŸ€— HuggingFace
πŸ€– ModelScope

Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope.

Quickstart

πŸ€— Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ling-flash-2.0"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

πŸ€– ModelScope

If you're in mainland China, we strongly recommend you to use our model from πŸ€– ModelScope.

Deployment

vLLM

vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.

Environment Preparation

Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:

git clone -b v0.10.0 https://github.com/vllm-project/vllm.git
cd vllm
wget https://raw.githubusercontent.com/inclusionAI/Ling-V2/refs/heads/main/inference/vllm/bailing_moe_v2.patch
git apply bailing_moe_v2.patch
pip install -e .

Offline Inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-flash-2.0")

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)

llm = LLM(model="inclusionAI/Ling-flash-2.0", dtype='bfloat16')
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
outputs = llm.generate([text], sampling_params)

Online Inference:

vllm serve inclusionAI/Ling-flash-2.0 \
              --tensor-parallel-size 2 \
              --pipeline-parallel-size 1 \
              --use-v2-block-manager \
              --gpu-memory-utilization 0.90

To handle long context in vLLM using YaRN, we need to follow these two steps:

  1. Add a rope_scaling field to the model's config.json file, for example:
{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}
  1. Use an additional parameter --max-model-len to specify the desired maximum context length when starting the vLLM service.

For detailed guidance, please refer to the vLLM instructions.

SGLang

Environment Preparation

We will later submit our model to SGLang official release, now we can prepare the environment following steps:

pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1

You can use docker image as well:

docker pull lmsysorg/sglang:v0.5.2rc0-cu126

Then you should apply patch to sglang installation:

# patch command is needed, run `yum install -y patch` if needed
patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch

Run Inference

BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:

  • Start server:
python -m sglang.launch_server \
    --model-path $MODLE_PATH \
    --host 0.0.0.0 --port $PORT \
    --trust-remote-code \
    --attention-backend fa3

MTP is supported for base model, and not yet for chat model. You can add parameter --speculative-algorithm NEXTN to start command.

  • Client:
curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

More usage can be found here

Finetuning

We recommend you to use Llama-Factory to finetune Ling.

License

This code repository is licensed under the MIT License.

Downloads last month
-
Safetensors
Model size
103B params
Tensor type
BF16
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 2 Ask for provider support

Model tree for inclusionAI/Ling-flash-2.0

Unable to build the model tree, the base model loops to the model itself. Learn more.

Collection including inclusionAI/Ling-flash-2.0