Ling-1T / README.md
m1ngcheng's picture
Update README.md
7b07d2e verified
|
raw
history blame
23.6 kB
metadata
license: mit
pipeline_tag: text-generation
library_name: transformers

🤗 Hugging Face   |   🤖 ModelScope    |   🐙 Experience Now

Introduction

Ling-1T is the first flagship non-thinking model in the Ling 2.0 series, featuring 1 trillion total parameters with ≈ 50 billion active parameters per token. Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of efficient reasoning and scalable cognition.

Pre-trained on 20 trillion+ high-quality, reasoning-dense tokens, Ling-1T-base supports up to 128 K context length and adopts an evolutionary chain-of-thought (Evo-CoT) process across mid-training and post-training. This curriculum greatly enhances the model’s efficiency and reasoning depth, allowing Ling-1T to achieve state-of-the-art performance on multiple complex reasoning benchmarks—balancing accuracy and efficiency.

Flagship-Level Efficient Reasoning

We comprehensively evaluated Ling-1T against leading flagship models, including both open-source giants (e.g., DeepSeek-V3.1-Terminus, Kimi-K2-Instruct-0905) and closed-source APIs (GPT-5-main, Gemini-2.5-Pro). Across code generation, software development, competition-level mathematics, professional math, and logical reasoning, Ling-1T consistently demonstrates superior complex reasoning ability and overall advantage.

In the AIME 25 benchmark, Ling-1T extends the Pareto frontier of reasoning accuracy vs. reasoning length, showcasing its strength in “efficient thinking and precise reasoning.”

Aesthetic Understanding and Front-End Generation

Ling-1T excels in visual reasoning and front-end code generation tasks, combining deep semantic understanding with precise code synthesis. We introduce a hybrid Syntax–Function–Aesthetics reward mechanism, enabling the model to not only generate correct and functional code but also demonstrate a refined sense of visual aesthetics. On ArtifactsBench, Ling-1T ranks first among open-source models, and the benchmark visualizations in this card were, in fact, generated by Ling-1T itself.

Emergent Intelligence at Trillion-Scale

Scaling to the trillion-parameter level has revealed strong emergent reasoning and transfer capabilities. For example, in the BFCL V3 tool-use benchmark, Ling-1T achieves ≈ 70 % tool-call accuracy with only light instruction tuning—despite having seen no large-scale trajectory data during training. Ling-1T can:

  • Interpret complex natural-language instructions
  • Transform abstract logic into functional visual components
  • Generate cross-platform compatible front-end code
  • Create stylistically controlled marketing copy and multi-lingual text

These capabilities form the foundation for general, collaborative human–AI intelligence, which we aim to advance together with the open-source community through Ling-1T’s release.

Pre-Training at Trillion Scale

The Ling 2.0 architecture was designed from the ground up for trillion-scale efficiency, guided by the Ling Scaling Law (arXiv:2507.17702). This ensures architectural and hyperparameter scalability even under 10²⁵–10²⁶ FLOPs of compute.

Key architectural innovations include:

  • 1 T total / 50 B active parameters with a 1/32 MoE activation ratio
  • MTP layers for enhanced compositional reasoning
  • Aux-loss-free, sigmoid-scoring expert routing with zero-mean updates
  • QK Normalization for fully stable convergence

Ling-1T is the largest FP8-trained foundation model known to date. FP8 mixed-precision training yields 15 %+ end-to-end speedup, improved memory efficiency, and maintains ≤ 0.1 % loss deviation from BF16 across 1 T tokens. A fine-grained, heterogeneous 1F1B interleaved pipeline further boosts utilization by 40 %+. System-level optimizations—fused kernels, communication scheduling, recomputation, checkpointing, simulation, and telemetry—ensure stable trillion-scale training.

Pre-training used over 20 T high-quality tokens, with > 40 % reasoning-dense data in later stages. Mid-training introduced curated chain-of-thought corpora for “reasoning pre-activation”, improving downstream reasoning stability. A custom WSM (Warmup–Stable–Merge) LR scheduler with mid-train checkpoint merging simulates LR decay and boosts generalization.

Post-Training and Evo-CoT Optimization

Built upon mid-training reasoning activation, post-training adopts Evo-CoT (Evolutionary Chain-of-Thought) for progressive reasoning enhancement under controllable cost. This approach continually expands the Pareto frontier of reasoning accuracy vs. efficiency—ideal for reflexive non-thinking models.

For reinforcement learning, we introduce LPO (Linguistics-Unit Policy Optimization) —a novel sentence-level policy optimization method. Unlike GRPO (token-level) or GSPO (sequence-level) algorithms, LPO treats sentences as the natural semantic action units, enabling precise alignment between rewards and reasoning behavior. Empirically, LPO offers superior training stability and generalization across reasoning tasks.

Evaluation

Ling-1T has been extensively evaluated across knowledge, code, math, reasoning, agent, and alignment benchmarks. It currently stands as the best open-source flagship non-thinking model, rivaling closed-source APIs in complex reasoning while maintaining exceptional efficiency and interpretability.

Evaluation

Task Benchmark DeepSeek-V3.1-Terminus Kimi-K2-Instruct-0905 gpt-5-main Gemini 2.5 Pro Ling-1T
(NonThinking) (thinkBudget=128)
Knowledge Professional Knowledge
C-Eval 91.76 91.12 83.59 88.77 92.19
MMLU-Redux (EM) 92.37 91.58 92.75 94.67 92.25
MMLU-Pro 83.25 81.03 81.94 82.13 82.04
Knowledge STEM
MMLU-Pro-Stem 87.91 85.30 73.45 88.60 88.5
OlympiadBench-stem 87.83 79.13 78.26 89.57 91.3
GPQA-Diamond 76.23 73.93 71.31 71.81 72.98
Coding Code Generation
MultiPL-E 77.68 73.76 76.66 71.48 77.91
mbpp 90.69 89.96 91.72 91.01 96.87
LiveCodeBench (2408-2505) 48.02 48.95 48.57 45.43 61.68
CodeForces-rating 1582 1574 1120 1675 1901
BIRD_SQL 44.88 46.45 43.97 54.76 52.38
Coding Software Development
ArtifactsBench 43.29 44.87 41.04 60.28 59.31
FullStack Bench 55.48 54.00 50.92 48.19 56.55
Aider 88.16 85.34 84.40 89.85 83.65
Math Competition Math
CNMO 2024 73.78 68.92 63.11 74.65 79.25
AIME 2025 55.21 50.16 59.43 70.10 70.42
UGMathBench 72.70 69.97 67.27 70.10 74.95
Omni-Math 64.77 62.42 61.09 72.02 74.46
Math Professional Math
FinanceReasoning 86.44 84.83 86.28 86.65 87.45
Optibench 64.30 60.83 40.06 68.76 74.71
OptMATH 35.99 35.84 39.16 42.77 57.68
General Reasoning
BBEH 42.86 34.83 39.75 29.08 47.34
KOR-Bench 73.76 73.20 70.56 59.68 76.00
ARC-AGI-1 14.69 22.19 14.06 18.94 43.81
ZebraLogic 81.6 85.5 57.3 70.2 90.8
Agent
BFCL-V3 52.67 71.05 50.27 63.31 69.64
Alignment
Arena Hard V2 ELO 54.09 76.95 68.37 65.37 76.26
Arena Hard V2 Win Rate 63.24 69.88 65.06 74.46 75.83
writing_bench 80.95 87.59 77.07 80.53 89.4
Creative Writing v3 85.18 87.01 80.93 84.99 89.24
MultiChallenge 42.49 48.72 48.72 51.28 58.24

Model Downloads

You can download Ling-1T from the following table. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model Context Length Download
Ling-1T 32K -> 128K (YaRN) 🤗 HuggingFace    🤖 ModelScope

Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope.

Quickstart

🚀 Try Online

You can experience Ling-1T online at: ZenMux

🔌 API Usage

You can also use Ling-1T through API calls:

from openai import OpenAI

# 1. Initialize the OpenAI client
client = OpenAI(
    # 2. Point the base URL to the ZenMux endpoint
    base_url="https://zenmux.ai/api/v1",
    # 3. Replace with the API Key from your ZenMux user console
    api_key="<your ZENMUX_API_KEY>",
)

# 4. Make a request
completion = client.chat.completions.create(
    # 5. Specify the model to use in the format "provider/model-name"
    model="inclusionai/ling-1t",
    messages=[
        {
            "role": "user",
            "content": "What is the meaning of life?"
        }
    ]
)

print(completion.choices[0].message.content)

🤗 Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ling-1T"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

🤖 ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope.

Deployment

vLLM

vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.

Environment Preparation

pip install vllm==0.11.0

Offline Inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-1T")

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)

llm = LLM(model="inclusionAI/Ling-1T", dtype='bfloat16', trust_remote_code=True)
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
outputs = llm.generate([text], sampling_params)

Online Inference:

vllm serve inclusionAI/Ling-1T \
              --tensor-parallel-size 2 \
              --pipeline-parallel-size 1 \
              --trust-remote-code \
              --gpu-memory-utilization 0.90

# This is only an example, please adjust the model sharding strategy according to your actual environment.

To handle long context in vLLM using YaRN, we need to follow these two steps:

  1. Add a rope_scaling field to the model's config.json file, for example:
{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}
  1. Use an additional parameter --max-model-len to specify the desired maximum context length when starting the vLLM service.

For detailed guidance, please refer to the vLLM instructions.

SGLang

Environment Preparation

We will later submit our model to SGLang official release, now we can prepare the environment following steps:

pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1

You can use docker image as well:

docker pull lmsysorg/sglang:v0.5.2rc0-cu126

Then you should apply patch to sglang installation:

# patch command is needed, run `yum install -y patch` if needed
patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch

Run Inference

BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:

  • Start server:
python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 0.0.0.0 --port $PORT \
    --trust-remote-code \
    --attention-backend fa3

# This is only an example, please adjust the model sharding strategy according to your actual environment.

MTP is supported for base model, and not yet for chat model. You can add parameter --speculative-algorithm NEXTN to start command.

  • Client:
curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

More usage can be found here

Limitations & Future Plans

While Ling-1T has made strong progress in efficient reasoning, cross-domain generalization, and training efficiency, several limitations remain:

  • GQA-based attention: stable for long-context reasoning but relatively costly. Future versions will adopt hybrid attention to improve efficiency.
  • Limited agentic ability: current model has room to grow in multi-turn interaction, long-term memory, and tool use.
  • Instruction and identity issues: occasional deviations or role confusion may occur; future updates will enhance alignment and consistency.

Ling-1T will continue to evolve in architecture, reasoning, and alignment, advancing the series toward more general intelligence.

License

This code repository is licensed under the MIT License.