InCoder-32B-Thinking: Reasoning Code Model for Industrial Scenarios

HuggingFace GitHub arXiv License

Model Summary

InCoder-32B-Thinking is the reasoning variant of the InCoder family. It extends InCoder-32B with chain-of-thought reasoning via <think>...</think> tags, enabling step-by-step problem decomposition before generating code. This is particularly effective for complex industrial tasks that require multi-step reasoning — debugging RTL modules, optimizing GPU kernels, or diagnosing embedded firmware issues.

For the instruction-tuned variant (without thinking), see IndustrialCoder. For the pre-trained base model, see IndustrialCoder-Base.


Key Results

General Code Benchmarks

Benchmark InCoder-32B InCoder-32B-Thinking
HumanEval+ 89.6 91.5
MBPP+ 78.3 80.1
BigCodeBench (Full) 49.8 51.2
LiveCodeBench (Pass@1) 49.14 52.3

Industrial Code Benchmarks

Benchmark Domain InCoder-32B InCoder-32B-Thinking
VeriScope Score Chip Design 80.7 82.3
CAD-Coder Compile (%) 3D Modeling 82.0 84.0
KernelBench L1 (%) GPU Optimization 22.2 24.0

The thinking variant shows consistent improvements across both general and industrial benchmarks, with the largest gains on tasks requiring multi-step reasoning.


Model Architecture

Same architecture as InCoder-32B, with thinking-aware post-training:

Hyperparameter Value
Parameters ~32B
Layers 64
Hidden Size 5,120
Attention Heads 40 (8 KV heads, GQA)
Max Context Length 131,072 (128K)
Positional Encoding RoPE (θ = 500,000)
Precision BFloat16

How Thinking Mode Works

InCoder-32B-Thinking generates a reasoning trace inside <think>...</think> tags before producing the final answer. This allows the model to:

  1. Decompose complex problems into sub-tasks
  2. Reason about constraints, edge cases, and hardware semantics
  3. Plan the solution structure before writing code

Example output:

<think>
The user wants a UART transmitter module. Let me think through the design:
1. Need a state machine: IDLE -> START_BIT -> DATA_BITS -> STOP_BIT
2. 8N1 means: 8 data bits, no parity, 1 stop bit
3. Need a baud rate counter derived from the clock frequency
4. Shift register to serialize the 8-bit data LSB first
</think>

module uart_tx (
    input wire clk,
    ...

You can disable thinking mode to get direct answers (behaves like the instruct variant):

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)

Usage

Installation

pip install transformers accelerate

Thinking Mode (default)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Optimize this CUDA kernel for better memory coalescing:\n__global__ void add(float *a, float *b, float *c, int N) {\n    int i = threadIdx.x;\n    if (i < N) c[i] = a[i] + b[i];\n}"}
]

# Thinking mode (default) — model reasons before answering
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.85, top_k=20)

output = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)

# Parse thinking and response
if "</think>" in output:
    thinking = output.split("</think>")[0].replace("<think>\n", "").strip()
    response = output.split("</think>")[1].strip()
    print(f"Thinking:\n{thinking}\n\nResponse:\n{response}")
else:
    print(output)

Non-Thinking Mode

# Disable thinking — direct answer without reasoning trace
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)

With Tool Calls

tools = [{
    "type": "function",
    "function": {
        "name": "run_verilog_sim",
        "description": "Run Verilog simulation with Icarus Verilog",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Verilog source code"},
                "testbench": {"type": "string", "description": "Testbench code"}
            }
        }
    }
}]

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, tools=tools
)

Deployment with vLLM

vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Thinking \
    --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code

Recommended Sampling Parameters

Use case temperature top_p top_k max_new_tokens
Thinking (default) 0.6 0.85 20 8192
Non-thinking / precise 0.2 0.95 — 4096

Model Family

Model Type HuggingFace
InCoder-32B-Base Pre-trained 🤗 IndustrialCoder-Base
InCoder-32B Instruct 🤗 IndustrialCoder
InCoder-32B-Thinking Reasoning 🤗 IndustrialCoder-Thinking
InCoder-32B-FP8 FP8 Quantized 🤗 IndustrialCoder-32B-FP8
InCoder-32B-AWQ-INT4 AWQ INT4 🤗 IndustrialCoder-32B-AWQ-INT4
InCoder-32B-GPTQ-INT4 GPTQ INT4 🤗 IndustrialCoder-32B-GPTQ-INT4

Limitations & Disclaimers

  • The thinking trace may occasionally contain reasoning errors or hallucinated constraints — always verify the final code output.
  • For simple tasks, thinking mode adds latency; use enable_thinking=False for straightforward generation.
  • Based on failure analysis, the model may struggle with:
    • API Knowledge: Linker errors from undefined HAL/CMSIS functions in embedded C.
    • Functional Semantics: Producing compilable but functionally incorrect RTL under complex logic scenarios.
    • Optimization: Correct but sub-optimal GPU kernel performance.

Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment.


Citation

@article{yang2026incoder,
  title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
  author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
          and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
          and others},
  journal={arXiv preprint arXiv:2603.16790},
  year={2026}
}
Downloads last month
-
Safetensors
Model size
32B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Multilingual-Multimodal-NLP/IndustrialCoder-Thinking

Paper for Multilingual-Multimodal-NLP/IndustrialCoder-Thinking