InCoder-32B-Thinking: Reasoning Code Model for Industrial Scenarios

Model Summary

InCoder-32B-Thinking is the reasoning variant of the InCoder family. It extends InCoder-32B with chain-of-thought reasoning via <think>...</think> tags, enabling step-by-step problem decomposition before generating code. This is particularly effective for complex industrial tasks that require multi-step reasoning — debugging RTL modules, optimizing GPU kernels, or diagnosing embedded firmware issues.

For the instruction-tuned variant (without thinking), see IndustrialCoder. For the pre-trained base model, see IndustrialCoder-Base.

Key Results

General Code Benchmarks

Benchmark	InCoder-32B	InCoder-32B-Thinking
HumanEval+	89.6	91.5
MBPP+	78.3	80.1
BigCodeBench (Full)	49.8	51.2
LiveCodeBench (Pass@1)	49.14	52.3

Industrial Code Benchmarks

Benchmark	Domain	InCoder-32B	InCoder-32B-Thinking
VeriScope Score	Chip Design	80.7	82.3
CAD-Coder Compile (%)	3D Modeling	82.0	84.0
KernelBench L1 (%)	GPU Optimization	22.2	24.0

The thinking variant shows consistent improvements across both general and industrial benchmarks, with the largest gains on tasks requiring multi-step reasoning.

Model Architecture

Same architecture as InCoder-32B, with thinking-aware post-training:

Hyperparameter	Value
Parameters	~32B
Layers	64
Hidden Size	5,120
Attention Heads	40 (8 KV heads, GQA)
Max Context Length	131,072 (128K)
Positional Encoding	RoPE (θ = 500,000)
Precision	BFloat16

How Thinking Mode Works

InCoder-32B-Thinking generates a reasoning trace inside <think>...</think> tags before producing the final answer. This allows the model to:

Decompose complex problems into sub-tasks
Reason about constraints, edge cases, and hardware semantics
Plan the solution structure before writing code

Example output:

<think>
The user wants a UART transmitter module. Let me think through the design:
1. Need a state machine: IDLE -> START_BIT -> DATA_BITS -> STOP_BIT
2. 8N1 means: 8 data bits, no parity, 1 stop bit
3. Need a baud rate counter derived from the clock frequency
4. Shift register to serialize the 8-bit data LSB first
</think>

module uart_tx (
    input wire clk,
    ...

You can disable thinking mode to get direct answers (behaves like the instruct variant):

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)

Usage

Installation

pip install transformers accelerate

Thinking Mode (default)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Optimize this CUDA kernel for better memory coalescing:\n__global__ void add(float *a, float *b, float *c, int N) {\n    int i = threadIdx.x;\n    if (i < N) c[i] = a[i] + b[i];\n}"}
]

# Thinking mode (default) — model reasons before answering
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.85, top_k=20)

output = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)

# Parse thinking and response
if "</think>" in output:
    thinking = output.split("</think>")[0].replace("<think>\n", "").strip()
    response = output.split("</think>")[1].strip()
    print(f"Thinking:\n{thinking}\n\nResponse:\n{response}")
else:
    print(output)

Non-Thinking Mode

# Disable thinking — direct answer without reasoning trace
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)

With Tool Calls

tools = [{
    "type": "function",
    "function": {
        "name": "run_verilog_sim",
        "description": "Run Verilog simulation with Icarus Verilog",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Verilog source code"},
                "testbench": {"type": "string", "description": "Testbench code"}
            }
        }
    }
}]

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, tools=tools
)

Deployment with vLLM

vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Thinking \
    --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code

Recommended Sampling Parameters

Use case	temperature	top_p	top_k	max_new_tokens
Thinking (default)	0.6	0.85	20	8192
Non-thinking / precise	0.2	0.95	—	4096

Model Family

Model	Type	HuggingFace
InCoder-32B-Base	Pre-trained	🤗 IndustrialCoder-Base
InCoder-32B	Instruct	🤗 IndustrialCoder
InCoder-32B-Thinking	Reasoning	🤗 IndustrialCoder-Thinking
InCoder-32B-FP8	FP8 Quantized	🤗 IndustrialCoder-32B-FP8
InCoder-32B-AWQ-INT4	AWQ INT4	🤗 IndustrialCoder-32B-AWQ-INT4
InCoder-32B-GPTQ-INT4	GPTQ INT4	🤗 IndustrialCoder-32B-GPTQ-INT4

Limitations & Disclaimers

The thinking trace may occasionally contain reasoning errors or hallucinated constraints — always verify the final code output.
For simple tasks, thinking mode adds latency; use enable_thinking=False for straightforward generation.
Based on failure analysis, the model may struggle with:
- API Knowledge: Linker errors from undefined HAL/CMSIS functions in embedded C.
- Functional Semantics: Producing compilable but functionally incorrect RTL under complex logic scenarios.
- Optimization: Correct but sub-optimal GPU kernel performance.

Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment.

Citation

@article{yang2026incoder,
  title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
  author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
          and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
          and others},
  journal={arXiv preprint arXiv:2603.16790},
  year={2026}
}

Downloads last month: -

Safetensors

Model size

32B params

Tensor type

BF16

Collection including Multilingual-Multimodal-NLP/IndustrialCoder-Thinking

IndustrialCoder

Collection

6 items • Updated about 11 hours ago • 1

Paper for Multilingual-Multimodal-NLP/IndustrialCoder-Thinking

InCoder-32B: Code Foundation Model for Industrial Scenarios

Paper • 2603.16790 • Published 9 days ago • 297