Intel/Kimi-K2-Instruct-int4-AutoRound

Model Details

This model is an int4 model with group_size 64 and symmetric quantization of moonshotai/Kimi-K2-Instruct generated by intel/auto-round algorithm.

Please follow the license of the original model.

How To Use

Due to kernel issue, this model could only run on CPU

INT4 Inference(CPU)

from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
from auto_round import AutoRound, AutoRoundConfig

import torch

quantized_model_dir = "Intel/Kimi-K2-Instruct-int4-AutoRound-cpu"

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype=torch.bfloat16,
    device_map="cpu",
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
    "9.11和9.8哪个数字大",
    "strawberry中有几个r?",
    "There is a girl who likes adventure,",
    "Please give a brief introduction of Moonshot AI",
]

texts=[]
for prompt in prompts:
    messages = [
        {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
        {"role": "user", "content": [{"type": "text", "text":prompt}]}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    max_length=200, ##change this to align with the official usage
    num_return_sequences=1,
    do_sample=False  ##change this to align with the official usage
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)

"""
Prompt: 9.11和9.8哪个数字大
Generated: ### 第一步：理解题目

首先，我需要明确题目在问什么。题目问的是“9.11和9.8哪个数字大”，也就是比较这两个数字的大小。看起来这是一个简单的数值比较问题。

### 第二步：数字的表示

这两个数字都是小数：

- 9.11
- 9.8

### 第三步：对齐小数位

为了更清楚地比较，可以将它们的小数位对齐。9.8可以看作是9.80，因为：

- 9.8 = 9.80

这样：

- 9.11
- 9.80

### 第四步：逐位比较

从左到右逐位比较：

1. 整数部分：都是9，相同。
2. 小数部分：
   - 第一位小数：1（来自
--------------------------------------------------
Prompt: strawberry中有几个r?
Generated: ### 问题重述
我们需要计算单词“strawberry”中有多少个字母“r”。

### 步骤分解
1. **理解题目**：明确要计算的是字母“r”的出现次数，不区分大小写（但“strawberry”全部是小写）。
2. **分解单词**：将“strawberry”拆分成单个字母，逐个检查。
3. **逐个检查**：从左到右或从右到左检查每个字母是否为“r”。
4. **计数**：每遇到一个“r”，计数器加1。

### 单词分解
“strawberry”可以分解为以下字母序列：
s, t, r, a, w, b, e, r, r, y

### 检查每个字母
让我们按顺序检查：

1.
--------------------------------------------------
Prompt: There is a girl who likes adventure,
Generated: There is a girl who likes adventure,
who keeps a compass in her coat pocket
and a map inked on the inside of her wrist.
She wakes before the sun,
laces boots that have crossed continents,
and steps out the door as if the day itself
were a question she is eager to answer.

She has learned the dialect of rivers,
can read the gossip of clouds,
and carries a stone from every mountain
she has ever kissed with her palms.
She speaks to strangers as if they were stories
she hasn’t finished reading,
and when she laughs,
it sounds like gravel crunching under tires—
a promise that she is going somewhere.

At night she writes by headlamp,
recording the way starlight tastes
on the tongue of the desert
--------------------------------------------------
Prompt: Please give a brief introduction of Moonshot AI
Generated: Moonshot AI is a cutting-edge artificial intelligence company based in China, established in 2023. It focuses on developing advanced large language models and innovative AI applications. The company is dedicated to pushing the boundaries of AI technology to create smarter, more efficient solutions for various industries.
--------------------------------------------------

"""

Generate the model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers

model_name = "Kimi-K2-Instruct-BF16"

tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name,device_map="cpu", torch_dtype="auto",trust_remote_code=True)

block = model.model.layers
device_map = {}

for n, m in block.named_modules():
    if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
        if "experts" in n and ("shared_experts" not in n):
            if int(n.split('.')[-2]) < 96:
                device = "cuda:1"
            elif int(n.split('.')[-2]) >= 96 and int(n.split('.')[-2]) < 192:
                device = "cuda:2"
            elif int(n.split('.')[-2]) >= 192 and int(n.split('.')[-2]) < 288:
                device = "cuda:3"
            elif int(n.split('.')[-2]) >= 288:
                device = "cuda:4"
        else:
            device = "cuda:0"

        n = n[2:]

        device_map.update({n: device})

from auto_round import AutoRound

autoround = AutoRound(
  model=model, tokenizer=tokenizer, device_map=device_map, iters=50, lr=5e-3, nsamples=256,
  batch_size=4, low_gpu_mem_usage=True, seqlen=2048,device=0,group_size=64)
autoround.quantize_and_save(format="auto_round", output_dir="tmp_autoround")

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Intel
/

Kimi-K2-Instruct-int4-AutoRound