weiranyao commited on
Commit
d433b83
Β·
verified Β·
1 Parent(s): 3986381

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -82
README.md CHANGED
@@ -4,106 +4,185 @@ language:
4
  - en
5
  pipeline_tag: text-generation
6
  tags:
7
- - diffusion
8
- - text generation
9
  - code generation
10
  ---
11
- # CoDA-v0-Base
12
 
13
- ## Overview 🎯
14
- CoDA is Salesforce AI Research's open diffusion language model.
15
 
 
 
16
 
17
- [Technical Report](https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf)
18
 
19
- [Code](https://github.com/SalesforceAIResearch/CoDA/)
 
 
 
20
 
21
- The code repo contains a unified training pipeline from pre-training to post-training, evaluation harnesses, and a simple Fast-API based serving backend.
22
 
 
 
 
 
23
 
 
24
 
25
- ## Requirements πŸ“¦
26
- ```
27
- torch==2.8.0
28
- transformers>=4.47.1
29
- flash-attn==2.8.3
30
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- ## Quickstart πŸš€
33
- Here is a code snippet for loading the model, tokenizer and run unmasking for a partially finished code.
34
  ```python
35
- import torch
36
- from transformers import AutoModel, AutoTokenizer
37
-
38
- model_name = "Salesforce/CoDA-v0-Base"
39
- device = "cuda"
40
-
41
- model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device)
42
- tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
43
- model.eval()
44
-
45
- prompt = """```python
46
- from typing import List
47
-
48
- class Solution:
49
- def twoSum(self, nums: List[int], target: int) -> List[int]:
50
- # Create a dictionary to store the numbers and their indices
51
- num_to_index = {}
52
-
53
- # Iterate over the list of numbers
54
- for index, num in enumerate(nums):
55
- # Calculate the complement
56
- complement = target - num
57
-
58
- # Check if the complement is already in the dictionary
59
- if complement in num_to_index:
60
- # If found, return the indices of the complement and the current number
61
- return [num_to_index[complement], index]
62
-
63
- # Otherwise, add the current number and its index to the dictionary
64
- num_to_index[num] = index
65
- ```"""
66
- input_ids = tokenizer.encode(prompt, return_tensors="pt")
67
- mask = torch.rand(input_ids.shape) < 0.4
68
- masked_input_ids = input_ids.clone()
69
- masked_input_ids[mask] = tokenizer.mask_token_id
70
- generated_ids = model.diffusion_generate(
71
- inputs=masked_input_ids.to(model.device),
72
- max_new_tokens=1,
73
- steps=128,
74
- top_p=0.95,
75
- temperature=0.2,
76
- alg="entropy",
77
- alg_temp=0.2,
78
  )
79
- generated_ids = [
80
- output_ids[:-1] for output_ids in generated_ids
81
- ]
82
- unmasked_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 
 
83
 
 
 
 
 
84
 
 
 
 
 
 
 
 
 
 
 
85
  ```
86
 
87
- ## Benchmark πŸ“Š
88
- Comparison of code-generation performance across standard and plus-enhanced benchmarks. Evalplus is computed as the mean pass@1 on enhanced variants. Bold marks results where CoDA produces the strongest diffusion-model performance.
89
 
90
- | Model | Humaneval Instruct | Humaneval Plus | MBPP Instruct | MBPP Plus | Evalplus |
91
- | --- | --- | --- | --- | --- | --- |
92
- | CoDA-Base | 29.3 | 23.8 | 35.2 | 46.0 | 34.9 |
93
- | CoDA-Instruct | 54.3 | 47.6 | 47.2 | **63.2** | **55.4** |
94
- | Dream-Base | 56.7 | 50.0 | 68.7 | 57.4 | 53.7 |
95
- | Dream-7B-Instruct | 57.9 | 53.7 | 68.3 | 56.1 | 54.9 |
96
- | LLaDA-8B-Instruct | 35.4 | 31.7 | 31.5 | 28.6 | 30.2 |
97
- | Qwen3-1.7B | 66.5 | 61.6 | 46.2 | 65.9 | 63.8 |
98
- | Qwen2.5-Coder-1.5B | 43.9 | 36.6 | 69.2 | 58.6 | 47.6 |
99
- | Qwen2.5-Coder-1.5B-Instruct | 70.7 | 66.5 | 69.2 | 59.4 | 62.3 |
100
- | Gemma-3-1B-it | 39.6 | 35.4 | 39.4 | 63.5 | 49.5 |
101
- | LLaMA-3.2-1B-Instruct | 35.4 | 31.1 | 24.4 | 53.7 | 42.4 |
102
 
103
- ## Deployment πŸ› οΈ
104
- Checkout our [Deployment Guide](https://github.com/SalesforceAIResearch/CoDA?tab=readme-ov-file#deployment-guide-%EF%B8%8F)!
105
 
106
- ## Citation πŸ“š
 
 
 
 
 
 
 
107
  ```
108
- coming soon
109
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - en
5
  pipeline_tag: text-generation
6
  tags:
7
+ - text diffusion model
8
+ - language model
9
  - code generation
10
  ---
11
+ # CoDA: Coding LM via Diffusion Adaptation
12
 
13
+ **CoDA-1.7B** is a lightweight diffusion language model for code generation developed by Salesforce AI Research. Unlike traditional autoregressive models, CoDA leverages discrete diffusion processes to enable bidirectional context understanding and efficient code completion.
 
14
 
15
+ - πŸ“„ [Technical Report](https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf)
16
+ - πŸ’» [Code Repository](https://github.com/SalesforceAIResearch/CoDA/)
17
 
18
+ ## πŸ“Š Model Details
19
 
20
+ - **Model Size**: 1.7B parameters
21
+ - **Architecture**: Diffusion-based language model
22
+ - **Training**: TPU-based pre-training with GPU fine-tuning
23
+ - **Primary Use**: Code generation and completion tasks
24
 
25
+ ## ✨ Key Features
26
 
27
+ - **Bidirectional Context**: Diffusion modeling enables understanding of both past and future tokens
28
+ - **Confidence-Guided Sampling**: Maintains competitive inference latency through intelligent sampling
29
+ - **Lightweight Design**: Achieves strong performance with fewer parameters than comparable models
30
+ - **Open Training Pipeline**: Fully reproducible training from pre-training to fine-tuning
31
 
32
+ ## πŸ“ˆ Performance
33
 
34
+ CoDA-1.7B-Instruct demonstrates competitive performance on standard code generation benchmarks:
35
+
36
+ | Model | HumanEval | HumanEval+ | MBPP | MBPP+ | EvalPlus |
37
+ |-------|-----------|------------|------|-------|----------|
38
+ | **CoDA-Base** | 29.3 | 23.8 | 35.2 | 46.0 | 34.9 |
39
+ | **CoDA-Instruct** | **54.3** | **47.6** | 47.2 | **63.2** | **55.4** |
40
+ | Dream-Base | 56.7 | 50.0 | 68.7 | 57.4 | 53.7 |
41
+ | Dream-7B-Instruct | 57.9 | 53.7 | 68.3 | 56.1 | 54.9 |
42
+ | LLaDA-8B-Instruct | 35.4 | 31.7 | 31.5 | 28.6 | 30.2 |
43
+
44
+ **🎯 Key Finding**: CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters while maintaining significantly lower computational requirements. CoDA offers an advantageous balance between inference speed and accuracy compared to larger diffusion models.
45
+
46
+ ## πŸŽ“ Training Methodology
47
+
48
+ CoDA employs a three-stage training process:
49
+
50
+ *Three-stage training: (1) Pre-training with bidirectional masking, (2) Post-training with instruction format, (3) Inference with progressive denoising.*
51
+
52
+ ## πŸ› οΈ Usage
53
+
54
+ ### πŸš€ Quick Start
55
 
 
 
56
  ```python
57
+ from transformers import AutoTokenizer, AutoModelForCausalLM
58
+
59
+ model_name = "Salesforce/CoDA-v0-Instruct"
60
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
61
+ model = AutoModelForCausalLM.from_pretrained(model_name)
62
+
63
+ # Generate code
64
+ prompt = "Write a Python function to calculate fibonacci numbers"
65
+ inputs = tokenizer(prompt, return_tensors="pt")
66
+ outputs = model.generate(
67
+ **inputs,
68
+ max_tokens=256,
69
+ diffusion_steps=128,
70
+ temperature=0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  )
72
+ print(tokenizer.decode(outputs[0]))
73
+ ```
74
+
75
+ ### πŸš€ Deployment
76
+
77
+ For production deployment, we provide serving with OpenAI-compatible APIs:
78
 
79
+ ```bash
80
+ # Clone the repository
81
+ git clone https://github.com/SalesforceAIResearch/CoDA
82
+ cd CoDA
83
 
84
+ # Set up environment
85
+ python3 -m venv .venv
86
+ source .venv/bin/activate
87
+ pip install -r serving/requirements.txt
88
+
89
+ # Export your Hugging Face token
90
+ export HF_TOKEN="hf_..."
91
+
92
+ # Start the server
93
+ bash serving/fast-api/start_server.sh
94
  ```
95
 
96
+ The server will listen on `http://localhost:8000`.
 
97
 
98
+ ### πŸ’¬ Interactive CLI
99
+
100
+ ```bash
101
+ python serving/fast-api/chat_cli.py \
102
+ --base-url http://localhost:8000 \
103
+ --model Salesforce/CoDA-v0-Instruct \
104
+ --stream \
105
+ --show-meta
106
+ ```
107
+
108
+ ### βš™οΈ Generation Hyperparameters
 
109
 
110
+ Customize generation behavior with environment variables:
 
111
 
112
+ ```bash
113
+ export MAX_TOKENS=512 # Maximum tokens to generate
114
+ export TEMPERATURE=0.7 # Sampling temperature
115
+ export TOP_P=0.9 # Nucleus sampling threshold
116
+ export STEPS=128 # Number of diffusion steps
117
+ export ALG="entropy" # Sampling algorithm
118
+ export ALG_TEMP=0.1 # Algorithm temperature
119
+ export BLOCK_LENGTH=32 # Block size for processing
120
  ```
121
+
122
+ **Recommended Settings**:
123
+ - **Fast inference**: `STEPS=64`, `TEMPERATURE=0.0`
124
+ - **Quality generation**: `STEPS=128`, `TEMPERATURE=0.7`, `TOP_P=0.9`
125
+ - **High quality**: `STEPS=256`, `TEMPERATURE=0.5`, `TOP_P=0.95`
126
+
127
+ ## πŸ”§ Training from Scratch
128
+
129
+ The complete training pipeline is available in our [repository](https://github.com/SalesforceAIResearch/CoDA):
130
+
131
+ ```bash
132
+ # Clone the repository
133
+ git clone https://github.com/SalesforceAIResearch/CoDA
134
+ cd CoDA
135
+ ```
136
+
137
+ ### 🧠 Pre-training on TPU
138
+ ```bash
139
+ # Configure TPU environment
140
+ cd pre-train
141
+ cp env.example .env # Add your TPU metadata
142
+ bash setup_tpu.sh
143
+
144
+ # Launch pre-training
145
+ bash recipes/midtrain_v4_512.sh
146
+ ```
147
+
148
+ ### 🎯 Supervised Fine-tuning
149
+ ```bash
150
+ # Set up fine-tuning environment
151
+ cd post-train/LLaMA-Factory
152
+ pip install -r requirements.txt
153
+
154
+ # Configure dataset and run fine-tuning
155
+ bash ../../run_sft.sh
156
+ ```
157
+
158
+ ### πŸ“Š Evaluation
159
+ ```bash
160
+ cd evaluation/lm_eval
161
+ bash eval_mbpp_humaneval.sh
162
+ ```
163
+ ## πŸ“š Citation
164
+
165
+ Technical report coming soon. For now, please cite:
166
+
167
+ ```bibtex
168
+ @misc{coda2025,
169
+ title={CoDA: Coding LM via Diffusion Adaptation},
170
+ author={Chen, Haolin and Wang, Shiyu and Qin, Can and Pang, Bo and Liu, Zuxin and Qiu, Jielin and Zhang, Jianguo and Zhou, Yingbo and Chen, Zeyuan and Xu, Ran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan and Yao, Weiran},
171
+ year={2025},
172
+ publisher={Salesforce AI Research}
173
+ }
174
+ ```
175
+
176
+ ## πŸ”— Resources
177
+
178
+ - πŸ“„ **Technical Report**: [technical_report.pdf](https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf)
179
+ - πŸ’» **Code Repository**: [github.com/SalesforceAIResearch/CoDA](https://github.com/SalesforceAIResearch/CoDA)
180
+ - πŸ€— **Model Hub**: [Salesforce CoDA collection](https://huggingface.co/collections/Salesforce/coda-68d627d87921c0e28a69e340)
181
+
182
+ ## πŸ™ Acknowledgements
183
+
184
+ We thank Lingpeng Kong for insightful discussions and Jialei Chen for technical support with TPU infrastructure.
185
+
186
+ ---
187
+
188
+ *🏒 Developed by Salesforce AI Research*