|
--- |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- code |
|
- jupyter |
|
- agent |
|
- data-science |
|
- qwen |
|
- instruct |
|
base_model: Qwen/Qwen3-4B-Instruct-2507 |
|
datasets: |
|
- jupyter-agent/jupyter-agent-dataset |
|
language: |
|
- en |
|
- code |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Jupyter Agent Qwen3-4B Instruct |
|
|
|
 |
|
|
|
**Jupyter Agent Qwen3-4B Instruct** is a fine-tuned version of [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) specifically optimized for **data science agentic tasks** in Jupyter notebook environments. This model can execute Python code, analyze datasets, and provide clear reasoning to solve realistic data analysis problems. |
|
|
|
- **Model type:** Causal Language Model (Instruct) |
|
- **Language(s):** English, Python |
|
- **License:** Apache 2.0 |
|
- **Finetuned from:** [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) |
|
|
|
## Key Features |
|
|
|
- **Jupyter-native agent** that lives inside notebook environments |
|
- **Code execution** with pandas, numpy, matplotlib, and other data science libraries |
|
- **Clear reasoning** with intermediate computations and explanations |
|
- **Dataset-grounded analysis** trained on real Kaggle notebook workflows |
|
- **Tool calling** for structured code execution and final answer generation |
|
|
|
## Performance |
|
|
|
On the [DABStep benchmark](https://huggingface.co/spaces/adyen/DABstep) for data science tasks: |
|
|
|
| Model | Easy Tasks | Hard Tasks | |
|
|-------|------------|------------| |
|
| Qwen3-4B-Instruct-2507 (Base) | 44.0% | 2.1% | |
|
| **Jupyter Agent Qwen3-4B Instruct** | **70.8%** | **3.4%** | |
|
|
|
**State-of-the-art performance** for small models on realistic data analysis tasks. |
|
|
|
## Model Sources |
|
|
|
- **Repository:** [jupyter-agent](https://github.com/huggingface/jupyter-agent) |
|
- **Dataset:** [jupyter-agent-dataset](https://huggingface.co/datasets/jupyter-agent/jupyter-agent-dataset) |
|
- **Blog post:** [Jupyter Agents: training LLMs to reason with notebooks](https://huggingface.co/blog/jupyter-agent-2) |
|
- **Demo:** [Jupyter Agent 2](https://huggingface.co/spaces/lvwerra/jupyter-agent-2) |
|
|
|
## Usage |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "jupyter-agent/jupyter-agent-qwen3-4b-instruct" |
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
torch_dtype="auto", |
|
device_map="auto" |
|
) |
|
|
|
# Prepare input |
|
prompt = "Analyze this sales dataset and find the top 3 performing products by revenue." |
|
messages = [ |
|
{"role": "user", "content": prompt} |
|
] |
|
|
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
# Generate response |
|
generated_ids = model.generate( |
|
**model_inputs, |
|
max_new_tokens=16384 |
|
) |
|
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() |
|
``` |
|
|
|
### Decoding Response |
|
|
|
For instruct models, you can extract the model's response: |
|
|
|
```python |
|
response = tokenizer.decode(output_ids, skip_special_tokens=True) |
|
print("Response:", response) |
|
``` |
|
|
|
### Agentic Usage with Tool Calling |
|
|
|
The model works best with proper scaffolding for tool calling: |
|
|
|
```python |
|
tools = [ |
|
{ |
|
"type": "function", |
|
"function": { |
|
"name": "execute_code", |
|
"description": "Execute Python code in a Jupyter environment", |
|
"parameters": { |
|
"type": "object", |
|
"properties": { |
|
"code": { |
|
"type": "string", |
|
"description": "Python code to execute" |
|
} |
|
}, |
|
"required": ["code"] |
|
} |
|
} |
|
}, |
|
{ |
|
"type": "function", |
|
"function": { |
|
"name": "final_answer", |
|
"description": "Provide the final answer to the question", |
|
"parameters": { |
|
"type": "object", |
|
"properties": { |
|
"answer": { |
|
"type": "string", |
|
"description": "The final answer" |
|
} |
|
}, |
|
"required": ["answer"] |
|
} |
|
} |
|
} |
|
] |
|
|
|
# Include tools in the conversation |
|
messages = [ |
|
{ |
|
"role": "system", |
|
"content": "You are a data science assistant. Use the available tools to analyze data and provide insights." |
|
}, |
|
{"role": "user", "content": prompt} |
|
] |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was fine-tuned on the [Jupyter Agent Dataset](https://huggingface.co/datasets/jupyter-agent/jupyter-agent-dataset), which contains: |
|
|
|
- **51,389 synthetic notebooks** (~0.2B tokens, total 1B tokens) |
|
- **Dataset-grounded QA pairs** from real Kaggle notebooks |
|
- **Executable reasoning traces** with intermediate computations |
|
- **High-quality educational content** filtered and scored by LLMs |
|
|
|
### Training Procedure |
|
|
|
- **Base Model:** Qwen3-4B-Instruct-2507 |
|
- **Training Method:** Full-parameter fine-tuning (not PEFT) |
|
- **Optimizer:** AdamW with cosine learning rate scheduling |
|
- **Learning Rate:** 5e-6 |
|
- **Epochs:** 5 (optimal based on ablation study) |
|
- **Context Length:** 32,768 tokens |
|
- **Batch Size:** Distributed across multiple GPUs |
|
- **Loss:** Assistant-only loss (`assistant_loss_only=True`) |
|
- **Regularization:** NEFTune noise (α=7) for full-parameter training |
|
|
|
### Training Infrastructure |
|
|
|
- **Framework:** [TRL](https://github.com/huggingface/trl) with [Transformers](https://github.com/huggingface/transformers) |
|
- **Distributed Training:** DeepSpeed ZeRO-2 across multiple nodes |
|
- **Hardware:** Multi-GPU setup with SLURM orchestration |
|
|
|
## Evaluation |
|
|
|
### Benchmark: DABStep |
|
|
|
The model was evaluated on [DABStep](https://huggingface.co/spaces/adyen/DABstep), a benchmark for data science agents with realistic tasks involving: |
|
|
|
- **Dataset analysis** with pandas and numpy |
|
- **Visualization** with matplotlib/seaborn |
|
- **Statistical analysis** and business insights |
|
- **Multi-step reasoning** with intermediate computations |
|
|
|
The model achieves **36.3% improvement** over the base model and **22.2% improvement** over scaffolding alone. |
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/jupyter-agent-2/training_dabstep_easy.png" alt="DABstep Easy Score"/> |
|
|
|
We can also see, that the hard score can increase too even though our dataset is focused on easier questions. |
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/jupyter-agent-2/training_dabstep_hard.png" alt="DABstep Hard Score"/> |
|
|
|
## Limitations and Bias |
|
|
|
### Technical Limitations |
|
|
|
- **Context window:** Limited to 32K tokens, may struggle with very large notebooks |
|
- **Tool calling format:** Requires specific scaffolding for optimal performance |
|
- **Dataset domains:** Primarily trained on Kaggle-style data science tasks |
|
- **Code execution:** Requires proper sandboxing for safe execution |
|
|
|
### Potential Biases |
|
|
|
- **Domain bias:** Trained primarily on Kaggle notebooks, may not generalize to all data science workflows |
|
- **Language bias:** Optimized for English and Python, limited multilingual support |
|
- **Task bias:** Focused on structured data analysis, may underperform on unstructured data tasks |
|
|
|
### Recommendations |
|
|
|
- Use in **sandboxed environments** like [E2B](https://e2b.dev/) for safe code execution |
|
- **Validate outputs** before using in production systems |
|
- **Review generated code** for security and correctness |
|
- Consider **domain adaptation** for specialized use cases |
|
|
|
## Ethical Considerations |
|
|
|
- **Code Safety:** Always execute generated code in secure, isolated environments |
|
- **Data Privacy:** Be cautious when analyzing sensitive datasets |
|
- **Verification:** Validate all analytical conclusions and insights |
|
- **Attribution:** Acknowledge model assistance in data analysis workflows |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{jupyteragentqwen3instruct, |
|
title={Jupyter Agent Qwen3-4B Instruct}, |
|
author={Baptiste Colle and Hanna Yukhymenko and Leandro von Werra}, |
|
year={2025}, |
|
publisher={Hugging Face}, |
|
url={https://huggingface.co/jupyter-agent/jupyter-agent-qwen3-4b-instruct} |
|
} |
|
``` |
|
|
|
## Related Work |
|
|
|
- **Dataset:** [jupyter-agent-dataset](https://huggingface.co/datasets/jupyter-agent/jupyter-agent-dataset) |
|
- **Thinking version:** [jupyter-agent-qwen3-4b-thinking](https://huggingface.co/jupyter-agent/jupyter-agent-qwen3-4b-thinking) |
|
- **Base model:** [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) |
|
- **Benchmark:** [DABStep](https://huggingface.co/spaces/adyen/DABstep) |
|
|
|
*For more details, see our [blog post](https://huggingface.co/blog/jupyter-agent-2) and [GitHub repository](https://github.com/huggingface/jupyter-agent).* |