jupyter-agent-qwen3-4b-instruct / README.md

upd org name

46decbf verified 3 days ago

8.9 kB

	---
	license: apache-2.0
	library_name: transformers
	tags:
	- code
	- jupyter
	- agent
	- data-science
	- qwen
	- instruct
	base_model: Qwen/Qwen3-4B-Instruct-2507
	datasets:
	- jupyter-agent/jupyter-agent-dataset
	language:
	- en
	- code
	pipeline_tag: text-generation
	---

	# Jupyter Agent Qwen3-4B Instruct

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/ZyF9foqe5SLECwkq0dOpT.png)

	Jupyter Agent Qwen3-4B Instruct is a fine-tuned version of [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) specifically optimized for data science agentic tasks in Jupyter notebook environments. This model can execute Python code, analyze datasets, and provide clear reasoning to solve realistic data analysis problems.

	- Model type: Causal Language Model (Instruct)
	- Language(s): English, Python
	- License: Apache 2.0
	- Finetuned from: [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)

	## Key Features

	- Jupyter-native agent that lives inside notebook environments
	- Code execution with pandas, numpy, matplotlib, and other data science libraries
	- Clear reasoning with intermediate computations and explanations
	- Dataset-grounded analysis trained on real Kaggle notebook workflows
	- Tool calling for structured code execution and final answer generation

	## Performance

	On the [DABStep benchmark](https://huggingface.co/spaces/adyen/DABstep) for data science tasks:

	\| Model \| Easy Tasks \| Hard Tasks \|
	\|-------\|------------\|------------\|
	\| Qwen3-4B-Instruct-2507 (Base) \| 44.0% \| 2.1% \|
	\| Jupyter Agent Qwen3-4B Instruct \| 70.8% \| 3.4% \|

	State-of-the-art performance for small models on realistic data analysis tasks.

	## Model Sources

	- Repository: [jupyter-agent](https://github.com/huggingface/jupyter-agent)
	- Dataset: [jupyter-agent-dataset](https://huggingface.co/datasets/jupyter-agent/jupyter-agent-dataset)
	- Blog post: [Jupyter Agents: training LLMs to reason with notebooks](https://huggingface.co/blog/jupyter-agent-2)
	- Demo: [Jupyter Agent 2](https://huggingface.co/spaces/lvwerra/jupyter-agent-2)

	## Usage

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "jupyter-agent/jupyter-agent-qwen3-4b-instruct"

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	# Prepare input
	prompt = "Analyze this sales dataset and find the top 3 performing products by revenue."
	messages = [
	{"role": "user", "content": prompt}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# Generate response
	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=16384
	)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
	```

	### Decoding Response

	For instruct models, you can extract the model's response:

	```python
	response = tokenizer.decode(output_ids, skip_special_tokens=True)
	print("Response:", response)
	```

	### Agentic Usage with Tool Calling

	The model works best with proper scaffolding for tool calling:

	```python
	tools = [
	{
	"type": "function",
	"function": {
	"name": "execute_code",
	"description": "Execute Python code in a Jupyter environment",
	"parameters": {
	"type": "object",
	"properties": {
	"code": {
	"type": "string",
	"description": "Python code to execute"
	}
	},
	"required": ["code"]
	}
	}
	},
	{
	"type": "function",
	"function": {
	"name": "final_answer",
	"description": "Provide the final answer to the question",
	"parameters": {
	"type": "object",
	"properties": {
	"answer": {
	"type": "string",
	"description": "The final answer"
	}
	},
	"required": ["answer"]
	}
	}
	}
	]

	# Include tools in the conversation
	messages = [
	{
	"role": "system",
	"content": "You are a data science assistant. Use the available tools to analyze data and provide insights."
	},
	{"role": "user", "content": prompt}
	]
	```

	## Training Details

	### Training Data

	The model was fine-tuned on the [Jupyter Agent Dataset](https://huggingface.co/datasets/jupyter-agent/jupyter-agent-dataset), which contains:

	- 51,389 synthetic notebooks (~0.2B tokens, total 1B tokens)
	- Dataset-grounded QA pairs from real Kaggle notebooks
	- Executable reasoning traces with intermediate computations
	- High-quality educational content filtered and scored by LLMs

	### Training Procedure

	- Base Model: Qwen3-4B-Instruct-2507
	- Training Method: Full-parameter fine-tuning (not PEFT)
	- Optimizer: AdamW with cosine learning rate scheduling
	- Learning Rate: 5e-6
	- Epochs: 5 (optimal based on ablation study)
	- Context Length: 32,768 tokens
	- Batch Size: Distributed across multiple GPUs
	- Loss: Assistant-only loss (`assistant_loss_only=True`)
	- Regularization: NEFTune noise (α=7) for full-parameter training

	### Training Infrastructure

	- Framework: [TRL](https://github.com/huggingface/trl) with [Transformers](https://github.com/huggingface/transformers)
	- Distributed Training: DeepSpeed ZeRO-2 across multiple nodes
	- Hardware: Multi-GPU setup with SLURM orchestration

	## Evaluation

	### Benchmark: DABStep

	The model was evaluated on [DABStep](https://huggingface.co/spaces/adyen/DABstep), a benchmark for data science agents with realistic tasks involving:

	- Dataset analysis with pandas and numpy
	- Visualization with matplotlib/seaborn
	- Statistical analysis and business insights
	- Multi-step reasoning with intermediate computations

	The model achieves 36.3% improvement over the base model and 22.2% improvement over scaffolding alone.

	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/jupyter-agent-2/training_dabstep_easy.png" alt="DABstep Easy Score"/>

	We can also see, that the hard score can increase too even though our dataset is focused on easier questions.

	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/jupyter-agent-2/training_dabstep_hard.png" alt="DABstep Hard Score"/>

	## Limitations and Bias

	### Technical Limitations

	- Context window: Limited to 32K tokens, may struggle with very large notebooks
	- Tool calling format: Requires specific scaffolding for optimal performance
	- Dataset domains: Primarily trained on Kaggle-style data science tasks
	- Code execution: Requires proper sandboxing for safe execution

	### Potential Biases

	- Domain bias: Trained primarily on Kaggle notebooks, may not generalize to all data science workflows
	- Language bias: Optimized for English and Python, limited multilingual support
	- Task bias: Focused on structured data analysis, may underperform on unstructured data tasks

	### Recommendations

	- Use in sandboxed environments like [E2B](https://e2b.dev/) for safe code execution
	- Validate outputs before using in production systems
	- Review generated code for security and correctness
	- Consider domain adaptation for specialized use cases

	## Ethical Considerations

	- Code Safety: Always execute generated code in secure, isolated environments
	- Data Privacy: Be cautious when analyzing sensitive datasets
	- Verification: Validate all analytical conclusions and insights
	- Attribution: Acknowledge model assistance in data analysis workflows

	## Citation

	```bibtex
	@misc{jupyteragentqwen3instruct,
	title={Jupyter Agent Qwen3-4B Instruct},
	author={Baptiste Colle and Hanna Yukhymenko and Leandro von Werra},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/jupyter-agent/jupyter-agent-qwen3-4b-instruct}
	}
	```

	## Related Work

	- Dataset: [jupyter-agent-dataset](https://huggingface.co/datasets/jupyter-agent/jupyter-agent-dataset)
	- Thinking version: [jupyter-agent-qwen3-4b-thinking](https://huggingface.co/jupyter-agent/jupyter-agent-qwen3-4b-thinking)
	- Base model: [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
	- Benchmark: [DABStep](https://huggingface.co/spaces/adyen/DABstep)

	For more details, see our [blog post](https://huggingface.co/blog/jupyter-agent-2) and [GitHub repository](https://github.com/huggingface/jupyter-agent).