oran-qualifire's picture
Update README.md
563625d verified
---
tools:
- en
license: other
tags:
- mcp
- tools quality
- tool quality selection
- tool selection
- TSQ
- TUQ
- sequence-classification
- tool-evaluation
- function-call
- limbic
- tool use
- tool quality
pipeline_tag: text-classification
language:
- en
base_model:
- Qwen/Qwen3-0.6B
---
## 🧠 Model Description
Designed for evaluating function calls in the context of **Model Context Protocol (MCP) tools**.
It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values.
The **mcp-tool-use-quality-ranger-0.6b** is a fine-tuned sequence classification model created to **evaluate the quality of function calls** in conversational AI systems.
**Max Context Length:** **32,768 Tokens**
<img src="Ranger_img.png" width="600px"/>
It determines if a given function call:
- Selects the correct tool
- Has correct parameter names and structure
- Contains correct parameter values
It produces one of four possible classification labels:
| Label | Meaning |
|-------|---------|
| **VALID_CALL** | βœ… The tool name, parameters, and values are all correct, or no suitable tool exists and no function call is made. |
| **TOOL_ERROR** | ❌ The tool name does not exist or does not match the user intent. |
| **PARAM_NAME_ERROR** | ❌ The correct tool is used, but parameter names are missing, extra, or incorrect. |
| **PARAM_VALUE_ERROR** | ❌ Tool and parameter names are correct, but parameter values are wrong or incorrectly formatted. |
---
## πŸ“Š Benchmark Evaluation
The **mcp-tool-use-quality-ranger-0.6b** was evaluated in a binary classification setting,
where the prediction is **Correct** if the function call evaluation matched the gold label, and **Incorrect** otherwise.
| Model | #Params | Avg. Latency | Avg Binary Accuracy | [Qualifire mcp-tool-use-quality Benchmark](https://huggingface.co/datasets/qualifire/mcp-tool-use-quality-benchmark) Binary Accuracy | [Limbic Benchmark](https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp) Binary Accuracy |
|-----------------------------------------------|---------|--------------------|---------------------|--------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
| qualifire/mcp-tool-use-quality-ranger-4b [private] | 4B | 0.30[sec] | 0.978 | 0.997 | 0.960 |
| **qualifire/mcp-tool-use-quality-ranger-0.6b** | **0.6B**| **0.09[sec]** | **0.958** | **0.993** | **0.924** |
| gemini-2.5-flash | - | 4.87[sec] | 0.890 | 0.936 | 0.845 |
| quotientai/limbic-tool-use-0.5B-32K | 0.5B | 0.79[sec] | 0.818 | 0.749 | 0.887 |
### πŸ“Œ Metrics Definitions
- **Avg. Binary Accuracy** – Mean accuracy across all evaluated benchmarks,
where predictions are mapped to binary outcomes as follows:
- **Qualifire TUQ Benchmark**
- **Correct** β†’ `VALID_CALL`
- **Incorrect** β†’ `TOOL_ERROR`, `PARAM_NAME_ERROR` or `PARAM_VALUE_ERROR`
- **Limbic Benchmark**
- **Correct** β†’ `correct`
- **Incorrect** β†’ `incorrect_tool`, `incorrect_parameter_names` or `incorrect_parameter_values`
- **Qualifire TUQ Benchmark** link – [Qualifire Tool Selection Quality Benchmark](https://huggingface.co/datasets/qualifire/mcp-tool-selection-quality-benchmark).
- **Limbic Benchmark** link – [Limbic Eval Tool Use MCP Benchmark](https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp).
---
## πŸ“œ Evaluation Prompt Template
The model uses the following structured evaluation process:
1. **TOOL SELECTION**
- Check if the tool name exists in `available_tools`
- Check if tool purpose matches user intent
- Fail β†’ `TOOL_ERROR`❌
2. **PARAMETER STRUCTURE**
- All required parameters are present
- No extra parameters
- Parameter names exactly match the schema
- Fail β†’ `PARAM_NAME_ERROR`❌
3. **PARAMETER VALUES**
- Values have correct data types
- Values match user request
- No fabricated or incorrect values
- Fail β†’ `PARAM_VALUE_ERROR`❌
If all checks pass β†’ `VALID_CALL`βœ…
---
### πŸ“¦ Requirements
- `transformers>=4.51.0`
- `huggingface_hub`
- `torch`
---
## πŸ’» Usage
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
import torch
from huggingface_hub import hf_hub_download
# Model name
model_name = "qualifire/mcp-tool-use-quality-ranger-0.6b"
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create pipeline
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Load prompt template
file_path = hf_hub_download(repo_id=model_name, filename="tsq_prompt_template.txt")
with open(file_path, encoding="utf-8") as f:
PROMPT_TEMPLATE = f.read()
# Example inputs
example_tools_list = '''[
{
"type": "function",
"function": {
"name": "send-email",
"description": "Send an email using Resend",
"parameters": {
"properties": {
"to": {
"type": "string",
"format": "email",
"description": "Recipient email address"
},
"content": {
"type": "string",
"description": "Plain text email content"
},
"subject": {
"type": "string",
"description": "Email subject line"
},
"scheduledAt": {
"type": "string",
"description": "Optional parameter to schedule the email. This uses natural language. Examples would be 'tomorrow at 10am' or 'in 2 hours' or 'next day at 9am PST' or 'Friday at 3pm ET'."
}
},
"required": ["to", "subject", "content"]
}
}
}
]'''
example_message_history = '''[
{
"role": "user",
"content": "Please send an email to 'jane.doe@example.com' with the subject 'Meeting Follow-Up'. The content should be 'Hi Jane, just following up on our meeting from yesterday. Please find the attached notes.' and schedule it for tomorrow at 10am."
},
{
"completion_message": {
"content": {
"type": "text",
"text": ""
},
"role": "assistant",
"stop_reason": "tool_calls",
"tool_calls": [
{
"id": "call_le25efmhltxx9o7n4rfe",
"function": {
"name": "send-email",
"arguments": {
"subject": "Meeting Follow-Up",
"content": "Hi Jane, just following up on our meeting from yesterday. Please find the attached notes.",
"scheduledAt": "tomorrow at 10am"
}
}
}
]
}
}
]'''
# Format input
example_input = PROMPT_TEMPLATE.format(
message_history=example_message_history,
available_tools=example_tools_list
)
# Get prediction
result = pipe(example_input)[0]
print(result)
```
## ✨ Example Output
```
{'label': 'PARAM_NAME_ERROR', 'score': 0.9999843835830688}
```
The tool call name and values are correct, but the required parameter 'to' is missing from the function call, so the label is `PARAM_NAME_ERROR`.