|
--- |
|
tools: |
|
- en |
|
license: other |
|
tags: |
|
- mcp |
|
- tools quality |
|
- tool quality selection |
|
- tool selection |
|
- TSQ |
|
- TUQ |
|
- sequence-classification |
|
- tool-evaluation |
|
- function-call |
|
- limbic |
|
- tool use |
|
- tool quality |
|
pipeline_tag: text-classification |
|
language: |
|
- en |
|
base_model: |
|
- Qwen/Qwen3-0.6B |
|
--- |
|
|
|
## π§ Model Description |
|
|
|
Designed for evaluating function calls in the context of **Model Context Protocol (MCP) tools**. |
|
It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values. |
|
|
|
The **mcp-tool-use-quality-ranger-0.6b** is a fine-tuned sequence classification model created to **evaluate the quality of function calls** in conversational AI systems. |
|
|
|
**Max Context Length:** **32,768 Tokens** |
|
|
|
<img src="Ranger_img.png" width="600px"/> |
|
|
|
It determines if a given function call: |
|
|
|
- Selects the correct tool |
|
- Has correct parameter names and structure |
|
- Contains correct parameter values |
|
|
|
It produces one of four possible classification labels: |
|
|
|
| Label | Meaning | |
|
|-------|---------| |
|
| **VALID_CALL** | β
The tool name, parameters, and values are all correct, or no suitable tool exists and no function call is made. | |
|
| **TOOL_ERROR** | β The tool name does not exist or does not match the user intent. | |
|
| **PARAM_NAME_ERROR** | β The correct tool is used, but parameter names are missing, extra, or incorrect. | |
|
| **PARAM_VALUE_ERROR** | β Tool and parameter names are correct, but parameter values are wrong or incorrectly formatted. | |
|
|
|
|
|
--- |
|
|
|
## π Benchmark Evaluation |
|
|
|
The **mcp-tool-use-quality-ranger-0.6b** was evaluated in a binary classification setting, |
|
where the prediction is **Correct** if the function call evaluation matched the gold label, and **Incorrect** otherwise. |
|
|
|
| Model | #Params | Avg. Latency | Avg Binary Accuracy | [Qualifire mcp-tool-use-quality Benchmark](https://huggingface.co/datasets/qualifire/mcp-tool-use-quality-benchmark) Binary Accuracy | [Limbic Benchmark](https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp) Binary Accuracy | |
|
|-----------------------------------------------|---------|--------------------|---------------------|--------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------| |
|
| qualifire/mcp-tool-use-quality-ranger-4b [private] | 4B | 0.30[sec] | 0.978 | 0.997 | 0.960 | |
|
| **qualifire/mcp-tool-use-quality-ranger-0.6b** | **0.6B**| **0.09[sec]** | **0.958** | **0.993** | **0.924** | |
|
| gemini-2.5-flash | - | 4.87[sec] | 0.890 | 0.936 | 0.845 | |
|
| quotientai/limbic-tool-use-0.5B-32K | 0.5B | 0.79[sec] | 0.818 | 0.749 | 0.887 | |
|
|
|
### π Metrics Definitions |
|
- **Avg. Binary Accuracy** β Mean accuracy across all evaluated benchmarks, |
|
where predictions are mapped to binary outcomes as follows: |
|
|
|
- **Qualifire TUQ Benchmark** |
|
- **Correct** β `VALID_CALL` |
|
- **Incorrect** β `TOOL_ERROR`, `PARAM_NAME_ERROR` or `PARAM_VALUE_ERROR` |
|
|
|
- **Limbic Benchmark** |
|
- **Correct** β `correct` |
|
- **Incorrect** β `incorrect_tool`, `incorrect_parameter_names` or `incorrect_parameter_values` |
|
|
|
- **Qualifire TUQ Benchmark** link β [Qualifire Tool Selection Quality Benchmark](https://huggingface.co/datasets/qualifire/mcp-tool-selection-quality-benchmark). |
|
- **Limbic Benchmark** link β [Limbic Eval Tool Use MCP Benchmark](https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp). |
|
|
|
--- |
|
|
|
## π Evaluation Prompt Template |
|
|
|
The model uses the following structured evaluation process: |
|
|
|
1. **TOOL SELECTION** |
|
- Check if the tool name exists in `available_tools` |
|
- Check if tool purpose matches user intent |
|
- Fail β `TOOL_ERROR`β |
|
|
|
2. **PARAMETER STRUCTURE** |
|
- All required parameters are present |
|
- No extra parameters |
|
- Parameter names exactly match the schema |
|
- Fail β `PARAM_NAME_ERROR`β |
|
|
|
3. **PARAMETER VALUES** |
|
- Values have correct data types |
|
- Values match user request |
|
- No fabricated or incorrect values |
|
- Fail β `PARAM_VALUE_ERROR`β |
|
|
|
If all checks pass β `VALID_CALL`β
|
|
|
|
--- |
|
### π¦ Requirements |
|
- `transformers>=4.51.0` |
|
- `huggingface_hub` |
|
- `torch` |
|
|
|
--- |
|
|
|
## π» Usage |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline |
|
import torch |
|
from huggingface_hub import hf_hub_download |
|
|
|
# Model name |
|
model_name = "qualifire/mcp-tool-use-quality-ranger-0.6b" |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForSequenceClassification.from_pretrained( |
|
model_name, |
|
torch_dtype=torch.bfloat16, |
|
device_map='auto', |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
# Create pipeline |
|
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer) |
|
|
|
# Load prompt template |
|
file_path = hf_hub_download(repo_id=model_name, filename="tsq_prompt_template.txt") |
|
with open(file_path, encoding="utf-8") as f: |
|
PROMPT_TEMPLATE = f.read() |
|
|
|
# Example inputs |
|
example_tools_list = '''[ |
|
{ |
|
"type": "function", |
|
"function": { |
|
"name": "send-email", |
|
"description": "Send an email using Resend", |
|
"parameters": { |
|
"properties": { |
|
"to": { |
|
"type": "string", |
|
"format": "email", |
|
"description": "Recipient email address" |
|
}, |
|
"content": { |
|
"type": "string", |
|
"description": "Plain text email content" |
|
}, |
|
"subject": { |
|
"type": "string", |
|
"description": "Email subject line" |
|
}, |
|
"scheduledAt": { |
|
"type": "string", |
|
"description": "Optional parameter to schedule the email. This uses natural language. Examples would be 'tomorrow at 10am' or 'in 2 hours' or 'next day at 9am PST' or 'Friday at 3pm ET'." |
|
} |
|
}, |
|
"required": ["to", "subject", "content"] |
|
} |
|
} |
|
} |
|
]''' |
|
|
|
example_message_history = '''[ |
|
{ |
|
"role": "user", |
|
"content": "Please send an email to 'jane.doe@example.com' with the subject 'Meeting Follow-Up'. The content should be 'Hi Jane, just following up on our meeting from yesterday. Please find the attached notes.' and schedule it for tomorrow at 10am." |
|
}, |
|
{ |
|
"completion_message": { |
|
"content": { |
|
"type": "text", |
|
"text": "" |
|
}, |
|
"role": "assistant", |
|
"stop_reason": "tool_calls", |
|
"tool_calls": [ |
|
{ |
|
"id": "call_le25efmhltxx9o7n4rfe", |
|
"function": { |
|
"name": "send-email", |
|
"arguments": { |
|
"subject": "Meeting Follow-Up", |
|
"content": "Hi Jane, just following up on our meeting from yesterday. Please find the attached notes.", |
|
"scheduledAt": "tomorrow at 10am" |
|
} |
|
} |
|
} |
|
] |
|
} |
|
} |
|
]''' |
|
|
|
# Format input |
|
example_input = PROMPT_TEMPLATE.format( |
|
message_history=example_message_history, |
|
available_tools=example_tools_list |
|
) |
|
|
|
# Get prediction |
|
result = pipe(example_input)[0] |
|
print(result) |
|
``` |
|
|
|
## β¨ Example Output |
|
|
|
``` |
|
{'label': 'PARAM_NAME_ERROR', 'score': 0.9999843835830688} |
|
``` |
|
|
|
The tool call name and values are correct, but the required parameter 'to' is missing from the function call, so the label is `PARAM_NAME_ERROR`. |