README.md · qualifire/mcp-tool-use-quality-ranger-0.6b at main

mcp-tool-use-quality-ranger-0.6b / README.md

oran-qualifire

Update README.md

563625d verified 1 day ago

preview code

raw

history blame contribute delete

8.21 kB

	---
	tools:
	- en
	license: other
	tags:
	- mcp
	- tools quality
	- tool quality selection
	- tool selection
	- TSQ
	- TUQ
	- sequence-classification
	- tool-evaluation
	- function-call
	- limbic
	- tool use
	- tool quality
	pipeline_tag: text-classification
	language:
	- en
	base_model:
	- Qwen/Qwen3-0.6B
	---

	## 🧠 Model Description

	Designed for evaluating function calls in the context of Model Context Protocol (MCP) tools.
	It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values.

	The mcp-tool-use-quality-ranger-0.6b is a fine-tuned sequence classification model created to evaluate the quality of function calls in conversational AI systems.

	Max Context Length: 32,768 Tokens

	<img src="Ranger_img.png" width="600px"/>

	It determines if a given function call:

	- Selects the correct tool
	- Has correct parameter names and structure
	- Contains correct parameter values

	It produces one of four possible classification labels:

	\| Label \| Meaning \|
	\|-------\|---------\|
	\| VALID_CALL \| ✅ The tool name, parameters, and values are all correct, or no suitable tool exists and no function call is made. \|
	\| TOOL_ERROR \| ❌ The tool name does not exist or does not match the user intent. \|
	\| PARAM_NAME_ERROR \| ❌ The correct tool is used, but parameter names are missing, extra, or incorrect. \|
	\| PARAM_VALUE_ERROR \| ❌ Tool and parameter names are correct, but parameter values are wrong or incorrectly formatted. \|


	---

	## 📊 Benchmark Evaluation

	The mcp-tool-use-quality-ranger-0.6b was evaluated in a binary classification setting,
	where the prediction is Correct if the function call evaluation matched the gold label, and Incorrect otherwise.

	\| Model \| #Params \| Avg. Latency \| Avg Binary Accuracy \| [Qualifire mcp-tool-use-quality Benchmark](https://huggingface.co/datasets/qualifire/mcp-tool-use-quality-benchmark) Binary Accuracy \| [Limbic Benchmark](https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp) Binary Accuracy \|
	\|-----------------------------------------------\|---------\|--------------------\|---------------------\|--------------------------------------------------------------------------------------------------------\|-----------------------------------------------------------------------------------------\|
	\| qualifire/mcp-tool-use-quality-ranger-4b [private] \| 4B \| 0.30[sec] \| 0.978 \| 0.997 \| 0.960 \|
	\| qualifire/mcp-tool-use-quality-ranger-0.6b \| 0.6B\| 0.09[sec] \| 0.958 \| 0.993 \| 0.924 \|
	\| gemini-2.5-flash \| - \| 4.87[sec] \| 0.890 \| 0.936 \| 0.845 \|
	\| quotientai/limbic-tool-use-0.5B-32K \| 0.5B \| 0.79[sec] \| 0.818 \| 0.749 \| 0.887 \|

	### 📌 Metrics Definitions
	- Avg. Binary Accuracy – Mean accuracy across all evaluated benchmarks,
	where predictions are mapped to binary outcomes as follows:

	- Qualifire TUQ Benchmark
	- Correct → `VALID_CALL`
	- Incorrect → `TOOL_ERROR`, `PARAM_NAME_ERROR` or `PARAM_VALUE_ERROR`

	- Limbic Benchmark
	- Correct → `correct`
	- Incorrect → `incorrect_tool`, `incorrect_parameter_names` or `incorrect_parameter_values`

	- Qualifire TUQ Benchmark link – [Qualifire Tool Selection Quality Benchmark](https://huggingface.co/datasets/qualifire/mcp-tool-selection-quality-benchmark).
	- Limbic Benchmark link – [Limbic Eval Tool Use MCP Benchmark](https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp).

	---

	## 📜 Evaluation Prompt Template

	The model uses the following structured evaluation process:

	1. TOOL SELECTION
	- Check if the tool name exists in `available_tools`
	- Check if tool purpose matches user intent
	- Fail → `TOOL_ERROR`❌

	2. PARAMETER STRUCTURE
	- All required parameters are present
	- No extra parameters
	- Parameter names exactly match the schema
	- Fail → `PARAM_NAME_ERROR`❌

	3. PARAMETER VALUES
	- Values have correct data types
	- Values match user request
	- No fabricated or incorrect values
	- Fail → `PARAM_VALUE_ERROR`❌

	If all checks pass → `VALID_CALL`✅

	---
	### 📦 Requirements
	- `transformers>=4.51.0`
	- `huggingface_hub`
	- `torch`

	---

	## 💻 Usage

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
	import torch
	from huggingface_hub import hf_hub_download

	# Model name
	model_name = "qualifire/mcp-tool-use-quality-ranger-0.6b"

	# Load model and tokenizer
	model = AutoModelForSequenceClassification.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map='auto',
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	# Create pipeline
	pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

	# Load prompt template
	file_path = hf_hub_download(repo_id=model_name, filename="tsq_prompt_template.txt")
	with open(file_path, encoding="utf-8") as f:
	PROMPT_TEMPLATE = f.read()

	# Example inputs
	example_tools_list = '''[
	{
	"type": "function",
	"function": {
	"name": "send-email",
	"description": "Send an email using Resend",
	"parameters": {
	"properties": {
	"to": {
	"type": "string",
	"format": "email",
	"description": "Recipient email address"
	},
	"content": {
	"type": "string",
	"description": "Plain text email content"
	},
	"subject": {
	"type": "string",
	"description": "Email subject line"
	},
	"scheduledAt": {
	"type": "string",
	"description": "Optional parameter to schedule the email. This uses natural language. Examples would be 'tomorrow at 10am' or 'in 2 hours' or 'next day at 9am PST' or 'Friday at 3pm ET'."
	}
	},
	"required": ["to", "subject", "content"]
	}
	}
	}
	]'''

	example_message_history = '''[
	{
	"role": "user",
	"content": "Please send an email to 'jane.doe@example.com' with the subject 'Meeting Follow-Up'. The content should be 'Hi Jane, just following up on our meeting from yesterday. Please find the attached notes.' and schedule it for tomorrow at 10am."
	},
	{
	"completion_message": {
	"content": {
	"type": "text",
	"text": ""
	},
	"role": "assistant",
	"stop_reason": "tool_calls",
	"tool_calls": [
	{
	"id": "call_le25efmhltxx9o7n4rfe",
	"function": {
	"name": "send-email",
	"arguments": {
	"subject": "Meeting Follow-Up",
	"content": "Hi Jane, just following up on our meeting from yesterday. Please find the attached notes.",
	"scheduledAt": "tomorrow at 10am"
	}
	}
	}
	]
	}
	}
	]'''

	# Format input
	example_input = PROMPT_TEMPLATE.format(
	message_history=example_message_history,
	available_tools=example_tools_list
	)

	# Get prediction
	result = pipe(example_input)[0]
	print(result)
	```

	## ✨ Example Output

	```
	{'label': 'PARAM_NAME_ERROR', 'score': 0.9999843835830688}
	```

	The tool call name and values are correct, but the required parameter 'to' is missing from the function call, so the label is `PARAM_NAME_ERROR`.