Spaces:

SamsungResearch
/

TRUEBench

Running

TRUEBench / src /about.py

송종윤/AI Productivity팀(SR)/삼성전자

Fix evaluation rules

cb27169 5 days ago

13.4 kB

	CATEGORY_DESCRIPTIONS = {
	"Content Generation": "<p>Evaluates the model's ability to produce diverse written outputs across professional and creative domains. This category measures adaptability to linguistic, stylistic, and formatting constraints, as well as the effectiveness of prompt engineering.</p> <b>🏷️Email 🏷️ReportDrafting</b>",
	"Editing": "<p>Evaluates refinement capabilities for optimizing given text. It focuses on queries related to rephrasing, revision, and correction, while preserving the rest of the content.</p> <b>🏷️QueryRephrase 🏷️DocumentRevision</b>",
	"Data Analysis": "<p>Measures proficiency in processing structured and unstructured data. This category includes tasks related to information extraction and data processing.</p> <b>🏷️JSONFormatted 🏷️TableQuery</b>",
	"Reasoning": "<p>Assesses logical problem-solving in coding, multiple-choice question answering, and mathematical operations. It also includes evaluation of rounding errors made by models in quantitative tasks.</p> <b>🏷️Logical 🏷️Mathematical</b>",
	"Hallucination": "<p>Detects limitations in generating plausible but inaccurate responses when faced with ambiguous queries, insufficient context, hypothetical scenarios, or challenges in document interpretation.</p> <b>🏷️InsufficientContext 🏷️FalseQueries</b>",
	"Safety": "<p>Verifies safeguards against harmful/inappropriate content. This category tests filtering of discriminatory, violent, or illegal material while upholding ethical standards.</p> <b>🏷️Illegal 🏷️Prejudice</b>",
	"Repetition": "<p>Evaluates consistency in producing iterative content variations while maintaining quality and relevance across outputs.</p> <b>🏷️Listing</b>",
	"Summarization": "<p>Measures ability to distill lengthy content into concise overviews preserving core concepts and eliminating redundancy. This category includes various constraints such as language, format, and output length.</p> <b>🏷️BulletPoints 🏷️N-lineSummary</b>",
	"Translation": "<p>Tests the ability to accurately translate diverse real-world contexts while adhering to target language and specified constraints. Our benchmark includes linguistic conditions in 12 languages, ensuring comprehensive multilingual evaluation.</p> <b>🏷️Document 🏷️Line-by-line</b>",
	"Multi-Turn": "<p>Assesses the model's ability to capture user intent in challenging scenarios where the context shifts or understanding of previous context is required.</p> <b>🏷️Consistency 🏷️Non-consistency</b>"
	}

	banner_url = "https://cdn-uploads.huggingface.co/production/uploads/6805a7222cbcd604c2e89cab/GIEbCbyNn7PjWBFftEgNm.png"
	BANNER = f'<div style="display: flex; justify-content: flex-start; width: 100%;"> <img src="{banner_url}" alt="Banner" style="width: 100%; height: auto; object-fit: contain;"> </div> '

	TITLE = """<html>
	<body>
	<p style="margin: 0; text-align: right">Leaderboards by Samsung Research for LLM evaluation.</p>
	</body>
	</html>"""

	LINK = """
	<h3 style="text-align: right; margin-top: 0;">
	<span>✨</span>
	<a href="https://research.samsung.com/" style="text-decoration: none;" rel="nofollow" target="_blank" onmouseover="this.style.textDecoration='underline'" onmouseout="this.style.textDecoration='none'">Samsung Research</a> \|
	<span>🌕</span>
	<a href="https://github.com/samsung" style="text-decoration: none;" rel="nofollow" target="_blank" onmouseover="this.style.textDecoration='underline'" onmouseout="this.style.textDecoration='none'">GitHub</a> \|
	<span>🌎</span>
	<a href="https://x.com/samsungresearch" style="text-decoration: none;" rel="nofollow" target="_blank" onmouseover="this.style.textDecoration='underline'" onmouseout="this.style.textDecoration='none'">X</a> \|
	<span>🌠</span>
	<a href="https://huggingface.co/spaces/SamsungResearch/TRUEBench/discussions" style="text-decoration: none;" rel="nofollow" target="_blank" onmouseover="this.style.textDecoration='underline'" onmouseout="this.style.textDecoration='none'">Discussion</a> \|
	<span>🔭</span> Updated: 2025-09-16
	</h3>
	"""

	INTRODUCTION_TEXT = """
	<div style="margin-bottom: 20px; text-align: center !important;">
	<h2 style="padding-bottom: 5px !important; text-align: center !important; font-size: 2.6em !important; font-weight: 900 !important; margin-top: 0.2em !important; margin-bottom: 0.3em !important;">
	🏆 TRUEBench: A Benchmark for Assessing LLMs as Human Job Productivity Assistants
	</h2>
	<p style="font-size: 1.25em !important; line-height: 1.7 !important; margin: 14px 0 !important;">
	TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark) evaluates LLMs as productivity assistants. <br>
	As LLMs become integral to tasks like report drafting and data analysis, existing benchmarks are suboptimal to capture real-world challenges. <br>
	To address this gap, <strong>Samsung Research</strong> developed TRUEBench as a comprehensive evaluation framework for real-world LLM applications.
	</p>
	<p style="font-size: 1.25em !important; line-height: 1.7 !important; margin: 14px 0 !important;">
	TRUEBench is a benchmark designed to evaluate the instruction-following capabilities of LLMs, determining whether a response receives a Pass (1 point) or Fail (0 points) based on checklists. <br> This aligns with user satisfaction from the perspective of job productivity.
	</p>
	<h3 style="font-size: 2em; font-weight: 800; margin-top: 1.2em; margin-bottom: 0.5em; line-height: 1.3; letter-spacing: -0.01em;">
	Main Features
	</h3>
	<div class="intro-feature-row">
	<div class="intro-feature-box">
	<div class="intro-feature-icon">📝</div>
	<div class="intro-feature-title">2,400+ Productivity-Oriented User Inputs</div>
	<div class="intro-feature-desc">A large-scale collection of complex, real-world user inputs designed to reflect productivity assistant scenarios.</div>
	</div>
	<div class="intro-feature-box">
	<div class="intro-feature-icon">🌎</div>
	<div class="intro-feature-title">Multilinguality in Real Tasks</div>
	<div class="intro-feature-desc">Comprehensive 12-language coverage with intra-instance multilingual instructions.</div>
	<div class="intro-feature-desc" style="font-style: italic; color: #888;">For multilingual aspects, it was created through local research institutes.</div>
	</div>
	<div class="intro-feature-box">
	<div class="intro-feature-icon">🧩</div>
	<div class="intro-feature-title">Beyond Explicit Constraints</div>
	<div class="intro-feature-desc">Human-annotated implicit requirements validated by LLMs.</div>
	</div>
	<div class="intro-feature-box">
	<div class="intro-feature-icon">🧭</div>
	<div class="intro-feature-title">Dynamic Multi-Turn Contexts</div>
	<div class="intro-feature-desc">Realistic dialogue flows with evolving constraints.</div>
	</div>
	</div>
	<a class="intro-dataset-btn" href="https://huggingface.co/datasets/SamsungResearch/TRUEBench" target="_blank" rel="nofollow">
	📂 Dataset Sample →
	</a>
	</div> """

	MAIN_FEATURES_TEXT ="""
	<div style="padding: 10px; border-radius: 8px; margin-bottom: 20px;">
	<h2 style="color: #2c3e50; border-bottom: 2px solid #3498db; padding-bottom: 5px;">✨ Main Features</h2>
	<ul style="list-style-type: none; padding-left: 0;">
	<li style="margin-bottom: 10px; padding-left: 25px; position: relative;">
	<span style="position: absolute; left: 0; color: #3498db;">✓</span>
	Input prompts across 12 languages
	</li>
	<li style="margin-bottom: 10px; padding-left: 25px; position: relative;">
	<span style="position: absolute; left: 0; color: #3498db;">✓</span>
	Intra-instance multilingual instructions
	</li>
	<li style="margin-bottom: 10px; padding-left: 25px; position: relative;">
	<span style="position: absolute; left: 0; color: #3498db;">✓</span>
	Rigorous evaluation criteria for explicit and implicit constraints
	</li>
	<li style="margin-bottom: 10px; padding-left: 25px; position: relative;">
	<span style="position: absolute; left: 0; color: #3498db;">✓</span>
	Complex multi-turn dialogue scenarios
	</li>
	<li style="margin-bottom: 10px; padding-left: 25px; position: relative;">
	<span style="position: absolute; left: 0; color: #3498db;">✓</span>
	LLM-validated constraints for reliable evaluation
	</li>
	</ul>
	<div style="margin: 20px 0 10px 0;">
	<a href="https://huggingface.co/datasets/SamsungResearch/TRUEBench"
	style="color: #3498db;
	text-decoration: underline;
	font-size: 1.2em;
	font-weight: bold;"
	rel="nofollow"
	target="_blank"
	onmouseover="this.style.textDecoration='none'; this.style.color='#2c3e50'"
	onmouseout="this.style.textDecoration='underline'; this.style.color='#3498db'">
	📂 Dataset Sample →
	</a>
	</div>
	</div>
	"""

	LLM_BENCHMARKS_TEXT = f"""
	## How it works
	We utilize LLM Judge with human-crafted criteria to assess AI response.
	"""

	EVALUATION_QUEUE_TEXT = '''
	<div style="font-size: 1.25em !important; line-height: 1.7 !important; margin: 14px 0 !important;">

	## Submission Policy
	- Submissions are limited to models that are registered on HuggingFace Models.
	- Each model affiliation (individual or organization) may submit up to 3 times within 24 hours.
	- The same model can only be submitted once per 24 hours.
	- Duplicate submissions will be determined based on the full model name (i.e., {affiliation}/{model name}). Sampling parameters, dtype, etc. are not considered for duplicate checking.
	- Submissions are only valid if the model's affiliation matches that of the submitter.
	- If the same model is submitted multiple times, only the version with the highest overall score will be reflected on the leaderboard. (Note: A maximum of 3 submissions per model is allowed.)

	[NOTE] Models with commercial licenses may be excluded from evaluation. We focus on evaluating non-commercial models, such as those under Apache-2.0 or MIT licenses. <br>
	[NOTE] We use your user name (via OAuthProfile) and your list of registered organizations (via OAuthToken) solely to verify submission eligibility. This information is never stored.<br><br>

	## Evaluation Environments
	- Submitted models are run on our internal servers to generate inference outputs, which are then evaluated using an LLM judge.
	- Models must be runnable on up to 32 H100 GPUs to be eligible for submission.
	- By default, we perform inference in the vLLM 0.10.1 environment. We recommend testing your model in this environment first. You may include additional requests in the requirements section in a free‑form manner, but please note that such requests could be rejected due to the constraints of inferencing environment.
	- We serve the model based on vLLM and perform inference through the OpenAI-compatible API (chat completion).<br><br>

	## Evaluation Rules
	- It might take more than 1 week for submitted models' scores to appear on the leaderboard.
	- The maximum model length is limited to 64K tokens.
	- Please provide a valid contact email address in the submission form so we can send notifications related to evaluation.

	[CAUTION] If inference fails or if inappropriate content is detected, the model might be excluded from evaluation.<br><br>

	## Submission Rules
	- For Think models, you must specify the sequence that separates the thinking process and the final response (e.g., </think>) in the response_prefix field. We will use this prefix to extract the response for evaluation. (NOTE: Models that fail to provide a proper response prefix might be excluded from evaluation.)
	- Referring to the configuration section of the submission form, provide the following in YAML format, either directly or via an uploaded `.yaml` file (if both are provided, the file takes priority):
	- Model serve arguments (llm_serve_args): vLLM-based model serving parameters ([Reference](https://docs.vllm.ai/en/latest/cli/serve.html))
	- Sampling parameters (sampling_params): Sampling parameters supported by the OpenAI API ([Reference](https://platform.openai.com/docs/api-reference/chat))
	- Extra body including chat template arguments (extra_body): `chat_template_kwargs` and sampling parameters supported by vLLM ([Reference](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters_1))
	- Any additional specifications outside the configuration format should be written in the requirements section.

	[NOTE] If you need to use two or more H100 GPUs, be sure to specify `tensor_parallel_size` within `llm_serve_args`.<br><br>
	</div>
	'''

	EVALUATION_QUEUE_TEXT_OPTION1 = """
	<div style="font-size: 1.25em !important; line-height: 1.7 !important; margin: 14px 0 !important;">

	## Submission Form
	1. Sign in using the log-in button below.
	2. Fill the information including metadata, requirements, and configuration (fill the textbox or upload .yaml file).
	3. Press "Submit Eval" button to submit.
	"""

	EVALUATION_QUEUE_TEXT_OPTION2 = """
	"""


	CITATION_BUTTON_LABEL = "To be updated"
	CITATION_BUTTON_TEXT = r"""
	"""