Spaces:
Running
Running
| """ | |
| Task description components for the leaderboard application. | |
| """ | |
| import streamlit as st | |
| from src.utils.config import tasks_info | |
| from src.utils.task_mapping import get_display_name, get_original_name | |
| def render_task_descriptions(): | |
| """ | |
| Render the benchmark details section | |
| """ | |
| # Display the MLRC-BENCH image | |
| st.image("Assests/MLRC_Bench_overview.png", use_column_width=True) | |
| # Display the MLRC-BENCH information | |
| st.markdown(""" | |
| ## MLRC-BENCH: Can Language Agents Solve ML Research Challenges? | |
| Recent advances in large language models (LLMs) have motivated a critical question in the machine learning community: can AI agents not only propose novel research ideas but also translate them into effective implementations? **MLRC-BENCH** is introduced as a new benchmark to investigate this question by rigorously evaluating the capacity of LLM-based research agents to address contemporary ML competition tasks. | |
| --- | |
| ### Benchmark Overview | |
| MLRC-BENCH seeks to assess AI-driven research workflows in two primary dimensions: | |
| - **Idea Proposal**: Generating plausible and potentially innovative methods for addressing current ML research problems. | |
| - **Code Implementation**: Translating these ideas into executable solutions that measurably improve performance over a baseline. | |
| This design contrasts with prior benchmarks that emphasize either (1) full end-to-end paper generation assessed by subjective human or LLM reviews, or (2) isolated code-generation tasks that focus on engineering challenges. By dividing the problem into idea proposal and implementation, MLRC-BENCH provides a clearer measure of how well agents can form and operationalize research insights. | |
| --- | |
| ### Evaluation Criteria | |
| For each agent on a given task, MLRC-BENCH measures performance relative to a **baseline** method and a **top human** benchmark. We report two primary metrics, each taken from the maximum result across all experimental runs for a task-model pair: | |
| - **Relative Improvement to Human** | |
| How effectively the agent closes the gap between the baseline and the best human solution. | |
| - **Absolute Improvement to Baseline** | |
| How much better the agent performs compared to the baseline, expressed as a percentage gain. | |
| --- | |
| ### Significance | |
| MLRC-BENCH emphasizes rigorous and reproducible evaluations, focusing on tasks drawn from recent machine learning conferences and workshops to ensure that tested methods are both **meaningful** and **nontrivial**. This dynamic approach allows the benchmark to grow as new competition tasks arise, enabling continuous monitoring of progress in agent-driven research. Through its emphasis on objective success criteria, MLRC-BENCH fosters the development of AI agents that more effectively balance conceptual innovation with practical impact. | |
| --- | |
| ### Future Directions | |
| While current results suggest that LLM-based research agents still fall short of human capabilities in creativity and code implementation, MLRC-BENCH provides a **scalable mechanism** to track and accelerate progress. As AI methods advance—and potentially branch into high-stakes domains such as healthcare and climate modeling—this benchmark could serve as a critical resource for aligning agent innovation with **reliability** and **safety**. | |
| """) | |
| st.markdown(""" | |
| <div class="card"> | |
| <div class="card-title"><span class="card-title-icon">🔍</span> Tasks in the Benchmark</div> | |
| <p style="margin-bottom: 20px;"> | |
| Click on any task to learn more. | |
| </p> | |
| </div> | |
| """, unsafe_allow_html=True) | |
| # Task links mapping - using original task names | |
| original_task_links = { | |
| "Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model", | |
| "Machine Unlearning": "https://unlearning-challenge.github.io/", | |
| "Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io", | |
| "Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge", | |
| "Meta Learning": "https://metalearning.chalearn.org/", | |
| "Llm Merging": "https://llm-merging.github.io", | |
| "Rainfall Prediction": "https://weather4cast.net/neurips-2023/" | |
| } | |
| # Update links mapping to use display names as keys | |
| task_links = {get_display_name(task): link for task, link in original_task_links.items()} | |
| # Create two columns | |
| col1, col2 = st.columns(2) | |
| # Split tasks between the two columns with better styling | |
| task_items = list(tasks_info.items()) | |
| mid_point = len(task_items) // 2 | |
| with col1: | |
| for task, description in task_items[:mid_point]: | |
| link = task_links.get(task, "#") | |
| st.markdown(f""" | |
| <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;"> | |
| <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';"> | |
| <div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div> | |
| </div> | |
| </a> | |
| """, unsafe_allow_html=True) | |
| with col2: | |
| for task, description in task_items[mid_point:]: | |
| link = task_links.get(task, "#") | |
| st.markdown(f""" | |
| <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;"> | |
| <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';"> | |
| <div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div> | |
| </div> | |
| </a> | |
| """, unsafe_allow_html=True) |