Spaces:

togethercomputer
/

FutureBench

Running

File size: 2,666 Bytes

6441bc6

from dataclasses import dataclass
from enum import Enum


@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Define our evaluation tasks
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the data, metric name, display name
    news = Task("news", "acc", "News")
    polymarket = Task("polymarket", "acc", "PolyMarket")


# Your leaderboard name
TITLE = """<h1 align="center" id="space-title" style="font-size: 4.375rem; font-weight: bold; margin-bottom: 1rem;">🔮 FutureBench Leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """<div class="section-card">
<h3 class="section-header"><span class="section-icon">🎯</span> About FutureBench</h3>
FutureBench is a benchmarking system for evaluating AI models on predicting future events.
This leaderboard shows how well different AI models perform at forecasting real-world outcomes
across various domains including news events, sports, and prediction markets.
<br><br>
📝 <a href="https://www.together.ai/blog/futurebench" target="_blank" style="color: #007acc; text-decoration: none;">Read our blog post</a> for more details about FutureBench.
</div>"""

# Additional information about the benchmark
ABOUT_TEXT = """
<div class="section-card fade-in-up">
<h2 class="section-header"><span class="section-icon">⚙️</span> How it works</h2>

FutureBench evaluates AI models on their ability to predict future events by:

- **Ingesting real-world events** from multiple sources (news, sports, prediction markets)
- **Collecting AI predictions** before events resolve
- **Measuring accuracy** once outcomes are known
- **Ranking models** based on their predictive performance
</div>

<div class="section-card fade-in-up stagger-1">
<h2 class="section-header"><span class="section-icon">📊</span> Event Types</h2>

- **News Events**: Predictions about political developments, economic changes, and current events
- **PolyMarket**: Predictions on various real-world events traded on prediction markets
</div>

<div class="section-card fade-in-up stagger-2">
<h2 class="section-header"><span class="section-icon">📈</span> Metrics</h2>

Models are evaluated using **accuracy** - the percentage of correct predictions made.
The **Average** score shows overall performance across all event types.
</div>

<div class="section-card fade-in-up stagger-3">
<h2 class="section-header"><span class="section-icon">🔒</span> Data Integrity</h2>

All predictions are made before events resolve, ensuring fair evaluation.
The leaderboard updates as new events are resolved and model performances are calculated.
</div>
"""