FutureBench Dataset Processing

This directory contains tools for processing FutureBench datasets, both downloading from HuggingFace and transforming your own database into the standard format.

Option 1: Download from HuggingFace (Original)

Use this to download the existing FutureBench dataset:

python download_data.py

Option 2: Transform Your Own Database

Use this to transform your production database into HuggingFace format:

Setup

Install dependencies:

pip install pandas sqlalchemy huggingface_hub

Set up HuggingFace token:

export HF_TOKEN="your_huggingface_token_here"

Configure your settings: Edit config_db.py to match your needs:

Update HF_CONFIG with your HuggingFace repository names
Adjust PROCESSING_CONFIG for data filtering preferences
Note: Database connection uses the same setup as the main FutureBench app

Usage

# Transform your database and upload to HuggingFace
python db_to_hf.py

# Or run locally without uploading
HF_TOKEN="" python db_to_hf.py

Database Schema

The script uses the same database schema as the main FutureBench application:

EventBase model for events
Prediction model for predictions
Uses SQLAlchemy ORM (same as convert_to_csv.py)

No additional database configuration needed - it uses the existing FutureBench database connection.

Output Format

The script produces data in the same format as the original FutureBench dataset:

event_id, question, event_type, algorithm_name, actual_prediction, result, open_to_bet_until, prediction_created_at

Automation

You can run this as a scheduled job:

# Add to crontab to run daily at 2 AM
0 2 * * * cd /path/to/your/project && python leaderboard/process_data/db_to_hf.py

Files

download_data.py - Downloads data from HuggingFace repositories
db_to_hf.py - Transforms your database to HuggingFace format
config_db.py - Configuration for database connection and HF settings
config.py - HuggingFace repository configuration
requirements.txt - Python dependencies

Data Structure

The main dataset contains:

event_id: Unique identifier for each event
question: The prediction question
event_type: Type of event (polymarket, soccer, etc.)
answer_options: Possible answers in JSON format
result: Actual outcome (if resolved)
algorithm_name: AI model that made the prediction
actual_prediction: The prediction made
open_to_bet_until: Prediction window deadline
prediction_created_at: When prediction was made

Output

The script generates:

Downloaded datasets in local cache folders
evaluation_queue.csv with unique events for processing
Console output with data statistics and summary