Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.39.0
FutureBench Dataset Processing
This directory contains tools for processing FutureBench datasets, both downloading from HuggingFace and transforming your own database into the standard format.
Option 1: Download from HuggingFace (Original)
Use this to download the existing FutureBench dataset:
python download_data.py
Option 2: Transform Your Own Database
Use this to transform your production database into HuggingFace format:
Setup
- Install dependencies:
pip install pandas sqlalchemy huggingface_hub
- Set up HuggingFace token:
export HF_TOKEN="your_huggingface_token_here"
- Configure your settings:
Edit
config_db.py
to match your needs:
- Update
HF_CONFIG
with your HuggingFace repository names - Adjust
PROCESSING_CONFIG
for data filtering preferences - Note: Database connection uses the same setup as the main FutureBench app
Usage
# Transform your database and upload to HuggingFace
python db_to_hf.py
# Or run locally without uploading
HF_TOKEN="" python db_to_hf.py
Database Schema
The script uses the same database schema as the main FutureBench application:
EventBase
model for eventsPrediction
model for predictions- Uses SQLAlchemy ORM (same as
convert_to_csv.py
)
No additional database configuration needed - it uses the existing FutureBench database connection.
Output Format
The script produces data in the same format as the original FutureBench dataset:
event_id
,question
,event_type
,algorithm_name
,actual_prediction
,result
,open_to_bet_until
,prediction_created_at
Automation
You can run this as a scheduled job:
# Add to crontab to run daily at 2 AM
0 2 * * * cd /path/to/your/project && python leaderboard/process_data/db_to_hf.py
Files
download_data.py
- Downloads data from HuggingFace repositoriesdb_to_hf.py
- Transforms your database to HuggingFace formatconfig_db.py
- Configuration for database connection and HF settingsconfig.py
- HuggingFace repository configurationrequirements.txt
- Python dependencies
Data Structure
The main dataset contains:
event_id
: Unique identifier for each eventquestion
: The prediction questionevent_type
: Type of event (polymarket, soccer, etc.)answer_options
: Possible answers in JSON formatresult
: Actual outcome (if resolved)algorithm_name
: AI model that made the predictionactual_prediction
: The prediction madeopen_to_bet_until
: Prediction window deadlineprediction_created_at
: When prediction was made
Output
The script generates:
- Downloaded datasets in local cache folders
evaluation_queue.csv
with unique events for processing- Console output with data statistics and summary