Spaces:

togethercomputer
/

FutureBench

Running

App Files Files Community

FutureBench / process_data /README.md

vinid

Leaderboard deployment 2025-07-16 18:05:41

6441bc6 16 days ago

preview code

raw

history blame contribute delete

2.8 kB

	# FutureBench Dataset Processing

	This directory contains tools for processing FutureBench datasets, both downloading from HuggingFace and transforming your own database into the standard format.

	## Option 1: Download from HuggingFace (Original)

	Use this to download the existing FutureBench dataset:

	```bash
	python download_data.py
	```

	## Option 2: Transform Your Own Database

	Use this to transform your production database into HuggingFace format:

	### Setup

	1. Install dependencies:
	```bash
	pip install pandas sqlalchemy huggingface_hub
	```

	2. Set up HuggingFace token:
	```bash
	export HF_TOKEN="your_huggingface_token_here"
	```

	3. Configure your settings:
	Edit `config_db.py` to match your needs:
	- Update `HF_CONFIG` with your HuggingFace repository names
	- Adjust `PROCESSING_CONFIG` for data filtering preferences
	- Note: Database connection uses the same setup as the main FutureBench app

	### Usage

	```bash
	# Transform your database and upload to HuggingFace
	python db_to_hf.py

	# Or run locally without uploading
	HF_TOKEN="" python db_to_hf.py
	```

	### Database Schema

	The script uses the same database schema as the main FutureBench application:
	- `EventBase` model for events
	- `Prediction` model for predictions
	- Uses SQLAlchemy ORM (same as `convert_to_csv.py`)

	No additional database configuration needed - it uses the existing FutureBench database connection.

	### Output Format

	The script produces data in the same format as the original FutureBench dataset:
	- `event_id`, `question`, `event_type`, `algorithm_name`, `actual_prediction`, `result`, `open_to_bet_until`, `prediction_created_at`

	### Automation

	You can run this as a scheduled job:

	```bash
	# Add to crontab to run daily at 2 AM
	0 2 * * * cd /path/to/your/project && python leaderboard/process_data/db_to_hf.py
	```

	## Files

	- `download_data.py` - Downloads data from HuggingFace repositories
	- `db_to_hf.py` - Transforms your database to HuggingFace format
	- `config_db.py` - Configuration for database connection and HF settings
	- `config.py` - HuggingFace repository configuration
	- `requirements.txt` - Python dependencies

	## Data Structure

	The main dataset contains:
	- `event_id`: Unique identifier for each event
	- `question`: The prediction question
	- `event_type`: Type of event (polymarket, soccer, etc.)
	- `answer_options`: Possible answers in JSON format
	- `result`: Actual outcome (if resolved)
	- `algorithm_name`: AI model that made the prediction
	- `actual_prediction`: The prediction made
	- `open_to_bet_until`: Prediction window deadline
	- `prediction_created_at`: When prediction was made

	## Output

	The script generates:
	- Downloaded datasets in local cache folders
	- `evaluation_queue.csv` with unique events for processing
	- Console output with data statistics and summary