vinid's picture
Leaderboard deployment 2025-07-16 18:05:41
6441bc6
# FutureBench Dataset Processing
This directory contains tools for processing FutureBench datasets, both downloading from HuggingFace and transforming your own database into the standard format.
## Option 1: Download from HuggingFace (Original)
Use this to download the existing FutureBench dataset:
```bash
python download_data.py
```
## Option 2: Transform Your Own Database
Use this to transform your production database into HuggingFace format:
### Setup
1. **Install dependencies:**
```bash
pip install pandas sqlalchemy huggingface_hub
```
2. **Set up HuggingFace token:**
```bash
export HF_TOKEN="your_huggingface_token_here"
```
3. **Configure your settings:**
Edit `config_db.py` to match your needs:
- Update `HF_CONFIG` with your HuggingFace repository names
- Adjust `PROCESSING_CONFIG` for data filtering preferences
- Note: Database connection uses the same setup as the main FutureBench app
### Usage
```bash
# Transform your database and upload to HuggingFace
python db_to_hf.py
# Or run locally without uploading
HF_TOKEN="" python db_to_hf.py
```
### Database Schema
The script uses the same database schema as the main FutureBench application:
- `EventBase` model for events
- `Prediction` model for predictions
- Uses SQLAlchemy ORM (same as `convert_to_csv.py`)
No additional database configuration needed - it uses the existing FutureBench database connection.
### Output Format
The script produces data in the same format as the original FutureBench dataset:
- `event_id`, `question`, `event_type`, `algorithm_name`, `actual_prediction`, `result`, `open_to_bet_until`, `prediction_created_at`
### Automation
You can run this as a scheduled job:
```bash
# Add to crontab to run daily at 2 AM
0 2 * * * cd /path/to/your/project && python leaderboard/process_data/db_to_hf.py
```
## Files
- `download_data.py` - Downloads data from HuggingFace repositories
- `db_to_hf.py` - Transforms your database to HuggingFace format
- `config_db.py` - Configuration for database connection and HF settings
- `config.py` - HuggingFace repository configuration
- `requirements.txt` - Python dependencies
## Data Structure
The main dataset contains:
- `event_id`: Unique identifier for each event
- `question`: The prediction question
- `event_type`: Type of event (polymarket, soccer, etc.)
- `answer_options`: Possible answers in JSON format
- `result`: Actual outcome (if resolved)
- `algorithm_name`: AI model that made the prediction
- `actual_prediction`: The prediction made
- `open_to_bet_until`: Prediction window deadline
- `prediction_created_at`: When prediction was made
## Output
The script generates:
- Downloaded datasets in local cache folders
- `evaluation_queue.csv` with unique events for processing
- Console output with data statistics and summary