# FutureBench Dataset Processing

This directory contains tools for processing FutureBench datasets, both downloading from HuggingFace and transforming your own database into the standard format.

## Option 1: Download from HuggingFace (Original)

Use this to download the existing FutureBench dataset:

```bash
python download_data.py
```

## Option 2: Transform Your Own Database

Use this to transform your production database into HuggingFace format:

### Setup

1. **Install dependencies:**
```bash
pip install pandas sqlalchemy huggingface_hub
```

2. **Set up HuggingFace token:**
```bash
export HF_TOKEN="your_huggingface_token_here"
```

3. **Configure your settings:**
Edit `config_db.py` to match your needs:
- Update `HF_CONFIG` with your HuggingFace repository names
- Adjust `PROCESSING_CONFIG` for data filtering preferences
- Note: Database connection uses the same setup as the main FutureBench app

### Usage

```bash
# Transform your database and upload to HuggingFace
python db_to_hf.py

# Or run locally without uploading
HF_TOKEN="" python db_to_hf.py
```

### Database Schema

The script uses the same database schema as the main FutureBench application:
- `EventBase` model for events
- `Prediction` model for predictions
- Uses SQLAlchemy ORM (same as `convert_to_csv.py`)

No additional database configuration needed - it uses the existing FutureBench database connection.

### Output Format

The script produces data in the same format as the original FutureBench dataset:
- `event_id`, `question`, `event_type`, `algorithm_name`, `actual_prediction`, `result`, `open_to_bet_until`, `prediction_created_at`

### Automation

You can run this as a scheduled job:

```bash
# Add to crontab to run daily at 2 AM
0 2 * * * cd /path/to/your/project && python leaderboard/process_data/db_to_hf.py
```

## Files

- `download_data.py` - Downloads data from HuggingFace repositories
- `db_to_hf.py` - Transforms your database to HuggingFace format
- `config_db.py` - Configuration for database connection and HF settings
- `config.py` - HuggingFace repository configuration
- `requirements.txt` - Python dependencies

## Data Structure

The main dataset contains:
- `event_id`: Unique identifier for each event
- `question`: The prediction question
- `event_type`: Type of event (polymarket, soccer, etc.)
- `answer_options`: Possible answers in JSON format
- `result`: Actual outcome (if resolved)
- `algorithm_name`: AI model that made the prediction
- `actual_prediction`: The prediction made
- `open_to_bet_until`: Prediction window deadline
- `prediction_created_at`: When prediction was made

## Output

The script generates:
- Downloaded datasets in local cache folders
- `evaluation_queue.csv` with unique events for processing
- Console output with data statistics and summary