Spaces:
Running
Running
# FutureBench Dataset Processing | |
This directory contains tools for processing FutureBench datasets, both downloading from HuggingFace and transforming your own database into the standard format. | |
## Option 1: Download from HuggingFace (Original) | |
Use this to download the existing FutureBench dataset: | |
```bash | |
python download_data.py | |
``` | |
## Option 2: Transform Your Own Database | |
Use this to transform your production database into HuggingFace format: | |
### Setup | |
1. **Install dependencies:** | |
```bash | |
pip install pandas sqlalchemy huggingface_hub | |
``` | |
2. **Set up HuggingFace token:** | |
```bash | |
export HF_TOKEN="your_huggingface_token_here" | |
``` | |
3. **Configure your settings:** | |
Edit `config_db.py` to match your needs: | |
- Update `HF_CONFIG` with your HuggingFace repository names | |
- Adjust `PROCESSING_CONFIG` for data filtering preferences | |
- Note: Database connection uses the same setup as the main FutureBench app | |
### Usage | |
```bash | |
# Transform your database and upload to HuggingFace | |
python db_to_hf.py | |
# Or run locally without uploading | |
HF_TOKEN="" python db_to_hf.py | |
``` | |
### Database Schema | |
The script uses the same database schema as the main FutureBench application: | |
- `EventBase` model for events | |
- `Prediction` model for predictions | |
- Uses SQLAlchemy ORM (same as `convert_to_csv.py`) | |
No additional database configuration needed - it uses the existing FutureBench database connection. | |
### Output Format | |
The script produces data in the same format as the original FutureBench dataset: | |
- `event_id`, `question`, `event_type`, `algorithm_name`, `actual_prediction`, `result`, `open_to_bet_until`, `prediction_created_at` | |
### Automation | |
You can run this as a scheduled job: | |
```bash | |
# Add to crontab to run daily at 2 AM | |
0 2 * * * cd /path/to/your/project && python leaderboard/process_data/db_to_hf.py | |
``` | |
## Files | |
- `download_data.py` - Downloads data from HuggingFace repositories | |
- `db_to_hf.py` - Transforms your database to HuggingFace format | |
- `config_db.py` - Configuration for database connection and HF settings | |
- `config.py` - HuggingFace repository configuration | |
- `requirements.txt` - Python dependencies | |
## Data Structure | |
The main dataset contains: | |
- `event_id`: Unique identifier for each event | |
- `question`: The prediction question | |
- `event_type`: Type of event (polymarket, soccer, etc.) | |
- `answer_options`: Possible answers in JSON format | |
- `result`: Actual outcome (if resolved) | |
- `algorithm_name`: AI model that made the prediction | |
- `actual_prediction`: The prediction made | |
- `open_to_bet_until`: Prediction window deadline | |
- `prediction_created_at`: When prediction was made | |
## Output | |
The script generates: | |
- Downloaded datasets in local cache folders | |
- `evaluation_queue.csv` with unique events for processing | |
- Console output with data statistics and summary | |