# FutureBench Dataset Processing This directory contains tools for processing FutureBench datasets, both downloading from HuggingFace and transforming your own database into the standard format. ## Option 1: Download from HuggingFace (Original) Use this to download the existing FutureBench dataset: ```bash python download_data.py ``` ## Option 2: Transform Your Own Database Use this to transform your production database into HuggingFace format: ### Setup 1. **Install dependencies:** ```bash pip install pandas sqlalchemy huggingface_hub ``` 2. **Set up HuggingFace token:** ```bash export HF_TOKEN="your_huggingface_token_here" ``` 3. **Configure your settings:** Edit `config_db.py` to match your needs: - Update `HF_CONFIG` with your HuggingFace repository names - Adjust `PROCESSING_CONFIG` for data filtering preferences - Note: Database connection uses the same setup as the main FutureBench app ### Usage ```bash # Transform your database and upload to HuggingFace python db_to_hf.py # Or run locally without uploading HF_TOKEN="" python db_to_hf.py ``` ### Database Schema The script uses the same database schema as the main FutureBench application: - `EventBase` model for events - `Prediction` model for predictions - Uses SQLAlchemy ORM (same as `convert_to_csv.py`) No additional database configuration needed - it uses the existing FutureBench database connection. ### Output Format The script produces data in the same format as the original FutureBench dataset: - `event_id`, `question`, `event_type`, `algorithm_name`, `actual_prediction`, `result`, `open_to_bet_until`, `prediction_created_at` ### Automation You can run this as a scheduled job: ```bash # Add to crontab to run daily at 2 AM 0 2 * * * cd /path/to/your/project && python leaderboard/process_data/db_to_hf.py ``` ## Files - `download_data.py` - Downloads data from HuggingFace repositories - `db_to_hf.py` - Transforms your database to HuggingFace format - `config_db.py` - Configuration for database connection and HF settings - `config.py` - HuggingFace repository configuration - `requirements.txt` - Python dependencies ## Data Structure The main dataset contains: - `event_id`: Unique identifier for each event - `question`: The prediction question - `event_type`: Type of event (polymarket, soccer, etc.) - `answer_options`: Possible answers in JSON format - `result`: Actual outcome (if resolved) - `algorithm_name`: AI model that made the prediction - `actual_prediction`: The prediction made - `open_to_bet_until`: Prediction window deadline - `prediction_created_at`: When prediction was made ## Output The script generates: - Downloaded datasets in local cache folders - `evaluation_queue.csv` with unique events for processing - Console output with data statistics and summary