FireShadow's picture
Initial clean commit
1721aea

A newer version of the Gradio SDK is available: 5.44.0

Upgrade

Synthetic Data Generation Instructions

Step 1: Configure Parameters

  1. Go to the reproduce_results folder
  2. Open settings.sh and configure the hyperparameters

Step 2: Generate Synthetic Data

For a Single Method

  • Go to the home directory (Do not run from reproduce results)
  • To generate data for a specific method (e.g., RCT), run the following bash script:
    bash reproduce_results/create_data/create_rct_data.sh
    

Output 

Note The results are described with respect to the default parameters in settings.sh. They may vary if the names are modified in settings.sh.   `

  • Datasets will be saved to: samples/synthetic/rct/data/
  • A metadata file will be created at: samples/synthetic/rct/metadata/rct.json
  • The metadata file contains the following information about the synthetic data:   - True effects   - Number of observations   - Number of continuous covariates   - Number of binary covariates

For All Methods

To generate synthetic data for all methods in one go:

bash reproduce_results/create_synthetic_data_all.sh

Step 3: Generate Contextual Information

For a Single Method

  1. Go to the home directory
  2. To generate column labels, backstory, and query for datasets related to a specific method (e.g., RCT), run:
    bash reproduce_results/create_context/create_context_rct.sh
    

Output: GPT generated information will be saved to: samples/synthetic/rct/description/rct.json

For All Methods

To generate contextual information for all methods at once:

bash reproduce_results/create_context_all.sh

Step 4: Generate Summary Files

  • Go to the home directory
  • Then run the following command:
    bash reproduce_results/finalize_synthetic_dataset.sh
    

Output Files

The script generates two types of output files:

  1. CAIS Input Files    - They contain all information needed to run CAIS on the synthetic dataset. A separate file is created for each method (rct_info.csv for RCT). Files are saved to reproduce_results/samples/synthetic/data_info

  2. Renamed Dataset Files    - Original columns (X1, X2, ..., Y, D) are renamed with real-world variable names generated by GPT in the previous step. The files are saved in reproduce_results/samples/synthetic/synthetic_data

Sample Results

Example outputs can be found in the samples/synthetic directory.