Spaces:
Running
A newer version of the Gradio SDK is available:
5.44.0
Synthetic Data Generation Instructions
Step 1: Configure Parameters
- Go to the
reproduce_results
folder - Open
settings.sh
and configure the hyperparameters
Step 2: Generate Synthetic Data
For a Single Method
- Go to the home directory (Do not run from reproduce results)
- To generate data for a specific method (e.g., RCT), run the following bash script:
bash reproduce_results/create_data/create_rct_data.sh
Output
Note The results are described with respect to the default parameters in settings.sh. They may vary if the names are modified in settings.sh. `
- Datasets will be saved to:
samples/synthetic/rct/data/
- A metadata file will be created at:
samples/synthetic/rct/metadata/rct.json
- The metadata file contains the following information about the synthetic data: - True effects - Number of observations - Number of continuous covariates - Number of binary covariates
For All Methods
To generate synthetic data for all methods in one go:
bash reproduce_results/create_synthetic_data_all.sh
Step 3: Generate Contextual Information
For a Single Method
- Go to the home directory
- To generate column labels, backstory, and query for datasets related to a specific method (e.g., RCT), run:
bash reproduce_results/create_context/create_context_rct.sh
Output: GPT generated information will be saved to: samples/synthetic/rct/description/rct.json
For All Methods
To generate contextual information for all methods at once:
bash reproduce_results/create_context_all.sh
Step 4: Generate Summary Files
- Go to the home directory
- Then run the following command:
bash reproduce_results/finalize_synthetic_dataset.sh
Output Files
The script generates two types of output files:
CAIS Input Files - They contain all information needed to run CAIS on the synthetic dataset. A separate file is created for each method (rct_info.csv for RCT). Files are saved to
reproduce_results/samples/synthetic/data_info
Renamed Dataset Files - Original columns (X1, X2, ..., Y, D) are renamed with real-world variable names generated by GPT in the previous step. The files are saved in
reproduce_results/samples/synthetic/synthetic_data
Sample Results
Example outputs can be found in the samples/synthetic
directory.