Synthetic Data Generation Instructions

Step 1: Configure Parameters

Go to the home directory (Do not run from reproduce results)
To generate data for a specific method (e.g., RCT), run the following bash script:
```
bash reproduce_results/create_data/create_rct_data.sh
```

Output

Note The results are described with respect to the default parameters in settings.sh. They may vary if the names are modified in settings.sh. `

Datasets will be saved to: samples/synthetic/rct/data/
A metadata file will be created at: samples/synthetic/rct/metadata/rct.json
The metadata file contains the following information about the synthetic data: - True effects - Number of observations - Number of continuous covariates - Number of binary covariates

To generate synthetic data for all methods in one go:

bash reproduce_results/create_synthetic_data_all.sh

Go to the home directory
To generate column labels, backstory, and query for datasets related to a specific method (e.g., RCT), run:
```
bash reproduce_results/create_context/create_context_rct.sh
```

Output: GPT generated information will be saved to: samples/synthetic/rct/description/rct.json

To generate contextual information for all methods at once:

bash reproduce_results/create_context_all.sh

Then run the following command:

bash reproduce_results/finalize_synthetic_dataset.sh

The script generates two types of output files:

CAIS Input Files - They contain all information needed to run CAIS on the synthetic dataset. A separate file is created for each method (rct_info.csv for RCT). Files are saved to reproduce_results/samples/synthetic/data_info
Renamed Dataset Files - Original columns (X1, X2, ..., Y, D) are renamed with real-world variable names generated by GPT in the previous step. The files are saved in reproduce_results/samples/synthetic/synthetic_data

Example outputs can be found in the samples/synthetic directory.