FireShadow's picture
Initial clean commit
1721aea
# Synthetic Data Generation Instructions
## Step 1: Configure Parameters
1. Go to the `reproduce_results` folder
2. Open `settings.sh` and configure the hyperparameters
## Step 2: Generate Synthetic Data
### For a Single Method
- Go to the home directory (Do not run from reproduce results)
- To generate data for a specific method (e.g., RCT), run the following bash script:
```bash
bash reproduce_results/create_data/create_rct_data.sh
```
**Output** 
***Note*** The results are described with respect to the default parameters in settings.sh. They may vary if the names are modified in settings.sh.   `
- Datasets will be saved to: `samples/synthetic/rct/data/`
- A metadata file will be created at: `samples/synthetic/rct/metadata/rct.json`
- The metadata file contains the following information about the synthetic data:
  - True effects
  - Number of observations
  - Number of continuous covariates
  - Number of binary covariates
### For All Methods
To generate synthetic data for all methods in one go:
```bash
bash reproduce_results/create_synthetic_data_all.sh
```
## Step 3: Generate Contextual Information
### For a Single Method
1. Go to the home directory
2. To generate column labels, backstory, and query for datasets related to a specific method (e.g., RCT), run:
```bash
bash reproduce_results/create_context/create_context_rct.sh
```
**Output:** GPT generated information will be saved to: `samples/synthetic/rct/description/rct.json`
### For All Methods
To generate contextual information for all methods at once:
```bash
bash reproduce_results/create_context_all.sh
```
## Step 4: Generate Summary Files
- Go to the home directory
- Then run the following command:
```bash
bash reproduce_results/finalize_synthetic_dataset.sh
```
### Output Files
The script generates two types of output files:
1. **CAIS Input Files**
   - They contain all information needed to run CAIS on the synthetic dataset. A separate file is created for each method (rct_info.csv for RCT). Files are saved to `reproduce_results/samples/synthetic/data_info`
2. **Renamed Dataset Files**
   - Original columns (X1, X2, ..., Y, D) are renamed with real-world variable names generated by GPT in the previous step. The files are saved in `reproduce_results/samples/synthetic/synthetic_data`
## Sample Results
Example outputs can be found in the `samples/synthetic` directory.