Spaces:
Running
Running
# Synthetic Data Generation Instructions | |
## Step 1: Configure Parameters | |
1. Go to the `reproduce_results` folder | |
2. Open `settings.sh` and configure the hyperparameters | |
## Step 2: Generate Synthetic Data | |
### For a Single Method | |
- Go to the home directory (Do not run from reproduce results) | |
- To generate data for a specific method (e.g., RCT), run the following bash script: | |
```bash | |
bash reproduce_results/create_data/create_rct_data.sh | |
``` | |
**Output** | |
***Note*** The results are described with respect to the default parameters in settings.sh. They may vary if the names are modified in settings.sh. ` | |
- Datasets will be saved to: `samples/synthetic/rct/data/` | |
- A metadata file will be created at: `samples/synthetic/rct/metadata/rct.json` | |
- The metadata file contains the following information about the synthetic data: | |
- True effects | |
- Number of observations | |
- Number of continuous covariates | |
- Number of binary covariates | |
### For All Methods | |
To generate synthetic data for all methods in one go: | |
```bash | |
bash reproduce_results/create_synthetic_data_all.sh | |
``` | |
## Step 3: Generate Contextual Information | |
### For a Single Method | |
1. Go to the home directory | |
2. To generate column labels, backstory, and query for datasets related to a specific method (e.g., RCT), run: | |
```bash | |
bash reproduce_results/create_context/create_context_rct.sh | |
``` | |
**Output:** GPT generated information will be saved to: `samples/synthetic/rct/description/rct.json` | |
### For All Methods | |
To generate contextual information for all methods at once: | |
```bash | |
bash reproduce_results/create_context_all.sh | |
``` | |
## Step 4: Generate Summary Files | |
- Go to the home directory | |
- Then run the following command: | |
```bash | |
bash reproduce_results/finalize_synthetic_dataset.sh | |
``` | |
### Output Files | |
The script generates two types of output files: | |
1. **CAIS Input Files** | |
- They contain all information needed to run CAIS on the synthetic dataset. A separate file is created for each method (rct_info.csv for RCT). Files are saved to `reproduce_results/samples/synthetic/data_info` | |
2. **Renamed Dataset Files** | |
- Original columns (X1, X2, ..., Y, D) are renamed with real-world variable names generated by GPT in the previous step. The files are saved in `reproduce_results/samples/synthetic/synthetic_data` | |
## Sample Results | |
Example outputs can be found in the `samples/synthetic` directory. | |