Spaces:
Running
Running
File size: 2,422 Bytes
1721aea |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
# Synthetic Data Generation Instructions
## Step 1: Configure Parameters
1. Go to the `reproduce_results` folder
2. Open `settings.sh` and configure the hyperparameters
## Step 2: Generate Synthetic Data
### For a Single Method
- Go to the home directory (Do not run from reproduce results)
- To generate data for a specific method (e.g., RCT), run the following bash script:
```bash
bash reproduce_results/create_data/create_rct_data.sh
```
**Output**
***Note*** The results are described with respect to the default parameters in settings.sh. They may vary if the names are modified in settings.sh. `
- Datasets will be saved to: `samples/synthetic/rct/data/`
- A metadata file will be created at: `samples/synthetic/rct/metadata/rct.json`
- The metadata file contains the following information about the synthetic data:
- True effects
- Number of observations
- Number of continuous covariates
- Number of binary covariates
### For All Methods
To generate synthetic data for all methods in one go:
```bash
bash reproduce_results/create_synthetic_data_all.sh
```
## Step 3: Generate Contextual Information
### For a Single Method
1. Go to the home directory
2. To generate column labels, backstory, and query for datasets related to a specific method (e.g., RCT), run:
```bash
bash reproduce_results/create_context/create_context_rct.sh
```
**Output:** GPT generated information will be saved to: `samples/synthetic/rct/description/rct.json`
### For All Methods
To generate contextual information for all methods at once:
```bash
bash reproduce_results/create_context_all.sh
```
## Step 4: Generate Summary Files
- Go to the home directory
- Then run the following command:
```bash
bash reproduce_results/finalize_synthetic_dataset.sh
```
### Output Files
The script generates two types of output files:
1. **CAIS Input Files**
- They contain all information needed to run CAIS on the synthetic dataset. A separate file is created for each method (rct_info.csv for RCT). Files are saved to `reproduce_results/samples/synthetic/data_info`
2. **Renamed Dataset Files**
- Original columns (X1, X2, ..., Y, D) are renamed with real-world variable names generated by GPT in the previous step. The files are saved in `reproduce_results/samples/synthetic/synthetic_data`
## Sample Results
Example outputs can be found in the `samples/synthetic` directory.
|