Spaces:

CausalNLP
/

causal-agent

Running

File size: 2,422 Bytes

1721aea

# Synthetic Data Generation Instructions

## Step 1: Configure Parameters

1. Go to the `reproduce_results` folder
2. Open `settings.sh` and configure the hyperparameters

## Step 2: Generate Synthetic Data
### For a Single Method
- Go to the home directory (Do not run from reproduce results)
- To generate data for a specific method (e.g., RCT), run the following bash script:
  ```bash
  bash reproduce_results/create_data/create_rct_data.sh
  ```


**Output** 

***Note*** The results are described with respect to the default parameters in settings.sh. They may vary if the names are modified in settings.sh.   `
- Datasets will be saved to: `samples/synthetic/rct/data/`
- A metadata file will be created at: `samples/synthetic/rct/metadata/rct.json`
- The metadata file contains the following information about the synthetic data:
  - True effects
  - Number of observations
  - Number of continuous covariates
  - Number of binary covariates

### For All Methods

To generate synthetic data for all methods in one go:
```bash
bash reproduce_results/create_synthetic_data_all.sh
```

## Step 3: Generate Contextual Information

### For a Single Method

1. Go to the home directory
2. To generate column labels, backstory, and query for datasets related to a specific method (e.g., RCT), run:
   ```bash
   bash reproduce_results/create_context/create_context_rct.sh
   ```

**Output:** GPT generated information will be saved to: `samples/synthetic/rct/description/rct.json`

### For All Methods

To generate contextual information for all methods at once:
```bash
bash reproduce_results/create_context_all.sh
```

## Step 4: Generate Summary Files
- Go to the home directory
- Then run the following command:
  ```bash
  bash reproduce_results/finalize_synthetic_dataset.sh
  ```

### Output Files

The script generates two types of output files:

1. **CAIS Input Files**
   - They contain all information needed to run CAIS on the synthetic dataset. A separate file is created for each method (rct_info.csv for RCT). Files are saved to `reproduce_results/samples/synthetic/data_info`

2. **Renamed Dataset Files**
   - Original columns (X1, X2, ..., Y, D) are renamed with real-world variable names generated by GPT in the previous step. The files are saved in `reproduce_results/samples/synthetic/synthetic_data`

## Sample Results

Example outputs can be found in the `samples/synthetic` directory.