Spaces:

CausalNLP
/

causal-agent

Running

App Files Files Community

causal-agent / reproduce_results /readme.md

FireShadow

Initial clean commit

1721aea 21 days ago

preview code

raw

history blame contribute delete

2.42 kB

	# Synthetic Data Generation Instructions

	## Step 1: Configure Parameters

	1. Go to the `reproduce_results` folder
	2. Open `settings.sh` and configure the hyperparameters

	## Step 2: Generate Synthetic Data
	### For a Single Method
	- Go to the home directory (Do not run from reproduce results)
	- To generate data for a specific method (e.g., RCT), run the following bash script:
	```bash
	bash reproduce_results/create_data/create_rct_data.sh
	```


	Output

	*Note* The results are described with respect to the default parameters in settings.sh. They may vary if the names are modified in settings.sh. `
	- Datasets will be saved to: `samples/synthetic/rct/data/`
	- A metadata file will be created at: `samples/synthetic/rct/metadata/rct.json`
	- The metadata file contains the following information about the synthetic data:
	- True effects
	- Number of observations
	- Number of continuous covariates
	- Number of binary covariates

	### For All Methods

	To generate synthetic data for all methods in one go:
	```bash
	bash reproduce_results/create_synthetic_data_all.sh
	```

	## Step 3: Generate Contextual Information

	### For a Single Method

	1. Go to the home directory
	2. To generate column labels, backstory, and query for datasets related to a specific method (e.g., RCT), run:
	```bash
	bash reproduce_results/create_context/create_context_rct.sh
	```

	Output: GPT generated information will be saved to: `samples/synthetic/rct/description/rct.json`

	### For All Methods

	To generate contextual information for all methods at once:
	```bash
	bash reproduce_results/create_context_all.sh
	```

	## Step 4: Generate Summary Files
	- Go to the home directory
	- Then run the following command:
	```bash
	bash reproduce_results/finalize_synthetic_dataset.sh
	```

	### Output Files

	The script generates two types of output files:

	1. CAIS Input Files
	- They contain all information needed to run CAIS on the synthetic dataset. A separate file is created for each method (rct_info.csv for RCT). Files are saved to `reproduce_results/samples/synthetic/data_info`

	2. Renamed Dataset Files
	- Original columns (X1, X2, ..., Y, D) are renamed with real-world variable names generated by GPT in the previous step. The files are saved in `reproduce_results/samples/synthetic/synthetic_data`

	## Sample Results

	Example outputs can be found in the `samples/synthetic` directory.