Spaces:
Running
Running
metadata
title: Synthetic Sdk Demo
emoji: π
colorFrom: green
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
Synthetic Data SDK by MOSTLY AI Demo
Documentation | Technical White Paper | Usage Examples | Free Cloud Service
The Synthetic Data SDK is a Python toolkit for high-fidelity, privacy-safe Synthetic Data.
- LOCAL mode trains and generates synthetic data locally on your own compute resources.
- CLIENT mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
- Generators, that were trained locally, can be easily imported to a platform for further sharing.
Overview
The SDK allows you to programmatically create, browse and manage 3 key resources:
- Generators - Train a synthetic data generator on your existing tabular or language data assets
- Synthetic Datasets - Use a generator to create any number of synthetic samples to your needs
- Connectors - Connect to any data source within your organization, for reading and writing data
Intent | Primitive | API Reference |
---|---|---|
Train a Generator on tabular or language data | g = mostly.train(config) |
mostly.train |
Generate any number of synthetic data records | sd = mostly.generate(g, config) |
mostly.generate |
Live probe the generator on demand | df = mostly.probe(g, config) |
mostly.probe |
Connect to any data source within your org | c = mostly.connect(config) |
mostly.connect |
https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f
Key Features
- Broad Data Support
- Mixed-type data (categorical, numerical, geospatial, text, etc.)
- Single-table, multi-table, and time-series
- Multiple Model Types
- State-of-the-art performance via TabularARGN
- Fine-tune Hugging Face hosted language models
- Efficient LSTM for text synthesis from scratch
- Advanced Training Options
- GPU/CPU support
- Differential Privacy
- Progress Monitoring
- Automated Quality Assurance
- Quality metrics for fidelity and privacy
- In-depth HTML reports for visual analysis
- Flexible Sampling
- Up-sample to any data volumes
- Conditional simulations based on any columns
- Re-balance underrepresented segments
- Context-aware data imputation
- Statistical fairness controls
- Rule-adherence via temperature
- Seamless Integration
- Connect to external data sources (DBs, cloud storages)
- Fully permissive open-source license
Citation
Please consider citing our project if you find it useful:
@misc{mostlyai,
title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK},
author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko},
year={2025},
eprint={2508.00718},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.00718},
}