Spaces:
Running
Running
title: Synthetic Sdk Demo | |
emoji: π | |
colorFrom: green | |
colorTo: green | |
sdk: docker | |
pinned: false | |
license: apache-2.0 | |
short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI | |
# Synthetic Data SDK by MOSTLY AI Demo | |
[Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/) | |
The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**. | |
- **LOCAL** mode trains and generates synthetic data locally on your own compute resources. | |
- **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there. | |
- Generators, that were trained locally, can be easily imported to a platform for further sharing. | |
## Overview | |
The SDK allows you to programmatically create, browse and manage 3 key resources: | |
1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets | |
2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs | |
3. **Connectors** - Connect to any data source within your organization, for reading and writing data | |
| Intent | Primitive | API Reference | | |
|-----------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------------------------| | |
| Train a Generator on tabular or language data | `g = mostly.train(config)` | [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train) | | |
| Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) | | |
| Live probe the generator on demand | `df = mostly.probe(g, config)` | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe) | | |
| Connect to any data source within your org | `c = mostly.connect(config)` | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect) | | |
https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f | |
## Key Features | |
- **Broad Data Support** | |
- Mixed-type data (categorical, numerical, geospatial, text, etc.) | |
- Single-table, multi-table, and time-series | |
- **Multiple Model Types** | |
- State-of-the-art performance via TabularARGN | |
- Fine-tune Hugging Face hosted language models | |
- Efficient LSTM for text synthesis from scratch | |
- **Advanced Training Options** | |
- GPU/CPU support | |
- Differential Privacy | |
- Progress Monitoring | |
- **Automated Quality Assurance** | |
- Quality metrics for fidelity and privacy | |
- In-depth HTML reports for visual analysis | |
- **Flexible Sampling** | |
- Up-sample to any data volumes | |
- Conditional simulations based on any columns | |
- Re-balance underrepresented segments | |
- Context-aware data imputation | |
- Statistical fairness controls | |
- Rule-adherence via temperature | |
- **Seamless Integration** | |
- Connect to external data sources (DBs, cloud storages) | |
- Fully permissive open-source license | |
## Citation | |
Please consider citing our project if you find it useful: | |
```bibtex | |
@misc{mostlyai, | |
title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK}, | |
author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko}, | |
year={2025}, | |
eprint={2508.00718}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.LG}, | |
url={https://arxiv.org/abs/2508.00718}, | |
} | |
``` | |