--- title: Synthetic Sdk Demo emoji: 🚀 colorFrom: green colorTo: green sdk: docker pinned: false license: apache-2.0 short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI --- # Synthetic Data SDK by MOSTLY AI Demo [Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/) The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**. - **LOCAL** mode trains and generates synthetic data locally on your own compute resources. - **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there. - Generators, that were trained locally, can be easily imported to a platform for further sharing. ## Overview The SDK allows you to programmatically create, browse and manage 3 key resources: 1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets 2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs 3. **Connectors** - Connect to any data source within your organization, for reading and writing data | Intent | Primitive | API Reference | |-----------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------------------------| | Train a Generator on tabular or language data | `g = mostly.train(config)` | [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train) | | Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) | | Live probe the generator on demand | `df = mostly.probe(g, config)` | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe) | | Connect to any data source within your org | `c = mostly.connect(config)` | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect) | https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f ## Key Features - **Broad Data Support** - Mixed-type data (categorical, numerical, geospatial, text, etc.) - Single-table, multi-table, and time-series - **Multiple Model Types** - State-of-the-art performance via TabularARGN - Fine-tune Hugging Face hosted language models - Efficient LSTM for text synthesis from scratch - **Advanced Training Options** - GPU/CPU support - Differential Privacy - Progress Monitoring - **Automated Quality Assurance** - Quality metrics for fidelity and privacy - In-depth HTML reports for visual analysis - **Flexible Sampling** - Up-sample to any data volumes - Conditional simulations based on any columns - Re-balance underrepresented segments - Context-aware data imputation - Statistical fairness controls - Rule-adherence via temperature - **Seamless Integration** - Connect to external data sources (DBs, cloud storages) - Fully permissive open-source license ## Citation Please consider citing our project if you find it useful: ```bibtex @misc{mostlyai, title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK}, author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko}, year={2025}, eprint={2508.00718}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.00718}, } ```