synthetic-sdk-demo / README.md
ZennyKenny's picture
Update README.md
5426d51 verified
metadata
title: Synthetic Sdk Demo
emoji: πŸš€
colorFrom: green
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI

Synthetic Data SDK by MOSTLY AI Demo

Documentation | Technical White Paper | Usage Examples | Free Cloud Service

The Synthetic Data SDK is a Python toolkit for high-fidelity, privacy-safe Synthetic Data.

  • LOCAL mode trains and generates synthetic data locally on your own compute resources.
  • CLIENT mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
  • Generators, that were trained locally, can be easily imported to a platform for further sharing.

Overview

The SDK allows you to programmatically create, browse and manage 3 key resources:

  1. Generators - Train a synthetic data generator on your existing tabular or language data assets
  2. Synthetic Datasets - Use a generator to create any number of synthetic samples to your needs
  3. Connectors - Connect to any data source within your organization, for reading and writing data
Intent Primitive API Reference
Train a Generator on tabular or language data g = mostly.train(config) mostly.train
Generate any number of synthetic data records sd = mostly.generate(g, config) mostly.generate
Live probe the generator on demand df = mostly.probe(g, config) mostly.probe
Connect to any data source within your org c = mostly.connect(config) mostly.connect

https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f

Key Features

  • Broad Data Support
    • Mixed-type data (categorical, numerical, geospatial, text, etc.)
    • Single-table, multi-table, and time-series
  • Multiple Model Types
    • State-of-the-art performance via TabularARGN
    • Fine-tune Hugging Face hosted language models
    • Efficient LSTM for text synthesis from scratch
  • Advanced Training Options
    • GPU/CPU support
    • Differential Privacy
    • Progress Monitoring
  • Automated Quality Assurance
    • Quality metrics for fidelity and privacy
    • In-depth HTML reports for visual analysis
  • Flexible Sampling
    • Up-sample to any data volumes
    • Conditional simulations based on any columns
    • Re-balance underrepresented segments
    • Context-aware data imputation
    • Statistical fairness controls
    • Rule-adherence via temperature
  • Seamless Integration
    • Connect to external data sources (DBs, cloud storages)
    • Fully permissive open-source license

Citation

Please consider citing our project if you find it useful:

@misc{mostlyai,
      title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK},
      author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko},
      year={2025},
      eprint={2508.00718},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.00718},
}