Spaces:
Running
Running
title: Anonyspark | |
emoji: π | |
colorFrom: green | |
colorTo: gray | |
sdk: static | |
pinned: false | |
# anonyspark | |
`anonyspark` is a lightweight Python package for schema-driven **data masking and anonymization** in **PySpark DataFrames**. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce **data privacy**, **PII redaction**, and **regulatory compliance** (e.g., HIPAA, GDPR). | |
--- | |
## Motivation | |
In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. `anonyspark` solves this by enabling **deterministic and schema-aware masking** of such fields **directly in Spark**, without leaving the distributed environment. | |
--- | |
## Key Features | |
- **Schema-driven masking** based on column types or names | |
- Supports **regex**, **nulling**, **hashing**, or **custom UDF-based** masking | |
- Designed for **PySpark DataFrames**, not pandas | |
- Lightweight, dependency-free, and easy to integrate | |
- CLI-ready for pipeline integration (coming soon) | |
--- | |
## Use Cases | |
- Mask PII fields in ETL pipelines before storage or ML training | |
- Anonymize user data before model sharing or analytics | |
- Simulate production-like data in dev/test environments | |
- Help comply with HIPAA, GDPR, and internal audit policies | |
--- | |
## Installation | |
```bash | |
pip install anonyspark | |
PyPi link: https://pypi.org/project/anonyspark-core | |
License: MIT License | |