anonyspark / README.md
GenAIDevTOProd's picture
Update README.md
0b396bf verified
---
title: Anonyspark
emoji: πŸ“ˆ
colorFrom: green
colorTo: gray
sdk: static
pinned: false
---
# anonyspark
`anonyspark` is a lightweight Python package for schema-driven **data masking and anonymization** in **PySpark DataFrames**. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce **data privacy**, **PII redaction**, and **regulatory compliance** (e.g., HIPAA, GDPR).
---
## Motivation
In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. `anonyspark` solves this by enabling **deterministic and schema-aware masking** of such fields **directly in Spark**, without leaving the distributed environment.
---
## Key Features
- **Schema-driven masking** based on column types or names
- Supports **regex**, **nulling**, **hashing**, or **custom UDF-based** masking
- Designed for **PySpark DataFrames**, not pandas
- Lightweight, dependency-free, and easy to integrate
- CLI-ready for pipeline integration (coming soon)
---
## Use Cases
- Mask PII fields in ETL pipelines before storage or ML training
- Anonymize user data before model sharing or analytics
- Simulate production-like data in dev/test environments
- Help comply with HIPAA, GDPR, and internal audit policies
---
## Installation
```bash
pip install anonyspark
PyPi link: https://pypi.org/project/anonyspark-core
License: MIT License