Spaces:
Running
Running
File size: 1,530 Bytes
f159967 36b06cf c6d1709 0b396bf c6d1709 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
---
title: Anonyspark
emoji: π
colorFrom: green
colorTo: gray
sdk: static
pinned: false
---
# anonyspark
`anonyspark` is a lightweight Python package for schema-driven **data masking and anonymization** in **PySpark DataFrames**. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce **data privacy**, **PII redaction**, and **regulatory compliance** (e.g., HIPAA, GDPR).
---
## Motivation
In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. `anonyspark` solves this by enabling **deterministic and schema-aware masking** of such fields **directly in Spark**, without leaving the distributed environment.
---
## Key Features
- **Schema-driven masking** based on column types or names
- Supports **regex**, **nulling**, **hashing**, or **custom UDF-based** masking
- Designed for **PySpark DataFrames**, not pandas
- Lightweight, dependency-free, and easy to integrate
- CLI-ready for pipeline integration (coming soon)
---
## Use Cases
- Mask PII fields in ETL pipelines before storage or ML training
- Anonymize user data before model sharing or analytics
- Simulate production-like data in dev/test environments
- Help comply with HIPAA, GDPR, and internal audit policies
---
## Installation
```bash
pip install anonyspark
PyPi link: https://pypi.org/project/anonyspark-core
License: MIT License
|