--- title: Anonyspark emoji: 📈 colorFrom: green colorTo: gray sdk: static pinned: false --- # anonyspark `anonyspark` is a lightweight Python package for schema-driven **data masking and anonymization** in **PySpark DataFrames**. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce **data privacy**, **PII redaction**, and **regulatory compliance** (e.g., HIPAA, GDPR). --- ## Motivation In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. `anonyspark` solves this by enabling **deterministic and schema-aware masking** of such fields **directly in Spark**, without leaving the distributed environment. --- ## Key Features - **Schema-driven masking** based on column types or names - Supports **regex**, **nulling**, **hashing**, or **custom UDF-based** masking - Designed for **PySpark DataFrames**, not pandas - Lightweight, dependency-free, and easy to integrate - CLI-ready for pipeline integration (coming soon) --- ## Use Cases - Mask PII fields in ETL pipelines before storage or ML training - Anonymize user data before model sharing or analytics - Simulate production-like data in dev/test environments - Help comply with HIPAA, GDPR, and internal audit policies --- ## Installation ```bash pip install anonyspark PyPi link: https://pypi.org/project/anonyspark-core License: MIT License