---
title: Anonyspark
emoji: 📈
colorFrom: green
colorTo: gray
sdk: static
pinned: false
---

# anonyspark

`anonyspark` is a lightweight Python package for schema-driven **data masking and anonymization** in **PySpark DataFrames**. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce **data privacy**, **PII redaction**, and **regulatory compliance** (e.g., HIPAA, GDPR).

---

## Motivation

In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. `anonyspark` solves this by enabling **deterministic and schema-aware masking** of such fields **directly in Spark**, without leaving the distributed environment.

---

## Key Features

-  **Schema-driven masking** based on column types or names  
-  Supports **regex**, **nulling**, **hashing**, or **custom UDF-based** masking  
-  Designed for **PySpark DataFrames**, not pandas  
-  Lightweight, dependency-free, and easy to integrate  
-  CLI-ready for pipeline integration (coming soon)

---

## Use Cases

-  Mask PII fields in ETL pipelines before storage or ML training  
-  Anonymize user data before model sharing or analytics  
-  Simulate production-like data in dev/test environments  
-  Help comply with HIPAA, GDPR, and internal audit policies  

---

##  Installation

```bash
pip install anonyspark

PyPi link: https://pypi.org/project/anonyspark-core

License: MIT License