File size: 1,530 Bytes
f159967
 
 
 
 
 
 
 
 
36b06cf
c6d1709
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b396bf
c6d1709
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
title: Anonyspark
emoji: πŸ“ˆ
colorFrom: green
colorTo: gray
sdk: static
pinned: false
---

# anonyspark

`anonyspark` is a lightweight Python package for schema-driven **data masking and anonymization** in **PySpark DataFrames**. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce **data privacy**, **PII redaction**, and **regulatory compliance** (e.g., HIPAA, GDPR).

---

## Motivation

In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. `anonyspark` solves this by enabling **deterministic and schema-aware masking** of such fields **directly in Spark**, without leaving the distributed environment.

---

## Key Features

-  **Schema-driven masking** based on column types or names  
-  Supports **regex**, **nulling**, **hashing**, or **custom UDF-based** masking  
-  Designed for **PySpark DataFrames**, not pandas  
-  Lightweight, dependency-free, and easy to integrate  
-  CLI-ready for pipeline integration (coming soon)

---

## Use Cases

-  Mask PII fields in ETL pipelines before storage or ML training  
-  Anonymize user data before model sharing or analytics  
-  Simulate production-like data in dev/test environments  
-  Help comply with HIPAA, GDPR, and internal audit policies  

---

##  Installation

```bash
pip install anonyspark

PyPi link: https://pypi.org/project/anonyspark-core

License: MIT License