anonyspark / README.md
GenAIDevTOProd's picture
Update README.md
0b396bf verified
metadata
title: Anonyspark
emoji: πŸ“ˆ
colorFrom: green
colorTo: gray
sdk: static
pinned: false

anonyspark

anonyspark is a lightweight Python package for schema-driven data masking and anonymization in PySpark DataFrames. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce data privacy, PII redaction, and regulatory compliance (e.g., HIPAA, GDPR).


Motivation

In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. anonyspark solves this by enabling deterministic and schema-aware masking of such fields directly in Spark, without leaving the distributed environment.


Key Features

  • Schema-driven masking based on column types or names
  • Supports regex, nulling, hashing, or custom UDF-based masking
  • Designed for PySpark DataFrames, not pandas
  • Lightweight, dependency-free, and easy to integrate
  • CLI-ready for pipeline integration (coming soon)

Use Cases

  • Mask PII fields in ETL pipelines before storage or ML training
  • Anonymize user data before model sharing or analytics
  • Simulate production-like data in dev/test environments
  • Help comply with HIPAA, GDPR, and internal audit policies

Installation

pip install anonyspark

PyPi link: https://pypi.org/project/anonyspark-core

License: MIT License