deid-LONGFORMER-NemPII
HIPAA-compliant clinical de-identification that beats commercial solutionsβat zero cost.
A fine-tuned Clinical-Longformer model for Protected Health Information (PHI) detection and replacement in clinical text, achieving 97.74% F1 on held-out test data.
Model Description
This model identifies 25 types of Protected Health Information (PHI) in clinical text using BILOU tagging (101 classes total). Unlike commercial solutions that simply redact PHI with [REDACTED], the accompanying replacement logic generates realistic surrogate data that preserves clinical meaning.
Performance Comparison
| Solution | F1 Score | Cost | Replacement Quality |
|---|---|---|---|
| AWS Comprehend Medical | ~83-93% | $14.5K/1M notes | Basic placeholders |
| John Snow Labs | 96-97% | Enterprise license | Basic placeholders |
| deid-LONGFORMER-NemPII | 97.74% | Free | Realistic surrogates |
Acknowledgments & Inspiration
This model builds directly on excellent prior work:
π obi/deid_bert_i2b2 β The Inspiration
This project was directly inspired by obi/deid_bert_i2b2 from the Open Biomedical Informatics team (Prajwal Kailas, Max Homilius, Shinichi Goto). Their pioneering work on ClinicalBERT-based de-identification using the I2B2 2014 dataset demonstrated the viability of transformer-based approaches for PHI detection. The robust-deid framework they developed provided invaluable reference for architecture decisions, BILOU tagging schemes, and evaluation methodology.
π₯ yikuan8/Clinical-Longformer β The Base Model
Built on yikuan8/Clinical-Longformer by Li, Yikuan et al. This clinical knowledge-enriched Longformer was pre-trained on MIMIC-III clinical notes and supports sequences up to 4,096 tokensβcritical for processing real-world clinical documents that often exceed BERT's 512-token limit.
π NVIDIA Nemotron-PII β The Training Data
Trained on the healthcare subset of nvidia/Nemotron-PII (3,630 records, CC BY 4.0). This synthetic dataset provides diverse PHI patterns without exposing real patient data.
Intended Uses
- Clinical research: De-identify notes for IRB-compliant research datasets
- Healthcare NLP: Prepare training data for downstream clinical NLP tasks
- Data sharing: Enable safe sharing of clinical text between institutions
- Quality improvement: Analyze clinical documentation without PHI exposure
Key Features
The replacement logic (in the GitHub repo) provides:
- Age-preserving DOB: Fake DOBs keep patient age within Β±2 years
- Name consistency: "Dr. Sarah Johnson" and "Sarah J." map to the same fake name
- Temporal consistency: All dates shift by the same offset (preserves intervals)
- Geographic consistency: City/state/ZIP combinations are coherent
- Format preservation: Phone numbers, SSNs, dates keep original format
- Medical term protection: Whitelist prevents "Anion Gap" β fake name
Training Details
Architecture
| Parameter | Value |
|---|---|
| Base Model | yikuan8/Clinical-Longformer |
| Parameters | 148M |
| Max Length | 4,096 tokens |
| Task | Token Classification |
| Tagging | BILOU scheme |
| Classes | 101 (25 PHI types Γ 4 tags + O) |
PHI Categories (25 types)
NAME, FIRST_NAME, LAST_NAME, DATE, DATE_OF_BIRTH, DATE_TIME,
TIME, AGE, SSN, MEDICAL_RECORD_NUMBER, HEALTH_PLAN_BENEFICIARY_NUMBER,
ACCOUNT_NUMBER, CERTIFICATE_LICENSE_NUMBER, PHONE_NUMBER, FAX_NUMBER,
EMAIL, STREET_ADDRESS, CITY, STATE, POSTCODE, COUNTRY,
BIOMETRIC_IDENTIFIER, UNIQUE_ID, CUSTOMER_ID, EMPLOYEE_ID
Training Procedure
- Dataset: NVIDIA Nemotron-PII healthcare subset (3,630 records)
- Split: 80% train / 20% test
- Epochs: 10
- Batch size: 4
- Learning rate: 2e-5
- Optimizer: AdamW
- Hardware: NVIDIA T4 GPU
Evaluation Results
| Metric | Score |
|---|---|
| F1 | 97.74% |
| Precision | 97.62% |
| Recall | 97.86% |
Usage
With Transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("riggsmed/deid-LONGFORMER-NemPII")
model = AutoModelForTokenClassification.from_pretrained("riggsmed/deid-LONGFORMER-NemPII")
text = "Patient John Smith, DOB 01/15/1957, presented with chest pain."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# Decode predictions to entity labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if label != "O":
print(f"{token}: {label}")
With Pipeline
from transformers import pipeline
pipe = pipeline("token-classification",
model="riggsmed/deid-LONGFORMER-NemPII",
aggregation_strategy="simple")
text = "Contact Dr. Sarah Johnson at (405) 555-1234"
entities = pipe(text)
for ent in entities:
print(f"{ent['word']}: {ent['entity_group']} ({ent['score']:.2f})")
Full De-identification (with surrogates)
For realistic surrogate replacement, use the full system from GitHub:
git clone https://github.com/Hrygt/deid-longformer-nempii.git
cd deid-longformer-nempii
pip install -r requirements.txt
from deid import deidentify_text
result = deidentify_text("Patient John Smith, DOB 01/15/1957")
print(result["deidentified_text"])
# Output: "Patient Robert Johnson, DOB 03/22/1955"
Limitations
- English only: Trained exclusively on English clinical text
- US-centric: PHI patterns (SSN format, US addresses) are US-focused
- Synthetic training data: May miss edge cases in real clinical notes
- Not a substitute for expert review: For high-stakes applications, human review is recommended
Ethical Considerations
- This model is intended to protect patient privacy, not circumvent it
- De-identified data should still be handled according to institutional policies
- The model may have biases from training data that could affect certain demographic groups
- Always validate de-identification quality on your specific data before production use
Live Demo
Try it at: https://deid.riggsmedai.com
Citation
@software{riggs2024deid,
author = {Riggs, Gary},
title = {deid-LONGFORMER-NemPII: Clinical De-identification with Realistic Surrogates},
year = {2024},
url = {https://huggingface.co/riggsmed/deid-LONGFORMER-NemPII}
}
Please also cite the foundational work:
@article{li2023comparative,
title={A comparative study of pretrained language models for long clinical text},
author={Li, Yikuan and Wehbe, Ramsey M and Ahmad, Faraz S and Wang, Hanyin and Luo, Yuan},
journal={Journal of the American Medical Informatics Association},
volume={30},
number={2},
pages={340--347},
year={2023}
}
@misc{obi_deid,
author = {Kailas, Prajwal and Homilius, Max and Goto, Shinichi},
title = {Robust De-ID: De-Identification of Medical Notes using Transformer Architectures},
year = {2022},
url = {https://github.com/obi-ml-public/ehr_deidentification}
}
Author
Gary Riggs, MD
Medical Director, Metro Physician Group
Master of Science in Data Science candidate, Northwestern University
Model Card Contact
For questions or issues: GitHub Issues
- Downloads last month
- 32
Model tree for riggsmed/deid-LONGFORMER-NemPII
Base model
yikuan8/Clinical-Longformer