deid-LONGFORMER-NemPII

HIPAA-compliant clinical de-identification that beats commercial solutions—at zero cost.

A fine-tuned Clinical-Longformer model for Protected Health Information (PHI) detection and replacement in clinical text, achieving 97.74% F1 on held-out test data.

Model Description

This model identifies 25 types of Protected Health Information (PHI) in clinical text using BILOU tagging (101 classes total). Unlike commercial solutions that simply redact PHI with [REDACTED], the accompanying replacement logic generates realistic surrogate data that preserves clinical meaning.

Performance Comparison

Solution	F1 Score	Cost	Replacement Quality
AWS Comprehend Medical	~83-93%	$14.5K/1M notes	Basic placeholders
John Snow Labs	96-97%	Enterprise license	Basic placeholders
deid-LONGFORMER-NemPII	97.74%	Free	Realistic surrogates

Acknowledgments & Inspiration

This model builds directly on excellent prior work:

🙏 obi/deid_bert_i2b2 — The Inspiration

This project was directly inspired by obi/deid_bert_i2b2 from the Open Biomedical Informatics team (Prajwal Kailas, Max Homilius, Shinichi Goto). Their pioneering work on ClinicalBERT-based de-identification using the I2B2 2014 dataset demonstrated the viability of transformer-based approaches for PHI detection. The robust-deid framework they developed provided invaluable reference for architecture decisions, BILOU tagging schemes, and evaluation methodology.

🏥 yikuan8/Clinical-Longformer — The Base Model

Built on yikuan8/Clinical-Longformer by Li, Yikuan et al. This clinical knowledge-enriched Longformer was pre-trained on MIMIC-III clinical notes and supports sequences up to 4,096 tokens—critical for processing real-world clinical documents that often exceed BERT's 512-token limit.

📊 NVIDIA Nemotron-PII — The Training Data

Trained on the healthcare subset of nvidia/Nemotron-PII (3,630 records, CC BY 4.0). This synthetic dataset provides diverse PHI patterns without exposing real patient data.

Intended Uses

Clinical research: De-identify notes for IRB-compliant research datasets
Healthcare NLP: Prepare training data for downstream clinical NLP tasks
Data sharing: Enable safe sharing of clinical text between institutions
Quality improvement: Analyze clinical documentation without PHI exposure

Key Features

The replacement logic (in the GitHub repo) provides:

Age-preserving DOB: Fake DOBs keep patient age within ±2 years
Name consistency: "Dr. Sarah Johnson" and "Sarah J." map to the same fake name
Temporal consistency: All dates shift by the same offset (preserves intervals)
Geographic consistency: City/state/ZIP combinations are coherent
Format preservation: Phone numbers, SSNs, dates keep original format
Medical term protection: Whitelist prevents "Anion Gap" → fake name

Training Details

Architecture

Parameter	Value
Base Model	yikuan8/Clinical-Longformer
Parameters	148M
Max Length	4,096 tokens
Task	Token Classification
Tagging	BILOU scheme
Classes	101 (25 PHI types × 4 tags + O)

PHI Categories (25 types)

NAME, FIRST_NAME, LAST_NAME, DATE, DATE_OF_BIRTH, DATE_TIME,
TIME, AGE, SSN, MEDICAL_RECORD_NUMBER, HEALTH_PLAN_BENEFICIARY_NUMBER,
ACCOUNT_NUMBER, CERTIFICATE_LICENSE_NUMBER, PHONE_NUMBER, FAX_NUMBER,
EMAIL, STREET_ADDRESS, CITY, STATE, POSTCODE, COUNTRY,
BIOMETRIC_IDENTIFIER, UNIQUE_ID, CUSTOMER_ID, EMPLOYEE_ID

Training Procedure

Dataset: NVIDIA Nemotron-PII healthcare subset (3,630 records)
Split: 80% train / 20% test
Epochs: 10
Batch size: 4
Learning rate: 2e-5
Optimizer: AdamW
Hardware: NVIDIA T4 GPU

Evaluation Results

Metric	Score
F1	97.74%
Precision	97.62%
Recall	97.86%

Usage

With Transformers

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("riggsmed/deid-LONGFORMER-NemPII")
model = AutoModelForTokenClassification.from_pretrained("riggsmed/deid-LONGFORMER-NemPII")

text = "Patient John Smith, DOB 01/15/1957, presented with chest pain."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode predictions to entity labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(f"{token}: {label}")

With Pipeline

from transformers import pipeline

pipe = pipeline("token-classification", 
                model="riggsmed/deid-LONGFORMER-NemPII",
                aggregation_strategy="simple")

text = "Contact Dr. Sarah Johnson at (405) 555-1234"
entities = pipe(text)

for ent in entities:
    print(f"{ent['word']}: {ent['entity_group']} ({ent['score']:.2f})")

Full De-identification (with surrogates)

For realistic surrogate replacement, use the full system from GitHub:

git clone https://github.com/Hrygt/deid-longformer-nempii.git
cd deid-longformer-nempii
pip install -r requirements.txt

from deid import deidentify_text

result = deidentify_text("Patient John Smith, DOB 01/15/1957")
print(result["deidentified_text"])
# Output: "Patient Robert Johnson, DOB 03/22/1955"

Limitations

English only: Trained exclusively on English clinical text
US-centric: PHI patterns (SSN format, US addresses) are US-focused
Synthetic training data: May miss edge cases in real clinical notes
Not a substitute for expert review: For high-stakes applications, human review is recommended

Ethical Considerations

This model is intended to protect patient privacy, not circumvent it
De-identified data should still be handled according to institutional policies
The model may have biases from training data that could affect certain demographic groups
Always validate de-identification quality on your specific data before production use

Live Demo

Try it at: https://deid.riggsmedai.com

Citation

@software{riggs2024deid,
  author = {Riggs, Gary},
  title = {deid-LONGFORMER-NemPII: Clinical De-identification with Realistic Surrogates},
  year = {2024},
  url = {https://huggingface.co/riggsmed/deid-LONGFORMER-NemPII}
}

Please also cite the foundational work:

@article{li2023comparative,
  title={A comparative study of pretrained language models for long clinical text},
  author={Li, Yikuan and Wehbe, Ramsey M and Ahmad, Faraz S and Wang, Hanyin and Luo, Yuan},
  journal={Journal of the American Medical Informatics Association},
  volume={30},
  number={2},
  pages={340--347},
  year={2023}
}

@misc{obi_deid,
  author = {Kailas, Prajwal and Homilius, Max and Goto, Shinichi},
  title = {Robust De-ID: De-Identification of Medical Notes using Transformer Architectures},
  year = {2022},
  url = {https://github.com/obi-ml-public/ehr_deidentification}
}

Author

Gary Riggs, MD
Medical Director, Metro Physician Group
Master of Science in Data Science candidate, Northwestern University

Model Card Contact

For questions or issues: GitHub Issues

Downloads last month: 32

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for riggsmed/deid-LONGFORMER-NemPII

Base model

yikuan8/Clinical-Longformer

Finetuned

(17)

this model

riggsmed
/

deid-LONGFORMER-NemPII