deid-LONGFORMER-NemPII

HIPAA-compliant clinical de-identification that beats commercial solutionsβ€”at zero cost.

A fine-tuned Clinical-Longformer model for Protected Health Information (PHI) detection and replacement in clinical text, achieving 97.74% F1 on held-out test data.

Model Description

This model identifies 25 types of Protected Health Information (PHI) in clinical text using BILOU tagging (101 classes total). Unlike commercial solutions that simply redact PHI with [REDACTED], the accompanying replacement logic generates realistic surrogate data that preserves clinical meaning.

Performance Comparison

Solution F1 Score Cost Replacement Quality
AWS Comprehend Medical ~83-93% $14.5K/1M notes Basic placeholders
John Snow Labs 96-97% Enterprise license Basic placeholders
deid-LONGFORMER-NemPII 97.74% Free Realistic surrogates

Acknowledgments & Inspiration

This model builds directly on excellent prior work:

πŸ™ obi/deid_bert_i2b2 β€” The Inspiration

This project was directly inspired by obi/deid_bert_i2b2 from the Open Biomedical Informatics team (Prajwal Kailas, Max Homilius, Shinichi Goto). Their pioneering work on ClinicalBERT-based de-identification using the I2B2 2014 dataset demonstrated the viability of transformer-based approaches for PHI detection. The robust-deid framework they developed provided invaluable reference for architecture decisions, BILOU tagging schemes, and evaluation methodology.

πŸ₯ yikuan8/Clinical-Longformer β€” The Base Model

Built on yikuan8/Clinical-Longformer by Li, Yikuan et al. This clinical knowledge-enriched Longformer was pre-trained on MIMIC-III clinical notes and supports sequences up to 4,096 tokensβ€”critical for processing real-world clinical documents that often exceed BERT's 512-token limit.

πŸ“Š NVIDIA Nemotron-PII β€” The Training Data

Trained on the healthcare subset of nvidia/Nemotron-PII (3,630 records, CC BY 4.0). This synthetic dataset provides diverse PHI patterns without exposing real patient data.

Intended Uses

  • Clinical research: De-identify notes for IRB-compliant research datasets
  • Healthcare NLP: Prepare training data for downstream clinical NLP tasks
  • Data sharing: Enable safe sharing of clinical text between institutions
  • Quality improvement: Analyze clinical documentation without PHI exposure

Key Features

The replacement logic (in the GitHub repo) provides:

  • Age-preserving DOB: Fake DOBs keep patient age within Β±2 years
  • Name consistency: "Dr. Sarah Johnson" and "Sarah J." map to the same fake name
  • Temporal consistency: All dates shift by the same offset (preserves intervals)
  • Geographic consistency: City/state/ZIP combinations are coherent
  • Format preservation: Phone numbers, SSNs, dates keep original format
  • Medical term protection: Whitelist prevents "Anion Gap" β†’ fake name

Training Details

Architecture

Parameter Value
Base Model yikuan8/Clinical-Longformer
Parameters 148M
Max Length 4,096 tokens
Task Token Classification
Tagging BILOU scheme
Classes 101 (25 PHI types Γ— 4 tags + O)

PHI Categories (25 types)

NAME, FIRST_NAME, LAST_NAME, DATE, DATE_OF_BIRTH, DATE_TIME,
TIME, AGE, SSN, MEDICAL_RECORD_NUMBER, HEALTH_PLAN_BENEFICIARY_NUMBER,
ACCOUNT_NUMBER, CERTIFICATE_LICENSE_NUMBER, PHONE_NUMBER, FAX_NUMBER,
EMAIL, STREET_ADDRESS, CITY, STATE, POSTCODE, COUNTRY,
BIOMETRIC_IDENTIFIER, UNIQUE_ID, CUSTOMER_ID, EMPLOYEE_ID

Training Procedure

  • Dataset: NVIDIA Nemotron-PII healthcare subset (3,630 records)
  • Split: 80% train / 20% test
  • Epochs: 10
  • Batch size: 4
  • Learning rate: 2e-5
  • Optimizer: AdamW
  • Hardware: NVIDIA T4 GPU

Evaluation Results

Metric Score
F1 97.74%
Precision 97.62%
Recall 97.86%

Usage

With Transformers

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("riggsmed/deid-LONGFORMER-NemPII")
model = AutoModelForTokenClassification.from_pretrained("riggsmed/deid-LONGFORMER-NemPII")

text = "Patient John Smith, DOB 01/15/1957, presented with chest pain."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode predictions to entity labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(f"{token}: {label}")

With Pipeline

from transformers import pipeline

pipe = pipeline("token-classification", 
                model="riggsmed/deid-LONGFORMER-NemPII",
                aggregation_strategy="simple")

text = "Contact Dr. Sarah Johnson at (405) 555-1234"
entities = pipe(text)

for ent in entities:
    print(f"{ent['word']}: {ent['entity_group']} ({ent['score']:.2f})")

Full De-identification (with surrogates)

For realistic surrogate replacement, use the full system from GitHub:

git clone https://github.com/Hrygt/deid-longformer-nempii.git
cd deid-longformer-nempii
pip install -r requirements.txt
from deid import deidentify_text

result = deidentify_text("Patient John Smith, DOB 01/15/1957")
print(result["deidentified_text"])
# Output: "Patient Robert Johnson, DOB 03/22/1955"

Limitations

  • English only: Trained exclusively on English clinical text
  • US-centric: PHI patterns (SSN format, US addresses) are US-focused
  • Synthetic training data: May miss edge cases in real clinical notes
  • Not a substitute for expert review: For high-stakes applications, human review is recommended

Ethical Considerations

  • This model is intended to protect patient privacy, not circumvent it
  • De-identified data should still be handled according to institutional policies
  • The model may have biases from training data that could affect certain demographic groups
  • Always validate de-identification quality on your specific data before production use

Live Demo

Try it at: https://deid.riggsmedai.com

Citation

@software{riggs2024deid,
  author = {Riggs, Gary},
  title = {deid-LONGFORMER-NemPII: Clinical De-identification with Realistic Surrogates},
  year = {2024},
  url = {https://huggingface.co/riggsmed/deid-LONGFORMER-NemPII}
}

Please also cite the foundational work:

@article{li2023comparative,
  title={A comparative study of pretrained language models for long clinical text},
  author={Li, Yikuan and Wehbe, Ramsey M and Ahmad, Faraz S and Wang, Hanyin and Luo, Yuan},
  journal={Journal of the American Medical Informatics Association},
  volume={30},
  number={2},
  pages={340--347},
  year={2023}
}

@misc{obi_deid,
  author = {Kailas, Prajwal and Homilius, Max and Goto, Shinichi},
  title = {Robust De-ID: De-Identification of Medical Notes using Transformer Architectures},
  year = {2022},
  url = {https://github.com/obi-ml-public/ehr_deidentification}
}

Author

Gary Riggs, MD
Medical Director, Metro Physician Group
Master of Science in Data Science candidate, Northwestern University

Model Card Contact

For questions or issues: GitHub Issues

Downloads last month
32
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for riggsmed/deid-LONGFORMER-NemPII

Finetuned
(17)
this model

Dataset used to train riggsmed/deid-LONGFORMER-NemPII