Model Card for parthesh111/layoutlmv3-finetune-bioes-new

This model is a fine-tuned version of microsoft/layoutlmv3-base designed for token classification on scanned medical lab reports. It uses BIOES tagging to extract structured entities such as patient name, doctor name, lab name, date, sex, age, and more. The model works in conjunction with OCR results (PaddleOCR v2.6) to handle scanned documents in image format.

Model Details

Model Description

Developed by: Parthesh Ingale
Shared by: parthesh111
Model type: Token Classification (NER)
Language(s) (NLP): English
License: Apache-2.0
Finetuned from model: microsoft/layoutlmv3-base

Model Sources

Repository: https://huggingface.co/parthesh111/layoutlmv3-finetune-bioes-new
Paper: N/A

Uses

Direct Use

Extract named entities from medical lab reports (scanned images).
Automate structured data extraction from semi-structured medical documents.

Downstream Use

Preprocessing step in EHR (Electronic Health Records).
PII-aware document processing.
Indexing and summarization of medical records.

Out-of-Scope Use

Handwritten document recognition.
General NLP tasks not involving OCR.
Documents with non-medical layouts.

Bias, Risks, and Limitations

May fail on unseen or unfamiliar lab formats.
Possible misclassification of sensitive entities.
Does not anonymize data automatically.

Recommendations

Users (both direct and downstream) should:

Ensure anonymization where required.
Evaluate the model's performance on unseen document layouts.
Avoid using the model for unsupported languages or handwritten data.

How to Get Started with the Model

import streamlit as st
import torch
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
from PIL import Image
from paddleocr import PaddleOCR
import numpy as np
import os
from huggingface_hub import login

# Login to Hugging Face using the environment variable token
HF_TOKEN = os.environ.get("HF_TOKEN")
if not HF_TOKEN:
    st.error("Hugging Face token not found. Please set 'HF_TOKEN' as an environment variable.")
else:
    login(HF_TOKEN)

# Set Streamlit page configuration
st.set_page_config(layout="wide", page_title="Document Entity Extractor")

st.title("📄 Document Entity Extractor (LayoutLMv3 + PaddleOCR)")
st.markdown("Upload an image (e.g., a scanned document or a form) to extract structured information.")

@st.cache_resource
def load_paddle_ocr_model():
    st.info("Initializing PaddleOCR model...")
    return PaddleOCR(use_angle_cls=True, lang='en', show_log=False)

@st.cache_resource
def load_layoutlmv3_model_and_processor(model_name_or_path="parthesh111/layoutlmv3-finetune-bioes-new"):
    st.info(f"Loading LayoutLMv3 model from {model_name_or_path}...")
    processor = LayoutLMv3Processor.from_pretrained(model_name_or_path, apply_ocr=False)
    model = LayoutLMv3ForTokenClassification.from_pretrained(model_name_or_path)
    return model, processor

def normalize_box(box, width, height):
    return [
        int(1000 * (box[0] / width)),
        int(1000 * (box[1] / height)),
        int(1000 * (box[2] / width)),
        int(1000 * (box[3] / height)),
    ]

def generate_non_pii_summary_with_llm(text_data):
    if not text_data or text_data.strip() == "No 'O' (Outside) labeled text found.":
        return "No non-PII data available for summarization."

    return (
        f"Original Non-PII Text:\n\"{text_data}\"\n\n"
        f"This section would typically contain a summary or analysis generated by an external Large Language Model "
        f"(e.g., ChatGPT) based on the provided 'O' labeled text. For demonstration purposes, "
        f"this is a placeholder showing the input text. Implement your actual LLM API call here.")

def run_inference_logic(image: Image.Image, model, processor, ocr_engine):
    width, height = image.size

    try:
        paddle_ocr_result = ocr_engine.ocr(np.array(image), cls=True)
        words = []
        boxes = []
        if not paddle_ocr_result or not paddle_ocr_result[0]:
            st.warning("PaddleOCR did not detect any text in the image.")
            return {}, "No text detected by OCR."

        for line in paddle_ocr_result[0]:
            box_coords = line[0]
            text = line[1][0]
            x_min = min([point[0] for point in box_coords])
            y_min = min([point[1] for point in box_coords])
            x_max = max([point[0] for point in box_coords])
            y_max = max([point[1] for point in box_coords])
            words.append(text)
            boxes.append([x_min, y_min, x_max, y_max])

    except Exception as e:
        st.error(f"An error occurred during PaddleOCR processing: {e}")
        return {}, f"Error during OCR: {e}"

    if not words:
        return {}, "No extractable words found by OCR."

    normalized_boxes = [normalize_box(box, width, height) for box in boxes]

    try:
        if not words or not normalized_boxes:
            st.warning("No words or bounding boxes to encode for LayoutLMv3.")
            return {}, "No valid data for model inference."

        encoding = processor(image, text=words, boxes=normalized_boxes,
                             return_tensors="pt", truncation=True, padding="max_length", max_length=512)

        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        model.to(device)
        for key, val in encoding.items():
            encoding[key] = val.to(device)

        with torch.no_grad():
            outputs = model(**encoding)

        predictions = outputs.logits.argmax(-1).squeeze().tolist()
        word_ids = encoding.word_ids(batch_index=0)

        word_labels = {}
        for token_idx, word_idx in enumerate(word_ids):
            if word_idx is not None and token_idx < len(predictions):
                if word_idx not in word_labels:
                    word_labels[word_idx] = model.config.id2label[predictions[token_idx]]

        all_grouped_segments = {}
        current_segment_text = ""
        current_segment_label = ""

        for i in range(len(words)):
            label = word_labels.get(i, "O")
            word = words[i]

            if label == "O":
                effective_segment_type = "O"
            else:
                _, _, entity_type = label.partition('-')
                effective_segment_type = entity_type

            if current_segment_text and effective_segment_type != current_segment_label:
                if current_segment_label not in all_grouped_segments:
                    all_grouped_segments[current_segment_label] = []
                all_grouped_segments[current_segment_label].append(current_segment_text.strip())
                current_segment_text = ""
                current_segment_label = ""

            if not current_segment_text:
                current_segment_text = word
                current_segment_label = effective_segment_type
            else:
                current_segment_text += " " + word

        if current_segment_text and current_segment_label:
            if current_segment_label not in all_grouped_segments:
                all_grouped_segments[current_segment_label] = []
            all_grouped_segments[current_segment_label].append(current_segment_text.strip())

        non_pii_segments = all_grouped_segments.pop("O", [])
        pii_data = all_grouped_segments
        non_pii_data_string = " ".join(non_pii_segments) if non_pii_segments else "No 'O' (Outside) labeled text found."
        return pii_data, non_pii_data_string

    except Exception as e:
        st.error(f"An error occurred during LayoutLMv3 model inference: {e}")
        return {}, f"Error during model inference: {e}"

# --- UI ---

uploaded_file = st.file_uploader("Upload an image (JPG, JPEG, PNG)", type=["jpg", "jpeg", "png"])

if uploaded_file is not None:
    image = Image.open(uploaded_file).convert("RGB")

    with st.spinner("Loading models and processing..."):
        ocr_engine = load_paddle_ocr_model()
        model, processor = load_layoutlmv3_model_and_processor()
        pii_data, non_pii_raw_text = run_inference_logic(image, model, processor, ocr_engine)

    st.success("Processing Complete!")

    with st.spinner("Generating summary for Non-PII data..."):
        non_pii_llm_output = generate_non_pii_summary_with_llm(non_pii_raw_text)

    col_pii, col_non_pii = st.columns(2)

    with col_pii:
        st.header("🔐 PII Data")
        if pii_data:
            sorted_pii_labels = sorted(pii_data.keys())
            for label in sorted_pii_labels:
                st.subheader(f"🏷️ {label.replace('-', ' ').title()}")
                for text in pii_data[label]:
                    st.markdown(f"- **{text}**")
        else:
            st.info("No PII entities were detected in the document.")

    with col_non_pii:
        st.header("📝 Non-PII Data")
        st.markdown(non_pii_llm_output)

# Optional CSS
st.markdown("""
<style>
    .stApp {
        background-color: #f0f2f6;
        color: #333333;
    }
    .stButton>button {
        background-color: #4CAF50;
        color: white;
        border-radius: 8px;
        padding: 10px 20px;
        font-weight: bold;
        box-shadow: 0 4px 6px rgba(0,0,0,0.1);
    }
    .stButton>button:hover {
        background-color: #45a049;
    }
    .stFileUploader {
        border: 2px dashed #a0aec0;
        border-radius: 10px;
        padding: 20px;
        background-color: #ffffff;
    }
    h1 {
        color: #1a73e8;
        text-align: center;
        font-size: 2.5em;
    }
    h2 {
        color: #3f51b5;
    }
    h3 {
        color: #5c6bc0;
    }
    ul {
        list-style-type: none;
        padding-left: 0;
    }
    ul li {
        margin-bottom: 5px;
        padding-left: 20px;
        position: relative;
    }
    ul li::before {
        content: '•';
        color: #4CAF50;
        font-weight: bold;
        display: inline-block;
        width: 1em;
        margin-left: -1em;
    }
</style>
""", unsafe_allow_html=True)

# Preprocess OCR results and run inference as described in the full README

Training Details

Training Data

Custom annotated dataset of medical lab reports using BIOES tagging.

Training Procedure

Preprocessing

Images were preprocessed using PaddleOCR.
Bounding boxes normalized to 1000-scale.
Tokens and boxes aligned for LayoutLMv3.

Training Hyperparameters

Training regime: fp16 mixed precision
Epochs: 20
Batch size: 1
Learning rate: 5e-5

Speeds, Sizes, Times

Checkpoint size: ~435 MB
Training time: ~2 hours on RTX 3060

Testing Data, Factors & Metrics

Testing Data

Held-out set of annotated lab report images.

Factors

Document layout structure.
Entity type variability.

Results

Metric	Value
Accuracy	~99.31%

Model Architecture and Objective

LayoutLMv3 with token classification head using OCR input (image, text, and layout).

Compute Infrastructure

Software

PyTorch, Hugging Face Transformers, PaddleOCR, Streamlit

Citation

BibTeX:

@misc{parthesh2025layoutlmv3,
  title = {LayoutLMv3 Fine-Tuned on Lab Reports with BIOES Tags},
  author = {Parthesh Ingale},
  year = {2025},
  howpublished = {\url{https://huggingface.co/parthesh111/layoutlmv3-finetune-bioes-new}},
}

Glossary

BIOES: Beginning, Inside, Outside, End, Single tagging scheme used for NER.

Model Card Contact

GitHub/HF: parthesh111
Email: [parthesh.ingale1804@gmail.com]