MisterXY89's picture
Add Missing Type Imports to Readme
ff2d862 verified
|
raw
history blame
7.9 kB
metadata
license: mit
datasets:
  - galileo-ai/ragbench
language:
  - en
base_model:
  - answerdotai/ModernBERT-base
pipeline_tag: text-classification

Verbatim RAG Extractor Model

This is a fine-tuned BERT-based model for extracting relevant text spans from documents in a Verbatim RAG system. The model performs sentence-level binary classification to identify which sentences in a document are relevant to answer a given question.

Model Architecture

  • Base Model: ModernBERT-base (answerdotai/ModernBERT-base)
  • Architecture: BERT encoder + linear classification head
  • Task: Sentence-level binary classification (relevant/not relevant)
  • Input Format: [CLS] question [SEP] sentence1 [SEP] sentence2 [SEP] ... [SEP]
  • Output: Binary classification scores for each sentence
  • Hidden Dimension: 768
  • Number of Labels: 2 (relevant/not relevant)

The model processes questions and documents by:

  1. Encoding the question and all sentences from retrieved documents in a single sequence
  2. Using sentence boundaries to extract sentence-level representations
  3. Averaging token embeddings within each sentence boundary
  4. Classifying each sentence independently for relevance

Installation

Option 1: Install from PyPI (Recommended)

pip install verbatim-rag

Option 2: Install from Source

git clone https://github.com/krlabsorg/verbatim-rag.git
cd verbatim-rag
pip install -e .

Dependencies

The model requires the following key dependencies:

  • torch>=2.6.0
  • transformers==4.53.3
  • numpy>=1.24.3
  • scikit-learn==1.6.1

Usage

Basic Usage

from verbatim_rag.vector_stores import SearchResult
from verbatim_core.extractors import ModelSpanExtractor

extractor = ModelSpanExtractor(
    model_path="KRLabsOrg/verbatim-rag-modern-bert-v1",
)

# Create search results (typically from your vector store)
search_results = [
    SearchResult(
        id="1",
        text="Climate change is caused by greenhouse gas emissions from human activities. "
             "Carbon dioxide is the primary greenhouse gas.",
        score=0.95,
        metadata={"source": "climate_doc.txt"}
    ),
    SearchResult(
        id="2",
        text="Renewable energy sources include solar, wind, and hydropower. "
             "These sources produce no direct emissions.",
        score=0.87,
        metadata={"source": "energy_doc.txt"}
    )
]

# Extract relevant spans
question = "What causes climate change?"
relevant_spans = extractor.extract_spans(question, search_results)

# Print results
for text, spans in relevant_spans.items():
    print(f"Document: {text[:50]}...")
    for span in spans:
        print(f"  - {span}")

Integration with Verbatim RAG System

from verbatim_rag.core import VerbatimRAG
from verbatim_rag.index import VerbatimIndex
from verbatim_rag.extractors import ModelSpanExtractor

# Load your trained extractor
extractor = ModelSpanExtractor("path/to/your/model")

# Create VerbatimRAG system with custom extractor
index = VerbatimIndex(
    sparse_model="naver/splade-v3", 
    db_path="./index.db"
)

rag_system = VerbatimRAG(
    index=index,
    extractor=extractor,
    k=5
)

# Query the system
response = rag_system.query("What are the main causes of climate change?")
print(response.answer)

Training Your Own Model

from verbatim_rag.extractor_models.model import QAModel
from verbatim_rag.extractor_models.dataset import QADataset
from transformers import AutoTokenizer

# Load your training data (list of QASample objects)
# training_samples = load_your_training_data()

# Initialize tokenizer and dataset
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
dataset = QADataset(training_samples, tokenizer, max_length=512)

# Initialize model
model = QAModel(
    model_name="answerdotai/ModernBERT-base",
    hidden_dim=768,
    num_labels=2
)

# Train your model (implement your training loop)
# model = train_model(model, dataset)

# Save the trained model
model.save_pretrained("path/to/save/model")
tokenizer.save_pretrained("path/to/save/model")

Sample Output

Example 1: Climate Change Question

Question: "What causes climate change?"

Input Documents:

  1. "Climate change is a significant and lasting change in weather patterns. Global warming is the observed increase in Earth's temperature. Greenhouse gases include water vapor, carbon dioxide, methane, and nitrous oxide. Human activities since the Industrial Revolution have increased greenhouse gas levels."

  2. "Renewable energy comes from naturally replenished sources. Solar power converts sunlight into electricity. Wind power uses wind for mechanical power. Hydropower generates electricity from falling water."

Extracted Relevant Spans:

  • Document 1:
    • βœ… "Greenhouse gases include water vapor, carbon dioxide, methane, and nitrous oxide"
    • βœ… "Human activities since the Industrial Revolution have increased greenhouse gas levels"
  • Document 2:
    • ❌ No relevant spans found

Example 2: Technology Question

Question: "How does solar power work?"

Input Documents:

  1. "Solar panels contain photovoltaic cells that convert sunlight directly into electricity. When photons hit the semiconductor material, they knock electrons loose, creating an electric current. This direct current is then converted to alternating current for use in homes."

  2. "Wind turbines use aerodynamic blades to capture wind energy. The rotation of the blades turns a generator that produces electricity. Modern wind farms can generate significant amounts of clean energy."

Extracted Relevant Spans:

  • Document 1:
    • βœ… "Solar panels contain photovoltaic cells that convert sunlight directly into electricity"
    • βœ… "When photons hit the semiconductor material, they knock electrons loose, creating an electric current"
    • βœ… "This direct current is then converted to alternating current for use in homes"
  • Document 2:
    • ❌ No relevant spans found

Confidence Scores

The model outputs confidence scores for each sentence. With the default threshold of 0.5:

  • Scores > 0.5: Sentence is marked as relevant
  • Scores ≀ 0.5: Sentence is marked as not relevant

You can adjust the threshold based on your precision/recall requirements:

  • Higher threshold (e.g., 0.7): More precise, fewer false positives
  • Lower threshold (e.g., 0.3): Higher recall, more comprehensive extraction

Citation

@inproceedings{kovacs-etal-2025-kr,
    title = "{KR} Labs at {A}rch{EHR}-{QA} 2025: A Verbatim Approach for Evidence-Based Question Answering",
    author = "Kovacs, Adam  and
      Schmitt, Paul  and
      Recski, Gabor",
    editor = "Soni, Sarvesh  and
      Demner-Fushman, Dina",
    booktitle = "Proceedings of the 24th Workshop on Biomedical Language Processing (Shared Tasks)",
    month = aug,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.bionlp-share.8/",
    pages = "69--74",
    ISBN = "979-8-89176-276-3",
    abstract = "We present a lightweight, domain{-}agnostic verbatim pipeline for evidence{-}grounded question answering. Our pipeline operates in two steps: first, a sentence-level extractor flags relevant note sentences using either zero-shot LLM prompts or supervised ModernBERT classifiers. Next, an LLM drafts a question-specific template, which is filled verbatim with sentences from the extraction step. This prevents hallucinations and ensures traceability. In the ArchEHR{-}QA 2025 shared task, our system scored 42.01{\%}, ranking top{-}10 in core metrics and outperforming the organiser{'}s 70B{-}parameter Llama{-}3.3 baseline. We publicly release our code and inference scripts under an MIT license."
}

License

MIT License - see the LICENSE file for details.