license: mit
datasets:
  - galileo-ai/ragbench
language:
  - en
base_model:
  - answerdotai/ModernBERT-base
pipeline_tag: text-classification
Verbatim RAG Extractor Model
This is a fine-tuned BERT-based model for extracting relevant text spans from documents in a Verbatim RAG system. The model performs sentence-level binary classification to identify which sentences in a document are relevant to answer a given question.
Model Architecture
- Base Model: ModernBERT-base (answerdotai/ModernBERT-base)
- Architecture: BERT encoder + linear classification head
- Task: Sentence-level binary classification (relevant/not relevant)
- Input Format: [CLS] question [SEP] sentence1 [SEP] sentence2 [SEP] ... [SEP]
- Output: Binary classification scores for each sentence
- Hidden Dimension: 768
- Number of Labels: 2 (relevant/not relevant)
The model processes questions and documents by:
- Encoding the question and all sentences from retrieved documents in a single sequence
- Using sentence boundaries to extract sentence-level representations
- Averaging token embeddings within each sentence boundary
- Classifying each sentence independently for relevance
Installation
Option 1: Install from PyPI (Recommended)
pip install verbatim-rag
Option 2: Install from Source
git clone https://github.com/krlabsorg/verbatim-rag.git
cd verbatim-rag
pip install -e .
Dependencies
The model requires the following key dependencies:
- torch>=2.6.0
- transformers==4.53.3
- numpy>=1.24.3
- scikit-learn==1.6.1
Usage
Basic Usage
from verbatim_rag.vector_stores import SearchResult
from verbatim_core.extractors import ModelSpanExtractor
extractor = ModelSpanExtractor(
    model_path="KRLabsOrg/verbatim-rag-modern-bert-v1",
)
# Create search results (typically from your vector store)
search_results = [
    SearchResult(
        id="1",
        text="Climate change is caused by greenhouse gas emissions from human activities. "
             "Carbon dioxide is the primary greenhouse gas.",
        score=0.95,
        metadata={"source": "climate_doc.txt"}
    ),
    SearchResult(
        id="2",
        text="Renewable energy sources include solar, wind, and hydropower. "
             "These sources produce no direct emissions.",
        score=0.87,
        metadata={"source": "energy_doc.txt"}
    )
]
# Extract relevant spans
question = "What causes climate change?"
relevant_spans = extractor.extract_spans(question, search_results)
# Print results
for text, spans in relevant_spans.items():
    print(f"Document: {text[:50]}...")
    for span in spans:
        print(f"  - {span}")
Integration with Verbatim RAG System
from verbatim_rag.core import VerbatimRAG
from verbatim_rag.index import VerbatimIndex
from verbatim_rag.extractors import ModelSpanExtractor
# Load your trained extractor
extractor = ModelSpanExtractor("path/to/your/model")
# Create VerbatimRAG system with custom extractor
index = VerbatimIndex(
    sparse_model="naver/splade-v3", 
    db_path="./index.db"
)
rag_system = VerbatimRAG(
    index=index,
    extractor=extractor,
    k=5
)
# Query the system
response = rag_system.query("What are the main causes of climate change?")
print(response.answer)
Training Your Own Model
from verbatim_rag.extractor_models.model import QAModel
from verbatim_rag.extractor_models.dataset import QADataset
from transformers import AutoTokenizer
# Load your training data (list of QASample objects)
# training_samples = load_your_training_data()
# Initialize tokenizer and dataset
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
dataset = QADataset(training_samples, tokenizer, max_length=512)
# Initialize model
model = QAModel(
    model_name="answerdotai/ModernBERT-base",
    hidden_dim=768,
    num_labels=2
)
# Train your model (implement your training loop)
# model = train_model(model, dataset)
# Save the trained model
model.save_pretrained("path/to/save/model")
tokenizer.save_pretrained("path/to/save/model")
Sample Output
Example 1: Climate Change Question
Question: "What causes climate change?"
Input Documents:
- "Climate change is a significant and lasting change in weather patterns. Global warming is the observed increase in Earth's temperature. Greenhouse gases include water vapor, carbon dioxide, methane, and nitrous oxide. Human activities since the Industrial Revolution have increased greenhouse gas levels." 
- "Renewable energy comes from naturally replenished sources. Solar power converts sunlight into electricity. Wind power uses wind for mechanical power. Hydropower generates electricity from falling water." 
Extracted Relevant Spans:
- Document 1:- β "Greenhouse gases include water vapor, carbon dioxide, methane, and nitrous oxide"
- β "Human activities since the Industrial Revolution have increased greenhouse gas levels"
 
- Document 2:- β No relevant spans found
 
Example 2: Technology Question
Question: "How does solar power work?"
Input Documents:
- "Solar panels contain photovoltaic cells that convert sunlight directly into electricity. When photons hit the semiconductor material, they knock electrons loose, creating an electric current. This direct current is then converted to alternating current for use in homes." 
- "Wind turbines use aerodynamic blades to capture wind energy. The rotation of the blades turns a generator that produces electricity. Modern wind farms can generate significant amounts of clean energy." 
Extracted Relevant Spans:
- Document 1:- β "Solar panels contain photovoltaic cells that convert sunlight directly into electricity"
- β "When photons hit the semiconductor material, they knock electrons loose, creating an electric current"
- β "This direct current is then converted to alternating current for use in homes"
 
- Document 2:- β No relevant spans found
 
Confidence Scores
The model outputs confidence scores for each sentence. With the default threshold of 0.5:
- Scores > 0.5: Sentence is marked as relevant
- Scores β€ 0.5: Sentence is marked as not relevant
You can adjust the threshold based on your precision/recall requirements:
- Higher threshold (e.g., 0.7): More precise, fewer false positives
- Lower threshold (e.g., 0.3): Higher recall, more comprehensive extraction
Citation
@inproceedings{kovacs-etal-2025-kr,
    title = "{KR} Labs at {A}rch{EHR}-{QA} 2025: A Verbatim Approach for Evidence-Based Question Answering",
    author = "Kovacs, Adam  and
      Schmitt, Paul  and
      Recski, Gabor",
    editor = "Soni, Sarvesh  and
      Demner-Fushman, Dina",
    booktitle = "Proceedings of the 24th Workshop on Biomedical Language Processing (Shared Tasks)",
    month = aug,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.bionlp-share.8/",
    pages = "69--74",
    ISBN = "979-8-89176-276-3",
    abstract = "We present a lightweight, domain{-}agnostic verbatim pipeline for evidence{-}grounded question answering. Our pipeline operates in two steps: first, a sentence-level extractor flags relevant note sentences using either zero-shot LLM prompts or supervised ModernBERT classifiers. Next, an LLM drafts a question-specific template, which is filled verbatim with sentences from the extraction step. This prevents hallucinations and ensures traceability. In the ArchEHR{-}QA 2025 shared task, our system scored 42.01{\%}, ranking top{-}10 in core metrics and outperforming the organiser{'}s 70B{-}parameter Llama{-}3.3 baseline. We publicly release our code and inference scripts under an MIT license."
}
License
MIT License - see the LICENSE file for details.
