Spaces:
Sleeping
Implementation Plan: Maternal Health RAG Chatbot v3.0
1. Project Goal
To significantly enhance the quality, accuracy, and naturalness of the RAG chatbot by implementing a state-of-the-art document processing and retrieval pipeline. This version will address the shortcomings of v2, specifically the poor handling of complex document elements (tables, diagrams) and the rigid, templated nature of the LLM responses.
2. Core Problems to Solve
- Poor Data Quality: The current
pdfplumber
-based processor loses critical information from tables, flowcharts, and diagrams, leading to low-quality, out-of-context chunks in the vector store. - Inaccurate Retrieval: As a result of poor data quality, the retrieval system often fails to find the most relevant context, even when the information exists in the source PDFs.
- Robotic LLM Responses: The current system prompt is too restrictive, forcing the LLM into a fixed template and preventing natural, conversational answers.
3. The "Version 3.0" Plan
This plan is divided into three main phases, designed to be implemented sequentially.
Phase 1: Advanced Document Processing (Completed)
We have replaced our entire PDF processing pipeline with a modern, machine-learning-based tool to handle complex documents. Note: The AMA citation generation feature is deferred to focus on core functionality first.
- Technology: We are using the
unstructured.io
library for parsing. It is a robust, industry-standard tool for extracting text, tables, and other elements from complex PDFs. - Why
unstructured.io
? After failed attempts with other libraries (mineru
,nougat-ocr
) due to performance and dependency issues,unstructured.io
proved to be the most reliable and effective solution. It uses models like Detectron2 under the hood (via ONNX, simplifying installation) and provides the high-resolution extraction needed for quality results. - Implementation Steps (Completed):
- Create
src/enhanced_pdf_processor.py
: A new script built to use theunstructured.io
library. It processes a directory of PDFs and outputs structured Markdown files. - Use High-Resolution Strategy: The script leverages the
hi_res
strategy inunstructured
to accurately parse document layouts, convert tables to HTML, and extract images. - Update Dependencies: Replaced all previous PDF processing dependencies with
unstructured[local-inference]
inrequirements.txt
. - Re-process all documents: Ran the new script on all PDFs in the
Obs
directory, storing the resulting.md
files and associated images in thesrc/processed_markdown/
directory.
- Create
Phase 2: High-Precision Retrieval with Re-ranking (In Progress)
Once we have high-quality Markdown, we need to ensure our retrieval system can leverage it effectively.
- Technology: We will implement a Cross-Encoder Re-ranking strategy using the
sentence-transformers
library. - Why Re-ranking? A simple vector search (like our current FAISS implementation) is fast but not always precise. It can retrieve documents that are semantically nearby but not the most relevant. A re-ranker adds a second, more powerful validation step to dramatically increase precision.
- Implementation Steps:
- Update Chunking Strategy (Completed): In
src/groq_medical_rag.py
, the document loading was changed to read from the new.md
files usingUnstructuredMarkdownLoader
. We now use aRecursiveCharacterTextSplitter
to create semantically aware chunks. - Implement 2-Stage Retrieval (Completed):
- Stage 1 (Recall): Use the existing FAISS vector store to retrieve a large number of candidate documents (e.g., top 20).
- Stage 2 (Precision): Use a
Cross-Encoder
model (cross-encoder/ms-marco-MiniLM-L-6-v2
) from thesentence-transformers
library to score the relevance of these candidates against the user's query. We then select the top 5 highest-scoring documents to pass to the LLM.
- Update the RAG System (Completed): The core logic in
src/groq_medical_rag.py
has been refactored to accommodate this new two-stage process. The confidence calculation has also been updated to use the re-ranked scores.
- Update Chunking Strategy (Completed): In
Phase 3: Dynamic and Natural LLM Interaction
With high-quality context, we can "unleash" the LLM to provide more human-like responses.
- Technology: Advanced Prompt Engineering.
- Why a new prompt? To move the LLM from a "template filler" to a "reasoning engine." We will give it a persona and a goal, rather than a rigid set of formatting rules.
- Implementation Steps:
- Rewrite the System Prompt: The
SYSTEM_PROMPT
insrc/groq_medical_rag.py
will be replaced with a new version. - Draft of New Prompt:
"You are a world-class medical expert and a compassionate assistant for healthcare professionals in Sri Lanka. Your primary goal is to provide accurate, evidence-based clinical information based only on the provided context from Sri Lankan maternal health guidelines. Your tone should be professional, clear, and supportive. While accuracy is paramount, present the information in a natural, easy-to-understand manner. Feel free to use lists, bullet points, or paragraphs to best structure the answer for clarity. After providing the answer, you must cite the source using the AMA-formatted citation provided with the context. At the end of every response, include the following disclaimer: 'This information is for clinical reference based on Sri Lankan guidelines and does not replace professional medical judgment.'"
- Rewrite the System Prompt: The
Phase 4: Standardized Citation Formatting
This phase will address the user's feedback on improving citation quality. The current citations are too long and not in a standard scientific format.
- Goal: To format the source citations in a consistent, professional, and standardized scientific style (e.g., AMA or Vancouver).
- Problem: The current
source
metadata is just a file path, which is not user-friendly. The LLM needs structured metadata to create proper citations. - Implementation Steps:
- Extract Citation Metadata: Modify the document processing script (
src/enhanced_pdf_processor.py
) to extract structured citation information (e.g., authors, title, journal, year, page numbers) from each document. This could involve looking for patterns or specific text on the first few pages of each PDF. If not available, we will use the filename as a fallback. - Store Metadata: Add the extracted metadata to the
metadata
field of each document chunk created insrc/groq_medical_rag.py
. - Create Citation Formatting Prompt: Develop a new system prompt or enhance the existing one in
src/groq_medical_rag.py
to instruct the LLM on how to format the citation using the provided metadata. We will ask it to generate a citation in a standard style like AMA. - Testing and Refinement: Test the new citation generation with various documents and queries to ensure it is robust and consistently produces well-formatted citations.
- Extract Citation Metadata: Modify the document processing script (
4. Expected Outcome
By the end of this implementation, the chatbot should be able to:
- Answer questions that require information from complex tables and flowcharts.
- Provide more accurate and relevant answers due to the high-precision retrieval pipeline.
- Include proper AMA-style citations for all retrieved information, enhancing trustworthiness.
- Interact with users in a more natural, helpful, and less robotic tone.
- Have a robust, state-of-the-art foundation for any future enhancements.
5. Project Status Board
- Phase 3: Dynamic and Natural LLM Interaction
- Rewrite the
SYSTEM_PROMPT
insrc/groq_medical_rag.py
.
- Rewrite the
- Phase 4: Standardized Citation Formatting
- Modify
src/enhanced_pdf_processor.py
to extract structured citation metadata. - Update
src/groq_medical_rag.py
to store this metadata in document chunks. - Enhance the system prompt in
src/groq_medical_rag.py
to instruct the LLM on citation formatting. - Test and refine the citation generation.
- Modify
6. Executor's Feedback or Assistance Requests
No feedback at this time.
7. Branch Name
feature/standard-citations