vedaMD / docs /implementation-plan /maternal-health-rag-chatbot-v3.md
sniro23's picture
Initial commit without binary files
19aaa42

Implementation Plan: Maternal Health RAG Chatbot v3.0

1. Project Goal

To significantly enhance the quality, accuracy, and naturalness of the RAG chatbot by implementing a state-of-the-art document processing and retrieval pipeline. This version will address the shortcomings of v2, specifically the poor handling of complex document elements (tables, diagrams) and the rigid, templated nature of the LLM responses.


2. Core Problems to Solve

  1. Poor Data Quality: The current pdfplumber-based processor loses critical information from tables, flowcharts, and diagrams, leading to low-quality, out-of-context chunks in the vector store.
  2. Inaccurate Retrieval: As a result of poor data quality, the retrieval system often fails to find the most relevant context, even when the information exists in the source PDFs.
  3. Robotic LLM Responses: The current system prompt is too restrictive, forcing the LLM into a fixed template and preventing natural, conversational answers.

3. The "Version 3.0" Plan

This plan is divided into three main phases, designed to be implemented sequentially.

Phase 1: Advanced Document Processing (Completed)

We have replaced our entire PDF processing pipeline with a modern, machine-learning-based tool to handle complex documents. Note: The AMA citation generation feature is deferred to focus on core functionality first.

  • Technology: We are using the unstructured.io library for parsing. It is a robust, industry-standard tool for extracting text, tables, and other elements from complex PDFs.
  • Why unstructured.io? After failed attempts with other libraries (mineru, nougat-ocr) due to performance and dependency issues, unstructured.io proved to be the most reliable and effective solution. It uses models like Detectron2 under the hood (via ONNX, simplifying installation) and provides the high-resolution extraction needed for quality results.
  • Implementation Steps (Completed):
    1. Create src/enhanced_pdf_processor.py: A new script built to use the unstructured.io library. It processes a directory of PDFs and outputs structured Markdown files.
    2. Use High-Resolution Strategy: The script leverages the hi_res strategy in unstructured to accurately parse document layouts, convert tables to HTML, and extract images.
    3. Update Dependencies: Replaced all previous PDF processing dependencies with unstructured[local-inference] in requirements.txt.
    4. Re-process all documents: Ran the new script on all PDFs in the Obs directory, storing the resulting .md files and associated images in the src/processed_markdown/ directory.

Phase 2: High-Precision Retrieval with Re-ranking (In Progress)

Once we have high-quality Markdown, we need to ensure our retrieval system can leverage it effectively.

  • Technology: We will implement a Cross-Encoder Re-ranking strategy using the sentence-transformers library.
  • Why Re-ranking? A simple vector search (like our current FAISS implementation) is fast but not always precise. It can retrieve documents that are semantically nearby but not the most relevant. A re-ranker adds a second, more powerful validation step to dramatically increase precision.
  • Implementation Steps:
    1. Update Chunking Strategy (Completed): In src/groq_medical_rag.py, the document loading was changed to read from the new .md files using UnstructuredMarkdownLoader. We now use a RecursiveCharacterTextSplitter to create semantically aware chunks.
    2. Implement 2-Stage Retrieval (Completed):
      • Stage 1 (Recall): Use the existing FAISS vector store to retrieve a large number of candidate documents (e.g., top 20).
      • Stage 2 (Precision): Use a Cross-Encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2) from the sentence-transformers library to score the relevance of these candidates against the user's query. We then select the top 5 highest-scoring documents to pass to the LLM.
    3. Update the RAG System (Completed): The core logic in src/groq_medical_rag.py has been refactored to accommodate this new two-stage process. The confidence calculation has also been updated to use the re-ranked scores.

Phase 3: Dynamic and Natural LLM Interaction

With high-quality context, we can "unleash" the LLM to provide more human-like responses.

  • Technology: Advanced Prompt Engineering.
  • Why a new prompt? To move the LLM from a "template filler" to a "reasoning engine." We will give it a persona and a goal, rather than a rigid set of formatting rules.
  • Implementation Steps:
    1. Rewrite the System Prompt: The SYSTEM_PROMPT in src/groq_medical_rag.py will be replaced with a new version.
    2. Draft of New Prompt:

      "You are a world-class medical expert and a compassionate assistant for healthcare professionals in Sri Lanka. Your primary goal is to provide accurate, evidence-based clinical information based only on the provided context from Sri Lankan maternal health guidelines. Your tone should be professional, clear, and supportive. While accuracy is paramount, present the information in a natural, easy-to-understand manner. Feel free to use lists, bullet points, or paragraphs to best structure the answer for clarity. After providing the answer, you must cite the source using the AMA-formatted citation provided with the context. At the end of every response, include the following disclaimer: 'This information is for clinical reference based on Sri Lankan guidelines and does not replace professional medical judgment.'"

Phase 4: Standardized Citation Formatting

This phase will address the user's feedback on improving citation quality. The current citations are too long and not in a standard scientific format.

  • Goal: To format the source citations in a consistent, professional, and standardized scientific style (e.g., AMA or Vancouver).
  • Problem: The current source metadata is just a file path, which is not user-friendly. The LLM needs structured metadata to create proper citations.
  • Implementation Steps:
    1. Extract Citation Metadata: Modify the document processing script (src/enhanced_pdf_processor.py) to extract structured citation information (e.g., authors, title, journal, year, page numbers) from each document. This could involve looking for patterns or specific text on the first few pages of each PDF. If not available, we will use the filename as a fallback.
    2. Store Metadata: Add the extracted metadata to the metadata field of each document chunk created in src/groq_medical_rag.py.
    3. Create Citation Formatting Prompt: Develop a new system prompt or enhance the existing one in src/groq_medical_rag.py to instruct the LLM on how to format the citation using the provided metadata. We will ask it to generate a citation in a standard style like AMA.
    4. Testing and Refinement: Test the new citation generation with various documents and queries to ensure it is robust and consistently produces well-formatted citations.

4. Expected Outcome

By the end of this implementation, the chatbot should be able to:

  • Answer questions that require information from complex tables and flowcharts.
  • Provide more accurate and relevant answers due to the high-precision retrieval pipeline.
  • Include proper AMA-style citations for all retrieved information, enhancing trustworthiness.
  • Interact with users in a more natural, helpful, and less robotic tone.
  • Have a robust, state-of-the-art foundation for any future enhancements.

5. Project Status Board

  • Phase 3: Dynamic and Natural LLM Interaction
    • Rewrite the SYSTEM_PROMPT in src/groq_medical_rag.py.
  • Phase 4: Standardized Citation Formatting
    • Modify src/enhanced_pdf_processor.py to extract structured citation metadata.
    • Update src/groq_medical_rag.py to store this metadata in document chunks.
    • Enhance the system prompt in src/groq_medical_rag.py to instruct the LLM on citation formatting.
    • Test and refine the citation generation.

6. Executor's Feedback or Assistance Requests

No feedback at this time.


7. Branch Name

feature/standard-citations