🔄 Enhanced Memory System: STM + LTM + Hybrid Context Retrieval

Overview

The Medical Chatbot now implements an advanced memory system with Short-Term Memory (STM) and Long-Term Memory (LTM) that intelligently manages conversation context, semantic knowledge, and conversational continuity. This system goes beyond simple RAG to provide truly intelligent, contextually aware responses that remember and build upon previous interactions.

🏗️ Architecture

Memory Hierarchy

User Query → Enhanced Memory System → Intelligent Context Selection → LLM Response
                ↓
        ┌─────────────────┬─────────────────┬─────────────────┐
        │   STM (5 items) │   LTM (60 items)│   RAG Search    │
        │ (Recent Summaries)│ (Semantic Store)│ (Knowledge Base)│
        └─────────────────┴─────────────────┴─────────────────┘
                ↓
        Gemini Flash Lite Contextual Analysis
                ↓
        Summarized Context + Semantic Knowledge

Memory Types

1. Short-Term Memory (STM)

Capacity: 5 recent conversation summaries
Content: Chunked and summarized LLM responses with enriched topics
Features: Semantic deduplication, intelligent merging, topic enrichment
Purpose: Maintain conversational continuity and immediate context

2. Long-Term Memory (LTM)

Capacity: 60 semantic chunks (~20 conversational rounds)
Content: FAISS-indexed medical knowledge chunks
Features: Semantic similarity search, usage tracking, smart eviction
Purpose: Provide deep medical knowledge and historical context

3. RAG Knowledge Base

Content: External medical knowledge and guidelines
Features: Real-time retrieval, semantic matching
Purpose: Supplement with current medical information

🔧 Key Components

1. Enhanced Memory Manager (`memory.py`)

STM Management

def get_recent_chat_history(self, user_id: str, num_turns: int = 5) -> List[Dict]:
    """
    Get the most recent STM summaries (not raw Q/A).
    Returns: [{"user": "", "bot": "Topic: ...\n<summary>", "timestamp": time}, ...]
    """

STM Features:

Capacity: 5 recent conversation summaries
Content: Chunked and summarized LLM responses with enriched topics
Deduplication: Semantic similarity-based merging (≥0.92 identical, ≥0.75 merge)
Topic Enrichment: Uses user question context to generate detailed topics

LTM Management

def get_relevant_chunks(self, user_id: str, query: str, top_k: int = 3, min_sim: float = 0.30) -> List[str]:
    """Return texts of chunks whose cosine similarity ≥ min_sim."""

LTM Features:

Capacity: 60 semantic chunks (~20 conversational rounds)
Indexing: FAISS-based semantic search
Smart Eviction: Usage-based decay and recency scoring
Merging: Intelligent deduplication and content fusion

Enhanced Chunking

def chunk_response(self, response: str, lang: str, question: str = "") -> List[Dict]:
    """
    Enhanced chunking with question context for richer topics.
    Returns: [{"tag": "detailed_topic", "text": "summary"}, ...]
    """

Chunking Features:

Question Context: Incorporates user's latest question for topic generation
Rich Topics: Detailed topics (10-20 words) capturing context, condition, and action
Medical Focus: Excludes disclaimers, includes exact medication names/doses
Semantic Grouping: Groups by medical topic, symptom, assessment, plan, or instruction

2. Intelligent Context Retrieval

Contextual Summarization

def get_contextual_chunks(self, user_id: str, current_query: str, lang: str = "EN") -> str:
    """
    Creates a single, coherent summary from STM + LTM + RAG.
    Returns: A single summary string for the main LLM.
    """

Features:

Unified Summary: Combines STM (5 turns) + LTM (semantic) + RAG (knowledge)
Gemini Analysis: Uses Gemini Flash Lite for intelligent context selection
Conversational Flow: Maintains continuity while providing medical relevance
Fallback Strategy: Graceful degradation if analysis fails

🚀 How It Works

Step 1: Enhanced Memory Processing

# Process new exchange through STM and LTM
chunks = memory.chunk_response(response, lang, question=query)
for chunk in chunks:
    memory._upsert_stm(user_id, chunk, lang)  # STM with dedupe/merge
memory._upsert_ltm(user_id, chunks, lang)     # LTM with semantic storage

Step 2: Context Retrieval

# Get STM summaries (5 recent turns)
recent_history = memory.get_recent_chat_history(user_id, num_turns=5)

# Get LTM semantic chunks
rag_chunks = memory.get_relevant_chunks(user_id, current_query, top_k=3)

# Get external RAG knowledge
external_rag = retrieve_medical_info(current_query)

Step 3: Intelligent Context Summarization

The system sends all context sources to Gemini Flash Lite for unified summarization:

You are a medical assistant creating a concise summary of conversation context for continuity.

Current user query: "{current_query}"

Available context information:
Recent conversation history:
{recent_history}

Semantically relevant historical medical information:
{rag_chunks}

Task: Create a brief, coherent summary that captures the key points from the conversation history and relevant medical information that are important for understanding the current query.

Guidelines:
1. Focus on medical symptoms, diagnoses, treatments, or recommendations mentioned
2. Include any patient concerns or questions that are still relevant
3. Highlight any follow-up needs or pending clarifications
4. Keep the summary concise but comprehensive enough for context
5. Maintain conversational flow and continuity

Output: Provide a single, well-structured summary paragraph that can be used as context for the main LLM to provide a coherent response.

Step 4: Unified Context Integration

The single, coherent summary is integrated into the main LLM prompt, providing:

Conversational continuity (from STM summaries)
Medical knowledge (from LTM semantic chunks)
Current information (from external RAG)
Unified narrative (single summary instead of multiple chunks)

📊 Benefits

1. Advanced Memory Management

STM: Maintains 5 recent conversation summaries with intelligent deduplication
LTM: Stores 60 semantic chunks (~20 rounds) with FAISS indexing
Smart Merging: Combines similar content while preserving unique details
Topic Enrichment: Detailed topics using user question context

2. Intelligent Context Summarization

Unified Summary: Single coherent narrative instead of multiple chunks
Gemini Analysis: AI-powered context selection and summarization
Medical Focus: Prioritizes symptoms, diagnoses, treatments, and recommendations
Conversational Flow: Maintains natural dialogue continuity

3. Enhanced Chunking & Topics

Question Context: Incorporates user's latest question for richer topics
Detailed Topics: 10-20 word descriptions capturing context, condition, and action
Medical Precision: Includes exact medication names, doses, and clinical instructions
Semantic Grouping: Organizes by medical topic, symptom, assessment, plan, or instruction

4. Robust Fallback Strategy

Primary: Gemini Flash Lite contextual summarization
Secondary: LTM semantic search with usage-based scoring
Tertiary: STM recent summaries
Final: External RAG knowledge base

5. Performance & Scalability

Efficient Storage: Semantic deduplication reduces memory footprint
Fast Retrieval: FAISS indexing for sub-millisecond LTM search
Smart Eviction: Usage-based decay and recency scoring
Minimal Latency: Optimized for real-time medical consultations

🧪 Example Scenarios

Scenario 1: STM Deduplication & Merging

User: "I have chest pain"
Bot: "This could be angina. Symptoms include pressure, tightness, and shortness of breath."

User: "What about chest pain with shortness of breath?"
Bot: "Chest pain with shortness of breath is concerning for angina or heart attack..."

User: "Tell me more about the symptoms"
Bot: "Angina symptoms include chest pressure, tightness, shortness of breath, and may radiate to arms..."

Result: STM merges similar responses, creating a comprehensive summary: "Patient has chest pain symptoms consistent with angina, including pressure, tightness, shortness of breath, and potential radiation to arms. This represents a concerning cardiac presentation requiring immediate evaluation."

Scenario 2: LTM Semantic Retrieval

User: "What medications should I avoid with my condition?"
Bot: "Based on your previous discussion about hypertension and the medications mentioned..."

Result: LTM retrieves relevant medical information about hypertension medications and contraindications from previous conversations, even if not in recent STM.

Scenario 3: Enhanced Topic Generation

User: "I'm having trouble sleeping"
Bot: "Topic: Sleep disturbance evaluation and management for adult patient with insomnia symptoms"

Result: The topic incorporates the user's question context to create a detailed, medical-specific description instead of just "Sleep problems."

Scenario 4: Unified Context Summarization

User: "Can you repeat the treatment plan?"
Bot: "Based on our conversation about your hypertension and sleep issues, your treatment plan includes..."

Result: The system creates a unified summary combining STM (recent sleep discussion), LTM (hypertension history), and RAG (current treatment guidelines) into a single coherent narrative.

⚙️ Configuration

Environment Variables

FlashAPI=your_gemini_api_key  # For both main LLM and contextual analysis

Enhanced Memory Settings

memory = MemoryManager(
    max_users=1000,           # Maximum users in memory
    history_per_user=5,       # STM capacity (5 recent summaries)
    max_chunks=60             # LTM capacity (~20 conversational rounds)
)

Memory Parameters

# STM retrieval (5 recent turns)
recent_history = memory.get_recent_chat_history(user_id, num_turns=5)

# LTM semantic search
rag_chunks = memory.get_relevant_chunks(user_id, query, top_k=3, min_sim=0.30)

# Unified context summarization
contextual_summary = memory.get_contextual_chunks(user_id, current_query, lang)

Similarity Thresholds

# STM deduplication thresholds
IDENTICAL_THRESHOLD = 0.92    # Replace older with newer
MERGE_THRESHOLD = 0.75        # Merge similar content

# LTM semantic search
MIN_SIMILARITY = 0.30         # Minimum similarity for retrieval
TOP_K = 3                     # Number of chunks to retrieve

🔍 Monitoring & Debugging

Enhanced Logging

The system provides comprehensive logging for all memory operations:

# STM operations
logger.info(f"[Contextual] Retrieved {len(recent_history)} recent history items")
logger.info(f"[Contextual] Retrieved {len(rag_chunks)} RAG chunks")

# Chunking operations
logger.info(f"[Memory] 📦 Gemini summarized chunk output: {output}")
logger.warning(f"[Memory] ❌ Gemini chunking failed: {e}")

# Contextual summarization
logger.info(f"[Contextual] Gemini created summary: {summary[:100]}...")
logger.warning(f"[Contextual] Gemini summarization failed: {e}")

Performance Metrics

STM Operations: Deduplication rate, merge frequency, topic enrichment quality
LTM Operations: FAISS search latency, semantic similarity scores, eviction patterns
Context Summarization: Gemini response time, summary quality, fallback usage
Memory Usage: Storage efficiency, retrieval hit rates, cache performance

🚨 Error Handling

Enhanced Fallback Strategy

Primary: Gemini Flash Lite contextual summarization
Secondary: LTM semantic search with usage-based scoring
Tertiary: STM recent summaries
Final: External RAG knowledge base
Emergency: No context (minimal response)

Error Scenarios & Recovery

Gemini API failure → Fall back to LTM semantic search
LTM corruption → Rebuild FAISS index from remaining chunks
STM corruption → Reset to empty STM, continue with LTM
Memory corruption → Reset user session, clear all memory
Chunking failure → Store raw response as fallback chunk

🔮 Future Enhancements

1. Persistent Memory Storage

Database Integration: Store LTM in PostgreSQL/SQLite with FAISS index persistence
Session Recovery: Resume conversations after system restarts
Memory Export: Allow users to export their conversation history
Cross-device Sync: Synchronize memory across different devices

2. Advanced Memory Features

Fact Store: Dedicated storage for critical medical facts (allergies, chronic conditions, medications)
Memory Compression: Summarize older STM entries into LTM when STM overflows
Contextual Tags: Add metadata tags (encounter type, modality, urgency) to bias retrieval
Memory Analytics: Track memory usage patterns and optimize storage strategies

3. Intelligent Memory Management

Adaptive Thresholds: Dynamically adjust similarity thresholds based on conversation context
Memory Prioritization: Protect critical medical information from eviction
Usage-based Retention: Keep frequently accessed information longer
Semantic Clustering: Group related memories for better organization

4. Enhanced Medical Context

Clinical Decision Support: Integrate with medical guidelines and protocols
Risk Assessment: Track and alert on potential medical risks across conversations
Medication Reconciliation: Maintain accurate medication lists across sessions
Follow-up Scheduling: Track recommended follow-ups and reminders

5. Multi-modal Memory

Image Memory: Store and retrieve medical images with descriptions
Voice Memory: Convert voice interactions to text for memory storage
Document Memory: Process and store medical documents and reports
Temporal Memory: Track changes in symptoms and conditions over time

📝 Testing

Memory System Testing

cd Medical-Chatbot
python test_memory_system.py

Test Scenarios

STM Deduplication Test: Verify similar responses are merged correctly
LTM Semantic Search Test: Test FAISS retrieval with various queries
Context Summarization Test: Validate unified summary generation
Topic Enrichment Test: Check detailed topic generation with question context
Memory Capacity Test: Verify STM (5 items) and LTM (60 items) limits
Fallback Strategy Test: Test system behavior when Gemini API fails

Expected Behaviors

STM: Similar responses merge, unique details preserved
LTM: Semantic search returns relevant chunks with usage tracking
Topics: Detailed, medical-specific descriptions (10-20 words)
Summaries: Coherent narratives combining STM + LTM + RAG
Performance: Sub-second retrieval times for all operations

🎯 Summary

The enhanced memory system transforms the Medical Chatbot into a sophisticated, memory-aware medical assistant that:

✅ Maintains Short-Term Memory (STM) with 5 recent conversation summaries and intelligent deduplication
✅ Provides Long-Term Memory (LTM) with 60 semantic chunks and FAISS-based retrieval
✅ Generates Enhanced Topics using question context for detailed, medical-specific descriptions
✅ Creates Unified Summaries combining STM + LTM + RAG into coherent narratives
✅ Implements Smart Merging that preserves unique details while eliminating redundancy
✅ Ensures Conversational Continuity across extended medical consultations
✅ Optimizes Performance with sub-second retrieval and efficient memory management

This advanced memory system addresses the limitations of simple RAG systems by providing:

Intelligent context management that remembers and builds upon previous interactions
Medical precision with detailed topics and exact clinical information
Scalable architecture that can handle extended conversations without performance degradation
Robust fallback strategies ensuring system reliability in all scenarios

The result is a medical chatbot that truly understands conversation context, remembers patient history, and provides increasingly relevant and personalized medical guidance over time.