morris-bot / memory-bank /systemPatterns.md
eusholli's picture
Upload folder using huggingface_hub
599c2c0 verified

A newer version of the Gradio SDK is available: 5.41.1

Upgrade

System Patterns: Morris Bot Architecture

System Architecture Overview

High-Level Architecture

Data Collection β†’ Preprocessing β†’ Enhancement β†’ Fine-tuning β†’ Inference β†’ Web Interface
     ↓              ↓             ↓            ↓           ↓           ↓
  scraper.py β†’ preprocess.py β†’ enhance.py β†’ finetune.py β†’ model β†’ app.py (Gradio)

Core Components

  1. Data Pipeline: Web scraping β†’ JSON storage β†’ Enhancement β†’ Dataset preparation
  2. Enhancement Pipeline: System prompt improvement β†’ Non-telecom examples β†’ Style optimization
  3. Training Pipeline: Enhanced LoRA fine-tuning β†’ Multiple checkpoints β†’ Enhanced adapter storage
  4. Inference Pipeline: Enhanced model loading β†’ Style-aware generation β†’ Response formatting
  5. User Interface: Enhanced Gradio web app β†’ Apple Silicon optimization β†’ Real-time generation

Key Technical Decisions

Model Selection: Zephyr-7B-Beta

Decision: Use HuggingFaceH4/zephyr-7b-beta as base model Rationale:

  • Instruction-tuned for better following of generation prompts
  • No authentication required (unlike some Mistral variants)
  • 7B parameters: Good balance of capability vs. resource requirements
  • Strong performance on text generation tasks

Alternative Considered: Direct Mistral-7B Why Rejected: Zephyr's instruction-tuning provides better prompt adherence

Fine-tuning Approach: LoRA (Low-Rank Adaptation)

Decision: Use LoRA instead of full fine-tuning Rationale:

  • Memory Efficiency: Only 0.58% of parameters trainable (42.5M vs 7.24B)
  • Hardware Compatibility: Fits in 8GB RAM on Apple Silicon
  • Training Speed: ~18 minutes vs hours for full fine-tuning
  • Preservation: Keeps base model knowledge while adding specialization

Configuration:

LoRA Parameters:
- rank: 16 (balance of efficiency vs capacity)
- alpha: 32 (scaling factor)
- dropout: 0.1 (regularization)
- target_modules: All attention layers

Hardware Optimization: Apple Silicon MPS

Decision: Optimize for Apple M1/M2/M3 chips with MPS backend Rationale:

  • Target Hardware: Many developers use MacBooks
  • Performance: MPS provides significant acceleration over CPU
  • Memory: Unified memory architecture efficient for ML workloads
  • Accessibility: Makes fine-tuning accessible without expensive GPUs

Implementation Pattern:

# Automatic device detection
if torch.backends.mps.is_available():
    device = "mps"
    dtype = torch.float16  # Memory efficient
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

Design Patterns in Use

Data Processing Pipeline Pattern

Pattern: ETL (Extract, Transform, Load) with validation Implementation:

  1. Extract: Web scraping with rate limiting and error handling
  2. Transform: Text cleaning, format standardization, instruction formatting
  3. Load: JSON storage with validation and dataset splitting
  4. Validate: Content quality checks and format verification

Model Adapter Pattern

Pattern: Adapter pattern for model extensions Implementation:

  • Base model remains unchanged
  • LoRA adapters provide specialization
  • Easy swapping between different fine-tuned versions
  • Preserves ability to use base model capabilities

Configuration Management Pattern

Pattern: Centralized configuration with environment-specific overrides Implementation:

# Training configuration centralized in finetune.py
TRAINING_CONFIG = {
    "learning_rate": 1e-4,
    "num_epochs": 2,
    "batch_size": 1,
    "gradient_accumulation_steps": 8
}

# Hardware-specific overrides
if device == "mps":
    TRAINING_CONFIG["fp16"] = False  # Not supported on MPS
    TRAINING_CONFIG["dataloader_num_workers"] = 0

Error Handling and Logging Pattern

Pattern: Comprehensive logging with graceful degradation Implementation:

  • Structured logging to morris_bot.log
  • Try-catch blocks with informative error messages
  • Fallback behaviors (CPU if MPS fails, etc.)
  • Progress tracking during long operations

Component Relationships

Enhanced Data Flow Architecture

Raw Articles β†’ Enhanced Dataset β†’ Style-Optimized Training β†’ Enhanced Model
     ↓              ↓                      ↓                     ↓
Raw JSON β†’ Improved Prompts β†’ Non-telecom Examples β†’ Enhanced LoRA Adapters
     ↓              ↓                      ↓                     ↓
Original β†’ System Prompt Update β†’ Topic Diversification β†’ Multi-topic Capability
     ↓              ↓                      ↓                     ↓
Web Interface ← Enhanced Inference ← Enhanced Model ← Style-Aware Training

Enhanced Dependency Relationships

  • app.py depends on enhanced model in models/iain-morris-model-enhanced/
  • finetune.py depends on enhanced dataset in data/enhanced_train_dataset.json
  • update_system_prompt.py enhances training data with improved style guidance
  • add_non_telecom_examples.py expands dataset with topic diversity
  • test_enhanced_model.py validates enhanced model performance
  • ENHANCEMENT_SUMMARY.md documents all improvements and changes

Enhanced Model Architecture

Base Model: Zephyr-7B-Beta (7.24B parameters)
     ↓
Enhanced LoRA Adapters (42.5M trainable parameters)
     ↓
Style-Aware Generation with:
- Doom-laden openings
- Cynical wit and expertise  
- Signature phrases ("What could possibly go wrong?")
- Dark analogies and visceral metaphors
- British cynicism with parenthetical snark
- Multi-topic versatility (telecom + non-telecom)

State Management

  • Model State: Stored as LoRA adapter files (safetensors format)
  • Training State: Checkpoints saved during training for recovery
  • Data State: JSON files with versioning through filenames
  • Application State: Stateless web interface, model loaded on demand

Critical Implementation Paths

Training Pipeline Critical Path

  1. Data Validation: Ensure training examples meet quality standards
  2. Model Loading: Base model download and initialization
  3. LoRA Setup: Adapter configuration and parameter freezing
  4. Training Loop: Gradient computation and adapter updates
  5. Checkpoint Saving: Periodic saves for recovery
  6. Final Export: Adapter weights saved for inference

Inference Pipeline Critical Path

  1. Model Loading: Base model + LoRA adapter loading
  2. Prompt Formatting: User input β†’ instruction format
  3. Generation: Model forward pass with sampling parameters
  4. Post-processing: Clean output, format for display
  5. Response: Return formatted article to user interface

Error Recovery Patterns

  • Training Interruption: Resume from last checkpoint
  • Memory Overflow: Reduce batch size, enable gradient checkpointing
  • Model Loading Failure: Fallback to CPU, reduce precision
  • Generation Timeout: Implement timeout with partial results

Performance Optimization Patterns

Memory Management

  • Gradient Accumulation: Simulate larger batch sizes without memory increase
  • Mixed Precision: float16 where supported for memory efficiency
  • Model Sharding: LoRA adapters separate from base model
  • Garbage Collection: Explicit cleanup after training steps

Compute Optimization

  • Hardware Detection: Automatic selection of best available device
  • Batch Processing: Process multiple examples efficiently
  • Caching: Tokenized datasets cached for repeated training runs
  • Parallel Processing: Multi-threading where beneficial

User Experience Optimization

  • Lazy Loading: Model loaded only when needed
  • Progress Indicators: Real-time feedback during long operations
  • Parameter Validation: Input validation before expensive operations
  • Responsive Interface: Non-blocking UI during generation

Scalability Considerations

Current Limitations

  • Single Model: Only one fine-tuned model at a time
  • Local Deployment: No distributed inference capability
  • Memory Bound: Limited by single machine memory
  • Sequential Processing: One generation request at a time

Future Scalability Patterns

  • Model Versioning: Support multiple LoRA adapters
  • Distributed Inference: Model serving across multiple devices
  • Batch Generation: Process multiple requests simultaneously
  • Cloud Deployment: Container-based scaling patterns

Security and Ethics Patterns

Data Handling

  • Public Data Only: Scrape only publicly available articles
  • Rate Limiting: Respectful scraping with delays
  • Attribution: Clear marking of AI-generated content
  • Privacy: No personal data collection or storage

Model Safety

  • Content Filtering: Basic checks on generated content
  • Human Review: Emphasis on human oversight requirement
  • Educational Use: Clear guidelines for appropriate use
  • Transparency: Open documentation of training process