Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.41.1
System Patterns: Morris Bot Architecture
System Architecture Overview
High-Level Architecture
Data Collection β Preprocessing β Enhancement β Fine-tuning β Inference β Web Interface
β β β β β β
scraper.py β preprocess.py β enhance.py β finetune.py β model β app.py (Gradio)
Core Components
- Data Pipeline: Web scraping β JSON storage β Enhancement β Dataset preparation
- Enhancement Pipeline: System prompt improvement β Non-telecom examples β Style optimization
- Training Pipeline: Enhanced LoRA fine-tuning β Multiple checkpoints β Enhanced adapter storage
- Inference Pipeline: Enhanced model loading β Style-aware generation β Response formatting
- User Interface: Enhanced Gradio web app β Apple Silicon optimization β Real-time generation
Key Technical Decisions
Model Selection: Zephyr-7B-Beta
Decision: Use HuggingFaceH4/zephyr-7b-beta as base model Rationale:
- Instruction-tuned for better following of generation prompts
- No authentication required (unlike some Mistral variants)
- 7B parameters: Good balance of capability vs. resource requirements
- Strong performance on text generation tasks
Alternative Considered: Direct Mistral-7B Why Rejected: Zephyr's instruction-tuning provides better prompt adherence
Fine-tuning Approach: LoRA (Low-Rank Adaptation)
Decision: Use LoRA instead of full fine-tuning Rationale:
- Memory Efficiency: Only 0.58% of parameters trainable (42.5M vs 7.24B)
- Hardware Compatibility: Fits in 8GB RAM on Apple Silicon
- Training Speed: ~18 minutes vs hours for full fine-tuning
- Preservation: Keeps base model knowledge while adding specialization
Configuration:
LoRA Parameters:
- rank: 16 (balance of efficiency vs capacity)
- alpha: 32 (scaling factor)
- dropout: 0.1 (regularization)
- target_modules: All attention layers
Hardware Optimization: Apple Silicon MPS
Decision: Optimize for Apple M1/M2/M3 chips with MPS backend Rationale:
- Target Hardware: Many developers use MacBooks
- Performance: MPS provides significant acceleration over CPU
- Memory: Unified memory architecture efficient for ML workloads
- Accessibility: Makes fine-tuning accessible without expensive GPUs
Implementation Pattern:
# Automatic device detection
if torch.backends.mps.is_available():
device = "mps"
dtype = torch.float16 # Memory efficient
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
Design Patterns in Use
Data Processing Pipeline Pattern
Pattern: ETL (Extract, Transform, Load) with validation Implementation:
- Extract: Web scraping with rate limiting and error handling
- Transform: Text cleaning, format standardization, instruction formatting
- Load: JSON storage with validation and dataset splitting
- Validate: Content quality checks and format verification
Model Adapter Pattern
Pattern: Adapter pattern for model extensions Implementation:
- Base model remains unchanged
- LoRA adapters provide specialization
- Easy swapping between different fine-tuned versions
- Preserves ability to use base model capabilities
Configuration Management Pattern
Pattern: Centralized configuration with environment-specific overrides Implementation:
# Training configuration centralized in finetune.py
TRAINING_CONFIG = {
"learning_rate": 1e-4,
"num_epochs": 2,
"batch_size": 1,
"gradient_accumulation_steps": 8
}
# Hardware-specific overrides
if device == "mps":
TRAINING_CONFIG["fp16"] = False # Not supported on MPS
TRAINING_CONFIG["dataloader_num_workers"] = 0
Error Handling and Logging Pattern
Pattern: Comprehensive logging with graceful degradation Implementation:
- Structured logging to
morris_bot.log
- Try-catch blocks with informative error messages
- Fallback behaviors (CPU if MPS fails, etc.)
- Progress tracking during long operations
Component Relationships
Enhanced Data Flow Architecture
Raw Articles β Enhanced Dataset β Style-Optimized Training β Enhanced Model
β β β β
Raw JSON β Improved Prompts β Non-telecom Examples β Enhanced LoRA Adapters
β β β β
Original β System Prompt Update β Topic Diversification β Multi-topic Capability
β β β β
Web Interface β Enhanced Inference β Enhanced Model β Style-Aware Training
Enhanced Dependency Relationships
- app.py depends on enhanced model in
models/iain-morris-model-enhanced/
- finetune.py depends on enhanced dataset in
data/enhanced_train_dataset.json
- update_system_prompt.py enhances training data with improved style guidance
- add_non_telecom_examples.py expands dataset with topic diversity
- test_enhanced_model.py validates enhanced model performance
- ENHANCEMENT_SUMMARY.md documents all improvements and changes
Enhanced Model Architecture
Base Model: Zephyr-7B-Beta (7.24B parameters)
β
Enhanced LoRA Adapters (42.5M trainable parameters)
β
Style-Aware Generation with:
- Doom-laden openings
- Cynical wit and expertise
- Signature phrases ("What could possibly go wrong?")
- Dark analogies and visceral metaphors
- British cynicism with parenthetical snark
- Multi-topic versatility (telecom + non-telecom)
State Management
- Model State: Stored as LoRA adapter files (safetensors format)
- Training State: Checkpoints saved during training for recovery
- Data State: JSON files with versioning through filenames
- Application State: Stateless web interface, model loaded on demand
Critical Implementation Paths
Training Pipeline Critical Path
- Data Validation: Ensure training examples meet quality standards
- Model Loading: Base model download and initialization
- LoRA Setup: Adapter configuration and parameter freezing
- Training Loop: Gradient computation and adapter updates
- Checkpoint Saving: Periodic saves for recovery
- Final Export: Adapter weights saved for inference
Inference Pipeline Critical Path
- Model Loading: Base model + LoRA adapter loading
- Prompt Formatting: User input β instruction format
- Generation: Model forward pass with sampling parameters
- Post-processing: Clean output, format for display
- Response: Return formatted article to user interface
Error Recovery Patterns
- Training Interruption: Resume from last checkpoint
- Memory Overflow: Reduce batch size, enable gradient checkpointing
- Model Loading Failure: Fallback to CPU, reduce precision
- Generation Timeout: Implement timeout with partial results
Performance Optimization Patterns
Memory Management
- Gradient Accumulation: Simulate larger batch sizes without memory increase
- Mixed Precision: float16 where supported for memory efficiency
- Model Sharding: LoRA adapters separate from base model
- Garbage Collection: Explicit cleanup after training steps
Compute Optimization
- Hardware Detection: Automatic selection of best available device
- Batch Processing: Process multiple examples efficiently
- Caching: Tokenized datasets cached for repeated training runs
- Parallel Processing: Multi-threading where beneficial
User Experience Optimization
- Lazy Loading: Model loaded only when needed
- Progress Indicators: Real-time feedback during long operations
- Parameter Validation: Input validation before expensive operations
- Responsive Interface: Non-blocking UI during generation
Scalability Considerations
Current Limitations
- Single Model: Only one fine-tuned model at a time
- Local Deployment: No distributed inference capability
- Memory Bound: Limited by single machine memory
- Sequential Processing: One generation request at a time
Future Scalability Patterns
- Model Versioning: Support multiple LoRA adapters
- Distributed Inference: Model serving across multiple devices
- Batch Generation: Process multiple requests simultaneously
- Cloud Deployment: Container-based scaling patterns
Security and Ethics Patterns
Data Handling
- Public Data Only: Scrape only publicly available articles
- Rate Limiting: Respectful scraping with delays
- Attribution: Clear marking of AI-generated content
- Privacy: No personal data collection or storage
Model Safety
- Content Filtering: Basic checks on generated content
- Human Review: Emphasis on human oversight requirement
- Educational Use: Clear guidelines for appropriate use
- Transparency: Open documentation of training process