Spaces:
Sleeping
Sleeping
# System Patterns: Morris Bot Architecture | |
## System Architecture Overview | |
### High-Level Architecture | |
``` | |
Data Collection β Preprocessing β Enhancement β Fine-tuning β Inference β Web Interface | |
β β β β β β | |
scraper.py β preprocess.py β enhance.py β finetune.py β model β app.py (Gradio) | |
``` | |
### Core Components | |
1. **Data Pipeline**: Web scraping β JSON storage β Enhancement β Dataset preparation | |
2. **Enhancement Pipeline**: System prompt improvement β Non-telecom examples β Style optimization | |
3. **Training Pipeline**: Enhanced LoRA fine-tuning β Multiple checkpoints β Enhanced adapter storage | |
4. **Inference Pipeline**: Enhanced model loading β Style-aware generation β Response formatting | |
5. **User Interface**: Enhanced Gradio web app β Apple Silicon optimization β Real-time generation | |
## Key Technical Decisions | |
### Model Selection: Zephyr-7B-Beta | |
**Decision**: Use HuggingFaceH4/zephyr-7b-beta as base model | |
**Rationale**: | |
- Instruction-tuned for better following of generation prompts | |
- No authentication required (unlike some Mistral variants) | |
- 7B parameters: Good balance of capability vs. resource requirements | |
- Strong performance on text generation tasks | |
**Alternative Considered**: Direct Mistral-7B | |
**Why Rejected**: Zephyr's instruction-tuning provides better prompt adherence | |
### Fine-tuning Approach: LoRA (Low-Rank Adaptation) | |
**Decision**: Use LoRA instead of full fine-tuning | |
**Rationale**: | |
- **Memory Efficiency**: Only 0.58% of parameters trainable (42.5M vs 7.24B) | |
- **Hardware Compatibility**: Fits in 8GB RAM on Apple Silicon | |
- **Training Speed**: ~18 minutes vs hours for full fine-tuning | |
- **Preservation**: Keeps base model knowledge while adding specialization | |
**Configuration**: | |
```python | |
LoRA Parameters: | |
- rank: 16 (balance of efficiency vs capacity) | |
- alpha: 32 (scaling factor) | |
- dropout: 0.1 (regularization) | |
- target_modules: All attention layers | |
``` | |
### Hardware Optimization: Apple Silicon MPS | |
**Decision**: Optimize for Apple M1/M2/M3 chips with MPS backend | |
**Rationale**: | |
- **Target Hardware**: Many developers use MacBooks | |
- **Performance**: MPS provides significant acceleration over CPU | |
- **Memory**: Unified memory architecture efficient for ML workloads | |
- **Accessibility**: Makes fine-tuning accessible without expensive GPUs | |
**Implementation Pattern**: | |
```python | |
# Automatic device detection | |
if torch.backends.mps.is_available(): | |
device = "mps" | |
dtype = torch.float16 # Memory efficient | |
elif torch.cuda.is_available(): | |
device = "cuda" | |
else: | |
device = "cpu" | |
``` | |
## Design Patterns in Use | |
### Data Processing Pipeline Pattern | |
**Pattern**: ETL (Extract, Transform, Load) with validation | |
**Implementation**: | |
1. **Extract**: Web scraping with rate limiting and error handling | |
2. **Transform**: Text cleaning, format standardization, instruction formatting | |
3. **Load**: JSON storage with validation and dataset splitting | |
4. **Validate**: Content quality checks and format verification | |
### Model Adapter Pattern | |
**Pattern**: Adapter pattern for model extensions | |
**Implementation**: | |
- Base model remains unchanged | |
- LoRA adapters provide specialization | |
- Easy swapping between different fine-tuned versions | |
- Preserves ability to use base model capabilities | |
### Configuration Management Pattern | |
**Pattern**: Centralized configuration with environment-specific overrides | |
**Implementation**: | |
```python | |
# Training configuration centralized in finetune.py | |
TRAINING_CONFIG = { | |
"learning_rate": 1e-4, | |
"num_epochs": 2, | |
"batch_size": 1, | |
"gradient_accumulation_steps": 8 | |
} | |
# Hardware-specific overrides | |
if device == "mps": | |
TRAINING_CONFIG["fp16"] = False # Not supported on MPS | |
TRAINING_CONFIG["dataloader_num_workers"] = 0 | |
``` | |
### Error Handling and Logging Pattern | |
**Pattern**: Comprehensive logging with graceful degradation | |
**Implementation**: | |
- Structured logging to `morris_bot.log` | |
- Try-catch blocks with informative error messages | |
- Fallback behaviors (CPU if MPS fails, etc.) | |
- Progress tracking during long operations | |
## Component Relationships | |
### Enhanced Data Flow Architecture | |
``` | |
Raw Articles β Enhanced Dataset β Style-Optimized Training β Enhanced Model | |
β β β β | |
Raw JSON β Improved Prompts β Non-telecom Examples β Enhanced LoRA Adapters | |
β β β β | |
Original β System Prompt Update β Topic Diversification β Multi-topic Capability | |
β β β β | |
Web Interface β Enhanced Inference β Enhanced Model β Style-Aware Training | |
``` | |
### Enhanced Dependency Relationships | |
- **app.py** depends on enhanced model in `models/iain-morris-model-enhanced/` | |
- **finetune.py** depends on enhanced dataset in `data/enhanced_train_dataset.json` | |
- **update_system_prompt.py** enhances training data with improved style guidance | |
- **add_non_telecom_examples.py** expands dataset with topic diversity | |
- **test_enhanced_model.py** validates enhanced model performance | |
- **ENHANCEMENT_SUMMARY.md** documents all improvements and changes | |
### Enhanced Model Architecture | |
``` | |
Base Model: Zephyr-7B-Beta (7.24B parameters) | |
β | |
Enhanced LoRA Adapters (42.5M trainable parameters) | |
β | |
Style-Aware Generation with: | |
- Doom-laden openings | |
- Cynical wit and expertise | |
- Signature phrases ("What could possibly go wrong?") | |
- Dark analogies and visceral metaphors | |
- British cynicism with parenthetical snark | |
- Multi-topic versatility (telecom + non-telecom) | |
``` | |
### State Management | |
- **Model State**: Stored as LoRA adapter files (safetensors format) | |
- **Training State**: Checkpoints saved during training for recovery | |
- **Data State**: JSON files with versioning through filenames | |
- **Application State**: Stateless web interface, model loaded on demand | |
## Critical Implementation Paths | |
### Training Pipeline Critical Path | |
1. **Data Validation**: Ensure training examples meet quality standards | |
2. **Model Loading**: Base model download and initialization | |
3. **LoRA Setup**: Adapter configuration and parameter freezing | |
4. **Training Loop**: Gradient computation and adapter updates | |
5. **Checkpoint Saving**: Periodic saves for recovery | |
6. **Final Export**: Adapter weights saved for inference | |
### Inference Pipeline Critical Path | |
1. **Model Loading**: Base model + LoRA adapter loading | |
2. **Prompt Formatting**: User input β instruction format | |
3. **Generation**: Model forward pass with sampling parameters | |
4. **Post-processing**: Clean output, format for display | |
5. **Response**: Return formatted article to user interface | |
### Error Recovery Patterns | |
- **Training Interruption**: Resume from last checkpoint | |
- **Memory Overflow**: Reduce batch size, enable gradient checkpointing | |
- **Model Loading Failure**: Fallback to CPU, reduce precision | |
- **Generation Timeout**: Implement timeout with partial results | |
## Performance Optimization Patterns | |
### Memory Management | |
- **Gradient Accumulation**: Simulate larger batch sizes without memory increase | |
- **Mixed Precision**: float16 where supported for memory efficiency | |
- **Model Sharding**: LoRA adapters separate from base model | |
- **Garbage Collection**: Explicit cleanup after training steps | |
### Compute Optimization | |
- **Hardware Detection**: Automatic selection of best available device | |
- **Batch Processing**: Process multiple examples efficiently | |
- **Caching**: Tokenized datasets cached for repeated training runs | |
- **Parallel Processing**: Multi-threading where beneficial | |
### User Experience Optimization | |
- **Lazy Loading**: Model loaded only when needed | |
- **Progress Indicators**: Real-time feedback during long operations | |
- **Parameter Validation**: Input validation before expensive operations | |
- **Responsive Interface**: Non-blocking UI during generation | |
## Scalability Considerations | |
### Current Limitations | |
- **Single Model**: Only one fine-tuned model at a time | |
- **Local Deployment**: No distributed inference capability | |
- **Memory Bound**: Limited by single machine memory | |
- **Sequential Processing**: One generation request at a time | |
### Future Scalability Patterns | |
- **Model Versioning**: Support multiple LoRA adapters | |
- **Distributed Inference**: Model serving across multiple devices | |
- **Batch Generation**: Process multiple requests simultaneously | |
- **Cloud Deployment**: Container-based scaling patterns | |
## Security and Ethics Patterns | |
### Data Handling | |
- **Public Data Only**: Scrape only publicly available articles | |
- **Rate Limiting**: Respectful scraping with delays | |
- **Attribution**: Clear marking of AI-generated content | |
- **Privacy**: No personal data collection or storage | |
### Model Safety | |
- **Content Filtering**: Basic checks on generated content | |
- **Human Review**: Emphasis on human oversight requirement | |
- **Educational Use**: Clear guidelines for appropriate use | |
- **Transparency**: Open documentation of training process | |