File size: 8,972 Bytes

472e2e9

# Backend Code Generation Model - Setup & Usage Guide

## 🛠️ Installation & Setup

### 1. Install Dependencies
```bash
pip install torch transformers datasets pandas numpy aiohttp requests
pip install accelerate  # For faster training
```

### 2. Set Environment Variables
```bash
# Optional: GitHub token for collecting real repositories
export GITHUB_TOKEN="your_github_token_here"

# For GPU training (if available)
export CUDA_VISIBLE_DEVICES=0
```

### 3. Directory Structure
```
backend-ai-trainer/
├── training_pipeline.py          # Main pipeline code
├── data/
│   ├── raw_dataset.json         # Collected training data
│   └── processed/               # Preprocessed data
├── models/
│   ├── backend_code_model/      # Trained model output
│   └── checkpoints/             # Training checkpoints
└── evaluation/
    ├── test_cases.json          # Test scenarios
    └── results/                 # Evaluation results
```

## 🏃‍♂️ Quick Start

### Option A: Full Automated Pipeline
```python
import asyncio
from training_pipeline import TrainingPipeline

config = {
    'base_model': 'microsoft/DialoGPT-medium',
    'output_dir': './models/backend_code_model',
    'github_token': 'your_token_here',  # Optional
}

pipeline = TrainingPipeline(config)
asyncio.run(pipeline.run_full_pipeline())
```

### Option B: Step-by-Step Execution

#### Step 1: Collect Training Data
```python
from training_pipeline import DataCollector
import asyncio

collector = DataCollector()

# Collect from GitHub (requires token)
github_queries = [
    'express api backend',
    'fastapi python backend', 
    'django rest api',
    'nodejs backend server',
    'flask api backend'
]

asyncio.run(collector.collect_github_repositories(github_queries, max_repos=100))

# Generate synthetic examples
collector.generate_synthetic_examples(count=500)

# Save dataset
collector.save_dataset('training_data.json')
```

#### Step 2: Preprocess Data
```python
from training_pipeline import DataPreprocessor

preprocessor = DataPreprocessor()
processed_examples = preprocessor.preprocess_examples(collector.collected_examples)
training_dataset = preprocessor.create_training_dataset(processed_examples)

print(f"Created dataset with {len(training_dataset)} examples")
```

#### Step 3: Train Model
```python
from training_pipeline import CodeGenerationModel

model = CodeGenerationModel('microsoft/DialoGPT-medium')
model.fine_tune(training_dataset, output_dir='./trained_model')
```

#### Step 4: Generate Code
```python
# Generate a complete backend application
generated_code = model.generate_code(
    description="E-commerce API with user authentication and product management",
    framework="fastapi",
    language="python"
)

print("Generated Backend Application:")
print("=" * 50)
print(generated_code)
```

## 🎯 Training Configuration Options

### Model Selection
```python
# Lightweight for testing
config['base_model'] = 'microsoft/DialoGPT-small'

# Balanced performance
config['base_model'] = 'microsoft/DialoGPT-medium'

# High quality (requires more resources)  
config['base_model'] = 'microsoft/DialoGPT-large'
```

### Training Parameters
```python
training_config = {
    'num_epochs': 5,           # More epochs = better learning
    'batch_size': 4,           # Adjust based on GPU memory
    'learning_rate': 5e-5,     # Conservative learning rate
    'max_length': 2048,        # Maximum token length
    'warmup_steps': 500,       # Learning rate warmup
    'save_steps': 1000,        # Checkpoint frequency
}
```

### Framework Coverage
The pipeline supports these backend frameworks:

**Node.js Frameworks:**
- Express.js - Most popular Node.js framework
- NestJS - Enterprise-grade framework  
- Koa.js - Lightweight alternative

**Python Frameworks:**
- FastAPI - Modern, high-performance API framework
- Django - Full-featured web framework
- Flask - Lightweight and flexible

**Go Frameworks:**
- Gin - HTTP web framework
- Fiber - Express-inspired framework

## 📊 Evaluation & Testing

### Automatic Quality Assessment
```python
from training_pipeline import ModelEvaluator

evaluator = ModelEvaluator()

# Test specific code generation
generated_code = model.generate_code(
    description="User authentication API with JWT tokens",
    framework="express", 
    language="javascript"
)

# Get quality scores
quality_scores = evaluator.evaluate_code_quality(generated_code, "javascript")
print(f"Syntax Correctness: {quality_scores['syntax_correctness']:.2f}")
print(f"Completeness: {quality_scores['completeness']:.2f}")
print(f"Best Practices: {quality_scores['best_practices']:.2f}")
```

### Comprehensive Benchmarking
```python
test_cases = [
    {
        'description': 'REST API for task management with user authentication',
        'framework': 'express',
        'language': 'javascript'
    },
    {
        'description': 'GraphQL API for social media platform',
        'framework': 'fastapi',
        'language': 'python'
    },
    {
        'description': 'Microservice for payment processing',
        'framework': 'gin',
        'language': 'go'
    }
]

benchmark_results = evaluator.benchmark_model(model, test_cases)
print("Overall Performance:", benchmark_results)
```

## 🚀 Advanced Usage

### Custom Data Sources
```python
# Add your own training examples
custom_examples = [
    {
        'description': 'Custom API requirement',
        'requirements': ['Custom feature 1', 'Custom feature 2'],
        'framework': 'fastapi',
        'language': 'python',
        'code_files': {
            'main.py': '# Your custom code here',
            'requirements.txt': 'fastapi\nuvicorn'
        }
    }
]

# Add to training data
collector.collected_examples.extend([CodeExample(**ex) for ex in custom_examples])
```

### Fine-tuning on Specific Domains
```python
# Focus training on specific application types
domain_specific_queries = [
    'microservices architecture',
    'api gateway implementation', 
    'database orm integration',
    'authentication middleware',
    'rate limiting api'
]

asyncio.run(collector.collect_github_repositories(domain_specific_queries))
```

### Export Trained Model
```python
# Save model for deployment
model.model.save_pretrained('./production_model')
model.tokenizer.save_pretrained('./production_model')

# Load for inference
from transformers import AutoModelForCausalLM, AutoTokenizer

production_model = AutoModelForCausalLM.from_pretrained('./production_model')
production_tokenizer = AutoTokenizer.from_pretrained('./production_model')
```

## 🔧 Troubleshooting

### Common Issues

**1. Out of Memory Errors**
```python
# Reduce batch size
config['per_device_train_batch_size'] = 1
config['gradient_accumulation_steps'] = 4

# Use gradient checkpointing
config['gradient_checkpointing'] = True
```

**2. Slow Training**
```python
# Enable mixed precision (if GPU supports it)
config['fp16'] = True

# Use multiple GPUs
config['dataloader_num_workers'] = 4
```

**3. Poor Code Quality**
```python
# Increase training data diversity
collector.generate_synthetic_examples(count=1000)

# Extend training duration
config['num_train_epochs'] = 10
```

### Performance Optimization

**For CPU Training:**
```python
config['dataloader_pin_memory'] = False
config['per_device_train_batch_size'] = 1
```

**For GPU Training:**
```python
config['fp16'] = True
config['dataloader_pin_memory'] = True
config['per_device_train_batch_size'] = 4
```

## 📈 Expected Results

After training on ~500-1000 examples, you should expect:

- **Syntax Correctness**: 85-95%
- **Code Completeness**: 80-90% 
- **Best Practices**: 70-85%
- **Framework Coverage**: All major Node.js and Python frameworks
- **Generation Speed**: 2-5 seconds per application

## 🔄 Continuous Improvement

### Regular Retraining
```python
# Schedule weekly data collection
import schedule

def update_training_data():
    asyncio.run(collector.collect_github_repositories(['new backend trends']))
    
schedule.every().week.do(update_training_data)
```

### A/B Testing Different Models
```python
models_to_compare = [
    'microsoft/DialoGPT-medium',
    'microsoft/DialoGPT-large', 
    'gpt2-medium'
]

for base_model in models_to_compare:
    model = CodeGenerationModel(base_model)
    results = evaluator.benchmark_model(model, test_cases)
    print(f"{base_model}: {results}")
```

## 🎯 Next Steps

1. **Start Small**: Begin with synthetic data and 100-200 examples
2. **Add Real Data**: Integrate GitHub repositories gradually
3. **Evaluate Regularly**: Monitor quality metrics after each training session
4. **Expand Frameworks**: Add support for new frameworks as needed
5. **Production Deploy**: Export model for API deployment

This pipeline provides a complete foundation for building your own backend code generation AI. The modular design allows you to customize and extend each component based on your specific needs.