|
# Backend Code Generation Model - Setup & Usage Guide |
|
|
|
## π οΈ Installation & Setup |
|
|
|
### 1. Install Dependencies |
|
```bash |
|
pip install torch transformers datasets pandas numpy aiohttp requests |
|
pip install accelerate # For faster training |
|
``` |
|
|
|
### 2. Set Environment Variables |
|
```bash |
|
# Optional: GitHub token for collecting real repositories |
|
export GITHUB_TOKEN="your_github_token_here" |
|
|
|
# For GPU training (if available) |
|
export CUDA_VISIBLE_DEVICES=0 |
|
``` |
|
|
|
### 3. Directory Structure |
|
``` |
|
backend-ai-trainer/ |
|
βββ training_pipeline.py # Main pipeline code |
|
βββ data/ |
|
β βββ raw_dataset.json # Collected training data |
|
β βββ processed/ # Preprocessed data |
|
βββ models/ |
|
β βββ backend_code_model/ # Trained model output |
|
β βββ checkpoints/ # Training checkpoints |
|
βββ evaluation/ |
|
βββ test_cases.json # Test scenarios |
|
βββ results/ # Evaluation results |
|
``` |
|
|
|
## πββοΈ Quick Start |
|
|
|
### Option A: Full Automated Pipeline |
|
```python |
|
import asyncio |
|
from training_pipeline import TrainingPipeline |
|
|
|
config = { |
|
'base_model': 'microsoft/DialoGPT-medium', |
|
'output_dir': './models/backend_code_model', |
|
'github_token': 'your_token_here', # Optional |
|
} |
|
|
|
pipeline = TrainingPipeline(config) |
|
asyncio.run(pipeline.run_full_pipeline()) |
|
``` |
|
|
|
### Option B: Step-by-Step Execution |
|
|
|
#### Step 1: Collect Training Data |
|
```python |
|
from training_pipeline import DataCollector |
|
import asyncio |
|
|
|
collector = DataCollector() |
|
|
|
# Collect from GitHub (requires token) |
|
github_queries = [ |
|
'express api backend', |
|
'fastapi python backend', |
|
'django rest api', |
|
'nodejs backend server', |
|
'flask api backend' |
|
] |
|
|
|
asyncio.run(collector.collect_github_repositories(github_queries, max_repos=100)) |
|
|
|
# Generate synthetic examples |
|
collector.generate_synthetic_examples(count=500) |
|
|
|
# Save dataset |
|
collector.save_dataset('training_data.json') |
|
``` |
|
|
|
#### Step 2: Preprocess Data |
|
```python |
|
from training_pipeline import DataPreprocessor |
|
|
|
preprocessor = DataPreprocessor() |
|
processed_examples = preprocessor.preprocess_examples(collector.collected_examples) |
|
training_dataset = preprocessor.create_training_dataset(processed_examples) |
|
|
|
print(f"Created dataset with {len(training_dataset)} examples") |
|
``` |
|
|
|
#### Step 3: Train Model |
|
```python |
|
from training_pipeline import CodeGenerationModel |
|
|
|
model = CodeGenerationModel('microsoft/DialoGPT-medium') |
|
model.fine_tune(training_dataset, output_dir='./trained_model') |
|
``` |
|
|
|
#### Step 4: Generate Code |
|
```python |
|
# Generate a complete backend application |
|
generated_code = model.generate_code( |
|
description="E-commerce API with user authentication and product management", |
|
framework="fastapi", |
|
language="python" |
|
) |
|
|
|
print("Generated Backend Application:") |
|
print("=" * 50) |
|
print(generated_code) |
|
``` |
|
|
|
## π― Training Configuration Options |
|
|
|
### Model Selection |
|
```python |
|
# Lightweight for testing |
|
config['base_model'] = 'microsoft/DialoGPT-small' |
|
|
|
# Balanced performance |
|
config['base_model'] = 'microsoft/DialoGPT-medium' |
|
|
|
# High quality (requires more resources) |
|
config['base_model'] = 'microsoft/DialoGPT-large' |
|
``` |
|
|
|
### Training Parameters |
|
```python |
|
training_config = { |
|
'num_epochs': 5, # More epochs = better learning |
|
'batch_size': 4, # Adjust based on GPU memory |
|
'learning_rate': 5e-5, # Conservative learning rate |
|
'max_length': 2048, # Maximum token length |
|
'warmup_steps': 500, # Learning rate warmup |
|
'save_steps': 1000, # Checkpoint frequency |
|
} |
|
``` |
|
|
|
### Framework Coverage |
|
The pipeline supports these backend frameworks: |
|
|
|
**Node.js Frameworks:** |
|
- Express.js - Most popular Node.js framework |
|
- NestJS - Enterprise-grade framework |
|
- Koa.js - Lightweight alternative |
|
|
|
**Python Frameworks:** |
|
- FastAPI - Modern, high-performance API framework |
|
- Django - Full-featured web framework |
|
- Flask - Lightweight and flexible |
|
|
|
**Go Frameworks:** |
|
- Gin - HTTP web framework |
|
- Fiber - Express-inspired framework |
|
|
|
## π Evaluation & Testing |
|
|
|
### Automatic Quality Assessment |
|
```python |
|
from training_pipeline import ModelEvaluator |
|
|
|
evaluator = ModelEvaluator() |
|
|
|
# Test specific code generation |
|
generated_code = model.generate_code( |
|
description="User authentication API with JWT tokens", |
|
framework="express", |
|
language="javascript" |
|
) |
|
|
|
# Get quality scores |
|
quality_scores = evaluator.evaluate_code_quality(generated_code, "javascript") |
|
print(f"Syntax Correctness: {quality_scores['syntax_correctness']:.2f}") |
|
print(f"Completeness: {quality_scores['completeness']:.2f}") |
|
print(f"Best Practices: {quality_scores['best_practices']:.2f}") |
|
``` |
|
|
|
### Comprehensive Benchmarking |
|
```python |
|
test_cases = [ |
|
{ |
|
'description': 'REST API for task management with user authentication', |
|
'framework': 'express', |
|
'language': 'javascript' |
|
}, |
|
{ |
|
'description': 'GraphQL API for social media platform', |
|
'framework': 'fastapi', |
|
'language': 'python' |
|
}, |
|
{ |
|
'description': 'Microservice for payment processing', |
|
'framework': 'gin', |
|
'language': 'go' |
|
} |
|
] |
|
|
|
benchmark_results = evaluator.benchmark_model(model, test_cases) |
|
print("Overall Performance:", benchmark_results) |
|
``` |
|
|
|
## π Advanced Usage |
|
|
|
### Custom Data Sources |
|
```python |
|
# Add your own training examples |
|
custom_examples = [ |
|
{ |
|
'description': 'Custom API requirement', |
|
'requirements': ['Custom feature 1', 'Custom feature 2'], |
|
'framework': 'fastapi', |
|
'language': 'python', |
|
'code_files': { |
|
'main.py': '# Your custom code here', |
|
'requirements.txt': 'fastapi\nuvicorn' |
|
} |
|
} |
|
] |
|
|
|
# Add to training data |
|
collector.collected_examples.extend([CodeExample(**ex) for ex in custom_examples]) |
|
``` |
|
|
|
### Fine-tuning on Specific Domains |
|
```python |
|
# Focus training on specific application types |
|
domain_specific_queries = [ |
|
'microservices architecture', |
|
'api gateway implementation', |
|
'database orm integration', |
|
'authentication middleware', |
|
'rate limiting api' |
|
] |
|
|
|
asyncio.run(collector.collect_github_repositories(domain_specific_queries)) |
|
``` |
|
|
|
### Export Trained Model |
|
```python |
|
# Save model for deployment |
|
model.model.save_pretrained('./production_model') |
|
model.tokenizer.save_pretrained('./production_model') |
|
|
|
# Load for inference |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
production_model = AutoModelForCausalLM.from_pretrained('./production_model') |
|
production_tokenizer = AutoTokenizer.from_pretrained('./production_model') |
|
``` |
|
|
|
## π§ Troubleshooting |
|
|
|
### Common Issues |
|
|
|
**1. Out of Memory Errors** |
|
```python |
|
# Reduce batch size |
|
config['per_device_train_batch_size'] = 1 |
|
config['gradient_accumulation_steps'] = 4 |
|
|
|
# Use gradient checkpointing |
|
config['gradient_checkpointing'] = True |
|
``` |
|
|
|
**2. Slow Training** |
|
```python |
|
# Enable mixed precision (if GPU supports it) |
|
config['fp16'] = True |
|
|
|
# Use multiple GPUs |
|
config['dataloader_num_workers'] = 4 |
|
``` |
|
|
|
**3. Poor Code Quality** |
|
```python |
|
# Increase training data diversity |
|
collector.generate_synthetic_examples(count=1000) |
|
|
|
# Extend training duration |
|
config['num_train_epochs'] = 10 |
|
``` |
|
|
|
### Performance Optimization |
|
|
|
**For CPU Training:** |
|
```python |
|
config['dataloader_pin_memory'] = False |
|
config['per_device_train_batch_size'] = 1 |
|
``` |
|
|
|
**For GPU Training:** |
|
```python |
|
config['fp16'] = True |
|
config['dataloader_pin_memory'] = True |
|
config['per_device_train_batch_size'] = 4 |
|
``` |
|
|
|
## π Expected Results |
|
|
|
After training on ~500-1000 examples, you should expect: |
|
|
|
- **Syntax Correctness**: 85-95% |
|
- **Code Completeness**: 80-90% |
|
- **Best Practices**: 70-85% |
|
- **Framework Coverage**: All major Node.js and Python frameworks |
|
- **Generation Speed**: 2-5 seconds per application |
|
|
|
## π Continuous Improvement |
|
|
|
### Regular Retraining |
|
```python |
|
# Schedule weekly data collection |
|
import schedule |
|
|
|
def update_training_data(): |
|
asyncio.run(collector.collect_github_repositories(['new backend trends'])) |
|
|
|
schedule.every().week.do(update_training_data) |
|
``` |
|
|
|
### A/B Testing Different Models |
|
```python |
|
models_to_compare = [ |
|
'microsoft/DialoGPT-medium', |
|
'microsoft/DialoGPT-large', |
|
'gpt2-medium' |
|
] |
|
|
|
for base_model in models_to_compare: |
|
model = CodeGenerationModel(base_model) |
|
results = evaluator.benchmark_model(model, test_cases) |
|
print(f"{base_model}: {results}") |
|
``` |
|
|
|
## π― Next Steps |
|
|
|
1. **Start Small**: Begin with synthetic data and 100-200 examples |
|
2. **Add Real Data**: Integrate GitHub repositories gradually |
|
3. **Evaluate Regularly**: Monitor quality metrics after each training session |
|
4. **Expand Frameworks**: Add support for new frameworks as needed |
|
5. **Production Deploy**: Export model for API deployment |
|
|
|
This pipeline provides a complete foundation for building your own backend code generation AI. The modular design allows you to customize and extend each component based on your specific needs. |