# Project Structure This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component. ## 📁 Root Directory ``` CodeLLaMA-Linux-BugFix/ ├── dataset_builder/ # Dataset creation and processing ├── dataset/ # Generated datasets and data files ├── train/ # Model training scripts and outputs ├── evaluate/ # Model evaluation and testing ├── requirements.txt # Python dependencies ├── README.md # Project documentation └── PROJECT_STRUCTURE.md # This file ``` ## 🔧 Dataset Builder (`dataset_builder/`) The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format. ### Files: - **`extract_linux_bugfixes.py`** - Main dataset extraction script - Uses PyDriller to analyze Linux kernel Git history - Filters commits using bug-fix keywords - Extracts code context around bug locations - Generates structured dataset entries - **`extract_linux_bugfixes_parallel.py`** - Parallelized version of dataset builder - Multi-process implementation for faster processing - Configurable worker count (default: 16 workers) - Test mode with limited commit processing - **`format_for_training.py`** - Format conversion script - Converts structured data to prompt-completion pairs - Formats input for supervised fine-tuning - Creates training-ready JSONL format ### Key Features: - **Commit Filtering**: Identifies bug-fix commits using 17 keywords - **Code Context**: Extracts 10 lines before/after bug location - **File Filtering**: Focuses on C and header files (`.c`, `.h`) - **Diff Extraction**: Captures Git diff patches for fixes ## 📊 Dataset (`dataset/`) Contains the generated datasets used for training and evaluation. ### Files: - **`training_data_100k.jsonl`** - Main training dataset - 100,000 bug-fix samples - Structured format with input/output pairs - Stored using Git LFS for large file handling - **`training_data_prompt_completion.jsonl`** - Converted training format - Prompt-completion pairs for supervised learning - Optimized for transformer model training - Stored using Git LFS ### Data Format: ```json { "input": { "original code": "C code snippet with bug", "instruction": "Bug fix instruction from commit message" }, "output": { "diff codes": "Git diff showing the fix" } } ``` ## 🚀 Training (`train/`) Contains all training-related scripts, configurations, and model outputs. ### Files: - **`train_codellama_qlora_linux_bugfix.py`** - Main training script - QLoRA fine-tuning implementation - Optimized for H200 GPU with bfloat16 - Includes Weights & Biases integration - Comprehensive training configuration - **`train_codellama_qlora_simple.py`** - Alternative training script - Simplified QLoRA implementation - Basic training setup without advanced features - Good for testing and development - **`download_codellama_model.py`** - Model download utility - Downloads base CodeLLaMA-7B-Instruct model - Ensures model availability before training ### Output Directory (`train/output/`): - **`qlora-codellama-bugfix/`** - Main model output - **`adapter_model.safetensors`** - LoRA adapter weights - **`adapter_config.json`** - LoRA configuration - **`tokenizer.json`** - Tokenizer files - **`chat_template.jinja`** - Conversation template - **`checkpoint-500/`** - Training checkpoint at step 500 - **`checkpoint-1000/`** - Training checkpoint at step 1000 - **`README.md`** - Model card and documentation ### Training Configuration: - **Base Model**: `codellama/CodeLLaMA-7b-Instruct-hf` - **Method**: QLoRA with 4-bit quantization - **LoRA Config**: r=64, alpha=16, dropout=0.1 - **Training**: 3 epochs, batch size 64, learning rate 2e-4 - **Hardware**: Optimized for H200 GPU ## 📈 Evaluation (`evaluate/`) Contains evaluation scripts and results for assessing model performance. ### Files: - **`evaluate_linux_bugfix_model.py`** - Main evaluation script - Loads fine-tuned model for inference - Generates predictions on test data - Computes BLEU and ROUGE metrics - Saves results in multiple formats - **`test_samples.jsonl`** - Evaluation dataset - Test samples for model evaluation - Stored using Git LFS ### Output Directory (`evaluate/output/`): - **`eval_results.json`** - Detailed evaluation results - Complete predictions and references - Stored using Git LFS - **`eval_results.csv`** - Tabular evaluation results - CSV format for easy analysis - Stored using Git LFS ### Evaluation Metrics: - **BLEU Score**: Measures translation quality - **ROUGE Score**: Evaluates text generation accuracy - **Human Evaluation**: Qualitative assessment ## 🔧 Dependencies (`requirements.txt`) Comprehensive list of Python packages required for the project: ### Core ML Libraries: - `transformers==4.53.1` - Hugging Face transformers - `torch==2.7.1+cu128` - PyTorch with CUDA support - `peft==0.16.0` - Parameter-efficient fine-tuning - `accelerate==1.8.1` - Distributed training - `bitsandbytes==0.46.1` - Quantization support ### Data Processing: - `datasets==3.6.0` - Dataset handling - `pandas==2.3.1` - Data manipulation - `numpy==2.3.1` - Numerical computing ### Git Analysis: - `pydriller` - Git repository mining - `gitpython` - Git operations ### Utilities: - `tqdm==4.67.1` - Progress bars - `wandb` - Experiment tracking - `evaluate==0.4.4` - Evaluation metrics ## 🔄 Workflow ### 1. Dataset Creation ```bash cd dataset_builder python extract_linux_bugfixes.py # Extract bug-fix data python format_for_training.py # Convert format ``` ### 2. Model Training ```bash cd train python train_codellama_qlora_linux_bugfix.py # Train with QLoRA ``` ### 3. Model Evaluation ```bash cd evaluate python evaluate_linux_bugfix_model.py # Evaluate performance ``` ## 🎯 Key Design Principles ### Modularity - Each component has a specific responsibility - Clear separation between data, training, and evaluation - Easy to modify or extend individual components ### Efficiency - QLoRA for memory-efficient training - Parallel processing for dataset creation - Optimized for modern GPU hardware ### Reproducibility - Version-controlled dependencies - Structured data formats - Comprehensive logging and evaluation ### Scalability - Configurable parameters for different hardware - Support for distributed training - Efficient data handling with Git LFS ## 🔍 File Naming Conventions - **Scripts**: Descriptive names with clear purpose - **Datasets**: Include size/version information - **Models**: Include architecture and method - **Results**: Include timestamp or version - **Configs**: Use `.json` or `.yaml` format ## 📝 Documentation - **README.md**: Project overview and quick start - **PROJECT_STRUCTURE.md**: This detailed structure guide - **Model README**: Generated model cards in output directories - **Code Comments**: Inline documentation in all scripts This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors.