File size: 4,626 Bytes

6e793bd

Text-to-Code Generator using CodeGen-350M-Multi
=============================================

This project provides a text-to-code generator using a fine-tuned Salesforce/codegen-350M-multi 
model, designed to run on low-end laptops (8GB RAM, CPU-only) for students to experiment with AI 
model development. The model is fine-tuned on a custom dataset and includes a Flask web interface 
for easy interaction. All resources are open-source under the Apache-2.0 license, with attribution
to the original model by Salesforce.



Do's and Setup Process
---------------------
1. **System Requirements**:
   - Laptop with at least 8GB RAM and 2GB free disk space.
   - Windows, macOS, or Linux (CPU-only, no GPU required).
   - Internet connection for initial model download.

2. **Install Python**:
   - Use Python 3.10.9. Download from https://www.python.org/downloads/release/python-3109/.
   - Verify installation: `python --version`.

3. **Clone or Download Repository**:
   - Download the project files from the Hugging Face repository: 
   https://huggingface.co/remiai3/text-to-code-using-codegen-project.
   - Extract files to a folder (e.g., `text-to-code-codegen`).

4. **Set Up Virtual Environment**:
   - Open a terminal in the project folder.
   - Create a virtual environment: `python -m venv venv`.
   - Activate it:
     - Windows: `venv\Scripts\activate`
     - macOS/Linux: `source venv/bin/activate`

5. **Install Dependencies**:
   - Run: `pip install -r requirements.txt`.
   - Required libraries: torch, transformers, datasets, accelerate, protobuf, matplotlib, flask.
      NOTE: if the matplotlib version is not compatible remove the version 3.7.2 and also if any 
      other library is also not compitable with the python version or local device because of 
      previous libraries installed then remove all the versions from the libraries and install the 
      libraries with the names only then a default version will installed of that particualr library

6. **Prepare Custom Dataset**:
   - Ensure the `custom_dataset.jsonl` file exists in the project folder.
   - Format: Each line is a JSON object with `prompt` (natural language) and `code` (Python code).
   - Example:
     {"prompt": "Write a Python program to print 'Hello, World!'", "code": "print('Hello, World!')"}
     {"prompt": "Write a Python function to add two numbers.", "code": "def add_numbers(a, b):\n    return a + b"}

7. **Run the Model**:
   - Option 1: Run the full pipeline (download, fine-tune, test):
     - Update `run_all.py` with your Hugging Face token (`HF_TOKEN`).
     - Run: `python run_all.py`.
     - This downloads the model, fine-tunes it, tests it, and generates a loss plot.
   - Option 2: Test the fine-tuned model directly:
     - Run: `python test_codegen.py` to test with sample prompts.
   - Option 3: Use the web interface:
     - Run: `python app.py`.
     - Open a browser and go to `http://127.0.0.1:5000`.

8. **Using the AI Model**:
   - **Command Line Testing**: Use `test_codegen.py` to input prompts and generate Python code.
   - **Web Interface**: Use the Flask app (`app.py`) to enter prompts via a browser and view generated code.
   - Example prompts:
     - "Write a Python function to calculate factorial of a number"
     - "Write a Python function to check if a number is prime"
   - Output is saved in `./finetuned_codegen/loss_plot.png` (loss plot) and `./finetuned_codegen` 
      (model weights).

9. **Model Details**:
   - Model: Salesforce/codegen-350M-multi (Apache-2.0 license).
   - Source: https://huggingface.co/Salesforce/codegen-350M-multi.
   - Fine-tuned on a custom dataset for Python code generation.
   - Attribution: This project uses the Salesforce CodeGen model, fine-tuned by remiai3 for 
     educational purposes.

10. **Troubleshooting**:
    - Ensure ~2GB disk space for model weights.
    - If memory issues occur, reduce dataset size or batch size in `run_all.py`.
    - Check terminal output for errors and ensure all files (`custom_dataset.jsonl`, 
      `finetuned_codegen`) are in place.

11. **Contributing**:
    - Add more examples to `custom_dataset.jsonl` to improve model performance.
    - Share feedback or improvements via the Hugging Face repository: 
      https://huggingface.co/remiai3.

Attribution
-----------
This project is built using the Salesforce/codegen-350M-multi model, licensed under Apache-2.0. 
The fine-tuned model and resources are provided by remiai3 for free educational use to help students 
learn and experiment with AI models.