|
Text-to-Code Generator using CodeGen-350M-Multi
|
|
=============================================
|
|
|
|
This project provides a text-to-code generator using a fine-tuned Salesforce/codegen-350M-multi
|
|
model, designed to run on low-end laptops (8GB RAM, CPU-only) for students to experiment with AI
|
|
model development. The model is fine-tuned on a custom dataset and includes a Flask web interface
|
|
for easy interaction. All resources are open-source under the Apache-2.0 license, with attribution
|
|
to the original model by Salesforce.
|
|
|
|
|
|
|
|
Do's and Setup Process
|
|
---------------------
|
|
1. **System Requirements**:
|
|
- Laptop with at least 8GB RAM and 2GB free disk space.
|
|
- Windows, macOS, or Linux (CPU-only, no GPU required).
|
|
- Internet connection for initial model download.
|
|
|
|
2. **Install Python**:
|
|
- Use Python 3.10.9. Download from https://www.python.org/downloads/release/python-3109/.
|
|
- Verify installation: `python --version`.
|
|
|
|
3. **Clone or Download Repository**:
|
|
- Download the project files from the Hugging Face repository:
|
|
https://huggingface.co/remiai3/text-to-code-using-codegen-project.
|
|
- Extract files to a folder (e.g., `text-to-code-codegen`).
|
|
|
|
4. **Set Up Virtual Environment**:
|
|
- Open a terminal in the project folder.
|
|
- Create a virtual environment: `python -m venv venv`.
|
|
- Activate it:
|
|
- Windows: `venv\Scripts\activate`
|
|
- macOS/Linux: `source venv/bin/activate`
|
|
|
|
5. **Install Dependencies**:
|
|
- Run: `pip install -r requirements.txt`.
|
|
- Required libraries: torch, transformers, datasets, accelerate, protobuf, matplotlib, flask.
|
|
NOTE: if the matplotlib version is not compatible remove the version 3.7.2 and also if any
|
|
other library is also not compitable with the python version or local device because of
|
|
previous libraries installed then remove all the versions from the libraries and install the
|
|
libraries with the names only then a default version will installed of that particualr library
|
|
|
|
6. **Prepare Custom Dataset**:
|
|
- Ensure the `custom_dataset.jsonl` file exists in the project folder.
|
|
- Format: Each line is a JSON object with `prompt` (natural language) and `code` (Python code).
|
|
- Example:
|
|
{"prompt": "Write a Python program to print 'Hello, World!'", "code": "print('Hello, World!')"}
|
|
{"prompt": "Write a Python function to add two numbers.", "code": "def add_numbers(a, b):\n return a + b"}
|
|
|
|
7. **Run the Model**:
|
|
- Option 1: Run the full pipeline (download, fine-tune, test):
|
|
- Update `run_all.py` with your Hugging Face token (`HF_TOKEN`).
|
|
- Run: `python run_all.py`.
|
|
- This downloads the model, fine-tunes it, tests it, and generates a loss plot.
|
|
- Option 2: Test the fine-tuned model directly:
|
|
- Run: `python test_codegen.py` to test with sample prompts.
|
|
- Option 3: Use the web interface:
|
|
- Run: `python app.py`.
|
|
- Open a browser and go to `http://127.0.0.1:5000`.
|
|
|
|
8. **Using the AI Model**:
|
|
- **Command Line Testing**: Use `test_codegen.py` to input prompts and generate Python code.
|
|
- **Web Interface**: Use the Flask app (`app.py`) to enter prompts via a browser and view generated code.
|
|
- Example prompts:
|
|
- "Write a Python function to calculate factorial of a number"
|
|
- "Write a Python function to check if a number is prime"
|
|
- Output is saved in `./finetuned_codegen/loss_plot.png` (loss plot) and `./finetuned_codegen`
|
|
(model weights).
|
|
|
|
9. **Model Details**:
|
|
- Model: Salesforce/codegen-350M-multi (Apache-2.0 license).
|
|
- Source: https://huggingface.co/Salesforce/codegen-350M-multi.
|
|
- Fine-tuned on a custom dataset for Python code generation.
|
|
- Attribution: This project uses the Salesforce CodeGen model, fine-tuned by remiai3 for
|
|
educational purposes.
|
|
|
|
10. **Troubleshooting**:
|
|
- Ensure ~2GB disk space for model weights.
|
|
- If memory issues occur, reduce dataset size or batch size in `run_all.py`.
|
|
- Check terminal output for errors and ensure all files (`custom_dataset.jsonl`,
|
|
`finetuned_codegen`) are in place.
|
|
|
|
11. **Contributing**:
|
|
- Add more examples to `custom_dataset.jsonl` to improve model performance.
|
|
- Share feedback or improvements via the Hugging Face repository:
|
|
https://huggingface.co/remiai3.
|
|
|
|
Attribution
|
|
-----------
|
|
This project is built using the Salesforce/codegen-350M-multi model, licensed under Apache-2.0.
|
|
The fine-tuned model and resources are provided by remiai3 for free educational use to help students
|
|
learn and experiment with AI models. |