{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TinyLlama Fine-Tuning Example\n", "\n", "This notebook demonstrates how to fine-tune a TinyLlama model on a custom dataset. We'll go through the following steps:\n", "\n", "1. Setting up the environment\n", "2. Loading the model and tokenizer\n", "3. Preparing the dataset\n", "4. Fine-tuning the model\n", "5. Evaluating the results\n", "6. Saving and using the fine-tuned model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setting up the environment\n", "\n", "First, let's install the necessary libraries if they're not already installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "# Install required libraries\n", "!pip install torch transformers datasets accelerate tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import required libraries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "import os\n", "import json\n", "import torch\n", "from transformers import (\n", " AutoModelForCausalLM, \n", " AutoTokenizer,\n", " Trainer, \n", " TrainingArguments,\n", " DataCollatorForLanguageModeling\n", ")\n", "from datasets import Dataset\n", "from tqdm.notebook import tqdm\n", "\n", "# Check if GPU is available\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "print(f\"Using device: {device}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Loading the model and tokenizer\n", "\n", "We'll use the TinyLlama-1.1B-Chat-v1.0 model from Hugging Face." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "model_name = \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\n", "\n", "# Load tokenizer\n", "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", "# Ensure the tokenizer has a padding token\n", "if tokenizer.pad_token is None:\n", " tokenizer.pad_token = tokenizer.eos_token\n", "\n", "# Load model with reduced precision to save memory\n", "model = AutoModelForCausalLM.from_pretrained(\n", " model_name,\n", " torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,\n", " low_cpu_mem_usage=True\n", ")\n", "model = model.to(device)\n", "\n", "print(f\"Model and tokenizer loaded: {model_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Preparing the dataset\n", "\n", "Let's load our example training data and format it properly for fine-tuning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "# Load example data\n", "with open('example_training_data.json', 'r', encoding='utf-8') as f:\n", " data = json.load(f)\n", "\n", "# Format data for instruction fine-tuning\n", "formatted_data = []\n", "for item in data:\n", " # Format as a chat-like conversation\n", " formatted_text = f\"<|im_start|>user\\n{item['instruction']}<|im_end|>\\n<|im_start|>assistant\\n{item['response']}<|im_end|>\"\n", " formatted_data.append({\"text\": formatted_text})\n", "\n", "# Create a Hugging Face dataset\n", "dataset = Dataset.from_list(formatted_data)\n", "print(f\"Dataset created with {len(dataset)} examples\")\n", "\n", "# Show an example\n", "print(\"\\nExample entry:\")\n", "print(dataset[0]['text'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "# Tokenize the dataset\n", "def tokenize_function(examples):\n", " return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True, max_length=512)\n", "\n", "# Add labels for causal language modeling\n", "def add_labels(examples):\n", " examples[\"labels\"] = examples[\"input_ids\"].copy()\n", " return examples\n", "\n", "# Process dataset\n", "tokenized_dataset = dataset.map(tokenize_function, batched=True)\n", "tokenized_dataset = tokenized_dataset.map(add_labels, batched=True)\n", "tokenized_dataset = tokenized_dataset.remove_columns([\"text\"])\n", "\n", "# Split into training and evaluation sets\n", "tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1)\n", "print(f\"Training examples: {len(tokenized_dataset['train'])}\")\n", "print(f\"Evaluation examples: {len(tokenized_dataset['test'])}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Fine-tuning the model\n", "\n", "Now we'll set up the training configuration and fine-tune the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "# Set up training arguments\n", "output_dir = \"./fine_tuned_tinyllama\"\n", "\n", "training_args = TrainingArguments(\n", " output_dir=output_dir,\n", " overwrite_output_dir=True,\n", " num_train_epochs=3, # Adjust based on your dataset size\n", " per_device_train_batch_size=2, # Adjust based on your GPU memory\n", " per_device_eval_batch_size=2,\n", " gradient_accumulation_steps=4, # Accumulate gradients to simulate larger batch size\n", " learning_rate=2e-5,\n", " weight_decay=0.01,\n", " logging_dir=f\"{output_dir}/logs\",\n", " logging_steps=10,\n", " eval_steps=100,\n", " save_steps=100,\n", " save_total_limit=2, # Only keep the 2 best checkpoints\n", " evaluation_strategy=\"steps\",\n", " fp16=torch.cuda.is_available(), # Use mixed precision if GPU is available\n", " warmup_steps=100,\n", " report_to=\"none\", # Disable reporting to wandb, etc.\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "# Set up data collator\n", "data_collator = DataCollatorForLanguageModeling(\n", " tokenizer=tokenizer,\n", " mlm=False # We're doing causal language modeling, not masked language modeling\n", ")\n", "\n", "# Set up trainer\n", "trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=tokenized_dataset[\"train\"],\n", " eval_dataset=tokenized_dataset[\"test\"],\n", " data_collator=data_collator,\n", ")\n", "\n", "# Train the model\n", "print(\"Starting fine-tuning...\")\n", "trainer.train()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Evaluating the results\n", "\n", "Let's evaluate the fine-tuned model on some test prompts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "# Save the fine-tuned model\n", "trainer.save_model(output_dir)\n", "tokenizer.save_pretrained(output_dir)\n", "print(f\"Model saved to {output_dir}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "# Test the model with a few prompts\n", "test_prompts = [\n", " \"Explain the concept of neural networks.\",\n", " \"Write a short story about a robot that learns to feel emotions.\",\n", " \"What are three sustainable energy sources and how do they work?\"\n", "]\n", "\n", "# Format prompts for the chat model\n", "formatted_prompts = [f\"<|im_start|>user\\n{prompt}<|im_end|>\\n<|im_start|>assistant\\n\" for prompt in test_prompts]\n", "\n", "# Generate responses\n", "for i, prompt in enumerate(formatted_prompts):\n", " print(f\"\\n\\nPrompt {i+1}: {test_prompts[i]}\")\n", " print(\"\\nGenerating response...\")\n", " \n", " inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n", " \n", " with torch.no_grad():\n", " outputs = model.generate(\n", " inputs.input_ids,\n", " max_new_tokens=256,\n", " temperature=0.7,\n", " do_sample=True,\n", " pad_token_id=tokenizer.eos_token_id\n", " )\n", " \n", " # Get only the newly generated text (not the prompt)\n", " response_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\n", " \n", " print(f\"Response: {response_text}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Using the fine-tuned model\n", "\n", "Here's how you can load and use your fine-tuned model in the future." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "source": [ "# Load the fine-tuned model and tokenizer\n", "def load_fine_tuned_model(model_path):\n", " tokenizer = AutoTokenizer.from_pretrained(model_path)\n", " model = AutoModelForCausalLM.from_pretrained(\n", " model_path,\n", " torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32\n", " )\n", " model = model.to(device)\n", " return model, tokenizer\n", "\n", "# Example of loading the model (uncomment to run)\n", "# fine_tuned_model, fine_tuned_tokenizer = load_fine_tuned_model(output_dir)\n", "\n", "# Function to generate a response\n", "def generate_response(model, tokenizer, prompt, max_length=256, temperature=0.7):\n", " # Format the prompt\n", " formatted_prompt = f\"<|im_start|>user\\n{prompt}<|im_end|>\\n<|im_start|>assistant\\n\"\n", " \n", " # Tokenize\n", " inputs = tokenizer(formatted_prompt, return_tensors=\"pt\").to(device)\n", " \n", " # Generate\n", " with torch.no_grad():\n", " outputs = model.generate(\n", " inputs.input_ids,\n", " max_new_tokens=max_length,\n", " temperature=temperature,\n", " do_sample=True,\n", " pad_token_id=tokenizer.eos_token_id\n", " )\n", " \n", " # Decode\n", " full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n", " \n", " # Extract assistant's response\n", " try:\n", " assistant_response = full_response.split(\"<|im_start|>assistant\\n\")[1].split(\"<|im_end|>\")[0]\n", " except IndexError:\n", " assistant_response = full_response.replace(prompt, \"\").strip()\n", " \n", " return assistant_response\n", "\n", "# Example usage (uncomment to run)\n", "# response = generate_response(fine_tuned_model, fine_tuned_tokenizer, \"Explain quantum computing.\")\n", "# print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "You've successfully fine-tuned a TinyLlama model on a custom dataset! You can now use this model for various applications:\n", "\n", "1. Integrate it into a chatbot or virtual assistant\n", "2. Use it for content generation\n", "3. Deploy it as part of a web application\n", "4. Fine-tune it further on more specific data\n", "\n", "You can also experiment with different hyperparameters and training strategies to improve results." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }