{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TinyLlama Fine-Tuning Example\n",
    "\n",
    "This notebook demonstrates how to fine-tune a TinyLlama model on a custom dataset. We'll go through the following steps:\n",
    "\n",
    "1. Setting up the environment\n",
    "2. Loading the model and tokenizer\n",
    "3. Preparing the dataset\n",
    "4. Fine-tuning the model\n",
    "5. Evaluating the results\n",
    "6. Saving and using the fine-tuned model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setting up the environment\n",
    "\n",
    "First, let's install the necessary libraries if they're not already installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Install required libraries\n",
    "!pip install torch transformers datasets accelerate tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import required libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "import os\n",
    "import json\n",
    "import torch\n",
    "from transformers import (\n",
    "    AutoModelForCausalLM, \n",
    "    AutoTokenizer,\n",
    "    Trainer, \n",
    "    TrainingArguments,\n",
    "    DataCollatorForLanguageModeling\n",
    ")\n",
    "from datasets import Dataset\n",
    "from tqdm.notebook import tqdm\n",
    "\n",
    "# Check if GPU is available\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "print(f\"Using device: {device}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Loading the model and tokenizer\n",
    "\n",
    "We'll use the TinyLlama-1.1B-Chat-v1.0 model from Hugging Face."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "model_name = \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\n",
    "\n",
    "# Load tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "# Ensure the tokenizer has a padding token\n",
    "if tokenizer.pad_token is None:\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "\n",
    "# Load model with reduced precision to save memory\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    model_name,\n",
    "    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,\n",
    "    low_cpu_mem_usage=True\n",
    ")\n",
    "model = model.to(device)\n",
    "\n",
    "print(f\"Model and tokenizer loaded: {model_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Preparing the dataset\n",
    "\n",
    "Let's load our example training data and format it properly for fine-tuning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Load example data\n",
    "with open('example_training_data.json', 'r', encoding='utf-8') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "# Format data for instruction fine-tuning\n",
    "formatted_data = []\n",
    "for item in data:\n",
    "    # Format as a chat-like conversation\n",
    "    formatted_text = f\"<|im_start|>user\\n{item['instruction']}<|im_end|>\\n<|im_start|>assistant\\n{item['response']}<|im_end|>\"\n",
    "    formatted_data.append({\"text\": formatted_text})\n",
    "\n",
    "# Create a Hugging Face dataset\n",
    "dataset = Dataset.from_list(formatted_data)\n",
    "print(f\"Dataset created with {len(dataset)} examples\")\n",
    "\n",
    "# Show an example\n",
    "print(\"\\nExample entry:\")\n",
    "print(dataset[0]['text'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Tokenize the dataset\n",
    "def tokenize_function(examples):\n",
    "    return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True, max_length=512)\n",
    "\n",
    "# Add labels for causal language modeling\n",
    "def add_labels(examples):\n",
    "    examples[\"labels\"] = examples[\"input_ids\"].copy()\n",
    "    return examples\n",
    "\n",
    "# Process dataset\n",
    "tokenized_dataset = dataset.map(tokenize_function, batched=True)\n",
    "tokenized_dataset = tokenized_dataset.map(add_labels, batched=True)\n",
    "tokenized_dataset = tokenized_dataset.remove_columns([\"text\"])\n",
    "\n",
    "# Split into training and evaluation sets\n",
    "tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1)\n",
    "print(f\"Training examples: {len(tokenized_dataset['train'])}\")\n",
    "print(f\"Evaluation examples: {len(tokenized_dataset['test'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Fine-tuning the model\n",
    "\n",
    "Now we'll set up the training configuration and fine-tune the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Set up training arguments\n",
    "output_dir = \"./fine_tuned_tinyllama\"\n",
    "\n",
    "training_args = TrainingArguments(\n",
    "    output_dir=output_dir,\n",
    "    overwrite_output_dir=True,\n",
    "    num_train_epochs=3,  # Adjust based on your dataset size\n",
    "    per_device_train_batch_size=2,  # Adjust based on your GPU memory\n",
    "    per_device_eval_batch_size=2,\n",
    "    gradient_accumulation_steps=4,  # Accumulate gradients to simulate larger batch size\n",
    "    learning_rate=2e-5,\n",
    "    weight_decay=0.01,\n",
    "    logging_dir=f\"{output_dir}/logs\",\n",
    "    logging_steps=10,\n",
    "    eval_steps=100,\n",
    "    save_steps=100,\n",
    "    save_total_limit=2,  # Only keep the 2 best checkpoints\n",
    "    evaluation_strategy=\"steps\",\n",
    "    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU is available\n",
    "    warmup_steps=100,\n",
    "    report_to=\"none\",  # Disable reporting to wandb, etc.\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Set up data collator\n",
    "data_collator = DataCollatorForLanguageModeling(\n",
    "    tokenizer=tokenizer,\n",
    "    mlm=False  # We're doing causal language modeling, not masked language modeling\n",
    ")\n",
    "\n",
    "# Set up trainer\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=tokenized_dataset[\"train\"],\n",
    "    eval_dataset=tokenized_dataset[\"test\"],\n",
    "    data_collator=data_collator,\n",
    ")\n",
    "\n",
    "# Train the model\n",
    "print(\"Starting fine-tuning...\")\n",
    "trainer.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Evaluating the results\n",
    "\n",
    "Let's evaluate the fine-tuned model on some test prompts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Save the fine-tuned model\n",
    "trainer.save_model(output_dir)\n",
    "tokenizer.save_pretrained(output_dir)\n",
    "print(f\"Model saved to {output_dir}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Test the model with a few prompts\n",
    "test_prompts = [\n",
    "    \"Explain the concept of neural networks.\",\n",
    "    \"Write a short story about a robot that learns to feel emotions.\",\n",
    "    \"What are three sustainable energy sources and how do they work?\"\n",
    "]\n",
    "\n",
    "# Format prompts for the chat model\n",
    "formatted_prompts = [f\"<|im_start|>user\\n{prompt}<|im_end|>\\n<|im_start|>assistant\\n\" for prompt in test_prompts]\n",
    "\n",
    "# Generate responses\n",
    "for i, prompt in enumerate(formatted_prompts):\n",
    "    print(f\"\\n\\nPrompt {i+1}: {test_prompts[i]}\")\n",
    "    print(\"\\nGenerating response...\")\n",
    "    \n",
    "    inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n",
    "    \n",
    "    with torch.no_grad():\n",
    "        outputs = model.generate(\n",
    "            inputs.input_ids,\n",
    "            max_new_tokens=256,\n",
    "            temperature=0.7,\n",
    "            do_sample=True,\n",
    "            pad_token_id=tokenizer.eos_token_id\n",
    "        )\n",
    "    \n",
    "    # Get only the newly generated text (not the prompt)\n",
    "    response_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\n",
    "    \n",
    "    print(f\"Response: {response_text}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Using the fine-tuned model\n",
    "\n",
    "Here's how you can load and use your fine-tuned model in the future."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Load the fine-tuned model and tokenizer\n",
    "def load_fine_tuned_model(model_path):\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_path)\n",
    "    model = AutoModelForCausalLM.from_pretrained(\n",
    "        model_path,\n",
    "        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32\n",
    "    )\n",
    "    model = model.to(device)\n",
    "    return model, tokenizer\n",
    "\n",
    "# Example of loading the model (uncomment to run)\n",
    "# fine_tuned_model, fine_tuned_tokenizer = load_fine_tuned_model(output_dir)\n",
    "\n",
    "# Function to generate a response\n",
    "def generate_response(model, tokenizer, prompt, max_length=256, temperature=0.7):\n",
    "    # Format the prompt\n",
    "    formatted_prompt = f\"<|im_start|>user\\n{prompt}<|im_end|>\\n<|im_start|>assistant\\n\"\n",
    "    \n",
    "    # Tokenize\n",
    "    inputs = tokenizer(formatted_prompt, return_tensors=\"pt\").to(device)\n",
    "    \n",
    "    # Generate\n",
    "    with torch.no_grad():\n",
    "        outputs = model.generate(\n",
    "            inputs.input_ids,\n",
    "            max_new_tokens=max_length,\n",
    "            temperature=temperature,\n",
    "            do_sample=True,\n",
    "            pad_token_id=tokenizer.eos_token_id\n",
    "        )\n",
    "    \n",
    "    # Decode\n",
    "    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
    "    \n",
    "    # Extract assistant's response\n",
    "    try:\n",
    "        assistant_response = full_response.split(\"<|im_start|>assistant\\n\")[1].split(\"<|im_end|>\")[0]\n",
    "    except IndexError:\n",
    "        assistant_response = full_response.replace(prompt, \"\").strip()\n",
    "    \n",
    "    return assistant_response\n",
    "\n",
    "# Example usage (uncomment to run)\n",
    "# response = generate_response(fine_tuned_model, fine_tuned_tokenizer, \"Explain quantum computing.\")\n",
    "# print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "You've successfully fine-tuned a TinyLlama model on a custom dataset! You can now use this model for various applications:\n",
    "\n",
    "1. Integrate it into a chatbot or virtual assistant\n",
    "2. Use it for content generation\n",
    "3. Deploy it as part of a web application\n",
    "4. Fine-tune it further on more specific data\n",
    "\n",
    "You can also experiment with different hyperparameters and training strategies to improve results."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}