{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Welcome to the Second Lab - Week 1, Day 3\n", "\n", "Today we will work with lots of models! This is a way to get comfortable with APIs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook extends the original by adding a reviewer pattern to evaluate the impact on model performance.\n", "\n", "In the new workflow, each model's answer is provided to a \"reviewer LLM\" who is prompted to \"Evaluate the response for clarity and strength of argument, and provide constructive suggestions for improving the answer.\" Each model is then given the chance to revise its answer based on the feedback but is also told, \"You are not required to take any of the feedback into account, but you want to win the competition.\"\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Results for Representative Run
ModelOriginal RankExclusive FeedbackWith Feedback (all models)
gpt-4o-mini234
claude-3-7-sonnet-latest611
gemini-2.0-flash112
deepseek-chat323
llama-3.3-70b-versatile435
llama3.2546
\n", "\n", "The workflow is obviously non-deterministic and the results can vary greatly from run to run, but the introduction of a reviewer appeared to have a generaly positive impact on performance. The table above shows the results for a representative run. It compares each model's rank versus the other models when it exclusively received feedback. The table also shows the ranking when ALL models received feedback. Exclusive use of feedback improved a model's ranking for five out of six models and decreased it for one model.\n", "\n", "Inspired by some other contributions, this worksheet also makes LLM calls asyncrhonously to reduce wait time." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Start with imports - ask ChatGPT to explain any package that you don't know\n", "#!uv add prettytable\n", "\n", "import os\n", "import asyncio\n", "import json\n", "from dotenv import load_dotenv\n", "from openai import OpenAI, AsyncOpenAI\n", "from anthropic import AsyncAnthropic\n", "from IPython.display import display\n", "from pydantic import BaseModel, Field\n", "from string import Template\n", "from prettytable import PrettyTable\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "class LLMResult(BaseModel):\n", " model: str\n", " answer: str\n", " feedback: str | None =Field(\n", " default = None, \n", " description=\"Mutable field. This will be set by the reviewer.\")\n", " revised_answer: str | None =Field(\n", " default = None, \n", " description=\"Mutable field. This will be set by the answerer after the reviewer has provided feedback.\")\n", " original_rank: int | None =Field(\n", " default = None, \n", " description=\"Mutable field. Rank when no feedback is used by any models.\")\n", " exclusive_feedback: str | None =Field(\n", " default = None, \n", " description=\"Mutable field. Rank when only this model used feedback.\")\n", " revised_rank: int | None =Field(\n", " default = None, \n", " description=\"Mutable field. Rank when all models used feedback.\")\n", "\n", "results : list[LLMResult] = []\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Always remember to do this!\n", "load_dotenv(override=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print the key prefixes to help with any debugging\n", "\n", "openai_api_key = os.getenv('OPENAI_API_KEY')\n", "anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n", "google_api_key = os.getenv('GOOGLE_API_KEY')\n", "deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')\n", "groq_api_key = os.getenv('GROQ_API_KEY')\n", "\n", "if openai_api_key:\n", " print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n", "else:\n", " print(\"OpenAI API Key not set\")\n", " \n", "if anthropic_api_key:\n", " print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n", "else:\n", " print(\"Anthropic API Key not set (and this is optional)\")\n", "\n", "if google_api_key:\n", " print(f\"Google API Key exists and begins {google_api_key[:2]}\")\n", "else:\n", " print(\"Google API Key not set (and this is optional)\")\n", "\n", "if deepseek_api_key:\n", " print(f\"DeepSeek API Key exists and begins {deepseek_api_key[:3]}\")\n", "else:\n", " print(\"DeepSeek API Key not set (and this is optional)\")\n", "\n", "if groq_api_key:\n", " print(f\"Groq API Key exists and begins {groq_api_key[:4]}\")\n", "else:\n", " print(\"Groq API Key not set (and this is optional)\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "request = \"Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. \"\n", "request += \"Answer only with the question, no explanation.\"\n", "messages = [{\"role\": \"user\", \"content\": request}]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "messages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openai = OpenAI()\n", "response = openai.chat.completions.create(\n", " model=\"gpt-4o-mini\",\n", " messages=messages,\n", ")\n", "question = response.choices[0].message.content\n", "print(question)\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "competitors = []\n", "answers = []\n", "messages = [{\"role\": \"user\", \"content\": question}]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "# The API we know well\n", "\n", "async def openai_answer(messages: list[dict[str, str]], model_name : str) -> str:\n", " openai = AsyncOpenAI()\n", " response = await openai.chat.completions.create(model=model_name, messages=messages)\n", " answer = response.choices[0].message.content\n", " print(f\"{model_name} answer: {answer[:50]}...\")\n", " return answer\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Anthropic has a slightly different API, and Max Tokens is required\n", "\n", "async def claude_anthropic_answer(messages: list[dict[str, str]], model_name : str) -> str:\n", " claude = AsyncAnthropic()\n", " response = await claude.messages.create(model=model_name, messages=messages, max_tokens=1000)\n", " answer = response.content[0].text\n", " print(f\"{model_name} answer: {answer[:50]}...\")\n", " return answer\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "async def gemini_google_answer(messages: list[dict[str, str]], model_name : str) -> str: \n", " gemini = AsyncOpenAI(api_key=google_api_key, base_url=\"https://generativelanguage.googleapis.com/v1beta/openai/\")\n", " response = await gemini.chat.completions.create(model=model_name, messages=messages)\n", " answer = response.choices[0].message.content.strip()\n", " print(f\"{model_name} answer: {answer[:50]}...\")\n", " return answer\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "async def deepseek_answer(messages: list[dict[str, str]], model_name : str) -> str:\n", " deepseek = AsyncOpenAI(api_key=deepseek_api_key, base_url=\"https://api.deepseek.com/v1\")\n", " response = await deepseek.chat.completions.create(model=model_name, messages=messages)\n", " answer = response.choices[0].message.content\n", " print(f\"{model_name} answer: {answer[:50]}...\")\n", " return answer\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "async def groq_answer(messages: list[dict[str, str]], model_name : str) -> str:\n", " groq = AsyncOpenAI(api_key=groq_api_key, base_url=\"https://api.groq.com/openai/v1\")\n", " response = await groq.chat.completions.create(model=model_name, messages=messages)\n", " answer = response.choices[0].message.content\n", " print(f\"{model_name} answer: {answer[:50]}...\")\n", " return answer\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## For the next cell, we will use Ollama\n", "\n", "Ollama runs a local web service that gives an OpenAI compatible endpoint, \n", "and runs models locally using high performance C++ code.\n", "\n", "If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.\n", "\n", "After it's installed, you should be able to visit here: http://localhost:11434 and see the message \"Ollama is running\"\n", "\n", "You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\\`) and run `ollama serve`\n", "\n", "Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):\n", "\n", "`ollama pull ` downloads a model locally \n", "`ollama ls` lists all the models you've downloaded \n", "`ollama rm ` deletes the specified model from your downloads" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", "

Super important - ignore me at your peril!

\n", " The model called llama3.3 is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized llama3.2 or llama3.2:1b and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the the Ollama models page for a full list of models and sizes.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "#!ollama pull llama3.2" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "async def ollama_answer(messages: list[dict[str, str]], model_name : str) -> str:\n", " ollama = AsyncOpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n", " response = await ollama.chat.completions.create(model=model_name, messages=messages)\n", " answer = response.choices[0].message.content\n", " print(f\"{model_name} answer: {answer[:50]}...\")\n", " return answer\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answerers = [openai_answer, claude_anthropic_answer, gemini_google_answer, deepseek_answer, groq_answer, ollama_answer]\n", "models = [\"gpt-4o-mini\", \"claude-3-7-sonnet-latest\", \"gemini-2.0-flash\", \"deepseek-chat\", \"llama-3.3-70b-versatile\", \"llama3.2\"]\n", "\n", "tasks = [ answerer(messages, model) for answerer, model in zip(answerers, models)]\n", "answers : list[str] = await asyncio.gather(*tasks)\n", "results : list[LLMResult] = [LLMResult(model=model, answer=answer) for model, answer in zip(models, answers)]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answers " ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "reviewer = f\"\"\"You are reviewing a submission for a writing competition. The particpant has been given this question to answer:\n", "\n", "{question}\n", "\n", "Your job is to evaluate the response for clarity and strength of argument, and provide constructive suggestions for improving the answer.\n", "Limit your feedback to 200 words.\n", "\n", "Here is the particpant's answer:\n", "{{answer}}\n", "\"\"\"\n", "\n", "async def review_answer(answer : str) -> str:\n", " openai = AsyncOpenAI()\n", " reviewer_messages = [{\"role\": \"user\", \"content\": reviewer.format(answer=answer)}]\n", " reviewer_response = await openai.chat.completions.create(\n", " model=\"gpt-4o-mini\",\n", " messages=reviewer_messages,\n", " )\n", " feedback = reviewer_response.choices[0].message.content\n", " print(f\"feedback: {feedback[:50]}...\")\n", " return feedback" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "\n", "tasks = [review_answer(answer) for answer in answers]\n", "feedback = await asyncio.gather(*tasks)\n", "\n", "for result, feedback in zip(results, feedback):\n", " result.feedback = feedback\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "revision_prompt = f\"\"\"You are revising a submission you wrote for a writing competition based on feedback from a reviewer.\n", "\n", "You are not required to take any of the feedback into account but you want to win the competition.\n", "\n", "The question was: \n", "{question}\n", "\n", "The feedback was:\n", "{{feedback}}\n", "\n", "And your original answer was:\n", "{{answer}}\n", "\n", "Please return your revised answer and nothing else.\n", "\"\"\"\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "messages = [{\"role\": \"user\", \"content\": revision_prompt.format(answer=answer, feedback=feedback)} for answer, feedback in zip(answers, feedback)]\n", "tasks = [ answerer(messages, model) for answerer, model in zip(answerers, models)]\n", "revised_answers = await asyncio.gather(*tasks)\n", "\n", "for revised_answer, result in zip(revised_answers, results):\n", " result.revised_answer = revised_answer\n", "\n" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# need to use Template because we are making a later substitution for \"together\"\n", "judge = Template(f\"\"\"You are judging a competition between {len(results)} competitors.\n", "Each model has been given this question:\n", "\n", "{question}\n", "\n", "Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.\n", "Respond with JSON, and only JSON, with the following format:\n", "{{\"results\": [\"best competitor number\", \"second best competitor number\", \"third best competitor number\", ...]}}\n", "\n", "Here are the responses from each competitor:\n", "\n", "$together\n", "\n", "Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks.\"\"\")\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "judge_messages = [{\"role\": \"user\", \"content\": judge}]" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "def come_together(results : list[LLMResult], revised_entry : int | None ) -> list[dict[str, str]]:\n", " # include revised results for \"revised_entry\" or all entries if revise_entrys is None\n", " together = \"\"\n", " for index, result in enumerate(results):\n", " together += f\"# Response from competitor {index}\\n\\n\"\n", " together += result.answer if (index != revised_entry and revised_entry is not None) else result.revised_answer + \"\\n\\n\"\n", " return [{\"role\": \"user\", \"content\": judge.substitute(together=together)}]\n", "\n", "\n", "# Judgement time!\n", "async def judgement_time(results : list[LLMResult], revised_entry : int ) -> str:\n", " judge_messages = come_together(results, revised_entry)\n", "\n", " openai = AsyncOpenAI()\n", " response = await openai.chat.completions.create(\n", " model=\"o3-mini\",\n", " messages=judge_messages,\n", " )\n", " results = response.choices[0].message.content\n", " results_dict = json.loads(results)\n", " results = { int(model) : int(rank) +1 for rank, model in enumerate(results_dict[\"results\"]) }\n", " return results\n", "\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "#evaluate the impact of feedback on model performance\n", "\n", "no_feedback = await judgement_time(results, -1)\n", "with_feedback = await judgement_time(results, None)\n", "\n", "tasks = [ judgement_time(results, i) for i in range(len(results))]\n", "model_spefic_feedback = await asyncio.gather(*tasks)\n", "\n", "for index, result in enumerate(results):\n", " result.original_rank = no_feedback[index]\n", " result.exclusive_feedback = model_spefic_feedback[index][index]\n", " result.revised_rank = with_feedback[index]\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "table = PrettyTable()\n", "table.field_names = [\"Model\", \"Original Rank\", \"Exclusive Feedback\", \"With Feedback (all models)\"]\n", "\n", "for result in results:\n", " table.add_row([result.model, result.original_rank, result.exclusive_feedback, result.revised_rank])\n", "\n", "print(table)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", "

Exercise

\n", " Which pattern(s) did this use? Try updating this to add another Agentic design pattern.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", "

Commercial implications

\n", " These kinds of patterns - to send a task to multiple models, and evaluate results,\n", " are common where you need to improve the quality of your LLM response. This approach can be universally applied\n", " to business projects where accuracy is critical.\n", " \n", "
" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 2 }