## Welcome to the Second Lab - Week 1, Day 3

Today we will work with lots of models! This is a way to get comfortable with APIs.

This notebook extends the original by adding a reviewer pattern to evaluate the impact on model performance.

In the new workflow, each model's answer is provided to a "reviewer LLM" who is prompted to "Evaluate the response for clarity and strength of argument, and provide constructive suggestions for improving the answer." Each model is then given the chance to revise its answer based on the feedback but is also told, "You are not required to take any of the feedback into account, but you want to win the competition."

<table>
  <caption style="font-size: 1.2em; margin-bottom: 10px;"><strong>Results for Representative Run</strong></caption>
  <thead>
    <tr>
      <th>Model</th>
      <th>Original Rank</th>
      <th>Exclusive Feedback</th>
      <th>With Feedback (all models)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>gpt-4o-mini</td>
      <td>2</td>
      <td>3</td>
      <td>4</td>
    </tr>
    <tr>
      <td>claude-3-7-sonnet-latest</td>
      <td>6</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <td>gemini-2.0-flash</td>
      <td>1</td>
      <td>1</td>
      <td>2</td>
    </tr>
    <tr>
      <td>deepseek-chat</td>
      <td>3</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>llama-3.3-70b-versatile</td>
      <td>4</td>
      <td>3</td>
      <td>5</td>
    </tr>
    <tr>
      <td>llama3.2</td>
      <td>5</td>
      <td>4</td>
      <td>6</td>
    </tr>
  </tbody>
</table>

The workflow is obviously non-deterministic and the results can vary greatly from run to run, but the introduction of a reviewer appeared to have a generaly positive impact on performance.  The table above shows the results for a representative run. It compares each model's rank versus the other models when it exclusively received feedback.  The table also shows the ranking when ALL models received feedback.  Exclusive use of feedback improved a model's ranking for five out of six models and decreased it for one model.

Inspired by some other contributions, this worksheet also makes LLM calls asyncrhonously to reduce wait time.

In [23]:
# Start with imports - ask ChatGPT to explain any package that you don't know
#!uv add prettytable

import os
import asyncio
import json
from dotenv import load_dotenv
from openai import OpenAI, AsyncOpenAI
from anthropic import AsyncAnthropic
from IPython.display import display
from pydantic import BaseModel, Field
from string import Template
from prettytable import PrettyTable




In [24]:
class LLMResult(BaseModel):
    model: str
    answer: str
    feedback: str | None =Field(
        default = None, 
        description="Mutable field.  This will be set by the reviewer.")
    revised_answer: str | None =Field(
        default = None, 
        description="Mutable field.  This will be set by the answerer after the reviewer has provided feedback.")
    original_rank: int | None =Field(
        default = None, 
        description="Mutable field.  Rank when no feedback is used by any models.")
    exclusive_feedback: str | None =Field(
        default = None, 
        description="Mutable field.  Rank when only this model used feedback.")
    revised_rank: int | None =Field(
        default = None, 
        description="Mutable field.  Rank when all models used feedback.")

results : list[LLMResult] = []


In [None]:
# Always remember to do this!
load_dotenv(override=True)

In [None]:
# Print the key prefixes to help with any debugging

openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
groq_api_key = os.getenv('GROQ_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set (and this is optional)")

if groq_api_key:
    print(f"Groq API Key exists and begins {groq_api_key[:4]}")
else:
    print("Groq API Key not set (and this is optional)")

In [27]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

In [None]:
messages

In [None]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
)
question = response.choices[0].message.content
print(question)


In [30]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

In [31]:
# The API we know well

async def openai_answer(messages: list[dict[str, str]], model_name : str) -> str:
    openai = AsyncOpenAI()
    response = await openai.chat.completions.create(model=model_name, messages=messages)
    answer = response.choices[0].message.content
    print(f"{model_name} answer: {answer[:50]}...")
    return answer


In [32]:
# Anthropic has a slightly different API, and Max Tokens is required

async def claude_anthropic_answer(messages: list[dict[str, str]], model_name : str) -> str:
    claude = AsyncAnthropic()
    response = await claude.messages.create(model=model_name, messages=messages, max_tokens=1000)
    answer = response.content[0].text
    print(f"{model_name} answer: {answer[:50]}...")
    return answer


In [33]:
async def gemini_google_answer(messages: list[dict[str, str]], model_name : str) -> str:     
    gemini = AsyncOpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
    response = await gemini.chat.completions.create(model=model_name, messages=messages)
    answer = response.choices[0].message.content.strip()
    print(f"{model_name} answer: {answer[:50]}...")
    return answer


In [34]:
async def deepseek_answer(messages: list[dict[str, str]], model_name : str) -> str:
    deepseek = AsyncOpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
    response = await deepseek.chat.completions.create(model=model_name, messages=messages)
    answer = response.choices[0].message.content
    print(f"{model_name} answer: {answer[:50]}...")
    return answer


In [35]:
async def groq_answer(messages: list[dict[str, str]], model_name : str) -> str:
    groq = AsyncOpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
    response = await groq.chat.completions.create(model=model_name, messages=messages)
    answer = response.choices[0].message.content
    print(f"{model_name} answer: {answer[:50]}...")
    return answer


## For the next cell, we will use Ollama

Ollama runs a local web service that gives an OpenAI compatible endpoint,  
and runs models locally using high performance C++ code.

If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.

After it's installed, you should be able to visit here: http://localhost:11434 and see the message "Ollama is running"

You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\`) and run `ollama serve`

Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):

`ollama pull <model_name>` downloads a model locally  
`ollama ls` lists all the models you've downloaded  
`ollama rm <model_name>` deletes the specified model from your downloads

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Super important - ignore me at your peril!</h2>
            <span style="color:#ff7800;">The model called <b>llama3.3</b> is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized <b>llama3.2</b> or <b>llama3.2:1b</b> and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the <A href="https://ollama.com/models">the Ollama models page</a> for a full list of models and sizes.
            </span>
        </td>
    </tr>
</table>

In [36]:
#!ollama pull llama3.2

In [37]:
async def ollama_answer(messages: list[dict[str, str]], model_name : str) -> str:
    ollama = AsyncOpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
    response = await ollama.chat.completions.create(model=model_name, messages=messages)
    answer = response.choices[0].message.content
    print(f"{model_name} answer: {answer[:50]}...")
    return answer


In [None]:
answerers  = [openai_answer, claude_anthropic_answer, gemini_google_answer, deepseek_answer, groq_answer, ollama_answer]
models = ["gpt-4o-mini", "claude-3-7-sonnet-latest", "gemini-2.0-flash", "deepseek-chat", "llama-3.3-70b-versatile", "llama3.2"]

tasks = [ answerer(messages, model) for answerer, model in zip(answerers, models)]
answers : list[str] = await asyncio.gather(*tasks)
results : list[LLMResult] = [LLMResult(model=model, answer=answer) for model, answer in zip(models, answers)]


In [None]:
answers 

In [40]:
reviewer = f"""You are reviewing a submission for a writing competition.  The particpant has been given this question to answer:

{question}

Your job is to evaluate the response for clarity and strength of argument, and provide constructive suggestions for improving the answer.
Limit your feedback to 200 words.

Here is the particpant's answer:
{{answer}}
"""

async def review_answer(answer : str) -> str:
    openai = AsyncOpenAI()
    reviewer_messages = [{"role": "user", "content": reviewer.format(answer=answer)}]
    reviewer_response = await openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=reviewer_messages,
    )
    feedback = reviewer_response.choices[0].message.content
    print(f"feedback: {feedback[:50]}...")
    return feedback

In [None]:
import asyncio

tasks = [review_answer(answer) for answer in answers]
feedback = await asyncio.gather(*tasks)

for result, feedback in zip(results, feedback):
    result.feedback = feedback


In [42]:
revision_prompt = f"""You are revising a submission you wrote for a writing competition based on feedback from a reviewer.

You are not required to take any of the feedback into account but you want to win the competition.

The question was: 
{question}

The feedback was:
{{feedback}}

And your original answer was:
{{answer}}

Please return your revised answer and nothing else.
"""


In [None]:
messages = [{"role": "user", "content": revision_prompt.format(answer=answer, feedback=feedback)} for answer, feedback in zip(answers, feedback)]
tasks = [ answerer(messages, model) for answerer, model in zip(answerers, models)]
revised_answers = await asyncio.gather(*tasks)

for revised_answer, result in zip(revised_answers, results):
    result.revised_answer = revised_answer



In [44]:
# need to use Template because we are making a later substitution for "together"
judge = Template(f"""You are judging a competition between {len(results)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

$together

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks.""")




In [45]:
judge_messages = [{"role": "user", "content": judge}]

In [46]:
def come_together(results : list[LLMResult], revised_entry : int | None ) -> list[dict[str, str]]:
    # include revised results for "revised_entry" or all entries if revise_entrys is None
    together = ""
    for index, result in enumerate(results):
        together += f"# Response from competitor {index}\n\n"
        together += result.answer if (index != revised_entry and revised_entry is not None) else result.revised_answer + "\n\n"
    return [{"role": "user", "content": judge.substitute(together=together)}]


# Judgement time!
async def judgement_time(results : list[LLMResult], revised_entry : int ) -> str:
    judge_messages = come_together(results, revised_entry)

    openai = AsyncOpenAI()
    response = await openai.chat.completions.create(
        model="o3-mini",
        messages=judge_messages,
    )
    results = response.choices[0].message.content
    results_dict = json.loads(results)
    results = {  int(model) : int(rank) +1 for rank, model in enumerate(results_dict["results"]) }
    return results



In [47]:
#evaluate the impact of feedback on model performance

no_feedback = await judgement_time(results, -1)
with_feedback = await judgement_time(results, None)

tasks = [ judgement_time(results, i) for i in range(len(results))]
model_spefic_feedback = await asyncio.gather(*tasks)

for index, result in enumerate(results):
    result.original_rank = no_feedback[index]
    result.exclusive_feedback = model_spefic_feedback[index][index]
    result.revised_rank = with_feedback[index]



In [None]:

table = PrettyTable()
table.field_names = ["Model", "Original Rank", "Exclusive Feedback", "With Feedback (all models)"]

for result in results:
    table.add_row([result.model, result.original_rank, result.exclusive_feedback, result.revised_rank])

print(table)



<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>