# **Explore GAIA Questions Data**

Explore the `metadata.jsonl` file in order to gain a deeper comprehension of the dataset.

#### **Imports**

In [183]:
import os
import re
import json
import random
import psycopg2
import pandas as pd
from collections import Counter, OrderedDict

from dotenv import load_dotenv
from huggingface_hub import login

from langchain.schema import Document
from langchain_community.retrievers import BM25Retriever
from langchain.tools import Tool, StructuredTool
from langchain_core.tools import tool
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import SupabaseVectorStore

from supabase import Client, create_client
from supabase.client import ClientOptions

In [194]:
with open("metadata.jsonl") as dataset_file:
    json_list = list(dataset_file)

QAs = [json.loads(qa) for qa in json_list]
print(f"Number of QAs: {len(QAs)}")
QAs[0]

Number of QAs: 165


{'task_id': 'c61d22de-5f6c-4958-a7f6-5e9707bd3466',
 'Question': 'A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?',
 'Level': 2,
 'Final answer': 'egalitarian',
 'file_name': '',
 'Annotator Metadata': {'Steps': '1. Go to arxiv.org and navigate to the Advanced Search page.\n2. Enter "AI regulation" in the search box and select "All fields" from the dropdown.\n3. Enter 2022-06-01 and 2022-07-01 into the date inputs, select "Submission date (original)", and submit the search.\n4. Go through the search results to find the article that has a figure with three axes and labels on each end of the axes, titled "Fairness in Agreement With European Values: An Interdisciplinary Perspective on AI Regulation".\n5. Note the six words used as labels: deon

In [89]:
random_samples = random.sample(QAs, 1)
for samp in random_samples:
    print(
        f"TaskId: {samp['task_id']}\nLevel: {samp['Level']}\n"
        f"Question: {samp['Question']}\nGround Truth: {samp['Final answer']}\n"
        f"Additional file: {samp['file_name']}"
    )
    print("Annotator Metadata:")
    print(" - Steps:")
    metadata = samp['Annotator Metadata']
    steps = metadata['Steps'].split("\n")
    for step in steps:
        print(f"    {step}")
    print(f" - Number of steps: {metadata['Number of steps']}")
    print(f" - How long did this take: {metadata['How long did this take?']}")
    tools = metadata['Tools'].split("\n")
    print(f" - Tools [{len(tools)}]:")
    for t in tools:
        print(f"    {t}")
    print(f"- Number of tools: {metadata['Number of tools']}")


TaskId: 7a4a336d-dcfa-45a0-b014-824c7619e8de
Level: 2
Question: At the two-minute mark in the YouTube video uploaded by the channel “GameGrumps” on May 14, 2017 as part of their playthrough of the game Mario Kart 8 Deluxe, the shows’ hosts are competing on one of the game’s racetracks. What was the world record time for that track in the game’s 150cc mode as of June 7, 2023? Express your answer in minutes and seconds, rounding the seconds to the nearest hundredth, e.g. 1:01.001.
Ground Truth: 1:41.614
Additional file: 
Annotator Metadata:
 - Steps:
    1. Search the web for “gamegrumps mario kart 8 deluxe may 14 2017”.
    2. Click on the YouTube video result.
    3. Navigate to two minutes into the video.
    4. Scroll further back until I see the name of the racecourse, Yoshi Circuit.
    5. Search the web for “mario kart 8 deluxe yoshi circuit world record 150cc”
    6. Scroll down until I find a reliable world record listing site.
    7. Navigate through the site until I find the r

As we can see, the `Dataset` contains:

- **task_id** : The unique identifier for the task

- **Level** : Difficulty level of the GAIA task

- **Question** : The specific GAIA task

- **Final answer** : The ground truth for the GAIA task

- **file_name** : The additional file related to the task

- **Annotator Metadata** : 

    - **Steps** : The **sequence** of steps followed to accomplish the correct answer

    - **Number of steps** : Total number of steps to accomplish the correct answer

    - **Tools** : The list of `tools` used to answer the question/task

    - **Number of tools** : Total number of tools used

**GAIA Agent** must be an `Agentic RAG`. This way the agent will be able to combine retrieval system, accessing the QAs `dataset`.

#### **Explore Dataset Tools Types**

Since the *`dataset`* provides for each question a list of `Tools` used to reaching the final answer, it is useful to explore these tools in order to define an efficient and relevant set of tools for our agent to incorporate:

In [169]:
tools_qa = []
for qa in QAs:
    for t in qa["Annotator Metadata"]["Tools"].split("\n"):
        tool_qa = t[2:].strip().upper()
        tool_qa = re.sub(r"\s*\([^)]*\)\s*", "", tool_qa)
    tools_qa.append(tool_qa)
tools_counter = OrderedDict(Counter(tools_qa))

print(f"Total number of Tools used in entire set: {len(tools_counter)}")
print("Tools used in QAs:")
df = pd.DataFrame(
    list(tools_counter.items()), columns = ["Tool", "Count"]
    ).sort_values("Count", ascending = False)\
    .reset_index(drop = True)
df.head(20)

Total number of Tools used in entire set: 55
Tools used in QAs:


Unnamed: 0,Tool,Count
0,SEARCH ENGINE,35
1,CALCULATOR,33
2,WEB BROWSER,12
3,NE,9
4,IMAGE RECOGNITION TOOLS,8
5,PDF VIEWER,6
6,A CALCULATOR,5
7,OCR,3
8,VIDEO RECOGNITION TOOLS,3
9,MICROSOFT EXCEL,2


#### **Tools to be Implemented**

- `Search Engine` (arXiv, Wikipedia, DuckDuckGo)

- `Calculator` (add, substract, divide, multiply, modulus, etc.)

- `Access` and `Download Files` from Web

- `Excel`/`Google Sheets`: Process Downloaded files

---

## **Project Structure: GAIA Agent**

In order to implement our agent within a `Hugging Face Space`, as a structured `Python` project, ensuring clean and modular code organized in different functionalities it is recommended to use separate files. For instance the structure would be:

- `tools.py` - To provide the auxiliary tools for the GAIA Agent

- `retriever.py` - To implement the retrieval functions to support acces to the knowledge base (*dataset*)

- `agent.py` - To implement the agent

- `app.py` - To integrate all the components into a fully functional agent

---

## **Dataset Loading and Dataset**

In [4]:
docs = [
    Document(
        page_content = "\n".join([
            f"Question: {qa['Question']}",
            f"Final answer: {qa['Final answer']}",
            # f"file_name: {qa['file_name']}",
            # f"Annotator Metadata: {qa['Annotator Metadata']}"
        ]),
        metadata = {"task_id": qa["task_id"], "level": qa['Level']}
    )
    for qa in QAs
]

In [5]:
docs[0]

Document(metadata={'task_id': 'c61d22de-5f6c-4958-a7f6-5e9707bd3466', 'level': 2}, page_content='Question: A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?\nFinal answer: egalitarian')

---

## **Retrival Tool Creation**

There are $2$ options for this:

1. ***Semantic Search*** - `BM25Retriever`
2. ***Vector Search*** - 

Let's explore both with the following methods and tools:

- **Semantic Search**: `BM25Retriever`
- **Vector Search**: `bge-base-en-v1.5` for Embeddings and `Supabase` as *Vector Store*

### **Retriever for Semantic Search**

In [6]:
bm25_retriever = BM25Retriever.from_documents(documents = docs)
bm25_retriever.k = 3

# @tool(parse_docstring = True)
def retrieve_semantic(query: str) -> str:
    """
    Retrieves information about QA's based on semantic search.

    Args:
        query (str): The user query.

    Returns:
        str: The result of the semantic search
    """
    res = bm25_retriever.invoke(query)
    if res:
        return "\n\n".join([doc.page_content for doc in res])
    else: 
        return "No matching information found."

tool_retrieve_semantic = StructuredTool.from_function(
    retrieve_semantic
)

In [7]:
# Comparing outputs
print(tool_retrieve_semantic.invoke(QAs[0]['Question']))
print(retrieve_semantic(QAs[0]['Question']))

Question: A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?
Final answer: egalitarian

Question: An office held a Secret Santa gift exchange where each of its twelve employees was assigned one other employee in the group to present with a gift. Each employee filled out a profile including three likes or hobbies. On the day of the gift exchange, only eleven gifts were given, each one specific to one of the recipient's interests. Based on the information in the document, who did not give a gift?
Final answer: Fred

Question: On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Und

### **Retriever for Vector Search**

For this we must create:
- **Table** in `supabase` with extension for `pgvector`
- RLS for security

In [8]:
# Logging to HF for downloading Embedding Model

load_dotenv()
hf_token = os.getenv("HF_API_TOKEN")
if hf_token:
    login(token = hf_token)
else:
    print("Warning: No Hugging Face token found.")

In [9]:
# MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
MODEL_NAME = "BAAI/bge-base-en-v1.5"
embedding_model = HuggingFaceEmbeddings(model_name = MODEL_NAME)
model = embedding_model._client
dim = model.get_sentence_embedding_dimension()
dim

768

#### **Supabase (Postgresql) Table Creation**

In [170]:
# Create postgresql connection
conn = psycopg2.connect(
    host = os.getenv("SUPABASE_DB_HOST"),
    port = os.getenv("SUPABASE_DB_PORT"),
    dbname = os.getenv("SUPABASE_DB_NAME"),
    user = os.getenv("SUPABASE_DB_USER"),
    password = os.getenv("SUPABASE_DB_PASSWORD")
)
conn.autocommit = True
cursor = conn.cursor()

In [171]:
TBL_NAME = "documents_tbl"
create_table = f"""
DROP TABLE IF EXISTS {TBL_NAME};
CREATE TABLE IF NOT EXISTS {TBL_NAME} (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    content TEXT,
    metadata JSONB,
    embedding VECTOR({dim})
);
"""
try:
    cursor.execute("CREATE SCHEMA IF NOT EXISTS extensions;")
    cursor.execute("CREATE EXTENSION IF NOT EXISTS vector WITH SCHEMA extensions;")
    cursor.execute(create_table)
    cursor.execute(f"ALTER TABLE {TBL_NAME} ENABLE ROW LEVEL SECURITY;")
    print(f"Table {TBL_NAME}' successfully created and ready to insert embeddings.")
except Exception as e:
    conn.rollback()
    print("Couldn't create the Postgresql table. Error: {e}")
    raise e

cursor.execute(f"""
    DROP POLICY IF EXISTS "Allow read to all" ON {TBL_NAME};
""")

cursor.execute(f"""
CREATE POLICY "Allow read to all"
ON {TBL_NAME}
FOR SELECT
USING (true);
""")
# cursor.close()
# conn.close()

Table documents_tbl' successfully created and ready to insert embeddings.


#### **Function to Seach Documents in Supabase**

In [172]:
df_func_def = f"""
CREATE FUNCTION match_documents (
    query_embedding VECTOR({dim}),
    filter JSONB DEFAULT '{{}}',
    match_count INT DEFAULT 5
) RETURNS TABLE (
    id BIGINT,
    content TEXT,
    metadata JSONB,
    similarity FLOAT
) LANGUAGE plpgsql
SET search_path = 'extensions', 'public'
AS $$
BEGIN
    RETURN QUERY
    SELECT
        {TBL_NAME}.id,
        {TBL_NAME}.content,
        {TBL_NAME}.metadata,
        1 - ({TBL_NAME}.embedding <=> query_embedding) AS similarity
    FROM {TBL_NAME}
    WHERE {TBL_NAME}.metadata @> filter
    ORDER BY {TBL_NAME}.embedding <=> query_embedding
    LIMIT match_count;
END;
$$;
"""

cursor.execute("DROP FUNCTION IF EXISTS match_documents(VECTOR, JSONB, INT);")
cursor.execute(df_func_def)
cursor.execute(f"GRANT SELECT ON {TBL_NAME} TO anon;")
cursor.execute("GRANT EXECUTE ON FUNCTION match_documents(VECTOR, JSONB, INT) TO service_role;")
cursor.execute("GRANT EXECUTE ON FUNCTION match_documents(VECTOR, JSONB, INT) TO anon;")

cursor.close()
conn.close()

#### **Data Insertion into Supabase Table**

In [173]:
docs_qa = []
for i, qa in enumerate(QAs):
    question = qa.get("Question", "").strip()
    final_answer = qa.get("Final answer", "").strip()
    additional_file = qa.get("file_name")
    has_file = additional_file != ""

    content = f"Question: {question}\n\nAdditional file: {additional_file}\n\nFinal answer: {final_answer}"
    embedding = embedding_model.embed_query(content)
    doc_qa = {
        "content": content,
        "metadata": {
            "task_id": qa.get("task_id"),
            "has_file": has_file
        },
        "embedding": embedding
    }

    if i == 0:
        print(f"Embedding first 5 dims: {embedding[:5]}")
    
    docs_qa.append(doc_qa)

Embedding first 5 dims: [0.006851373240351677, 0.019783932715654373, -0.005305973347276449, 0.04809008538722992, 0.03095371648669243]


Intantiate **Supabase** `Client`:

In [174]:
supabase_url = os.environ.get("SUPABASE_URL")
supabase_key = os.environ.get("SUPABASE_KEY")
supabase_anon_key = os.environ.get("SUPABASE_ANON_KEY")
supabase: Client = create_client(
    supabase_url, supabase_key,
    options = ClientOptions(
        schema = "public"
    )
)

supabase_public: Client = create_client(
    supabase_url, supabase_anon_key,
    options = ClientOptions(
        schema = "public"
    )
)

Upload *Documents* to the `Vector Database` (*Supabase*):

In [175]:
try: 
    res = (
        supabase
        .table(TBL_NAME)
        .insert(docs_qa)
        .execute()
    )
    if len(res.data) != len(docs):
        print(f"Warning: Only {len(res.data)} out of {len(docs)} docs were inserted.")
except Exception as e:
    print(f"Error inserting documents into Supabase:\n{e}")
    raise

#### **Supabase Vector Store**

In [64]:
vector_store = SupabaseVectorStore(
    client = supabase_public,
    embedding = embedding_model,
    table_name = TBL_NAME,
    query_name = "match_documents"
)
vector_retriever = vector_store.as_retriever()

In [176]:
vector_store = SupabaseVectorStore(
    client = supabase,
    embedding = embedding_model,
    table_name = TBL_NAME,
    query_name = "match_documents"
)
vector_retriever = vector_store.as_retriever()

In [177]:
r_samp = random.sample(QAs, 1)[0]
query = r_samp['Question']
r_ans = r_samp['Final answer']

print(f"Question:\n{query}\n\nAnswer:\n{r_ans}")

Question:
What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023?

Answer:
Louvrier


In [178]:
cntx = vector_retriever.invoke(query)
cntx[0]

Document(metadata={'task_id': 'cabe07ed-9eca-40ea-8ead-410ef5e83f91', 'has_file': False}, page_content="Question: What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023?\n\nAdditional file: \n\nFinal answer: Louvrier")