HF_Agents_Final_Project

Sleeping

HF_Agents_Final_Project / docs /Apuntes curso Agents.txt

Yago Bolivar

add notes and reference notebooks

8818829 3 months ago

70.1 kB

	Unit 1. Introduction to agents
	Unit 1.ipynb
	What are LLMs
	https://huggingface.co/learn/agents-course/unit1/what-are-llms
	Transformers
	Most LLMs nowadays are built on the Transformer architecture—a deep learning architecture based on the “Attention” algorithm, that has gained significant interest since the release of BERT from Google in 2018.
	Original transformer architecture:
	There are 3 types of transformers :
	1. Encoders
	An encoder-based Transformer takes text (or other data) as input and outputs a dense representation (or embedding) of that text.
	* Example: BERT from Google
	* Use Cases: Text classification, semantic search, Named Entity Recognition
	* Typical Size: Millions of parameters
	2. Decoders
	A decoder-based Transformer focuses on generating new tokens to complete a sequence, one token at a time.
	* Example: Llama from Meta
	* Use Cases: Text generation, chatbots, code generation
	* Typical Size: Billions (in the US sense, i.e., 10^9) of parameters
	3. Seq2Seq (Encoder–Decoder)
	A sequence-to-sequence Transformer combines an encoder and a decoder. The encoder first processes the input sequence into a context representation, then the decoder generates an output sequence.
	* Example: T5, BART,
	* Use Cases: Translation, Summarization, Paraphrasing
	* Typical Size: Millions of parameters
	Although Large Language Models come in various forms, LLMs are typically decoder-based models with billions of parameters. Here are some of the most well-known LLMs:
	Model
	Provider
	Deepseek-R1
	DeepSeek
	GPT4
	OpenAI
	Llama 3
	Meta (Facebook AI Research)
	SmolLM2
	Hugging Face
	Gemma
	Google
	Mistral
	Mistral


	Tokens
	Each LLM has some special tokens specific to the model. The LLM uses these tokens to open and close the structured components of its generation. For example, to indicate the start or end of a sequence, message, or response. Moreover, the input prompts that we pass to the model are also structured with special tokens. The most important of those is the End of sequence token (EOS).
	The forms of special tokens are highly diverse across model providers.
	The table below illustrates the diversity of special tokens.
	Model
	Provider
	EOS Token
	Functionality
	GPT4
	OpenAI
	<\|endoftext\|>
	End of message text
	Llama 3
	Meta (Facebook AI Research)
	<\|eot_id\|>
	End of sequence
	Deepseek-R1
	DeepSeek
	<\|end_of_sentence\|>
	End of message text
	SmolLM2
	Hugging Face
	<\|im_end\|>
	End of instruction or message
	Gemma
	Google
	<end_of_turn>
	End of conversation turn


	We do not expect you to memorize these special tokens, but it is important to appreciate their diversity and the role they play in the text generation of LLMs. If you want to know more about special tokens, you can check out the configuration of the model in its Hub repository. For example, you can find the special tokens of the SmolLM2 model in its tokenizer_config.json.
	Understanding next token prediction.
	LLMs are said to be autoregressive, meaning that the output from one pass becomes the input for the next one. This loop continues until the model predicts the next token to be the EOS token, at which point the model can stop.
	In other words, an LLM will decode text until it reaches the EOS. But what happens during a single decoding loop?
	While the full process can be quite technical for the purpose of learning agents, here’s a brief overview:
	* Once the input text is tokenized, the model computes a representation of the sequence that captures information about the meaning and the position of each token in the input sequence.
	* This representation goes into the model, which outputs scores that rank the likelihood of each token in its vocabulary as being the next one in the sequence.


	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit1/DecodingFinal.gif
	Beam decoding is a heuristic search algorithm used in natural language processing (NLP) and other fields where probabilistic models generate outputs such as machine translation, speech recognition, and image captioning. It aims to find the most likely output sequence given a set of possible outputs by keeping track of multiple potential sequences at each step, known as the beam width.
	In beam decoding, the algorithm generates all possible new sequences by appending every possible next word to each of the sequences it is currently considering. It then selects the top 'k' most probable sequences from these new sequences, where 'k' is the beam width.
	Beam Search Visualizer - a Hugging Face Space by m-ric


	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit1/AttentionSceneFinal.gif
	Chat templates
	More: Chat Templates
	Chat is an increasing use for LLMs. Different models expect different input formats for chat.
	Chat templates are part of the tokenizer for text-only LLMs or processor for multimodal LLMs. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.
	For this input:
	chat = [
	{"role": "user", "content": "Hello, how are you?"},
	{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
	{"role": "user", "content": "I'd like to show off how chat templating works!"},
	]
	The output of the tokenizer will be something like:
	<s> [INST] Hello, how are you? [/INST] I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]
	SMolLM2
	<\|im_start\|>system
	You are a helpful AI assistant named SmolLM, trained by Hugging Face<\|im_end\|>
	<\|im_start\|>user
	I need help with my order<\|im_end\|>
	<\|im_start\|>assistant
	I'd be happy to help. Could you provide your order number?<\|im_end\|>
	<\|im_start\|>user
	It's ORDER-123<\|im_end\|>
	<\|im_start\|>assistant


	Llama 3.2
	<\|begin_of_text\|><\|start_header_id\|>system<\|end_header_id\|>


	Cutting Knowledge Date: December 2023
	Today Date: 10 Feb 2025
	<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>
	I need help with my order<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>
	I'd be happy to help. Could you provide your order number?<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>
	It's ORDER-123<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>
	Base Models vs. Instruct Models
	Another point we need to understand is the difference between a Base Model vs. an Instruct Model:
	* A Base Model is trained on raw text data to predict the next token.
	* An Instruct Model is fine-tuned specifically to follow instructions and engage in conversations. For example, SmolLM2-135M is a base model, while SmolLM2-135M-Instruct is its instruction-tuned variant.
	To make a Base Model behave like an instruct model, we need to format our prompts in a consistent way that the model can understand. This is where chat templates come in.
	ChatML is one chat template format. A base model can be fine tuned with different chat templates, so we need to be sure we are using the correct one.
	Chat template include Jinja code to transform JSON messages into the chat template
	Below is a simplified version of the SmolLM2-135M-Instruct chat template:
	{% for message in messages %}
	{% if loop.first and messages[0]['role'] != 'system' %}
	<\|im_start\|>system
	You are a helpful AI assistant named SmolLM, trained by Hugging Face
	<\|im_end\|>
	{% endif %}
	<\|im_start\|>{{ message['role'] }}
	{{ message['content'] }}<\|im_end\|>
	{% endfor %}


	Given these messages:
	messages = [
	{"role": "system", "content": "You are a helpful assistant focused on technical topics."},
	{"role": "user", "content": "Can you explain what a chat template is?"},
	{"role": "assistant", "content": "A chat template structures conversations between users and AI models..."},
	{"role": "user", "content": "How do I use it ?"},
	]
	The previous chat template will produce the following string:
	<\|im_start\|>system
	You are a helpful assistant focused on technical topics.<\|im_end\|>
	<\|im_start\|>user
	Can you explain what a chat template is?<\|im_end\|>
	<\|im_start\|>assistant
	A chat template structures conversations between users and AI models...<\|im_end\|>
	<\|im_start\|>user
	"How do I use it ?<\|im_end\|>


	Chat Template Viewer - a Hugging Face Space by Jofthomas


	Chat Templates
	What are tools
	A tool is a function given to the LLM and should fulfill a clear objective. Examples are: web search, image generation, retrieval, API interface. Tools can be made to complement the agent..
	A tool should contain:
	* A textual description of what the function does
	* A callable (something to perform an action)
	* Arguments with typings
	* Outputs with typings (optional)
	Because LLMs can only receive and generate text, we must teach them about the existence of the tool and ask it to generate text that invokes the tool when it needs to. It is the responsibility of the agent to parse the text of the LLM, recognize that a tool call is required and invoke the tool on behalf of the LLM.
	The output of the tool will be sent back to the LLM. It is another type of message in the conversation and it is typically not shown to the user.
	1. Agent retrieves the conversation
	2. Calls the tool(s)
	3. Gets the output(s)
	4. Adds the output(s) as a new conversation message
	5. Sends the updated conversation to the LLM


	Example:


	This is how we pass the information to the LLM:
	Tool Name: calculator, Description: Multiply two integers., Arguments: a: int, b: int, Outputs: int


	If we want to pass more than one tool, we must be consistent and always use the same format.
	* A descriptive name of what it does: calculator
	* A longer description, provided by the function’s docstring comment: Multiply two integers.
	* The inputs and their type: the function clearly expects two ints.
	* The type of the output.
	By just adding a decorator, we can use Python’s introspection features to build a tools description automatically:
	Unit 1.ipynb


	Understanding AI Agents through the Thought-Action-Observation Cycle - Hugging Face Agents Course
	Agent’s cycle: Thought->Act->Observe




	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit1/AgentCycle.gif


	Simplfied system prompt


	* The Agent’s behavior
	* The Tools our Agent has access to
	* The Thought-Action-Observation Cycle, that we bake into the LLM instructions














	Final answer : The current weather in New York is partly cloudy with a temperature of 15°C and 60% humidity.






	Agents iterate through a loop until the goal is fulfilled.
	Agents can use tools.
	The agent's thought process adapts dynamically to the information provided by the tool.
	Thought: Internal Reasoning and the Re-Act Approach


	ReAct approach: Concatenation of Reasoning + Acting
	ReAct is a simple prompting technique that appends “Let’s think step by step” before letting the LLM decode the next tokens. This prompting guides the thinking process towards generating tokens that create a plan rather than providing a final answer, since the model is encouraged to divide the problem into sub tasks.




	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit1/ReAct.png


	Models like Deepseek or o1 have been trained to include specific thinking sections among <think></think> special tokens. This is not a prompting technique, but a training method where the model learns to generate these tokens.
	Actions: Enabling the Agent to Engage with Its Environment
	Actions are the steps an agent takes to interact with its environment. Each action is a deliberate operation executed by the agent.


	Type of Agent
	Description
	JSON Agent
	The Action to take is specified in JSON format.
	Code Agent
	The Agent writes a code block that is interpreted externally.
	Function-calling Agent
	It is a subcategory of the JSON Agent which has been fine-tuned to generate a new message for each action.
	Actions themselves can serve many purposes:


	Type of Action
	Description
	Information Gathering
	Performing web searches, querying databases, or retrieving documents.
	Tool Usage
	Making API calls, running calculations, and executing code.
	Environment Interaction
	Manipulating digital interfaces or controlling physical devices.
	Communication
	Engaging with users via chat or collaborating with other agents.


	One crucial aspect of an agent is its ability to stop producing tokens once an action is complete. This is true on all types of agents.
	The LLM only handles text and uses it to describe the action it wants to take and the parameters to supply to the tool.
	Stop and Parse Approach
	1. Generation of the expected action in a structured format (JSON or code)
	2. Halting further generation: Once the action is complete, the agent stops generating additional tokens
	3. Parsing the output: an external parser reads the formatted action, determines which tool to call and extracts the parameters:


	This machine readable format minimizes errors and allows for external tools to process the agent’s command.
	Code Agents
	An alternative is using code agents, that instead of JSON, generate block-executable code in a high level language like Python.


	This provides enhanced expresability, debuggability, integration and modularity.


	A code agent could write the following code:


	# Code Agent Example: Retrieve Weather Information
	def get_weather(city):
	import requests
	api_url = f"https://api.weather.com/v1/location/{city}?apiKey=YOUR_API_KEY"
	response = requests.get(api_url)
	if response.status_code == 200:
	data = response.json()
	return data.get("weather", "No weather information available")
	else:
	return "Error: Unable to fetch weather data."


	# Execute the function and prepare the final answer
	result = get_weather("New York")
	final_answer = f"The current weather in New York is: {result}"
	print(final_answer)


	This method also follows the stop and parse approach by clearly delimiting the code block and signaling when execution is complete (here, by printing the final_answer).
	Observe: Integrating Feedback to Reflect and Adapt
	Observation is how an agent perceives the consequences of its actions.


	Agent’s memory is at the end of the prompt.


	In this phase the agent:
	* Collects feedback
	* Appends results
	* Adapts its strategy
	How are the results appended:
	1. Parse the action: identify the function to call and its arguments.
	2. Execute the action
	3. Append the result of the action as an observation.
	Dummy Agent Library
	The core of an agent library is to append information to the system prompt.
	dummy_agent_library.ipynb · agents-course/notebooks at main
	dummy_agent_library.ipynb (my copy)




	This is an example prompt:


	prompt="""<\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>


	The capital of france is<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>"""


	We can also use the chat method:


	output = client.chat.completions.create(
	messages=[
	{"role": "user", "content": "The capital of France is"},
	],
	stream=False,
	max_tokens=1024,
	)
	print(output.choices[0].message.content)


	Example prompt with system prompt:
	prompt=f"""<\|begin_of_text\|><\|start_header_id\|>system<\|end_header_id\|>
	{SYSTEM_PROMPT}
	<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>
	What's the weather in London ?
	<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>
	"""


	With chat method:
	messages=[
	{"role": "system", "content": SYSTEM_PROMPT},
	{"role": "user", "content": "What's the weather in London ?"},
	]
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")


	tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True)
	Let’s Create Our First Agent Using smolagents
	Introducing smolagents: simple agents that write actions in code.
	An efficient AI system must provide LLMs with access to the real world (LLMs are good at writing, but bad at having all the facts) and agency.
	Agency is not discrete, but runs through a spectrum:


	Agency Level
	Description
	How that's called
	Example Pattern
	☆☆☆
	LLM output has no impact on program flow
	Simple processor
	process_llm_output(llm_response)
	★☆☆
	LLM output determines basic control flow
	Router
	if llm_decision(): path_a() else: path_b()
	★★☆
	LLM output determines function execution
	Tool call
	run_function(llm_chosen_tool, llm_chosen_args)
	★★★
	LLM output controls iteration and program continuation
	Multi-step Agent
	while llm_should_continue(): execute_next_step()
	★★★
	One agentic workflow can start another agentic workflow
	Multi-Agent
	if llm_trigger(): execute_agent()


	A multi step agent can write actions in the form of calls to external tools. A common format is to write them in the form of JSON with tool names and arguments to use.
	A better way is to use code agents that write actions as code. The advantages are:
	* Composability: code functions can be nested and reused
	* Object management: code allows for storage of the function’s output
	* Generality: code can express anything a computer can do
	* Representation in training data: there is a lot of code in the training data
	smolagents is the successor to transformers.agents, and will be replacing it as transformers.agents gets deprecated in the future.
	smolagents.ipynb
	Duplicate the original space.
	Working on https://huggingface.co/spaces/leroidubuffet/AgentLeroi
	It’s key to update agent.json:
	"tools": ["web_search", "visit_webpage", "syntacticAnalizer", "final_answer"],


	Bonus Unit 1. Fine-tuning an LLM for Function-calling
	First we need to learn how to fine tune an LLM: Processing the data - Hugging Face NLP Course
	Bonus.ipynb
	Unit 2 Frameworks for AI agents
	* smolagents
	* LlamaIndex
	* LangGraph
	Introduction to smolagents
	smolagents: 🤗 a barebones library for agents
	* open-source
	* CodeAgents: primary type of agent in this framework. They produce Python code instead of JSON or text
	* ToolCallingAgents: second type of agent. They use text/JSON the system must parse and interpret to execute actions
	* Tools: essential building blocks for agent behavior. Tool class or @tool decorator. There are a default toolbox and community-contributed tools
	* Retrieval agents allow models to access knowledge bases using vector stores to implement RAG
	* Multi-Agent Systems: having multiple agents is key for building more sophisticated solutions
	* Vision and Browser agents: Vision Agents incorporate Vision-Language Models, and can be used to build browser agents that can browse the web.
	Resources
	* smolagents Documentation - Official docs for the smolagents library
	* Building Effective Agents - Research paper on agent architectures
	* Agent Guidelines - Best practices for building reliable agents
	* LangGraph Agents - Additional examples of agent implementations
	* Function Calling Guide - Understanding function calling in LLMs
	* RAG Best Practices - Guide to implementing effective RAG
	Why use smolagents
	* Works with any LLM through HF tools integration and external APIs
	* Code-first approach: no need for parsing thanks to code agents
	* HF Hub integration. Gradio Spaces can be used as tools
	* Lightweight and minimal
	* Fast prototyping
	Code Vs JSON Actions


	https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/code_vs_json_actions.png
	In smolagents agents operate as multi-step agents, each of which perform:
	* one thought
	* one tool call and execution
	Primary type > Code agent
	Also > ToolCallingAgent
	Integration in smolagents
	smolagents supports flexible LLM integration, allowing you to use any callable model that meets certain criteria. The framework provides several predefined classes to simplify model connections:
	* TransformersModel: Implements a local transformers pipeline for seamless integration.
	* HfApiModel: Supports serverless inference calls through Hugging Face’s infrastructure, or via a growing number of third-party inference providers.
	* LiteLLMModel: Leverages LiteLLM for lightweight model interactions.
	* OpenAIServerModel: Connects to any service that offers an OpenAI API interface.
	* AzureOpenAIServerModel: Supports integration with any Azure OpenAI deployment.
	Building Agents That Use Code
	Code agents are the default agent type. They generate Python calls to perform actions, achieving efficient, expressive and accurate action representations, as well as reducing the number of required actions, simplifying complex operations and enabling reuse of existing code functions.
	Why code agents?
	In a multi-step agent process, LLMs write and execute actions, often involving external tool calls, using JSON to specify tools names and strings for arguments that must be parsed by the system. “Research shows” that such LLMs work more efficiently if they write code directly.
	Using code has the following advantages:
	* Can be reused and combined
	* Can work with complex structures like images
	* Generalizes to any task
	* LLMs are presented with plenty of high-quality code
	How does a code agent work?


	https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolagents/codeagent_docs.png


	The main abstraction for agents in smolagents is a MultiStepAgent and serves as the main building block. A CodeAgent is a special kind of MultiStepAgent.
	A CodeAgent performs actions through a cycle of steps, with existing variables and knowledge being incorporated into the agent’s context, kept in an execution log.




	1. The system prompt is stored in a SystemPromptStep, and the user query is logged in a TaskStep.
	2. Then, the following while loop is executed:
	2.1 Method agent.write_memory_to_messages() writes the agent’s logs into a list of LLM-readable chat messages.
	2.2 These messages are sent to a Model, which generates a completion.
	2.3 The completion is parsed to extract the action, which, in our case, should be a code snippet since we’re working with a CodeAgent.
	2.4 The action is executed.
	2.5 The results are logged into memory in an ActionStep.
	At the end of each step, if the agent includes any function calls (in agent.step_callback), they are executed.
	code_agents.ipynb
	Writing actions as code snippets or JSON blobs
	ToolCallingAgent are the second type of agents in smolagents.
	Code agents use Python snippets. Tool calling agentes generate tool calls as JSON structures, taking advantage of the tool-calling capabilities that most LLMs have (OpenAI, Anthropic both offer this). ToolCallingAgents can be useful when there is no need for variable handling or complex tool calls.
	ToolCallingAgents use the same multi-step workflow., but they generate JSON specifying tool names and arguments. The system then parses these instructions to execute the appropriate tools.
	The main difference, codewise, is that we will be calling a ToolCallingAgent instead of a CodeAgent.
	The agent’s trace will be something like:
	╭──────────────────────────────────────────────────────────────────╮
	│ Calling tool: 'web_search' with arguments: {'query': "best music
	\| recommendations for a party at Wayne's mansion"} │
	╰——————————————————————──────────────────────────────────────────────╯


	Instead of:
	─ Executing parsed code: ────────────────────────────────────────────────────────────────────
	result = web_search(query="Peggy Guggenheim artist friends")
	print(result)


	Copia de tool_calling_agents.ipynb
	Agents
	Tools
	Tools are treated as functions a LLM can call within the agent system.
	To interact with a tool, the LLM needs an interface description with these key components:
	* Name: What the tool is called
	* Tool description: What the tool does
	* Input types and descriptions: What arguments the tool accepts
	* Output type: What the tool returns


	https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Agent_ManimCE.gif


	There are 2 ways of defining a tool:
	1. Using the @tool decorator
	2. Creating a subclass of tool when we need more complex functionality
	@tool
	tools.ipynb


	This is the recommended way to define simple tools. smolagents will parse basic information about the function from Python, so it must be written clearly with a good docstring (text between ‘’’ or “””).

	def greet(name):
	"""
	Return a greeting message for a given name.


	Parameters:
	name (str): The name of the person to greet.


	Returns:
	str: A greeting string that includes the person's name.
	"""
	return f"Hello, {name}!"


	* Clear function name
	* Type hints for inputs and outputs
	* A detailed description including and Args: section where arguments are described.
	Defining a tool as a Python class
	For complex tools we can create a subclass of @tool. In this class we define:
	* name: The tool’s name.
	* description: A description used to populate the agent’s system prompt.
	* inputs: A dictionary with keys type and description, providing information to help the Python interpreter process inputs.
	* output_type: Specifies the expected output type.
	* forward: The method containing the inference logic to execute.


	Examples at tools.ipynb
	Default Toolbox
	smolagents comes with a set of prebuilt tools that can be injected into the agent, including:
	* PythonInterpreterTool
	* FinalAnswerTool
	* UserInputTool
	* DuckDuckGoSearchTool
	* GoogleSearchTool
	* VisitWebPageTool
	Tools can be shared and integrated, including connecting with HF Spaces and LangChain tools.
	Retrieval Agents
	Copia de retrieval_agents.ipynb
	Agentic RAG (Retrieval-Augmented Generation) extends traditional RAG systems by combining autonomous agents with dynamic knowledge retrieval.
	While traditional RAG systems use an LLM to answer queries based on retrieved data, agentic RAG enables intelligent control of both retrieval and generation processes, improving efficiency and accuracy.
	Agenting RAG can formulate search queries autonomously.
	Advance strategies of agentic RAG systems:
	1. Query Reformulation: Instead of using the raw user query, the agent can craft optimized search terms that better match the target documents
	2. Multi-Step Retrieval The agent can perform multiple searches, using initial results to inform subsequent queries
	3. Source Integration Information can be combined from multiple sources like web search and local documentation
	4. Result Validation Retrieved content can be analyzed for relevance and accuracy before being included in responses
	Agentic RAG: turbocharge your RAG with query reformulation and self-query! 🚀 - Hugging Face Open-Source AI Cookbook
	Multi-Agent Systems
	multiagent_notebook.ipynb
	Different agents collaborating to achieve a given task. Work is distributed among different agents for modularity, scalability and robustness.
	A typical setup might include:
	* A Manager Agent for task delegation
	* A Code Interpreter Agent for code execution
	* A Web Search Agent for information retrieval
	The diagram below illustrates a simple multi-agent architecture where a Manager Agent coordinates a Code Interpreter Tool and a Web Search Agent, which in turn utilizes tools like the DuckDuckGoSearchTool and VisitWebpageTool to gather relevant information.




	https://mermaid.ink/img/pako:eNp1kc1qhTAQRl9FUiQb8wIpdNO76eKubrmFks1oRg3VSYgjpYjv3lFL_2hnMWQOJwn5sqgmelRWleUSKLAtFs09jqhtoWuYUFfFAa6QA9QDTnpzamheuhxn8pt40-6l13UtS0ddhtQXj6dbR4XUGQg6zEYasTF393KjeSDGnDJKNxzj8I_7hLW5IOSmP9CH9hv_NL-d94d4DVNg84p1EnK4qlIj5hGClySWbadT-6OdsrL02MI8sFOOVkciw8zx8kaNspxnrJQE0fXKtjBMMs3JA-MpgOQwftIE9Bzj14w-cMznI_39E9Z3p0uFoA?type=png
	All of these agents operate under an orchestrator that manages task delegation and interaction.
	For example, a Multi-Agent RAG system can integrate:
	* A Web Agent for browsing the internet.
	* A Retriever Agent for fetching information from knowledge bases.
	* An Image Generation Agent for producing visuals.
	All of these agents operate under an orchestrator that manages task delegation and interaction.
	Vision Agents with smolagents
	vision_agents.ipynb
	We can pass images and store them as task_images alongside the task prompt. The agent will process these images in the execution.
	Providing images with dynamic retrieval
	In this approach, images are dynamically added to the agent’s memory during execution.


	https://huggingface.co/agents-course/notebooks/blob/main/unit2/smolagents/vision_web_browser.py




	https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/smolagents-can-see/diagram_adding_vlms_smolagents.png


	Further Reading
	* We just gave sight to smolagents - Blog describing the vision agent functionality.
	* Web Browser Automation with Agents 🤖🌐 - Example for Web browsing using a vision agent.
	* Web Browser Vision Agent Example - Example for Web browsing using a vision agent.


	quiz 1.ipynb
	Introduction to LlamaIndex
	LlamaIndex is a toolkit for creating LLM powered agents using indexes and workflows.
	LlamaIndex is built around several core components that enhance agent capabilities:
	* Components: The fundamental building blocks of LlamaIndex, including prompts, models, and databases. These elements integrate LlamaIndex with external tools and libraries.
	* Tools: Specialized components that provide specific functionalities, such as searching, computation, or external API access. Tools empower agents to perform tasks efficiently.
	* Agents: Autonomous entities capable of using tools and making decisions. They coordinate tool usage to achieve complex objectives.
	* Workflows: Structured, step-by-step processes that organize logic. Workflows enable agent-like behavior without requiring explicit agents.
	This structure helps agents efficiently interact with data, perform tasks, and automate workflows.
	Benefits of LlamaIndex
	* Clear workflow system: uses an event-driven and async-first syntax to clearly compose and organize your logic
	* Advanced document parsing with LlamaParse: paid feature
	* Many components: works with different frameworks. These components are registered in LlamaHub.
	LlamaHub
	Llama Hub


	Installation format
	pip install llama-index-{component-type}-{framework-name}


	For example
	pip install llama-index-llms-huggingface-api
	or
	pip install llama-index-vector-stores-tablestore


	Import paths follow the install command:
	from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI


	llm = HuggingFaceInferenceAPI(
	model_name="Qwen/Qwen2.5-Coder-32B-Instruct",
	temperature=0.7,
	max_tokens=100,
	token="hf_xxx",
	)


	llm.complete("Hello, how are you?")
	# I am good, how can I help you today?
	What are components in LlamaIndex?
	components.ipynb
	While LlamaIndex has many components, we’ll focus specifically on the QueryEngine component. Why? Because it can be used as a Retrieval-Augmented Generation (RAG) tool for an agent.
	QueryEngine is a key component for building agentic RAG workflows.
	Creating a RAG pipeline using components


	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit2/llama-index/rag.png


	There are five key stages within RAG, which in turn will be a part of most larger applications you build. These are:
	1. Loading: this refers to getting your data from where it lives — whether it’s text files, PDFs, another website, a database, or an API — into your workflow. LlamaHub provides hundreds of integrations to choose from.
	2. Indexing: this means creating a data structure that allows for querying the data. For LLMs, this nearly always means creating vector embeddings. Which are numerical representations of the meaning of the data. Indexing can also refer to numerous other metadata strategies to make it easy to accurately find contextually relevant data based on properties.
	3. Index storing: once your data is indexed you will want to store your index, as well as other metadata, to avoid having to re-index it.
	4. Querying: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies.
	5. Evaluation: a critical step in any flow is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.
	Loading and embedding documents
	There are three main ways to load data into LlamaIndex:
	1. SimpleDirectoryReader: A built-in loader for various file types from a local directory.
	2. LlamaParse: LlamaParse, LlamaIndex’s official tool for PDF parsing, available as a managed API.
	3. LlamaHub: A registry of hundreds of data-loading libraries to ingest data from any source.


	The simplest way is with SimpleDirectoryReader. Can load many files and convert them into document objects LlamaIndex can work with.


	from llama_index.core import SimpleDirectoryReader


	reader = SimpleDirectoryReader(input_dir="path/to/directory")
	documents = reader.load_data()


	Once the documents are loaded we need to transform them into Node objects, chunks of texts that the AI can work with, but keeps references to the original Document object.


	The IngestionPipeline helps us create these nodes through two key transformations.
	1. SentenceSplitter breaks down documents into manageable chunks by splitting them at natural sentence boundaries.
	2. HuggingFaceInferenceAPIEmbedding converts each chunk into numerical embeddings - vector representations that capture the semantic meaning in a way AI can process efficiently.


	from llama_index.core import Document
	from llama_index.embeddings.huggingface_api import HuggingFaceInferenceAPIEmbedding
	from llama_index.core.node_parser import SentenceSplitter
	from llama_index.core.ingestion import IngestionPipeline


	# create the pipeline with transformations
	pipeline = IngestionPipeline(
	transformations=[
	SentenceSplitter(chunk_overlap=0),
	HuggingFaceInferenceAPIEmbedding(model_name="BAAI/bge-small-en-v1.5"),
	]
	)


	nodes = await pipeline.arun(documents=[Document.example()])
	Storing and indexing documents
	After creating our Node objects we need to index them to make them searchable, but before we can do that, we need a place to store our data.


	Since we are using an ingestion pipeline, we can directly attach a vector store to the pipeline to populate it. In this case, we will use Chroma to store our documents.


	pip install llama-index-vector-stores-chroma


	import chromadb
	from llama_index.vector_stores.chroma import ChromaVectorStore


	db = chromadb.PersistentClient(path="./alfred_chroma_db")
	chroma_collection = db.get_or_create_collection("alfred")
	vector_store = ChromaVectorStore(chroma_collection=chroma_collection)


	pipeline = IngestionPipeline(
	transformations=[
	SentenceSplitter(chunk_size=25, chunk_overlap=0),
	HuggingFaceInferenceAPIEmbedding(model_name="BAAI/bge-small-en-v1.5"),
	],
	vector_store=vector_store,
	)


	This is where vector embeddings come in - by embedding both the query and nodes in the same vector space, we can find relevant matches. The VectorStoreIndex handles this for us, using the same embedding model we used during ingestion to ensure consistency.


	Now we create an index from the vector store:


	from llama_index.core import VectorStoreIndex
	from llama_index.embeddings.huggingface_api import HuggingFaceInferenceAPIEmbedding


	embed_model = HuggingFaceInferenceAPIEmbedding(model_name="BAAI/bge-small-en-v1.5")
	index = VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)


	All information is automatically persisted within the ChromaVectorStore object and the passed directory path.
	Querying a VectorStoreIndex with prompts and LLMs
	Before we can query our index, we need to convert it to a query interface. The most common conversion options are:
	* as_retriever: For basic document retrieval, returning a list of NodeWithScore objects with similarity scores
	* as_query_engine: For single question-answer interactions, returning a written response
	* as_chat_engine: For conversational interactions that maintain memory across multiple messages, returning a written response using chat history and indexed context
	as_query_engine is the most common:
	from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
	llm = HuggingFaceInferenceAPI(model_name="Qwen/Qwen2.5-Coder-32B-Instruct")
	query_engine = index.as_query_engine(
	llm=llm,
	response_mode="tree_summarize",
	)
	query_engine.query("What is the meaning of life?")
	# The meaning of life is 42
	Response Processing
	The query engine doesn’t only use the LLM to answer the question but also uses a ResponseSynthesizer as a strategy to process the response. This is fully customisable but there are three main strategies that work well out of the box:


	* refine: create and refine an answer by sequentially going through each retrieved text chunk. This makes a separate LLM call per Node/retrieved chunk.
	* compact (default): similar to refining but concatenating the chunks beforehand, resulting in fewer LLM calls.
	* tree_summarize: create a detailed answer by going through each retrieved text chunk and creating a tree structure of the answer.
	Take fine-grained control of your query workflows with the low-level composition API. This API lets you customize and fine-tune every step of the query process to match your exact needs, which also pairs great with Workflows
	The language model won’t always perform in predictable ways, so we can’t be sure that the answer we get is always correct. We can deal with this by evaluating the quality of the answer.
	Using tools in LlamaIndex
	There are four types of tools:






	* FunctionTool: Convert any Python function into a tool that an agent can use. It automatically figures out how the function works.
	* QueryEngineTool: A tool that lets agents use query engines. Since agents are built on query engines, they can also use other agents as tools.
	* Toolspecs: Sets of tools created by the community, which often include tools for specific services like Gmail.
	* Utility Tools: Special tools that help handle large amounts of data from other tools.
	Creating a FunctionTool
	tools.ipynb


	from llama_index.core.tools import FunctionTool


	def get_weather(location: str) -> str:
	"""Useful for getting the weather for a given location."""
	print(f"Getting weather for {location}")
	return f"The weather in {location} is sunny"


	tool = FunctionTool.from_defaults(
	get_weather,
	name="my_weather_tool",
	description="Useful for getting the weather for a given location.",
	)
	tool.call("New York")


	Creating a QueryEngineTool
	The QueryEngine we defined in the previous unit can be easily transformed into a tool using the QueryEngineTool class. Let’s see how to create a QueryEngineTool from a QueryEngine.


	from llama_index.core import VectorStoreIndex
	from llama_index.core.tools import QueryEngineTool
	from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
	from llama_index.embeddings.huggingface import HuggingFaceEmbedding
	from llama_index.vector_stores.chroma import ChromaVectorStore


	embed_model = HuggingFaceEmbedding("BAAI/bge-small-en-v1.5")


	db = chromadb.PersistentClient(path="./alfred_chroma_db")
	chroma_collection = db.get_or_create_collection("alfred")
	vector_store = ChromaVectorStore(chroma_collection=chroma_collection)


	index = VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)


	llm = HuggingFaceInferenceAPI(model_name="Qwen/Qwen2.5-Coder-32B-Instruct")
	query_engine = index.as_query_engine(llm=llm)
	tool = QueryEngineTool.from_defaults(query_engine, name="some useful name", description="some useful description")
	Creating Toolspecs
	Think of ToolSpecs as collections of tools that work together harmoniously - like a well-organized professional toolkit. Just as a mechanic’s toolkit contains complementary tools that work together for vehicle repairs, a ToolSpec combines related tools for specific purposes. For example, an accounting agent’s ToolSpec might elegantly integrate spreadsheet capabilities, email functionality, and calculation tools to handle financial tasks with precision and efficiency.


	Google ToolSpec
	pip install llama-index-tools-google


	from llama_index.tools.google import GmailToolSpec


	tool_spec = GmailToolSpec()
	tool_spec_list = tool_spec.to_tool_list()


	Utility Tools
	Oftentimes, directly querying an API can return an excessive amount of data, some of which may be irrelevant, overflow the context window of the LLM, or unnecessarily increase the number of tokens that you are using. Let’s walk through our two main utility tools below.


	* OnDemandToolLoader: This tool turns any existing LlamaIndex data loader (BaseReader class) into a tool that an agent can use. The tool can be called with all the parameters needed to trigger load_data from the data loader, along with a natural language query string. During execution, we first load data from the data loader, index it (for instance with a vector store), and then query it ‘on-demand’. All three of these steps happen in a single tool call.
	* LoadAndSearchToolSpec: The LoadAndSearchToolSpec takes in any existing Tool as input. As a tool spec, it implements to_tool_list, and when that function is called, two tools are returned: a loading tool and then a search tool. The load Tool execution would call the underlying Tool, and the index the output (by default with a vector index). The search Tool execution would take in a query string as input and call the underlying index.
	Using Agents in LlamaIndex
	agents.ipynb


	LlamaIndex supports three types of agents:


	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit2/llama-index/agents.png


	* Function Calling Agents - These work with AI models that can call specific functions.
	* ReAct Agents - These can work with any AI that does chat or text endpoint and deal with complex reasoning tasks.
	* Advanced Custom Agents - These use more complex methods to deal with more complex tasks and workflows.
	https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/agent/workflow/base_agent.py


	Agents are stateless by default, add remembering past interactions is opt-in using a Context object This might be useful if you want to use an agent that needs to remember previous interactions, like a chatbot that maintains context across multiple messages or a task manager that needs to track progress over time.


	You’ll notice that agents in LlamaIndex are async because they use Python’s await operator. If you are new to async code in Python, or need a refresher, they have an async guide.


	Creating RAG Agents with QueryEngineTools
	Agentic RAG is a powerful way to use agents to answer questions about your data.
	It is easy to wrap QueryEngine as a tool for an agent. When doing so, we need to define a name and description. The LLM will use this information to correctly use the tool. Let’s see how to load in a QueryEngineTool using the QueryEngine we created in the component section.
	Creating Multi-agent systems
	The AgentWorkflow class also directly supports multi-agent systems. By giving each agent a name and description, the system maintains a single active speaker, with each agent having the ability to hand off to another agent.


	By narrowing the scope of each agent, we can help increase their general accuracy when responding to user messages.


	Agents in LlamaIndex can also directly be used as tools for other agents, for more complex and custom scenarios.
	Creating agentic workflows in LlamaIndex
	workflows.ipynb


	A workflow in LlamaIndex provides a structured way to organize your code into sequential and manageable steps.


	Such a workflow is created by defining Steps which are triggered by Events, and themselves emit Events to trigger further steps. Let’s take a look at Alfred showing a LlamaIndex workflow for a RAG task.



	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit2/llama-index/workflows.png


	Workflows offer several key benefits:
	* Clear organization of code into discrete steps
	* Event-driven architecture for flexible control flow
	* Type-safe communication between steps
	* Built-in state management
	* Support for both simple and complex agent interactions
	As you might have guessed, workflows strike a great balance between the autonomy of agents while maintaining control over the overall workflow.
	Introduction to LangGraph
	This section is an introduction to langGraph and more advances topics can be discovered in the free langChain academy course : Introduction to LangGraph


	https://langchain-ai.github.io/langgraph/
	https://academy.langchain.com/courses/intro-to-langgraph
	LangGraph is a framework to manage the control flow of applications that integrate an LLM.
	LangChain provides a standard interface to interact with models and other components. LangChain and LangGraph are often used together, but not necessarily.
	LangGraph is useful when you need control and predictability over your applications. Smolagents is on the other side of the spectrum, giving your agents freedom (and less predictability).
	LangGraph is great when your application must follow a predetermined series of steps with decisions at each junction point.
	The key scenarios where LangGraph excels include:
	* Multi-step reasoning processes that need explicit control on the flow
	* Applications requiring persistence of state between steps
	* Systems that combine deterministic logic with AI capabilities
	* Workflows that need human-in-the-loop interventions
	* Complex agent architectures with multiple components working together
	Whenever possible, design workflows that adapt dynamically, choosing the next action based on the outcome of the previous step. LangGraph is the ideal framework to implement such adaptive workflows.
	At its core, LangGraph uses a directed graph structure to define the flow of your application:
	* Nodes represent individual processing steps (like calling an LLM, using a tool, or making a decision).
	* Edges define the possible transitions between steps.
	* State is user defined and maintained and passed between nodes during execution. When deciding which node to target next, this is the current state that we look at.
	Building Blocks of LangGraph


	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit2/LangGraph/application.png
	1. State:
	User defined. Represents the information that flows through the application. Must be carefully considered.
	💡 Tip: Think carefully about what information your application needs to track between steps.
	2. Nodes
	Python functions. Takes the state as input, performs an operation and updates the state.
	Nodes can contain:
	* LLM calls: Generate text or make decisions
	* Tool calls: Interact with external systems
	* Conditional logic: Determine next steps
	* Human intervention: Get input from users
	💡 Info: Some nodes necessary for the whole workflow like START and END exist from langGraph directly.
	3. Edges
	Connect nodes and define possible paths through the graph.
	There are two kinds:
	* Direct
	* Conditional
	4. StateGraph
	Contains the whole agent workflow.
	Building Your First LangGraph
	mail_sorting.ipynb


	Step 1: Define Our State
	Let’s define what information the agent needs to track during the workflow.
	💡 Tip: Make your state comprehensive enough to track all the important information, but avoid bloating it with unnecessary details.


	Step 2: Define Our Nodes
	Create the functions.


	Step 3: Define Our Routing Logic
	We need a function to determine which path to take after classification.
	💡 Note: This routing function is called by LangGraph to determine which edge to follow after the classification node. The return value must match one of the keys in our conditional edges mapping.


	Step 4: Create the StateGraph and Define Edges
	We use the special END node provided by LangGraph. This indicates terminal states where the workflow completes.


	Step 5: Run the Application


	Step 6: Inspecting Our Agent with Langfuse
	Document Analysis Graph
	https://colab.research.google.com/drive/1ihDKrqlvEKld6tmsJ-LTXHuBcHvr4mvN




	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit2/LangGraph/alfred_flow.png




	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit2/LangGraph/Agent.png


	This state is a little more complex than the previous ones we have seen. AnyMessage is a class from langchain that define messages and the add_messages is an operator that add the latest message rather than overwritting it with the latest state.
	Key Takeaways
	Should you wish to create your own document analysis butler, here are key considerations:
	* Define clear tools for specific document-related tasks
	* Create a robust state tracker to maintain context between tool calls
	* Consider error handling for tools fails
	* Maintain contextual awareness of previous interactions (ensured by the operator add_messages)


	GitHub - langchain-ai/langgraph: Build resilient language agents as graphs.
	Introduction to LangGraph
	________________


	Unit 3 Use Case for Agentic RAG


	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit2/llama-index/rag.png


	LLMs are trained on enormous bodies of data to learn general knowledge. However, the world knowledge model of LLMs may not always be relevant and up-to-date information. RAG solves this problem by finding and retrieving relevant information from your data and forwarding that to the LLM.


	Agentic RAG is a powerful way to use agents to answer questions about your data.
	Why RAG?
	A traditional LLM might struggle recall specific details because:
	* The data is specific to your task and not in the model’s training data
	* The data may change or be updated frequently
	* The agent needs to retrieve precise details like email addresses
	Setting up our application
	Project Structure
	* tools.py – Provides auxiliary tools for the agent.
	* retriever.py – Implements retrieval functions to support knowledge access.
	* app.py – Integrates all components into a fully functional agent, which we’ll finalize in the last part of this unit.
	AgenticRAG - Course reference.
	RAG for agents.ipynbcode snippets.
	Steps:
	1. Load and prepare the dataset
	We need to transform our raw data into a format that’s optimized for retrieval.
	2. Create the Retriever Tool
	3. Integrate the Tool with the agent


	Building and Integrating Tools for Your Agent
	Agent.ipynb
	Creating Your Gala Agent
	Agent.ipynb
	Bonus Unit 2. Agent Observability and Evaluation
	Agent Observability and Evaluation is perfect in the following scenarios:
	* Develop and Deploy AI Agents: You want to ensure that your agents are performing reliably in production.
	* Need Detailed Insights: You’re looking to diagnose issues, optimize performance, or understand the inner workings of your agent.
	* Aim to Reduce Operational Overhead: By monitoring agent costs, latency, and execution details, you can efficiently manage resources.
	* Seek Continuous Improvement: You’re interested in integrating both real-time user feedback and automated evaluation into your AI applications.
	In short, for everyone who wants to bring their agents in front of users.
	AI Agent Observability and Evaluation
	Observability is about understanding what’s happening inside your AI agent by looking at external signals like logs, metrics, and traces. For AI agents, this means tracking actions, tool usage, model calls, and responses to debug and improve agent performance.


	Without observability, AI agents are “black boxes.” Observability tools make agents transparent, enabling you to:
	* Understand costs and accuracy trade-offs
	* Measure latency
	* Detect harmful language & prompt injection
	* Monitor user feedback
	In other words, it makes your demo agent ready for production.
	Observability Tools
	Common observability tools for AI agents include platforms like Langfuse and Arize. These tools help collect detailed traces and offer dashboards to monitor metrics in real-time, making it easy to detect problems and optimize performance.


	Observability tools vary widely in their features and capabilities. Some tools are open source, benefiting from large communities that shape their roadmaps and extensive integrations. Additionally, certain tools specialize in specific aspects of LLMOps—such as observability, evaluations, or prompt management—while others are designed to cover the entire LLMOps workflow. We encourage you to explore the documentation of different options to pick a solution that works well for you.


	Many agent frameworks such as smolagents use the OpenTelemetry standard to expose metadata to the observability tools. In addition to this, observability tools build custom instrumentations to allow for more flexibility in the fast moving world of LLMs. You should check the documentation of the tool you are using to see what is supported.
	Traces and Spans
	Observability tools usually represent agent runs as traces and spans.


	* Traces represent a complete agent task from start to finish (like handling a user query).
	* Spans are individual steps within the trace (like calling a language model or retrieving data).


	Key Metrics to Monitor
	* Latency: How quickly does the agent respond? Long waiting times negatively impact user experience. You should measure latency for tasks and individual steps by tracing agent runs. For example, an agent that takes 20 seconds for all model calls could be accelerated by using a faster model or by running model calls in parallel.
	* Costs: What’s the expense per agent run? AI agents rely on LLM calls billed per token or external APIs. Frequent tool usage or multiple prompts can rapidly increase costs. For instance, if an agent calls an LLM five times for marginal quality improvement, you must assess if the cost is justified or if you could reduce the number of calls or use a cheaper model. Real-time monitoring can also help identify unexpected spikes (e.g., bugs causing excessive API loops).
	* Request Errors: How many requests did the agent fail? This can include API errors or failed tool calls. To make your agent more robust against these in production, you can then set up fallbacks or retries. E.g. if LLM provider A is down, you switch to LLM provider B as backup.
	* User Feedback: Implementing direct user evaluations provide valuable insights. This can include explicit ratings (👍thumbs-up/👎down, ⭐1-5 stars) or textual comments. Consistent negative feedback should alert you as this is a sign that the agent is not working as expected.
	* Implicit User Feedback: User behaviors provide indirect feedback even without explicit ratings. This can include immediate question rephrasing, repeated queries or clicking a retry button. E.g. if you see that users repeatedly ask the same question, this is a sign that the agent is not working as expected.
	* Accuracy: How frequently does the agent produce correct or desirable outputs? Accuracy definitions vary (e.g., problem-solving correctness, information retrieval accuracy, user satisfaction). The first step is to define what success looks like for your agent. You can track accuracy via automated checks, evaluation scores, or task completion labels. For example, marking traces as “succeeded” or “failed”.
	* Automated Evaluation Metrics: You can also set up automated evals. For instance, you can use an LLM to score the output of the agent e.g. if it is helpful, accurate, or not. There are also several open source libraries that help you to score different aspects of the agent. E.g. RAGAS for RAG agents or LLM Guard to detect harmful language or prompt injection.


	In practice, a combination of these metrics gives the best coverage of an AI agent’s health.
	Evaluating AI Agents
	Observability gives us metrics, but evaluation is the process of analyzing that data (and performing tests) to determine how well an AI agent is performing and how it can be improved. In other words, once you have those traces and metrics, how do you use them to judge the agent and make decisions?


	Regular evaluation is important because AI agents are often non-deterministic and can evolve (through updates or drifting model behavior) – without evaluation, you wouldn’t know if your “smart agent” is actually doing its job well or if it’s regressed.


	There are two categories of evaluations for AI agents: online evaluation and offline evaluation. Both are valuable, and they complement each other. We usually begin with offline evaluation, as this is the minimum necessary step before deploying any agent.
	Offline Evaluation


	https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/bonus-unit2/example-dataset.png


	This involves evaluating the agent in a controlled setting, typically using test datasets, not live user queries. You use curated datasets where you know what the expected output or correct behavior is, and then run your agent on those.


	For instance, if you built a math word-problem agent, you might have a test dataset of 100 problems with known answers. Offline evaluation is often done during development (and can be part of CI/CD pipelines) to check improvements or guard against regressions. The benefit is that it’s repeatable and you can get clear accuracy metrics since you have ground truth. You might also simulate user queries and measure the agent’s responses against ideal answers or use automated metrics as described above.


	The key challenge with offline eval is ensuring your test dataset is comprehensive and stays relevant – the agent might perform well on a fixed test set but encounter very different queries in production. Therefore, you should keep test sets updated with new edge cases and examples that reflect real-world scenarios. A mix of small “smoke test” cases and larger evaluation sets is useful: small sets for quick checks and larger ones for broader performance metrics.
	Online Evaluation
	This refers to evaluating the agent in a live, real-world environment, i.e. during actual usage in production. Online evaluation involves monitoring the agent’s performance on real user interactions and analyzing outcomes continuously.


	For example, you might track success rates, user satisfaction scores, or other metrics on live traffic. The advantage of online evaluation is that it captures things you might not anticipate in a lab setting – you can observe model drift over time (if the agent’s effectiveness degrades as input patterns shift) and catch unexpected queries or situations that weren’t in your test data. It provides a true picture of how the agent behaves in the wild.


	Online evaluation often involves collecting implicit and explicit user feedback, as discussed, and possibly running shadow tests or A/B tests (where a new version of the agent runs in parallel to compare against the old). The challenge is that it can be tricky to get reliable labels or scores for live interactions – you might rely on user feedback or downstream metrics (like did the user click the result).


	In practice, successful AI agent evaluation blends online and offline methods. You might run regular offline benchmarks to quantitatively score your agent on defined tasks and continuously monitor live usage to catch things the benchmarks miss. For example, offline tests can catch if a code-generation agent’s success rate on a known set of problems is improving, while online monitoring might alert you that users have started asking a new category of question that the agent struggles with. Combining both gives a more robust picture.


	In fact, many teams adopt a loop: offline evaluation → deploy new agent version → monitor online metrics and collect new failure examples → add those examples to offline test set → iterate. This way, evaluation is continuous and ever-improving.


	monitoring-and-evaluating-agents.ipynb
	Unit 4 Welcome to the final Unit
	Challenge: You’ll create your own agent and evaluate its performance using a subset of the GAIA benchmark.


	To successfully complete the course, your agent needs to score 30% or higher on the GAIA benchmark. Achieve that, and you’ll earn your Certificate of Completion, officially recognizing your expertise.


	GAIA is carefully designed around the following pillars:


	* Real-world difficulty: Tasks require multi-step reasoning, multimodal understanding, and tool interaction.
	* Human interpretability: Despite their difficulty for AI, tasks remain conceptually simple and easy to follow for humans.
	* Non-gameability: Correct answers demand full task execution, making brute-forcing ineffective.
	* Simplicity of evaluation: Answers are concise, factual, and unambiguous—ideal for benchmarking.
	Difficulty Levels
	GAIA tasks are organized into three levels of increasing complexity, each testing specific skills:


	* Level 1: Requires less than 5 steps and minimal tool usage.
	* Level 2: Involves more complex reasoning and coordination between multiple tools and 5-10 steps.
	* Level 3: Demands long-term planning and advanced integration of various tools.




	Example of a Hard GAIA Question
	Which of the fruits shown in the 2008 painting “Embroidery from Uzbekistan” were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film “The Last Voyage”? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o’clock position. Use the plural form of each fruit.


	As you can see, this question challenges AI systems in several ways:
	* Requires a structured response format
	* Involves multimodal reasoning (e.g., analyzing images)
	* Demands multi-hop retrieval of interdependent facts:
	* Identifying the fruits in the painting
	* Discovering which ocean liner was used in The Last Voyage
	* Looking up the breakfast menu from October 1949 for that ship
	* Needs correct sequencing and high-level planning to solve in the right order
	This kind of task highlights where standalone LLMs often fall short, making GAIA an ideal benchmark for agent-based systems that can reason, retrieve, and execute over multiple steps and modalities.






	Gala
	GAIA Leaderboard
	Read the full paper
	Deep Research release post by OpenAI
	Open-source DeepResearch – Freeing our search agents
	Hands on
	The Dataset used in this leaderboard consists of 20 questions extracted from the level 1 questions of the validation set from GAIA. The chosen questions were filtered based on the number of tools and steps needed to answer a question.


	The goal is to get 30% on level 1 questions from the GAIA benchmark.
	The process
	We created an API that will allow you to get the questions, and send your answers for scoring. Here is a summary of the routes (see the live documentation for interactive details):


	* GET /questions: Retrieve the full list of filtered evaluation questions.
	* GET /random-question: Fetch a single random question from the list.
	* GET /files/{task_id}: Download a specific file associated with a given task ID.
	* POST /submit: Submit agent answers, calculate the score, and update the leaderboard.
	The submit function will compare the answer to the ground truth in an EXACT MATCH manner.


	—————————————————————————————————————————
	Results can be submitted for both validation and test. Scores are expressed as the percentage of correct answers for a given split.


	Each question calls for an answer that is either a string (one or a few words), a number, or a comma separated list of strings or floats, unless specified otherwise. There is only one correct answer. Hence, evaluation is done via quasi exact match between a model’s answer and the ground truth (up to some normalization that is tied to the “type” of the ground truth).


	In our evaluation, we use a system prompt to instruct the model about the required format:


	You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
	We advise you to use the system prompt provided in the paper to ensure your agents answer using the correct and expected format. In practice, GPT4 level models easily follow it.


	We expect submissions to be json-line files with the following format. The first two fields are mandatory, reasoning_trace is optional:


	{"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
	{"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}


	—————————————————————————————————————————
	Template
	Template Final Assignment - a Hugging Face Space by agents-course
	Duplicated space
	https://huggingface.co/spaces/leroidubuffet/Final_Assignment_Template
	First agent template
	https://huggingface.co/spaces/agents-course/First_agent_template/blob/main/prompts.yaml#L145
	Student’s leaderboard
	https://huggingface.co/spaces/agents-course/Students_leaderboard


	Certificate page
	Unit4 Final Certificate
	Deadline
	July 1st