Spaces:

Loren
/

GAIA_Agents_Evaluations

Running

App Files Files Community

GAIA_Agents_Evaluations / data /lib.md

Loren

Upload 6 files

ff9afd7 verified 4 months ago

preview code

raw

history blame contribute delete

589 Bytes

A newer version of the Streamlit SDK is available: 1.49.1

Upgrade

GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc).
Data
GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to solve. It is therefore divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities. Each level is divided into a fully public dev set for validation, and a test set with private answers and metadata.