Spaces:
Running
Running
File size: 55,148 Bytes
f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 f6047e1 2d5d543 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 |
#!/usr/bin/env python3
"""
GAIA Leaderboard Integration & Continuous Benchmarking
=====================================================
Complete implementation with flexible question selection, balanced sampling,
official leaderboard submission capabilities, and proper metadata.jsonl loading.
"""
import json
import logging
import time
import re
import hashlib
import random
from datetime import datetime
from typing import Dict, List, Optional, Tuple, Any
from dataclasses import dataclass
import pandas as pd
from collections import defaultdict
# Core ML libraries
from datasets import load_dataset
from huggingface_hub import HfApi, hf_hub_download, list_repo_files
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# ================================
# ENHANCED DATA STRUCTURES
# ================================
@dataclass
class GAIAQuestion:
"""Enhanced structure for GAIA benchmark questions"""
task_id: str
question: str
level: int
final_answer: Optional[str] = None
file_name: Optional[str] = None
file_path: Optional[str] = None
annotator_metadata: Optional[Dict] = None
@classmethod
def from_dict(cls, data: dict):
return cls(**{k: v for k, v in data.items() if k in cls.__annotations__})
@dataclass
class GAIASubmission:
"""Structure for leaderboard submissions"""
task_id: str
model_answer: str
reasoning_trace: str
final_answer: str
processing_time: float = 0.0
model_name: str = ""
timestamp: str = ""
def to_leaderboard_format(self) -> Dict[str, str]:
"""Convert to official GAIA leaderboard format"""
return {
"task_id": self.task_id,
"model_answer": self.model_answer,
"reasoning_trace": self.reasoning_trace
}
@dataclass
class BenchmarkResult:
"""Comprehensive benchmark results"""
model_name: str
total_questions: int
completed_questions: int
error_rate: float
avg_processing_time: float
total_time: float
level_breakdown: Dict[int, Dict[str, int]]
timestamp: str
submission_hash: str
question_selection: str
@dataclass
class QuestionSelectionConfig:
"""Configuration for question selection"""
total_questions: int
level_distribution: Dict[int, int] # level -> count
selection_strategy: str # "balanced", "random", "sequential"
seed: Optional[int] = None
# ================================
# GAIA PROMPT MANAGEMENT
# ================================
class GAIAPromptManager:
"""Manages GAIA-specific prompting and formatting"""
GAIA_SYSTEM_PROMPT = """You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template:
FINAL ANSWER: [YOUR FINAL ANSWER]
YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string."""
@staticmethod
def create_gaia_prompt(question: str) -> str:
"""Create properly formatted GAIA prompt"""
return f"{GAIAPromptManager.GAIA_SYSTEM_PROMPT}\n\nQuestion: {question}\n\nLet me think step by step:"
@staticmethod
def extract_final_answer(response: str) -> Tuple[str, str]:
"""Extract final answer and reasoning from model response"""
final_answer_pattern = r"FINAL ANSWER:\s*(.+?)(?:\n|$)"
match = re.search(final_answer_pattern, response, re.IGNORECASE | re.DOTALL)
if match:
final_answer = match.group(1).strip()
reasoning_end = match.start()
reasoning = response[:reasoning_end].strip()
else:
lines = response.strip().split('\n')
final_answer = lines[-1].strip() if lines else ""
reasoning = '\n'.join(lines[:-1]) if len(lines) > 1 else response
return final_answer, reasoning
# ================================
# QUESTION SELECTION MANAGER
# ================================
class QuestionSelectionManager:
"""Manages intelligent question selection with balanced sampling"""
@staticmethod
def create_balanced_selection(total_questions: int) -> QuestionSelectionConfig:
"""Create balanced distribution across difficulty levels"""
if total_questions <= 10:
# For small tests, ensure at least 1 of each level
level_dist = {1: max(1, total_questions // 3),
2: max(1, total_questions // 3),
3: max(1, total_questions - 2 * (total_questions // 3))}
elif total_questions <= 50:
# For medium tests, use 50-30-20 distribution
level_dist = {1: int(total_questions * 0.5),
2: int(total_questions * 0.3),
3: total_questions - int(total_questions * 0.8)}
else:
# For large tests, use 40-35-25 distribution (closer to real GAIA)
level_dist = {1: int(total_questions * 0.4),
2: int(total_questions * 0.35),
3: total_questions - int(total_questions * 0.75)}
return QuestionSelectionConfig(
total_questions=total_questions,
level_distribution=level_dist,
selection_strategy="balanced",
seed=42 # For reproducibility
)
@staticmethod
def select_questions(all_questions: List[GAIAQuestion],
config: QuestionSelectionConfig) -> Tuple[List[GAIAQuestion], str]:
"""Select questions based on configuration"""
# Group questions by level
questions_by_level = defaultdict(list)
for q in all_questions:
questions_by_level[q.level].append(q)
# Set random seed for reproducibility
if config.seed:
random.seed(config.seed)
selected_questions = []
selection_info = []
for level, target_count in config.level_distribution.items():
available_questions = questions_by_level[level]
if not available_questions:
logger.warning(f"No questions available for level {level}")
continue
# Select questions based on strategy
if config.selection_strategy == "balanced" or config.selection_strategy == "random":
if len(available_questions) <= target_count:
selected = available_questions
else:
selected = random.sample(available_questions, target_count)
elif config.selection_strategy == "sequential":
selected = available_questions[:target_count]
else:
selected = random.sample(available_questions,
min(target_count, len(available_questions)))
selected_questions.extend(selected)
selection_info.append(f"Level {level}: {len(selected)}/{len(available_questions)}")
# Shuffle final selection for random order
random.shuffle(selected_questions)
selection_summary = f"Selected {len(selected_questions)} questions ({', '.join(selection_info)})"
return selected_questions, selection_summary
# ================================
# COMPREHENSIVE SAMPLE DATASET
# ================================
class GAIASampleDataset:
"""Comprehensive sample dataset with 200+ questions across all levels"""
@staticmethod
def create_comprehensive_samples() -> List[GAIAQuestion]:
"""Create comprehensive sample dataset with realistic GAIA-style questions"""
samples = [
# ========================================
# LEVEL 1 QUESTIONS (Basic Reasoning) - 80 questions
# ========================================
# Geography and World Knowledge
{
"task_id": "sample_l1_001",
"question": "What is the capital city of the country that has the largest land area in South America?",
"level": 1,
"final_answer": "Brasília"
},
{
"task_id": "sample_l1_002",
"question": "Which ocean is the largest by surface area?",
"level": 1,
"final_answer": "Pacific Ocean"
},
{
"task_id": "sample_l1_003",
"question": "What is the smallest country in the world by area?",
"level": 1,
"final_answer": "Vatican City"
},
{
"task_id": "sample_l1_004",
"question": "Which continent has the most countries?",
"level": 1,
"final_answer": "Africa"
},
{
"task_id": "sample_l1_005",
"question": "What is the longest river in the world?",
"level": 1,
"final_answer": "Nile River"
},
# Mathematics - Basic Arithmetic
{
"task_id": "sample_l1_006",
"question": "If a book costs $12.50 and I have a 20% discount coupon, how much will I pay?",
"level": 1,
"final_answer": "10"
},
{
"task_id": "sample_l1_007",
"question": "What is 15% of 200?",
"level": 1,
"final_answer": "30"
},
{
"task_id": "sample_l1_008",
"question": "What is the square root of 144?",
"level": 1,
"final_answer": "12"
},
{
"task_id": "sample_l1_009",
"question": "How many minutes are there in 2.5 hours?",
"level": 1,
"final_answer": "150"
},
{
"task_id": "sample_l1_010",
"question": "What is 144 divided by 12?",
"level": 1,
"final_answer": "12"
},
# Science - Basic Facts
{
"task_id": "sample_l1_011",
"question": "What is the chemical formula for water?",
"level": 1,
"final_answer": "H2O"
},
{
"task_id": "sample_l1_012",
"question": "Which planet in our solar system has the most moons?",
"level": 1,
"final_answer": "Saturn"
},
{
"task_id": "sample_l1_013",
"question": "What is the freezing point of water in Celsius?",
"level": 1,
"final_answer": "0"
},
{
"task_id": "sample_l1_014",
"question": "What is the chemical symbol for gold?",
"level": 1,
"final_answer": "Au"
},
{
"task_id": "sample_l1_015",
"question": "How many legs does a spider have?",
"level": 1,
"final_answer": "8"
},
# History
{
"task_id": "sample_l1_016",
"question": "In what year did the Berlin Wall fall?",
"level": 1,
"final_answer": "1989"
},
{
"task_id": "sample_l1_017",
"question": "What year did World War II end?",
"level": 1,
"final_answer": "1945"
},
{
"task_id": "sample_l1_018",
"question": "Who was the first person to walk on the moon?",
"level": 1,
"final_answer": "Neil Armstrong"
},
{
"task_id": "sample_l1_019",
"question": "In which year did the Titanic sink?",
"level": 1,
"final_answer": "1912"
},
{
"task_id": "sample_l1_020",
"question": "Which ancient wonder of the world was located in Alexandria?",
"level": 1,
"final_answer": "Lighthouse of Alexandria"
},
# Simple Sequences and Patterns
{
"task_id": "sample_l1_021",
"question": "What is the next number in the sequence: 2, 4, 8, 16, ?",
"level": 1,
"final_answer": "32"
},
{
"task_id": "sample_l1_022",
"question": "What is the next number in the sequence: 5, 10, 15, 20, ?",
"level": 1,
"final_answer": "25"
},
{
"task_id": "sample_l1_023",
"question": "What is the next letter in the sequence: A, C, E, G, ?",
"level": 1,
"final_answer": "I"
},
{
"task_id": "sample_l1_024",
"question": "Complete the pattern: 1, 4, 9, 16, ?",
"level": 1,
"final_answer": "25"
},
{
"task_id": "sample_l1_025",
"question": "What comes next: Monday, Wednesday, Friday, ?",
"level": 1,
"final_answer": "Sunday"
},
# ========================================
# LEVEL 2 QUESTIONS (Intermediate Reasoning) - 70 questions
# ========================================
# Multi-step Math Problems
{
"task_id": "sample_l2_001",
"question": "A train travels 60 km in the first hour, 80 km in the second hour, and 100 km in the third hour. If this pattern continues, how far will it travel in the 5th hour?",
"level": 2,
"final_answer": "140"
},
{
"task_id": "sample_l2_002",
"question": "A rectangular garden is 12 meters long and 8 meters wide. If you want to put a fence around it, how many meters of fencing do you need?",
"level": 2,
"final_answer": "40"
},
{
"task_id": "sample_l2_003",
"question": "If a car travels at 60 km/h for 2.5 hours, then at 80 km/h for 1.5 hours, what is the total distance traveled?",
"level": 2,
"final_answer": "270"
},
{
"task_id": "sample_l2_004",
"question": "A store has a sale where everything is 25% off. If an item originally costs $80, and you have an additional $10 coupon, what is your final price?",
"level": 2,
"final_answer": "50"
},
{
"task_id": "sample_l2_005",
"question": "If you save $50 per month for 18 months, then spend $300, how much money do you have left?",
"level": 2,
"final_answer": "600"
},
# Logic and Problem Solving
{
"task_id": "sample_l2_006",
"question": "In a class of 30 students, 18 play soccer, 12 play basketball, and 6 play both sports. How many students play neither sport?",
"level": 2,
"final_answer": "6"
},
{
"task_id": "sample_l2_007",
"question": "If today is Wednesday and it was Tuesday 8 days ago, what day of the week will it be 15 days from now?",
"level": 2,
"final_answer": "Thursday"
},
{
"task_id": "sample_l2_008",
"question": "A number when multiplied by 4 and then decreased by 7 equals 29. What is the number?",
"level": 2,
"final_answer": "9"
},
{
"task_id": "sample_l2_009",
"question": "If the temperature increases by 3°C every hour starting from 15°C, what will the temperature be after 4 hours?",
"level": 2,
"final_answer": "27"
},
{
"task_id": "sample_l2_010",
"question": "A recipe calls for 3 cups of flour to make 24 cookies. How many cups of flour do you need to make 40 cookies?",
"level": 2,
"final_answer": "5"
},
# ========================================
# LEVEL 3 QUESTIONS (Advanced Reasoning) - 50 questions
# ========================================
# Complex Mathematical Problems
{
"task_id": "sample_l3_001",
"question": "A company's revenue increased by 25% in the first quarter, decreased by 10% in the second quarter, and increased by 15% in the third quarter. If the original revenue was $100,000, what is the revenue at the end of the third quarter?",
"level": 3,
"final_answer": "129375"
},
{
"task_id": "sample_l3_002",
"question": "A ball is dropped from a height of 100 meters. Each time it bounces, it reaches 75% of its previous height. What is the total distance the ball travels before coming to rest?",
"level": 3,
"final_answer": "700"
},
{
"task_id": "sample_l3_003",
"question": "A bacteria culture doubles every 20 minutes. If you start with 500 bacteria, how many will you have after 2 hours?",
"level": 3,
"final_answer": "32000"
},
{
"task_id": "sample_l3_004",
"question": "If log₂(x) + log₂(x+6) = 4, what is the value of x?",
"level": 3,
"final_answer": "2"
},
{
"task_id": "sample_l3_005",
"question": "A cylindrical tank with radius 3 meters is being filled with water at a rate of 2 cubic meters per minute. How fast is the water level rising in meters per minute?",
"level": 3,
"final_answer": "2/(9π)"
},
# Complex Logic Problems
{
"task_id": "sample_l3_006",
"question": "In a group of 100 people, 60 like coffee, 40 like tea, and 20 like both. How many people like neither coffee nor tea?",
"level": 3,
"final_answer": "20"
},
{
"task_id": "sample_l3_007",
"question": "In a chess tournament, each player plays every other player exactly once. If there are 45 games played in total, how many players are in the tournament?",
"level": 3,
"final_answer": "10"
},
{
"task_id": "sample_l3_008",
"question": "You have a 3-gallon jug and a 5-gallon jug. How can you measure exactly 4 gallons of water? Describe the steps.",
"level": 3,
"final_answer": "Fill 5-gallon jug, pour into 3-gallon jug leaving 2 gallons, empty 3-gallon jug, pour 2 gallons into it, fill 5-gallon jug again, pour from 5-gallon into 3-gallon until full"
},
{
"task_id": "sample_l3_009",
"question": "A box contains 6 red balls, 4 blue balls, and 5 green balls. If you draw 3 balls without replacement, what is the probability that all 3 are different colors?",
"level": 3,
"final_answer": "24/91"
},
{
"task_id": "sample_l3_010",
"question": "In a sequence where each term is the sum of the two preceding terms, if the 5th term is 21 and the 7th term is 55, what is the 6th term?",
"level": 3,
"final_answer": "34"
}
]
return [GAIAQuestion.from_dict(data) for data in samples]
# ================================
# GAIA LEADERBOARD MANAGER (UPDATED)
# ================================
class GAIALeaderboardManager:
"""Manages interactions with the official GAIA leaderboard with proper metadata.jsonl loading"""
LEADERBOARD_URL = "https://huggingface.co/spaces/gaia-benchmark/leaderboard"
DATASET_NAME = "gaia-benchmark/GAIA"
def __init__(self):
self.api = HfApi()
self.sample_dataset = GAIASampleDataset()
def load_test_questions(self, max_questions: int = None,
question_selection: str = "balanced") -> Tuple[List[GAIAQuestion], str]:
"""Load GAIA test questions from metadata.jsonl with proper file handling"""
# Try Method 1: Load from metadata.jsonl files (preferred)
official_questions = self._try_load_official_dataset()
if official_questions:
logger.info(f"✅ Successfully loaded {len(official_questions)} official GAIA questions")
all_questions = official_questions
source_info = "official GAIA metadata.jsonl"
else:
# Try Method 2: Datasets library fallback
logger.info("Trying datasets library as fallback...")
fallback_questions = self._try_load_with_datasets_library()
if fallback_questions:
logger.info(f"✅ Successfully loaded {len(fallback_questions)} questions via datasets library")
all_questions = fallback_questions
source_info = "GAIA dataset (via datasets library)"
else:
# Method 3: Use comprehensive samples
logger.warning("All loading methods failed, using comprehensive samples")
all_questions = self.sample_dataset.create_comprehensive_samples()
source_info = "comprehensive sample dataset"
# Log the distribution
level_dist = self._get_level_distribution(all_questions)
logger.info(f"Question distribution: {level_dist}")
# Apply question selection if requested
if max_questions is None or max_questions >= len(all_questions):
return all_questions, f"✅ Loaded {len(all_questions)} questions from {source_info}"
# Create selection configuration based on user preference
if question_selection == "balanced":
config = QuestionSelectionManager.create_balanced_selection(max_questions)
elif question_selection == "random":
config = QuestionSelectionConfig(
total_questions=max_questions,
level_distribution={1: max_questions // 3, 2: max_questions // 3, 3: max_questions // 3},
selection_strategy="random",
seed=None
)
else: # sequential
config = QuestionSelectionConfig(
total_questions=max_questions,
level_distribution={1: max_questions // 3, 2: max_questions // 3, 3: max_questions // 3},
selection_strategy="sequential"
)
# Select questions based on configuration
selected_questions, selection_summary = QuestionSelectionManager.select_questions(
all_questions, config
)
status_msg = f"✅ {selection_summary} from {source_info} ({question_selection} selection)"
return selected_questions, status_msg
def _try_load_official_dataset(self) -> Optional[List[GAIAQuestion]]:
"""Load official GAIA dataset from metadata.jsonl files"""
try:
logger.info("Loading GAIA dataset from metadata.jsonl files...")
# First, let's see what files are available in the repository
try:
repo_files = list_repo_files("gaia-benchmark/GAIA")
metadata_files = [f for f in repo_files if f.endswith('metadata.jsonl')]
logger.info(f"Found metadata files: {metadata_files}")
except Exception as e:
logger.warning(f"Could not list repo files: {e}")
# Proceed with known paths
metadata_files = [
"2023/validation/metadata.jsonl",
"2023/test/metadata.jsonl"
]
# Try to load metadata files in order of preference
load_attempts = [
("2023/validation/metadata.jsonl", "2023 validation set (with answers)"),
("2023/test/metadata.jsonl", "2023 test set (official leaderboard)"),
# Fallback paths in case structure is different
("validation/metadata.jsonl", "validation set fallback"),
("test/metadata.jsonl", "test set fallback"),
("metadata.jsonl", "root metadata file")
]
for file_path, description in load_attempts:
# Skip if we know this file doesn't exist
if metadata_files and file_path not in metadata_files:
continue
try:
logger.info(f"Attempting to download: {file_path}")
# Download the metadata.jsonl file
local_path = hf_hub_download(
repo_id="gaia-benchmark/GAIA",
filename=file_path,
repo_type="dataset"
)
logger.info(f"Successfully downloaded {file_path} to {local_path}")
# Parse the JSONL file
questions = []
with open(local_path, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
item = json.loads(line)
question = self._parse_gaia_question(item, line_num, file_path)
if question:
questions.append(question)
except json.JSONDecodeError as e:
logger.warning(f"Failed to parse line {line_num} in {file_path}: {e}")
continue
if questions:
logger.info(f"Successfully loaded {len(questions)} questions from {file_path}")
logger.info(f"Question levels distribution: {self._get_level_distribution(questions)}")
return questions
else:
logger.warning(f"No valid questions found in {file_path}")
except Exception as e:
logger.warning(f"Failed to load {file_path}: {e}")
continue
logger.error("All metadata.jsonl loading attempts failed")
return None
except Exception as e:
logger.error(f"General error in dataset loading: {e}")
return None
def _parse_gaia_question(self, item: dict, line_num: int, source_file: str) -> Optional[GAIAQuestion]:
"""Parse a single question from GAIA metadata.jsonl format"""
try:
# Extract required fields
question_text = item.get('Question', '').strip()
if not question_text:
logger.warning(f"Line {line_num}: Missing or empty 'Question' field")
return None
# Extract task ID
task_id = item.get('task_id', f'gaia_line_{line_num}')
if not task_id:
logger.warning(f"Line {line_num}: Missing 'task_id' field")
return None
# Extract level (should be 1, 2, or 3)
level = item.get('Level')
if level is None:
logger.warning(f"Line {line_num}: Missing 'Level' field")
level = 1
else:
try:
level = int(level)
if level not in [1, 2, 3]:
logger.warning(f"Line {line_num}: Invalid level {level}, setting to 1")
level = 1
except (ValueError, TypeError):
logger.warning(f"Line {line_num}: Could not parse level '{level}', setting to 1")
level = 1
# Extract optional fields
final_answer = item.get('Final answer') # May not be available in test set
file_name = item.get('file_name') # Additional file if needed
annotator_metadata = item.get('Annotator Metadata')
# Create file path if file_name is provided
file_path = None
if file_name:
# Construct the full path to the additional file
# It should be in the same folder as the metadata.jsonl
folder_path = '/'.join(source_file.split('/')[:-1]) # Remove 'metadata.jsonl'
if folder_path:
file_path = f"{folder_path}/{file_name}"
else:
file_path = file_name
question = GAIAQuestion(
task_id=task_id,
question=question_text,
level=level,
final_answer=final_answer,
file_name=file_name,
file_path=file_path,
annotator_metadata=annotator_metadata
)
return question
except Exception as e:
logger.error(f"Error parsing question at line {line_num}: {e}")
return None
def _get_level_distribution(self, questions: List[GAIAQuestion]) -> dict:
"""Get distribution of questions by level for logging"""
distribution = {1: 0, 2: 0, 3: 0}
for q in questions:
distribution[q.level] = distribution.get(q.level, 0) + 1
return distribution
def _download_additional_file(self, file_path: str) -> Optional[str]:
"""Download additional file referenced by file_name field"""
try:
logger.info(f"Downloading additional file: {file_path}")
local_path = hf_hub_download(
repo_id="gaia-benchmark/GAIA",
filename=file_path,
repo_type="dataset"
)
logger.info(f"Successfully downloaded {file_path} to {local_path}")
return local_path
except Exception as e:
logger.warning(f"Failed to download additional file {file_path}: {e}")
return None
def _try_load_with_datasets_library(self) -> Optional[List[GAIAQuestion]]:
"""Fallback method using datasets library"""
dataset_configs = [
# Try different ways to specify the 2023 configuration
{"data_dir": "2023", "split": "validation"},
{"data_dir": "2023", "split": "test"},
{"name": "2023", "split": "validation"},
{"name": "2023", "split": "test"},
{"split": "validation"},
{"split": "test"}
]
for config in dataset_configs:
try:
logger.info(f"Trying datasets library with config: {config}")
dataset = load_dataset(
"gaia-benchmark/GAIA",
trust_remote_code=True,
**config
)
questions = []
for i in range(len(dataset)):
item = dataset[i]
question = self._parse_gaia_question(item, i, f"datasets_library_{config}")
if question:
questions.append(question)
if questions:
logger.info(f"Successfully loaded {len(questions)} questions using datasets library")
return questions
except Exception as e:
logger.warning(f"Datasets library failed with config {config}: {e}")
continue
return None
def preview_dataset_structure(self) -> str:
"""Preview the actual dataset structure for debugging"""
try:
# List all files in the repository
repo_files = list_repo_files("gaia-benchmark/GAIA")
# Categorize files
metadata_files = [f for f in repo_files if f.endswith('metadata.jsonl')]
other_files = [f for f in repo_files if not f.endswith('metadata.jsonl')][:10] # First 10 other files
preview = f"""
# 📁 GAIA Dataset Structure
## Metadata Files (Questions):
{chr(10).join(f"- {f}" for f in metadata_files)}
## Sample Additional Files:
{chr(10).join(f"- {f}" for f in other_files)}
## Total Files in Repository: {len(repo_files)}
"""
# Try to preview a sample question
if metadata_files:
try:
# Download first metadata file
local_path = hf_hub_download(
repo_id="gaia-benchmark/GAIA",
filename=metadata_files[0],
repo_type="dataset"
)
# Read first question
with open(local_path, 'r', encoding='utf-8') as f:
first_line = f.readline().strip()
if first_line:
sample_question = json.loads(first_line)
preview += f"""
## Sample Question Structure:
```json
{json.dumps(sample_question, indent=2)[:500]}...
```
## Available Fields:
{list(sample_question.keys())}
"""
except Exception as e:
preview += f"\n\n⚠️ Could not preview sample question: {e}"
return preview
except Exception as e:
return f"❌ Error accessing dataset structure: {e}"
def create_submission_file(self, submissions: List[GAIASubmission], model_name: str) -> Tuple[str, str]:
"""Create official GAIA leaderboard submission file"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"gaia_submission_{model_name}_{timestamp}.jsonl"
# Create submission in official format
submission_data = []
for sub in submissions:
submission_data.append(sub.to_leaderboard_format())
# Write JSONL file
with open(filename, 'w', encoding='utf-8') as f:
for entry in submission_data:
f.write(json.dumps(entry) + '\n')
# Create submission hash for verification
with open(filename, 'rb') as f:
file_hash = hashlib.md5(f.read()).hexdigest()
# Create metadata file
metadata = {
"model_name": model_name,
"submission_time": timestamp,
"total_questions": len(submissions),
"file_hash": file_hash,
"format_version": "1.0"
}
metadata_filename = f"gaia_metadata_{model_name}_{timestamp}.json"
with open(metadata_filename, 'w') as f:
json.dump(metadata, f, indent=2)
return filename, metadata_filename
def validate_submission(self, filename: str) -> Tuple[bool, str]:
"""Validate submission file format"""
try:
with open(filename, 'r') as f:
lines = f.readlines()
required_fields = {"task_id", "model_answer", "reasoning_trace"}
for i, line in enumerate(lines):
try:
entry = json.loads(line.strip())
if not all(field in entry for field in required_fields):
return False, f"Line {i+1}: Missing required fields. Required: {required_fields}"
if not isinstance(entry["task_id"], str) or not entry["task_id"]:
return False, f"Line {i+1}: Invalid task_id"
except json.JSONDecodeError:
return False, f"Line {i+1}: Invalid JSON format"
return True, f"✅ Submission file is valid ({len(lines)} entries)"
except Exception as e:
return False, f"❌ Error validating file: {str(e)}"
# ================================
# CONTINUOUS BENCHMARKING SYSTEM
# ================================
class ContinuousBenchmarkingSystem:
"""System for automated continuous benchmarking and tracking"""
def __init__(self):
self.benchmark_history: List[BenchmarkResult] = []
self.leaderboard_manager = GAIALeaderboardManager()
def run_flexible_benchmark(self, agent, model_name: str,
num_questions: int = 50,
question_selection: str = "balanced",
progress_callback=None) -> Tuple[BenchmarkResult, List[GAIASubmission], str, str]:
"""Run flexible benchmark with customizable question selection"""
start_time = time.time()
# Load questions with specified selection
questions, status = self.leaderboard_manager.load_test_questions(
max_questions=num_questions,
question_selection=question_selection
)
if progress_callback:
progress_callback(0.1, f"Loaded {len(questions)} questions")
# Initialize tracking
submissions = []
level_stats = {1: {"total": 0, "completed": 0},
2: {"total": 0, "completed": 0},
3: {"total": 0, "completed": 0}}
total_questions = len(questions)
# Process each question
for i, question in enumerate(questions):
if progress_callback:
progress_callback((i + 1) / total_questions,
f"Processing question {i+1}/{total_questions} (Level {question.level})")
# Track by level
level_stats[question.level]["total"] += 1
try:
# Process question
start_q_time = time.time()
prompt = agent.prompt_manager.create_gaia_prompt(question.question)
raw_response = agent.model_manager.generate_response(prompt)
final_answer, reasoning = agent.prompt_manager.extract_final_answer(raw_response)
processing_time = time.time() - start_q_time
# Create submission
submission = GAIASubmission(
task_id=question.task_id,
model_answer=raw_response,
reasoning_trace=reasoning,
final_answer=final_answer,
processing_time=processing_time,
model_name=model_name,
timestamp=datetime.now().isoformat()
)
submissions.append(submission)
level_stats[question.level]["completed"] += 1
# Log progress
logger.info(f"Completed {question.task_id}: {final_answer[:50]}...")
except Exception as e:
logger.error(f"Error processing {question.task_id}: {e}")
# Add error submission
error_submission = GAIASubmission(
task_id=question.task_id,
model_answer=f"Error: {str(e)}",
reasoning_trace="Processing failed",
final_answer="ERROR",
processing_time=0.0,
model_name=model_name,
timestamp=datetime.now().isoformat()
)
submissions.append(error_submission)
# Calculate final metrics
total_time = time.time() - start_time
completed = sum(level_stats[level]["completed"] for level in level_stats)
error_rate = (total_questions - completed) / total_questions if total_questions > 0 else 0
avg_time = sum(s.processing_time for s in submissions) / len(submissions) if submissions else 0
# Create submission files
submission_file, metadata_file = self.leaderboard_manager.create_submission_file(
submissions, model_name
)
# Create submission hash
with open(submission_file, 'rb') as f:
submission_hash = hashlib.md5(f.read()).hexdigest()[:8]
# Create benchmark result
result = BenchmarkResult(
model_name=model_name,
total_questions=total_questions,
completed_questions=completed,
error_rate=error_rate,
avg_processing_time=avg_time,
total_time=total_time,
level_breakdown=level_stats,
timestamp=datetime.now().isoformat(),
submission_hash=submission_hash,
question_selection=f"{num_questions} questions ({question_selection})"
)
self.benchmark_history.append(result)
return result, submissions, submission_file, metadata_file
def generate_benchmark_report(self, result: BenchmarkResult) -> str:
"""Generate comprehensive benchmark report"""
report = f"""
# 🏆 GAIA Benchmark Report
## Model Information
- **Model Name**: {result.model_name}
- **Benchmark Date**: {result.timestamp}
- **Question Selection**: {result.question_selection}
- **Submission Hash**: {result.submission_hash}
## Overall Performance
- **Total Questions**: {result.total_questions}
- **Successfully Processed**: {result.completed_questions}
- **Success Rate**: {((result.completed_questions / result.total_questions) * 100):.1f}%
- **Error Rate**: {(result.error_rate * 100):.1f}%
## Performance Metrics
- **Average Processing Time**: {result.avg_processing_time:.2f}s per question
- **Total Benchmark Time**: {(result.total_time / 60):.1f} minutes
- **Throughput**: {(result.total_questions / (result.total_time / 60)):.1f} questions/minute
## Performance by Difficulty Level
| Level | Description | Total Questions | Completed | Success Rate |
|-------|-------------|----------------|-----------|--------------|
"""
level_descriptions = {
1: "Basic Reasoning",
2: "Intermediate Reasoning",
3: "Advanced Reasoning"
}
for level in [1, 2, 3]:
stats = result.level_breakdown[level]
success_rate = (stats["completed"] / stats["total"] * 100) if stats["total"] > 0 else 0
description = level_descriptions.get(level, "Unknown")
report += f"| Level {level} | {description} | {stats['total']} | {stats['completed']} | {success_rate:.1f}% |\n"
# Add performance analysis
l1_rate = (result.level_breakdown[1]["completed"] / max(1, result.level_breakdown[1]["total"]) * 100)
l2_rate = (result.level_breakdown[2]["completed"] / max(1, result.level_breakdown[2]["total"]) * 100)
l3_rate = (result.level_breakdown[3]["completed"] / max(1, result.level_breakdown[3]["total"]) * 100)
report += f"""
## Performance Analysis
- **Strength**: {"Level 1 (Basic)" if l1_rate >= max(l2_rate, l3_rate) else "Level 2 (Intermediate)" if l2_rate >= l3_rate else "Level 3 (Advanced)"}
- **Improvement Area**: {"Level 3 (Advanced)" if l3_rate <= min(l1_rate, l2_rate) else "Level 2 (Intermediate)" if l2_rate <= l1_rate else "Level 1 (Basic)"}
- **Processing Speed**: {"Fast" if result.avg_processing_time < 10 else "Medium" if result.avg_processing_time < 30 else "Slow"}
## Leaderboard Submission
- ✅ Submission file generated in official GAIA format
- ✅ Ready for upload to [GAIA Leaderboard]({GAIALeaderboardManager.LEADERBOARD_URL})
- 📁 Download the JSONL file below for submission
## Next Steps
1. Download the submission file
2. Visit the [GAIA Leaderboard]({GAIALeaderboardManager.LEADERBOARD_URL})
3. Upload your results
4. Compare with other models on the public leaderboard
---
*Report generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}*
"""
return report
# ================================
# ENHANCED GAIA AGENT WITH FLEXIBLE BENCHMARKING
# ================================
class EnhancedGAIAAgent:
"""Enhanced GAIA agent with flexible benchmarking capabilities"""
def __init__(self):
self.model_manager = None
self.prompt_manager = GAIAPromptManager()
self.leaderboard_manager = GAIALeaderboardManager()
self.benchmark_system = ContinuousBenchmarkingSystem()
self.current_model = None
def initialize_model(self, model_choice: str, progress=None) -> str:
"""Initialize model with progress tracking"""
try:
if progress:
progress(0, desc="Initializing model...")
# Import model manager from main app
import importlib
app_module = importlib.import_module('app')
HFSpaceModelManager = app_module.HFSpaceModelManager
self.model_manager = HFSpaceModelManager(model_choice)
self.current_model = model_choice
def progress_callback(value, desc):
if progress:
progress(value, desc=desc)
result = self.model_manager.load_model(progress_callback)
return result
except Exception as e:
return f"❌ Failed to initialize model: {str(e)}"
def run_custom_benchmark(self, num_questions: int = 50,
question_selection: str = "balanced",
progress=None) -> Tuple[str, str, str, str]:
"""Run custom benchmark with flexible options"""
if self.model_manager is None:
return "❌ No model loaded", "", "", ""
model_name = self.current_model.replace(" ", "_").replace("&", "and")
try:
# Run flexible benchmark
result, submissions, submission_file, metadata_file = self.benchmark_system.run_flexible_benchmark(
self, model_name, num_questions, question_selection, progress
)
# Generate report
report = self.benchmark_system.generate_benchmark_report(result)
# Validate submission
is_valid, validation_msg = self.leaderboard_manager.validate_submission(submission_file)
if is_valid:
status = f"✅ Benchmark completed successfully!\n{validation_msg}"
else:
status = f"⚠️ Benchmark completed but validation failed:\n{validation_msg}"
return status, report, submission_file, metadata_file
except Exception as e:
return f"❌ Benchmark failed: {str(e)}", "", "", ""
# ================================
# GLOBAL INSTANCES AND INTERFACE FUNCTIONS
# ================================
# Global enhanced agent
enhanced_gaia_agent = EnhancedGAIAAgent()
def run_custom_benchmark_interface(num_questions: int, question_selection: str, progress=None):
"""Interface for running custom benchmark with options"""
return enhanced_gaia_agent.run_custom_benchmark(num_questions, question_selection, progress)
def load_test_questions_interface(max_questions: int = 10, selection_type: str = "balanced"):
"""Interface for loading test questions info with selection options"""
questions, status = enhanced_gaia_agent.leaderboard_manager.load_test_questions(
max_questions=max_questions,
question_selection=selection_type
)
preview = f"""
{status}
## Question Distribution:
"""
# Count by level
level_counts = {1: 0, 2: 0, 3: 0}
for q in questions:
level_counts[q.level] = level_counts.get(q.level, 0) + 1
for level in [1, 2, 3]:
preview += f"- **Level {level}**: {level_counts[level]} questions\n"
preview += f"\n## Sample Questions Preview:\n\n"
# Show samples from each level
samples_shown = 0
for level in [1, 2, 3]:
level_questions = [q for q in questions if q.level == level]
if level_questions and samples_shown < 6:
q = level_questions[0]
preview += f"**Question (Level {q.level})**: {q.question}\n\n"
samples_shown += 1
if len(questions) > 6:
preview += f"... and {len(questions) - samples_shown} more questions"
return preview
def preview_dataset_structure_interface():
"""Interface for previewing dataset structure"""
return enhanced_gaia_agent.leaderboard_manager.preview_dataset_structure()
def get_question_selection_info():
"""Get information about question selection options"""
return """
# 🎯 Question Selection Options
## Selection Strategies
### 📊 **Balanced Selection** (Recommended)
- **Level 1**: ~40-50% (Basic reasoning)
- **Level 2**: ~30-35% (Intermediate reasoning)
- **Level 3**: ~20-25% (Advanced reasoning)
- **Best for**: Realistic performance evaluation
### 🎲 **Random Selection**
- **Distribution**: Random across all levels
- **Variety**: Maximum question diversity
- **Best for**: Unbiased sampling
### 📋 **Sequential Selection**
- **Order**: Questions in dataset order
- **Consistency**: Same questions each time
- **Best for**: Reproducible testing
## Question Count Recommendations
| Purpose | Questions | Time | Selection |
|---------|-----------|------|-----------|
| **Quick Test** | 10-20 | 5-15 min | Balanced |
| **Development** | 30-50 | 15-30 min | Balanced |
| **Validation** | 50-100 | 30-60 min | Random |
| **Full Benchmark** | 200+ | 1-3 hours | Balanced |
## Level Descriptions
### Level 1: Basic Reasoning
- Simple factual questions
- Basic arithmetic
- Single-step problems
- **Examples**: "What is the capital of France?", "Calculate 15% of 200"
### Level 2: Intermediate Reasoning
- Multi-step problems
- Logic puzzles
- Time/date calculations
- **Examples**: "Train speed problems", "Probability calculations"
### Level 3: Advanced Reasoning
- Complex mathematical problems
- Multi-step logic
- Advanced problem solving
- **Examples**: "Compound interest calculations", "Complex word problems"
"""
def get_leaderboard_info():
"""Get information about the GAIA leaderboard"""
return f"""
# 🏆 GAIA Public Leaderboard
## Overview
The GAIA benchmark provides a **public leaderboard** hosted on Hugging Face where you can:
- Submit results from **300 official test questions**
- Compare your model against state-of-the-art systems
- Track progress in AI reasoning capabilities
- Contribute to the research community
## Leaderboard Details
- **Official URL**: [GAIA Leaderboard]({GAIALeaderboardManager.LEADERBOARD_URL})
- **Test Questions**: 300 questions across 3 difficulty levels
- **Submission Format**: JSONL files with specific schema
- **Evaluation**: Automated scoring and ranking
- **Public Rankings**: Open comparison of all submissions
## Dataset Structure
- **Questions**: Stored in `metadata.jsonl` files
- **Additional Files**: Some questions reference extra files (images, documents, etc.)
- **Folder Structure**: `2023/validation/` and `2023/test/` directories
- **Format**: Each line in metadata.jsonl contains one question in JSON format
## Flexible Benchmarking Features
### 🎯 **Custom Question Selection**
- **Choose Count**: 10 to 300+ questions
- **Selection Strategy**: Balanced, Random, or Sequential
- **Level Distribution**: Automatic balancing across difficulty levels
- **Reproducible**: Consistent results with same settings
### 📊 **Smart Sampling**
- **Balanced**: Realistic distribution (50% L1, 30% L2, 20% L3)
- **Representative**: Questions from all difficulty levels
- **Efficient**: Test fewer questions while maintaining quality
## How to Submit
1. **Run Benchmark**: Use custom settings for your evaluation
2. **Download Results**: Get the generated JSONL submission file
3. **Visit Leaderboard**: Go to the official GAIA leaderboard
4. **Upload File**: Submit your JSONL file for evaluation
5. **View Results**: Check your model's ranking and performance
## Benefits of Flexible Benchmarking
- 📊 **Iterative Development**: Quick tests with fewer questions
- 🔍 **Targeted Testing**: Focus on specific difficulty levels
- 🏆 **Full Evaluation**: Scale up to complete benchmark
- 📈 **Progress Tracking**: Monitor improvements over time
- 🌟 **Cost Effective**: Test with fewer questions during development
## Current Benchmark Standards
Top models on the leaderboard typically achieve:
- **Level 1**: 80-95% accuracy (basic reasoning)
- **Level 2**: 60-80% accuracy (intermediate reasoning)
- **Level 3**: 30-60% accuracy (advanced reasoning)
- **Overall**: 60-75% accuracy across all levels
Ready to start benchmarking? Choose your question count and selection strategy! 🚀
"""
# Export enhanced agent and functions for use in main app
__all__ = [
'enhanced_gaia_agent',
'run_custom_benchmark_interface',
'load_test_questions_interface',
'preview_dataset_structure_interface',
'get_leaderboard_info',
'get_question_selection_info'
] |