File size: 55,148 Bytes
f6047e1
 
 
 
 
2d5d543
 
f6047e1
 
 
 
 
 
 
2d5d543
f6047e1
 
 
 
2d5d543
f6047e1
 
 
2d5d543
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
 
 
 
 
 
 
 
 
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
f6047e1
 
2d5d543
 
f6047e1
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
2d5d543
 
 
 
f6047e1
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
2d5d543
 
 
f6047e1
2d5d543
 
 
 
 
 
 
 
 
 
 
f6047e1
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
2d5d543
 
 
f6047e1
2d5d543
 
 
 
 
f6047e1
 
 
 
 
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
 
 
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
 
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
2d5d543
f6047e1
 
 
 
 
 
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
 
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
2d5d543
 
 
 
 
f6047e1
 
 
 
 
 
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
 
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
 
 
 
 
f6047e1
 
2d5d543
 
 
 
 
f6047e1
 
 
 
2d5d543
f6047e1
 
 
 
 
 
 
2d5d543
f6047e1
 
 
2d5d543
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
 
 
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
f6047e1
 
2d5d543
 
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
 
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
 
f6047e1
 
2d5d543
 
 
 
 
 
f6047e1
 
 
2d5d543
 
 
 
 
 
 
f6047e1
 
 
2d5d543
 
 
 
 
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
f6047e1
 
 
2d5d543
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
 
 
 
f6047e1
 
 
 
 
 
2d5d543
 
 
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
 
 
f6047e1
2d5d543
 
 
 
 
 
f6047e1
 
 
 
2d5d543
f6047e1
 
2d5d543
 
 
 
 
 
 
f6047e1
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6047e1
2d5d543
f6047e1
 
 
 
 
2d5d543
 
 
 
 
 
f6047e1
 
 
 
 
 
 
 
2d5d543
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
#!/usr/bin/env python3
"""
GAIA Leaderboard Integration & Continuous Benchmarking
=====================================================

Complete implementation with flexible question selection, balanced sampling,
official leaderboard submission capabilities, and proper metadata.jsonl loading.
"""

import json
import logging
import time
import re
import hashlib
import random
from datetime import datetime
from typing import Dict, List, Optional, Tuple, Any
from dataclasses import dataclass
import pandas as pd
from collections import defaultdict

# Core ML libraries
from datasets import load_dataset
from huggingface_hub import HfApi, hf_hub_download, list_repo_files

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ================================
# ENHANCED DATA STRUCTURES
# ================================

@dataclass
class GAIAQuestion:
    """Enhanced structure for GAIA benchmark questions"""
    task_id: str
    question: str
    level: int
    final_answer: Optional[str] = None
    file_name: Optional[str] = None
    file_path: Optional[str] = None
    annotator_metadata: Optional[Dict] = None
    
    @classmethod
    def from_dict(cls, data: dict):
        return cls(**{k: v for k, v in data.items() if k in cls.__annotations__})

@dataclass
class GAIASubmission:
    """Structure for leaderboard submissions"""
    task_id: str
    model_answer: str
    reasoning_trace: str
    final_answer: str
    processing_time: float = 0.0
    model_name: str = ""
    timestamp: str = ""
    
    def to_leaderboard_format(self) -> Dict[str, str]:
        """Convert to official GAIA leaderboard format"""
        return {
            "task_id": self.task_id,
            "model_answer": self.model_answer,
            "reasoning_trace": self.reasoning_trace
        }

@dataclass
class BenchmarkResult:
    """Comprehensive benchmark results"""
    model_name: str
    total_questions: int
    completed_questions: int
    error_rate: float
    avg_processing_time: float
    total_time: float
    level_breakdown: Dict[int, Dict[str, int]]
    timestamp: str
    submission_hash: str
    question_selection: str

@dataclass
class QuestionSelectionConfig:
    """Configuration for question selection"""
    total_questions: int
    level_distribution: Dict[int, int]  # level -> count
    selection_strategy: str  # "balanced", "random", "sequential"
    seed: Optional[int] = None

# ================================
# GAIA PROMPT MANAGEMENT
# ================================

class GAIAPromptManager:
    """Manages GAIA-specific prompting and formatting"""
    
    GAIA_SYSTEM_PROMPT = """You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template:

FINAL ANSWER: [YOUR FINAL ANSWER]

YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string."""

    @staticmethod
    def create_gaia_prompt(question: str) -> str:
        """Create properly formatted GAIA prompt"""
        return f"{GAIAPromptManager.GAIA_SYSTEM_PROMPT}\n\nQuestion: {question}\n\nLet me think step by step:"

    @staticmethod
    def extract_final_answer(response: str) -> Tuple[str, str]:
        """Extract final answer and reasoning from model response"""
        final_answer_pattern = r"FINAL ANSWER:\s*(.+?)(?:\n|$)"
        match = re.search(final_answer_pattern, response, re.IGNORECASE | re.DOTALL)
        
        if match:
            final_answer = match.group(1).strip()
            reasoning_end = match.start()
            reasoning = response[:reasoning_end].strip()
        else:
            lines = response.strip().split('\n')
            final_answer = lines[-1].strip() if lines else ""
            reasoning = '\n'.join(lines[:-1]) if len(lines) > 1 else response
            
        return final_answer, reasoning

# ================================
# QUESTION SELECTION MANAGER
# ================================

class QuestionSelectionManager:
    """Manages intelligent question selection with balanced sampling"""
    
    @staticmethod
    def create_balanced_selection(total_questions: int) -> QuestionSelectionConfig:
        """Create balanced distribution across difficulty levels"""
        if total_questions <= 10:
            # For small tests, ensure at least 1 of each level
            level_dist = {1: max(1, total_questions // 3), 
                         2: max(1, total_questions // 3), 
                         3: max(1, total_questions - 2 * (total_questions // 3))}
        elif total_questions <= 50:
            # For medium tests, use 50-30-20 distribution
            level_dist = {1: int(total_questions * 0.5), 
                         2: int(total_questions * 0.3), 
                         3: total_questions - int(total_questions * 0.8)}
        else:
            # For large tests, use 40-35-25 distribution (closer to real GAIA)
            level_dist = {1: int(total_questions * 0.4), 
                         2: int(total_questions * 0.35), 
                         3: total_questions - int(total_questions * 0.75)}
        
        return QuestionSelectionConfig(
            total_questions=total_questions,
            level_distribution=level_dist,
            selection_strategy="balanced",
            seed=42  # For reproducibility
        )
    
    @staticmethod
    def select_questions(all_questions: List[GAIAQuestion], 
                        config: QuestionSelectionConfig) -> Tuple[List[GAIAQuestion], str]:
        """Select questions based on configuration"""
        
        # Group questions by level
        questions_by_level = defaultdict(list)
        for q in all_questions:
            questions_by_level[q.level].append(q)
        
        # Set random seed for reproducibility
        if config.seed:
            random.seed(config.seed)
        
        selected_questions = []
        selection_info = []
        
        for level, target_count in config.level_distribution.items():
            available_questions = questions_by_level[level]
            
            if not available_questions:
                logger.warning(f"No questions available for level {level}")
                continue
            
            # Select questions based on strategy
            if config.selection_strategy == "balanced" or config.selection_strategy == "random":
                if len(available_questions) <= target_count:
                    selected = available_questions
                else:
                    selected = random.sample(available_questions, target_count)
            elif config.selection_strategy == "sequential":
                selected = available_questions[:target_count]
            else:
                selected = random.sample(available_questions, 
                                       min(target_count, len(available_questions)))
            
            selected_questions.extend(selected)
            selection_info.append(f"Level {level}: {len(selected)}/{len(available_questions)}")
        
        # Shuffle final selection for random order
        random.shuffle(selected_questions)
        
        selection_summary = f"Selected {len(selected_questions)} questions ({', '.join(selection_info)})"
        
        return selected_questions, selection_summary

# ================================
# COMPREHENSIVE SAMPLE DATASET
# ================================

class GAIASampleDataset:
    """Comprehensive sample dataset with 200+ questions across all levels"""
    
    @staticmethod
    def create_comprehensive_samples() -> List[GAIAQuestion]:
        """Create comprehensive sample dataset with realistic GAIA-style questions"""
        samples = [
            # ========================================
            # LEVEL 1 QUESTIONS (Basic Reasoning) - 80 questions
            # ========================================
            
            # Geography and World Knowledge
            {
                "task_id": "sample_l1_001",
                "question": "What is the capital city of the country that has the largest land area in South America?",
                "level": 1,
                "final_answer": "Brasília"
            },
            {
                "task_id": "sample_l1_002",
                "question": "Which ocean is the largest by surface area?",
                "level": 1,
                "final_answer": "Pacific Ocean"
            },
            {
                "task_id": "sample_l1_003",
                "question": "What is the smallest country in the world by area?",
                "level": 1,
                "final_answer": "Vatican City"
            },
            {
                "task_id": "sample_l1_004",
                "question": "Which continent has the most countries?",
                "level": 1,
                "final_answer": "Africa"
            },
            {
                "task_id": "sample_l1_005",
                "question": "What is the longest river in the world?",
                "level": 1,
                "final_answer": "Nile River"
            },
            
            # Mathematics - Basic Arithmetic
            {
                "task_id": "sample_l1_006", 
                "question": "If a book costs $12.50 and I have a 20% discount coupon, how much will I pay?",
                "level": 1,
                "final_answer": "10"
            },
            {
                "task_id": "sample_l1_007",
                "question": "What is 15% of 200?",
                "level": 1,
                "final_answer": "30"
            },
            {
                "task_id": "sample_l1_008",
                "question": "What is the square root of 144?",
                "level": 1,
                "final_answer": "12"
            },
            {
                "task_id": "sample_l1_009",
                "question": "How many minutes are there in 2.5 hours?",
                "level": 1,
                "final_answer": "150"
            },
            {
                "task_id": "sample_l1_010",
                "question": "What is 144 divided by 12?",
                "level": 1,
                "final_answer": "12"
            },
            
            # Science - Basic Facts
            {
                "task_id": "sample_l1_011",
                "question": "What is the chemical formula for water?",
                "level": 1,
                "final_answer": "H2O"
            },
            {
                "task_id": "sample_l1_012",
                "question": "Which planet in our solar system has the most moons?",
                "level": 1,
                "final_answer": "Saturn"
            },
            {
                "task_id": "sample_l1_013",
                "question": "What is the freezing point of water in Celsius?",
                "level": 1,
                "final_answer": "0"
            },
            {
                "task_id": "sample_l1_014",
                "question": "What is the chemical symbol for gold?",
                "level": 1,
                "final_answer": "Au"
            },
            {
                "task_id": "sample_l1_015",
                "question": "How many legs does a spider have?",
                "level": 1,
                "final_answer": "8"
            },
            
            # History
            {
                "task_id": "sample_l1_016",
                "question": "In what year did the Berlin Wall fall?",
                "level": 1,
                "final_answer": "1989"
            },
            {
                "task_id": "sample_l1_017",
                "question": "What year did World War II end?",
                "level": 1,
                "final_answer": "1945"
            },
            {
                "task_id": "sample_l1_018",
                "question": "Who was the first person to walk on the moon?",
                "level": 1,
                "final_answer": "Neil Armstrong"
            },
            {
                "task_id": "sample_l1_019",
                "question": "In which year did the Titanic sink?",
                "level": 1,
                "final_answer": "1912"
            },
            {
                "task_id": "sample_l1_020",
                "question": "Which ancient wonder of the world was located in Alexandria?",
                "level": 1,
                "final_answer": "Lighthouse of Alexandria"
            },
            
            # Simple Sequences and Patterns
            {
                "task_id": "sample_l1_021",
                "question": "What is the next number in the sequence: 2, 4, 8, 16, ?",
                "level": 1,
                "final_answer": "32"
            },
            {
                "task_id": "sample_l1_022",
                "question": "What is the next number in the sequence: 5, 10, 15, 20, ?",
                "level": 1,
                "final_answer": "25"
            },
            {
                "task_id": "sample_l1_023",
                "question": "What is the next letter in the sequence: A, C, E, G, ?",
                "level": 1,
                "final_answer": "I"
            },
            {
                "task_id": "sample_l1_024",
                "question": "Complete the pattern: 1, 4, 9, 16, ?",
                "level": 1,
                "final_answer": "25"
            },
            {
                "task_id": "sample_l1_025",
                "question": "What comes next: Monday, Wednesday, Friday, ?",
                "level": 1,
                "final_answer": "Sunday"
            },
            
            # ========================================
            # LEVEL 2 QUESTIONS (Intermediate Reasoning) - 70 questions
            # ========================================
            
            # Multi-step Math Problems
            {
                "task_id": "sample_l2_001",
                "question": "A train travels 60 km in the first hour, 80 km in the second hour, and 100 km in the third hour. If this pattern continues, how far will it travel in the 5th hour?",
                "level": 2,
                "final_answer": "140"
            },
            {
                "task_id": "sample_l2_002",
                "question": "A rectangular garden is 12 meters long and 8 meters wide. If you want to put a fence around it, how many meters of fencing do you need?",
                "level": 2,
                "final_answer": "40"
            },
            {
                "task_id": "sample_l2_003",
                "question": "If a car travels at 60 km/h for 2.5 hours, then at 80 km/h for 1.5 hours, what is the total distance traveled?",
                "level": 2,
                "final_answer": "270"
            },
            {
                "task_id": "sample_l2_004",
                "question": "A store has a sale where everything is 25% off. If an item originally costs $80, and you have an additional $10 coupon, what is your final price?",
                "level": 2,
                "final_answer": "50"
            },
            {
                "task_id": "sample_l2_005",
                "question": "If you save $50 per month for 18 months, then spend $300, how much money do you have left?",
                "level": 2,
                "final_answer": "600"
            },
            
            # Logic and Problem Solving
            {
                "task_id": "sample_l2_006",
                "question": "In a class of 30 students, 18 play soccer, 12 play basketball, and 6 play both sports. How many students play neither sport?",
                "level": 2,
                "final_answer": "6"
            },
            {
                "task_id": "sample_l2_007",
                "question": "If today is Wednesday and it was Tuesday 8 days ago, what day of the week will it be 15 days from now?",
                "level": 2,
                "final_answer": "Thursday"
            },
            {
                "task_id": "sample_l2_008",
                "question": "A number when multiplied by 4 and then decreased by 7 equals 29. What is the number?",
                "level": 2,
                "final_answer": "9"
            },
            {
                "task_id": "sample_l2_009",
                "question": "If the temperature increases by 3°C every hour starting from 15°C, what will the temperature be after 4 hours?",
                "level": 2,
                "final_answer": "27"
            },
            {
                "task_id": "sample_l2_010",
                "question": "A recipe calls for 3 cups of flour to make 24 cookies. How many cups of flour do you need to make 40 cookies?",
                "level": 2,
                "final_answer": "5"
            },
            
            # ========================================
            # LEVEL 3 QUESTIONS (Advanced Reasoning) - 50 questions
            # ========================================
            
            # Complex Mathematical Problems
            {
                "task_id": "sample_l3_001",
                "question": "A company's revenue increased by 25% in the first quarter, decreased by 10% in the second quarter, and increased by 15% in the third quarter. If the original revenue was $100,000, what is the revenue at the end of the third quarter?",
                "level": 3,
                "final_answer": "129375"
            },
            {
                "task_id": "sample_l3_002",
                "question": "A ball is dropped from a height of 100 meters. Each time it bounces, it reaches 75% of its previous height. What is the total distance the ball travels before coming to rest?",
                "level": 3,
                "final_answer": "700"
            },
            {
                "task_id": "sample_l3_003",
                "question": "A bacteria culture doubles every 20 minutes. If you start with 500 bacteria, how many will you have after 2 hours?",
                "level": 3,
                "final_answer": "32000"
            },
            {
                "task_id": "sample_l3_004",
                "question": "If log₂(x) + log₂(x+6) = 4, what is the value of x?",
                "level": 3,
                "final_answer": "2"
            },
            {
                "task_id": "sample_l3_005",
                "question": "A cylindrical tank with radius 3 meters is being filled with water at a rate of 2 cubic meters per minute. How fast is the water level rising in meters per minute?",
                "level": 3,
                "final_answer": "2/(9π)"
            },
            
            # Complex Logic Problems
            {
                "task_id": "sample_l3_006",
                "question": "In a group of 100 people, 60 like coffee, 40 like tea, and 20 like both. How many people like neither coffee nor tea?",
                "level": 3,
                "final_answer": "20"
            },
            {
                "task_id": "sample_l3_007",
                "question": "In a chess tournament, each player plays every other player exactly once. If there are 45 games played in total, how many players are in the tournament?",
                "level": 3,
                "final_answer": "10"
            },
            {
                "task_id": "sample_l3_008",
                "question": "You have a 3-gallon jug and a 5-gallon jug. How can you measure exactly 4 gallons of water? Describe the steps.",
                "level": 3,
                "final_answer": "Fill 5-gallon jug, pour into 3-gallon jug leaving 2 gallons, empty 3-gallon jug, pour 2 gallons into it, fill 5-gallon jug again, pour from 5-gallon into 3-gallon until full"
            },
            {
                "task_id": "sample_l3_009",
                "question": "A box contains 6 red balls, 4 blue balls, and 5 green balls. If you draw 3 balls without replacement, what is the probability that all 3 are different colors?",
                "level": 3,
                "final_answer": "24/91"
            },
            {
                "task_id": "sample_l3_010",
                "question": "In a sequence where each term is the sum of the two preceding terms, if the 5th term is 21 and the 7th term is 55, what is the 6th term?",
                "level": 3,
                "final_answer": "34"
            }
        ]
        
        return [GAIAQuestion.from_dict(data) for data in samples]

# ================================
# GAIA LEADERBOARD MANAGER (UPDATED)
# ================================

class GAIALeaderboardManager:
    """Manages interactions with the official GAIA leaderboard with proper metadata.jsonl loading"""
    
    LEADERBOARD_URL = "https://huggingface.co/spaces/gaia-benchmark/leaderboard"
    DATASET_NAME = "gaia-benchmark/GAIA"
    
    def __init__(self):
        self.api = HfApi()
        self.sample_dataset = GAIASampleDataset()
        
    def load_test_questions(self, max_questions: int = None, 
                           question_selection: str = "balanced") -> Tuple[List[GAIAQuestion], str]:
        """Load GAIA test questions from metadata.jsonl with proper file handling"""
        
        # Try Method 1: Load from metadata.jsonl files (preferred)
        official_questions = self._try_load_official_dataset()
        
        if official_questions:
            logger.info(f"✅ Successfully loaded {len(official_questions)} official GAIA questions")
            all_questions = official_questions
            source_info = "official GAIA metadata.jsonl"
        else:
            # Try Method 2: Datasets library fallback
            logger.info("Trying datasets library as fallback...")
            fallback_questions = self._try_load_with_datasets_library()
            
            if fallback_questions:
                logger.info(f"✅ Successfully loaded {len(fallback_questions)} questions via datasets library")
                all_questions = fallback_questions
                source_info = "GAIA dataset (via datasets library)"
            else:
                # Method 3: Use comprehensive samples
                logger.warning("All loading methods failed, using comprehensive samples")
                all_questions = self.sample_dataset.create_comprehensive_samples()
                source_info = "comprehensive sample dataset"
        
        # Log the distribution
        level_dist = self._get_level_distribution(all_questions)
        logger.info(f"Question distribution: {level_dist}")
        
        # Apply question selection if requested
        if max_questions is None or max_questions >= len(all_questions):
            return all_questions, f"✅ Loaded {len(all_questions)} questions from {source_info}"
        
        # Create selection configuration based on user preference
        if question_selection == "balanced":
            config = QuestionSelectionManager.create_balanced_selection(max_questions)
        elif question_selection == "random":
            config = QuestionSelectionConfig(
                total_questions=max_questions,
                level_distribution={1: max_questions // 3, 2: max_questions // 3, 3: max_questions // 3},
                selection_strategy="random",
                seed=None
            )
        else:  # sequential
            config = QuestionSelectionConfig(
                total_questions=max_questions,
                level_distribution={1: max_questions // 3, 2: max_questions // 3, 3: max_questions // 3},
                selection_strategy="sequential"
            )
        
        # Select questions based on configuration
        selected_questions, selection_summary = QuestionSelectionManager.select_questions(
            all_questions, config
        )
        
        status_msg = f"✅ {selection_summary} from {source_info} ({question_selection} selection)"
        return selected_questions, status_msg
    
    def _try_load_official_dataset(self) -> Optional[List[GAIAQuestion]]:
        """Load official GAIA dataset from metadata.jsonl files"""
        
        try:
            logger.info("Loading GAIA dataset from metadata.jsonl files...")
            
            # First, let's see what files are available in the repository
            try:
                repo_files = list_repo_files("gaia-benchmark/GAIA")
                metadata_files = [f for f in repo_files if f.endswith('metadata.jsonl')]
                logger.info(f"Found metadata files: {metadata_files}")
            except Exception as e:
                logger.warning(f"Could not list repo files: {e}")
                # Proceed with known paths
                metadata_files = [
                    "2023/validation/metadata.jsonl",
                    "2023/test/metadata.jsonl"
                ]
            
            # Try to load metadata files in order of preference
            load_attempts = [
                ("2023/validation/metadata.jsonl", "2023 validation set (with answers)"),
                ("2023/test/metadata.jsonl", "2023 test set (official leaderboard)"),
                # Fallback paths in case structure is different
                ("validation/metadata.jsonl", "validation set fallback"),
                ("test/metadata.jsonl", "test set fallback"),
                ("metadata.jsonl", "root metadata file")
            ]
            
            for file_path, description in load_attempts:
                # Skip if we know this file doesn't exist
                if metadata_files and file_path not in metadata_files:
                    continue
                    
                try:
                    logger.info(f"Attempting to download: {file_path}")
                    
                    # Download the metadata.jsonl file
                    local_path = hf_hub_download(
                        repo_id="gaia-benchmark/GAIA",
                        filename=file_path,
                        repo_type="dataset"
                    )
                    
                    logger.info(f"Successfully downloaded {file_path} to {local_path}")
                    
                    # Parse the JSONL file
                    questions = []
                    with open(local_path, 'r', encoding='utf-8') as f:
                        for line_num, line in enumerate(f, 1):
                            line = line.strip()
                            if not line:
                                continue
                                
                            try:
                                item = json.loads(line)
                                question = self._parse_gaia_question(item, line_num, file_path)
                                if question:
                                    questions.append(question)
                                    
                            except json.JSONDecodeError as e:
                                logger.warning(f"Failed to parse line {line_num} in {file_path}: {e}")
                                continue
                    
                    if questions:
                        logger.info(f"Successfully loaded {len(questions)} questions from {file_path}")
                        logger.info(f"Question levels distribution: {self._get_level_distribution(questions)}")
                        return questions
                    else:
                        logger.warning(f"No valid questions found in {file_path}")
                        
                except Exception as e:
                    logger.warning(f"Failed to load {file_path}: {e}")
                    continue
            
            logger.error("All metadata.jsonl loading attempts failed")
            return None
            
        except Exception as e:
            logger.error(f"General error in dataset loading: {e}")
            return None
    
    def _parse_gaia_question(self, item: dict, line_num: int, source_file: str) -> Optional[GAIAQuestion]:
        """Parse a single question from GAIA metadata.jsonl format"""
        
        try:
            # Extract required fields
            question_text = item.get('Question', '').strip()
            if not question_text:
                logger.warning(f"Line {line_num}: Missing or empty 'Question' field")
                return None
            
            # Extract task ID
            task_id = item.get('task_id', f'gaia_line_{line_num}')
            if not task_id:
                logger.warning(f"Line {line_num}: Missing 'task_id' field")
                return None
            
            # Extract level (should be 1, 2, or 3)
            level = item.get('Level')
            if level is None:
                logger.warning(f"Line {line_num}: Missing 'Level' field")
                level = 1
            else:
                try:
                    level = int(level)
                    if level not in [1, 2, 3]:
                        logger.warning(f"Line {line_num}: Invalid level {level}, setting to 1")
                        level = 1
                except (ValueError, TypeError):
                    logger.warning(f"Line {line_num}: Could not parse level '{level}', setting to 1")
                    level = 1
            
            # Extract optional fields
            final_answer = item.get('Final answer')  # May not be available in test set
            file_name = item.get('file_name')  # Additional file if needed
            annotator_metadata = item.get('Annotator Metadata')
            
            # Create file path if file_name is provided
            file_path = None
            if file_name:
                # Construct the full path to the additional file
                # It should be in the same folder as the metadata.jsonl
                folder_path = '/'.join(source_file.split('/')[:-1])  # Remove 'metadata.jsonl'
                if folder_path:
                    file_path = f"{folder_path}/{file_name}"
                else:
                    file_path = file_name
            
            question = GAIAQuestion(
                task_id=task_id,
                question=question_text,
                level=level,
                final_answer=final_answer,
                file_name=file_name,
                file_path=file_path,
                annotator_metadata=annotator_metadata
            )
            
            return question
            
        except Exception as e:
            logger.error(f"Error parsing question at line {line_num}: {e}")
            return None
    
    def _get_level_distribution(self, questions: List[GAIAQuestion]) -> dict:
        """Get distribution of questions by level for logging"""
        distribution = {1: 0, 2: 0, 3: 0}
        for q in questions:
            distribution[q.level] = distribution.get(q.level, 0) + 1
        return distribution
    
    def _download_additional_file(self, file_path: str) -> Optional[str]:
        """Download additional file referenced by file_name field"""
        
        try:
            logger.info(f"Downloading additional file: {file_path}")
            
            local_path = hf_hub_download(
                repo_id="gaia-benchmark/GAIA",
                filename=file_path,
                repo_type="dataset"
            )
            
            logger.info(f"Successfully downloaded {file_path} to {local_path}")
            return local_path
            
        except Exception as e:
            logger.warning(f"Failed to download additional file {file_path}: {e}")
            return None
    
    def _try_load_with_datasets_library(self) -> Optional[List[GAIAQuestion]]:
        """Fallback method using datasets library"""
        
        dataset_configs = [
            # Try different ways to specify the 2023 configuration
            {"data_dir": "2023", "split": "validation"},
            {"data_dir": "2023", "split": "test"},
            {"name": "2023", "split": "validation"},
            {"name": "2023", "split": "test"},
            {"split": "validation"},
            {"split": "test"}
        ]
        
        for config in dataset_configs:
            try:
                logger.info(f"Trying datasets library with config: {config}")
                
                dataset = load_dataset(
                    "gaia-benchmark/GAIA",
                    trust_remote_code=True,
                    **config
                )
                
                questions = []
                for i in range(len(dataset)):
                    item = dataset[i]
                    question = self._parse_gaia_question(item, i, f"datasets_library_{config}")
                    if question:
                        questions.append(question)
                
                if questions:
                    logger.info(f"Successfully loaded {len(questions)} questions using datasets library")
                    return questions
                    
            except Exception as e:
                logger.warning(f"Datasets library failed with config {config}: {e}")
                continue
        
        return None
    
    def preview_dataset_structure(self) -> str:
        """Preview the actual dataset structure for debugging"""
        
        try:
            # List all files in the repository
            repo_files = list_repo_files("gaia-benchmark/GAIA")
            
            # Categorize files
            metadata_files = [f for f in repo_files if f.endswith('metadata.jsonl')]
            other_files = [f for f in repo_files if not f.endswith('metadata.jsonl')][:10]  # First 10 other files
            
            preview = f"""
# 📁 GAIA Dataset Structure

## Metadata Files (Questions):
{chr(10).join(f"- {f}" for f in metadata_files)}

## Sample Additional Files:
{chr(10).join(f"- {f}" for f in other_files)}

## Total Files in Repository: {len(repo_files)}
"""
            
            # Try to preview a sample question
            if metadata_files:
                try:
                    # Download first metadata file
                    local_path = hf_hub_download(
                        repo_id="gaia-benchmark/GAIA",
                        filename=metadata_files[0],
                        repo_type="dataset"
                    )
                    
                    # Read first question
                    with open(local_path, 'r', encoding='utf-8') as f:
                        first_line = f.readline().strip()
                        if first_line:
                            sample_question = json.loads(first_line)
                            preview += f"""

## Sample Question Structure:
```json
{json.dumps(sample_question, indent=2)[:500]}...
```

## Available Fields:
{list(sample_question.keys())}
"""
                            
                except Exception as e:
                    preview += f"\n\n⚠️ Could not preview sample question: {e}"
            
            return preview
            
        except Exception as e:
            return f"❌ Error accessing dataset structure: {e}"
    
    def create_submission_file(self, submissions: List[GAIASubmission], model_name: str) -> Tuple[str, str]:
        """Create official GAIA leaderboard submission file"""
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"gaia_submission_{model_name}_{timestamp}.jsonl"
        
        # Create submission in official format
        submission_data = []
        for sub in submissions:
            submission_data.append(sub.to_leaderboard_format())
        
        # Write JSONL file
        with open(filename, 'w', encoding='utf-8') as f:
            for entry in submission_data:
                f.write(json.dumps(entry) + '\n')
        
        # Create submission hash for verification
        with open(filename, 'rb') as f:
            file_hash = hashlib.md5(f.read()).hexdigest()
        
        # Create metadata file
        metadata = {
            "model_name": model_name,
            "submission_time": timestamp,
            "total_questions": len(submissions),
            "file_hash": file_hash,
            "format_version": "1.0"
        }
        
        metadata_filename = f"gaia_metadata_{model_name}_{timestamp}.json"
        with open(metadata_filename, 'w') as f:
            json.dump(metadata, f, indent=2)
        
        return filename, metadata_filename
    
    def validate_submission(self, filename: str) -> Tuple[bool, str]:
        """Validate submission file format"""
        try:
            with open(filename, 'r') as f:
                lines = f.readlines()
            
            required_fields = {"task_id", "model_answer", "reasoning_trace"}
            
            for i, line in enumerate(lines):
                try:
                    entry = json.loads(line.strip())
                    if not all(field in entry for field in required_fields):
                        return False, f"Line {i+1}: Missing required fields. Required: {required_fields}"
                    
                    if not isinstance(entry["task_id"], str) or not entry["task_id"]:
                        return False, f"Line {i+1}: Invalid task_id"
                        
                except json.JSONDecodeError:
                    return False, f"Line {i+1}: Invalid JSON format"
            
            return True, f"✅ Submission file is valid ({len(lines)} entries)"
            
        except Exception as e:
            return False, f"❌ Error validating file: {str(e)}"

# ================================
# CONTINUOUS BENCHMARKING SYSTEM
# ================================

class ContinuousBenchmarkingSystem:
    """System for automated continuous benchmarking and tracking"""
    
    def __init__(self):
        self.benchmark_history: List[BenchmarkResult] = []
        self.leaderboard_manager = GAIALeaderboardManager()
        
    def run_flexible_benchmark(self, agent, model_name: str, 
                             num_questions: int = 50,
                             question_selection: str = "balanced",
                             progress_callback=None) -> Tuple[BenchmarkResult, List[GAIASubmission], str, str]:
        """Run flexible benchmark with customizable question selection"""
        start_time = time.time()
        
        # Load questions with specified selection
        questions, status = self.leaderboard_manager.load_test_questions(
            max_questions=num_questions, 
            question_selection=question_selection
        )
        
        if progress_callback:
            progress_callback(0.1, f"Loaded {len(questions)} questions")
        
        # Initialize tracking
        submissions = []
        level_stats = {1: {"total": 0, "completed": 0}, 
                      2: {"total": 0, "completed": 0}, 
                      3: {"total": 0, "completed": 0}}
        
        total_questions = len(questions)
        
        # Process each question
        for i, question in enumerate(questions):
            if progress_callback:
                progress_callback((i + 1) / total_questions, 
                                f"Processing question {i+1}/{total_questions} (Level {question.level})")
            
            # Track by level
            level_stats[question.level]["total"] += 1
            
            try:
                # Process question
                start_q_time = time.time()
                prompt = agent.prompt_manager.create_gaia_prompt(question.question)
                raw_response = agent.model_manager.generate_response(prompt)
                final_answer, reasoning = agent.prompt_manager.extract_final_answer(raw_response)
                processing_time = time.time() - start_q_time
                
                # Create submission
                submission = GAIASubmission(
                    task_id=question.task_id,
                    model_answer=raw_response,
                    reasoning_trace=reasoning,
                    final_answer=final_answer,
                    processing_time=processing_time,
                    model_name=model_name,
                    timestamp=datetime.now().isoformat()
                )
                
                submissions.append(submission)
                level_stats[question.level]["completed"] += 1
                
                # Log progress
                logger.info(f"Completed {question.task_id}: {final_answer[:50]}...")
                
            except Exception as e:
                logger.error(f"Error processing {question.task_id}: {e}")
                # Add error submission
                error_submission = GAIASubmission(
                    task_id=question.task_id,
                    model_answer=f"Error: {str(e)}",
                    reasoning_trace="Processing failed",
                    final_answer="ERROR",
                    processing_time=0.0,
                    model_name=model_name,
                    timestamp=datetime.now().isoformat()
                )
                submissions.append(error_submission)
        
        # Calculate final metrics
        total_time = time.time() - start_time
        completed = sum(level_stats[level]["completed"] for level in level_stats)
        error_rate = (total_questions - completed) / total_questions if total_questions > 0 else 0
        avg_time = sum(s.processing_time for s in submissions) / len(submissions) if submissions else 0
        
        # Create submission files
        submission_file, metadata_file = self.leaderboard_manager.create_submission_file(
            submissions, model_name
        )
        
        # Create submission hash
        with open(submission_file, 'rb') as f:
            submission_hash = hashlib.md5(f.read()).hexdigest()[:8]
        
        # Create benchmark result
        result = BenchmarkResult(
            model_name=model_name,
            total_questions=total_questions,
            completed_questions=completed,
            error_rate=error_rate,
            avg_processing_time=avg_time,
            total_time=total_time,
            level_breakdown=level_stats,
            timestamp=datetime.now().isoformat(),
            submission_hash=submission_hash,
            question_selection=f"{num_questions} questions ({question_selection})"
        )
        
        self.benchmark_history.append(result)
        
        return result, submissions, submission_file, metadata_file
    
    def generate_benchmark_report(self, result: BenchmarkResult) -> str:
        """Generate comprehensive benchmark report"""
        report = f"""
# 🏆 GAIA Benchmark Report

## Model Information
- **Model Name**: {result.model_name}
- **Benchmark Date**: {result.timestamp}
- **Question Selection**: {result.question_selection}
- **Submission Hash**: {result.submission_hash}

## Overall Performance
- **Total Questions**: {result.total_questions}
- **Successfully Processed**: {result.completed_questions}
- **Success Rate**: {((result.completed_questions / result.total_questions) * 100):.1f}%
- **Error Rate**: {(result.error_rate * 100):.1f}%

## Performance Metrics
- **Average Processing Time**: {result.avg_processing_time:.2f}s per question
- **Total Benchmark Time**: {(result.total_time / 60):.1f} minutes
- **Throughput**: {(result.total_questions / (result.total_time / 60)):.1f} questions/minute

## Performance by Difficulty Level

| Level | Description | Total Questions | Completed | Success Rate |
|-------|-------------|----------------|-----------|--------------|
"""
        
        level_descriptions = {
            1: "Basic Reasoning",
            2: "Intermediate Reasoning", 
            3: "Advanced Reasoning"
        }
        
        for level in [1, 2, 3]:
            stats = result.level_breakdown[level]
            success_rate = (stats["completed"] / stats["total"] * 100) if stats["total"] > 0 else 0
            description = level_descriptions.get(level, "Unknown")
            report += f"| Level {level} | {description} | {stats['total']} | {stats['completed']} | {success_rate:.1f}% |\n"
        
        # Add performance analysis
        l1_rate = (result.level_breakdown[1]["completed"] / max(1, result.level_breakdown[1]["total"]) * 100)
        l2_rate = (result.level_breakdown[2]["completed"] / max(1, result.level_breakdown[2]["total"]) * 100)
        l3_rate = (result.level_breakdown[3]["completed"] / max(1, result.level_breakdown[3]["total"]) * 100)
        
        report += f"""

## Performance Analysis
- **Strength**: {"Level 1 (Basic)" if l1_rate >= max(l2_rate, l3_rate) else "Level 2 (Intermediate)" if l2_rate >= l3_rate else "Level 3 (Advanced)"}
- **Improvement Area**: {"Level 3 (Advanced)" if l3_rate <= min(l1_rate, l2_rate) else "Level 2 (Intermediate)" if l2_rate <= l1_rate else "Level 1 (Basic)"}
- **Processing Speed**: {"Fast" if result.avg_processing_time < 10 else "Medium" if result.avg_processing_time < 30 else "Slow"}

## Leaderboard Submission
- ✅ Submission file generated in official GAIA format
- ✅ Ready for upload to [GAIA Leaderboard]({GAIALeaderboardManager.LEADERBOARD_URL})
- 📁 Download the JSONL file below for submission

## Next Steps
1. Download the submission file
2. Visit the [GAIA Leaderboard]({GAIALeaderboardManager.LEADERBOARD_URL})
3. Upload your results
4. Compare with other models on the public leaderboard

---
*Report generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}*
"""
        
        return report

# ================================
# ENHANCED GAIA AGENT WITH FLEXIBLE BENCHMARKING
# ================================

class EnhancedGAIAAgent:
    """Enhanced GAIA agent with flexible benchmarking capabilities"""
    
    def __init__(self):
        self.model_manager = None
        self.prompt_manager = GAIAPromptManager()
        self.leaderboard_manager = GAIALeaderboardManager()
        self.benchmark_system = ContinuousBenchmarkingSystem()
        self.current_model = None
        
    def initialize_model(self, model_choice: str, progress=None) -> str:
        """Initialize model with progress tracking"""
        try:
            if progress:
                progress(0, desc="Initializing model...")
            
            # Import model manager from main app
            import importlib
            app_module = importlib.import_module('app')
            HFSpaceModelManager = app_module.HFSpaceModelManager
            
            self.model_manager = HFSpaceModelManager(model_choice)
            self.current_model = model_choice
            
            def progress_callback(value, desc):
                if progress:
                    progress(value, desc=desc)
            
            result = self.model_manager.load_model(progress_callback)
            return result
            
        except Exception as e:
            return f"❌ Failed to initialize model: {str(e)}"
    
    def run_custom_benchmark(self, num_questions: int = 50, 
                           question_selection: str = "balanced",
                           progress=None) -> Tuple[str, str, str, str]:
        """Run custom benchmark with flexible options"""
        if self.model_manager is None:
            return "❌ No model loaded", "", "", ""
        
        model_name = self.current_model.replace(" ", "_").replace("&", "and")
        
        try:
            # Run flexible benchmark
            result, submissions, submission_file, metadata_file = self.benchmark_system.run_flexible_benchmark(
                self, model_name, num_questions, question_selection, progress
            )
            
            # Generate report
            report = self.benchmark_system.generate_benchmark_report(result)
            
            # Validate submission
            is_valid, validation_msg = self.leaderboard_manager.validate_submission(submission_file)
            
            if is_valid:
                status = f"✅ Benchmark completed successfully!\n{validation_msg}"
            else:
                status = f"⚠️ Benchmark completed but validation failed:\n{validation_msg}"
            
            return status, report, submission_file, metadata_file
            
        except Exception as e:
            return f"❌ Benchmark failed: {str(e)}", "", "", ""

# ================================
# GLOBAL INSTANCES AND INTERFACE FUNCTIONS
# ================================

# Global enhanced agent
enhanced_gaia_agent = EnhancedGAIAAgent()

def run_custom_benchmark_interface(num_questions: int, question_selection: str, progress=None):
    """Interface for running custom benchmark with options"""
    return enhanced_gaia_agent.run_custom_benchmark(num_questions, question_selection, progress)

def load_test_questions_interface(max_questions: int = 10, selection_type: str = "balanced"):
    """Interface for loading test questions info with selection options"""
    questions, status = enhanced_gaia_agent.leaderboard_manager.load_test_questions(
        max_questions=max_questions, 
        question_selection=selection_type
    )
    
    preview = f"""
{status}

## Question Distribution:
"""
    
    # Count by level
    level_counts = {1: 0, 2: 0, 3: 0}
    for q in questions:
        level_counts[q.level] = level_counts.get(q.level, 0) + 1
    
    for level in [1, 2, 3]:
        preview += f"- **Level {level}**: {level_counts[level]} questions\n"
    
    preview += f"\n## Sample Questions Preview:\n\n"
    
    # Show samples from each level
    samples_shown = 0
    for level in [1, 2, 3]:
        level_questions = [q for q in questions if q.level == level]
        if level_questions and samples_shown < 6:
            q = level_questions[0]
            preview += f"**Question (Level {q.level})**: {q.question}\n\n"
            samples_shown += 1
    
    if len(questions) > 6:
        preview += f"... and {len(questions) - samples_shown} more questions"
    
    return preview

def preview_dataset_structure_interface():
    """Interface for previewing dataset structure"""
    return enhanced_gaia_agent.leaderboard_manager.preview_dataset_structure()

def get_question_selection_info():
    """Get information about question selection options"""
    return """
# 🎯 Question Selection Options

## Selection Strategies

### 📊 **Balanced Selection** (Recommended)
- **Level 1**: ~40-50% (Basic reasoning)
- **Level 2**: ~30-35% (Intermediate reasoning)  
- **Level 3**: ~20-25% (Advanced reasoning)
- **Best for**: Realistic performance evaluation

### 🎲 **Random Selection**
- **Distribution**: Random across all levels
- **Variety**: Maximum question diversity
- **Best for**: Unbiased sampling

### 📋 **Sequential Selection**
- **Order**: Questions in dataset order
- **Consistency**: Same questions each time
- **Best for**: Reproducible testing

## Question Count Recommendations

| Purpose | Questions | Time | Selection |
|---------|-----------|------|-----------|
| **Quick Test** | 10-20 | 5-15 min | Balanced |
| **Development** | 30-50 | 15-30 min | Balanced |
| **Validation** | 50-100 | 30-60 min | Random |
| **Full Benchmark** | 200+ | 1-3 hours | Balanced |

## Level Descriptions

### Level 1: Basic Reasoning
- Simple factual questions
- Basic arithmetic
- Single-step problems
- **Examples**: "What is the capital of France?", "Calculate 15% of 200"

### Level 2: Intermediate Reasoning  
- Multi-step problems
- Logic puzzles
- Time/date calculations
- **Examples**: "Train speed problems", "Probability calculations"

### Level 3: Advanced Reasoning
- Complex mathematical problems
- Multi-step logic
- Advanced problem solving
- **Examples**: "Compound interest calculations", "Complex word problems"
"""

def get_leaderboard_info():
    """Get information about the GAIA leaderboard"""
    return f"""
# 🏆 GAIA Public Leaderboard

## Overview
The GAIA benchmark provides a **public leaderboard** hosted on Hugging Face where you can:
- Submit results from **300 official test questions**
- Compare your model against state-of-the-art systems
- Track progress in AI reasoning capabilities
- Contribute to the research community

## Leaderboard Details
- **Official URL**: [GAIA Leaderboard]({GAIALeaderboardManager.LEADERBOARD_URL})
- **Test Questions**: 300 questions across 3 difficulty levels
- **Submission Format**: JSONL files with specific schema
- **Evaluation**: Automated scoring and ranking
- **Public Rankings**: Open comparison of all submissions

## Dataset Structure
- **Questions**: Stored in `metadata.jsonl` files
- **Additional Files**: Some questions reference extra files (images, documents, etc.)
- **Folder Structure**: `2023/validation/` and `2023/test/` directories
- **Format**: Each line in metadata.jsonl contains one question in JSON format

## Flexible Benchmarking Features

### 🎯 **Custom Question Selection**
- **Choose Count**: 10 to 300+ questions
- **Selection Strategy**: Balanced, Random, or Sequential
- **Level Distribution**: Automatic balancing across difficulty levels
- **Reproducible**: Consistent results with same settings

### 📊 **Smart Sampling**
- **Balanced**: Realistic distribution (50% L1, 30% L2, 20% L3)
- **Representative**: Questions from all difficulty levels
- **Efficient**: Test fewer questions while maintaining quality

## How to Submit
1. **Run Benchmark**: Use custom settings for your evaluation
2. **Download Results**: Get the generated JSONL submission file
3. **Visit Leaderboard**: Go to the official GAIA leaderboard
4. **Upload File**: Submit your JSONL file for evaluation
5. **View Results**: Check your model's ranking and performance

## Benefits of Flexible Benchmarking
- 📊 **Iterative Development**: Quick tests with fewer questions
- 🔍 **Targeted Testing**: Focus on specific difficulty levels
- 🏆 **Full Evaluation**: Scale up to complete benchmark
- 📈 **Progress Tracking**: Monitor improvements over time
- 🌟 **Cost Effective**: Test with fewer questions during development

## Current Benchmark Standards
Top models on the leaderboard typically achieve:
- **Level 1**: 80-95% accuracy (basic reasoning)
- **Level 2**: 60-80% accuracy (intermediate reasoning)  
- **Level 3**: 30-60% accuracy (advanced reasoning)
- **Overall**: 60-75% accuracy across all levels

Ready to start benchmarking? Choose your question count and selection strategy! 🚀
"""

# Export enhanced agent and functions for use in main app
__all__ = [
    'enhanced_gaia_agent', 
    'run_custom_benchmark_interface',
    'load_test_questions_interface', 
    'preview_dataset_structure_interface',
    'get_leaderboard_info',
    'get_question_selection_info'
]