Surfer 2: The Next Generation of Cross-Platform Computer Use Agents Paper • 2510.19949 • Published Oct 22 • 38
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering Paper • 2409.06595 • Published Sep 10, 2024 • 38
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering Paper • 2409.06595 • Published Sep 10, 2024 • 38
CroissantLLM: A Truly Bilingual French-English Language Model Paper • 2402.00786 • Published Feb 1, 2024 • 26
TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation Paper • 2307.05134 • Published Jul 11, 2023 • 3