WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality Paper • 2510.18560 • Published 5 days ago • 1
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents Paper • 2510.07172 • Published 18 days ago • 27
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training Paper • 2508.00414 • Published Aug 1 • 91