Predicting Task Performance with Context-aware Scaling Laws Paper • 2510.14919 • Published 3 days ago • 3
Budget-aware Test-time Scaling via Discriminative Verification Paper • 2510.14913 • Published 3 days ago • 4
Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning Paper • 2407.04787 • Published Jul 5, 2024
JudgeBench: A Benchmark for Evaluating LLM-based Judges Paper • 2410.12784 • Published Oct 16, 2024 • 48
Agent Instructs Large Language Models to be General Zero-Shot Reasoners Paper • 2310.03710 • Published Oct 5, 2023 • 2