Reasoning Models Struggle to Control their Chains of Thought Paper • 2603.05706 • Published 6 days ago • 25
On Many-Shot In-Context Learning for Long-Context Evaluation Paper • 2411.07130 • Published Nov 11, 2024 • 7
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Paper • 2506.01241 • Published Jun 2, 2025 • 9
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Paper • 2506.01241 • Published Jun 2, 2025 • 9 • 2
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Paper • 2506.01241 • Published Jun 2, 2025 • 9