LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training Paper • 2406.16554 • Published Jun 24, 2024 • 1
Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA Paper • 2509.17743 • Published Sep 22, 2025
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Paper • 2511.04570 • Published Nov 6, 2025 • 242
OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment Paper • 2601.01576 • Published Jan 4 • 19
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training Paper • 2502.04066 • Published Feb 6, 2025
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models Paper • 2508.05452 • Published Aug 7, 2025
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation Paper • 2506.04078 • Published Jun 4, 2025 • 1
MOVA: Towards Scalable and Synchronized Video-Audio Generation Paper • 2602.08794 • Published Feb 9 • 156
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents Paper • 2602.12984 • Published Feb 13 • 5
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities Paper • 2401.15071 • Published Jan 26, 2024 • 37
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment Paper • 2410.09893 • Published Oct 13, 2024
Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning Paper • 2505.13886 • Published May 20, 2025 • 8
Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Paper • 2405.06680 • Published May 5, 2024 • 1