README / README.md
zfj1998's picture
Update README.md
ef29f0a verified

A newer version of the Streamlit SDK is available: 1.49.0

Upgrade
metadata
title: README
emoji: πŸ’»
colorFrom: green
colorTo: red
sdk: streamlit
pinned: false

HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

πŸ“„ Paper β€’ 🏠 Home Page β€’ πŸ’» GitHub Repository β€’ πŸ† Leaderboard β€’ πŸ€— Dataset β€’ πŸ€— Dataset Viewer

HumanEval-V is a novel benchmark designed to evaluate the diagram understanding and reasoning capabilities of Large Multimodal Models (LMMs) in programming contexts. Unlike existing benchmarks, HumanEval-V focuses on coding tasks that require sophisticated visual reasoning over complex diagrams, pushing the boundaries of LMMs' ability to comprehend and process visual information. The dataset includes 253 human-annotated Python coding tasks, each featuring a critical, self-explanatory diagram with minimal textual clues. These tasks require LMMs to generate Python code based on the visual context and predefined function signatures.

Key features:

  • Complex diagram understanding that is indispensable for solving coding tasks.
  • Real-world problem contexts with diverse diagram types and spatial reasoning challenges.
  • Code generation tasks, moving beyond multiple-choice or short-answer questions to evaluate deeper visual and logical reasoning capabilities.
  • Two-stage evaluation pipeline that separates diagram description generation and code implementation for more accurate visual reasoning assessment.
  • Handcrafted test cases for rigorous execution-based evaluation through the pass@k metric.