Overview

VisualToolBench (VTB) is the first benchmark designed to evaluate how well Multimodal Large Language Models (MLLMs) can dynamically interact with and reason about visual information. It shifts the paradigm from passively “thinking about images” to actively “thinking with images,” treating them as a manipulable cognitive workspace. To solve complex, multi-step problems, models must use tools to transform visual content by cropping, editing, or enhancing it to uncover critical details. The benchmark provides leaderboard results across 16 diverse MLLMs, including reasoning, non-reasoning, open-source, and closed-source models.

VisualToolBench (VTB)

Overview

Dataset Design and Composition

Evaluation Methodology

Key Performance Metrics

Key Findings

Performance Comparison