VisualToolBench (VTB)
Evaluating how LLMs can dynamically interact with and reason about visual information
Overview
VisualToolBench (VTB) is the first benchmark designed to evaluate how well Multimodal Large Language Models (MLLMs) can dynamically interact with and reason about visual information. It shifts the paradigm from passively “thinking about images” to actively “thinking with images,” treating them as a manipulable cognitive workspace. To solve complex, multi-step problems, models must use tools to transform visual content by cropping, editing, or enhancing it to uncover critical details. The benchmark provides leaderboard results across 16 diverse MLLMs, including reasoning, non-reasoning, open-source, and closed-source models.