CHAIN: From Perception to Action

Abstract

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision–Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.

Showcase

VLM Agents

Diffusion Models

Models generate visually plausible motion but consistently fail to respect geometric constraints — pieces phase through each other, ignore collision boundaries, and produce physically impossible assembly sequences.

How Do Frontier Models Perform?

Overall accuracy remains low, with the largest gap appearing on interlocking puzzles. The chart below highlights the top models by overall Pass@1.

22.9%

Best model overall accuracy (GPT-5.2)

0–3.1%

Puzzle accuracy across all models

109

Interactive 3D levels across 2 task families

Task Overview

CHAIN comprises 109 distinct interactive levels across two task families, each stressing complementary aspects of structured physical reasoning.

Puzzle Interlocking Mechanical Structures

Puzzle task examples across Easy, Medium, and Hard difficulty levels, showing Kongming locks, Lu Ban locks, and burr puzzles.

Assemble or disassemble multi-piece structures (Kongming locks, Lu Ban locks, burr puzzles) through fine-grained mortise-and-tenon manipulation. Progress depends on executing steps in the correct order under hidden geometric constraints, contact-rich dependencies, and collision avoidance.

32 instances
10 Easy
12 Medium
10 Hard

Stacking 3D Spatial Packing

Stacking task examples across Easy, Medium, and Hard difficulty levels, showing 3D packing puzzles of increasing container sizes.

Pack multiple irregularly-shaped 3D blocks into a fixed container by reasoning about shape compatibility, orientation constraints, and how early placements progressively restrict remaining free space. Difficulty scales with container size and piece complexity.

77 instances
10 Easy
20 Medium
47 Hard

Leaderboard

Main evaluation results on CHAIN (Pass@1). Plan efficiency metrics are computed only on solved tasks (lower is better). Token and cost metrics are normalized by successful solves (higher is better). Solved/Tokens is reported per 1M tokens.

Even the best-performing model (GPT-5.2) solves only 22.9% of tasks overall. Interlocking puzzles remain at most 3.1% accuracy across all models, suggesting current VLMs lack the ability to internalize geometric constraints and plan multi-step physical manipulations.

Diagnosing Frontier Models on CHAIN

We use CHAIN's controlled interactive protocol to localize bottlenecks in perception, planning, and execution as physical constraints tighten.

Constraint Tightness (Difficulty Stratification)

Accuracy (%) by difficulty tier. Stacking–Easy is largely solved, but performance collapses at Mid/Hard. Puzzle–Easy peaks at 10%, while Puzzle–Mid/Hard remain at 0%.

Model	Puzzle Acc ↑			Stacking Acc ↑
Model	Easy	Mid	Hard	Easy	Mid	Hard
GPT-5.2	10.0	0.0	0.0	100.0	55.0	6.3
Gemini-3-Pro	10.0	0.0	0.0	90.0	40.0	6.3
Claude-Sonnet-4.5	10.0	0.0	0.0	100.0	20.0	0.0

Intermediate Feedback (Interactive vs. One-shot)

Multi-step interaction consistently outperforms one-shot solving. Δ = Interactive − One-shot on overall accuracy.

Model	Interactive (%) ↑			One-shot (%) ↑			Δ
Model	Puzzle	Stack.	All	Puzzle	Stack.	All	Δ
GPT-5.2	3.1	31.2	22.9	0.0	9.1	7.1	−15.8
Claude-Sonnet-4.5	3.1	18.2	13.8	0.0	10.3	8.1	−5.7
Gemini-3-Pro	3.1	26.0	19.3	0.0	9.1	7.1	−12.2

Selection Signal (Reward Models vs. Verification)

Better selection helps, but gains saturate quickly. Reward-model reranking provides limited improvements relative to stronger verifier-style checks.

Strategy	All (%) ↑	Δ vs. Avg@4
Avg@4	9.3	—
Pass@1	9.4	+0.1
Pass@2	11.2	+1.9
Pass@4	11.2	+1.9
VLM Judge	10.3	+1.3
Reward Model	9.9	+0.6

World Models (Diffusion Video)

Video generators often follow instructions superficially while violating interlocking and collision constraints. As complexity increases, failures can escalate to structural corruption and hallucinated additions/removals.

See the Diffusion Models previews in the Showcase above for representative failure cases.

Multi-step interactive evaluation consistently outperforms one-shot solving, yet also exposes cascading failures in long-horizon plans. Performance drops steeply from Easy to Hard — Stacking collapses from ~100% to single digits, while Puzzle is stuck at 0% for Mid/Hard and never exceeds 10% even on Easy.

Failure Case Studies

Beyond aggregate metrics, case analysis reveals recurring patterns: trial-and-error exploration with little constraint-guided progress; global spatial planning failures that fragment remaining free volume and trigger costly backtracking; and early commitments that snowball into dead-end configurations.

"Constraint-Free Trial-and-Error"

When structural constraints are unclear, agents often probe by proposing many candidate placements without converging on a constraint-consistent plan, leading to little constraint-guided progress.

"Costly Backtracking"

In harder stacking regimes, locally reasonable placements can corner the solver into awkward residual space and fragment the remaining free volume. Agents then resort to costly removals and replanning, highlighting the need for global spatial planning and lookahead.

"Dead-End Configurations"

Hard instances require tightly coupled, long-horizon packing decisions. Even after several locally valid moves, early commitments can reduce clearance and leave residual space incompatible with the leftover pieces, producing dead-end states.

Citation

@inproceedings{chain2026,
  title     = {From Perception to Action: An Interactive Benchmark for Vision Reasoning},
  author    = {Anonymous},
  year      = {2026}
}