CHAIN: From Perception to Action

Yihuai Lan*, Maojia Song*, Yuhao Wu*#, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dylan Raharja, Soujanya Poria, Roy Ka-Wei Lee

* Equal contribution. # Project Leader. Advisor.

An interactive 3D, physics-driven benchmark for evaluating whether vision-language and diffusion models can reason about physical structure and execute action sequences grounded in causal constraints.

CHAIN benchmark overview: (a) Traditional VQA relies on passive observation. (b) CHAIN requires multi-step interaction, enabling procedural evaluation of planning and structural understanding.

Abstract

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision–Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.

Showcase

VLM Agents

Diffusion Models

Models generate visually plausible motion but consistently fail to respect geometric constraints — pieces phase through each other, ignore collision boundaries, and produce physically impossible assembly sequences.

How Do Frontier Models Perform?

Overall accuracy remains low, with the largest gap appearing on interlocking puzzles. The chart below highlights the top models by overall Pass@1.

22.9%
Best model overall accuracy (GPT-5.2)
0–3.1%
Puzzle accuracy across all models
109
Interactive 3D levels across 2 task families

Task Overview

CHAIN comprises 109 distinct interactive levels across two task families, each stressing complementary aspects of structured physical reasoning.

Puzzle Interlocking Mechanical Structures

Puzzle task examples across Easy, Medium, and Hard difficulty levels, showing Kongming locks, Lu Ban locks, and burr puzzles.

Assemble or disassemble multi-piece structures (Kongming locks, Lu Ban locks, burr puzzles) through fine-grained mortise-and-tenon manipulation. Progress depends on executing steps in the correct order under hidden geometric constraints, contact-rich dependencies, and collision avoidance.

  • 32 instances
  • 10 Easy
  • 12 Medium
  • 10 Hard

Stacking 3D Spatial Packing

Stacking task examples across Easy, Medium, and Hard difficulty levels, showing 3D packing puzzles of increasing container sizes.

Pack multiple irregularly-shaped 3D blocks into a fixed container by reasoning about shape compatibility, orientation constraints, and how early placements progressively restrict remaining free space. Difficulty scales with container size and piece complexity.

  • 77 instances
  • 10 Easy
  • 20 Medium
  • 47 Hard

Leaderboard

Main evaluation results on CHAIN (Pass@1). Plan efficiency metrics are computed only on solved tasks (lower is better). Token and cost metrics are normalized by successful solves (higher is better). Solved/Tokens is reported per 1M tokens.

Even the best-performing model (GPT-5.2) solves only 22.9% of tasks overall. Interlocking puzzles remain at most 3.1% accuracy across all models, suggesting current VLMs lack the ability to internalize geometric constraints and plan multi-step physical manipulations.

Diagnosing Frontier Models on CHAIN

We use CHAIN's controlled interactive protocol to localize bottlenecks in perception, planning, and execution as physical constraints tighten.

Constraint Tightness (Difficulty Stratification)

Accuracy (%) by difficulty tier. Stacking–Easy is largely solved, but performance collapses at Mid/Hard. Puzzle–Easy peaks at 10%, while Puzzle–Mid/Hard remain at 0%.

Model Puzzle Acc ↑ Stacking Acc ↑
EasyMidHard EasyMidHard
GPT-5.2 10.00.00.0 100.055.06.3
Gemini-3-Pro 10.00.00.0 90.040.06.3
Claude-Sonnet-4.5 10.00.00.0 100.020.00.0

Intermediate Feedback (Interactive vs. One-shot)

Multi-step interaction consistently outperforms one-shot solving. Δ = Interactive − One-shot on overall accuracy.

Model Interactive (%) ↑ One-shot (%) ↑ Δ
PuzzleStack.All PuzzleStack.All
GPT-5.2 3.131.222.9 0.09.17.1 −15.8
Claude-Sonnet-4.5 3.118.213.8 0.010.38.1 −5.7
Gemini-3-Pro 3.126.019.3 0.09.17.1 −12.2

Selection Signal (Reward Models vs. Verification)

Better selection helps, but gains saturate quickly. Reward-model reranking provides limited improvements relative to stronger verifier-style checks.

Strategy All (%) ↑ Δ vs. Avg@4
Avg@4 9.3
Pass@1 9.4 +0.1
Pass@2 11.2 +1.9
Pass@4 11.2 +1.9
VLM Judge 10.3 +1.3
Reward Model 9.9 +0.6

World Models (Diffusion Video)

Video generators often follow instructions superficially while violating interlocking and collision constraints. As complexity increases, failures can escalate to structural corruption and hallucinated additions/removals.

See the Diffusion Models previews in the Showcase above for representative failure cases.

Multi-step interactive evaluation consistently outperforms one-shot solving, yet also exposes cascading failures in long-horizon plans. Performance drops steeply from Easy to Hard — Stacking collapses from ~100% to single digits, while Puzzle is stuck at 0% for Mid/Hard and never exceeds 10% even on Easy.

Failure Case Studies

Beyond aggregate metrics, case analysis reveals recurring patterns: trial-and-error exploration with little constraint-guided progress; global spatial planning failures that fragment remaining free volume and trigger costly backtracking; and early commitments that snowball into dead-end configurations.

"Constraint-Free Trial-and-Error"

When structural constraints are unclear, agents often probe by proposing many candidate placements without converging on a constraint-consistent plan, leading to little constraint-guided progress.

Auto-playing failure mode example

"Costly Backtracking"

In harder stacking regimes, locally reasonable placements can corner the solver into awkward residual space and fragment the remaining free volume. Agents then resort to costly removals and replanning, highlighting the need for global spatial planning and lookahead.

Auto-playing failure mode example

"Dead-End Configurations"

Hard instances require tightly coupled, long-horizon packing decisions. Even after several locally valid moves, early commitments can reduce clearance and leave residual space incompatible with the leftover pieces, producing dead-end states.

Auto-playing failure mode example

Citation

@inproceedings{chain2026,
  title     = {From Perception to Action: An Interactive Benchmark for Vision Reasoning},
  author    = {Anonymous},
  year      = {2026}
}