SketchVerify icon SketchVerify: Planning with Sketch-Guided Verification
for Physics-Aware Video Generation

1UNC Chapel Hill      2Field AI     3Nanyang Technological University

Abstract

Recent video generation approaches increasingly rely on planning intermediate control signals, such as object trajectories, to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement, which requires multiple calls to the video generator and incurs high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation, by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision–language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to a trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that SketchVerify significantly improves motion quality, physical realism, and long-term consistency compared to strong baselines, while being substantially more efficient. Ablations further show that scaling up the number of trajectory candidates and using multimodal sketch-based verification consistently enhances overall performance.

Comparison of planning paradigms

Figure: Illustration comparing (a) one-shot planning, (b) iterative generation, and (c) our SketchVerify framework. SketchVerify evaluates multiple lightweight video sketches using a multimodal verifier before committing to video synthesis, enabling efficient and physically-plausible motion planning.

Method

Method overview
Overview of SketchVerify. Given a text prompt and first frame, SketchVerify (1) decomposes the instruction and parses moving objects, (2) samples candidate object trajectories and renders them as lightweight video sketches on an inpainted static background, and (3) uses a multimodal verifier to score semantic alignment and physics plausibility, selecting a high-quality motion plan to guide a trajectory-conditioned video generator.

High-Level Planning & Object Parsing. An MLLM expands the prompt into a sequence of sub-instructions and predicts the objects expected to move. We then use GroundedSAM to segment these objects and Omnieraser to inpaint them out, obtaining a clean static background that serves as the canvas for sketch rendering.

Sketch-Based Test-Time Planning. For each sub-instruction, we sample multiple candidate trajectories as sequences of 2D bounding boxes and render them into video sketches by translating object crops across the static background. A multimodal verifier evaluates each sketch for instruction following and several physics criteria (e.g., gravity, penetration, deformation).

Trajectory-Conditioned Video Generation. We stitch together the selected trajectories, interpolate them over the full time horizon, and feed the resulting motion plan into a trajectory-conditioned image-to-video diffusion model. Since the generator receives a pre-verified motion trajectory, it can focus on visual fidelity while preserving physically coherent motion.

Qualitative Results

We present qualitative comparisons across multiple domains. Each row below shows four videos arranged horizontally (our SketchVerify and baselines). Captions show the input prompt for each sample.

WorldModelBench

Prompt: The robotic arm put the toy carrot into the metal bowl

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: Snow melts and slides down the mountain cliff in the scenic

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: The robotic arm picks up the purple tool from the toolbox.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: Two football players tackle an opponent near the end zone in a football simulation game.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: The man in red shorts jumps forward on the beach.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: The robotic arm positions the engine under the car chassis during assembly.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: The character in the red outfit jumps onto a series of shipping containers.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: The herd of antelopes stands alert in the savannah.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: The robotic arm moves the rubber chicken into the metal bowl.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: The worker uses a press brake to cut the metal sheet.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

PhyWorldBench

Prompt: A soccer player kicks the ball in a high lob.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: A tetherball swings around the pole after being hit.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: A child uses a top on the floor.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: A basketball falls into a hoop.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Prompt: A slack rope is used to pull a box.

SketchVerify

Wan-2.1

Cosmos-predict2

CogVideoX

Quantitative Results

WorldModelBench

WorldModelBench results
WorldModelBench results. SketchVerify achieves the highest instruction-following score and stronger physics-law coherence compared to strong open-source image-to-video baselines, while being significantly more efficient than iterative full-generation refinement methods.


PhyWorldBench

PhyWorldBench results
PhyWorldBench results. On fine-grained physics prompts, SketchVerify improves object–event realism and physical-standard adherence over base models, indicating better causal and physical reasoning in the generated videos.

BibTeX


@misc{huang2025sketchverify,
  title         = {SketchVerify: Planning with Sketch-Guided Verification for Physics-Aware Video Generation},
  author        = {Huang, Yidong and Wang, Zun and Lin, Han and Kim, Dong-Ki and Omidshafiei, Shayegan and Yoon, Jaehong and Zhang, Yue and Bansal, Mohit},
  year          = {2025},
  eprint        = {2511.17450},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}