SketchVerify: Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Recent video generation approaches increasingly rely on planning intermediate control signals, such as object trajectories, to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement, which requires multiple calls to the video generator and incurs high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation, by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision–language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to a trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that SketchVerify significantly improves motion quality, physical realism, and long-term consistency compared to strong baselines, while being substantially more efficient. Ablations further show that scaling up the number of trajectory candidates and using multimodal sketch-based verification consistently enhances overall performance.

Figure: Illustration comparing (a) one-shot planning, (b) iterative generation, and (c) our SketchVerify framework. SketchVerify evaluates multiple lightweight video sketches using a multimodal verifier before committing to video synthesis, enabling efficient and physically-plausible motion planning.

Overview of SketchVerify. Given a text prompt and first frame, SketchVerify (1) decomposes the instruction and parses moving objects, (2) samples candidate object trajectories and renders them as lightweight video sketches on an inpainted static background, and (3) uses a multimodal verifier to score semantic alignment and physics plausibility, selecting a high-quality motion plan to guide a trajectory-conditioned video generator.

High-Level Planning & Object Parsing. An MLLM expands the prompt into a sequence of sub-instructions and predicts the objects expected to move. We then use GroundedSAM to segment these objects and Omnieraser to inpaint them out, obtaining a clean static background that serves as the canvas for sketch rendering.

Sketch-Based Test-Time Planning. For each sub-instruction, we sample multiple candidate trajectories as sequences of 2D bounding boxes and render them into video sketches by translating object crops across the static background. A multimodal verifier evaluates each sketch for instruction following and several physics criteria (e.g., gravity, penetration, deformation).

Trajectory-Conditioned Video Generation. We stitch together the selected trajectories, interpolate them over the full time horizon, and feed the resulting motion plan into a trajectory-conditioned image-to-video diffusion model. Since the generator receives a pre-verified motion trajectory, it can focus on visual fidelity while preserving physically coherent motion.

We present qualitative comparisons across multiple domains. Each row below shows four videos arranged horizontally (our SketchVerify and baselines). Captions show the input prompt for each sample.

WorldModelBench results. SketchVerify achieves the highest instruction-following score and stronger physics-law coherence compared to strong open-source image-to-video baselines, while being significantly more efficient than iterative full-generation refinement methods.

PhyWorldBench results. On fine-grained physics prompts, SketchVerify improves object–event realism and physical-standard adherence over base models, indicating better causal and physical reasoning in the generated videos.

BibTeX


@misc{huang2025sketchverify,
  title         = {SketchVerify: Planning with Sketch-Guided Verification for Physics-Aware Video Generation},
  author        = {Huang, Yidong and Wang, Zun and Lin, Han and Kim, Dong-Ki and Omidshafiei, Shayegan and Yoon, Jaehong and Zhang, Yue and Bansal, Mohit},
  year          = {2025},
  eprint        = {2511.17450},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}

SketchVerify: Planning with Sketch-Guided Verification

for Physics-Aware Video Generation

Abstract

Method

Qualitative Results

WorldModelBench

PhyWorldBench

Quantitative Results

WorldModelBench

PhyWorldBench

BibTeX