Recent video generation approaches increasingly rely on planning intermediate control signals, such as object trajectories, to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement, which requires multiple calls to the video generator and incurs high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation, by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision–language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to a trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that SketchVerify significantly improves motion quality, physical realism, and long-term consistency compared to strong baselines, while being substantially more efficient. Ablations further show that scaling up the number of trajectory candidates and using multimodal sketch-based verification consistently enhances overall performance.
Figure: Illustration comparing (a) one-shot planning, (b) iterative generation, and (c) our SketchVerify framework. SketchVerify evaluates multiple lightweight video sketches using a multimodal verifier before committing to video synthesis, enabling efficient and physically-plausible motion planning.
High-Level Planning & Object Parsing. An MLLM expands the prompt into a sequence of sub-instructions and predicts the objects expected to move. We then use GroundedSAM to segment these objects and Omnieraser to inpaint them out, obtaining a clean static background that serves as the canvas for sketch rendering.
Sketch-Based Test-Time Planning. For each sub-instruction, we sample multiple candidate trajectories as sequences of 2D bounding boxes and render them into video sketches by translating object crops across the static background. A multimodal verifier evaluates each sketch for instruction following and several physics criteria (e.g., gravity, penetration, deformation).
Trajectory-Conditioned Video Generation. We stitch together the selected trajectories, interpolate them over the full time horizon, and feed the resulting motion plan into a trajectory-conditioned image-to-video diffusion model. Since the generator receives a pre-verified motion trajectory, it can focus on visual fidelity while preserving physically coherent motion.
We present qualitative comparisons across multiple domains. Each row below shows four videos arranged horizontally (our SketchVerify and baselines). Captions show the input prompt for each sample.
Prompt: The robotic arm put the toy carrot into the metal bowl
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: Snow melts and slides down the mountain cliff in the scenic
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: The robotic arm picks up the purple tool from the toolbox.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: Two football players tackle an opponent near the end zone in a football simulation game.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: The man in red shorts jumps forward on the beach.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: The robotic arm positions the engine under the car chassis during assembly.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: The character in the red outfit jumps onto a series of shipping containers.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: The herd of antelopes stands alert in the savannah.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: The robotic arm moves the rubber chicken into the metal bowl.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: The worker uses a press brake to cut the metal sheet.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: A soccer player kicks the ball in a high lob.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: A tetherball swings around the pole after being hit.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: A child uses a top on the floor.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: A basketball falls into a hoop.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
Prompt: A slack rope is used to pull a box.
SketchVerify
Wan-2.1
Cosmos-predict2
CogVideoX
@misc{huang2025sketchverify,
title = {SketchVerify: Planning with Sketch-Guided Verification for Physics-Aware Video Generation},
author = {Huang, Yidong and Wang, Zun and Lin, Han and Kim, Dong-Ki and Omidshafiei, Shayegan and Yoon, Jaehong and Zhang, Yue and Bansal, Mohit},
year = {2025},
eprint = {2511.17450},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}