Newton Agentic Planning for Physics-Following Video Generation

1Zhejiang University 2The Hong Kong Polytechnic University 3IROOTECH Technology · Sany Group
Equal contribution  ·  *Corresponding authors
Agentic Video Generation Plan · Execute · Verify Flow-GRPO Tool-Augmented Reasoning

Newton orchestrates external tools to ground video generation in physical laws.

Abstract

Video generation models produce visually compelling results but systematically violate physical commonsense — on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy — sufficiency, dynamism, and verifiability — and show that no existing approach satisfies all three.

We present Newton, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop.

On VideoPhy-2, Newton improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator.

Method

Newton Pipeline

Given a user prompt, the planner decides which tools to invoke; the video generator becomes one tool among many. The critic evaluates each draft against physical plausibility and feeds language-form feedback back into the next plan–execute–verify cycle.

Plan–Execute–Verify Loop
  • Vision–language planner chooses tools per turn.
  • Multi-turn rollouts let the system self-correct.
  • Treats the video generator as one callable tool.
External Toolkit
  • Python interpreter for physics simulation.
  • Search engine for visual reference retrieval.
  • Image generator for physics-grounded conditioning.
Flow-GRPO Training
  • Multimodal critic produces language & scalar rewards.
  • Trajectory-level credit assignment across turns.
  • Planner self-evolves from live deployment traces.

Real-World Physics

Newton vs. open-source video generators on prompts that require fluid dynamics, deformable cutting, granular pouring, and rigid-object peeling.

FluidA bottle of beer is poured into a mug until it is full.
Newton (Ours)
LTX
HunyuanVideo
Wan 2.2
CuttingA small knife digs a groove into a piece of wood.
Newton (Ours)
LTX
HunyuanVideo
Wan 2.2
GranularSalt is poured from a shaker onto a plate, creating a layer of granules.
Newton (Ours)
LTX
HunyuanVideo
Wan 2.2
PeelingA grapefruit is peeled with a knife; the thick rind separates.
Newton (Ours)
LTX
HunyuanVideo
Wan 2.2

Animated World

Newton transfers its physics-following behavior into stylized domains — Studio Ghibli and LEGO — without sacrificing visual style.

GhibliStudio Ghibli — a girl on a park bench blows a bubble that drifts up into the sky.
Newton (Ours)
HunyuanVideo
Wan 2.2
LEGOLEGO — a quarterback hands off the football; the running back drops it and it bounces.
Newton (Ours)
HunyuanVideo
Wan 2.2

Citation

@article{feng2026newton,
  title         = {Newton: Agentic Planning for Physics-Following Video Generation},
  author        = {Feng, Yuxiang and Wang, Juncheng and Xu, Chao and Qian, Yijie and Wang, Huihan and Hou, Wenlong and Liu, Yang and Sun, Baigui and Liu, Yong and Wang, Shujun},
  journal       = {arXiv preprint arXiv:2605.18396},
  year          = {2026},
  url           = {https://arxiv.org/abs/2605.18396}
}