DiT4DiT | Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent joint training of both modules. Across simulation and real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of 98.6% on LIBERO and 50.8% on RoboCasa GR1 while using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world performance and strong zero-shot generalization. Importantly, DiT4DiT improves sample efficiency by over 10x and speeds up convergence by up to 7x, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning.

Method	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long	Average
OpenVLA	84.7	88.4	79.2	53.7	76.5
CogACT	92.5	92.7	91.3	74.7	87.8
OpenVLA-OFT	97.8	97.8	98.0	94.8	97.1
π_0.5	98.0	98.4	96.0	95.2	96.9
CogVLA	98.4	99.6	97.6	93.6	97.4
Qwen3DiT	97.0	97.2	97.8	94.2	96.6
DiT4DiT (Ours)	99.2	99.2	98.4	97.6	98.6

Task	GR00T-N1.5	GR00T-N1.6	Qwen3DiT	DiT4DiT
BottleToCabinetClose	40.0	36.0	50.0	48.0
CanToDrawerClose	56.0	28.0	48.0	74.0
CupToDrawerClose	50.0	12.0	42.0	52.0
MilkToMicrowaveClose	52.0	20.0	38.0	50.0
PotatoToMicrowaveClose	22.0	28.0	18.0	36.0
WineToCabinetClose	44.0	18.0	28.0	42.0
FromCuttingboardToBasket	46.0	42.0	42.0	52.0
FromCuttingboardToCardboardbox	44.0	40.0	30.0	48.0
FromCuttingboardToPan	58.0	62.0	50.0	76.0
FromCuttingboardToPot	48.0	60.0	44.0	62.0
FromCuttingboardToTieredbasket	28.0	48.0	36.0	50.0
FromPlacematToBasket	32.0	42.0	14.0	50.0
FromPlacematToBowl	52.0	34.0	28.0	56.0
FromPlacematToPlate	42.0	42.0	40.0	32.0
FromPlacematToTieredshelf	26.0	24.0	30.0	18.0
FromPlateToBowl	38.0	48.0	36.0	56.0
FromPlateToCardboardbox	40.0	44.0	36.0	58.0
FromPlateToPan	56.0	48.0	34.0	68.0
FromPlateToPlate	50.0	66.0	44.0	58.0
FromTrayToCardboardbox	36.0	42.0	48.0	38.0
FromTrayToPlate	54.0	52.0	44.0	56.0
FromTrayToPot	36.0	64.0	34.0	54.0
FromTrayToTieredbasket	34.0	42.0	36.0	46.0
FromTrayToTieredshelf	22.0	38.0	18.0	38.0
Average	41.8	40.8	36.2	50.8

Model	Trainable Params	Deployment Freq.
GR00T-N1.5	2.7B	13 Hz
Qwen3DiT	2.3B	9 Hz
DiT4DiT (Ours)	2.2B	6 Hz

DiT4DiT: Jointly Modeling Video Dynamics and Actions
for Generalizable Robot Control

Abstract

Overview of DiT4DiT

Why Video Generation?

Simulation Results

LIBERO Benchmark

RoboCasa-GR1 Benchmark

Real-World Results

Zero-Shot Generalization

Simulation Generalization

Real-World Generalization

Generated Video Plans

Model Efficiency

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control