Vision-Language-Action (VLA) models have emerged as a promising paradigm
for robot learning, but their representations are still largely inherited from static
image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data.
Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making
them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the
literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action
Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified
cascaded framework. Instead of relying on reconstructed future
frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them
as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective
with decoupled timesteps
and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent
joint training of both modules. Across simulation and
real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of
98.6% on LIBERO and 50.8% on RoboCasa GR1 while
using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world
performance and strong zero-shot generalization. Importantly,
DiT4DiT improves sample efficiency by over 10x and speeds up convergence by
up to 7x, demonstrating that video generation can serve as an effective scaling
proxy for robot policy learning.