This post presents a world model that predicts how humans manipulate objects from a single image and an action sequence.
Given the first frame and 16-step action sequence, the model predicts future manipulation frames.
Robotics lacks a cheap eval layer. Language has perplexity, vision has classification accuracy, but evaluating a manipulation policy still requires running a real robot or trusting a physics simulator that can’t model cloth. Video diffusion models pretrained on internet data already encode useful physical priors, but they generate plausible futures, not controllable ones. This work turns a pretrained video model into an action-conditioned world model for human bimanual manipulation of deformable objects.
The Core Insight
Standard video diffusion models generate frames that look plausible but ignore what you want to happen. They’re storytellers, not simulators. To make them useful for robotics, two modifications are needed:
- Causality: Frame 10 shouldn’t influence frame 5. The model must respect the arrow of time.
- Action conditioning: The model must understand “if I move my hand here, this happens.”
The first sounds trivial (just mask attention) but pretrained video models have bidirectional temporal convolutions baked in everywhere. The second requires the model to learn a new input modality (actions) while preserving its video generation capabilities.
Architecture
Training
During training, the model has access to full video sequences with paired actions. The key trick: rather than applying uniform noise across all frames, each frame gets an independent noise level. Frame 3 might be nearly clean while frame 12 is heavily corrupted.
Why does this matter? At inference, the model generates autoregressively: it has clean past frames and must generate noisy future frames. By training with variable noise levels, the model learns to leverage clean context to reconstruct corrupted frames. Uniform noise would never teach this skill.
Inference
At test time, only the first frame is available. Generation proceeds one frame at a time: encode the initial frame, generate frame 2 by denoising conditioned on frame 1, generate frame 3 conditioned on frames 1-2, and so on.
This is where causal attention pays off. Because the model never saw future frames during training, it learned to make predictions from past context alone. KV-caching stores attention keys and values from previously generated frames, avoiding recomputation of the entire sequence at each step.
Method
The model predicts velocity $v$ rather than noise $\epsilon$, a reparameterization that provides more stable gradients across timesteps 1: $v = \sqrt{\bar{\alpha}_t} \cdot \epsilon - \sqrt{1-\bar{\alpha}_t} \cdot x_0$.
The loss sums over frames, each with its own sampled noise level:
\[ \mathcal{L} = \mathbb{E}_{t_1,...,t_{16}} \left[ \sum_{i=1}^{16} \|v_\theta(x_{t_i}, t_i, a_i) - v_{\text{target},i}\|^2 \right] \]To make actions actually matter, the model uses classifier-free guidance. During training, actions are randomly dropped with 15% probability, but crucially, per-frame rather than per-sequence. This teaches the model fine-grained action-outcome relationships. At inference, action influence is amplified: $v_{\text{guided}} = v_{\text{uncond}} + s \cdot (v_{\text{cond}} - v_{\text{uncond}})$ where $s > 1$.
The trickiest part was converting pretrained weights to causal. Video diffusion models like DynamiCrafter use symmetric temporal convolutions, kernels that look at frames before and after the current frame. Simply zeroing the future-looking weights destroys learned dynamics. Instead, an extrapolative transformation works: for a kernel $[w_0, w_1, w_2]$, the causal version becomes $[0, w_0 - w_2, w_1 + 2w_2]$. This preserves the effective temporal receptive field while enforcing strict causality.
Adapting to Human Manipulation
Most world models target robot arms with 7-DOF action spaces. The goal here was to model human bimanual manipulation: two hands working together on deformable objects. This required designing a new action representation.
Each hand contributes 10 dimensions: 3D position delta, 6D rotation (two columns of the rotation matrix, more stable than Euler angles or quaternions for learning), and gripper width. The full 20D action captures the coordinated motion of both hands.
| Gripper | Dimensions | Encoding |
|---|---|---|
| Left | 10D | 3D position delta + 6D rotation + 1D width |
| Right | 10D | 3D position delta + 6D rotation + 1D width |
This precise action representation enables tighter action-outcome coupling than video-only approaches that infer actions from pixels. Ground truth end-effector poses allow the model to learn fine-grained control, essential for deformable object manipulation where subtle actions matter.
The training data consists of 5,248 episodes of bimanual t-shirt folding captured with UMI-style grippers. The fisheye camera provides a wide field of view that captures both hands throughout the manipulation. Training ran at 320×512 resolution with batch size 2, gradient accumulation, learning rate 1e-5, and FP16 mixed precision.
What the Model Learned
The model captures the broad strokes: hand trajectories follow commanded actions, cloth deforms plausibly, spatial relationships stay coherent across frames.
To verify the model learned generalizable dynamics rather than memorizing trajectories, an action cross-swap test was run: take frame A with actions from episode B. The grid below shows ground truth (diagonal) versus swapped actions (off-diagonal), same starting frame, different action sequences producing different outcomes.
This tracks with findings from 1X: video prediction quality correlates with downstream task success 2. If the world model can’t accurately predict what happens when you grasp a shirt corner, a policy trained on its rollouts will fail at the real task. Visual fidelity isn’t vanity; it’s a proxy for physical understanding.
Current limitations reveal what’s still hard:
- Fine details decay: Fingers blur, cloth texture simplifies over longer horizons. The model takes shortcuts when it can.
- Complex dynamics: Cloth folding involves self-collision, layering, contact transitions. The model sometimes produces physically impossible configurations.
- Depth ambiguity: Like 1X’s monocular system, a single fisheye camera provides weak 3D grounding. The model sometimes confuses depth ordering when hands cross.
- Large actions: Big displacements cause hallucination; the model hasn’t seen enough extreme motions to generalize.
What’s Next
This world model is infrastructure for the real goal: learning manipulation policies without expensive robot rollouts. The next steps:
- Inverse dynamics grounding: Following 1X’s architecture, add an inverse dynamics model that extracts action sequences from generated frames. This bridges visual prediction to actionable control.
- Model-based policy learning: Train diffusion policies that plan in imagination, using the world model as a simulator.
- Longer horizons: Current 16-frame prediction isn’t enough for complex tasks. Hierarchical action abstraction or best-of-N sampling at inference could extend temporal reach.
While this vision is aspirational and world models for robotics are still in early stages, recent work from 1X demonstrates the potential: they achieved strong generalization to novel tasks using only 70 hours of robot-specific data after massive video pretraining. The goal is not to eliminate physical robots, but to make robot learning dramatically more data-efficient by leveraging internet-scale video knowledge.
The vision: collect human demonstrations once, train a world model, then train thousands of policies in simulation. Real robot time becomes validation, not training.
References
This work builds on video diffusion techniques for world modeling. See Vid2World: Crafting Video Diffusion Models to Interactive World Models (Chen et al., 2025). Project page. ↩︎
1X Technologies demonstrated world models driving real humanoid robots with minimal robot-specific data. See World Model for Self-Learning (1X, 2025). ↩︎