I trained a world model that predicts how humans manipulate objects from a single image and an action sequence.

Given the first frame and 16-step action sequence, the model predicts future manipulation frames.

The premise is simple: if you can accurately simulate what happens when a human performs an action, you don’t need a physical robot to learn manipulation. A policy can explore thousands of candidate action sequences in imagination, evaluating outcomes before committing to real-world execution. This isn’t just theoretical: 1X Technologies recently demonstrated that world models can drive real humanoid robots 1, using video diffusion models pretrained on internet-scale data that already understand physics, object permanence, and hand-object interactions. The question becomes: can we steer that knowledge with actions?

The Core Insight

Standard video diffusion models generate frames that look plausible but ignore what you want to happen. They’re storytellers, not simulators. To make them useful for robotics, I needed two modifications:

  1. Causality: Frame 10 shouldn’t influence frame 5. The model must respect the arrow of time.
  2. Action conditioning: The model must understand “if I move my hand here, this happens.”

The first sounds trivial—just mask attention—but pretrained video models have bidirectional temporal convolutions baked in everywhere. The second requires the model to learn a new input modality (actions) while preserving its video generation capabilities.

Architecture

Training

During training, I have access to full video sequences with paired actions. The key trick: rather than applying uniform noise across all frames, each frame gets an independent noise level. Frame 3 might be nearly clean while frame 12 is heavily corrupted.

Why does this matter? At inference, the model generates autoregressively—it has clean past frames and must generate noisy future frames. By training with variable noise levels, the model learns to leverage clean context to reconstruct corrupted frames. Uniform noise would never teach this skill.

Training architecture

Inference

At test time, I only have the first frame. Generation proceeds one frame at a time: encode the initial frame, generate frame 2 by denoising conditioned on frame 1, generate frame 3 conditioned on frames 1-2, and so on.

This is where causal attention pays off. Because the model never saw future frames during training, it learned to make predictions from past context alone. KV-caching stores attention keys and values from previously generated frames, so I don’t recompute the entire sequence at each step.

Inference architecture

Method

The model predicts velocity $v$ rather than noise $\epsilon$—a reparameterization that provides more stable gradients across timesteps 2: $v = \sqrt{\bar{\alpha}_t} \cdot \epsilon - \sqrt{1-\bar{\alpha}_t} \cdot x_0$.

The loss sums over frames, each with its own sampled noise level:

\[ \mathcal{L} = \mathbb{E}_{t_1,...,t_{16}} \left[ \sum_{i=1}^{16} \|v_\theta(x_{t_i}, t_i, a_i) - v_{\text{target},i}\|^2 \right] \]

To make actions actually matter, I use classifier-free guidance. During training, I randomly drop actions with 15% probability—but crucially, I drop them per-frame rather than per-sequence. This teaches the model fine-grained action-outcome relationships. At inference, I amplify action influence: $v_{\text{guided}} = v_{\text{uncond}} + s \cdot (v_{\text{cond}} - v_{\text{uncond}})$ where $s > 1$.

The trickiest part was converting pretrained weights to causal. Video diffusion models like DynamiCrafter use symmetric temporal convolutions—a kernel that looks at frames before and after the current frame. Simply zeroing the future-looking weights destroys learned dynamics. Instead, I used an extrapolative transformation: for a kernel $[w_0, w_1, w_2]$, the causal version becomes $[0, w_0 - w_2, w_1 + 2w_2]$. This preserves the effective temporal receptive field while enforcing strict causality.

Adapting to Human Manipulation

Most world models target robot arms with 7-DOF action spaces. I wanted to model human bimanual manipulation—two hands working together on deformable objects. This required designing a new action representation.

Each hand contributes 10 dimensions: 3D position delta, 6D rotation (two columns of the rotation matrix—more stable than Euler angles or quaternions for learning), and gripper width. The full 20D action captures the coordinated motion of both hands.

GripperDimensionsEncoding
Left10D3D position delta + 6D rotation + 1D width
Right10D3D position delta + 6D rotation + 1D width

I collected 5,248 episodes of bimanual t-shirt folding using UMI-style grippers (details in the SF Fold dataset post). The fisheye camera provides a wide field of view that captures both hands throughout the manipulation. Training ran at 320×512 resolution with batch size 2, gradient accumulation, learning rate 1e-5, and FP16 mixed precision.

What the Model Learned

The model captures the broad strokes: hand trajectories follow commanded actions, cloth deforms plausibly, spatial relationships stay coherent across frames.

To verify it learned generalizable dynamics rather than memorizing trajectories, I ran an action cross-swap: take frame A with actions from episode B. The grid below shows ground truth (diagonal) versus swapped actions (off-diagonal)—same starting frame, different action sequences producing different outcomes.

This tracks with findings from 1X: video prediction quality correlates with downstream task success 1. If the world model can’t accurately predict what happens when you grasp a shirt corner, a policy trained on its rollouts will fail at the real task. Visual fidelity isn’t vanity—it’s a proxy for physical understanding.

Current limitations reveal what’s still hard:

  • Fine details decay: Fingers blur, cloth texture simplifies over longer horizons. The model takes shortcuts when it can.
  • Complex dynamics: Cloth folding involves self-collision, layering, contact transitions. The model sometimes produces physically impossible configurations.
  • Depth ambiguity: Like 1X’s monocular system, a single fisheye camera provides weak 3D grounding. The model sometimes confuses depth ordering when hands cross.
  • Large actions: Big displacements cause hallucination—the model hasn’t seen enough extreme motions to generalize.

What’s Next

This world model is infrastructure for the real goal: learning manipulation policies without expensive robot rollouts. The next steps:

  • Inverse dynamics grounding: Following 1X’s architecture, add an inverse dynamics model that extracts action sequences from generated frames. This bridges visual prediction to actionable control.
  • Model-based policy learning: Train diffusion policies that plan in imagination, using the world model as a simulator.
  • Longer horizons: Current 16-frame prediction isn’t enough for complex tasks. Hierarchical action abstraction or best-of-N sampling at inference could extend temporal reach.

The vision: collect human demonstrations once, train a world model, then train thousands of policies in simulation. Real robot time becomes validation, not training.


References


  1. 1X Technologies demonstrated world models driving real humanoid robots with minimal robot-specific data. See World Model for Self-Learning (1X, 2025). ↩︎ ↩︎

  2. This work builds on video diffusion techniques for world modeling. See Vid2World: Crafting Video Diffusion Models to Interactive World Models (Chen et al., 2025). Project page↩︎