Dexterous World Models

Byungjun Kim; Taeksoo Kim; Junyoung Lee; Hanbyul Joo

TL;DR: A scene-action-conditioned video diffusion model
to simulate embodied dexterous actions in a given static 3D scene

Abstract

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static—limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human–scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset: synthetic egocentric interactions provide fully aligned supervision for joint locomotion–manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic, physically plausible interactions, such as grasping, opening, or moving objects, while maintaining camera and scene consistency. This framework establishes the first step toward video diffusion-based interactive digital twins, enabling embodied simulation from egocentric actions.

Results

DWM Results in Synthetic Scenes

DWM Results in Real-world Scenes

Method Overview

We decompose embodied actions within a static 3D scene $\mathbf{S}_{0}$ into an egocentric camera motion $\mathcal{C}_{1:F}$ and a hand manipulation trajectory $\mathcal{H}_{1:F}$. Given these components, we render the static scene video and the hand-only video by following the egocentric camera motion $\mathcal{C}_{1:F}$. These two rendered videos serve as conditioning inputs to our video diffusion model, enabling it to generate egocentric visual simulations of the specified action within the given static 3D scene.

💡 Key Insight #1: Hybrid Interaction-Static Paired Video Dataset

DWM learns joint locomotion-manipulation from synthetic egocentric video pairs, while absorbing diverse real-world dynamics from fixed-camera interaction videos.

Click to expand for details

Training DWM requires paired videos consisting of (i) an interaction video, (ii) a corresponding static-scene video, and (iii) an aligned hand video, all captured under the same camera trajectory. In real-world settings, acquiring such perfectly aligned pairs is challenging. To address this, we first leverage the synthetic 3D human–scene interaction dataset TRUMANS. We obtain egocentric interaction videos in TRUMANS by placing a virtual camera between the agent's eyes, yielding consistent egocentric viewpoints across sequences.

Crucially, the synthetic setup allows us to disable object dynamics and re-render the scene along the identical egocentric camera trajectory, producing a clean static-scene video. In addition, we render only the agent's hand meshes to obtain perfectly aligned egocentric hand videos. Through this process, we construct aligned triplets of interaction videos, static scene videos, and hand videos.

However, purely synthetic data covers a limited range of interactions and lacks the rich dynamics present in real-world environments. Since capturing fully paired egocentric videos in the real world remains impractical, we complement the synthetic data with fixed-camera real-world interaction videos from Taste-Rob. In the fixed-camera setup, we treat the first frame of each video as the static scene and replicate it across all frames to form a static-scene video. Hand videos are extracted by running HaMeR on the interaction sequences. Although camera motion is absent, this procedure yields real-world static–interaction video pairs that capture realistic object dynamics and contact behaviors.

By combining egocentric synthetic pairs (enabling joint locomotion–manipulation learning) with fixed-camera real-world pairs (providing diverse and realistic dynamics), our hybrid dataset design allows DWM to learn robust action-conditioned scene dynamics while sidestepping the prohibitive cost of real-world paired capture.

💡 Key Insight #2: Inpainting Priors for Residual Dynamics Learning

A full-mask inpainting diffusion model becomes an identity function with generative priors.

Click to expand for details

When an inpainting video diffusion model is given a full mask ($m = 1$), it reproduces the input video, effectively behaving as an identity mapping with a generative prior. In our case, this allows the static scene video to serve as a valid identity input, as it already encodes egocentric navigation motion rendered along the camera trajectory $\mathcal{C}_{1:F}$.

Based on this observation, we initialize our model as a full-mask inpainting model, using the static scene video as a navigation-only baseline. This initialization encourages the model to preserve the scene appearance and egocentric camera motion, providing a stable reference before introducing any manipulation-induced changes.

To model interaction dynamics, we condition the model on the dexterous hand trajectory $\mathcal{H}_{1:F}$. This additional signal guides the model to focus on residual dynamics driven by manipulation, rather than re-learning navigation effects already present in the static scene video. As a result, training becomes more stable and the learned dynamics naturally disentangle navigation from manipulation.

Citation

@article{kim2025dwm,
  title={Dexterous World Models},
  author={Kim, Byungjun and Kim, Taeksoo and Lee, Junyoung and Joo, Hanbyul},
  journal={arXiv preprint arXiv:2512.17907},
  year={2025}
}