Latent Action Models · World Models · Disentangled Representation Learning

ICML 2026

DiLA: Disentangled Latent Action World Models

Learning abstract, reusable latent actions from unlabeled videos while preserving high-fidelity world-model predictions through content-structure disentanglement.

Tianqiu Zhang^*, Muyang Lyu^*, Yufan Zhang, Fang Fang, Si Wu

Peking University

Paper Code (Coming Soon) BibTeX

Abstract

Latent Action Models learn world models from unlabeled video by inferring abstract actions between frames, but they often face a trade-off between action abstraction and generation fidelity. DiLA resolves this tension through content-structure disentanglement: the predictive bottleneck drives motion-relevant spatial layout into a structure pathway, while a separate content pathway preserves appearance and scene details for generation. This co-evolution yields continuous, semantically structured latent actions that transfer across embodiments, support interpretable manifold analysis, and improve downstream visual planning.

Overview

Resolving the LAM trade-off

Latent Action Models infer action-like variables directly from unlabeled videos, avoiding the need for expensive action-labeled data. Yet existing LAMs face a persistent tension: strong bottlenecks encourage abstract and transferable actions, while weaker bottlenecks preserve generation quality but often entangle actions with visual appearance.

DiLA reframes this tension as a disentanglement problem. It separates video features into a structure pathway that captures dynamics-relevant spatial layouts and a content pathway that stores appearance, texture, and slowly revealed scene details. The predictive bottleneck in latent action learning drives structure-content separation, and that separation makes the latent actions more abstract.

Diagram showing co-evolution between latent action learning and disentanglement. — Co-evolution of latent actions and disentanglement: a restricted predictive bottleneck encourages structure to shed appearance and focus on motion.

Unified LAM and world model

DiLA learns latent actions and future dynamics end-to-end from observation sequences.

Content-structure disentanglement

The structure pathway models motion, while the content pathway preserves visual details for generation.

Transferable action manifolds

Continuous latent actions support cross-embodiment transfer, visual planning, and interpretable analysis.

Method

Two pathways, one predictive loop

DiLA predicts in the latent feature space: DINOv2 extracts visual embeddings, the structure pathway learns abstract latent actions, the content pathway maintains memory, and a fusion decoder recombines both streams for future-frame embedding prediction.

Architecture diagram of DiLA. — Architecture of DiLA. The structure pathway learns abstract latent actions under a bottleneck, while the Mamba-based content pathway preserves historical appearance context.

Structure Pathway

A structure encoder compresses video tokens into spatial layouts. An inverse dynamics model extracts latent actions from temporal differences, and a forward dynamics model predicts the next structure state.

Content Pathway

A content encoder and Mamba memory aggregate temporally invariant visual information, including occluded backgrounds and scene details that should not be stored in actions.

Fusion Decoder

A dual cross-attention decoder combines predicted structure, content memory, and the initial visual embedding to reconstruct target DINOv2 embeddings.

Latent Rollouts

Rollouts are performed autoregressively in structure space, giving DiLA a compact dynamics model for transfer and planning.

Core design of latent action space

Latent action learning in DINOv2 space

DiLA predicts and rolls out dynamics in a compact DINOv2 latent space, avoiding pixel-level reconstruction during training while preserving semantic visual structure.

Delta latent autoencoder

Latent actions are learned from temporal differences in structure embeddings, forcing the action code to capture transition dynamics rather than static content.

\(\Delta \mathbf{s}_t = \mathbf{s}_{t+1} - \mathbf{s}_t,\quad \mathbf{z}_t = \mathrm{IDM}(\Delta \mathbf{s}_t)\)

Inverse-transform symmetry

Forward and backward transitions are constrained to form opposite latent action vectors, shaping a continuous and semantically meaningful action manifold.

\(\mathbf{z}^{\mathrm{bwd}}_t \approx -\mathbf{z}^{\mathrm{fwd}}_t\)

Results

Transferable actions with high-fidelity prediction

The core experimental evidence is organized around action transfer and content-structure disentanglement. Quantitative ablations are shown separately below.

Action Transfer

Latent actions transfer across embodiment and domain.

DiLA extracts latent actions from a source video and applies them to a target context, including human-to-robot transfer, cross-object semantic transfer, intra-domain transfer, and navigation transfer between virtual and real scenes.

Action transfer examples across humans, robots, objects, and navigation scenes. — Cross-embodiment and navigation transfer show that latent actions capture abstract dynamics rather than appearance-specific motion traces.

Content-Structure Disentanglement

Rebinding separates dynamics from appearance.

Structure from one sequence can be recombined with content from another. The output follows the source spatial layout while inheriting reference appearance, texture, and scene details.

Content and structure rebinding examples. — Rebinding and motion-isolation controls validate the intended structure/content factorization.

Ablation

Disentanglement is the key to the trade-off

Removing latent action learning or replacing DiLA's bottleneck weakens either disentanglement, generation fidelity, or cycle-transfer robustness.

Ablation examples showing content leakage without latent action learning. — Without IDM and FDM, structure embeddings retain redundant content, and rebinding produces texture leakage.

Latent action ablations

Lower rollout and cycle-transfer LPIPS indicate stronger generation fidelity and transferability.

Model	Rollouts↓	Cycle transfer↓	10k MSE↓
DiLA w/o content	0.344	0.451	0.249
Discrete z	0.334	0.442	0.262
Gaussian z	0.346	0.434	0.265
DiLA	0.263	0.343	0.216

Analysis

Interpretable latent action manifolds

On controlled and out-of-distribution settings, DiLA's continuous action space aligns with physical transformation parameters and downstream control signals.

UMAP and decoding analysis of latent action manifolds. — UMAP projections reveal continuous manifolds for translation, scaling, rotation, compositional transformations, and navigation motion.

OOD robotic linear probing

Mean squared error on unseen robotic benchmarks.

Method	Franka	Block	Push-T	LIBERO
Discrete z	0.098	0.061	0.023	0.160
Gaussian z	0.125	0.102	0.041	0.190
DiLA	0.073	0.037	0.009	0.119

Visual Planning

Model Predictive Control with DiLA

After action adaptation, DiLA serves as the visual dynamics model for MPPI planning and improves aggregate VP² success over AdaWorld.

VP² success rate

Average over 4 independent runs per task. Aggregate success is normalized relative to the ground-truth simulator baseline.

Method	Robosuite push	Open slide	Blue button	Green button	Red button	Upright block	Aggregate
AdaWorld	63.50	5.83	29.17	10.83	10.00	5.00	21.54
DiLA	68.00	15.00	78.33	35.83	20.83	3.33	41.44

Citation

Use this BibTeX

@inproceedings{zhang2026dila,
  title     = {{DiLA}: Disentangled Latent Action World Models},
  author    = {Zhang, Tianqiu and Lyu, Muyang and Zhang, Yufan and Fang, Fang and Wu, Si},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  url       = {https://disentangled-latent-action-world-models.github.io}
}