LaDi-WM Model Significantly Improves Robot Operation Strategy Success Rates and Cross-Scenario Generalization Capabilities

Deep News
Aug 18

In robot operation tasks, predictive strategies have garnered widespread attention in the field of embodied artificial intelligence in recent years, as they can leverage predicted states to enhance robot operational performance. However, having world models predict precise future states of robot-object interactions remains a recognized challenge, particularly in generating high-quality pixel-level representations.

To address these issues, a team from the National University of Defense Technology, Peking University, and Shenzhen University proposed LaDi-WM (Latent Diffusion-based World Models), a latent space diffusion-based world model for predicting future states in latent space.

Specifically, LaDi-WM utilizes pre-trained vision foundation models to construct latent space representations that incorporate both geometric features (constructed based on DINOv2) and semantic features (constructed based on Siglip), with broad universality that benefits policy learning for robot operations and cross-task generalization capabilities.

Based on LaDi-WM, the team designed a diffusion strategy that iteratively optimizes output actions by integrating predicted states generated by the world model, thereby producing more consistent and accurate action results.

Through extensive experiments on virtual and real datasets, LaDi-WM can significantly improve the success rate of robot operation tasks, particularly achieving a 27.9% improvement on the LIBERO-LONG dataset, surpassing all previous methods.

**Paper Innovations:**

1. A latent space diffusion-based world model: Uses vision foundation models to construct universal representations in latent space and learns generalizable dynamic modeling capabilities in latent space.

2. A diffusion strategy based on iterative optimization of world model predictions: Utilizes world model-generated future predicted states, feeds predicted states back to the policy model, and iteratively optimizes policy outputs.

Figure 1: (Left) Learning latent diffusion world model through task-agnostic segments; (Right) Optimizing policy model through world model's future state predictions

**Technical Approach**

The team proposed a framework that utilizes world models to optimize policy learning for learning robot grasping operation-related skill strategies. The framework can be divided into two major stages: world model learning and policy learning.

**A. World Model Learning:**

(a) Latent Space Representation: Extracts geometric and semantic representations from observed images through pre-trained vision foundation models, where geometric representations are extracted using DINOv2, and semantic representations are extracted using Siglip.

(b) Interactive Diffusion: Simultaneously applies diffusion processes to both latent space representations, allowing full interaction between the two during the diffusion process, learning dependency relationships between geometric and semantic representations, thereby promoting accurate dynamic prediction of both representations.

Figure 2: World model architecture based on interactive diffusion

**B. Policy Model Training and Iterative Optimization Inference**

(a) Combining world model future predictions to guide policy learning: Uses future predictions from the world model as additional input to guide accurate action prediction in the policy model; the model architecture is based on diffusion strategy models, facilitating learning of multimodal action distributions.

(b) Iteratively optimizing policy output: The policy model can utilize the world model's future predictions as guidance multiple times within a single time step, continuously optimizing its own action output. Experiments show this approach can gradually reduce the output distribution entropy of the policy model, achieving more accurate action predictions.

Figure 3: Policy model architecture based on future prediction guidance

**Experimental Results**

**Virtual Experiments:** In public virtual datasets (LIBERO-LONG, CALVIN D-D), the team validated the performance of the proposed framework on robot grasping-related operation tasks. In experiments, world model training data was separated from policy model training data to verify the world model's generalization capabilities.

For LIBERO-LONG, given language instructions, multiple executions were performed and success rates for various tasks were recorded. For CALVIN D-D, five consecutive language instructions were given, multiple executions were performed, and the average number of completed tasks was recorded.

In the LIBERO-LONG dataset, to verify the world model's guiding effect on the policy model, the team used only 10 trajectories to train each task. Comparison results are shown in Table 1. Compared to other methods, LaDi-WM can provide precise future predictions and feed predictions back to the policy model, continuously optimizing action output, achieving a 68.7% success rate with minimal training data, significantly outperforming other methods.

Table 1: LIBERO-LONG Performance Comparison

On the CALVIN D-D dataset, LaDi-WM also demonstrated strong performance in long-term tasks (Table 2).

Table 2: CALVIN D-D Performance Comparison

The team further validated the scalability of the proposed framework, as shown in Figure 4. (a) Gradually increasing world model training data, model prediction error gradually decreases and policy performance gradually improves; (b) Gradually increasing policy model training data, grasping operation success rate gradually improves; (c) Gradually increasing policy model parameter count, grasping operation success rate gradually improves.

Figure 4: Scalability Experiments

To verify LaDi-WM's cross-scenario generalization capabilities, the team trained the world model on LIBERO-LONG and directly applied it to policy learning in CALVIN D-D. Experimental results are shown in Table 3. If using the original policy model trained on LIBERO-LONG and directly applying it to CALVIN D-D, it doesn't work (first row); however, using the world model trained on LIBERO-LONG to guide policy learning in the CALVIN environment can achieve performance 0.61 higher than the original policy trained in the CALVIN environment (third row). This indicates that the world model's generalization capability is superior to the policy model's generalization capability.

Table 3: Cross-scenario experimental results. L represents LIBERO-LONG, C represents CALVIN D-D

The team further explored the working principle of iterative optimization using world models. The team collected output actions from the policy model under different iteration rounds and plotted their distributions, as shown in Figure 5. During the iterative optimization process, the entropy of the output action distribution gradually decreases, indicating that the policy model's output actions at each step become more stable, thereby improving overall grasping success rates.

Figure 5: Action distribution comparison of iterative optimization

**Real Robot Experiments:** The team also validated the performance of the proposed framework in real scenarios, with specific operation tasks including "stacking bowls," "opening drawers," "closing drawers," and "grasping objects and placing them in baskets," as shown in Figure 6.

Figure 6: (Left) Real scenario environment; (Right) Actual robot operation examples

In real scenarios, LaDi-WM significantly improved the success rate of the original imitation learning strategy by 20% (Table 4).

Table 4: Real scenario performance comparison

Figure 7 shows the execution trajectories of the final policy model on different tasks. From the figure, it can be observed that the proposed strategy has robust generalization under different lighting conditions and different initial positions.

Figure 7: Real scenario robot execution trajectories

**Summary**

The team from National University of Defense Technology, Peking University, and Shenzhen University proposed LaDi-WM (Latent Diffusion-based World Models), a latent space diffusion world model that utilizes vision foundation models to extract universal latent space representations and learns generalizable dynamic modeling in latent space. Simultaneously, the team proposed using future predictions from world models to guide policy learning, iteratively optimizing policy outputs during the inference stage to further improve the accuracy of policy output actions.

Through extensive experiments on virtual and real robots, the team demonstrated the effectiveness of LaDi-WM. The proposed method significantly improved the performance of robot grasping operation skills.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Most Discussed

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10