Skip to content

Difficulty Reproducing LIBERO Results (Table 1) #22

@boss-server-ops

Description

@boss-server-ops

Hi, thanks for the great work and for releasing the sim/libero branch!

I've been trying to reproduce the LIBERO results from Table 1 in the paper, specifically the VLASH (Async) rows on libero_spatial. I followed the instructions in the sim/libero branch README closely, but my results are significantly lower than reported, especially at higher delay values.

Setup

  • Branch: sim/libero
  • Environment: Python 3.10, PyTorch 2.7.1, lerobot 0.4.1 (as installed by pip install -e .)
  • Dataset: lerobot/libero (1693 episodes, 273K frames, v3.0 features)
  • Base model: lerobot/pi05_base
  • Training config: examples/train/pi05/libero.yaml (default, state_cond=false, max_delay_steps=4, shared_observation=false)
  • Eval: vlash eval-libero with --eval.method_type=vlash, --policy.n_action_steps=5, --eval.n_episodes=500

Training

I trained with two batch size settings, both for 30K steps on 4 GPUs:

Setting Effective batch size Steps
Run A 8 × 4 = 32 (paper default) 30K
Run B 16 × 4 = 64 30K

Final training loss converged to ~0.026 (Run A) and ~0.018 (Run B), which seems reasonable.

Results on libero_spatial (500 episodes)

Delay Run A (bsz=32) Run B (bsz=64) Paper Table 1 (Spatial)
0 94.3% 95.6% 98.8%
1 93.6% 95.0% 98.8%
2 88.6% 89.6% 97.5%
3 83.2% 86.0% 94.4%
4 74.4% 71.2% 92.5%

Sync Baseline Verification

To verify the eval pipeline is correct, I also evaluated lerobot/pi05-libero (the official pretrained model) under synchronous inference (delay=0):

  • pi05-libero, delay=0: 96.6% (500 episodes on libero_spatial)

This is close to the paper's Sync baseline of 97.3% (averaged across suites), so the eval pipeline appears to be working correctly.

Questions

  1. Are there any additional training details not covered in the config/README that might explain the gap? (e.g., specific seeds, learning rate schedules, number of training epochs, data preprocessing)

  2. What numpy version was used? The README suggests 1.24.4, but this causes segfaults with mujoco in our environment. We used numpy 2.2.6 for evaluation.

  3. Were the paper results averaged over multiple seeds, or from a single run?

  4. Is shared_observation=true needed to reproduce the paper numbers? The default config has it set to false.

  5. The gap is especially large at delay=3,4 (~8-21%). Could this be related to the use_state_ground_truth mechanism for LIBERO, or is there something else we might be missing in the state rollforward logic?

Any guidance would be greatly appreciated. Happy to provide more details or logs if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions