Hi, thanks for the great work and for releasing the sim/libero branch!
I've been trying to reproduce the LIBERO results from Table 1 in the paper, specifically the VLASH (Async) rows on libero_spatial. I followed the instructions in the sim/libero branch README closely, but my results are significantly lower than reported, especially at higher delay values.
Setup
- Branch:
sim/libero
- Environment: Python 3.10, PyTorch 2.7.1, lerobot 0.4.1 (as installed by
pip install -e .)
- Dataset:
lerobot/libero (1693 episodes, 273K frames, v3.0 features)
- Base model:
lerobot/pi05_base
- Training config:
examples/train/pi05/libero.yaml (default, state_cond=false, max_delay_steps=4, shared_observation=false)
- Eval:
vlash eval-libero with --eval.method_type=vlash, --policy.n_action_steps=5, --eval.n_episodes=500
Training
I trained with two batch size settings, both for 30K steps on 4 GPUs:
| Setting |
Effective batch size |
Steps |
| Run A |
8 × 4 = 32 (paper default) |
30K |
| Run B |
16 × 4 = 64 |
30K |
Final training loss converged to ~0.026 (Run A) and ~0.018 (Run B), which seems reasonable.
Results on libero_spatial (500 episodes)
| Delay |
Run A (bsz=32) |
Run B (bsz=64) |
Paper Table 1 (Spatial) |
| 0 |
94.3% |
95.6% |
98.8% |
| 1 |
93.6% |
95.0% |
98.8% |
| 2 |
88.6% |
89.6% |
97.5% |
| 3 |
83.2% |
86.0% |
94.4% |
| 4 |
74.4% |
71.2% |
92.5% |
Sync Baseline Verification
To verify the eval pipeline is correct, I also evaluated lerobot/pi05-libero (the official pretrained model) under synchronous inference (delay=0):
- pi05-libero, delay=0: 96.6% (500 episodes on
libero_spatial)
This is close to the paper's Sync baseline of 97.3% (averaged across suites), so the eval pipeline appears to be working correctly.
Questions
-
Are there any additional training details not covered in the config/README that might explain the gap? (e.g., specific seeds, learning rate schedules, number of training epochs, data preprocessing)
-
What numpy version was used? The README suggests 1.24.4, but this causes segfaults with mujoco in our environment. We used numpy 2.2.6 for evaluation.
-
Were the paper results averaged over multiple seeds, or from a single run?
-
Is shared_observation=true needed to reproduce the paper numbers? The default config has it set to false.
-
The gap is especially large at delay=3,4 (~8-21%). Could this be related to the use_state_ground_truth mechanism for LIBERO, or is there something else we might be missing in the state rollforward logic?
Any guidance would be greatly appreciated. Happy to provide more details or logs if needed.
Hi, thanks for the great work and for releasing the sim/libero branch!
I've been trying to reproduce the LIBERO results from Table 1 in the paper, specifically the VLASH (Async) rows on
libero_spatial. I followed the instructions in thesim/liberobranch README closely, but my results are significantly lower than reported, especially at higher delay values.Setup
sim/liberopip install -e .)lerobot/libero(1693 episodes, 273K frames, v3.0 features)lerobot/pi05_baseexamples/train/pi05/libero.yaml(default,state_cond=false,max_delay_steps=4,shared_observation=false)vlash eval-liberowith--eval.method_type=vlash,--policy.n_action_steps=5,--eval.n_episodes=500Training
I trained with two batch size settings, both for 30K steps on 4 GPUs:
Final training loss converged to ~0.026 (Run A) and ~0.018 (Run B), which seems reasonable.
Results on
libero_spatial(500 episodes)Sync Baseline Verification
To verify the eval pipeline is correct, I also evaluated
lerobot/pi05-libero(the official pretrained model) under synchronous inference (delay=0):libero_spatial)This is close to the paper's Sync baseline of 97.3% (averaged across suites), so the eval pipeline appears to be working correctly.
Questions
Are there any additional training details not covered in the config/README that might explain the gap? (e.g., specific seeds, learning rate schedules, number of training epochs, data preprocessing)
What numpy version was used? The README suggests 1.24.4, but this causes segfaults with mujoco in our environment. We used numpy 2.2.6 for evaluation.
Were the paper results averaged over multiple seeds, or from a single run?
Is
shared_observation=trueneeded to reproduce the paper numbers? The default config has it set tofalse.The gap is especially large at delay=3,4 (~8-21%). Could this be related to the
use_state_ground_truthmechanism for LIBERO, or is there something else we might be missing in the state rollforward logic?Any guidance would be greatly appreciated. Happy to provide more details or logs if needed.