Difficulty Reproducing LIBERO Results (Table 1)

Hi, thanks for the great work and for releasing the sim/libero branch!

I've been trying to reproduce the LIBERO results from Table 1 in the paper, specifically the VLASH (Async) rows on `libero_spatial`. I followed the instructions in the `sim/libero` branch README closely, but my results are significantly lower than reported, especially at higher delay values.

### Setup

- **Branch**: `sim/libero`
- **Environment**: Python 3.10, PyTorch 2.7.1, lerobot 0.4.1 (as installed by `pip install -e .`)
- **Dataset**: `lerobot/libero` (1693 episodes, 273K frames, v3.0 features)
- **Base model**: `lerobot/pi05_base`
- **Training config**: `examples/train/pi05/libero.yaml` (default, `state_cond=false`, `max_delay_steps=4`, `shared_observation=false`)
- **Eval**: `vlash eval-libero` with `--eval.method_type=vlash`, `--policy.n_action_steps=5`, `--eval.n_episodes=500`

### Training

I trained with two batch size settings, both for 30K steps on 4 GPUs:

| Setting | Effective batch size | Steps |
|---------|---------------------|-------|
| Run A | 8 × 4 = 32 (paper default) | 30K |
| Run B | 16 × 4 = 64 | 30K |

Final training loss converged to ~0.026 (Run A) and ~0.018 (Run B), which seems reasonable.

### Results on `libero_spatial` (500 episodes)

| Delay | Run A (bsz=32) | Run B (bsz=64) | Paper Table 1 (Spatial) |
|-------|---------------|----------------|------------------------|
| 0 | 94.3% | 95.6% | 98.8% |
| 1 | 93.6% | 95.0% | 98.8% |
| 2 | 88.6% | 89.6% | 97.5% |
| 3 | 83.2% | 86.0% | 94.4% |
| 4 | 74.4% | 71.2% | 92.5% |

### Sync Baseline Verification

To verify the eval pipeline is correct, I also evaluated `lerobot/pi05-libero` (the official pretrained model) under synchronous inference (delay=0):

- **pi05-libero, delay=0**: **96.6%** (500 episodes on `libero_spatial`)

This is close to the paper's Sync baseline of 97.3% (averaged across suites), so the eval pipeline appears to be working correctly.

### Questions

1. **Are there any additional training details** not covered in the config/README that might explain the gap? (e.g., specific seeds, learning rate schedules, number of training epochs, data preprocessing)

2. **What numpy version** was used? The README suggests 1.24.4, but this causes segfaults with mujoco in our environment. We used numpy 2.2.6 for evaluation.

3. **Were the paper results averaged over multiple seeds**, or from a single run?

4. **Is `shared_observation=true` needed** to reproduce the paper numbers? The default config has it set to `false`.

5. The gap is especially large at delay=3,4 (~8-21%). Could this be related to the **`use_state_ground_truth` mechanism** for LIBERO, or is there something else we might be missing in the state rollforward logic?

Any guidance would be greatly appreciated. Happy to provide more details or logs if needed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty Reproducing LIBERO Results (Table 1) #22

Setup

Training

Results on `libero_spatial` (500 episodes)

Sync Baseline Verification

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Delay	Run A (bsz=32)	Run B (bsz=64)	Paper Table 1 (Spatial)
0	94.3%	95.6%	98.8%
1	93.6%	95.0%	98.8%
2	88.6%	89.6%	97.5%
3	83.2%	86.0%	94.4%
4	74.4%	71.2%	92.5%

Difficulty Reproducing LIBERO Results (Table 1) #22

Description

Setup

Training

Results on libero_spatial (500 episodes)

Sync Baseline Verification

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Results on `libero_spatial` (500 episodes)