Hi, thank you a lot for the impressive work, especially in terms of capturing corner case scenarios with the Impromptu dataset!
My question is more of a curiosity: in many VLA works, other than predicting future trajectory as waypoints in text format (as in your work or in EMMA for example), there are alternatives such as:
- clustering the trajectories and expanding the VLM vocabulary to predict specialised action tokens that represent either waypoint trajectories or low-level controls (as in AutoVLA)
- using separate planning modules, such as a generative module (diffusion head, VAE, ...) as in the case of ORION
I was wondering if any of these 2 alternatives were explored, and if the case, how they perform after the fine-tuning stage with your dataset. Thanks!
Hi, thank you a lot for the impressive work, especially in terms of capturing corner case scenarios with the Impromptu dataset!
My question is more of a curiosity: in many VLA works, other than predicting future trajectory as waypoints in text format (as in your work or in EMMA for example), there are alternatives such as:
I was wondering if any of these 2 alternatives were explored, and if the case, how they perform after the fine-tuning stage with your dataset. Thanks!