Dependency on depth

Thanks for your excellent work! I noticed that your model takes depth as input and includes a specific stage to adapt depth information. I'm curious—what would happen if we removed the depth input? Specifically, if I train a model using only RGB images on your dataset, would the performance be comparable to the version that uses depth?