[Question] Action decoder

In the TrackVLA paper, a Diffusion Transformer is used as the action output head. However, in this implementation, it has been replaced with an MLP. Besides ease of implementation, are there any other advantages? Has there been a comparison of performance between these two types of output heads?