Conversation
epwalsh
left a comment
There was a problem hiding this comment.
No major concerns. I'm glad we're cleaning this up.
Why do we scale the embedding with the following factor if scale_logits=True?
emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.
This was another "trick" we heard works from someone else (not sure who).
Wouldn't this make more sense if we did this when |
Co-authored-by: Pete <epwalsh10@gmail.com>
Yea I'm guessing that's the only scenario where we tried it? It might have come from PaLM. |
Simplifies our inscrutable initialization
init_weightswith its complex if-else logic.init_normalwhich only takes the module, the std, and optionally a cutoff_factor.reset_parameters()kaiming_normalandfan_inInitFnType as these aren't being used anywhere. Can be added later if needed.Potential bugs found in initialization as a result of the refactoring (these will be fixed after feedback):
OLMoBlock.ff_out'snormalinitialization multiples std by an extra factor of1 / math.sqrt(2 * self.config.n_layers. This potentially came from trying to incorporatefull_megatroninto the same function.mitchellhardcodes a cutoff_factor of 3.0 (always truncated_normal_ with 3.0).full_megatronhardcodes a default cutoff_factor of 3.0 (truncated_normal_ withconfig.init_cutoff_factor or 3.0). Again, this may be a result of trying to incorporate multiple inits into the same function. Ideally, the cutoff_factor should always come from the configurableconfig.init_cutoff_factor; do we want to set always this value to 3.0 for mitchell and megatron?scale_logits=True?emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0mitchellinit, due to supplying the factor at multiple places in the old code, std ends up always being 0.5 whenscale_logits=True!