A simple, handwritten token transformer I wrote for some mathematics research has found a grad bug in v0.21.0-pre.1. Good/bad loss plot, for identical model settings between v0.20.1 and v0.21.0-pre.1:
I'm in the process of trying to hunt down what the problem is, but thought I should let you know.