-
Notifications
You must be signed in to change notification settings - Fork 228
[FEA] Support for NVIDIA_TF32_OVERRIDE environment variable + handle #1393
Description
Is your feature request related to a problem? Please describe.
I have recently run some brute force KNN benchmarks with @tfeher. Here, we looked at the impact of using 1 x tf32 versus 3 x tf32 performance of brute force knn. On a representative benchmark, using 1 x tf32 resulted in a 2.5x speedup (5 seconds -> 2 seconds). This can be significant for certain workloads (but can also not be set as the default due to unknown effects of reduced numerical accuracy).
We ran into the problem how to support this use case in our current pairwise distance API. We already have two distance types for the L2 distance (expanded and unexpanded). Adding variants for every possible way of speeding up the computation could become prohibitive. CuBLAS supports the NVIDIA_TF32_OVERRIDE environment variable that can force fp32 computations to be performed in tfloat32 precision.
Describe the solution you'd like
Add support for the NVIDIA_TF32_OVERRIDE environment in the RAFT handle. This way, algorithms can interrogate this option without having to continously inspect the environment.
In addition, make it possible to set the tf32 override programmatically. For instance, PyTorch supports the following:
# The flag below controls whether to allow TF32 on matmul.
torch.backends.cuda.matmul.allow_tf32 = TrueDescribe alternatives you've considered
Adding another L2 distance type, which I think is unwise (and would not help in the case of cosine distance). Also, adding boolean flags to the pairwise distance API is going to be a mess.