add option to alm2cl to accumulate the cl sum at double precision #324
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, if an
almis single-precision, thenalm2cldoes all summation and accumulation at single precision. For reasonable data and lmax, this should be OK, but can begin to fail due to floating-point errors for edge cases (see slack thread here for more gory detail). I can reproduce the issue for a particular edge-case:which gives:

In other words, for large lmax and using only 1 thread, the number of terms to be averaged can begin to accumulate O(0.5%)-level errors. In reality, lmax is much lower, and when we use
Nthreads, the number of terms in each sum is further reduced byN, but this is the kind of thing I would be paranoid about getting wrong if there are weird things in the data that are hard to mock-up in a test.A brute-force workaround is to cast the
almto double-precision before doingalm2cl, but this has large computation disadvantages. Since the data is actually single-precision, it's unnecessary, and doing so:map2alm-time will double the memory requirements of holding those maps in generalalm2cl-time can avoid the memory cost, but adds substantial time from the casting itself (this time can grow very large for a simulation ensemble, up to O(100) node-hours spent casting alone)Fortunately, there is a near-zero-cost, simple workaround that eliminates this possibility: accumulate the terms at double-precision (I could have also tried Kahan summation, but this is really simpler). In that case, the
almremains in single-precision, while the math occurs at double precision by changing the type of the buffer here. This relies on the type-promotion to double happening inCwhich is way faster than copying whole arrays innumpy.In this implementation, one passes

double_accum=Truetoalm2cl. The above floating-point error is eliminated:The runtime penalty is between 5 and 10% per

alm2cl, which is negligible in most scripts where other computation is way more lengthy. For a typical single-precisionalm, the change in the result from current code to this is, as expected, at single-point precision: