You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimizes CKKS square by avoiding unnecessary allocation / copy.
On ICX with clang-12, running sealbench CKKS EvaluateSquare for 1000 iterations yields:
N
HEXL
Time before (us)
Time after (us)
Speedup
1024
OFF
12.3
11.7
1.05x
1024
ON
2.39
1.68
1.42x
2048
OFF
24.4
22.5
1.08x
2048
ON
8.68
6.52
1.33x
4096
OFF
107
89.3
1.22x
4096
ON
38.4
31.1
1.23x
8192
OFF
429
375
1.14x
8192
ON
183
106
1.75x
16384
OFF
1878
1520
1.23x
16384
ON
774
401
1.93x
32768
OFF
6972
5798
1.20x
32768
ON
3205
1990
1.61x
I didn't see significant additional speedup with a tiling approach similar to #346.
In case you'd like to try it out, I've pasted the code for a tiled version of this implementation below.
// Prepare destination
encrypted.resize(context_, context_data.parms_id(), dest_size);
// Set up iterators for input ciphertext
// auto encrypted_iter = iter(encrypted);
size_t tile_size = min<size_t>(coeff_count, size_t(1024));
size_t num_tiles = coeff_count / tile_size;
#ifdef SEAL_DEBUG
if (coeff_count % tile_size != 0)
{
throw invalid_argument("tile_size does not divide coeff_count");
}
#endif
// Set up iterators for input ciphertexts
PolyIter encrypted_iter = iter(encrypted);
// Semantic misuse of RNSIter; each is really pointing to the data for each RNS factor in sequence
RNSIter encrypted1_0_iter(*encrypted_iter[0], tile_size);
RNSIter encrypted1_1_iter(*encrypted_iter[1], tile_size);
RNSIter encrypted1_2_iter(*encrypted_iter[2], tile_size);
// Computes the output tile_size coefficients at a time
// Given input tuple of polynomials x = (x[0], x[1], x[2]), computes
// x = (x[0] * x[0], 2 * x[0] * x[1] , x[1] * x[1])
// with appropriate modular reduction
SEAL_ITERATE(coeff_modulus, coeff_modulus_size, [&](auto I) {
SEAL_ITERATE(iter(size_t(0)), num_tiles, [&](auto J) {
// Compute third output polynomial, overwriting input
// x[2] = x[1] * x[1]
dyadic_product_coeffmod(encrypted1_1_iter[0], encrypted1_1_iter[0], tile_size, I, encrypted1_2_iter[0]);
// Compute second output polynomial, overwriting input
// x[1] = x[1] * x[0]
dyadic_product_coeffmod(encrypted1_1_iter[0], encrypted1_0_iter[0], tile_size, I, encrypted1_1_iter[0]);
// x[1] += x[1]
add_poly_coeffmod(encrypted1_1_iter[0], encrypted1_1_iter[0], tile_size, I, encrypted1_1_iter[0]);
// Compute first output polynomial, overwriting input
// x[0] = x[0] * x[0]
dyadic_product_coeffmod(encrypted1_0_iter[0], encrypted1_0_iter[0], tile_size, I, encrypted1_0_iter[0]);
// Manually increment iterators
++encrypted1_0_iter;
++encrypted1_1_iter;
++encrypted1_2_iter;
});
});
Testing with clang-10:
This PR is much faster than original SEAL, the tiled version is not (much) faster than this PR.
Testing with gcc-9:
The tiled version is faster than this PR which is much faster than original SEAL, except for 1024.
For 1024, the original SEAL is as fast as clang-10, costing 22 us, while both new versions with gcc-9 take 60+ us which is slower than 2048. I disassembled Evaluator::ckks_square in the original SEAL and this PR, and they are almost identical. The issue is likely caused by poor/weird performance of SEAL_ITERATE with GCC, which is called in dyadic_product.
I'll merge this PR. For gcc and 1024 case, I'll look into fixing that in a future release. I don't think this affects too many users. The speedup we get from this PR is more valuable. Thank you so much for this!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimizes CKKS square by avoiding unnecessary allocation / copy.
On ICX with clang-12, running
sealbenchCKKS EvaluateSquare for 1000 iterations yields:I didn't see significant additional speedup with a tiling approach similar to #346.
In case you'd like to try it out, I've pasted the code for a tiled version of this implementation below.