Draft
Conversation
6a2ff21 to
c4a5bb4
Compare
2d61ed6 to
a3f6041
Compare
ad0efd3 to
40b1d9d
Compare
0303048 to
081e064
Compare
2bc6ae3 to
e2f562c
Compare
1e712fe to
18d7034
Compare
This adds a check that our manual counts are corrects and adds the potential for more accurate counts in the future. Note that accurate counts after compiler optimization is difficult but in some cases some manual optimization can avoid double counting in many cases that the compiler can optimize. The accuracy of some counts also depends on how you define the counts. This leads to differences between the exact counts of bytes and iterations for some kernels. Normally the LoopBytes*/Rep counters should be the same as the estimated bytes. Similarly the ParallelIterations/Rep should match the estimate. The fp64Ops/rep should be the same as the estimated flops.
Some kernels do not yet support counting and will have values of -1 to indicate that they were not counted. In some cases the kernels helper file may have not cleanly worked with the wrappers. Some kernels use library functions like MPI or std::sort that may not reliably work with the wrappers.
4412955 to
e44e549
Compare
Member
Author
|
There was an idea to capture the memory operations on a per iterate basis. Then they could be visualized to see where memory accesses were occurring and how the cache might be used. We could also look at reordering iterates (cache blocking) to see how that might affect cache use. We could also look at how memory accesses would work on a per GPU warp, block, or xcd level. It is possible to capture this information via something like appending to a vector of events to get a timeline of what happened during the run, when the iteration variable was incremented, when memory accesses were performed, when arithmetic operations occurred, and what the operands of all the previous operations were. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add more kernel attributes by automatically counting operations using wrapper types.
This could either replace our manual counts or act as a check on their accuracy. It can also get accurate counts in some cases where we have been estimating. In some other cases we may not be able to use this like in Algorithm_SORT where we use std::sort. Its important to note that getting accurate counts after compiler optimization is still difficult but in most cases manual optimization can still get us good counts.
Note that this requires C++20 at the moment, but it could be back-ported to C++17 with SFINAE instead of concepts.
At the moment I'm interested in if people think this is a reasonable direction to take.
If so are there any things that I'm missing that I could be capturing with wrappers types and instrumentation.
As an example I used the counters in Apps_PRESSURE, APPS_VOL3D, and Polybench_JACOBI_2D. Note that I discovered opportunities to optimize redundant loads in PRESSURE and VOL3D kernels and found a copy paste error in VOL3D in examining these counters and comparing them to the manual "Estimate" counters.
Below are the normal attributes followed by the counted attributes. After that is a breakdown of each kernel with counters for each section of the kernel, by using enough macros it was possible to capture the code of the entire kernel and print it out.