Woptim/caliper integration by adrienbernede · Pull Request #291 · llnl/RAJAPerf

adrienbernede · 2023-01-13T15:10:03Z

A proof of concept of running Caliper and comparing the results to a baseline.

This is really basic.

… in dedicated directory

adrienbernede · 2023-01-13T15:19:32Z

@jonesholger @slabasan @kab163 This is a starting point for what’s next.

In particular, I’ll be attempting to setup things differently for another demo now that I have something running.

I would welcome improvements to the hatchet post-processing, I use a simple diff computation kindly provided by @slabasan, but I’m sure RAJA will want something more meaningful.

This reverts commit e3fc7d7.

jonesholger · 2023-01-18T19:40:29Z

@adrienbernede the hatchet script is perfectly fine when comparing across the same variant, like your build_and_test currently does. For the next steps I imagine you'll want to bring in the CUDA and HIP variants for those particular architectures, with a switch in the script when it detects where it's running. Also, at some point I imagine the script defining a pass/fail criterion for the hatchet output where you use the subtract operator for the two trees exceeding some threshold. To test functionality we could artificially slow down a RAJAPerf run with command line options increasing problem size (i.e --size) while keeping the baselines intact, and also in the openmp case by limiting OMP_NUM_THREADS. On the obverse artificial speedup is viable too running a tiny size, but this may be harder to catch since default is quite small.

Thanks for diving in to this. I think everyone is extremely appreciative.

jonesholger · 2023-01-18T19:50:42Z

Someone will have to remain cognizant for when the trees differ, like adding or removing a kernel. In this case the baseline should be rerun.

adrienbernede · 2023-01-18T20:37:37Z

@jonesholger I agree we will need to add hip and cuda, as well as a threshold mechanism.

Regarding changes in the baseline, I would like to suggest rerunning the reference each time, alway comparing to the most recent develop ancestor commit. This will not protect from changes coming from the branch itself, but it seems more robust to rerun the reference rather than risking that something may have changed, making the baseline obsolete (e.g. a change in the machine config).

I will present the idea the next time we meet, as we need to discuss the desired design anyway.

jonesholger · 2023-01-18T21:10:34Z

@adrienbernede running the develop baseline every time seems fine, and in your script you do have a dataframe node count where I would test that they are equal before running the subtract operator. I do have some routines in my back pocket that make comparing different trees more benign (only compares nodes that don't differ). @slabasan is aware of these too.

jonesholger · 2023-01-18T21:49:54Z

@adrienbernede what's going on with some of the older compilers on gitlab/lassen (clang9 pops a lot), should I check it out. Develop has this issue too.

rhornung67 · 2023-01-18T21:53:42Z

@jonesholger the older compilers are related to shared configurations being run/tested in multiple projects, RAJA, RAJAPerf, Umpire, etc. David B. and I plan to meet soon to update the shared specs.

rhornung67 · 2023-01-18T21:59:22Z

Isn't running the baseline each time automatic if we always run base and RAJA variants for each programming model; i.e., the base variant will be the baseline? The most important thing to monitor for this suite is that the difference between the RAJA variant and base variant for each kernel doesn't grow when the RAJA variant is slower than the base variant.

It may also be a good idea to track differences between base variants of each run against the base variants on the previous run. This may give us some insights into compiler regressions. However, there is probably enough variation run-to-run that it may be hard to interpret.

jonesholger · 2023-01-18T23:03:47Z

@rhornung67 yeah I think comparing to base variant is the eventual intended outcome, say RAJA_CUDA vs Base_CUDA - and they are matched one to one with kernel-tunings; otherwise I can add routines to the script to fix up the different trees which we can discuss . I think this initial version just checks across the same variant where baseline is defined to be the develop branch ancestor and PR is version under test. I haven't looked close enough to see if the baseline is just one fixed machine/compiler combo.

Forgot to mention the root nodes will be different when comparing different variants. I have a routine that fixes this too.

jonesholger · 2023-01-18T23:05:29Z

@rhornung67 I also like your idea of tracking regressions

rhornung67 · 2023-01-18T23:43:15Z

@jonesholger I think we should be in good shape if we can compare RAJA_* to Base_* for each relevant programming model. I believe these relate one-to-one across the Suite, with a couple of exceptions, so their trees should match.

Adding the ability to track regressions could be done in a separate PR when we figure out a good way to do that.

How close is this to being able to do the first part and be merged?

jonesholger · 2023-01-19T00:21:08Z

@rhornung67 I'll start on a patch to the hatchet script as a PR to this one and verify all the trees. If there are exceptions the default fixup I have in place is to propagate the min time in a set of tunings for a particular kernel.

jonesholger · 2023-01-19T00:27:48Z

@adrienbernede @rhornung67 maybe add a threshold as third argument so we can do v1 - v2 < +/- threshold. For now just record it - I'm curious regarding the run to run variance too.

jonesholger · 2023-01-23T17:48:20Z

Posted #298

adrienbernede · 2023-03-13T20:44:29Z

@jonesholger if you think comparing RAJA to Lambda is not meaningful, especially since both are compared to the baseline already, I’m fine with removing that comparison. Is that what you’re saying ?

I’m reordering the parameters, I agree baseline should be... the baseline.

rhornung67 · 2023-03-13T20:47:33Z

I think all that we want is to compare different variants to the designated baseline variant. We can infer other comparisons from those.

adrienbernede · 2023-03-14T17:39:00Z

@jonesholger what do you think of the current state of the PR ?

jonesholger · 2023-03-14T17:57:44Z

@adrienbernede I think once you swap parameters so --baseline == Base_(suffix) we're good.

adrienbernede · 2023-03-14T18:00:27Z

lol, I shouldn’t code late... One sec.

jonesholger · 2023-03-14T19:04:53Z

@adrienbernede this has to be extremely frustrating (timeout), maybe you could prebuild caliper
^caliper@master%rocmcc@5.1.1+adiak~~cuda~~fortran+gotcha~~ipo+libdw~~libpfm+libunwind+mpi+papi+rocm+sampler+shared~sosflow amdgpu_target=gfx90a:xnack- build_system=cmake build_type=RelWithDebInfo arch=linux-rhel8-zen2

on corona:
2231.081s: job.exception type=timeout severity=0 resource allocation expired
flux-job: task(s) exited with exit code 142

adrienbernede · 2023-03-14T19:23:04Z

That’s the first time it times out. But I have everything ready to optimize it. In fact it is already optimized on ruby.
I’ll do that tomorrow for the other machines.

adrienbernede · 2023-03-14T20:31:48Z

@jonesholger I just triggered a pipeline with UPDATE_SPACK_UPSTREAM=true.

I had been working on using Spack chaining (upstreams) to speed up the dependencies installation, but this work was paused in favor of more urgent matters.

With caliper integration, and the many dependencies it involves, I revamped the implementation, which was actually not so hard.

Presently, the tests are failing... that was not expected. In order not to be delayed by this, we have a joker card to play: we haven’t configured an external python install in the radiuss-spack-configs. We install python as a dependency for Caliper where we could just use one already installed on the machines.

Do you know which LC python install we should use to get things to run smoothly?

jonesholger · 2023-03-15T19:16:34Z

@adrienbernede . I just spotted your comment - yeah not having to build Python is win/win. But, you also rely on an LC install of Hatchet which I think is geared to Python 3.9 or was recently upgraded for 3.9, anyhow other version may trigger a cython recompile. You could also pip install llnl-hatchet against your favorite python in a venv, but I would steer clear of Python 3.11. Anyhow, I'll help look into it.

jonesholger · 2023-03-15T20:17:00Z

on lassen and I think the other platforms are analogous
in packages.yaml

python:
buildable: false
version: [3.8.2]
externals:
- spec: python@3.8.2
prefix: /usr/tce/packages/python/python-3.8.2

module load python/3.8.2
spack install caliper@master%gcc@8.3.1 ^python@3.8.2 ^spectrum-mpi

spack load /{installed hash}
cali-query --help

If you're intrepid you can install llnl-hatchet against Python3.8.2 like so:
module load python/3.8.2 on lassen
python3 -m venv test-env
source test-env/bin/activate
pip install --upgrade pip
export CXX=/usr/tce/packages/gcc/gcc-8.3.1/bin/g++
export CC=/usr/tce/packages/gcc/gcc-8.3.1/bin/gcc
export LD=/usr/tce/packages/gcc/gcc-8.3.1/bin/gcc
pip install matplotlib==3.2.0rc1
pip install pandas==1.5.3
pip install llnl-hatchet

then source test-env/bin/activate in build_and_test for your CI

In python3 test deployment

import hatchet as ht

jonesholger · 2023-03-15T20:19:43Z

BTW you can link against the GCC version of Caliper for the Clang installs of RAJAPerf

…al builds only

adrienbernede · 2023-03-17T12:33:54Z

I had already added python 3.8.2 when I saw your answer. But toss4 does not have it, so I used 3.10.8 there.
I will consider installing hatchet with each python version we use but:

I will have to be careful to set the installation permission correctly: this may not be that simple.
If we are installing hatchet, I’d rather install it with Spack.
The easiest still is to use the gapps install, either updating it, or adding python 3.7 to our config.

jonesholger · 2023-03-17T18:28:51Z

There is no Hatchet package in Spack. A while back one did exist, but it was very difficult to configure, with a dependency chain involving Matplotlib, with various backends, and linking against mesa as default,which was super fragile. Hatchet is easy enough to install from git clone - I did have a small request into the team to trigger cython recompile when switching Pythons in their install script, which I should revisit. Essentially just doing a "real clean" in the install every time.

python setup.py clean –all
python setup.py build_ext --inplace
python hatchet/vis/static_fixer.py

Your gapps install is a good position for gitlab, and maybe add Python 3.7

jonesholger · 2023-03-17T18:38:47Z

I salute your attention to also prebuild elfutils. A Caliper build with those prebuilts is done in less than a minute. Almost nonsense that we can't find elfutils-dev on these important systems.

jonesholger · 2023-03-17T19:25:34Z

Maybe another soapbox point, deployment installs should not allow "latest" non-versioned packages, including the Caliper variant we're playing with now. So, we should update installation scripts as soon as feasible. Even in pip land a requirements package list should have package==some_version, else rejected for deployment.

caliper@2.9.0+ should be our target.

Anyhow, you are way more expert at this, but let me know if you need me to look at anything.

adrienbernede · 2023-03-17T20:40:17Z

Those are good remarks. I’ll prepare an issue to keep track of those changes we want for the near future.

adrienbernede added 13 commits January 12, 2023 12:02

Add the option to not build and test in /dev/shm + copy caliper files…

54919cc

… in dedicated directory

Do not use /dev/shm for caliper job

b300358

Protect cali files copy with existence check

fc2981a

Add caliper reference files + minor

3c6503e

Minor

7731dc9

Add a comparison report between cali files using hatchet

256fc52

Fix script name and path

70b6cff

TEMP: de-activate unneeded part of the CI

e3fc7d7

Fix script call path

ed6033a

Fix python script

dbe10a4

Add missing extension

aaa3361

Fix baseline dir

cb3742e

Fix use existing field

42f9619

Revert "TEMP: de-activate unneeded part of the CI"

e5ed98b

This reverts commit e3fc7d7.

add support for cross-variant comparisons

5fa11bc

replace optparse with argparse

c46858e

adrienbernede added 2 commits March 13, 2023 21:59

Remove superfluous comparison

b826e28

Merge branch 'pr-from-fork/137' into woptim/caliper-integration

cb65f40

Actually use Base implementation as baseline

7a1f7b9

adrienbernede added 3 commits March 17, 2023 10:09

Use new CI queue in RAJAPerf

bd0d585

Update Rad Spack Conf (via RAJA) to Add python and elfutils as extern…

39e699d

…al builds only

python 3.8.2 missing on toss4 machines, using 3.10.8

7306808

adrienbernede added 2 commits March 17, 2023 13:46

No system elfutils on toss4

2eac54f

No system elfutils on blueos

b7fbc8c

jonesholger approved these changes Mar 17, 2023

View reviewed changes

Point to RAJA@develop with new radiuss-spack-configs

110d578

adrienbernede mentioned this pull request Mar 17, 2023

Improvement to the software stack related to Caliper and Hatchet. #316

Open

adrienbernede merged commit b29d567 into pr-from-fork/137 Mar 17, 2023

adrienbernede deleted the woptim/caliper-integration branch March 17, 2023 21:45

adrienbernede added this to the Caliper Integration milestone Apr 6, 2023

Conversation

adrienbernede commented Jan 13, 2023

A proof of concept of running Caliper and comparing the results to a baseline.

Uh oh!

adrienbernede commented Jan 13, 2023

Uh oh!

jonesholger commented Jan 18, 2023

Uh oh!

jonesholger commented Jan 18, 2023

Uh oh!

adrienbernede commented Jan 18, 2023

Uh oh!

jonesholger commented Jan 18, 2023

Uh oh!

jonesholger commented Jan 18, 2023

Uh oh!

rhornung67 commented Jan 18, 2023

Uh oh!

rhornung67 commented Jan 18, 2023

Uh oh!

jonesholger commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonesholger commented Jan 18, 2023

Uh oh!

rhornung67 commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonesholger commented Jan 19, 2023

Uh oh!

jonesholger commented Jan 19, 2023

Uh oh!

jonesholger commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrienbernede commented Mar 13, 2023

Uh oh!

rhornung67 commented Mar 13, 2023

Uh oh!

adrienbernede commented Mar 14, 2023

Uh oh!

jonesholger commented Mar 14, 2023

Uh oh!

adrienbernede commented Mar 14, 2023

Uh oh!

jonesholger commented Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrienbernede commented Mar 14, 2023

Uh oh!

adrienbernede commented Mar 14, 2023

Uh oh!

jonesholger commented Mar 15, 2023

Uh oh!

jonesholger commented Mar 15, 2023

Uh oh!

jonesholger commented Mar 15, 2023

Uh oh!

adrienbernede commented Mar 17, 2023

Uh oh!

jonesholger commented Mar 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonesholger commented Mar 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonesholger commented Mar 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrienbernede commented Mar 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jonesholger commented Jan 18, 2023 •

edited

Loading

rhornung67 commented Jan 18, 2023 •

edited

Loading

jonesholger commented Jan 23, 2023 •

edited

Loading

jonesholger commented Mar 14, 2023 •

edited

Loading

jonesholger commented Mar 17, 2023 •

edited

Loading

jonesholger commented Mar 17, 2023 •

edited

Loading

jonesholger commented Mar 17, 2023 •

edited

Loading