Skip to content

add patch to LLVM 20.1.5 to better support CUDA 13 and Blackwell GPUs#24040

Merged
ocaisa merged 1 commit intoeasybuilders:developfrom
Thyre:20250929090022_new_pr_LLVM2015
Oct 3, 2025
Merged

add patch to LLVM 20.1.5 to better support CUDA 13 and Blackwell GPUs#24040
ocaisa merged 1 commit intoeasybuilders:developfrom
Thyre:20250929090022_new_pr_LLVM2015

Conversation

@Thyre
Copy link
Copy Markdown
Collaborator

@Thyre Thyre commented Sep 29, 2025

(created using eb --new-pr)

See also: #23940

Description from that PR:

Click to open

With the introduction of Blackwell, NVIDIA has changed their ELF. As LLVM wasn't prepared for these changes, the validity of offload images could not be verified anymore. Hence, any program with an offload image failed during execution, if offloading is mandatory.

This can be seen in llvm/llvm-project#148703 and the related LLVM discourse.
Programs will for example fail with an error message like:

omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: No images found compatible with the installed hardware. 

The program will then crash, e.g. with the following stack trace:

Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: No images found compatible with the installed hardware. 
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5286ccf in llvm::object::ELFObjectFileBase::getNVPTXCPUName() const () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/libLLVM.so.20.1
(gdb) bt
#0  0x00007ffff5286ccf in llvm::object::ELFObjectFileBase::getNVPTXCPUName() const () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/libLLVM.so.20.1
#1  0x00007ffff5286c53 in llvm::object::ELFObjectFileBase::tryGetCPUName() const () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/libLLVM.so.20.1
#2  0x00007ffff7a9cca1 in handleTargetOutcome(bool, ident_t*) () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#3  0x00007ffff7a97f43 in checkDevice(long&, ident_t*) () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#4  0x00007ffff7a984e0 in void targetData<AsyncInfoTy>(ident_t*, long, int, void**, void**, long*, long*, void**, void**, int (*)(ident_t*, DeviceTy&, int, void**, void**, long*, long*, void**, void**, AsyncInfoTy&, bool), char const*, char const*) ()
   from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#5  0x00007ffff7a980c4 in __tgt_target_data_begin_mapper () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#6  0x000055555555ae7f in main ()

While this only affected Blackwell GPUs, CUDA 13.0 brought these ELF changes to prior generations as well.
Building an offload code with CUDA 13 and any architecture, one can see the exact same error message. See also llvm/llvm-project#159088

A fix was implemented for LLVM 22 in llvm/llvm-project#159354, and backported to LLVM 21 in llvm/llvm-project#159451. This PR brings the changes to LLVM 20.1.8 as well. The patch was slightly altered to handle differences between LLVM 20 and 21, but this shouldn't affect the functionality itself. They might also cleanly apply to older LLVM versions, which needs to be verified.

@Thyre Thyre added 2023b 2024a issues & PRs related to 2024a common toolchains labels Sep 29, 2025
@Thyre
Copy link
Copy Markdown
Collaborator Author

Thyre commented Sep 29, 2025

@boegelbot please test @ jsc-zen3-a100

@Thyre Thyre added bug fix and removed change labels Sep 29, 2025
@boegelbot
Copy link
Copy Markdown
Collaborator

@Thyre: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24040 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24040 --ntasks=8 --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8085

Test results coming soon (I hope)...

Details

- notification for comment with ID 3345291578 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

Copy link
Copy Markdown
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Micket Micket added this to the next release (5.2.0?) milestone Sep 29, 2025
@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 580.82.07, Python 3.9.21
See https://gist.github.com/boegelbot/baa7585e284b1bfd54dbde0a383e579f for a full test report.

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented Oct 3, 2025

Going in, thanks @Thyre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2023b 2024a issues & PRs related to 2024a common toolchains bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants