You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the introduction of Blackwell, NVIDIA has changed their ELF. As LLVM wasn't prepared for these changes, the validity of offload images could not be verified anymore. Hence, any program with an offload image failed during execution, if offloading is mandatory.
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: No images found compatible with the installed hardware.
The program will then crash, e.g. with the following stack trace:
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: No images found compatible with the installed hardware.
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5286ccf in llvm::object::ELFObjectFileBase::getNVPTXCPUName() const () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/libLLVM.so.20.1
(gdb) bt
#0 0x00007ffff5286ccf in llvm::object::ELFObjectFileBase::getNVPTXCPUName() const () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/libLLVM.so.20.1
#1 0x00007ffff5286c53 in llvm::object::ELFObjectFileBase::tryGetCPUName() const () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/libLLVM.so.20.1
#2 0x00007ffff7a9cca1 in handleTargetOutcome(bool, ident_t*) () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#3 0x00007ffff7a97f43 in checkDevice(long&, ident_t*) () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#4 0x00007ffff7a984e0 in void targetData<AsyncInfoTy>(ident_t*, long, int, void**, void**, long*, long*, void**, void**, int (*)(ident_t*, DeviceTy&, int, void**, void**, long*, long*, void**, void**, AsyncInfoTy&, bool), char const*, char const*) ()
from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#5 0x00007ffff7a980c4 in __tgt_target_data_begin_mapper () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#6 0x000055555555ae7f in main ()
While this only affected Blackwell GPUs, CUDA 13.0 brought these ELF changes to prior generations as well.
Building an offload code with CUDA 13 and any architecture, one can see the exact same error message. See also llvm/llvm-project#159088
A fix was implemented for LLVM 22 in llvm/llvm-project#159354, and backported to LLVM 21 in llvm/llvm-project#159451. This PR brings the changes to LLVM 20.1.8 as well. The patch was slightly altered to handle differences between LLVM 20 and 21, but this shouldn't affect the functionality itself. They might also cleanly apply to older LLVM versions, which needs to be verified.
Test report by @boegelbot SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 580.82.07, Python 3.9.21
See https://gist.github.com/boegelbot/baa7585e284b1bfd54dbde0a383e579f for a full test report.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(created using
eb --new-pr)See also: #23940
Description from that PR:
Click to open
With the introduction of Blackwell, NVIDIA has changed their ELF. As LLVM wasn't prepared for these changes, the validity of offload images could not be verified anymore. Hence, any program with an offload image failed during execution, if offloading is mandatory.
This can be seen in llvm/llvm-project#148703 and the related LLVM discourse.
Programs will for example fail with an error message like:
The program will then crash, e.g. with the following stack trace:
While this only affected Blackwell GPUs, CUDA 13.0 brought these ELF changes to prior generations as well.
Building an offload code with CUDA 13 and any architecture, one can see the exact same error message. See also llvm/llvm-project#159088
A fix was implemented for LLVM 22 in llvm/llvm-project#159354, and backported to LLVM 21 in llvm/llvm-project#159451. This PR brings the changes to LLVM 20.1.8 as well. The patch was slightly altered to handle differences between LLVM 20 and 21, but this shouldn't affect the functionality itself. They might also cleanly apply to older LLVM versions, which needs to be verified.