Skip to content

Allow for compiler+accelerator specific MPI overrides#231

Open
ocaisa wants to merge 6 commits into
EESSI:mainfrom
ocaisa:additional_rpath_fallbacks
Open

Allow for compiler+accelerator specific MPI overrides#231
ocaisa wants to merge 6 commits into
EESSI:mainfrom
ocaisa:additional_rpath_fallbacks

Conversation

@ocaisa

@ocaisa ocaisa commented May 14, 2026

Copy link
Copy Markdown
Member

Alternative to #230 where we focus only on the potential need for CUDA/ROCm variants.

This also opens the door to other types of variants (but the options here would be multiplicative so I haven't included that until we hit a need for it).

@ocaisa

ocaisa commented May 14, 2026

Copy link
Copy Markdown
Member Author

Example of the output

# Set things up
ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ export EESSI_ACCELERATOR_TARGET_OVERRIDE=accel/nvidia/cc86
ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ module load EESSI/2025.06
Module for EESSI/2025.06 loaded successfully
{EESSI/2025.06} ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ echo $MODULEPATH
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/accel/nvidia/cc80/modules/all:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/modules/all:/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/accel/nvidia/cc80/modules/all:/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/modules/all:/cvmfs/software.eessi.io/init/modules
{EESSI/2025.06} ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ module load EESSI-extend
-- Using /tmp/$USER as a temporary working directory for installations, you can override this by setting the environment variable WORKING_DIR and reloading the module (e.g., /dev/shm is a common option)
Configuring for use of EESSI_USER_INSTALL under /home/ocaisa/eessi
-- To create installations for EESSI, you _must_ have write permissions to /home/ocaisa/eessi/versions/2025.06/software/linux/aarch64/neoverse_n1
-- You may wish to configure a sources directory for EasyBuild (for example, via setting the environment variable EASYBUILD_SOURCEPATH) to allow you to reuse existing sources for packages.

# Pretend to want  to do a build
{EESSI/2025.06} ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ eb OSU-Micro-Benchmarks-7.5.1-gompi-2025b-CUDA-12.9.1.eb --stop prepare --rebuild --hooks=./eb_hooks.py
== Temporary log file in case of crash /tmp/eb-uflhewm6/easybuild-0ha7tv9j.log
== found valid index for /cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/5.3.0/easybuild/easyconfigs, so using it...
== Running parse hook for OSU-Micro-Benchmarks-7.5.1-gompi-2025b-CUDA-12.9.1.eb...
== found valid index for /cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/5.3.0/easybuild/easyconfigs, so using it...
== Running parse hook for gompi-2025b.eb...
...
== Running parse hook for lfbf-2025b.eb...
== processing EasyBuild easyconfig
/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/5.3.0/easybuild/easyconfigs/o/OSU-Micro-Benchmarks/OSU-Micro-Benchmarks-7.5.1-gompi-2025b-CUDA-12.9.1.eb
== building and installing OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1...
  >> installation prefix: /home/ocaisa/eessi/versions/2025.06/software/linux/aarch64/neoverse_n1/software/OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1
== fetching files and verifying checksums...
== Running pre-fetch hook...
  >> sources:
  >> /tmp/ocaisa/easybuild/sources/o/OSU-Micro-Benchmarks/osu-micro-benchmarks-7.5.1.tar.gz [SHA256: 160d0d5e3c3cb022520ecb247e9875bb0973b1d3cadccd6c17624f8407c52e22]
== ... (took < 1 sec)
== creating build dir, resetting environment...
  >> build dir: /tmp/ocaisa/easybuild/build/OSUMicroBenchmarks/7.5.1/gompi-2025b-CUDA-12.9.1
== Running post-ready hook...

WARNING: Deprecated functionality, will no longer work in EasyBuild v6.0: Easyconfig parameter 'parallel' is deprecated, use 'max_parallel' or the parallel property instead.; see
https://docs.easybuild.io/deprecated-functionality/ for more information

== ... (took < 1 sec)
== unpacking...
  >> running shell command:
        tar xzf /tmp/ocaisa/easybuild/sources/o/OSU-Micro-Benchmarks/osu-micro-benchmarks-7.5.1.tar.gz
        [started at: 2026-05-14 16:03:37]
        [working dir: /tmp/ocaisa/easybuild/build/OSUMicroBenchmarks/7.5.1/gompi-2025b-CUDA-12.9.1]
        [output and state saved to /tmp/eb-uflhewm6/run-shell-cmd-output/tar-gfx7xw93]
  >> command completed: exit 0, ran in < 1s
== ... (took < 1 sec)
== patching...
== ... (took < 1 sec)
== preparing...
== Running pre-prepare hook...
== Updated rpath_override_dirs (to allow overriding MPI family OpenMPI):
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system-CUDA-12.9.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neover
se_n1/rpath_overrides/OpenMPI/system-CUDA-12.9.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib:/cvmfs/software.eessi.io/host_injec
tions/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib64
  >> loading toolchain module: gompi/2025b
== ... (took < 1 sec)
...

@ocaisa ocaisa changed the title Allow for accelerator-specific MPI overrides Allow for compiler+accelerator specific MPI overrides May 14, 2026
@ocaisa

ocaisa commented May 14, 2026

Copy link
Copy Markdown
Member Author

Increased the complexity a bit but it might be necessary:

== Updated rpath_override_dirs (to allow overriding MPI family OpenMPI):
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/ne
overse_n1/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system-GCC/lib:/cvmfs/software.eessi.i
o/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system-GCC/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI
/system/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib64
  >> loading toolchain module: gompi/2025b

Comment thread eb_hooks.py
@laraPPr

laraPPr commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@ocaisa #230 is dependend on this one and also 2026. Can you let us know if your still looking at this or if you input?

@ocaisa

ocaisa commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

No feedback to date, I'm waiting on someone to review it

@laraPPr

laraPPr commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@TopRichard can you review it?

@casparvl

Copy link
Copy Markdown
Contributor

Discussed in support meeting: @TopRichard said he already tested this for CUDA and that it works. We agreed he'll add a review here, including the steps taken by him to test it. I can then try to mimic that for ROCm and validate that it also works there.

@TopRichard

Copy link
Copy Markdown
Collaborator

I have tested this locally, Integrating the changes introduced in the PR into test_eb_hooks.py, and running easybuild with the --hooks=test_eb_hooks.py option. Below is a sample result when executed with CUDA enabled software: readelf -d /cluster/installations/eessi/default/eessi_local/aarch64-2025.06/software/PyTorch/2.9.1-foss-2025b-CUDA-12.9.1/lib/python3.13/site-packages/torch/lib/libtorch.so | less:

Dynamic section at offset 0x233b0 contains 25 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software/CUDA/12.9.1/lib64/libnvrtc.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libtorch_cpu.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtorch_cuda.so]
 0x000000000000000e (SONAME)             Library soname: [libtorch.so]
 0x000000000000000f (RPATH)              Library rpath: [/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system-GCC/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system-GCC/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system/lib64:/cluster/installations/eessi/default/eessi_local/aarch64-2025.06/software/PyTorch/2.9.1-foss-2025b-CUDA-12.9.1/lib:/cluster/installations/eessi/default/eessi_local/aarch64-2025.06/software/PyTorch/2.9.1-foss-2025b-CUDA-12.9.1/lib64:$ORIGIN:$ORIGIN/../lib:$ORIGIN/../lib64:/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/nvidia/grace/software/ScaLAPACK/2.2.2-gompi-2025b-fb/lib64...

@casparvl

casparvl commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

For clarity: this is currently blocked by #228 for the ROCm side of things.

Comment thread eb_hooks.py Outdated
if dep[0] in top_level_accelerator_packages:
# Store the dependency as a property for later potential use
# (e.g., accelerator-specific MPI RPATH overrides)
ec.eessi_gpu_dependency = dep

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a reminder: this will need to be done for the ROCm side of things after #228 gets merged as well (you'll need to merge main into this feature branch, resolve any potential conflicts because they both touch this same part of the code, then add some ec.eessi_gpu_dependency = ... to the ROCm side of things).

@zerefwayne

Copy link
Copy Markdown
Contributor

#228 is merged.

@casparvl casparvl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting:

== Updated rpath_override_dirs (to allow overriding MPI family OpenMPI):
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-ROCm-ROCM-6.4.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/
zen2/rpath_overrides/OpenMPI/system-ROCm-ROCM-6.4.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-ROCm/lib:/cvmfs/software.eessi.io/hos
t_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-ROCm/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/l
ib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/lib64

I have the feeling that the system-ROCm-ROCM-6.4.1 should really be system-GCC-ROCM-6.4.1.

Other than that, it seems that the most specific path (i.e. including the ROCm version) comes first, which is good.

For CUDA, I do see:

/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/
zen2/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-GCC/lib:/cvmfs/software.eessi.io/host
_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-GCC/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/lib
:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/lib64

Which has the system-GCC-CUDA-<cudaver> as expected.

@casparvl

Copy link
Copy Markdown
Contributor

Ah, maybe this does make sense, becausse rompi is a ROCm-based toolchain, whereas the CUDA OSU is a GCC based toolchain. Since you construct the suffix as:

gpu_stub = f"{self.toolchain.COMPILER_FAMILY}-{self.cfg.eessi_gpu_dependency[0]}-{self.cfg.eessi_gpu_dependency[1]}"

self.toolchain.COMPILER_FAMILY probably resolves to ROCm for rompi?

@casparvl

Copy link
Copy Markdown
Contributor

@casparvl

Copy link
Copy Markdown
Contributor

I guess what I'm wondering about is... what should this string refer to?

I guess system-GCC-CUDA-<cudaver> basically means: any MPI lib provided in this directory has been built with the system GCC, and a (system-provided) CUDA with version cudaver?

But what does system-ROCm-ROCm-<rocmver> then mean? The 2nd part is likely "built with a ROCm with version ", but the first... it's almost certainly not build with a system-ROCm. For compatiblity, it may need to be compiled with a system-LLVM (that's ROCm-enabled?)? I'm really not sure what the compatibility requirement is here...

@ocaisa

ocaisa commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

I guess system-GCC-CUDA-<cudaver> basically means: any MPI lib provided in this directory has been built with the system GCC, and a (system-provided) CUDA with version cudaver?

Not quite, it's not necessarily the system GCC, it's just meant to mean provided the system (I don't have any idea what version of GCC compiled the library, I just know it was compiled by GCC and other programs compiled with GCC will happily talk to it). The CUDA part is also there since it must be linked against a specific CUDA version. In reality, it's another case where we allow for anything but recommend a specific workflow:

  • Use the same EESSI version (keep the same compat layer)
  • Update the OpenMPI recipe for latest toolchain in that EESSI version (no crazy compiler version jumps)
  • Rebuild OpenMPI more-or-less how we do it, but with the secret sauce for your system
  • Symlink the location in the RPATH to the $EBROOTOPENMPI of the new installation
  • Do that CPU only, and also for each CUDA version supported in the EESSI version

But what does system-ROCm-ROCm-<rocmver> then mean? The 2nd part is likely "built with a ROCm with version ", but the first... it's almost certainly not build with a system-ROCm. For compatiblity, it may need to be compiled with a system-LLVM (that's ROCm-enabled?)? I'm really not sure what the compatibility requirement is here...

I thought it might appear as something else, but the same logic still applies. The ROCm version is the blocker, how you add the OpenMPI for that ROCm version is up to you, but I know that I would do it in a very specific way. The reality is that in this case the compiler version and the ROCm version are linked. So you would need to create an updated OpenMPI recipe for each ROCm version (which means each ROCm-LLVM toolchain) in the EESSI version. What's there right now supports that.

@casparvl

casparvl commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

But shouldn't this then say something like system-llvm-rocm-<rocmversion> instead, to indicate it'll talk to anything compiled with a (rocm-enabled) llvm, instead of system-rocm-rocm-<rocmversion>?

@ocaisa

ocaisa commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

That's what I was expecting to see, but EB is the one that decides via https://github.com/easybuilders/easybuild-framework/blob/4f19fd82f690c081a4e4f2484e2c2e0adab287f6/easybuild/toolchains/compiler/rocm_compilers.py#L49 (and that is fine with me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants