Skip to content

End-to-end CUDA container, remove peacock, bump python to 3.13#28114

Merged
loganharbour merged 49 commits intoidaholab:nextfrom
loganharbour:namjae_gpu
Jun 14, 2025
Merged

End-to-end CUDA container, remove peacock, bump python to 3.13#28114
loganharbour merged 49 commits intoidaholab:nextfrom
loganharbour:namjae_gpu

Conversation

@loganharbour
Copy link
Member

@loganharbour loganharbour commented Jul 11, 2024

  • Removes the peacock conda package
  • Bumps default python to 3.13, also in apptainer
  • Updates python source for 3.13 compatibility
  • Bumps apptainer clang to 19
  • Bumps apptainer min gcc to 9
  • Supports a full stack CUDA build (mpi and on)
  • Builds pytorch within all apptainer builds but min gcc (pytrorch bump to 2.6)
  • Updates base images for all apptainer builds (latest rocky 8)
  • Updates apptainer openmpi to 5.0.7
  • Updates moose-language-server extension in moose-dev apptainer

Closes #29374 (adds full cuda build)
Closes #28161 (removes moose-peacock from moose-dev)
Closes #30382 (removes moose-peacock)
Closes #30586 (removes extra vtk build that comes from moose-peaacock; confirmed by @hugary1995)

@loganharbour loganharbour force-pushed the namjae_gpu branch 2 times, most recently from d7505dc to 9a80bb6 Compare September 5, 2024 04:15
@loganharbour loganharbour force-pushed the namjae_gpu branch 2 times, most recently from f44e95f to 627db04 Compare October 12, 2024 00:16
@moosebuild
Copy link
Contributor

moosebuild commented Oct 12, 2024

Job Documentation, step Docs: sync website on 552c9ab wanted to post the following:

View the site here

This comment will be updated on new commits.

@moosebuild
Copy link
Contributor

moosebuild commented Oct 12, 2024

Job Coverage, step Generate coverage on 1c9fc96 wanted to post the following:

Framework coverage

05730e #28114 1c9fc9
Total Total +/- New
Rate 85.54% 85.54% +0.00% 0.00%
Hits 113043 113044 +1 0
Misses 19107 19106 -1 1

Diff coverage report

Full coverage report

Modules coverage

Contact

05730e #28114 1c9fc9
Total Total +/- New
Rate 90.37% 90.37% - 100.00%
Hits 4879 4879 - 1
Misses 520 520 - 0

Diff coverage report

Full coverage report

Porous flow

05730e #28114 1c9fc9
Total Total +/- New
Rate 95.34% 95.34% - 0.00%
Hits 11386 11386 - 0
Misses 556 556 - 5

Diff coverage report

Full coverage report

Solid mechanics

05730e #28114 1c9fc9
Total Total +/- New
Rate 86.06% 86.06% - 0.00%
Hits 29418 29418 - 0
Misses 4764 4764 - 9

Diff coverage report

Full coverage report

Full coverage reports

Reports

Warnings

  • framework new line coverage rate 0.00% is less than the suggested 90.0%
  • porous_flow new line coverage rate 0.00% is less than the suggested 90.0%
  • solid_mechanics new line coverage rate 0.00% is less than the suggested 90.0%

This comment will be updated on new commits.

@lindsayad
Copy link
Member

I'm pretty interested in this. With our recent PETSc update, we should be much closer to being able to run a clean test harness with a gpu-aware mpi/petsc

@loganharbour
Copy link
Member Author

Sounds good. I'll revive this today.

@loganharbour
Copy link
Member Author

@lindsayad can you try

oras://mooseharbor.hpc.inl.gov/moose-dev/moose-dev-cuda-openmpi-x86_64:pr-28114

when you have a moment?

@lindsayad
Copy link
Member

All tests are failing with this message

misc/check_error.missing_active_section_test: terminate called after throwing an instance of 'blas::Error'
misc/check_error.missing_active_section_test:   what():  system has unsupported display driver / cuda driver combination, in function get_device_count

@loganharbour
Copy link
Member Author

All tests are failing with this message

misc/check_error.missing_active_section_test: terminate called after throwing an instance of 'blas::Error'
misc/check_error.missing_active_section_test:   what():  system has unsupported display driver / cuda driver combination, in function get_device_count

What about if you run with the --nv option?

We shouldn't need to bind mount in the nvidia drivers here, but I think we're missing a flag with blas (or unsetting a variable) so that it doesn't try to run GPU code.

@loganharbour loganharbour force-pushed the namjae_gpu branch 4 times, most recently from 1e3e71a to 8ccd004 Compare January 30, 2025 17:02
@lindsayad
Copy link
Member

is there a draft civet recipe for executing in this cuda container? Would be nice to see what progress we are making in CI

@loganharbour
Copy link
Member Author

is there a draft civet recipe for executing in this cuda container? Would be nice to see what progress we are making in CI

https://civet.inl.gov/job/2675477/

This guy. It's just libtorch stuff at this point

@lindsayad
Copy link
Member

I would expect to see a lot more failures than I do

@loganharbour
Copy link
Member Author

I would expect to see a lot more failures than I do

What would you expect to see? this is just libtorch moose

@lindsayad
Copy link
Member

Oh I didn't understand that basically all tests are getting excluded due to cuda not in libtorch_devices. Can we get all those tests running?

loganharbour and others added 12 commits June 13, 2025 09:18
Co-authored-by: Casey Icenhour <cticenho@ncsu.edu>
Solvers held by MFEM user objects make calls to GetDevicePtr. Consequently,
we have to make sure that the memory manager, which is destroyed in
the Device destructor held by the MFEMExecutuioner, is not destroyed
before these user object calls
- Remove peacock
- Bump python to 3.13
- Apptainer clang bump to 19
- Apptainer min gcc bump to 19
- Full stack cuda build from MPI on
- Manual apptainer libtorch build
@moosebuild
Copy link
Contributor

Job Precheck, step Versioner verify on 552c9ab wanted to post the following:

Versioner templates

Found 14 templates, 0 failed

Versioner influential files

Found 58 influential files, 20 changed, 2 added, 0 removed

package status file
tools CHANGE conda/tools/conda_build_config.yaml
tools CHANGE conda/tools/meta.yaml
mpi CHANGE apptainer/mpi.def
mpi CHANGE conda/mpi/meta.yaml
wasp CHANGE conda/wasp/meta.yaml
pprof CHANGE conda/pprof/meta.yaml
seacas CHANGE conda/seacas/meta.yaml
libmesh-vtk CHANGE conda/libmesh-vtk/conda_build_config.yaml
libmesh-vtk CHANGE conda/libmesh-vtk/meta.yaml
petsc CHANGE apptainer/petsc.def
petsc CHANGE conda/petsc/conda_build_config.yaml
petsc CHANGE conda/petsc/meta.yaml
libmesh CHANGE apptainer/libmesh.def
libmesh CHANGE conda/libmesh/conda_build_config.yaml
libmesh CHANGE conda/libmesh/meta.yaml
moose-dev CHANGE apptainer/files/moose-dev
moose-dev CHANGE apptainer/moose-dev.def
moose-dev CHANGE apptainer/remove_channels.def
moose-dev CHANGE conda/moose-dev/conda_build_config.yaml
moose-dev CHANGE conda/moose-dev/meta.yaml
moose-dev NEW scripts/update_and_rebuild_conduit.sh
moose-dev NEW scripts/update_and_rebuild_mfem.sh

Versioner versions

Found 9 packages, 9 changed, 0 failed

package status hash to hash version to version
tools CHANGE 827cc6f 075abd0 2025.04.17 2025.06.13
mpi CHANGE a9161d1 fcdc117 2025.04.17 2025.06.13
wasp CHANGE 43d14da 3e37a23 2025.05.13 build 0 2025.05.13 build 1
pprof CHANGE f8fd1e3 3d296db 2025.04.17 build 0 2025.06.13
seacas CHANGE 5d0701b 1278418 2025.02.27 build 1 2025.05.22 build 0
libmesh-vtk CHANGE 042bd0b 16265eb 9.4.2 build 2 9.4.2 build 3
petsc CHANGE ff9a7ec 9531950 3.23.0.6.gd9d7fd11dca build 1 3.23.0.6.gd9d7fd11dca build 2
libmesh CHANGE 982ea19 44d69f1 2025.05.23 build 0 2025.05.23 build 1
moose-dev CHANGE 851e6ad 6416dfc 2025.06.09 2025.06.13

@loganharbour
Copy link
Member Author

Modules parallel failure is unrelated.

Libtorch CUDA recipe failure is related, but this recipe will be removed.

@loganharbour loganharbour merged commit f0af667 into idaholab:next Jun 14, 2025
100 of 103 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in CONNECT Jun 14, 2025
@loganharbour loganharbour deleted the namjae_gpu branch June 14, 2025 18:16
milljm pushed a commit to milljm/moose that referenced this pull request Jun 16, 2025
Due to: idaholab#28114, bumping several
minimums.

Closes idaholab#30753
milljm pushed a commit to milljm/moose that referenced this pull request Jun 16, 2025
Due to: idaholab#28114, bumping several
minimums.

Closes idaholab#30753
milljm pushed a commit to milljm/moose that referenced this pull request Jun 17, 2025
Due to: idaholab#28114, bump several
packages version strings.

Turns out when you use yaml.safe_load(file), it will perform type
detection and set that type, thus dropping 3.10 to 3.1. Therefore
convert all entries in package_config.yml to strings, as thats all
we need them to be for documentation.

Closes idaholab#30753
LP1012 pushed a commit to LP1012/moose-devel that referenced this pull request Jun 27, 2025
Due to: idaholab#28114, bump several
packages version strings.

Turns out when you use yaml.safe_load(file), it will perform type
detection and set that type, thus dropping 3.10 to 3.1. Therefore
convert all entries in package_config.yml to strings, as thats all
we need them to be for documentation.

Closes idaholab#30753
drebbel1z pushed a commit to drebbel1z/moose that referenced this pull request Jun 30, 2025
Due to: idaholab#28114, bump several
packages version strings.

Turns out when you use yaml.safe_load(file), it will perform type
detection and set that type, thus dropping 3.10 to 3.1. Therefore
convert all entries in package_config.yml to strings, as thats all
we need them to be for documentation.

Closes idaholab#30753
LP1012 pushed a commit to LP1012/moose-devel that referenced this pull request Jul 31, 2025
Due to: idaholab#28114, bump several
packages version strings.

Turns out when you use yaml.safe_load(file), it will perform type
detection and set that type, thus dropping 3.10 to 3.1. Therefore
convert all entries in package_config.yml to strings, as thats all
we need them to be for documentation.

Closes idaholab#30753
lindsayad added a commit to lindsayad/moose that referenced this pull request Feb 27, 2026
I'm guessing this did not matter because we pointed at the cuda
dir and so torch figured out we wanted cuda anyways

Refs introduction of these cmake arguments in idaholab#28114
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: Ready for review/merge PR: Updates packages Pull requests that update versioner packages

Projects

No open projects
Status: Done

6 participants