End-to-end CUDA container, remove peacock, bump python to 3.13 by loganharbour · Pull Request #28114 · idaholab/moose

loganharbour · 2024-07-11T23:04:48Z

Removes the peacock conda package
Bumps default python to 3.13, also in apptainer
Updates python source for 3.13 compatibility
Bumps apptainer clang to 19
Bumps apptainer min gcc to 9
Supports a full stack CUDA build (mpi and on)
Builds pytorch within all apptainer builds but min gcc (pytrorch bump to 2.6)
Updates base images for all apptainer builds (latest rocky 8)
Updates apptainer openmpi to 5.0.7
Updates moose-language-server extension in moose-dev apptainer

Closes #29374 (adds full cuda build)
Closes #28161 (removes moose-peacock from moose-dev)
Closes #30382 (removes moose-peacock)
Closes #30586 (removes extra vtk build that comes from moose-peaacock; confirmed by @hugary1995)

moosebuild · 2024-10-12T04:32:44Z

Job Documentation, step Docs: sync website on 552c9ab wanted to post the following:

View the site here

This comment will be updated on new commits.

moosebuild · 2024-10-12T05:32:21Z

Job Coverage, step Generate coverage on 1c9fc96 wanted to post the following:

Framework coverage

	05730e	#28114 1c9fc9
	Total	Total	+/-	New
Rate	85.54%	85.54%	+0.00%	0.00%
Hits	113043	113044	+1	0
Misses	19107	19106	-1	1

Diff coverage report

Full coverage report

Modules coverage

Contact

	05730e	#28114 1c9fc9
	Total	Total	+/-	New
Rate	90.37%	90.37%	-	100.00%
Hits	4879	4879	-	1
Misses	520	520	-	0

Diff coverage report

Full coverage report

Porous flow

	05730e	#28114 1c9fc9
	Total	Total	+/-	New
Rate	95.34%	95.34%	-	0.00%
Hits	11386	11386	-	0
Misses	556	556	-	5

Diff coverage report

Full coverage report

Solid mechanics

	05730e	#28114 1c9fc9
	Total	Total	+/-	New
Rate	86.06%	86.06%	-	0.00%
Hits	29418	29418	-	0
Misses	4764	4764	-	9

Diff coverage report

Full coverage report

Full coverage reports

Reports

Warnings

framework new line coverage rate 0.00% is less than the suggested 90.0%
porous_flow new line coverage rate 0.00% is less than the suggested 90.0%
solid_mechanics new line coverage rate 0.00% is less than the suggested 90.0%

This comment will be updated on new commits.

lindsayad · 2024-12-03T18:16:52Z

I'm pretty interested in this. With our recent PETSc update, we should be much closer to being able to run a clean test harness with a gpu-aware mpi/petsc

loganharbour · 2024-12-03T19:36:02Z

Sounds good. I'll revive this today.

loganharbour · 2024-12-05T15:57:27Z

@lindsayad can you try

oras://mooseharbor.hpc.inl.gov/moose-dev/moose-dev-cuda-openmpi-x86_64:pr-28114

when you have a moment?

lindsayad · 2024-12-06T20:12:55Z

All tests are failing with this message

misc/check_error.missing_active_section_test: terminate called after throwing an instance of 'blas::Error'
misc/check_error.missing_active_section_test:   what():  system has unsupported display driver / cuda driver combination, in function get_device_count

loganharbour · 2024-12-06T20:14:47Z

All tests are failing with this message

misc/check_error.missing_active_section_test: terminate called after throwing an instance of 'blas::Error'
misc/check_error.missing_active_section_test:   what():  system has unsupported display driver / cuda driver combination, in function get_device_count

What about if you run with the --nv option?

We shouldn't need to bind mount in the nvidia drivers here, but I think we're missing a flag with blas (or unsetting a variable) so that it doesn't try to run GPU code.

conda/libmesh/meta.yaml

lindsayad · 2025-02-05T23:46:35Z

is there a draft civet recipe for executing in this cuda container? Would be nice to see what progress we are making in CI

loganharbour · 2025-02-06T03:48:16Z

is there a draft civet recipe for executing in this cuda container? Would be nice to see what progress we are making in CI

https://civet.inl.gov/job/2675477/

This guy. It's just libtorch stuff at this point

lindsayad · 2025-02-06T19:03:06Z

I would expect to see a lot more failures than I do

loganharbour · 2025-02-06T20:02:26Z

I would expect to see a lot more failures than I do

What would you expect to see? this is just libtorch moose

lindsayad · 2025-02-06T22:50:15Z

Oh I didn't understand that basically all tests are getting excluded due to cuda not in libtorch_devices. Can we get all those tests running?

Co-authored-by: Casey Icenhour <cticenho@ncsu.edu>

refs idaholab#29374

Solvers held by MFEM user objects make calls to GetDevicePtr. Consequently, we have to make sure that the memory manager, which is destroyed in the Device destructor held by the MFEMExecutuioner, is not destroyed before these user object calls

- Remove peacock - Bump python to 3.13 - Apptainer clang bump to 19 - Apptainer min gcc bump to 19 - Full stack cuda build from MPI on - Manual apptainer libtorch build

moosebuild · 2025-06-13T21:46:10Z

Job Precheck, step Versioner verify on 552c9ab wanted to post the following:

Versioner templates

Found 14 templates, 0 failed

Versioner influential files

Found 58 influential files, 20 changed, 2 added, 0 removed

package	status	file
tools	CHANGE	conda/tools/conda_build_config.yaml
tools	CHANGE	conda/tools/meta.yaml
mpi	CHANGE	apptainer/mpi.def
mpi	CHANGE	conda/mpi/meta.yaml
wasp	CHANGE	conda/wasp/meta.yaml
pprof	CHANGE	conda/pprof/meta.yaml
seacas	CHANGE	conda/seacas/meta.yaml
libmesh-vtk	CHANGE	conda/libmesh-vtk/conda_build_config.yaml
libmesh-vtk	CHANGE	conda/libmesh-vtk/meta.yaml
petsc	CHANGE	apptainer/petsc.def
petsc	CHANGE	conda/petsc/conda_build_config.yaml
petsc	CHANGE	conda/petsc/meta.yaml
libmesh	CHANGE	apptainer/libmesh.def
libmesh	CHANGE	conda/libmesh/conda_build_config.yaml
libmesh	CHANGE	conda/libmesh/meta.yaml
moose-dev	CHANGE	apptainer/files/moose-dev
moose-dev	CHANGE	apptainer/moose-dev.def
moose-dev	CHANGE	apptainer/remove_channels.def
moose-dev	CHANGE	conda/moose-dev/conda_build_config.yaml
moose-dev	CHANGE	conda/moose-dev/meta.yaml
moose-dev	NEW	scripts/update_and_rebuild_conduit.sh
moose-dev	NEW	scripts/update_and_rebuild_mfem.sh

Versioner versions

Found 9 packages, 9 changed, 0 failed

package	status	hash	to hash	version	to version
tools	CHANGE	827cc6f	075abd0	2025.04.17	2025.06.13
mpi	CHANGE	a9161d1	fcdc117	2025.04.17	2025.06.13
wasp	CHANGE	43d14da	3e37a23	2025.05.13 build 0	2025.05.13 build 1
pprof	CHANGE	f8fd1e3	3d296db	2025.04.17 build 0	2025.06.13
seacas	CHANGE	5d0701b	1278418	2025.02.27 build 1	2025.05.22 build 0
libmesh-vtk	CHANGE	042bd0b	16265eb	9.4.2 build 2	9.4.2 build 3
petsc	CHANGE	ff9a7ec	9531950	3.23.0.6.gd9d7fd11dca build 1	3.23.0.6.gd9d7fd11dca build 2
libmesh	CHANGE	982ea19	44d69f1	2025.05.23 build 0	2025.05.23 build 1
moose-dev	CHANGE	851e6ad	6416dfc	2025.06.09	2025.06.13

loganharbour · 2025-06-14T18:06:57Z

Modules parallel failure is unrelated.

Libtorch CUDA recipe failure is related, but this recipe will be removed.

Due to: idaholab#28114, bumping several minimums. Closes idaholab#30753

Due to: idaholab#28114, bump several packages version strings. Turns out when you use yaml.safe_load(file), it will perform type detection and set that type, thus dropping 3.10 to 3.1. Therefore convert all entries in package_config.yml to strings, as thats all we need them to be for documentation. Closes idaholab#30753

I'm guessing this did not matter because we pointed at the cuda dir and so torch figured out we wanted cuda anyways Refs introduction of these cmake arguments in idaholab#28114

loganharbour force-pushed the namjae_gpu branch from 7660f31 to 1ed7afa Compare July 18, 2024 18:57

loganharbour force-pushed the namjae_gpu branch 2 times, most recently from d7505dc to 9a80bb6 Compare September 5, 2024 04:15

loganharbour force-pushed the namjae_gpu branch 2 times, most recently from f44e95f to 627db04 Compare October 12, 2024 00:16

loganharbour force-pushed the namjae_gpu branch from 627db04 to 017dc51 Compare December 4, 2024 22:11

moosebuild added the PR: Failed but allowed label Dec 5, 2024

loganharbour force-pushed the namjae_gpu branch from 017dc51 to dda7285 Compare January 28, 2025 18:45

moosebuild removed the PR: Failed but allowed label Jan 28, 2025

loganharbour force-pushed the namjae_gpu branch 2 times, most recently from 6ae7031 to 6a87d91 Compare January 28, 2025 23:45

milljm reviewed Jan 29, 2025

View reviewed changes

conda/libmesh/meta.yaml Outdated Show resolved Hide resolved

loganharbour force-pushed the namjae_gpu branch 4 times, most recently from 1e3e71a to 8ccd004 Compare January 30, 2025 17:02

loganharbour force-pushed the next branch from cfc8970 to 86b9411 Compare February 19, 2025 14:51

GiudGiud assigned loganharbour Feb 22, 2025

loganharbour and others added 12 commits June 13, 2025 09:18

Bump moose language support

af6e129

More compliant test spec for SQA

666fb26

Co-authored-by: Casey Icenhour <cticenho@ncsu.edu>

Set LIBTORCH_DIR at the right location

c0afc5f

Disable python and mpi for pytorch, add blas and lapack

25bb300

refs idaholab#29374

Check for cuda, require python

fe53b61

Remove comments because bash sucks

839f84a

Add missing influential files for mfem

6bd8c4c

Add mfem to capabilities script

fb34d1d

Destroy warehouse before Executioner

1849b6c

Solvers held by MFEM user objects make calls to GetDevicePtr. Consequently, we have to make sure that the memory manager, which is destroyed in the Device destructor held by the MFEMExecutuioner, is not destroyed before these user object calls

Do not run NN tests in parallel

c62f499

Update packages

8ffa422

- Remove peacock - Bump python to 3.13 - Apptainer clang bump to 19 - Apptainer min gcc bump to 19 - Full stack cuda build from MPI on - Manual apptainer libtorch build

Update versioner_hashes

2fd350f

loganharbour force-pushed the namjae_gpu branch from 0e5c31b to 2fd350f Compare June 13, 2025 15:19

Combine capabilities

552c9ab

loganharbour merged commit f0af667 into idaholab:next Jun 14, 2025
100 of 103 checks passed

github-project-automation bot moved this from In Progress to Done in CONNECT Jun 14, 2025

loganharbour deleted the namjae_gpu branch June 14, 2025 18:16

dschwen mentioned this pull request Jun 14, 2025

Get MOOSE to work on Aurora #30761

Closed

milljm pushed a commit to milljm/moose that referenced this pull request Jun 16, 2025

Bump versions due to recent change

69c47a3

Due to: idaholab#28114, bumping several minimums. Closes idaholab#30753

milljm mentioned this pull request Jun 16, 2025

Bump versions due to recent change #30774

Merged

milljm pushed a commit to milljm/moose that referenced this pull request Jun 16, 2025

Bump versions due to recent change

3871dd1

Due to: idaholab#28114, bumping several minimums. Closes idaholab#30753

This was referenced Jun 23, 2025

Add natively compiled torch to moose-dev container. #29661

Closed

Add GPU support for libtorch within MOOSE #26718

Closed

Conversation

loganharbour commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moosebuild commented Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moosebuild commented Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Framework coverage

Modules coverage

Contact

Porous flow

Solid mechanics

Full coverage reports

Warnings

Uh oh!

lindsayad commented Dec 3, 2024

Uh oh!

loganharbour commented Dec 3, 2024

Uh oh!

loganharbour commented Dec 5, 2024

Uh oh!

lindsayad commented Dec 6, 2024

Uh oh!

loganharbour commented Dec 6, 2024

Uh oh!

Uh oh!

lindsayad commented Feb 5, 2025

Uh oh!

loganharbour commented Feb 6, 2025

Uh oh!

lindsayad commented Feb 6, 2025

Uh oh!

loganharbour commented Feb 6, 2025

Uh oh!

lindsayad commented Feb 6, 2025

Uh oh!

moosebuild commented Jun 13, 2025

Versioner templates

Versioner influential files

Versioner versions

Uh oh!

loganharbour commented Jun 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

loganharbour commented Jul 11, 2024 •

edited

Loading

moosebuild commented Oct 12, 2024 •

edited

Loading

moosebuild commented Oct 12, 2024 •

edited

Loading