Skip to content

Fix ptscotch test#32394

Draft
lindsayad wants to merge 18 commits intoidaholab:nextfrom
lindsayad:fix-ptscotch-test
Draft

Fix ptscotch test#32394
lindsayad wants to merge 18 commits intoidaholab:nextfrom
lindsayad:fix-ptscotch-test

Conversation

@lindsayad
Copy link
Member

We need to obey the PETSc documentation which requests that the adjacency list be sorted (per row). @roystgnr I checked to see whether we call MatCreateMPIAdj in libMesh and I see that we do in one place. AFAICT we also don't do any sorting there, so that could be a landmine waiting to happen

@lindsayad
Copy link
Member Author

Hmm this actually seems like it's going to have involve libMesh because I'm getting errors out of recent ptscotch if I attempt to sort the graph after build_graph. I'm guessing this is due to constructing things like _local_id_to_elem with the unsorted graph?

@lindsayad
Copy link
Member Author

Hmm this actually seems like it's going to have involve libMesh because I'm getting errors out of recent ptscotch if I attempt to sort the graph after build_graph. I'm guessing this is due to constructing things like _local_id_to_elem with the unsorted graph?

Hmm no I don't see anything obvious that suggests if I sort _dual_graph rows post build_graph that it would break any other data structures

@moosebuild
Copy link
Contributor

moosebuild commented Feb 25, 2026

Job Documentation, step Docs: sync website on c31db84 wanted to post the following:

View the site here

This comment will be updated on new commits.

@lindsayad
Copy link
Member Author

@ChengHauYang your equal_value_boundary_constraint.error_multiple_primary_nodes test does not work with --distributed-mesh run in serial, which is kind of fascinating. Can you figure out what's going on there?

@ChengHauYang
Copy link
Collaborator

ChengHauYang commented Mar 1, 2026

Thanks for catching the bugs, @lindsayad. In break_mesh_with_evbc.i, the test hard-codes secondary_node_ids = '1 2 3 4 5 6'. After BMBB duplicates the interface node, the duplicates get different IDs on a distributed mesh than on a replicated mesh. As a result, under distributed serial, that hard-coded node list no longer contains both coincident nodes at the requested primary_node_coord, so EqualValueBoundaryConstraint (EVBC) only sees one valid candidate and does not trigger the expected "Multiple nodes found" error.
In short: the failure is caused by mesh-dependent node numbering in the test setup, not by the EVBC or BMBB logic itself.
The short fix is to avoid using BMBB in the input file for this particular test, which would require hard-coding secondary_node_ids. I have tested this on multiple processor counts (including serial) on my local machine. I hope it can work. The fixing patch is in 945a65b.

Thanks for your help!!

@moosebuild
Copy link
Contributor

moosebuild commented Mar 1, 2026

Job Coverage, step Generate coverage on c31db84 wanted to post the following:

Framework coverage

801dc5 #32394 c31db8
Total Total +/- New
Rate 85.78% 85.78% +0.00% 100.00%
Hits 128547 128554 +7 30
Misses 21316 21317 +1 0

Diff coverage report

Full coverage report

Modules coverage

Thermal hydraulics

801dc5 #32394 c31db8
Total Total +/- New
Rate 88.88% 88.87% -0.01% -
Hits 15430 15429 -1 0
Misses 1931 1932 +1 0

Diff coverage report

Full coverage report

Full coverage reports

Reports

This comment will be updated on new commits.

@ChengHauYang
Copy link
Collaborator

Hi @lindsayad,

I found a simpler fix than the previous fix patch: 51a4d1d.

The way is to list all node IDs inside secondary_node_ids. I actually did this at the beginning, but I accidentally incremented each node ID by 1.

Thanks for your help! I hope this fixes the issue.

@lindsayad
Copy link
Member Author

The fixing patch is in 945a65b.

I just fetched from your repository and that commit is not visible to me. Do you have a branch tip that includes that commit?

@ChengHauYang
Copy link
Collaborator

Sorry, @lindsayad. I think maybe you can go with this patch: 51a4d1d. The branch is "fix_evbc_simple".

Thanks for your help!

@lindsayad
Copy link
Member Author

Thank you for being so responsive! It's awesome having you as a MOOSE contributor

@ChengHauYang
Copy link
Collaborator

You are very welcome, @lindsayad! Thanks for your kind words!

@lindsayad
Copy link
Member Author

Griffin patch at https://github.inl.gov/ncrc/griffin/pull/3000

@moosebuild
Copy link
Contributor

Job Test, step Results summary on c31db84 wanted to post the following:

Framework test summary

Compared against 801dc58 in job civet.inl.gov/job/3617958.

Removed tests

Test Time (s) Memory (MB)
partitioners/petsc_partitioner.ptscotch_weight_elment 0.82 0.00

Added tests

Test Time (s) Memory (MB)
partitioners/petsc_partitioner.ptscotch_weight_element 0.85 0.00

Modules test summary

Compared against 801dc58 in job civet.inl.gov/job/3617958.

No change

@moosebuild
Copy link
Contributor

Job Coverage, step Verify coverage on c31db84 wanted to post the following:

The following coverage requirement(s) failed:

  • Failed to generate richards coverage rate (required: 93.0%)

@lindsayad
Copy link
Member Author

@loganharbour it seems like there are some CI problems with the sweeps for processor counts >= 11. Maybe resource issues? I do get failures when I run on my system locally, but they are definitely not the same

Failed Tests:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
outputs/exodus.hdf5 .............................................................................................................................................................. FAILED (EXIT CODE 77 != 0)
vectorpostprocessors/work_balance.work_balance/replicated ..................................................................................................................... [max_cpus=2] FAILED (CSVDIFF)
multiapps/sub_cycling.group/test ................................................................................................................................................ FAILED (EXIT CODE 134 != 0)
multisystem/restore_multiapp.nl_sol ............................................................................................................................................... FAILED (EXIT CODE 1 != 0)
multisystem/picard/linearfv_nonlinearfv.tightly_coupled ........................................................................................................................... FAILED (EXIT CODE 1 != 0)
transfers/multiapp_copy_transfer/linear_sys_to_aux.test ........................................................................................................................... FAILED (EXIT CODE 1 != 0)
transfers/multiapp_variable_value_sample_transfer.array_sample_test/input_positions ............................................................................................... FAILED (EXIT CODE 1 != 0)
multiapps/linearfv_nonlinearfv.linearfv/linearfv_mainapp .......................................................................................................................... FAILED (EXIT CODE 1 != 0)
multiapps/linearfv_nonlinearfv.linearfv/nonlinearfv_mainapp ....................................................................................................................... FAILED (EXIT CODE 1 != 0)
variables/linearfv.basic_aux ...................................................................................................................................................... FAILED (EXIT CODE 1 != 0)
restart/kernel_restartable.thread_error/with_threads .................................................................................................. [min_threads=4,max_cpus=1] FAILED (EXIT CODE -6 != 0)
multiapps/steffensen_postprocessor.pp_transient/app_begin_transfers_begin_steffensen_sub ........................................................................................ FAILED (EXIT CODE 143 != 0)
meshgenerators/coarsen_block_generator.coarsen_hex/multiple_levels ......................................................................................................................... FAILED (TIMEOUT)
transfers/multiapp_variable_value_sample_transfer.block_restricted_primary/multiapp ............................................................................................... FAILED (EXIT CODE 1 != 0)
tag.controls-tagging ............................................................................................................................................................ FAILED (EXIT CODE 134 != 0)
tag.linear-fv ..................................................................................................................................................................... FAILED (EXIT CODE 1 != 0)
transfers/multiapp_variable_value_sample_transfer.block_restricted_primary/multiapp_and_var ....................................................................................... FAILED (EXIT CODE 1 != 0)
ics/random_ic_test.test_threaded ....................................................................................................................... [FINISHED,min_threads=2] FAILED (EXIT CODE 139 != 0)

@loganharbour
Copy link
Member

@loganharbour it seems like there are some CI problems with the sweeps for processor counts >= 11. Maybe resource issues? I do get failures when I run on my system locally, but they are definitely not the same

Failed Tests:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
outputs/exodus.hdf5 .............................................................................................................................................................. FAILED (EXIT CODE 77 != 0)
vectorpostprocessors/work_balance.work_balance/replicated ..................................................................................................................... [max_cpus=2] FAILED (CSVDIFF)
multiapps/sub_cycling.group/test ................................................................................................................................................ FAILED (EXIT CODE 134 != 0)
multisystem/restore_multiapp.nl_sol ............................................................................................................................................... FAILED (EXIT CODE 1 != 0)
multisystem/picard/linearfv_nonlinearfv.tightly_coupled ........................................................................................................................... FAILED (EXIT CODE 1 != 0)
transfers/multiapp_copy_transfer/linear_sys_to_aux.test ........................................................................................................................... FAILED (EXIT CODE 1 != 0)
transfers/multiapp_variable_value_sample_transfer.array_sample_test/input_positions ............................................................................................... FAILED (EXIT CODE 1 != 0)
multiapps/linearfv_nonlinearfv.linearfv/linearfv_mainapp .......................................................................................................................... FAILED (EXIT CODE 1 != 0)
multiapps/linearfv_nonlinearfv.linearfv/nonlinearfv_mainapp ....................................................................................................................... FAILED (EXIT CODE 1 != 0)
variables/linearfv.basic_aux ...................................................................................................................................................... FAILED (EXIT CODE 1 != 0)
restart/kernel_restartable.thread_error/with_threads .................................................................................................. [min_threads=4,max_cpus=1] FAILED (EXIT CODE -6 != 0)
multiapps/steffensen_postprocessor.pp_transient/app_begin_transfers_begin_steffensen_sub ........................................................................................ FAILED (EXIT CODE 143 != 0)
meshgenerators/coarsen_block_generator.coarsen_hex/multiple_levels ......................................................................................................................... FAILED (TIMEOUT)
transfers/multiapp_variable_value_sample_transfer.block_restricted_primary/multiapp ............................................................................................... FAILED (EXIT CODE 1 != 0)
tag.controls-tagging ............................................................................................................................................................ FAILED (EXIT CODE 134 != 0)
tag.linear-fv ..................................................................................................................................................................... FAILED (EXIT CODE 1 != 0)
transfers/multiapp_variable_value_sample_transfer.block_restricted_primary/multiapp_and_var ....................................................................................... FAILED (EXIT CODE 1 != 0)
ics/random_ic_test.test_threaded ....................................................................................................................... [FINISHED,min_threads=2] FAILED (EXIT CODE 139 != 0)

Signal 15 is a kill. I bet they were killed due to the host being OOM...

@lindsayad
Copy link
Member Author

In this case I can't blame my navier-stokes test 😆

@lindsayad
Copy link
Member Author

You think after all the memory enforcement policies you're working on trickle through that we shouldn't run into recipe failures like this?

@loganharbour
Copy link
Member

Eventually. There's a lot more to go... valgrind and apps. And then you have crap like this:

20 Heaviest Jobs (memory/slot):
--------------------------------------------------------------------------------------------------------------
[172.8s] [4634MB]       OK htgr/mhtgr/3D_mesh.3D_MHTGR_mesh [FINISHED]
[1014s ] [4143MB]       OK sfr/subchannel/multiple_SCM_assemblies/19assemblies.SCM19 [FINISHED,min_cpus=10]
[52.08s] [4027MB]       OK sfr/subchannel/multiple_SCM_assemblies/7assemblies.Master_app_syntax SYNTAX PASS
[128.8s] [3737MB]       OK htgr/mhtgr/3D_mesh.3D_MHTGR_syntax SYNTAX PASS [FINISHED]
[267.3s] [2664MB]       OK htgr/httf/inputs.core [FINISHED,min_cpus=16]
[467.3s] [2005MB]       OK sfr/subchannel/multiple_SCM_assemblies/7assemblies.SCM7 [min_cpus=7,FINISHED]
[196.3s] [1993MB]       OK msr/msre/pipe_cardinal.solid_mechanics [FINISHED]
[90.62s] [1640MB]       OK htgr/httf/inputs.core_syntax SYNTAX PASS
[44.42s] [1376MB]       OK research_reactors/agn.3D_AGN-201_mesh

@lindsayad
Copy link
Member Author

that's vtb right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants