-
Notifications
You must be signed in to change notification settings - Fork 19
Description
I'm creating this issue directly from the text of a email sent on 2025-01-10 from @ZacharyWills titled "Update on SCHISM":
We (Jason, Mykel, etc.) have successfully compiled the NWMv3.0 and SCHISM models using the standard suite of intel compilers on the cloud sandbox (intel-oneapi-compilers/2023.1.0-gcc-11.2.1-3a7dxu3; intel-oneapi-mpi/2021.9.0-intel-2021.9.0-egjrbfg; netcdf-c/4.9.2-intel-2021.9.0-vznmeik; netcdf-fortran/4.6.1-intel-2021.9.0-meeveoj; parallelio/2.6.2-intel-2021.9.0-csz55zr). Considering the MPI development and the high-resolution domain configurations of these models, we've had a prerequisite on NOAA RDHPCS supercomputers (Hera cluster) to utilize node configurations containing between 3-10 GBs/CPU RAM configurations in order to successfully execute the model setups. On the cloud sandbox, we've gone ahead and tested a small SCHISM coastal model domain (700,000 elements) and we were only able to successfully execute this model setup on the x2idn.32xlarge node type (16 GBs/cpu), while the hpc6a.48xlarge (4 GBs/cpu) configuration consistently was throwing Fortran allocation errors directly from the code base itself. However, any scalability for large meshes (CONUS domain) for NWMv3 or SCHISM models have consistently thrown "BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES; KILLED BY SIGNAL: 9 (Killed)" errors upon the model initialization phase as the models were attempting to load mesh arrays. We've attempted to maximize all "ulimit" environmental variables as well during the launcher shell script, but this did not change any results from the original behavior of the system. Overall, there appears to be an issue with the environmental system settings for executing these particular coastal models that warrants a further discussion on the cloudflow end.
Jason has a block of code on the us-east-2b head node at /save/ec2-user/OWP/CoastalSandbox that demonstrates this.
Out of an abundance of caution, I created a branch from main of this repo that contains the entirety of Jason's tree (including some temp files that I could probably have filtered better).
That branch contains a test.out file that describes the issues that we're experiencing where memory is not allocatable even though ulimits appear to have been set.
Per Jason:
/save/ec2-user/OWP/Cloud-Sandbox/cloudflow/workflows/nwmv3_hindcastrun.sh is the shell script executing the model run.