[GB200] Make IMEX prolog use local IMEx configurations + test fixes#7013
Conversation
| return 1 # Not Updated | ||
| fi | ||
|
|
||
| # Try to acquire lock with timeout |
There was a problem hiding this comment.
what if we do keep this as part of the prolog?
There was a problem hiding this comment.
Its a deadlock prevention, even through we added it for shared file.
Any scenario where more processes access this file and we end up in a deadlock scenario can be prevented if we keep it and we have logs showing that we were in deadlock
| IPS_FROM_CR=$(get_ips_from_node_names "${CR_NODES}") | ||
| IMEX_MAIN_CONFIG="/opt/parallelcluster/shared/nvidia-imex/config_${QUEUE_NAME}_${COMPUTE_RESOURCE_NAME}.cfg" | ||
| IMEX_NODES_CONFIG="/opt/parallelcluster/shared/nvidia-imex/nodes_config_${QUEUE_NAME}_${COMPUTE_RESOURCE_NAME}.cfg" | ||
| IMEX_MAIN_CONFIG="/etc/nvidia-imex/config.cfg" |
There was a problem hiding this comment.
You also need to chnage the nvidia-imex-status.job file which points to using a config specific file.
There was a problem hiding this comment.
good catch, done!
…ocal IMEX nodes config, rather than the shared one.
…status by using the local IMEX config file rather than the shared one.
c611d14 to
94393f7
Compare
| QUEUE_NAME=$(cat "/etc/chef/dna.json" | jq -r ".cluster.scheduler_queue_name") | ||
| COMPUTE_RES_NAME=$(cat "/etc/chef/dna.json" | jq -r ".cluster.scheduler_compute_resource_name") | ||
| IMEX_CONFIG_FILE="/opt/parallelcluster/shared/nvidia-imex/config_${QUEUE_NAME}_${COMPUTE_RES_NAME}.cfg" | ||
| IMEX_CONFIG_FILE="/etc/nvidia-imex/config.cfg" |
There was a problem hiding this comment.
We can remove this file. We should no longer specify the configuration file if not needed!
There was a problem hiding this comment.
I remember it was necessary, but apparently it is not when the default location is used
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
Description of changes
In aws/aws-parallelcluster-cookbook#3029 we moved from shared IMEx configurations to local ones. In this PR we adapt the prolog accordingly.
Also, we fixed an assertion made on IMEx logs, which used to check the logs in the head node, but it should check the compute nodes.
Tests
[ONGOING] test_gb200
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.