fix(eks): pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type#29651
fix(eks): pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type#29651wafuwafu13 wants to merge 4 commits intoaws:mainfrom
Conversation
aws-cdk-automation
left a comment
There was a problem hiding this comment.
The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.
A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.
| private addNeuronDevicePluginRbac() { | ||
| if (!this._neuronDevicePluginRbacClusterRole) { | ||
| const clusterRoleFileContents = fs.readFileSync(path.join(__dirname, 'addons', 'neuron-device-plugin-rbac-cluster-role.yaml'), 'utf8'); | ||
| const sanitizedClusterRole = YAML.parse(clusterRoleFileContents); |
There was a problem hiding this comment.
If I use parseAllDocuments, I don't need to divide k8s-neuron-device-plugin-rbac.yml into three files but the return type of parseAllDocuments is not equal to the return type of parse so addManifest function cannot handle parsed yaml.
I think divide k8s-neuron-device-plugin-rbac.yml into three files and use parse is the simplest solution.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
|
Exemption Request: I updated |
|
This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week. |
|
This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error. |
|
The pull request linter fails with the following errors: PRs must pass status checks before we can provide a meaningful review. If you would like to request an exemption from the status checks or clarification on feedback, please leave a comment on this PR containing ✅ A exemption request has been requested. Please wait for a maintainer's review. |
Issue # (if applicable)
#29262
Reason for this change
When we use INFERENTIA or TRAINIUM instance type, https://github.com/aws/aws-cdk/blob/main/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml is applied to cluster but Pod become CrashLoopBackOff (detail log #29262 (comment))
The current yaml https://github.com/aws-neuron/aws-neuron-sdk/blob/master/docs/neuron-container-tools/k8s-neuron-device-plugin.yml is File not found now.
aws-cdk/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml
Line 1 in dffedca
Description of changes
Download k8s-neuron-device-plugin.yml and k8s-neuron-device-plugin-rbac.yml from https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-setup.html and copy & paste
Add function to apply yaml file for RBAC
Add unit tests
Update
integ.eks-inference-nodegroupandinteg.eks-inferenceDescription of how you validated changes
Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license