feat: enforce PSS restricted for user namespaces in CI#3434
feat: enforce PSS restricted for user namespaces in CI#3434abdullahpathan22 wants to merge 17 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This pull request automates the enforcement of Kubernetes Pod Security Standards (PSS) 'restricted' policies on user namespaces in CI, extending the approach established in PR #3190 to dynamic namespaces created by the Profile Controller. The changes include updating test manifests to comply with PSS restricted requirements, configuring the Profile Controller to apply PSS labels by default, and removing manual PSS labeling logic from test scripts.
Changes:
- Add security contexts to test YAML manifests (training_operator_job.yaml, trainer_job.yaml, notebook.test.kubeflow-user-example.com.yaml, katib_test.yaml) to comply with PSS restricted requirements
- Enable PSS restricted enforcement by default in Profile Controller's namespace-labels.yaml configuration
- Update PSS_enable.sh to apply PSS policies to user namespaces regardless of enforcement level
- Remove manual namespace PSS labeling from kubeflow_profile_install.sh script
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/training_operator_job.yaml | Adds pod and container-level securityContext to Master and Worker replica specs |
| tests/trainer_job.yaml | Adds container-level securityContext to trainer specification |
| tests/notebook.test.kubeflow-user-example.com.yaml | Adds pod and container-level securityContext to Notebook spec |
| tests/katib_test.yaml | Adds pod and container-level securityContext to Experiment trial template |
| tests/PSS_enable.sh | Updates namespace list to always include kubeflow-user-example-com |
| tests/kubeflow_profile_install.sh | Removes manual PSS label application |
| applications/profiles/upstream/base/namespace-labels.yaml | Adds PSS restricted enforcement and warning labels |
605c163 to
e8500e3
Compare
|
Hello @juliusvonkohout, I investigated whether this file also needs to be updated to The To verify locally: # Check the current PSS level set in the overlay
cat applications/profiles/pss/namespace-labels.yaml
# Check whether it is actively referenced by any kustomization
grep -r "pss" applications/profiles/ --include="kustomization.yaml"Depending on the result:
|
|
Please rebase to master after #3421 |
996f8af to
b1b6def
Compare
|
Hello @juliusvonkohout Please review this PR. Thank you! |
Please fix the failing test. |
911634e to
7f04ae2
Compare
|
Hello @juliusvonkohout my apologies i will rollback my change. I will be more mindfull in Future. |
Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
This reverts commit 27287d3. Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
This commit automates the enforcement of Pod Security Standard (PSS) 'restricted' policies on user namespaces created by the Profile Controller. It also updates CI test manifests (Notebook, Katib, Training) to be PSS compliant by adding the required security contexts. Changes: - Added PSS restricted labels to Profile Controller's default labels. - Patched Notebook, Katib, and Training Operator test manifests. - Refactored profile installation scripts for consistent enforcement. Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Modified PSS_enable.sh to only include the user namespace when the PSS_LEVEL is set to 'restricted'. This prevents potential conflicts where the script might try to downgrade the security level to 'baseline' for namespaces that are now enforced to be 'restricted' by default. Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Updated applications/profiles/pss/namespace-labels.yaml to ensure consistency with the upstream configuration. The PSS enforcement level has been increased from baseline to restricted, and matching warning labels and versions have been added as required by the new security standards. Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Removed pod security enforcement labels from the namespace. Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Add pod security enforcement label to KF_PROFILE namespace Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
- Remove spec.template and spec.trainer.securityContext from trainer_job.yaml as they are unknown fields in TrainJob v1alpha1 CRD, causing strict decoding rejection (BadRequest). - Update pytorch-mnist image in training_operator_job.yaml from the legacy docker.io/kubeflowkatib image (root-based) to ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0, which supports non-root execution and is consistent with katib_test.yaml. Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
- training_operator_job.yaml: Add sidecar.istio.io/inject=false as pod annotation (not just label) so Istio CNI respects the opt-out and does not inject the istio-validation init container that requires NET_ADMIN capability blocked by PSS baseline. - katib_test.yaml: Add --no-cuda flag to trial command to ensure the training script exits quickly in CPU-only CI environments, fixing the 600s Succeeded timeout. - kubeflow_profile_install.sh: Re-add explicit kubectl label for baseline PSS as belt-and-suspenders alongside the Profile Controller declarative approach, ensuring the label is set before any pods are scheduled. - notebook.test.kubeflow-user-example.com.yaml: Use fsGroup=100 (jovyan group) instead of 1000 to match the notebook server image's expected group, fixing the readyReplicas=1 timeout. Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
56373ff to
bd62d80
Compare
.github/workflows/pipeline_test.yaml
Outdated
| - tests/pipeline_v1_test.py | ||
| - tests/pipeline_v2_test.py | ||
| - tests/pipeline* | ||
| - experimental/security/PSS/* |
There was a problem hiding this comment.
The workflow references experimental/security/PSS/* as a trigger path, but this directory does not exist in the repository. The PSS configuration is located at applications/profiles/pss/. This path should be updated to ensure the workflow is triggered correctly when PSS configuration changes.
| - experimental/security/PSS/* | |
| - applications/profiles/pss/** |
| pull_request: | ||
| paths: | ||
| - tests/install_KinD_create_KinD_cluster_install_kustomize.sh | ||
| - tests/katib_install.sh | ||
| - tests/katib* | ||
| - .github/workflows/katib_test.yaml | ||
| - applications/katib/upstream/** | ||
| - common/istio*/** |
There was a problem hiding this comment.
The workflow references experimental/security/PSS/* as a trigger path, but this directory does not exist in the repository. The PSS configuration is located at applications/profiles/pss/. This path should be updated to ensure the workflow is triggered correctly when PSS configuration changes.
tests/training_operator_test.sh
Outdated
| # Wait for the PyTorchJob to be created by the operator | ||
| echo "Waiting for PyTorchJob status to be populated..." | ||
| if ! kubectl wait --for=condition=Created pytorchjob/pytorch-simple -n $KF_PROFILE --timeout=120s; then | ||
| echo "ERROR: Timeout waiting for PyTorchJob status. Collecting diagnostics..." | ||
| kubectl describe pytorchjob pytorch-simple -n $KF_PROFILE | ||
| kubectl get pods -n $KF_PROFILE -l training.kubeflow.org/job-name=pytorch-simple | ||
| kubectl get events -n $KF_PROFILE --sort-by=.metadata.creationTimestamp |
There was a problem hiding this comment.
The kubectl wait command has been changed from using jsonpath to using a condition. However, the condition "Created" may not be a valid condition for pytorchjob resources. Valid conditions are typically resource-specific. Please verify that "Created" is a valid condition for Kubeflow PyTorchJob resources, or consult the training-operator documentation for the correct condition name.
| # Wait for the PyTorchJob to be created by the operator | |
| echo "Waiting for PyTorchJob status to be populated..." | |
| if ! kubectl wait --for=condition=Created pytorchjob/pytorch-simple -n $KF_PROFILE --timeout=120s; then | |
| echo "ERROR: Timeout waiting for PyTorchJob status. Collecting diagnostics..." | |
| kubectl describe pytorchjob pytorch-simple -n $KF_PROFILE | |
| kubectl get pods -n $KF_PROFILE -l training.kubeflow.org/job-name=pytorch-simple | |
| kubectl get events -n $KF_PROFILE --sort-by=.metadata.creationTimestamp | |
| # Wait for the PyTorchJob status conditions to be populated by the operator. | |
| echo "Waiting for PyTorchJob status to be populated..." | |
| pytorch_job_status_timeout_seconds=120 | |
| pytorch_job_status_poll_interval_seconds=2 | |
| pytorch_job_status_is_populated=false | |
| for ((elapsed_seconds=0; elapsed_seconds<pytorch_job_status_timeout_seconds; elapsed_seconds+=pytorch_job_status_poll_interval_seconds)); do | |
| pytorch_job_condition_type=$(kubectl get pytorchjob pytorch-simple -n "$KF_PROFILE" -o jsonpath='{.status.conditions[0].type}' 2>/dev/null || true) | |
| if [[ -n "$pytorch_job_condition_type" ]]; then | |
| pytorch_job_status_is_populated=true | |
| break | |
| fi | |
| sleep "$pytorch_job_status_poll_interval_seconds" | |
| done | |
| if [[ "$pytorch_job_status_is_populated" != "true" ]]; then | |
| echo "ERROR: Timeout waiting for PyTorchJob status. Collecting diagnostics..." | |
| kubectl describe pytorchjob pytorch-simple -n "$KF_PROFILE" | |
| kubectl get pods -n "$KF_PROFILE" -l training.kubeflow.org/job-name=pytorch-simple | |
| kubectl get events -n "$KF_PROFILE" --sort-by=.metadata.creationTimestamp |
- Update PSS trigger paths in all workflows - Fix PyTorchJob Creation wait condition in training_operator_test.sh Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
- tests/training_operator_job.yaml, tests/katib_test.yaml: revert broken ghcr image to docker.io - tests/notebook.test.kubeflow-user-example.com.yaml: opt out of Istio sidecar injection to bypass PSS baseline NET_ADMIN rejection Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
tests/katib_test.yaml
Outdated
| containers: | ||
| - name: training-container | ||
| image: ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0 | ||
| image: docker.io/kubeflowkatib/pytorch-mnist-cpu:latest |
There was a problem hiding this comment.
Why are you moving away from ghcr? We have to use ghcr.
There was a problem hiding this comment.
Sorry about that, it was an oversight from a local test change. I have reverted all image references back to ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0 in the latest commit to ensure we stick with our registry standards.
|
You can also just test offline. That might be faster than with the CI. |
As per mentor feedback, we must avoid using docker.io and prefer ghcr.io for all container images in our manifests and tests. Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com> Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
|
Hello @juliusvonkohout i am testing it locally now. |
* Add global container securityContext to Argo workflow controller * Ensure kubeflow-trainer-webhook-cert creates correctly bypassing type validation * Inject emptyDir mounts for PyTorch/Katib training pods to resolve Permission Denied errors * Ensure consistent PSS enforcement in local tests Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
…llbacks * Increase experiment wait timeout to 300s to accommodate KinD networking * Implement debug traps and docker image pre-loading Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
This PR automates the enforcement of PSS 'restricted' policies on user namespaces in CI. It follows the pattern established in PR #3190 but extends it to dynamic namespaces.
Key Changes: