Skip to content

feat: enforce PSS restricted for user namespaces in CI#3434

Open
abdullahpathan22 wants to merge 17 commits intokubeflow:masterfrom
abdullahpathan22:fix/pss-restricted-user-namespaces
Open

feat: enforce PSS restricted for user namespaces in CI#3434
abdullahpathan22 wants to merge 17 commits intokubeflow:masterfrom
abdullahpathan22:fix/pss-restricted-user-namespaces

Conversation

@abdullahpathan22
Copy link
Copy Markdown

This PR automates the enforcement of PSS 'restricted' policies on user namespaces in CI. It follows the pattern established in PR #3190 but extends it to dynamic namespaces.

Key Changes:

  • Default PSS labels in Profile Controller.
  • Compliance patches for CI test components.
  • Cleanup of manual PSS labeling in scripts.

Copilot AI review requested due to automatic review settings April 3, 2026 15:10
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request automates the enforcement of Kubernetes Pod Security Standards (PSS) 'restricted' policies on user namespaces in CI, extending the approach established in PR #3190 to dynamic namespaces created by the Profile Controller. The changes include updating test manifests to comply with PSS restricted requirements, configuring the Profile Controller to apply PSS labels by default, and removing manual PSS labeling logic from test scripts.

Changes:

  • Add security contexts to test YAML manifests (training_operator_job.yaml, trainer_job.yaml, notebook.test.kubeflow-user-example.com.yaml, katib_test.yaml) to comply with PSS restricted requirements
  • Enable PSS restricted enforcement by default in Profile Controller's namespace-labels.yaml configuration
  • Update PSS_enable.sh to apply PSS policies to user namespaces regardless of enforcement level
  • Remove manual namespace PSS labeling from kubeflow_profile_install.sh script

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/training_operator_job.yaml Adds pod and container-level securityContext to Master and Worker replica specs
tests/trainer_job.yaml Adds container-level securityContext to trainer specification
tests/notebook.test.kubeflow-user-example.com.yaml Adds pod and container-level securityContext to Notebook spec
tests/katib_test.yaml Adds pod and container-level securityContext to Experiment trial template
tests/PSS_enable.sh Updates namespace list to always include kubeflow-user-example-com
tests/kubeflow_profile_install.sh Removes manual PSS label application
applications/profiles/upstream/base/namespace-labels.yaml Adds PSS restricted enforcement and warning labels

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

@abdullahpathan22
Copy link
Copy Markdown
Author

abdullahpathan22 commented Apr 3, 2026

Hello @juliusvonkohout,
Addressing the Copilot suggestion about applications/profiles/pss/namespace-labels.yaml:

I investigated whether this file also needs to be updated to restricted for consistency.

The pss/ directory is a Kustomize overlay that was introduced as an opt-in layer on top of upstream/base/. Since this PR moves PSS restricted enforcement directly into the Profile Controller's default labels in upstream/base/namespace-labels.yaml, the pss/ overlay's relationship to these changes is worth clarifying.

To verify locally:

# Check the current PSS level set in the overlay
cat applications/profiles/pss/namespace-labels.yaml

# Check whether it is actively referenced by any kustomization
grep -r "pss" applications/profiles/ --include="kustomization.yaml"

Depending on the result:

  • If the pss/ overlay is actively referenced and sets a different level (e.g. baseline), it would override the new default and needs to be updated to restricted.
  • If it is not referenced or already set to restricted, no change is needed and the Copilot suggestion can be dismissed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

@juliusvonkohout
Copy link
Copy Markdown
Member

Please rebase to master after #3421

@abdullahpathan22 abdullahpathan22 force-pushed the fix/pss-restricted-user-namespaces branch from 996f8af to b1b6def Compare April 4, 2026 12:19
@abdullahpathan22
Copy link
Copy Markdown
Author

abdullahpathan22 commented Apr 4, 2026

Hello @juliusvonkohout Please review this PR. Thank you!

@juliusvonkohout
Copy link
Copy Markdown
Member

Hello @juliusvonkohout Please review this PR. Thank you!

Please fix the failing test.

@abdullahpathan22 abdullahpathan22 force-pushed the fix/pss-restricted-user-namespaces branch from 911634e to 7f04ae2 Compare April 4, 2026 19:43
@abdullahpathan22
Copy link
Copy Markdown
Author

Hello @juliusvonkohout my apologies i will rollback my change. I will be more mindfull in Future.

Abdullah Pathan added 12 commits April 5, 2026 14:11
Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
This reverts commit 27287d3.

Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
This commit automates the enforcement of Pod Security Standard (PSS)
'restricted' policies on user namespaces created by the Profile Controller.
It also updates CI test manifests (Notebook, Katib, Training) to be
PSS compliant by adding the required security contexts.

Changes:
- Added PSS restricted labels to Profile Controller's default labels.
- Patched Notebook, Katib, and Training Operator test manifests.
- Refactored profile installation scripts for consistent enforcement.

Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Modified PSS_enable.sh to only include the user namespace when the
PSS_LEVEL is set to 'restricted'. This prevents potential conflicts
where the script might try to downgrade the security level to 'baseline'
for namespaces that are now enforced to be 'restricted' by default.

Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Updated applications/profiles/pss/namespace-labels.yaml to ensure
consistency with the upstream configuration. The PSS enforcement level
has been increased from baseline to restricted, and matching warning
labels and versions have been added as required by the new security
standards.

Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Removed pod security enforcement labels from the namespace.

Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Add pod security enforcement label to KF_PROFILE namespace

Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
- Remove spec.template and spec.trainer.securityContext from
  trainer_job.yaml as they are unknown fields in TrainJob v1alpha1
  CRD, causing strict decoding rejection (BadRequest).
- Update pytorch-mnist image in training_operator_job.yaml from the
  legacy docker.io/kubeflowkatib image (root-based) to
  ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0, which supports
  non-root execution and is consistent with katib_test.yaml.

Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
- training_operator_job.yaml: Add sidecar.istio.io/inject=false as pod
  annotation (not just label) so Istio CNI respects the opt-out and
  does not inject the istio-validation init container that requires
  NET_ADMIN capability blocked by PSS baseline.

- katib_test.yaml: Add --no-cuda flag to trial command to ensure the
  training script exits quickly in CPU-only CI environments, fixing
  the 600s Succeeded timeout.

- kubeflow_profile_install.sh: Re-add explicit kubectl label for baseline
  PSS as belt-and-suspenders alongside the Profile Controller declarative
  approach, ensuring the label is set before any pods are scheduled.

- notebook.test.kubeflow-user-example.com.yaml: Use fsGroup=100 (jovyan
  group) instead of 1000 to match the notebook server image's expected
  group, fixing the readyReplicas=1 timeout.

Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

- tests/pipeline_v1_test.py
- tests/pipeline_v2_test.py
- tests/pipeline*
- experimental/security/PSS/*
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow references experimental/security/PSS/* as a trigger path, but this directory does not exist in the repository. The PSS configuration is located at applications/profiles/pss/. This path should be updated to ensure the workflow is triggered correctly when PSS configuration changes.

Suggested change
- experimental/security/PSS/*
- applications/profiles/pss/**

Copilot uses AI. Check for mistakes.
Comment on lines 3 to 9
pull_request:
paths:
- tests/install_KinD_create_KinD_cluster_install_kustomize.sh
- tests/katib_install.sh
- tests/katib*
- .github/workflows/katib_test.yaml
- applications/katib/upstream/**
- common/istio*/**
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow references experimental/security/PSS/* as a trigger path, but this directory does not exist in the repository. The PSS configuration is located at applications/profiles/pss/. This path should be updated to ensure the workflow is triggered correctly when PSS configuration changes.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +15
# Wait for the PyTorchJob to be created by the operator
echo "Waiting for PyTorchJob status to be populated..."
if ! kubectl wait --for=condition=Created pytorchjob/pytorch-simple -n $KF_PROFILE --timeout=120s; then
echo "ERROR: Timeout waiting for PyTorchJob status. Collecting diagnostics..."
kubectl describe pytorchjob pytorch-simple -n $KF_PROFILE
kubectl get pods -n $KF_PROFILE -l training.kubeflow.org/job-name=pytorch-simple
kubectl get events -n $KF_PROFILE --sort-by=.metadata.creationTimestamp
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubectl wait command has been changed from using jsonpath to using a condition. However, the condition "Created" may not be a valid condition for pytorchjob resources. Valid conditions are typically resource-specific. Please verify that "Created" is a valid condition for Kubeflow PyTorchJob resources, or consult the training-operator documentation for the correct condition name.

Suggested change
# Wait for the PyTorchJob to be created by the operator
echo "Waiting for PyTorchJob status to be populated..."
if ! kubectl wait --for=condition=Created pytorchjob/pytorch-simple -n $KF_PROFILE --timeout=120s; then
echo "ERROR: Timeout waiting for PyTorchJob status. Collecting diagnostics..."
kubectl describe pytorchjob pytorch-simple -n $KF_PROFILE
kubectl get pods -n $KF_PROFILE -l training.kubeflow.org/job-name=pytorch-simple
kubectl get events -n $KF_PROFILE --sort-by=.metadata.creationTimestamp
# Wait for the PyTorchJob status conditions to be populated by the operator.
echo "Waiting for PyTorchJob status to be populated..."
pytorch_job_status_timeout_seconds=120
pytorch_job_status_poll_interval_seconds=2
pytorch_job_status_is_populated=false
for ((elapsed_seconds=0; elapsed_seconds<pytorch_job_status_timeout_seconds; elapsed_seconds+=pytorch_job_status_poll_interval_seconds)); do
pytorch_job_condition_type=$(kubectl get pytorchjob pytorch-simple -n "$KF_PROFILE" -o jsonpath='{.status.conditions[0].type}' 2>/dev/null || true)
if [[ -n "$pytorch_job_condition_type" ]]; then
pytorch_job_status_is_populated=true
break
fi
sleep "$pytorch_job_status_poll_interval_seconds"
done
if [[ "$pytorch_job_status_is_populated" != "true" ]]; then
echo "ERROR: Timeout waiting for PyTorchJob status. Collecting diagnostics..."
kubectl describe pytorchjob pytorch-simple -n "$KF_PROFILE"
kubectl get pods -n "$KF_PROFILE" -l training.kubeflow.org/job-name=pytorch-simple
kubectl get events -n "$KF_PROFILE" --sort-by=.metadata.creationTimestamp

Copilot uses AI. Check for mistakes.
- Update PSS trigger paths in all workflows

- Fix PyTorchJob Creation wait condition in training_operator_test.sh

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.

- tests/training_operator_job.yaml, tests/katib_test.yaml: revert broken ghcr image to docker.io

- tests/notebook.test.kubeflow-user-example.com.yaml: opt out of Istio sidecar injection to bypass PSS baseline NET_ADMIN rejection

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
containers:
- name: training-container
image: ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:latest
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you moving away from ghcr? We have to use ghcr.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, it was an oversight from a local test change. I have reverted all image references back to ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0 in the latest commit to ensure we stick with our registry standards.

@juliusvonkohout
Copy link
Copy Markdown
Member

You can also just test offline. That might be faster than with the CI.

As per mentor feedback, we must avoid using docker.io and prefer
ghcr.io for all container images in our manifests and tests.

Signed-off-by: Abdullah Pathan <abdullahpathan22@gmail.com>
Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
@abdullahpathan22
Copy link
Copy Markdown
Author

Hello @juliusvonkohout i am testing it locally now.

* Add global container securityContext to Argo workflow controller
* Ensure kubeflow-trainer-webhook-cert creates correctly bypassing type validation
* Inject emptyDir mounts for PyTorch/Katib training pods to resolve Permission Denied errors
* Ensure consistent PSS enforcement in local tests

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
…llbacks

* Increase experiment wait timeout to 300s to accommodate KinD networking
* Implement debug traps and docker image pre-loading

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants