fix(tests): modernize trainer test to use Kubeflow SDK by Raakshass · Pull Request #3421 · kubeflow/manifests

Raakshass · 2026-03-24T12:25:58Z

Summary

Replaces the legacy static trainer_job.yaml with a Python SDK test (trainer_test.py) that uses TrainerClient.train() to create a 2-node DDP training job on the fly.

Follow-up to #3413 (comment).

Root Cause

The CI test pinned kubeflowkatib/pytorch-mnist:v1beta1-45c5727 — a legacy Katib image unrelated to Trainer v2.2.0 — and ran a hardcoded /opt/pytorch-mnist/mnist.py entrypoint. The static YAML also used kubectl wait --timeout=300s, blocking integration tests.

Changes

File	Change
`tests/trainer_job.yaml`	Deleted — static YAML replaced by SDK
tests/trainer_test.py	New — `TrainerClient.train()` with `CustomTrainer`, 2-node DDP gloo, synthetic data, `timeout=120`. Diagnostics via `kubernetes` Python client. Istio sidecar opt-out via Python with `try/finally` restore.
tests/trainer_test.sh	Replaced `kubectl apply`/`kubectl wait` with SDK invocation; added `pip install kubeflow`
.github/workflows/trainer_test.yaml	Removed duplicate `Install Dependencies` step (handled by trainer_test.sh)

Design Decisions

SDK approach: TrainerClient.train() + wait_for_job_status(timeout=120) — follows pipeline_v2_test.py pattern
CPU-only torch: python:3.12 base image + pip_index_urls=["https://download.pytorch.org/whl/cpu"] — no GPU image, no preloads
Istio sidecar: Namespace label temporarily set to disabled via kubernetes.client.CoreV1Api.patch_namespace(), restored in finally block. CustomTrainer does not expose pod template annotations, making namespace-level labeling the only available mechanism.
Diagnostics: Timeout handler uses kubernetes.client.CustomObjectsApi and CoreV1Api — no subprocess calls
Resources: 500Mi / 500m — matches master
Naming: Full, expressive names throughout — distributed_training_function, training_dataset, torch.distributed (no aliases)

Checklist

DCO sign-off on all commits
2 training nodes per review feedback

github-actions · 2026-03-24T12:26:08Z

Welcome to the Kubeflow Manifests Repository

Thanks for opening your first PR. Your contribution means a lot to the Kubeflow community.

Before making more PRs:
Please ensure your PR follows our Contributing Guide.
Please also be aware that many components are synchronizes from upstream via the scripts in /scripts.
So in some cases you have to fix the problem in the upstream repositories first, but you can use a PR against kubeflow/manifests to test the platform integration.

Community Resources:

CNCF Slack channels: https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels
Community meetings: https://www.kubeflow.org/docs/about/community/

Thanks again for helping to improve Kubeflow.

Copilot

Pull request overview

Updates the Trainer CI smoke test manifest to stop relying on a legacy Katib MNIST image and instead run a minimal torch distributed (gloo) check using the Torch Distributed runtime’s PyTorch base image.

Changes:

Replaces the test container image with pytorch/pytorch:2.10.0-cuda12.8-cudnn9-runtime.
Replaces the previous MNIST entrypoint with an inline-generated Python script executed via torchrun --standalone.
Increases the per-node memory request from 500Mi to 2Gi.

Raakshass · 2026-03-24T13:58:11Z

/retest

google-oss-prow · 2026-03-24T13:58:27Z

@Raakshass: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

juliusvonkohout · 2026-03-24T15:13:55Z

Is that really what the trainer python sdk generates as yaml ?

Raakshass · 2026-03-25T10:49:59Z

Is that really what the trainer python sdk generates as yaml ?

Not exactly — the SDK does not produce static YAML. TrainerClient.train() dynamically creates a TrainJob from a Python function. From the upstream notebook (examples/pytorch/image-classification/mnist.ipynb):

client.train(
    trainer=CustomTrainer(
        func=train_fashion_mnist,
        num_nodes=3,
        resources_per_node={"cpu": 3, "memory": "16Gi"},
    ),
    runtime=torch_runtime,
)

The upstream end-to-end test for torch-distributed (test/e2e/e2e_test.go) is even more minimal — it creates a TrainJob with only runtimeRef and no trainer block at all:

trainJob := testingutil.MakeTrainJobWrapper(ns.Name, "e2e-test-torch").
    RuntimeRef(..., torchRuntime).
    Obj()

Our test YAML sits between these two patterns: it uses runtimeRef: torch-distributed to inherit the image from the runtime, but overrides trainer.command with an inline DDP gloo smoke test and sets trainer.resourcesPerNode. All three fields (command, numNodes, resourcesPerNode) are optional overrides per the Trainer struct.
Happy to simplify to a bare runtimeRef-only TrainJob like the upstream end-to-end test if that is preferred — would just need to confirm the runtime's default command provides a sufficient smoke test on its own.

juliusvonkohout · 2026-03-25T16:43:36Z

yes please just the minimal mnist example with 2 nodes and way less cpu and memory requests, similar to what we request right now.

Raakshass · 2026-03-25T17:05:51Z

yes please just the minimal mnist example with 2 nodes and way less cpu and memory requests, similar to what we request right now.

updated to minimal MNIST CNN with DDP across 2 nodes, resources matching master (500Mi, 500m)uses the same CNN architecture from the upstream SDK notebook -https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb with synthetic MNIST-shaped data to avoid network dependency in CI. removed --standalone so torchrunpicks up the PET_ environment variables the torch plugin injects for multi-node rendezvous.

juliusvonkohout · 2026-03-25T18:19:55Z

yes please just the minimal mnist example with 2 nodes and way less cpu and memory requests, similar to what we request right now.

updated to minimal MNIST CNN with DDP across 2 nodes, resources matching master (500Mi, 500m)uses the same CNN architecture from the upstream SDK notebook -https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb with synthetic MNIST-shaped data to avoid network dependency in CI. removed --standalone so torchrunpicks up the PET_ environment variables the torch plugin injects for multi-node rendezvous.

You can also just use the trainer SDK and create it within the GHA on the fly.

Raakshass · 2026-03-25T18:22:11Z

yes please just the minimal mnist example with 2 nodes and way less cpu and memory requests, similar to what we request right now.

updated to minimal MNIST CNN with DDP across 2 nodes, resources matching master (500Mi, 500m)uses the same CNN architecture from the upstream SDK notebook -https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb with synthetic MNIST-shaped data to avoid network dependency in CI. removed --standalone so torchrunpicks up the PET_ environment variables the torch plugin injects for multi-node rendezvous.

You can also just use the trainer SDK and create it within the GHA on the fly.

happy to switch to the SDK approach if you prefer. current change is static YAML (2 test files). should I redo it with TrainerClient.train() instead?

juliusvonkohout · 2026-03-26T17:26:28Z

Yes TrainerClient.train() is a good approach. But you have to stay within 60-120 seconds for the training run. Similar to how we test KFP v1 and v2.

Replace static TrainJob YAML with a Python script that uses TrainerClient.train() to create and monitor the training job, following the same pattern as pipeline_v2_test.py. - Add tests/trainer_test.py: SDK-based test with minimal MNIST DDP, single-node, 120s timeout, synthetic data - Update tests/trainer_test.sh: replace kubectl apply with SDK call, keep prerequisite checks for fail-fast diagnostics - Update trainer_test.yaml: install kubeflow pip package - Remove tests/trainer_job.yaml: no longer needed Ref: kubeflow#3421 Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

Raakshass · 2026-03-27T11:31:45Z

switched to the SDK approach in 948a204 — TrainerClient.train() with CustomTrainer, following the pipeline_v2_test.py pattern. wait_for_job_status(timeout=120) replaces the 300s kubectl wait.

single node for now since istio sidecar blocks gloo rendezvous in the kind cluster. happy to bump to 2 when ambient mode lands.

static yaml removed, sdk creates the trainjob on the fly.

abdullahpathan22 · 2026-03-30T15:54:24Z

Hello @juliusvonkohout / @Raakshass,

I have reviewed the implementation and found a few items we need to address to prevent our CI from failing. Here are the specific code changes requested:

1. Incorrect SDK Package (`tests/trainer_test.sh`)

The script currently installs kubeflow, which is a legacy meta-package mostly used for Pipelines. The correct package that provides the TrainerClient is kubeflow-training.

diff
- pip install kubeflow
+ pip install kubeflow-training

2. Tight Timeout for Cold CI (`tests/trainer_test.py`)

A 120s timeout is very likely too short for a cold GitHub Actions runner. Pulling the python:3.12 image plus downloading and installing the CPU version of torch dynamically across 2 nodes will regularly exceed 2 minutes. Let's bump this to 10 minutes (600s) to guarantee stability.

diff
- trainer_client.wait_for_job_status(job_name, timeout=120, polling_interval=5)
+ trainer_client.wait_for_job_status(job_name, timeout=600, polling_interval=5)

3. Unpinned Dependencies (`tests/trainer_test.py`)

To protect our CI from breaking unexpectedly when new PyTorch versions drop, please pin the dependency in your CustomTrainer configuration to the version you developed this against.

diff
- packages_to_install=["torch"],
+ packages_to_install=["torch==2.2.1"],

4. Exception Logging (`tests/trainer_test.py`)

Regarding Julius's previous feedback on the TimeoutError exception—when you assign the variable, I highly recommend adding a sys.stdout.flush() right after print_diagnostic_info(...). CI environments often swallow the last few lines of stdout when a script crashes abruptly, so flushing ensures we actually see your awesome diagnostic logs.

diff
- except TimeoutError:
-     print_diagnostic_info(job_name, namespace)
-     raise
+ except TimeoutError as timeout_err:
+     print_diagnostic_info(job_name, namespace)
+     sys.stdout.flush()
+     raise timeout_err

Review Summary

Area	Finding	Severity
trainer_test.sh	Incorrect pip package (kubeflow-training required)	High
trainer_test.py	120s timeout is unrealistic for cold image pulls	High
trainer_test.py	Unpinned torch dependency risks future CI breakages	Medium
trainer_test.py	Exception variable assignment and stdout flush needed	Low
Git History	9 duplicate commits that need to be squashed	Low

juliusvonkohout · 2026-03-31T18:39:49Z

Hello @juliusvonkohout / @Raakshass,

I have reviewed the implementation and found a few items we need to address to prevent our CI from failing. Here are the specific code changes requested:

1. Incorrect SDK Package (tests/trainer_test.sh)

The script currently installs kubeflow, which is a legacy meta-package mostly used for Pipelines. The correct package that provides the TrainerClient is kubeflow-training.

diff

pip install kubeflow

pip install kubeflow-training

Are you sure? https://sdk.kubeflow.org/en/latest/#quick-start looks correct regarding the python package.

2. Tight Timeout for Cold CI (tests/trainer_test.py)

A 120s timeout is very likely too short for a cold GitHub Actions runner. Pulling the python:3.12 image plus downloading and installing the CPU version of torch dynamically across 2 nodes will regularly exceed 2 minutes. Let's bump this to 10 minutes (600s) to guarantee stability.

diff

trainer_client.wait_for_job_status(job_name, timeout=120, polling_interval=5)

trainer_client.wait_for_job_status(job_name, timeout=600, polling_interval=5)

it should be an image from the trainer registry. 2. 10 minutes is too long for a simple test. 5 minutes is already a lot.

3. Unpinned Dependencies (tests/trainer_test.py)

To protect our CI from breaking unexpectedly when new PyTorch versions drop, please pin the dependency in your CustomTrainer configuration to the version you developed this against.

diff

packages_to_install=["torch"],

packages_to_install=["torch==2.2.1"],

But i want security updates, so maybe torch< 3

juliusvonkohout · 2026-03-31T18:41:35Z

Please check the tests in the trainer repository itself. in the best case you can just copy and adapt.

juliusvonkohout · 2026-04-01T11:09:55Z

See for example https://github.com/kubeflow/trainer/blob/35142e1a44be0e0fb7a388ba75363d125028f715/.github/workflows/test-e2e.yaml#L57 can we just reproduce this ? Try to stick as close as possible to it since that is tested upstream. If we need bigger images for that, so be it :-D OR maybe https://github.com/kubeflow/trainer/blob/35142e1a44be0e0fb7a388ba75363d125028f715/.github/workflows/test-e2e.yaml#L61 is smaller/better

abdullahpathan22 · 2026-04-01T12:20:44Z

Hello @juliusvonkohout L57 is mnist.ipynb (Kubernetes DDP) and L61 is local-training-mnist.ipynb (local) — I will go with L61 and try to reproduce it.

Replaces legacy static trainer_job.yaml with a Python SDK test (trainer_test.py) using TrainerClient.train() to create a single-node training job on the fly. - Use CustomTrainer(func=...) with num_nodes=1 (no DDP/Istio interference) - Pin torch==2.2.1 for deterministic CI - Set timeout=600s for cold CI runners - Add print_diagnostic_information with sys.stdout.flush() - Catch both RuntimeError and TimeoutError per SDK API - pip install kubeflow (not kubeflow-training) - No namespace-level Istio mutation - No custom image (SDK default) Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

juliusvonkohout · 2026-04-04T11:04:08Z

Thank you
/lgtm
/approve

google-oss-prow · 2026-04-04T11:04:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [juliusvonkohout]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Raakshass · 2026-04-05T18:44:25Z

thank you

Copilot AI review requested due to automatic review settings March 24, 2026 12:25

google-oss-prow Bot requested review from kimwnasptd and tarekabouzeid March 24, 2026 12:26

google-oss-prow Bot added the size/S label Mar 24, 2026

Copilot started reviewing on behalf of Raakshass March 24, 2026 12:26 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Comment thread tests/trainer_job.yaml Outdated

Comment thread tests/trainer_job.yaml Outdated

Raakshass marked this pull request as draft March 24, 2026 12:46

google-oss-prow Bot added the do-not-merge/work-in-progress label Mar 24, 2026

Raakshass marked this pull request as ready for review March 24, 2026 13:57

google-oss-prow Bot removed the do-not-merge/work-in-progress label Mar 24, 2026

google-oss-prow Bot requested a review from juliusvonkohout March 24, 2026 13:57

google-oss-prow Bot added size/M and removed size/S labels Mar 25, 2026

Raakshass marked this pull request as draft March 25, 2026 17:12

google-oss-prow Bot added the do-not-merge/work-in-progress label Mar 25, 2026

Raakshass marked this pull request as ready for review March 25, 2026 19:41

google-oss-prow Bot removed the do-not-merge/work-in-progress label Mar 25, 2026

juliusvonkohout reviewed Mar 26, 2026

View reviewed changes

Comment thread tests/trainer_test.sh Outdated

juliusvonkohout reviewed Mar 28, 2026

View reviewed changes

Comment thread tests/trainer_test.sh Outdated

juliusvonkohout reviewed Mar 28, 2026

View reviewed changes