Skip to content

fix(tests): modernize trainer test to use Kubeflow SDK#3421

Merged
google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
Raakshass:fix/trainer-test-image
Apr 4, 2026
Merged

fix(tests): modernize trainer test to use Kubeflow SDK#3421
google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
Raakshass:fix/trainer-test-image

Conversation

@Raakshass
Copy link
Copy Markdown
Contributor

@Raakshass Raakshass commented Mar 24, 2026

Summary

Replaces the legacy static trainer_job.yaml with a Python SDK test (trainer_test.py) that uses TrainerClient.train() to create a 2-node DDP training job on the fly.

Follow-up to #3413 (comment).

Root Cause

The CI test pinned kubeflowkatib/pytorch-mnist:v1beta1-45c5727 — a legacy Katib image unrelated to Trainer v2.2.0 — and ran a hardcoded /opt/pytorch-mnist/mnist.py entrypoint. The static YAML also used kubectl wait --timeout=300s, blocking integration tests.

Changes

File Change
tests/trainer_job.yaml Deleted — static YAML replaced by SDK
tests/trainer_test.py NewTrainerClient.train() with CustomTrainer, 2-node DDP gloo, synthetic data, timeout=120. Diagnostics via kubernetes Python client. Istio sidecar opt-out via Python with try/finally restore.
tests/trainer_test.sh Replaced kubectl apply/kubectl wait with SDK invocation; added pip install kubeflow
.github/workflows/trainer_test.yaml Removed duplicate Install Dependencies step (handled by trainer_test.sh)

Design Decisions

  • SDK approach: TrainerClient.train() + wait_for_job_status(timeout=120) — follows pipeline_v2_test.py pattern
  • CPU-only torch: python:3.12 base image + pip_index_urls=["https://download.pytorch.org/whl/cpu"] — no GPU image, no preloads
  • Istio sidecar: Namespace label temporarily set to disabled via kubernetes.client.CoreV1Api.patch_namespace(), restored in finally block. CustomTrainer does not expose pod template annotations, making namespace-level labeling the only available mechanism.
  • Diagnostics: Timeout handler uses kubernetes.client.CustomObjectsApi and CoreV1Api — no subprocess calls
  • Resources: 500Mi / 500m — matches master
  • Naming: Full, expressive names throughout — distributed_training_function, training_dataset, torch.distributed (no aliases)

Checklist

  • DCO sign-off on all commits
  • 2 training nodes per review feedback

Copilot AI review requested due to automatic review settings March 24, 2026 12:25
@github-actions
Copy link
Copy Markdown

Welcome to the Kubeflow Manifests Repository

Thanks for opening your first PR. Your contribution means a lot to the Kubeflow community.

Before making more PRs:
Please ensure your PR follows our Contributing Guide.
Please also be aware that many components are synchronizes from upstream via the scripts in /scripts.
So in some cases you have to fix the problem in the upstream repositories first, but you can use a PR against kubeflow/manifests to test the platform integration.

Community Resources:

Thanks again for helping to improve Kubeflow.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Trainer CI smoke test manifest to stop relying on a legacy Katib MNIST image and instead run a minimal torch distributed (gloo) check using the Torch Distributed runtime’s PyTorch base image.

Changes:

  • Replaces the test container image with pytorch/pytorch:2.10.0-cuda12.8-cudnn9-runtime.
  • Replaces the previous MNIST entrypoint with an inline-generated Python script executed via torchrun --standalone.
  • Increases the per-node memory request from 500Mi to 2Gi.

Comment thread tests/trainer_job.yaml Outdated
Comment thread tests/trainer_job.yaml Outdated
@Raakshass Raakshass marked this pull request as draft March 24, 2026 12:46
@Raakshass Raakshass marked this pull request as ready for review March 24, 2026 13:57
@Raakshass
Copy link
Copy Markdown
Contributor Author

/retest

@google-oss-prow
Copy link
Copy Markdown

@Raakshass: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@juliusvonkohout
Copy link
Copy Markdown
Member

Is that really what the trainer python sdk generates as yaml ?

@Raakshass
Copy link
Copy Markdown
Contributor Author

Is that really what the trainer python sdk generates as yaml ?

Not exactly — the SDK does not produce static YAML. TrainerClient.train() dynamically creates a TrainJob from a Python function. From the upstream notebook (examples/pytorch/image-classification/mnist.ipynb):

client.train(
    trainer=CustomTrainer(
        func=train_fashion_mnist,
        num_nodes=3,
        resources_per_node={"cpu": 3, "memory": "16Gi"},
    ),
    runtime=torch_runtime,
)

The upstream end-to-end test for torch-distributed (test/e2e/e2e_test.go) is even more minimal — it creates a TrainJob with only runtimeRef and no trainer block at all:

trainJob := testingutil.MakeTrainJobWrapper(ns.Name, "e2e-test-torch").
    RuntimeRef(..., torchRuntime).
    Obj()

Our test YAML sits between these two patterns: it uses runtimeRef: torch-distributed to inherit the image from the runtime, but overrides trainer.command with an inline DDP gloo smoke test and sets trainer.resourcesPerNode. All three fields (command, numNodes, resourcesPerNode) are optional overrides per the Trainer struct.
Happy to simplify to a bare runtimeRef-only TrainJob like the upstream end-to-end test if that is preferred — would just need to confirm the runtime's default command provides a sufficient smoke test on its own.

@juliusvonkohout
Copy link
Copy Markdown
Member

juliusvonkohout commented Mar 25, 2026

yes please just the minimal mnist example with 2 nodes and way less cpu and memory requests, similar to what we request right now.

@google-oss-prow google-oss-prow Bot added size/M and removed size/S labels Mar 25, 2026
@Raakshass
Copy link
Copy Markdown
Contributor Author

yes please just the minimal mnist example with 2 nodes and way less cpu and memory requests, similar to what we request right now.

updated to minimal MNIST CNN with DDP across 2 nodes, resources matching master (500Mi, 500m)uses the same CNN architecture from the upstream SDK notebook -https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb with synthetic MNIST-shaped data to avoid network dependency in CI. removed --standalone so torchrunpicks up the PET_ environment variables the torch plugin injects for multi-node rendezvous.

@Raakshass Raakshass marked this pull request as draft March 25, 2026 17:12
@juliusvonkohout
Copy link
Copy Markdown
Member

yes please just the minimal mnist example with 2 nodes and way less cpu and memory requests, similar to what we request right now.

updated to minimal MNIST CNN with DDP across 2 nodes, resources matching master (500Mi, 500m)uses the same CNN architecture from the upstream SDK notebook -https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb with synthetic MNIST-shaped data to avoid network dependency in CI. removed --standalone so torchrunpicks up the PET_ environment variables the torch plugin injects for multi-node rendezvous.

You can also just use the trainer SDK and create it within the GHA on the fly.

@Raakshass
Copy link
Copy Markdown
Contributor Author

Raakshass commented Mar 25, 2026

yes please just the minimal mnist example with 2 nodes and way less cpu and memory requests, similar to what we request right now.

updated to minimal MNIST CNN with DDP across 2 nodes, resources matching master (500Mi, 500m)uses the same CNN architecture from the upstream SDK notebook -https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb with synthetic MNIST-shaped data to avoid network dependency in CI. removed --standalone so torchrunpicks up the PET_ environment variables the torch plugin injects for multi-node rendezvous.

You can also just use the trainer SDK and create it within the GHA on the fly.

happy to switch to the SDK approach if you prefer. current change is static YAML (2 test files). should I redo it with TrainerClient.train() instead?

@Raakshass Raakshass marked this pull request as ready for review March 25, 2026 19:41
Comment thread tests/trainer_test.sh Outdated
@juliusvonkohout
Copy link
Copy Markdown
Member

juliusvonkohout commented Mar 26, 2026

Yes TrainerClient.train() is a good approach. But you have to stay within 60-120 seconds for the training run. Similar to how we test KFP v1 and v2.

Raakshass added a commit to Raakshass/manifests that referenced this pull request Mar 27, 2026
Replace static TrainJob YAML with a Python script that uses
TrainerClient.train() to create and monitor the training job,
following the same pattern as pipeline_v2_test.py.

- Add tests/trainer_test.py: SDK-based test with minimal MNIST DDP,
  single-node, 120s timeout, synthetic data
- Update tests/trainer_test.sh: replace kubectl apply with SDK call,
  keep prerequisite checks for fail-fast diagnostics
- Update trainer_test.yaml: install kubeflow pip package
- Remove tests/trainer_job.yaml: no longer needed

Ref: kubeflow#3421

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
@Raakshass
Copy link
Copy Markdown
Contributor Author

switched to the SDK approach in 948a204TrainerClient.train() with CustomTrainer, following the pipeline_v2_test.py pattern. wait_for_job_status(timeout=120) replaces the 300s kubectl wait.

single node for now since istio sidecar blocks gloo rendezvous in the kind cluster. happy to bump to 2 when ambient mode lands.

static yaml removed, sdk creates the trainjob on the fly.

Comment thread tests/trainer_test.sh Outdated
Comment thread tests/trainer_test.py Outdated
Comment thread tests/trainer_test.py Outdated
Comment thread tests/trainer_test.py Outdated
@Raakshass Raakshass changed the title fix(tests): update trainer test image to pytorch/pytorch:2.10.0 fix(tests): modernize trainer test to use Kubeflow SDK Mar 28, 2026
@Raakshass Raakshass marked this pull request as draft March 29, 2026 12:59
Comment thread tests/trainer_test.py Outdated
Comment thread tests/trainer_test.py Outdated
@Raakshass Raakshass marked this pull request as ready for review March 29, 2026 13:57
@abdullahpathan22
Copy link
Copy Markdown

Hello @juliusvonkohout / @Raakshass,

I have reviewed the implementation and found a few items we need to address to prevent our CI from failing. Here are the specific code changes requested:

1. Incorrect SDK Package (tests/trainer_test.sh)

The script currently installs kubeflow, which is a legacy meta-package mostly used for Pipelines. The correct package that provides the TrainerClient is kubeflow-training.

diff
- pip install kubeflow
+ pip install kubeflow-training

2. Tight Timeout for Cold CI (tests/trainer_test.py)

A 120s timeout is very likely too short for a cold GitHub Actions runner. Pulling the python:3.12 image plus downloading and installing the CPU version of torch dynamically across 2 nodes will regularly exceed 2 minutes. Let's bump this to 10 minutes (600s) to guarantee stability.

diff
- trainer_client.wait_for_job_status(job_name, timeout=120, polling_interval=5)
+ trainer_client.wait_for_job_status(job_name, timeout=600, polling_interval=5)

3. Unpinned Dependencies (tests/trainer_test.py)

To protect our CI from breaking unexpectedly when new PyTorch versions drop, please pin the dependency in your CustomTrainer configuration to the version you developed this against.

diff
- packages_to_install=["torch"],
+ packages_to_install=["torch==2.2.1"],

4. Exception Logging (tests/trainer_test.py)

Regarding Julius's previous feedback on the TimeoutError exception—when you assign the variable, I highly recommend adding a sys.stdout.flush() right after print_diagnostic_info(...). CI environments often swallow the last few lines of stdout when a script crashes abruptly, so flushing ensures we actually see your awesome diagnostic logs.

diff
- except TimeoutError:
- print_diagnostic_info(job_name, namespace)
- raise
+ except TimeoutError as timeout_err:
+ print_diagnostic_info(job_name, namespace)
+ sys.stdout.flush()
+ raise timeout_err

Review Summary

Area Finding Severity
trainer_test.sh Incorrect pip package (kubeflow-training required) High
trainer_test.py 120s timeout is unrealistic for cold image pulls High
trainer_test.py Unpinned torch dependency risks future CI breakages Medium
trainer_test.py Exception variable assignment and stdout flush needed Low
Git History 9 duplicate commits that need to be squashed Low

@juliusvonkohout
Copy link
Copy Markdown
Member

Hello @juliusvonkohout / @Raakshass,

I have reviewed the implementation and found a few items we need to address to prevent our CI from failing. Here are the specific code changes requested:

1. Incorrect SDK Package (tests/trainer_test.sh)

The script currently installs kubeflow, which is a legacy meta-package mostly used for Pipelines. The correct package that provides the TrainerClient is kubeflow-training.

diff

  • pip install kubeflow
  • pip install kubeflow-training

Are you sure? https://sdk.kubeflow.org/en/latest/#quick-start looks correct regarding the python package.

2. Tight Timeout for Cold CI (tests/trainer_test.py)

A 120s timeout is very likely too short for a cold GitHub Actions runner. Pulling the python:3.12 image plus downloading and installing the CPU version of torch dynamically across 2 nodes will regularly exceed 2 minutes. Let's bump this to 10 minutes (600s) to guarantee stability.

diff

  • trainer_client.wait_for_job_status(job_name, timeout=120, polling_interval=5)
  • trainer_client.wait_for_job_status(job_name, timeout=600, polling_interval=5)
  1. it should be an image from the trainer registry. 2. 10 minutes is too long for a simple test. 5 minutes is already a lot.

3. Unpinned Dependencies (tests/trainer_test.py)

To protect our CI from breaking unexpectedly when new PyTorch versions drop, please pin the dependency in your CustomTrainer configuration to the version you developed this against.

diff

  • packages_to_install=["torch"],
  • packages_to_install=["torch==2.2.1"],

But i want security updates, so maybe torch< 3

@juliusvonkohout
Copy link
Copy Markdown
Member

Please check the tests in the trainer repository itself. in the best case you can just copy and adapt.

@juliusvonkohout
Copy link
Copy Markdown
Member

See for example https://github.com/kubeflow/trainer/blob/35142e1a44be0e0fb7a388ba75363d125028f715/.github/workflows/test-e2e.yaml#L57 can we just reproduce this ? Try to stick as close as possible to it since that is tested upstream. If we need bigger images for that, so be it :-D OR maybe https://github.com/kubeflow/trainer/blob/35142e1a44be0e0fb7a388ba75363d125028f715/.github/workflows/test-e2e.yaml#L61 is smaller/better

@abdullahpathan22
Copy link
Copy Markdown

abdullahpathan22 commented Apr 1, 2026

Hello @juliusvonkohout L57 is mnist.ipynb (Kubernetes DDP) and L61 is local-training-mnist.ipynb (local) — I will go with L61 and try to reproduce it.

@Raakshass Raakshass force-pushed the fix/trainer-test-image branch from 5278f66 to 2aa40cc Compare April 1, 2026 20:55
Replaces legacy static trainer_job.yaml with a Python SDK test
(trainer_test.py) using TrainerClient.train() to create a single-node
training job on the fly.

- Use CustomTrainer(func=...) with num_nodes=1 (no DDP/Istio interference)
- Pin torch==2.2.1 for deterministic CI
- Set timeout=600s for cold CI runners
- Add print_diagnostic_information with sys.stdout.flush()
- Catch both RuntimeError and TimeoutError per SDK API
- pip install kubeflow (not kubeflow-training)
- No namespace-level Istio mutation
- No custom image (SDK default)

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
@Raakshass Raakshass force-pushed the fix/trainer-test-image branch from 2aa40cc to 9dd5c8b Compare April 2, 2026 14:21
@juliusvonkohout
Copy link
Copy Markdown
Member

Thank you
/lgtm
/approve

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit c7bf926 into kubeflow:master Apr 4, 2026
8 checks passed
@Raakshass
Copy link
Copy Markdown
Contributor Author

thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants