fix(tests): modernize trainer test to use Kubeflow SDK#3421
fix(tests): modernize trainer test to use Kubeflow SDK#3421google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
Conversation
|
Welcome to the Kubeflow Manifests Repository Thanks for opening your first PR. Your contribution means a lot to the Kubeflow community. Before making more PRs: Community Resources:
Thanks again for helping to improve Kubeflow. |
There was a problem hiding this comment.
Pull request overview
Updates the Trainer CI smoke test manifest to stop relying on a legacy Katib MNIST image and instead run a minimal torch distributed (gloo) check using the Torch Distributed runtime’s PyTorch base image.
Changes:
- Replaces the test container image with
pytorch/pytorch:2.10.0-cuda12.8-cudnn9-runtime. - Replaces the previous MNIST entrypoint with an inline-generated Python script executed via
torchrun --standalone. - Increases the per-node memory request from
500Mito2Gi.
|
/retest |
|
@Raakshass: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Is that really what the trainer python sdk generates as yaml ? |
Not exactly — the SDK does not produce static YAML. client.train(
trainer=CustomTrainer(
func=train_fashion_mnist,
num_nodes=3,
resources_per_node={"cpu": 3, "memory": "16Gi"},
),
runtime=torch_runtime,
)The upstream end-to-end test for trainJob := testingutil.MakeTrainJobWrapper(ns.Name, "e2e-test-torch").
RuntimeRef(..., torchRuntime).
Obj()Our test YAML sits between these two patterns: it uses runtimeRef: torch-distributed to inherit the image from the runtime, but overrides trainer.command with an inline DDP gloo smoke test and sets trainer.resourcesPerNode. All three fields (command, numNodes, resourcesPerNode) are optional overrides per the Trainer struct. |
|
yes please just the minimal mnist example with 2 nodes and way less cpu and memory requests, similar to what we request right now. |
updated to minimal MNIST CNN with DDP across 2 nodes, resources matching master (500Mi, 500m)uses the same CNN architecture from the upstream SDK notebook -https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb with synthetic MNIST-shaped data to avoid network dependency in CI. removed |
You can also just use the trainer SDK and create it within the GHA on the fly. |
happy to switch to the SDK approach if you prefer. current change is static YAML (2 test files). should I redo it with TrainerClient.train() instead? |
|
Yes TrainerClient.train() is a good approach. But you have to stay within 60-120 seconds for the training run. Similar to how we test KFP v1 and v2. |
Replace static TrainJob YAML with a Python script that uses TrainerClient.train() to create and monitor the training job, following the same pattern as pipeline_v2_test.py. - Add tests/trainer_test.py: SDK-based test with minimal MNIST DDP, single-node, 120s timeout, synthetic data - Update tests/trainer_test.sh: replace kubectl apply with SDK call, keep prerequisite checks for fail-fast diagnostics - Update trainer_test.yaml: install kubeflow pip package - Remove tests/trainer_job.yaml: no longer needed Ref: kubeflow#3421 Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
|
switched to the SDK approach in 948a204 — single node for now since istio sidecar blocks gloo rendezvous in the kind cluster. happy to bump to 2 when ambient mode lands. static yaml removed, sdk creates the trainjob on the fly. |
|
Hello @juliusvonkohout / @Raakshass, I have reviewed the implementation and found a few items we need to address to prevent our CI from failing. Here are the specific code changes requested: 1. Incorrect SDK Package ( |
| Area | Finding | Severity |
|---|---|---|
| trainer_test.sh | Incorrect pip package (kubeflow-training required) | High |
| trainer_test.py | 120s timeout is unrealistic for cold image pulls | High |
| trainer_test.py | Unpinned torch dependency risks future CI breakages | Medium |
| trainer_test.py | Exception variable assignment and stdout flush needed | Low |
| Git History | 9 duplicate commits that need to be squashed | Low |
Are you sure? https://sdk.kubeflow.org/en/latest/#quick-start looks correct regarding the python package.
But i want security updates, so maybe torch< 3 |
|
Please check the tests in the trainer repository itself. in the best case you can just copy and adapt. |
|
See for example https://github.com/kubeflow/trainer/blob/35142e1a44be0e0fb7a388ba75363d125028f715/.github/workflows/test-e2e.yaml#L57 can we just reproduce this ? Try to stick as close as possible to it since that is tested upstream. If we need bigger images for that, so be it :-D OR maybe https://github.com/kubeflow/trainer/blob/35142e1a44be0e0fb7a388ba75363d125028f715/.github/workflows/test-e2e.yaml#L61 is smaller/better |
|
Hello @juliusvonkohout L57 is |
5278f66 to
2aa40cc
Compare
Replaces legacy static trainer_job.yaml with a Python SDK test (trainer_test.py) using TrainerClient.train() to create a single-node training job on the fly. - Use CustomTrainer(func=...) with num_nodes=1 (no DDP/Istio interference) - Pin torch==2.2.1 for deterministic CI - Set timeout=600s for cold CI runners - Add print_diagnostic_information with sys.stdout.flush() - Catch both RuntimeError and TimeoutError per SDK API - pip install kubeflow (not kubeflow-training) - No namespace-level Istio mutation - No custom image (SDK default) Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
2aa40cc to
9dd5c8b
Compare
|
Thank you |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: juliusvonkohout The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
thank you |
Summary
Replaces the legacy static
trainer_job.yamlwith a Python SDK test (trainer_test.py) that usesTrainerClient.train()to create a 2-node DDP training job on the fly.Follow-up to #3413 (comment).
Root Cause
The CI test pinned
kubeflowkatib/pytorch-mnist:v1beta1-45c5727— a legacy Katib image unrelated to Trainer v2.2.0 — and ran a hardcoded/opt/pytorch-mnist/mnist.pyentrypoint. The static YAML also usedkubectl wait --timeout=300s, blocking integration tests.Changes
tests/trainer_job.yamlTrainerClient.train()withCustomTrainer, 2-node DDP gloo, synthetic data,timeout=120. Diagnostics viakubernetesPython client. Istio sidecar opt-out via Python withtry/finallyrestore.kubectl apply/kubectl waitwith SDK invocation; addedpip install kubeflowInstall Dependenciesstep (handled by trainer_test.sh)Design Decisions
TrainerClient.train()+wait_for_job_status(timeout=120)— followspipeline_v2_test.pypatternpython:3.12base image +pip_index_urls=["https://download.pytorch.org/whl/cpu"]— no GPU image, no preloadsdisabledviakubernetes.client.CoreV1Api.patch_namespace(), restored infinallyblock.CustomTrainerdoes not expose pod template annotations, making namespace-level labeling the only available mechanism.kubernetes.client.CustomObjectsApiandCoreV1Api— no subprocess calls500Mi/500m— matches mastertraining_dataset,torch.distributed(no aliases)Checklist