-
Notifications
You must be signed in to change notification settings - Fork 2k
[BLD]: fix Tilt setup flakes #4720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
5020131 to
246cb37
Compare
a9f46c9 to
542c9b1
Compare
542c9b1 to
6231652
Compare
cd79956 to
f8aad48
Compare
f8aad48 to
53c50be
Compare
32d8060 to
a1ec8cd
Compare
1c367e3 to
188ef4f
Compare
96a3313 to
5189703
Compare
ed3e150 to
21097d2
Compare
fcacc53 to
68befed
Compare
68befed to
a9bd4e8
Compare
| matrix: | ||
| python: ${{fromJson(inputs.python_versions)}} | ||
| platform: ["blacksmith-16vcpu-ubuntu-2204"] | ||
| platform: ["blacksmith-8vcpu-ubuntu-2204"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute is not a bottleneck according to the profiling I did (assuming most Rust crates are cached)
deaa2b7 to
f9d44e3
Compare
f9d44e3 to
1d0a456
Compare
|
Infrastructure Refactor: Make Tilt-based CI Cluster Setup Deterministic and Less Flaky This PR overhauls the CI cluster setup logic for Tilt-based workflows, addressing persistent flakiness in CI by moving to a 'bare-metal' minikube approach and decoupling image building/pulling from pod deployment. It introduces a dedicated pre-build step for all required Docker images using a new docker-bake.hcl definition, pre-pulls external images in parallel, and cleans up the workflow to run all infrastructure on the host Docker engine, avoiding Docker-in-Docker entirely. Significant logic in the Tiltfile is reworked to use prebuilt/tagged images, custom builds, and clearer image targeting for CI vs. development. GitHub Actions are reorganized, external image pulling is scripted, and both Go and Python test workflows are updated accordingly. Key Changes: Affected Areas: Potential Impact: Functionality: Greatly reduces CI flakiness by ensuring all images are built and available before pods schedule, and improves reliability of cluster bring-up steps in CI. Behavior during local development is unchanged. Performance: Negligible increase in CI setup time (<10%), but build and image pull steps are parallelized where possible. CI runs should experience improved test reliability, with possible slight test throughput increase due to removal of Docker-in-Docker costly overhead. Security: No new direct risks; running 'bare-metal' with minikube should not weaken isolation for CI, but operational security model shifts slightly since no DinD. Scalability: This setup will scale better for pipelines with multiple and larger images; parallel image building and pulling reduces bottlenecks in large test matrices. Review Focus: Testing Needed• Ensure full Code Quality AssessmentTiltfile: Significant, well-commented rework; clearly differentiates between CI and dev flows, though now more complex. .github/actions/tilt/action.yaml: Concise, logical sequence, helpful comments; use of GNU parallel may merit cross-platform scrutiny. .github/actions/tilt/docker-bake.hcl: Declarative, roles and tags for each image now explicit and versioned for CI. .github/actions/tilt/pull_external_images.sh: Script is robust, efficiently identifies literal images while excluding false positives; no hardcoded paths aside from known custom images. .github/workflows/_python-tests.yml: Clear migration to new action and prebuild logic; minimal risk changes. .github/workflows/_go-tests.yml: Simple update, same as above. Best PracticesInfrastructure-As-Code: CI/CD: Shell Scripting: Potential Issues• If referenced images in k8s manifests become dynamic or templated in the future, the exclusion logic in the pull script may require updates. This summary was automatically generated by @propel-code-bot |
eculver
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any blocking feedback. I feel like the pull_external_images.sh script is a little magical, but works for me if it works for you. It's pretty awesome that pre-pulling the images fixes things.
## Description of changes Our cluster bring up in CI is the primary cause of flaky workflow runs. I initially ran a few experiments without changing the `Tiltfile`. I hypothesized that there was maybe something specific to `kind` causing errors, so I also experimented with `minikube`. For each scenario, I collected at least 200 runs. Results: - `kind`, all images are fully cached: never flakes - `minikube`, all images are fully cached: never flakes - `minikube`, `rust/types/src/collection.rs` changed between runs: frequently flakes These results seem to indicate that building fresh images concurrently with pods being deployed is problematic. I theroized that this was caused by the image builder hogging disk I/O or CPU, but wasn't able to prove this using iostat/[workflow-telemetry-action](https://github.com/catchpoint/workflow-telemetry-action). Regardless, I decided to go ahead and try splitting image building/pulling into a separate step, before pods are deployed to the cluster. And flakes are now magically gone. This seems to slightly increase setup times (<10%) but the increased reliability is well worth it. We are also now running "bare-metal", aka no more Docker-in-Docker. One of the advantages is that Blacksmith's pull-through registry cache is used automatically without any extra config. We also may get a small performance win in tests but it's likely negligible. ## Test plan _How are these changes tested?_ I ran the Tilt action 250 times with these changes without a single flake.
Description of changes
Our cluster bring up in CI is the primary cause of flaky workflow runs. I initially ran a few experiments without changing the
Tiltfile. I hypothesized that there was maybe something specific tokindcausing errors, so I also experimented withminikube. For each scenario, I collected at least 200 runs.Results:
kind, all images are fully cached: never flakesminikube, all images are fully cached: never flakesminikube,rust/types/src/collection.rschanged between runs: frequently flakesThese results seem to indicate that building fresh images concurrently with pods being deployed is problematic. I theroized that this was caused by the image builder hogging disk I/O or CPU, but wasn't able to prove this using iostat/workflow-telemetry-action.
Regardless, I decided to go ahead and try splitting image building/pulling into a separate step, before pods are deployed to the cluster. And flakes are now magically gone.
This seems to slightly increase setup times (<10%) but the increased reliability is well worth it.
We are also now running "bare-metal", aka no more Docker-in-Docker. One of the advantages is that Blacksmith's pull-through registry cache is used automatically without any extra config. We also may get a small performance win in tests but it's likely negligible.
Test plan
How are these changes tested?
I ran the Tilt action 250 times with these changes without a single flake.