[BLD]: fix Tilt setup flakes #4720

codetheweb · 2025-06-02T23:45:26Z

Description of changes

Our cluster bring up in CI is the primary cause of flaky workflow runs. I initially ran a few experiments without changing the Tiltfile. I hypothesized that there was maybe something specific to kind causing errors, so I also experimented with minikube. For each scenario, I collected at least 200 runs.

Results:

kind, all images are fully cached: never flakes
minikube, all images are fully cached: never flakes
minikube, rust/types/src/collection.rs changed between runs: frequently flakes

These results seem to indicate that building fresh images concurrently with pods being deployed is problematic. I theroized that this was caused by the image builder hogging disk I/O or CPU, but wasn't able to prove this using iostat/workflow-telemetry-action.

Regardless, I decided to go ahead and try splitting image building/pulling into a separate step, before pods are deployed to the cluster. And flakes are now magically gone.

This seems to slightly increase setup times (<10%) but the increased reliability is well worth it.

We are also now running "bare-metal", aka no more Docker-in-Docker. One of the advantages is that Blacksmith's pull-through registry cache is used automatically without any extra config. We also may get a small performance win in tests but it's likely negligible.

Test plan

How are these changes tested?

I ran the Tilt action 250 times with these changes without a single flake.

github-actions · 2025-06-02T23:45:49Z

codetheweb · 2025-06-04T19:58:17Z

.github/workflows/_python-tests.yml

      matrix:
        python: ${{fromJson(inputs.python_versions)}}
-        platform: ["blacksmith-16vcpu-ubuntu-2204"]
+        platform: ["blacksmith-8vcpu-ubuntu-2204"]


compute is not a bottleneck according to the profiling I did (assuming most Rust crates are cached)

propel-code-bot · 2025-06-04T20:45:29Z

Infrastructure Refactor: Make Tilt-based CI Cluster Setup Deterministic and Less Flaky

This PR overhauls the CI cluster setup logic for Tilt-based workflows, addressing persistent flakiness in CI by moving to a 'bare-metal' minikube approach and decoupling image building/pulling from pod deployment. It introduces a dedicated pre-build step for all required Docker images using a new docker-bake.hcl definition, pre-pulls external images in parallel, and cleans up the workflow to run all infrastructure on the host Docker engine, avoiding Docker-in-Docker entirely. Significant logic in the Tiltfile is reworked to use prebuilt/tagged images, custom builds, and clearer image targeting for CI vs. development. GitHub Actions are reorganized, external image pulling is scripted, and both Go and Python test workflows are updated accordingly.

Key Changes:
• Switched CI cluster provisioning from kind/Docker-in-Docker to host-based minikube with driver=none.
• Introduced .github/actions/tilt/docker-bake.hcl for explicit multi-image Docker builds via buildx bake.
• Tiltfile refactored to use custom_build with docker image tag for CI, ensuring only prebuilt/tagged images are used during testing.
• New pull_external_images.sh script pre-pulls all external images referenced in k8s manifests using a parallelized docker-compose pull.
• GitHub Actions composite action (action.yaml) rewritten to orchestrate minikube, image baking, and external image pulls in parallel pre-steps.
• Tilt startup is now reliably sequenced after prebuilding images, eliminating flaky race conditions.
• Test workflows (_python-tests.yml, _go-tests.yml) updated to use new action flow and platforms adjusted.

Affected Areas:
• .github/actions/tilt/
• .github/workflows/_python-tests.yml
• .github/workflows/_go-tests.yml
• Tiltfile

Potential Impact:

Functionality: Greatly reduces CI flakiness by ensuring all images are built and available before pods schedule, and improves reliability of cluster bring-up steps in CI. Behavior during local development is unchanged.

Performance: Negligible increase in CI setup time (<10%), but build and image pull steps are parallelized where possible. CI runs should experience improved test reliability, with possible slight test throughput increase due to removal of Docker-in-Docker costly overhead.

Security: No new direct risks; running 'bare-metal' with minikube should not weaken isolation for CI, but operational security model shifts slightly since no DinD.

Scalability: This setup will scale better for pipelines with multiple and larger images; parallel image building and pulling reduces bottlenecks in large test matrices.

Review Focus:
• Check that image references/tags are correct and consistently mapped from build to pod deployment.
• Review the image exclusion/inclusion logic in pull_external_images.sh for edge cases.
• Ensure the new build and cluster setup sequence in action.yaml matches documented intent and doesn't skip any critical initialization.
• Validate Tiltfile logic, especially custom_build vs. docker_build branches, for correctness and CI/developer parity.
• Inspect for potential regressions in how Helm chart templating interacts with pre-pulled images.

Testing Needed

• Ensure full CI workflow passes repeatedly with no flakes across all platforms.
• Verify that external images are successfully pulled in parallel and are available to pods post-cluster bring-up.
• Test that developer local workflows (non-CI) are unaffected.
• Confirm workflow environment variables and port-forwards are being set up as expected for all services.

Code Quality Assessment

Tiltfile: Significant, well-commented rework; clearly differentiates between CI and dev flows, though now more complex.

.github/actions/tilt/action.yaml: Concise, logical sequence, helpful comments; use of GNU parallel may merit cross-platform scrutiny.

.github/actions/tilt/docker-bake.hcl: Declarative, roles and tags for each image now explicit and versioned for CI.

.github/actions/tilt/pull_external_images.sh: Script is robust, efficiently identifies literal images while excluding false positives; no hardcoded paths aside from known custom images.

.github/workflows/_python-tests.yml: Clear migration to new action and prebuild logic; minimal risk changes.

.github/workflows/_go-tests.yml: Simple update, same as above.

Best Practices

Infrastructure-As-Code:
• Explicit image tags and consistent naming

CI/CD:
• Prebuilding and pre-pulling images before orchestrating cluster/resources
• Eliminating Docker-in-Docker for simplicity/performance
• Using declarative build targets and grouping with docker-bake.hcl

Shell Scripting:
• Robust error handling (set -euo pipefail)
• Use of docker-compose for concurrent pulls

Potential Issues

• If referenced images in k8s manifests become dynamic or templated in the future, the exclusion logic in the pull script may require updates.
• Some environments or self-hosted runners may lack GNU parallel or Docker Compose v2, breaking parallel setup steps.
• If image tags/names change or are non-standardized across build and manifest, deployment may fail.
• Differences between developer/local and CI path in Tiltfile may slowly drift; needs ongoing maintenance.

This summary was automatically generated by @propel-code-bot

eculver

I don't have any blocking feedback. I feel like the pull_external_images.sh script is a little magical, but works for me if it works for you. It's pretty awesome that pre-pulling the images fixes things.

## Description of changes Our cluster bring up in CI is the primary cause of flaky workflow runs. I initially ran a few experiments without changing the `Tiltfile`. I hypothesized that there was maybe something specific to `kind` causing errors, so I also experimented with `minikube`. For each scenario, I collected at least 200 runs. Results: - `kind`, all images are fully cached: never flakes - `minikube`, all images are fully cached: never flakes - `minikube`, `rust/types/src/collection.rs` changed between runs: frequently flakes These results seem to indicate that building fresh images concurrently with pods being deployed is problematic. I theroized that this was caused by the image builder hogging disk I/O or CPU, but wasn't able to prove this using iostat/[workflow-telemetry-action](https://github.com/catchpoint/workflow-telemetry-action). Regardless, I decided to go ahead and try splitting image building/pulling into a separate step, before pods are deployed to the cluster. And flakes are now magically gone. This seems to slightly increase setup times (<10%) but the increased reliability is well worth it. We are also now running "bare-metal", aka no more Docker-in-Docker. One of the advantages is that Blacksmith's pull-through registry cache is used automatically without any extra config. We also may get a small performance win in tests but it's likely negligible. ## Test plan _How are these changes tested?_ I ran the Tilt action 250 times with these changes without a single flake.

[BLD]: debug Tilt setup flakes

246cb37

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 5020131 to 246cb37 Compare June 2, 2025 23:48

codetheweb added 5 commits June 3, 2025 10:24

try minikube

ea64a0c

more jobs to stress test

c5abbab

use all resources

7eae1d7

test setup with rebuild

1faadfe

Cleanup

ccbccdb

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 2 times, most recently from a9f46c9 to 542c9b1 Compare June 3, 2025 21:42

override nproc

6231652

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 542c9b1 to 6231652 Compare June 3, 2025 21:45

codetheweb added 2 commits June 3, 2025 15:28

use different mount for etcd data

3406def

prebake

040385d

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 5 times, most recently from cd79956 to f8aad48 Compare June 3, 2025 23:31

pre-pull external images

53c50be

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from f8aad48 to 53c50be Compare June 3, 2025 23:39

bake in-action

a1ec8cd

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 32d8060 to a1ec8cd Compare June 3, 2025 23:50

parallelize steps

53e17b8

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 1c367e3 to 188ef4f Compare June 3, 2025 23:59

fix workflow

056e351

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 4 times, most recently from 96a3313 to 5189703 Compare June 4, 2025 00:31

attempt custom action

21097d2

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from ed3e150 to 21097d2 Compare June 4, 2025 18:09

try to parallelize

f386c8f

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 2 times, most recently from fcacc53 to 68befed Compare June 4, 2025 18:40

test

a9bd4e8

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 68befed to a9bd4e8 Compare June 4, 2025 18:44

codetheweb added 9 commits June 4, 2025 11:48

get telemetry

aec2e06

code change

ce4b4d5

code change 1

89fb708

code change 2

77cca16

code change 3

5841482

code change 4

06d6008

code change 5

19a5866

code change 6

e9ec495

code change 7

848e6aa

codetheweb commented Jun 4, 2025

View reviewed changes

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 2 times, most recently from deaa2b7 to f9d44e3 Compare June 4, 2025 20:36

codetheweb changed the title ~~[BLD]: debug Tilt setup flakes~~ [BLD]: fix Tilt setup flakes Jun 4, 2025

cleanup

1d0a456

codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from f9d44e3 to 1d0a456 Compare June 4, 2025 20:41

codetheweb marked this pull request as ready for review June 4, 2025 20:44

codetheweb requested review from HammadB and eculver June 4, 2025 20:45

eculver approved these changes Jun 4, 2025

View reviewed changes

codetheweb merged commit 6eed2b4 into main Jun 4, 2025
71 checks passed

codetheweb deleted the bld-debug-tilt-setup-flakes branch June 4, 2025 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BLD]: fix Tilt setup flakes #4720

[BLD]: fix Tilt setup flakes #4720

Uh oh!

codetheweb commented Jun 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 2, 2025

Uh oh!

codetheweb Jun 4, 2025

Uh oh!

propel-code-bot bot commented Jun 4, 2025

Uh oh!

eculver left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[BLD]: fix Tilt setup flakes #4720

[BLD]: fix Tilt setup flakes #4720

Uh oh!

Conversation

codetheweb commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Uh oh!

github-actions bot commented Jun 2, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

codetheweb Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot commented Jun 4, 2025

Uh oh!

eculver left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codetheweb commented Jun 2, 2025 •

edited

Loading