Skip to content

Conversation

@codetheweb
Copy link
Contributor

@codetheweb codetheweb commented Jun 2, 2025

Description of changes

Our cluster bring up in CI is the primary cause of flaky workflow runs. I initially ran a few experiments without changing the Tiltfile. I hypothesized that there was maybe something specific to kind causing errors, so I also experimented with minikube. For each scenario, I collected at least 200 runs.

Results:

  • kind, all images are fully cached: never flakes
  • minikube, all images are fully cached: never flakes
  • minikube, rust/types/src/collection.rs changed between runs: frequently flakes

These results seem to indicate that building fresh images concurrently with pods being deployed is problematic. I theroized that this was caused by the image builder hogging disk I/O or CPU, but wasn't able to prove this using iostat/workflow-telemetry-action.

Regardless, I decided to go ahead and try splitting image building/pulling into a separate step, before pods are deployed to the cluster. And flakes are now magically gone.

This seems to slightly increase setup times (<10%) but the increased reliability is well worth it.

We are also now running "bare-metal", aka no more Docker-in-Docker. One of the advantages is that Blacksmith's pull-through registry cache is used automatically without any extra config. We also may get a small performance win in tests but it's likely negligible.

Test plan

How are these changes tested?

I ran the Tilt action 250 times with these changes without a single flake.

@github-actions
Copy link

github-actions bot commented Jun 2, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 5020131 to 246cb37 Compare June 2, 2025 23:48
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 2 times, most recently from a9f46c9 to 542c9b1 Compare June 3, 2025 21:42
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 542c9b1 to 6231652 Compare June 3, 2025 21:45
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 5 times, most recently from cd79956 to f8aad48 Compare June 3, 2025 23:31
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from f8aad48 to 53c50be Compare June 3, 2025 23:39
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 32d8060 to a1ec8cd Compare June 3, 2025 23:50
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 1c367e3 to 188ef4f Compare June 3, 2025 23:59
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 4 times, most recently from 96a3313 to 5189703 Compare June 4, 2025 00:31
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from ed3e150 to 21097d2 Compare June 4, 2025 18:09
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 2 times, most recently from fcacc53 to 68befed Compare June 4, 2025 18:40
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from 68befed to a9bd4e8 Compare June 4, 2025 18:44
matrix:
python: ${{fromJson(inputs.python_versions)}}
platform: ["blacksmith-16vcpu-ubuntu-2204"]
platform: ["blacksmith-8vcpu-ubuntu-2204"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute is not a bottleneck according to the profiling I did (assuming most Rust crates are cached)

@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch 2 times, most recently from deaa2b7 to f9d44e3 Compare June 4, 2025 20:36
@codetheweb codetheweb changed the title [BLD]: debug Tilt setup flakes [BLD]: fix Tilt setup flakes Jun 4, 2025
@codetheweb codetheweb force-pushed the bld-debug-tilt-setup-flakes branch from f9d44e3 to 1d0a456 Compare June 4, 2025 20:41
@codetheweb codetheweb marked this pull request as ready for review June 4, 2025 20:44
@codetheweb codetheweb requested review from HammadB and eculver June 4, 2025 20:45
@propel-code-bot
Copy link
Contributor

Infrastructure Refactor: Make Tilt-based CI Cluster Setup Deterministic and Less Flaky

This PR overhauls the CI cluster setup logic for Tilt-based workflows, addressing persistent flakiness in CI by moving to a 'bare-metal' minikube approach and decoupling image building/pulling from pod deployment. It introduces a dedicated pre-build step for all required Docker images using a new docker-bake.hcl definition, pre-pulls external images in parallel, and cleans up the workflow to run all infrastructure on the host Docker engine, avoiding Docker-in-Docker entirely. Significant logic in the Tiltfile is reworked to use prebuilt/tagged images, custom builds, and clearer image targeting for CI vs. development. GitHub Actions are reorganized, external image pulling is scripted, and both Go and Python test workflows are updated accordingly.

Key Changes:
• Switched CI cluster provisioning from kind/Docker-in-Docker to host-based minikube with driver=none.
• Introduced .github/actions/tilt/docker-bake.hcl for explicit multi-image Docker builds via buildx bake.
• Tiltfile refactored to use custom_build with docker image tag for CI, ensuring only prebuilt/tagged images are used during testing.
• New pull_external_images.sh script pre-pulls all external images referenced in k8s manifests using a parallelized docker-compose pull.
• GitHub Actions composite action (action.yaml) rewritten to orchestrate minikube, image baking, and external image pulls in parallel pre-steps.
• Tilt startup is now reliably sequenced after prebuilding images, eliminating flaky race conditions.
• Test workflows (_python-tests.yml, _go-tests.yml) updated to use new action flow and platforms adjusted.

Affected Areas:
• .github/actions/tilt/
• .github/workflows/_python-tests.yml
• .github/workflows/_go-tests.yml
• Tiltfile

Potential Impact:

Functionality: Greatly reduces CI flakiness by ensuring all images are built and available before pods schedule, and improves reliability of cluster bring-up steps in CI. Behavior during local development is unchanged.

Performance: Negligible increase in CI setup time (<10%), but build and image pull steps are parallelized where possible. CI runs should experience improved test reliability, with possible slight test throughput increase due to removal of Docker-in-Docker costly overhead.

Security: No new direct risks; running 'bare-metal' with minikube should not weaken isolation for CI, but operational security model shifts slightly since no DinD.

Scalability: This setup will scale better for pipelines with multiple and larger images; parallel image building and pulling reduces bottlenecks in large test matrices.

Review Focus:
• Check that image references/tags are correct and consistently mapped from build to pod deployment.
• Review the image exclusion/inclusion logic in pull_external_images.sh for edge cases.
• Ensure the new build and cluster setup sequence in action.yaml matches documented intent and doesn't skip any critical initialization.
• Validate Tiltfile logic, especially custom_build vs. docker_build branches, for correctness and CI/developer parity.
• Inspect for potential regressions in how Helm chart templating interacts with pre-pulled images.

Testing Needed

• Ensure full CI workflow passes repeatedly with no flakes across all platforms.
• Verify that external images are successfully pulled in parallel and are available to pods post-cluster bring-up.
• Test that developer local workflows (non-CI) are unaffected.
• Confirm workflow environment variables and port-forwards are being set up as expected for all services.

Code Quality Assessment

Tiltfile: Significant, well-commented rework; clearly differentiates between CI and dev flows, though now more complex.

.github/actions/tilt/action.yaml: Concise, logical sequence, helpful comments; use of GNU parallel may merit cross-platform scrutiny.

.github/actions/tilt/docker-bake.hcl: Declarative, roles and tags for each image now explicit and versioned for CI.

.github/actions/tilt/pull_external_images.sh: Script is robust, efficiently identifies literal images while excluding false positives; no hardcoded paths aside from known custom images.

.github/workflows/_python-tests.yml: Clear migration to new action and prebuild logic; minimal risk changes.

.github/workflows/_go-tests.yml: Simple update, same as above.

Best Practices

Infrastructure-As-Code:
• Explicit image tags and consistent naming

CI/CD:
• Prebuilding and pre-pulling images before orchestrating cluster/resources
• Eliminating Docker-in-Docker for simplicity/performance
• Using declarative build targets and grouping with docker-bake.hcl

Shell Scripting:
• Robust error handling (set -euo pipefail)
• Use of docker-compose for concurrent pulls

Potential Issues

• If referenced images in k8s manifests become dynamic or templated in the future, the exclusion logic in the pull script may require updates.
• Some environments or self-hosted runners may lack GNU parallel or Docker Compose v2, breaking parallel setup steps.
• If image tags/names change or are non-standardized across build and manifest, deployment may fail.
• Differences between developer/local and CI path in Tiltfile may slowly drift; needs ongoing maintenance.

This summary was automatically generated by @propel-code-bot

Copy link
Contributor

@eculver eculver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any blocking feedback. I feel like the pull_external_images.sh script is a little magical, but works for me if it works for you. It's pretty awesome that pre-pulling the images fixes things.

@codetheweb codetheweb merged commit 6eed2b4 into main Jun 4, 2025
71 checks passed
@codetheweb codetheweb deleted the bld-debug-tilt-setup-flakes branch June 4, 2025 21:57
Inventrohyder pushed a commit to Inventrohyder/chroma that referenced this pull request Aug 5, 2025
## Description of changes

Our cluster bring up in CI is the primary cause of flaky workflow runs.
I initially ran a few experiments without changing the `Tiltfile`. I
hypothesized that there was maybe something specific to `kind` causing
errors, so I also experimented with `minikube`. For each scenario, I
collected at least 200 runs.

Results:

- `kind`, all images are fully cached: never flakes
- `minikube`, all images are fully cached:  never flakes
- `minikube`, `rust/types/src/collection.rs` changed between runs:
frequently flakes

These results seem to indicate that building fresh images concurrently
with pods being deployed is problematic. I theroized that this was
caused by the image builder hogging disk I/O or CPU, but wasn't able to
prove this using
iostat/[workflow-telemetry-action](https://github.com/catchpoint/workflow-telemetry-action).

Regardless, I decided to go ahead and try splitting image
building/pulling into a separate step, before pods are deployed to the
cluster. And flakes are now magically gone.

This seems to slightly increase setup times (<10%) but the increased
reliability is well worth it.

We are also now running "bare-metal", aka no more Docker-in-Docker. One
of the advantages is that Blacksmith's pull-through registry cache is
used automatically without any extra config. We also may get a small
performance win in tests but it's likely negligible.

## Test plan

_How are these changes tested?_

I ran the Tilt action 250 times with these changes without a single
flake.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants