Skip to content

Zilla can appear unhealthy to tcp health check mechanisms at full engine worker utilization #1495

@jfallows

Description

@jfallows

Describe the bug
Zilla unbinds server ports at full engine worker capacity, which defends against exceeding capacity, but also defeats tcp health check mechanisms for those ports, and only rebinds when a tcp server connection is closed (not for tcp client connections).

To Reproduce
Steps to reproduce the behavior:

  1. Start any zilla example with kafka cache and one or more bootstrap topics
  2. Verify number of tcp connections to and from zilla (baseline connections)
  3. Restart zilla example with -Pzilla.engine.worker.capacity=N where N is the number of baseline connections plus 1
  4. Check metrics for engine.workers.utilization (verify almost full utilization)
  5. Make connection via nc to zilla tcp server port
  6. Check metrics for engine.workers.utilization (verify now at full utilization)
  7. Verify local zilla server port no longer bound (due to full capacity)
  8. Stop Kafka container
  9. Check metrics for engine.workers.utilization (verify no longer at full utilization)
  10. Verify local zilla server port no longer bound (even though no longer at full capacity)

Expected behavior
Local zilla server port should not unbind at full capacity.
Instead, when zilla is at full capacity it should accept and immediately clean close new tcp connections.
This lets zilla interact cleanly with tcp health check mechanisms, without exceeding capacity.
Zilla would then need the ability to hand off the new connection to a different worker, distributing the load across engine workers.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions