feat(health): add check endpoint and loop control#2575
feat(health): add check endpoint and loop control#2575jfroy wants to merge 1 commit intoqdm12:masterfrom
Conversation
On Kubernetes, it makes more sense to use a liveness probe than the health server loop (i.e. only have one loop). This patch introduces a flag to disable the health server loop, and a new /check/ endpoint for such probes. When the connection is saturated, health checks can take a long time and therefore fail if the timeout is too short. Gradually increasing the timeout, as done in the health server loop, is not all that useful because the upper bound on the timeout is ultimately what you are willing to tolerate before declaring the connection unhealthy. So a static probe with a long timeout and a failure count, as implemented in Kubernetes, will be more stable (i.e. less flopping), especially if more than one sequential failure is allowed. The above argument aside, the two health/probe loops also do not work well together because they can get out of phase. Kubernetes probes usually must be used to sequence containers in a pod. Signed-off-by: Jean-Francois Roy <jf@devklog.net>
|
My intuition agrees, but I've seen this fail when 1000s of TCP connections are going through the VPN and the link is fairly loaded (80%+ of expected throughput).
The liveness endpoint uses the same code as the health loop on purpose, to avoid code duplication and maintain as much of the behavior of the loop as possible. As I wrote in the commit message, when using a K8S liveness probe, the gluetun health loop and the K8S probe loop sort of interfere with each other -- they can easily get out of phase. So before this patch, you can end up in a situation where the gluetun loop fails and sets the status error, the K8S liveness probe comes in and samples the error (thus failing the probe), the gluetun loop then succeeds and then fails again, the K8S probe comes in and samples an error again, etc. This can lead to K8S considering the container failed, even though the health server is flopping because the connection is loaded. Because the gluetun loop has an adaptive timeout, it's not possible to prevent the 2 loops from going out of phase. Even if you were to change the gluetun loop to have a fixed period and matched it to the K8S probe, they would still eventually get out of phase. Disabling the gluetun health loop only makes sense when using a K8S liveness probe. When using gluetun from just Docker, it is a bad idea to disable it, indeed. Perhaps the environment variable could be named something more specific ("disable for kubernetes", "disable I know what I am doing", "disable footguns are fun", etc). Alternatively, perhaps if the liveness endpoint is used it could disable the gluetun loop.
Ah yes, very good point. I will fix the code. I don't think cancellation is likely, but it's just better to handle it.
That's an interesting alternative, but it feels better, overall, not to have 2 loops.
Yeah, and I did try that before writing this patch, but I always got into the unstable regime I described above. |
27f74e4 to
fe3d4a9
Compare
d0247a1 to
0eeee5c
Compare
On Kubernetes, it makes more sense to use a liveness probe than the health server loop (i.e. only have one loop). This patch introduces a flag to disable the health server loop, and a new /check/ endpoint for such probes.
When the connection is saturated, health checks can take a long time and therefore fail if the timeout is too short. Gradually increasing the timeout, as done in the health server loop, is not all that useful because the upper bound on the timeout is ultimately what you are willing to tolerate before declaring the connection unhealthy. So a static probe with a long timeout and a failure count, as implemented in Kubernetes, will be more stable (i.e. less flopping), especially if more than one sequential failure is allowed.
The above argument aside, the two health/probe loops also do not work well together because they can get out of phase. Kubernetes probes usually must be used to sequence containers in a pod.