TCP Proxy: Fix corrupted hostname from partial connection read.#10454
TCP Proxy: Fix corrupted hostname from partial connection read.#10454chotiwat wants to merge 10 commits intokubernetes:mainfrom
Conversation
✅ Deploy Preview for kubernetes-ingress-nginx canceled.
|
|
Hi @chotiwat. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
|
@chotiwat: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Hi @cpanato @tao12345666333, is there anything I need to do on my end to get this PR reviewed? |
|
Hi @rikatz, would you be able to take a look at this PR when you get a chance? |
|
Hi @rikatz, ping on this issue 🙏 |
|
/retest |
|
|
Hi @longwuyuan, I've answered your questions below:
None. I couldn't find any issue describing the same problem.
Yes, as described in the PR body.
I wonder that myself. I can hazard a few guesses:
This doesn't mean the problem doesn't occur for other users of SSL passthrough though.
The connection handler for the proxy falls back to the default NGINX backend ( When this happens, the default backend will terminate the TLS and send a new request to the upstream per its generated NGINX configuration block, hence the HTTP-sent-to-HTTPS error. Setting backend-protocol to ingress-nginx/pkg/tcpproxy/tcp.go Lines 71 to 76 in 86f3af8 ingress-nginx/pkg/tcpproxy/tcp.go Lines 43 to 56 in 86f3af8
No, there isn't currently. I could add some though (see below).
I could modify the existing test suite if that sounds good to you. I didn't see it initially. It took some digging around the docs. |
|
|
|
Ok, I'll go ahead and try to add an e2e test for this.
I believe I've explained as clearly as I could in this PR. If you insist, I can create an issue, which would be a copy of this PR body and my previous answers to your questions, but I personally don't see any point in doing so. The mistake in the code should also be pretty clear from the developer's point of view as well. Hopefully, I'd be able to come up with an e2e test that would fail against the current
Yes, let's do that. |
There was a problem hiding this comment.
I've added the e2e tests that would have failed without this fix, and cherry-picked just the tests to #10988.
Unfortunately, it seems that this issue is harder to reproduce on a kind cluster, perhaps because the traffic doesn't go through many hops and there are no other workloads contending for I/O, so I had to introduce virtual throttling to the tests.
The throttling is done by wrapping the connection returned from the client's DialContext with a net.Conn that writes to the connection at most chunkSize bytes.
These tests would sometimes pass without the fix from this PR so I added retries with as well. Alternatively, we could make the ginkgo.MustPassRepeatedly decoratorchunkSize less than len(host) but I don't know if it would make the tests significantly slower.
| Name: name, | ||
| Image: image, | ||
| Env: []corev1.EnvVar{}, | ||
| Env: env, |
There was a problem hiding this comment.
There was a bug in the e2e framework as well. This was preventing httpbun from getting the environment variables so it always ran as an HTTP server even though HTTPBUN_SSL_CERT and HTTPBUN_SSL_KEY are specified.
| }) | ||
|
|
||
| ginkgo.It("should pass unknown traffic to default backend and handle known traffic", func() { | ||
| ginkgo.Context("when handling traffic", func() { |
There was a problem hiding this comment.
I grouped these tests with ginkgo.Context() so that we can reuse some of the common setup code.
| "nginx.ingress.kubernetes.io/ssl-passthrough": "true", | ||
| } | ||
|
|
||
| ingressDef := framework.NewSingleIngressWithTLS(host, |
There was a problem hiding this comment.
When SSL passthrough is working correctly, we shouldn't need TLS settings on the NGINX side. The upstream server should be the one who terminates the TLS.
|
|
||
| f.NewDeploymentWithOpts("echopass", | ||
| framework.HTTPBunImage, | ||
| 80, |
There was a problem hiding this comment.
This tripped me up a bit and caused me to spent some time trying to add an HTTPS port to the deployment. It turns out the httpbun image binds to port 80 by default no matter what the protocol it's listening for.
So this sets up an HTTPS server listening on port 80 which technically works for the purpose of the tests.
test/e2e/settings/ssl_passthrough.go
Outdated
| ForceResolve(f.GetNginxIP(), 443). | ||
| Expect(). | ||
| Status(http.StatusOK) | ||
| ginkgo.It("should handle known traffic with Host header", ginkgo.MustPassRepeatedly(tries), func() { |
There was a problem hiding this comment.
Failure without the fix:
Get "[https://testpassthrough.com:443/](https://testpassthrough.com/)": tls: failed to verify certificate: x509: certificate is valid for ingress.local, not testpassthrough.com
Reason:
The proxy directs the traffic to NGINX. The configuration for the host specified in the HOST header doesn't specify any TLS certificate, so NGINX tries to terminate TLS with the default certificate.
test/e2e/settings/ssl_passthrough.go
Outdated
| ForceResolve(f.GetNginxIP(), 443). | ||
| Expect(). | ||
| Status(http.StatusNotFound) | ||
| ginkgo.It("should handle insecure traffic with Host header", ginkgo.MustPassRepeatedly(tries), func() { |
There was a problem hiding this comment.
Failure without the fix:
400 Bad Request
Reason:
The proxy directs the traffic to NGINX. Unlike the should handle known traffic with Host header case above, NGINX doesn't verify certificate validity (InsecureSkipVerify: true) and is able to terminate TLS, but it tries to send the plain HTTP request to the HTTPS server because the configuration for the host specified in the HOST header doesn't specify HTTPS as the upstream protocol.
test/e2e/settings/ssl_passthrough.go
Outdated
| } | ||
| tries := 3 | ||
|
|
||
| ginkgo.It("should handle known traffic without Host header", ginkgo.MustPassRepeatedly(tries), func() { |
There was a problem hiding this comment.
Failure without the fix:
Get "[https://testpassthrough.com:443/](https://testpassthrough.com/)": tls: failed to verify certificate: x509: certificate is valid for ingress.local, not testpassthrough.com
Reason:
The proxy directs the traffic to NGINX. There's no HOST header so NGINX tries to terminate TLS with the default certificate because it doesn't know what server block to look for.
test/e2e/settings/ssl_passthrough.go
Outdated
| f.WaitForNginxServer(hostBad, | ||
| func(server string) bool { | ||
| return strings.Contains(server, "listen 442") | ||
| ginkgo.It("should handle insecure traffic without Host header", ginkgo.MustPassRepeatedly(tries), func() { |
There was a problem hiding this comment.
Failure without the fix:
404 Not Found
Reason:
The proxy directs the traffic to NGINX. Unlike the should handle known traffic without Host header case above, NGINX doesn't verify certificate validity (InsecureSkipVerify: true) and is able to terminate TLS, but it doesn't have any hostname for routing and returns 404.
Evidence:
There are no instances of this error in this workflow run so I ran it locally with make kind-e2e-test FOCUS='insecure traffic without Host' SKIP_INGRESS_IMAGE_CREATION=false SKIP_E2E_IMAGE_CREATION=false
[FAILED]
Error Trace: /go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/reporter.go:47
/go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/chain.go:37
/go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/response.go:263
/go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/response.go:79
/go/src/k8s.io/ingress-nginx/test/e2e/settings/ssl_passthrough.go:241
/go/src/k8s.io/ingress-nginx/.modcache/github.com/onsi/ginkgo/v2@v2.15.0/internal/node.go:463
/go/src/k8s.io/ingress-nginx/.modcache/github.com/onsi/ginkgo/v2@v2.15.0/internal/suite.go:889
/usr/local/go/src/runtime/asm_arm64.s:1197
Error:
expected status equal to:
"200 OK"
but got:
"404 Not Found"
Test: [Flag] enable-ssl-passthrough With enable-ssl-passthrough enabled when handling traffic on throttled connections should handle insecure traffic without Host header
| } | ||
| } | ||
|
|
||
| ginkgo.It("should handle known traffic without Host header", func() { |
There was a problem hiding this comment.
| f.WaitForNginxServer(hostBad, | ||
| func(server string) bool { | ||
| return strings.Contains(server, "listen 442") | ||
| ginkgo.It("should handle insecure traffic without Host header", func() { |
There was a problem hiding this comment.
| ForceResolve(f.GetNginxIP(), 443). | ||
| Expect(). | ||
| Status(http.StatusOK) | ||
| ginkgo.It("should handle known traffic with Host header", func() { |
There was a problem hiding this comment.
| ForceResolve(f.GetNginxIP(), 443). | ||
| Expect(). | ||
| Status(http.StatusNotFound) | ||
| ginkgo.It("should handle insecure traffic with Host header", func() { |
There was a problem hiding this comment.
|
@chotiwat thanks for the contribution. I am copy/pasting my comments on the other PR here as well
|
Of course. I can do that once the PR has been reviewed and I address all the comments.
Our use case is not unique at all. In fact, I believe it's one of the main reasons this SSL passthrough feature exists. Tools like
Here are a few resources that might be helpful for more productive conversations in the future:
|
|
/cherry-pick release-1.12 |
|
/cherry-pick release-1.11 |
|
@Gacko: once the present PR merges, I will cherry-pick it on top of release-1.12 in a new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-1.10 |
|
@Gacko: once the present PR merges, I will cherry-pick it on top of release-1.11 in a new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@Gacko: once the present PR merges, I will cherry-pick it on top of release-1.10 in a new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
4aa461b to
1069c21
Compare
Co-authored-by: Marco Ebert <marco_ebert@icloud.com>
Gacko
left a comment
There was a problem hiding this comment.
/triage accepted
/kind bug
/priority backlog
|
/ok-to-test |
|
Any update on this? |
|
/uncc |
What this PR does / why we need it:
We have run into intermittent "HTTP request sent to an HTTPS server" errors from our SSL passthrough ingresses. We found that the passthrough proxy would sometimes read incomplete data from the connection and cause corrupted hostname.
When this happens, the connection handler would fall back to the default server at
127.0.0.1:442and cause the error if thenginx.ingress.kubernetes.io/backend-protocol: HTTPSannotation is not specified.This PR fixes the issue by making sure that we fully read the Client Hello data from the connection by getting the total length from the TLS header.
Types of changes
Which issue/s this PR fixes
fixes #11491
fixes #11424
How Has This Been Tested?
We built a custom image to debug the issue and added some debug logs. For example, we updated
ingress-nginx/pkg/tcpproxy/tcp.go
Line 74 in 4bac120
to
Example log of corrupted data before the fix:
We can reproduce the issue by running
curlin a loop. I've seen around 5% of errors from my machine with the number of requests as low as 10 enough to trigger the error. This fix eliminates the errors completely for us.Checklist: