Skip to content

RetryableFeignBlockingLoadBalancerClient loses the real root cause and presents a spurious UnknownHostException when max-retries-on-same-service-instance is > 0 #840

@oliverlockwood

Description

@oliverlockwood

Describe the bug

Background

The scenario is:

  • using a Feign client that is configured by name (e.g. @FeignClient(name = "downstream-service")) rather than by URL
  • using Spring-Cloud Loadbalancer with retry enabled (i.e. we're using the RetryableFeignBlockingLoadBalancerClient)
  • the downstream service instance(s) are taking longer to respond than the configured Feign client timeout used by the client application.

Correct behaviour

  • Occurs when spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance is set to 0
  • The client application retries according to the configuration of max-retries-on-next-service-instance, and then at the end logs an error like this:
14:28:45.068 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: Read timed out executing GET http://downstream-service/api/something] with root cause
java.net.SocketTimeoutException: Read timed out
	at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:283)
	...
	at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
	at feign.Client$Default.convertResponse(Client.java:110)
	at feign.Client$Default.execute(Client.java:106)
	at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
	at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
	at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
	at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
	at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
	at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
	at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
	at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
	at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
	at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
	...

which, in my opinion, is perfectly reasonable.

Incorrect behaviour (i.e. the bug)

  • Occurs when spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance is set to 1 or higher
  • The client application retries according to the configuration of max-retries-on-same-service-instance, and max-retries-on-next-service-instance, and then at the end logs an error like this:
4:30:46.549 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: downstream-service executing GET http://downstream-service/api/something] with root cause
java.net.UnknownHostException: downstream-service
	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567)
	...
	at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
	at feign.Client$Default.convertResponse(Client.java:110)
	at feign.Client$Default.execute(Client.java:106)
	at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
	at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
	at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
	at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
	at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
	at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
	at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
	at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
	at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
	at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
	...

which, in my opinion, is not acceptable. The actual underlying issue is lost, and in its place there's now a spurious, misleading UnknownHostException, because for some reason the implementation tries to hit the literal URL http://downstream-service/api/something, abandoning the whole concept of service discovery.

Workaround

Always set spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance to 0.

Affected versions

I have explicitly proven that this bug reproduces on the following versions:

  • Spring-Boot 3.0.3 & Spring-Cloud 2022.0.1
  • Spring-Boot 2.7.9 & Spring-Cloud 2021.0.6
  • Spring-Boot 2.7.7 & Spring-Cloud 2021.0.5
  • Spring-Boot 2.6.2 & Spring-Cloud 2021.0.0

However, I suspect this bug has existed ever since the historical migration from Netflix Ribbon Loadbalancer to Spring-Cloud Loadbalancer.

Note: This is effectively a re-raise of #650, which was opened in December 2021, but closed by the maintainers because a detailed reproduction example was not provided at that time.

Sample

This is demonstrated in https://github.com/oliverlockwood/spring-cloud-openfeign-retry-sameserver-bug. Follow the README therein.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions