When using multiple machines with Caddy behind a TCP load balancer or a DNS record with multiple IPs, certificates can take a while to issue. The acme challenge bounces around to different machines until it hits the correct one. Moreover, if Cloudflare proxy record is used with multiple addresses and one instance is able to successfully issue a certificate, Cloudflare may not try to send requests for ACME challenge to other instances or send it very rarely. In practice, I saw the cases when instances failed to issue certificates for days or weeks.
Possible solutions:
- Share the certificate storage (file system) between caddy containers
- Consider using alternative storage module for certificates or implement our own, e.g. backed by Corrosion: https://caddyserver.com/docs/json/storage/
As the first step for implementing a distributed storage, we can fork the default file_system storage that uses the certmagic's one and only implement Store and Load methods for keys that store challenge tokens. We can store challenge tokens in Corrosion that will share them among all Caddy instances.
We can perhaps start even without a distributed lock. In the case when multiple instances start issuing a certificate for the same domain at the same time, they may override each other's token but after a few retries they all should be able to succeed. We can try adding optimistic locking by failing if a record for the key exists and ignore it after a reasonable timeout.
When using multiple machines with Caddy behind a TCP load balancer or a DNS record with multiple IPs, certificates can take a while to issue. The acme challenge bounces around to different machines until it hits the correct one. Moreover, if Cloudflare proxy record is used with multiple addresses and one instance is able to successfully issue a certificate, Cloudflare may not try to send requests for ACME challenge to other instances or send it very rarely. In practice, I saw the cases when instances failed to issue certificates for days or weeks.
Possible solutions:
As the first step for implementing a distributed storage, we can fork the default
file_systemstorage that uses the certmagic's one and only implementStoreandLoadmethods for keys that store challenge tokens. We can store challenge tokens in Corrosion that will share them among all Caddy instances.We can perhaps start even without a distributed lock. In the case when multiple instances start issuing a certificate for the same domain at the same time, they may override each other's token but after a few retries they all should be able to succeed. We can try adding optimistic locking by failing if a record for the key exists and ignore it after a reasonable timeout.