perf: use a constant number of ranch instances#706
Conversation
Instead of starting one ranch instance per pool, use the same two ranch pools for the whole application lifecycle. Each ranch was starting the minimum of 10 acceptors (and consequently, 10 connection supervisors). With 10_000 pools, these are 200_000 processes, which consume a sizeable amount of memory and resources. They also added complexity in managing the separate ranch instances (need to start and finish them at appropriate times, specially because they weren't linked to the pool). The ranch acceptors/connection supervisors aren't a bottleneck, and if they were, we can control the number of acceptors/supervisors through configuration. In synthesis, there's no benefit in starting bringing multiple ranch instances, only the resource consumption drawback.
That's a great point! Will add sharding in the PR. (edit: done on 43da5f3)
The memory burden is considerable. In a sample production node, we have just 4700 pools, and |
3d9a29a to
85c50e8
Compare
|
I went with a default of 4 shards per mode, which is probably enough for most workloads. If we need, we can increase it in production. |
Instead of using single Ranch listeners for session and transaction modes, create configurable shards per mode to distribute connections across multiple ports. This prevents hitting the ~65k connection limit per port, without needing to maintain one ranch instance per pool. Additionally, remove the unused local_proxy_multiplier configuration
367f024 to
43da5f3
Compare
lib/supavisor/application.ex
Outdated
| ) do | ||
| {:ok, _pid} -> | ||
| Logger.notice("Proxy started #{mode} on port #{port}") | ||
| Logger.notice("Proxy started #{opts.mode}(local=#{opts.local}) on port #{port}") |
There was a problem hiding this comment.
Here we need to get a value from ranch by id, because the port is 0. Although, it might be better to start the range from a defined value to make this behavior more deterministic
| Logger.notice("Proxy started #{opts.mode}(local=#{opts.local}) on port #{port}") | |
| port = :ranch.get_port(key) | |
| Logger.notice("Proxy started #{opts.mode}(local=#{opts.local}) on port #{port}") |
There was a problem hiding this comment.
Added the port!
I'd argue that starting on a range may be "less deterministic". Here's the behaviour is more predictable: we listen on 0, and the OS gives us a port that will always be free. If we try to determine it ourselves, we might get conflicts, etc.
There was a problem hiding this comment.
we can choose something that we are sure will be free, for example, starting at 45_000
There was a problem hiding this comment.
and we will always know that these ports are used by local proxies
There was a problem hiding this comment.
There's no such thing as surely free ports, specially in the ephemeral range :P
There was a problem hiding this comment.
well, there are a bunch of strict "safe" examples that supavisor uses: 4000, session's, transaction's, etc 😜
There was a problem hiding this comment.
But these aren't in the ephemeral range...
There was a problem hiding this comment.
We could maybe pick something outside it though (like 12_000 or smth) 🤔
### Features - **Authentication cleartext password support** - Added support for cleartext password authentication method (#707) - **Runtime-configurable connection retries** - Support for runtime configuration of connection retries and infinite retries (#705) - **Enhanced health checks** - Check database and eRPC capabilities during health check operations (#691) - **More consistency with postgres on auth errors** - Improves errors in some client libraries (#711) ### Performance Improvements - **Optimized ranch usage** - Supavisor now uses a constant number of ranch instances for improved performance and resource management when hosting a large number of pools (#706) ### Monitoring - **New OS memory metrics** - gives a more accurate picture of memory usage (#704) - **Add a promex plugin for cluster metrics** - for tracking latency and connection status (#690) - **Client connection lifetime metrics** - adds a metric about how long each connection is connected for (#688) - **Process monitoring** - Log when large process heaps and long message queues (#689) ### Bug Fixes - **Client handler query cancellation** - Fixed handling of `:cancel_query` when state is `:idle` (#692) ### Migration Notes - Instances running a small number of pools may see an increase in memory usage. This can be mitigated by changing the ranch shard or the acceptor counts. - If using any of the new used ports, may need to change the defaults - Review monitoring dashboards and include new metrics

Instead of starting one ranch instance per pool, use the same N ranch instances for all pools.
Each ranch was starting a minimum of 10 acceptors (and consequently, 10 connection supervisors). With 10_000 pools, these are 200_000 processes, which consume a sizeable amount of memory and resources. They also added complexity in managing the separate ranch instances (need to start and finish them at appropriate times, specially because they weren't linked to the pool).
The ranch acceptors/connection supervisors aren't a bottleneck, and if they were, we can control the number of acceptors/supervisors through configuration. In synthesis, there's no benefit in starting bringing multiple ranch instances, only the resource consumption drawback.