Skip to content

Comments

perf: use a constant number of ranch instances#706

Merged
v0idpwn merged 10 commits intomainfrom
feat/ranch-instances
Jul 22, 2025
Merged

perf: use a constant number of ranch instances#706
v0idpwn merged 10 commits intomainfrom
feat/ranch-instances

Conversation

@v0idpwn
Copy link
Member

@v0idpwn v0idpwn commented Jul 20, 2025

Instead of starting one ranch instance per pool, use the same N ranch instances for all pools.

Each ranch was starting a minimum of 10 acceptors (and consequently, 10 connection supervisors). With 10_000 pools, these are 200_000 processes, which consume a sizeable amount of memory and resources. They also added complexity in managing the separate ranch instances (need to start and finish them at appropriate times, specially because they weren't linked to the pool).

The ranch acceptors/connection supervisors aren't a bottleneck, and if they were, we can control the number of acceptors/supervisors through configuration. In synthesis, there's no benefit in starting bringing multiple ranch instances, only the resource consumption drawback.

Instead of starting one ranch instance per pool, use the same two
ranch pools for the whole application lifecycle.

Each ranch was starting the minimum of 10 acceptors (and consequently,
10 connection supervisors). With 10_000 pools, these are 200_000
processes, which consume a sizeable amount of memory and resources.
They also added complexity in managing the separate ranch instances
(need to start and finish them at appropriate times, specially because
they weren't linked to the pool).

The ranch acceptors/connection supervisors aren't a bottleneck, and if
they were, we can control the number of acceptors/supervisors through
configuration. In synthesis, there's no benefit in starting bringing
multiple ranch instances, only the resource consumption drawback.
@v0idpwn v0idpwn requested a review from a team as a code owner July 20, 2025 17:59
@v0idpwn v0idpwn changed the title perf: constant number of ranch instances perf: use a constant number of ranch instances Jul 20, 2025
@abc3
Copy link
Contributor

abc3 commented Jul 21, 2025

Yeah, a good initiative. Just adding some context for why it was originally done this way.

The initial approach aimed to provide tenant-level network isolation and reduce the risk of cross-tenant access through internal proxying. It also allowed us to observe TCP proxying behavior under these conditions.

We considered both unique ports per tenant and shared ones. For the shared setup, the idea was to maintain a pool of Ranch instances and select one via something like :erlang.phash2(tenant_id, ...) to avoid a single point of failure. We started with unique ports and planned to revisit these options later.

Note that with a single port the number of concurrent connections from a single client machine is limited by the number of available local ports on the client, which is typically around 64k.

Creating Ranch instances is resource-intensive mainly during initialization. When idle, they add negligible overhead.

Here are load charts with 10k local Ranch instances. Scheduler usage is about the same as without them, and memory stays around 1GB

Screenshot 2025-07-21 at 10 26 09

@v0idpwn
Copy link
Member Author

v0idpwn commented Jul 21, 2025

Note that with a single port the number of concurrent connections from a single client machine is limited by the number of available local ports on the client, which is typically around 64k.

That's a great point! Will add sharding in the PR. (edit: done on 43da5f3)

Creating Ranch instances is resource-intensive mainly during initialization. When idle, they add negligible overhead.

The memory burden is considerable. In a sample production node, we have just 4700 pools, and :ranch_conns_sup and ranch_acceptor alone take over 1gb of ram in process memory (with 47000 acceptors and 47000 ranch_conns_sup). That's over 1/3 of the total ram consumption of the node. Another option would be to reduce the acceptor count, but specially since "regular" connections use shared ranch instances, I generally think that these connections should too.

@v0idpwn v0idpwn force-pushed the feat/ranch-instances branch from 3d9a29a to 85c50e8 Compare July 21, 2025 14:34
@v0idpwn
Copy link
Member Author

v0idpwn commented Jul 21, 2025

I went with a default of 4 shards per mode, which is probably enough for most workloads. If we need, we can increase it in production.

Instead of using single Ranch listeners for session and transaction modes,
create configurable shards per mode to distribute connections across multiple
ports. This prevents hitting the ~65k connection limit per port, without
needing to maintain one ranch instance per pool.

Additionally, remove the unused local_proxy_multiplier configuration
@v0idpwn v0idpwn force-pushed the feat/ranch-instances branch from 367f024 to 43da5f3 Compare July 21, 2025 14:41
) do
{:ok, _pid} ->
Logger.notice("Proxy started #{mode} on port #{port}")
Logger.notice("Proxy started #{opts.mode}(local=#{opts.local}) on port #{port}")
Copy link
Contributor

@abc3 abc3 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need to get a value from ranch by id, because the port is 0. Although, it might be better to start the range from a defined value to make this behavior more deterministic

Suggested change
Logger.notice("Proxy started #{opts.mode}(local=#{opts.local}) on port #{port}")
port = :ranch.get_port(key)
Logger.notice("Proxy started #{opts.mode}(local=#{opts.local}) on port #{port}")

Copy link
Member Author

@v0idpwn v0idpwn Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the port!

I'd argue that starting on a range may be "less deterministic". Here's the behaviour is more predictable: we listen on 0, and the OS gives us a port that will always be free. If we try to determine it ourselves, we might get conflicts, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can choose something that we are sure will be free, for example, starting at 45_000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we will always know that these ports are used by local proxies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such thing as surely free ports, specially in the ephemeral range :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, there are a bunch of strict "safe" examples that supavisor uses: 4000, session's, transaction's, etc 😜

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these aren't in the ephemeral range...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe pick something outside it though (like 12_000 or smth) 🤔

@v0idpwn v0idpwn enabled auto-merge (squash) July 22, 2025 13:01
@v0idpwn v0idpwn merged commit e88fb4b into main Jul 22, 2025
12 checks passed
@v0idpwn v0idpwn deleted the feat/ranch-instances branch July 22, 2025 13:11
@v0idpwn v0idpwn mentioned this pull request Jul 28, 2025
v0idpwn added a commit that referenced this pull request Jul 29, 2025
### Features
- **Authentication cleartext password support** - Added support for
cleartext password authentication method (#707)
- **Runtime-configurable connection retries** - Support for runtime
configuration of connection retries and infinite retries (#705)
- **Enhanced health checks** - Check database and eRPC capabilities
during health check operations (#691)
- **More consistency with postgres on auth errors** - Improves errors in
some client libraries (#711)

### Performance Improvements

- **Optimized ranch usage** - Supavisor now uses a constant number of
ranch instances for improved performance and resource management when
hosting a large number of pools (#706)

### Monitoring

- **New OS memory metrics** - gives a more accurate picture of memory
usage (#704)
- **Add a promex plugin for cluster metrics** - for tracking latency and
connection status (#690)
- **Client connection lifetime metrics** - adds a metric about how long
each connection is connected for (#688)
- **Process monitoring** - Log when large process heaps and long message
queues (#689)

### Bug Fixes

- **Client handler query cancellation** - Fixed handling of
`:cancel_query` when state is `:idle` (#692)

### Migration Notes

- Instances running a small number of pools may see an increase in
memory usage. This can be mitigated by changing the ranch shard or the
acceptor counts.
- If using any of the new used ports, may need to change the defaults
- Review monitoring dashboards and include new metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants