Skip to content

[Throttler] Constant throttling of vreplication flow when cluster has large number of replicas #18930

@siddharth16396

Description

@siddharth16396

Summary

We are experiencing constant throttling of vreplication flows when Vitess Tablet Throttler is enabled on a large cluster (20 replicas across 2 regions). As soon as throttling is turned on via the UpdateThrottlerConfig gRPC endpoint, vreplication stalls and never recovers. The error persists indefinitely until throttling is disabled.

reason_throttled: vplayer:mv_fix_1:vreplication is denied access due to unexpected error: metric not collected yet

Environment

  • Large Vitess cluster with ~20 replicas
    • Region A: 10 replicas
    • Region B: 10 replicas (50–70 ms cross-region latency)
      -Reference keyspace replicating into the main keyspace
  • Throttler enabled via UpdateThrottlerConfig
  • activeCollectInterval: 250 ms

Observed Behavior

  • Once throttling is enabled, vreplication stops progressing. _vt.vreplication shows repeated errors:
reason_throttled: vplayer:mv_fix_1:vreplication is denied access due to unexpected error: metric not collected yet
  • Running vtctldclient GetThrottlerStatus shows:
    • Self-scope metrics are collected correctly.
    • Shard-scope metrics consistently fail with metric not collected yet.

The error never clears. Disabling throttling immediately restores vreplication to normal behavior.

Expected Behavior

Throttler metric collection should complete within the 250 ms collection (maybe a long shot for such large clusters) window or degrade gracefully without blocking vreplication. vreplication should not stall indefinitely.

Reproduction Steps

  • Run a cluster with 20 replicas across 2 regions.
  • Start vreplication into this keyspace.
  • Enable the throttler using UpdateThrottlerConfig.

Observe that:

  • vreplication stalls,
  • shard-scope metrics fail permanently,
  • errors persist until throttler is fully disabled.

Suspected Root Cause [We are still debugging if the following points help in fixing the issue or not]

1. Cluster size + cross-region latency exceeds collection window

  • Shard-scope metric collection requires the PRIMARY to reach out to all 20 replicas every 250 ms.
    activeCollectInterval = 250 * time.Millisecond // PRIMARY polls replicas
  • Cross-region replicas add 50–70 ms latency for half of these calls.
  • We suspect the 250 ms loop is insufficient for 20 RPCs including cross-region calls, creating constant backlog and continuous “metric not collected yet” failures.

2. grpcctmclient concurrency bottleneck (mutex contention)

We suspect issues with go/vt/vttablet/grpctmclient/client.go:

  • Single mutex protecting unrelated structures, one mutex (client.mu) guarded:
    -rpcClientMap (buffered channel pool)
    -rpcDialPoolMap (dedicated connection pool)
    // grpcClient implements both dialer and poolDialer.
    type grpcClient struct {
    // This cache of connections is to maximize QPS for ExecuteFetchAs{Dba,App},
    // CheckThrottler and FullStatus. Note we'll keep the clients open and close them upon Close() only.
    // But that's OK because usually the tasks that use them are one-purpose only.
    // The map is protected by the mutex.
    mu sync.Mutex
    rpcClientMap map[string]chan *tmc
    rpcDialPoolMap map[DialPoolGroup]addrTmcMap
    }

Implications:

  • During tablet inventory refresh or startup, dial-pool initialization can hold the mutex.
  • Meanwhile, throttler metric collection calls (CheckThrottler) block on this mutex.
  • These calls hit the 1 s timeout, causing repeated “metric not collected yet” errors.
  • Increasing tablet_manager_grpc_concurrency (e.g., from 8 → 64) may reduce pressure but does not fix the root problem.

Possible Fix:

  • in go/vt/vttablet/grpctmclient/client.go have :
	// poolMu protects rpcClientMap, dedicatedMu protects rpcDialPoolMap.
	poolMu         sync.Mutex
	dedicatedMu    sync.Mutex
  • use poolMu for dialPool and dedidcatedMu for dialDedicatedPool

3. Remaining bottleneck after separating poolMu and dedicatedMu

Even after separating the mutexes:

  • dialDedicatedPool still holds dedicatedMu while executing network I/O inside createTmc().
  • When PRIMARY collects metrics from 20+ replicas, it spawns many goroutines.
  • All of them try calling dialDedicatedPool, but serialization on dedicatedMu causes:
    • Wait times exceeding several seconds,
    • gRPC context deadline exceeded (1 s)
    • Continuous “metric not collected yet” throttler errors.
  • Since the collector runs every 250 ms but each cycle can take multiple seconds, this leads to a permanent backlog and never-ending throttling.

Possible fix:

  • Add the following code to grpctmclient/client.go
type tmcEntry struct {
	once sync.Once
	tmc  *tmc
	err  error
}

type addrTmcMap map[string]*tmcEntry
  • Change the code in dialDedicatedPool to :
func (client *grpcClient) dialDedicatedPool(ctx context.Context, dialPoolGroup DialPoolGroup, tablet *topodatapb.Tablet) (tabletmanagerservicepb.TabletManagerClient, invalidatorFunc, error) {
	addr := netutil.JoinHostPort(tablet.Hostname, int32(tablet.PortMap["grpc"]))
	opt, err := grpcclient.SecureDialOption(cert, key, ca, crl, name)
	if err != nil {
		return nil, nil, err
	}

	client.dedicatedMu.Lock()
	if client.rpcDialPoolMap == nil {
		client.rpcDialPoolMap = make(map[DialPoolGroup]addrTmcMap)
	}
	if _, ok := client.rpcDialPoolMap[dialPoolGroup]; !ok {
		client.rpcDialPoolMap[dialPoolGroup] = make(addrTmcMap)
	}
	m := client.rpcDialPoolMap[dialPoolGroup]
	entry, ok := m[addr]
	if !ok {
		entry = &tmcEntry{}
		m[addr] = entry
	}
	*client.dedicatedMu.Unlock()* <====< unlock here before doing a createTmc call

	// Initialize connection exactly once, without holding the mutex
	entry.once.Do(func() {
		entry.tmc, entry.err = client.createTmc(ctx, addr, opt)
	})

	if entry.err != nil {
		return nil, nil, entry.err
	}

	invalidator := func() {
		client.dedicatedMu.Lock()
		defer client.dedicatedMu.Unlock()
		if entry.tmc != nil && entry.tmc.cc != nil {
			entry.tmc.cc.Close()
		}
		delete(m, addr)
	}
	return entry.tmc.client, invalidator, nil
}

Binary Version

Vitess V21

Operating System and Environment details

Darwin 24.5.0
arm64

Log Fragments

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions