Skip to content

Option to disable Distributed table creation and use only ReplicatedMergeTree for fully replicated ClickHouse clusters #3957

@nareshntr

Description

@nareshntr

Description:
Currently, when PeerDB detects a ClickHouse cluster, it automatically creates two tables:

sites_shard — using ReplicatedMergeTree engine
sites — using Distributed engine as a proxy

Our Use Case / Requirement:
We are running a fully replicated ClickHouse cluster (2 Data Nodes + 3 Keeper Nodes) where 100% of data is available on every node. We do not need data to be sharded/distributed across nodes. Our application connects directly to a single node, and since all data is present locally, there is no need for the Distributed table overhead.
We want PeerDB to create only the ReplicatedMergeTree table without the accompanying Distributed table wrapper.

Expected Behavior:
PeerDB should provide a configuration option (e.g., disable_distributed_table: true or replication_mode: replicated_only) so that only the following is created:
sql-- Expected: Only this table should be created
CREATE TABLE sites ON CLUSTER '{cluster}'
(
id UInt64,
name String,
url String,
status UInt8,
_peerdb_synced_at DateTime DEFAULT now(),
_peerdb_is_deleted UInt8 DEFAULT 0,
_peerdb_version Int64 DEFAULT 0
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/sites',
'{replica}'
)
ORDER BY (id);


**Expected output when running `ON CLUSTER` DDL:**

Query id: fccf05f4-320e-424a-873a-a2cdfe2ff4f3

┌─host──┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐

  1. │ test1 │ 9000 │ 0 │ │ 1 │ 1 │
  2. │ test2 │ 9000 │ 0 │ │ 0 │ 0 │
    └───────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘

Both nodes confirm successful table creation with status `0` (no errors), and the table is automatically replicated to both nodes via ClickHouse Keeper using the Raft consensus.

---

**Current Behavior:**

PeerDB creates **two tables** even in a fully replicated setup:

SHOW TABLES

┌─name──────────────────────┐

  1. │ _peerdb_raw_testing │
  2. │ _peerdb_raw_testing_shard │
  3. │ sites │
  4. │ sites_shard │
    └───────────────────────────┘
    This results in:

Unnecessary Distributed table overhead
Confusing dual-table setup for a non-sharding use case
Queries on sites_shard return local data only, while sites proxies — but since data is 100% replicated, the Distributed layer adds no value

Cluster Architecture:

Image

2 Data Nodes (test1, test2) — each holding full replica of all data
3 ClickHouse Keeper Nodes — handling coordination via Raft consensus
Engine: ReplicatedMergeTree with ZooKeeper path pattern /clickhouse/tables/{shard}/sites
Replication confirmed working — both nodes hold identical data

Image

Proposed Solution:
Add a PeerDB mirror/peer configuration option such as:
yamlclickhouse_config:
table_engine: ReplicatedMergeTree # default: Auto (creates Distributed)
create_distributed_table: false
Or expose this as a toggle in the PeerDB UI when setting up a ClickHouse mirror.

Why This Matters:
For teams running ClickHouse in HA/failover mode (not sharding mode), the Distributed table is unnecessary complexity. Direct ReplicatedMergeTree queries are faster, simpler to manage, and indexes/projections are easier to maintain without the dual-table confusion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions