Skip to content

CDC Mirror stuck in infinite retry loop on table snapshot failure — need per-table fault isolation and pause/edit/resume capability #3973

@nareshntr

Description

@nareshntr

When a CDC mirror is created with multiple tables (e.g. 5 tables), and one table fails during the initial snapshot phase, the entire mirror gets stuck in an infinite retry loop on the failed table. Successfully snapshotted tables are blocked, and there is no way to skip, pause, edit, or resume without deleting the entire mirror.

To Reproduce

Create a CDC mirror from MariaDB → ClickHouse with 5 tables, all with initial snapshot enabled.
Tables 1 and 2 complete initial snapshot successfully.
Table 3 has a misconfigured partition column (toYYYYMM(crt_dt) — a ClickHouse function — was used as the watermark/partition key, which is invalid on the MariaDB source side).
Table 3 fails with:

failed to get partitions from source: ERROR 1054 (42S22): Unknown column 'toYYYYMM(crt_dt)' in 'SELECT'

The mirror does not proceed to Tables 4 and 5. It retries Table 3 in an infinite loop.
There is no option to pause the mirror, skip the failing table, edit the configuration, or resume from a checkpoint.

Expected Behavior
Option A — Fault isolation per table: If one table's initial snapshot fails, the mirror should skip that table (marking it as failed), continue snapshotting the remaining tables, and report a partial success with clear status per table.
Option B — Pause + Edit + Resume: The mirror should allow the user to:

Pause the mirror mid-snapshot
Edit the failing table's configuration (e.g. fix the partition/watermark column)
Resume only the failed table's snapshot without re-running already-completed tables

Either approach is acceptable. Ideally both should be supported.

Current Behavior

Mirror is permanently stuck in a retry loop on the failed table
Already-completed tables (1 & 2) are not making CDC progress because the mirror never exits the snapshot phase
The only recovery option is to delete the entire mirror and start over, losing all snapshot progress on Tables 1 & 2
No UI option to pause, edit the table mapping, or resume a partial snapshot

Root Cause (in this case)
The partition/watermark expression toYYYYMM(crt_dt) is ClickHouse-specific syntax and was mistakenly used as the source-side partition column. PeerDB sends this expression directly to MariaDB in a SELECT, which MariaDB does not understand. Validation at mirror creation time should catch this and reject ClickHouse function expressions in the source partition column field.

Suggested Fixes

Input validation: At mirror creation/edit time, validate that the watermark/partition column is a plain column name and not a function expression. Show an error before the mirror is created.
Per-table error isolation: Don't let one table's failure block the rest of the mirror's snapshot progress.
Pause/Edit/Resume: Allow pausing a mirror that is stuck in snapshot phase, editing individual table configurations, and resuming only failed tables.
Mirror status per table: Show per-table status in the UI (Completed / In Progress / Failed) so users can see exactly which table is the problem.

Environment

PeerDB Version: [0.36.7]
Source: MariaDB
Destination: ClickHouse
Deployment: [Docker Compose]

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions