Add Iceberg CDC source plugin by lawofcycles · Pull Request #6554 · opensearch-project/data-prepper

lawofcycles · 2026-02-23T05:09:58Z

Description

Adds a new iceberg source plugin that captures row-level changes (CDC) from Apache Iceberg tables and ingests them into the Data Prepper pipeline. The plugin uses Iceberg's IncrementalChangelogScan API directly, without requiring Spark or Flink.

Key features

Polls Iceberg tables for new snapshots and processes row-level changes (INSERT, UPDATE, DELETE) incrementally
Supports initial load (full table scan) with automatic transition to CDC
Handles CoW carryover removal for correct change detection
Distributes work across Data Prepper nodes via EnhancedSourceCoordinator
Works with any Iceberg catalog (Glue, REST, Hive, Hadoop, Nessie, JDBC, etc.)

Current limitations

Only Copy-on-Write (CoW) tables are supported. MoR tables are rejected at startup.
identifier_columns must be configured by the user for CDC correctness
When multiple DELETED and ADDED files exist within the same Iceberg partition and bounds-based pairing fails, work for that partition cannot be distributed across nodes.

Catalog support: Any catalog supported by Iceberg's CatalogUtil.buildIcebergCatalog() (Glue, REST, Hive, Hadoop, Nessie, JDBC, etc.)

Issues Resolved

Resolves #6552

Check List

New functionality includes testing.
New functionality has a documentation issue. Please link to it in this PR.
New functionality has javadoc added
Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

lawofcycles · 2026-02-23T05:10:15Z

Discussion: PeerForwarder extension for Source-layer shuffle

The current implementation groups DELETED and ADDED files together at the Source layer so that carryover removal can happen locally on each node. This works well when bounds-based pairing succeeds, but falls back to single-node processing when it fails (e.g., when an UPDATE changes a column's min/max bounds).

A more scalable approach would be to shuffle rows between ChangelogWorkers by a hash of all data columns (similar to Spark's repartition). Data Prepper's PeerForwarder infrastructure provides useful building blocks for this (consistent hashing via HashRing, node discovery, Armeria HTTP transport). However, the PeerForwarder's orchestration layer is tightly coupled to the Processor execution model (synchronous send/receive within a batch processing cycle, ReceiveBuffer integration with the Pipeline, pluginId-based routing), so Source-layer shuffling would likely need to build its own send/receive flow on top of these lower-level components, or the PeerForwarder architecture would need to be generalized to support non-Processor use cases.

RFC #700 (Core Peer Forwarding) notes that the initial implementation targets Processor plugins only, and suggests opening a new issue to expand the functionality to Source or Sink plugins. I plan to open a separate issue proposing this extension, with Iceberg CDC as the first use case.

I'd like to hear the maintainers' thoughts on the best approach for inter-node data exchange at the Source layer.

20260311 UPDATE: Additional benefit of Source-layer shuffle

Source-layer shuffle would also enable processing multiple snapshots in a single IncrementalChangelogScan call, rather than the current one-snapshot-at-a-time serial approach.

Currently, the LeaderScheduler processes each snapshot sequentially: plan one snapshot, wait for all partitions to complete (including sink acknowledgement when acknowledgments: true), then move to the next. This is necessary because without shuffle, carryover removal relies on bounds-based file pairing within each snapshot. With shuffle (repartition by hash of all data columns), carryover removal works correctly across snapshot boundaries, just as Spark's RemoveCarryoverIterator does after repartition.

The performance benefits of batch processing multiple snapshots include following points.

Eliminating per-snapshot synchronization overhead. With acknowledgments enabled, the leader must wait for the sink to finish writing and return acknowledgments before planning the next snapshot. Batch processing reduces N sequential wait cycles to one.
Reducing scan planning overhead. One IncrementalChangelogScan call reads manifests once instead of N times.

This is particularly relevant for tables with frequent commits (e.g., Spark Streaming committing every few minutes), where dozens of snapshots can accumulate between polling intervals.

Note that batch processing also requires ordering writes by changeOrdinal (the snapshot sequence number that IncrementalChangelogScan assigns to each task) to ensure correct operation order in OpenSearch when the same document is modified across multiple snapshots.

dlvenable

Thank you @lawofcycles for this great contribution!

I left a few initial comments. I want to look deeper at the logic as well.

One thing we will need before making this official is a set of integration tests that run against an actual Iceberg Docker image. We do this for our Kafka buffer and our OpenSearch sink.

dlvenable · 2026-03-05T14:59:42Z

...berg-source/src/main/java/org/opensearch/dataprepper/plugins/source/iceberg/TableConfig.java

+    private String tableName;
+
+    @JsonProperty("catalog")
+    private Map<String, String> catalog = Collections.emptyMap();


Is there a model that we can use here instead of a map? The models provide the best experience for the community because they promote configuration consistency and help with documentation.

I looked into this. Iceberg does not provide a typed model for catalog configuration. All official implementations pass catalog properties as a string map

Java (reference): CatalogUtil.buildIcebergCatalog(String name, Map<String, String> options, Object conf)

Python (PyIceberg): Properties = dict[str, Any]

Rust: HashMap<String, String>

Go: Properties = map[string]string

Kafka Connect's Iceberg integration also uses Map<String, String> in the same way.

I think this is an intentional design choice because each catalog type (Glue, REST, Hive, Nessie, JDBC, etc.) accepts a completely different set of properties. A typed model would either couple the plugin to specific catalog types or still need a map fallback for catalog specific properties.

That said, the catalog properties will need thorough documentation with examples for each supported catalog type so that users have clear guidance on what to configure.

dlvenable · 2026-03-05T15:11:12Z

...berg-source/src/main/java/org/opensearch/dataprepper/plugins/source/iceberg/TableConfig.java

+    @JsonProperty("initial_load")
+    private boolean initialLoad = true;


Suggested change

@JsonProperty("initial_load")

private boolean initialLoad = true;

@JsonProperty("disable_export")

private boolean disableExport = false;

Our other CDC sources have both export and stream configurations. And we allow the following combinations:

Export and stream

Export only

Stream only

Unless "initial load" is a specialized term in Iceberg we should keep our existing conventions.

Also, we prefer to default boolean values to false.

We may eventually want to add disable_stream but that is not necessary now.

changed from initialLoad to disableExport

dlvenable · 2026-03-05T15:11:46Z

...search/dataprepper/plugins/source/iceberg/coordination/partition/ChangelogTaskPartition.java

@@ -0,0 +1,49 @@
+/*


Please use the updated license headers.

https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md#license-headers

dlvenable · 2026-03-05T15:14:03Z

...search/dataprepper/plugins/source/iceberg/coordination/state/ChangelogTaskProgressState.java

+
+public class ChangelogTaskProgressState {
+
+    @JsonProperty("snapshot_id")


These save to DynamoDB for source coordination. We use camelCase for these. Please update accordingly.

data-prepper/data-prepper-plugins/dynamodb-source/src/main/java/org/opensearch/dataprepper/plugins/source/dynamodb/coordination/state/ExportProgressState.java

Lines 12 to 28 in 97cd930

@JsonProperty("exportArn")

private String exportArn;

@JsonProperty("status")

private String status;

@JsonProperty("bucket")

private String bucket;

@JsonProperty("prefix")

private String prefix;

@JsonProperty("kmsKeyId")

private String kmsKeyId;

@JsonProperty("exportTime")

private String exportTime;

made them camel case.

dlvenable · 2026-03-05T15:16:57Z

data-prepper-plugins/iceberg-source/build.gradle

@@ -0,0 +1,28 @@
+plugins {


You don't need these three lines. You can remove them.

dlvenable · 2026-03-05T15:18:09Z

...rg-source/src/main/java/org/opensearch/dataprepper/plugins/source/iceberg/IcebergSource.java

+import java.util.Objects;
+import java.util.function.Function;
+
+@DataPrepperPlugin(name = "iceberg", pluginType = Source.class, pluginConfigurationType = IcebergSourceConfig.class)


Let's add the @Experimental annotation here. I'd like to get this PR merged, but we do have a few other items to address before making this fully supported.

added the @experimental !

dlvenable · 2026-03-05T15:19:17Z

...search/dataprepper/plugins/source/iceberg/coordination/partition/ChangelogTaskPartition.java

+        this.state = state;
+    }
+
+    public ChangelogTaskPartition(final SourcePartitionStoreItem sourcePartitionStoreItem) {


This code should have unit tests. Verify the getter results.

added unit test.

dlvenable · 2026-03-05T15:20:38Z

...va/org/opensearch/dataprepper/plugins/source/iceberg/coordination/partition/GlobalState.java

+    }
+
+    public GlobalState(final SourcePartitionStoreItem sourcePartitionStoreItem) {
+        setSourcePartitionStoreItem(sourcePartitionStoreItem);


This code should have unit tests. Verify the getter results.

added unit test.

dlvenable · 2026-03-05T15:20:52Z

...arch/dataprepper/plugins/source/iceberg/coordination/partition/InitialLoadTaskPartition.java

+    }
+
+    public InitialLoadTaskPartition(final SourcePartitionStoreItem sourcePartitionStoreItem) {
+        setSourcePartitionStoreItem(sourcePartitionStoreItem);


This code should have unit tests. Verify the getter results.

added unit test

dlvenable · 2026-03-05T15:20:58Z

...rg/opensearch/dataprepper/plugins/source/iceberg/coordination/partition/LeaderPartition.java

+    }
+
+    public LeaderPartition(final SourcePartitionStoreItem sourcePartitionStoreItem) {
+        setSourcePartitionStoreItem(sourcePartitionStoreItem);


This code should have unit tests. Verify the getter results.

added unit test

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

…ecordConverter Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

lawofcycles · 2026-03-08T08:38:45Z

@dlvenable Thank you for the review. Here is a summary of the updates.

Review feedback
All review comments have been addressed.

Integration tests
Added Docker based integration tests following the existing patterns. The tests use SeaweedFS (S3 compatible storage) and the Iceberg REST Catalog.

docker-compose -f data-prepper-plugins/iceberg-source/docker/docker-compose.yml up -d
./gradlew :data-prepper-plugins:iceberg-source:integrationTest
docker-compose -f data-prepper-plugins/iceberg-source/docker/docker-compose.yml down

Code improvements
Reviewed the code in detail and fixed several issues found during the personal review process.

Manual verification
Verified end to end with Iceberg tables on S3 (Glue Catalog + EMR Spark) writing CDC events to OpenSearch, covering multiple scenarios including INSERT, UPDATE, DELETE, and carryover removal.

I am changing this PR from Draft to Ready for Review.

dlvenable · 2026-03-11T12:48:25Z

There are a couple of checkstyle errors:

Error: eckstyle] [ERROR] /home/runner/work/data-prepper/data-prepper/data-prepper-plugins/iceberg-source/src/test/java/org/opensearch/dataprepper/plugins/source/iceberg/worker/ChangelogRecordConverterTest.java:22:8: Unused import - org.apache.iceberg.variants.VariantValue. [UnusedImports]
Error: eckstyle] [ERROR] /home/runner/work/data-prepper/data-prepper/data-prepper-plugins/iceberg-source/src/test/java/org/opensearch/dataprepper/plugins/source/iceberg/worker/ChangelogRecordConverterTest.java:27:8: Unused import - java.nio.ByteBuffer. [UnusedImports]

Build results:
https://github.com/opensearch-project/data-prepper/actions/runs/22817460047/job/66184697244?pr=6554

When a CoW UPDATE produces a DELETE + INSERT pair with the same document_id after carryover removal, emit only the INSERT as INDEX. Since OpenSearch INDEX is an upsert, the DELETE is unnecessary. This also eliminates a potential issue where multiple ProcessWorker threads consuming from the buffer in parallel could reorder DELETE and INDEX operations for the same document, causing data loss. Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

lawofcycles · 2026-03-11T13:16:44Z

Found and fixed a correctness issue with UPDATE event handling.
094e5a2

Previously, a CoW UPDATE produced two separate events after carryover removal: DELETE (old row) followed by INDEX (new row) for the same document_id. This relied on the DELETE arriving at OpenSearch before the INDEX. However, Data Prepper's pipeline uses multiple ProcessWorker threads that consume from the buffer in parallel, so these two events could end up in different bulk requests with no ordering guarantee. If the INDEX arrived first and the DELETE second, the document would be deleted and lost.

The fix merges UPDATE pairs into a single event. After carryover removal, when identifier_columns is configured, if a DELETE and INSERT produce the same document_id, only the INSERT is emitted as INDEX. Since OpenSearch INDEX is an upsert, the DELETE is unnecessary. Pure deletes (DELETE with no matching INSERT) are unaffected. When identifier_columns is not configured, no document_id is generated and the behavior is unchanged.

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

lawofcycles · 2026-03-11T13:20:26Z

There are a couple of checkstyle errors:

Fixed 3dc4a7d

dlvenable · 2026-03-11T17:09:52Z

Discussion: PeerForwarder extension for Source-layer shuffle

The current implementation groups DELETED and ADDED files together at the Source layer so that carryover removal can happen locally on each node. This works well when bounds-based pairing succeeds, but falls back to single-node processing when it fails (e.g., when an UPDATE changes a column's min/max bounds).

A more scalable approach would be to shuffle rows between ChangelogWorkers by a hash of all data columns (similar to Spark's repartition). Data Prepper's PeerForwarder infrastructure provides useful building blocks for this (consistent hashing via HashRing, node discovery, Armeria HTTP transport). However, the PeerForwarder's orchestration layer is tightly coupled to the Processor execution model (synchronous send/receive within a batch processing cycle, ReceiveBuffer integration with the Pipeline, pluginId-based routing), so Source-layer shuffling would likely need to build its own send/receive flow on top of these lower-level components, or the PeerForwarder architecture would need to be generalized to support non-Processor use cases.

RFC #700 (Core Peer Forwarding) notes that the initial implementation targets Processor plugins only, and suggests opening a new issue to expand the functionality to Source or Sink plugins. I plan to open a separate issue proposing this extension, with Iceberg CDC as the first use case.

I'd like to hear the maintainers' thoughts on the best approach for inter-node data exchange at the Source layer.

20260311 UPDATE: Additional benefit of Source-layer shuffle

Source-layer shuffle would also enable processing multiple snapshots in a single IncrementalChangelogScan call, rather than the current one-snapshot-at-a-time serial approach.

Currently, the LeaderScheduler processes each snapshot sequentially: plan one snapshot, wait for all partitions to complete (including sink acknowledgement when acknowledgments: true), then move to the next. This is necessary because without shuffle, carryover removal relies on bounds-based file pairing within each snapshot. With shuffle (repartition by hash of all data columns), carryover removal works correctly across snapshot boundaries, just as Spark's RemoveCarryoverIterator does after repartition.

The performance benefits of batch processing multiple snapshots include following points.
1. Eliminating per-snapshot synchronization overhead. With acknowledgments enabled, the leader must wait for the sink to finish writing and return acknowledgments before planning the next snapshot. Batch processing reduces N sequential wait cycles to one.

2. Reducing scan planning overhead. One `IncrementalChangelogScan` call reads manifests once instead of N times.
This is particularly relevant for tables with frequent commits (e.g., Spark Streaming committing every few minutes), where dozens of snapshots can accumulate between polling intervals.

Note that batch processing also requires ordering writes by changeOrdinal (the snapshot sequence number that IncrementalChangelogScan assigns to each task) to ensure correct operation order in OpenSearch when the same document is modified across multiple snapshots.

@lawofcycles , I agree that this would be a great approach. The processor approach was what we needed at the time.

We actually have some hard-coded, non peer-forwarder support for source forwarding. It is in our OTel traces pipeline only when used with the Kafka buffer. In that case, we put the traces on different partitions.

The scenarios you outlined as well as tracing are both good examples of how we could make use of source-based peer-forwarding. Note that we do not generally recommend using kafka buffer with pull-based pipelines. So peer-forwarding would probably be the best solution here.

dlvenable · 2026-03-11T20:09:14Z

.../src/main/java/org/opensearch/dataprepper/plugins/source/iceberg/leader/LeaderScheduler.java

+                    taskState.setDataFilePath(task.file().location());
+                    taskState.setTotalRecords(task.file().recordCount());
+
+                    final String partitionKey = tableName + "|initial|" + UUID.randomUUID();


Why do you use a UUID here? Could this cause duplicates?

The UUID was used for simplicity, but you are right that it can cause duplicate partitions if the leader crashes after creating partitions but before updating lastProcessedSnapshotId. Since tryCreatePartitionItem uses a conditional put that returns false for existing keys, switching to deterministic keys would prevent this.

For initial load: tableName|initial|snapshotId|filePath (one file per task, so the file path is unique).
For CDC: tableName|snapshotId|sha256(sorted file paths) (the hash of sorted file paths within each task group).

I will submit a follow up PR for this change.

dlvenable · 2026-03-11T20:09:25Z

.../src/main/java/org/opensearch/dataprepper/plugins/source/iceberg/leader/LeaderScheduler.java

+
+                for (final ChangelogTaskProgressState taskState : taskGroups) {
+                    final String partitionKey = tableName + "|" + snapshot.snapshotId()
+                            + "|" + UUID.randomUUID();


Why do you use a UUID here? Could this cause duplicates?

Same as above. Will be addressed in the same follow up PR.

dlvenable · 2026-03-11T20:13:01Z

.../java/org/opensearch/dataprepper/plugins/source/iceberg/worker/ChangelogRecordConverter.java

+            data.put(field.name(), convertValue(value, field.type()));
+        }
+
+        final Event event = JacksonEvent.builder()


We should move toward using the EventFactory.

Here is a simple example of usage:

data-prepper/data-prepper-plugins/common/src/main/java/org/opensearch/dataprepper/plugins/source/file/FileSource.java

Lines 139 to 142 in ce57396

eventFactory.eventBuilder(EventBuilder.class)

.withEventType(fileSourceConfig.getRecordType())

.withData(structuredLine)

.build());

Got it, raised follow up PR #6641

dlvenable · 2026-03-11T20:16:18Z

...urce/src/main/java/org/opensearch/dataprepper/plugins/source/iceberg/leader/TaskGrouper.java

+            }
+        }
+
+        LOG.info("Planned {} task group(s) for table {} (snapshot {} -> {})",


This could become noisy quickly. We should make this a DEBUG log.

Another option here is to use a DistributionSummary metric for this. This would allow you to analyze maximum task group sizes and average.

I opened PR #6648 to adress this point.

dlvenable · 2026-03-11T20:18:40Z

.../src/main/java/org/opensearch/dataprepper/plugins/source/iceberg/worker/ChangelogWorker.java

+                tableName, tableConfig.getIdentifierColumns());
+        final CarryoverRemover carryoverRemover = new CarryoverRemover();
+
+        LOG.info("Processing partition for table {} snapshot {} with {} file(s)",


This should become a debug level log.

dlvenable · 2026-03-11T20:18:59Z

.../src/main/java/org/opensearch/dataprepper/plugins/source/iceberg/worker/ChangelogWorker.java

+                changelogRows.add(new CarryoverRemover.ChangelogRow(dataColumns, row.operation, i));
+            }
+            survivingIndices = carryoverRemover.removeCarryover(changelogRows);
+            LOG.info("Carryover removal: {} rows -> {} rows", allRows.size(), survivingIndices.size());


This should be a debug log.

horovits · 2026-03-12T08:46:22Z

Great work @lawofcycles, nice to see it merged

lawofcycles · 2026-03-18T06:58:08Z

@dlvenable Thank you for the feedback and for sharing the OTel traces precedent. That is very helpful context.

I have been thinking about the implementation approach and would like to propose a two track plan.

Track 1: Framework (RFC)
I will open a separate RFC issue proposing a generalized Source layer peer forwarding extension to the existing PeerForwarder infrastructure. The design would extract the reusable building blocks (HashRing, PeerForwarderClient, PeerListProvider, Armeria transport) into a shared layer that both Processor and Source plugins can use, while keeping the orchestration logic (send/receive flow, buffering) pluggable per use case. This would also provide a path to replace the hard coded OTel traces source forwarding with the generalized mechanism.

Track 2: Iceberg Source (plugin level implementation)
In parallel, I will implement the shuffle logic directly within the iceberg source plugin using the same lower level components (HashRing for consistent hashing, Armeria HTTP for transport). This unblocks the Iceberg CDC improvements (fallback partition distribution, multi snapshot batch processing) without waiting for the framework RFC to be reviewed and merged.

Convergence
Once the framework lands, the Iceberg source plugin's shuffle implementation would be migrated to use the framework API. The plugin level code is designed with this migration in mind: the hashing and transport logic will be isolated behind an internal interface so that swapping to the framework API is straightforward.

This approach keeps the Iceberg CDC work moving forward while ensuring the framework design benefits from a concrete, working implementation as a reference.

Does this plan sound reasonable?

lawofcycles · 2026-03-23T00:48:29Z

@dlvenable I need to correct my earlier proposal. After further analysis, I found that the PeerForwarder extension (Track 1) is not necessary.

I am now thinking the shuffle should use a pull based approach (similar to Spark's shuffle architecture) where writers partition data to local disk by hash key and readers pull their partitions from each node via HTTP. This would not require extending the PeerForwarder framework, which is push based.

I also identified a correctness bug in the current implementation: when a partition column is updated, the DELETE (old partition) and INSERT (new partition) for the same document end up in separate tasks. Since the UPDATE merge only operates within a single task, it cannot detect this cross partition update. The DELETE and INSERT are sent to the sink independently with no ordering guarantee, which can cause data loss if the DELETE arrives after the INSERT.

I will open a detailed issue with the revised design shortly.

lawofcycles · 2026-03-23T01:16:15Z

I opened two follow up issues.

#6666
#6667

I believe #6666 is more critical.

lawofcycles mentioned this pull request Feb 23, 2026

[RFC] Iceberg CDC Source Plugin #6552

Closed

lawofcycles changed the title ~~Add Iceberg CDC source plugin (#6552)~~ Add Iceberg CDC source plugin Feb 23, 2026

lawofcycles force-pushed the feature/iceberg-cdc-source branch from bf1d396 to 345ca7f Compare March 1, 2026 12:30

dlvenable requested changes Mar 5, 2026

View reviewed changes

lawofcycles added 2 commits March 6, 2026 08:17

Add Iceberg CDC source plugin (opensearch-project#6552)

788dab5

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

simplify table name parse

8ee1b65

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

lawofcycles force-pushed the feature/iceberg-cdc-source branch from ba2cd6f to 8ee1b65 Compare March 5, 2026 23:18

lawofcycles added 8 commits March 6, 2026 09:17

change initialLoad config to disableExport

dc5b930

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

updte license header

46ebd49

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

remove accidentally committed log files

5238bf2

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

add unit test LeaderPartitionTest

39f5d88

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

add unit test for ChangelogTaskPartitionTest

4c85711

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

add unit test for InitialLoadTaskPartition

d136999

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

add GlobalState unit test

5bc32e6

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

make properties camel case

3530318

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

lawofcycles force-pushed the feature/iceberg-cdc-source branch from 3e8ae17 to 3530318 Compare March 6, 2026 06:23

lawofcycles added 12 commits March 6, 2026 15:41

remove unnecessary property

1c54bdf

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

refactor LeaderScheduler

67fd0cf

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

make Iceberg source plugin experimental

329ae63

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

refactor TaskGrouper

13700ff

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

enhance ChangelogWorker

689392e

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

add incrementSnapshotCompletionCount even when ackEnabled = true

39bae56

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

enhance CarryoverRemover to handle duplicate insert, delete pairs

b65deb7

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

Fix struct field name loss and add Variant type support in ChangelogR…

fc05840

…ecordConverter Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

Validate identifier_columns against table schema at startup

e74fb01

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

remove unnecessary java plugin definition

368320f

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

fix ack mechanism

35c4170

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

add docker based IntegTest

3343cd6

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

lawofcycles requested a review from dlvenable March 8, 2026 08:37

lawofcycles requested review from KarstenSchnitter, dinujoh, engechas, graytaylor0, kkondaka, oeyh, san81, sb2k16 and srikanthjg as code owners March 8, 2026 08:37

Remove unused imports in ChangelogRecordConverterTest

3dc4a7d

Signed-off-by: Sotaro Hikita <bering1814@gmail.com>

kkondaka approved these changes Mar 11, 2026

View reviewed changes

dlvenable approved these changes Mar 11, 2026

View reviewed changes

dlvenable merged commit 34a63e9 into opensearch-project:main Mar 11, 2026
100 of 108 checks passed

dlvenable reviewed Mar 11, 2026

View reviewed changes

This was referenced Mar 15, 2026

Use EventFactory instead of JacksonEvent.builder() in iceberg-source #6641

Merged

Add metrics and reduce log verbosity in iceberg-source #6648

Open

This was referenced Mar 23, 2026

[FEATURE] Add source-layer shuffle to iceberg-source for correct and scalable CDC processing #6666

Open

[FEATURE] Support multi-snapshot batch processing in iceberg-source using IncrementalChangelogScan #6667

Open

This was referenced Mar 29, 2026

Add source-layer shuffle to iceberg-source for correct and scalable C… #6682

Open

[DOC] Add documentation for the Data Prepper Iceberg source plugin opensearch-project/documentation-website#12163

Open

		@JsonProperty("initial_load")
		private boolean initialLoad = true;


		public class ChangelogTaskProgressState {

		@JsonProperty("snapshot_id")

	@JsonProperty("exportArn")
	private String exportArn;

	@JsonProperty("status")
	private String status;

	@JsonProperty("bucket")
	private String bucket;

	@JsonProperty("prefix")
	private String prefix;

	@JsonProperty("kmsKeyId")
	private String kmsKeyId;

	@JsonProperty("exportTime")
	private String exportTime;

	eventFactory.eventBuilder(EventBuilder.class)
	.withEventType(fileSourceConfig.getRecordType())
	.withData(structuredLine)
	.build());

Conversation

lawofcycles commented Feb 23, 2026

Description

Issues Resolved

Check List

Uh oh!

lawofcycles commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Discussion: PeerForwarder extension for Source-layer shuffle

Uh oh!

dlvenable left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lawofcycles Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lawofcycles commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlvenable commented Mar 11, 2026

Uh oh!

lawofcycles commented Mar 11, 2026

Uh oh!

lawofcycles commented Mar 11, 2026

Uh oh!

Uh oh!

dlvenable commented Mar 11, 2026

Discussion: PeerForwarder extension for Source-layer shuffle

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lawofcycles commented Feb 23, 2026 •

edited

Loading

lawofcycles Mar 5, 2026 •

edited

Loading

lawofcycles commented Mar 8, 2026 •

edited

Loading

lawofcycles commented Mar 23, 2026 •

edited

Loading