Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions docs/streams/architecture.html
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,41 @@ <h3 class="anchor-heading"><a id="streams_architecture_tasks" class="anchor-link
work.
</p>

<h4 class="anchor-heading"><a id="streams_architecture_subtopology" class="anchor-link"></a><a href="#streams_architecture_subtopology">Increasing Parallelism with Multiple Input Topics</a></h4>

<p>
When your application reads from multiple input topics, the way you structure your topology has a significant impact on parallelism.
If you use <code>builder.stream(Arrays.asList("topic-a", "topic-b", "topic-c"))</code> or a regex pattern to subscribe to multiple topics,
Kafka Streams creates a single sub-topology for all these topics. In this case, the maximum number of tasks is determined by the
<b>maximum</b> number of partitions across all the input topics, not the sum. This means that data from all topics with the same
partition number will be funneled through a single task.
</p>

<p>
For example, if topic-a has 5 partitions, topic-b has 3 partitions, and topic-c has 4 partitions, subscribing to all three topics
in a single <code>stream()</code> call will result in only 5 tasks. All records from partition-0 of topic-a, topic-b, and topic-c
will be processed by the same task.
</p>

<p>
To achieve independent parallelism per topic, you can structure your topology to create separate sub-topologies by processing
each topic individually:
</p>

<pre class="line-numbers"><code class="language-java">List&lt;String&gt; topics = Arrays.asList("topic-a", "topic-b", "topic-c");

for (String topic : topics) {
KStream&lt;String, String&gt; input = builder.stream(topic);
input.filter(...).map(...).to("output-topic");
}</code></pre>

<p>
With this approach, each input topic gets its own sub-topology and its own independent set of tasks.
The total number of tasks becomes the <b>sum</b> of partitions across all topics (5 + 3 + 4 = 12 in the above example),
allowing for greater parallelism. Each topic's data is processed independently, which can significantly improve throughput
when you have multiple input topics with substantial data volumes.
</p>

<p>
It is important to understand that Kafka Streams is not a resource manager, but a library that "runs" anywhere its stream processing application runs.
Multiple instances of the application are executed either on the same machine, or spread across multiple machines and tasks can be distributed automatically
Expand Down
4 changes: 3 additions & 1 deletion docs/streams/developer-guide/dsl-api.html
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,9 @@ <h4 class="anchor-heading"><a id="streams_concepts_globalktable" class="anchor-l
<p>You <strong>must specify Serdes explicitly</strong> if the key or value types of the records in the Kafka input
topics do not match the configured default Serdes. For information about configuring default Serdes, available
Serdes, and implementing your own custom Serdes see <a class="reference internal" href="datatypes.html#streams-developer-guide-serdes"><span class="std std-ref">Data Types and Serialization</span></a>.</p>
<p class="last">Several variants of <code class="docutils literal"><span class="pre">stream</span></code> exist. For example, you can specify a regex pattern for input topics to read from (note that all matching topics will be part of the same input topic group, and the work will not be parallelized for different topics if subscribed to in this way).</p>
<p class="last">Several variants of <code class="docutils literal"><span class="pre">stream</span></code> exist. For example, you can specify a regex pattern for input topics to read from (note that all matching topics will be part of the same input topic group, and the work will not be parallelized for different topics if subscribed to in this way).
To learn how to increase parallelism when reading from multiple topics, see
<a class="reference internal" href="../architecture.html#streams_architecture_subtopology"><span class="std std-ref">Increasing Parallelism with Multiple Input Topics</span></a>.</p>
</td>
</tr>
<tr class="row-odd"><td><p class="first"><strong>Table</strong></p>
Expand Down
4 changes: 3 additions & 1 deletion docs/streams/developer-guide/running-app.html
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,9 @@ <h3><a class="toc-backref" href="#id8">Determining how many application instance
example, if your application reads from a single topic that has ten partitions, then you can run up to ten instances
of your applications. You can run further instances, but these will be idle.</p>
<p>The number of topic partitions is the upper limit for the parallelism of your Kafka Streams application and for the
number of running instances of your application.</p>
number of running instances of your application. If your application reads from multiple topics, see
<a class="reference internal" href="../architecture.html#streams_architecture_subtopology"><span class="std std-ref">Increasing Parallelism with Multiple Input Topics</span></a>
for how to maximize parallelism.</p>
<p>To achieve balanced workload processing across application instances and to prevent processing hotpots, you should
distribute data and processing workloads:</p>
<ul class="simple">
Expand Down