Extract append-only optimization from Engine by nik9000 · Pull Request #84771 · elastic/elasticsearch

nik9000 · 2022-03-08T16:59:03Z

This extracts the logic for the "append only" optimization from Engine
into a pluggable behavior class so that we can override it in TSDB.

This extracts the logic for the "append only" optimization from `Engine` into a pluggable behavior class so that we can override it in TSDB.

jpountz · 2022-03-21T16:51:16Z

Out of curiosity, what would the implementation look like for TSDB? Would it track the max timestamp for each timeseries or something like that?

nik9000 · 2022-03-21T19:36:39Z

Out of curiosity, what would the implementation look like for TSDB? Would it track the max timestamp for each timeseries or something like that?

I was thinking of keeping some number of max timestamps, yeah. Like grabbing the low nibble from the hashed tsid and storing 64 timestamps. Or something like that. Then I got distracted by that _id query optimization. Then I actually merged the id generation and had a bunch of follow up to run through.

jpountz · 2022-03-22T09:38:04Z

I just had a quick chat with @henningandersen about this PR and I believe that we have two main options at a high level. The first one is the one that you are suggesting, that consists of extracting the timestamp from the ID to optimize writes in case this timestamp is greater than all other timestamps that have been seen before. And the second one consists of generating IDs so that Elasticsearch and Lucene could do it mostly automatically by having an ID that concatenates the timestamp in BE order and then the TSID.

The trade-off I'm seeing is that your approach require more maintenance and is specific to timeseries data streams, but it is likely more space-efficient thanks to better sharing of prefixes of IDs, and more CPU efficient in case data comes in timestamp order on a per-timeseries basis but not globally (which is not unlikely?).

I'm curious if you have any data about how much larger the index of the _id field would be if we generated the ID by putting the timestamp first instead of last?

nik9000 · 2022-03-22T11:00:14Z

Not yet. My plan is to see what changes we get from putting the timestamp at the front of the id. My semi-educated guess is that we won't be ok with the space expansion in the inverted index. And that it won't be fast enough. But I'm not going to do anything until I know. Are there any other thing like this you think I should try? If the query is fast enough but too big we can have a look at just plumbing that funny query in my other PR. If not fast enough we'd need something like the timestamp array as well. I think we'll need both. But I'd be pleasantly surprised to need neither.

…

On Tue, Mar 22, 2022, 5:38 AM Adrien Grand ***@***.***> wrote: I just had a quick chat with @henningandersen <https://github.com/henningandersen> about this PR and I believe that we have two main options at a high level. The first one is the one that you are suggesting, that consists of extracting the timestamp from the ID to optimize writes in case this timestamp is greater than all other timestamps that have been seen before. And the second one consists of generating IDs so that Elasticsearch and Lucene could do it mostly automatically by having an ID that concatenates the timestamp in BE order and then the TSID. The trade-off I'm seeing is that your approach require more maintenance and is specific to timeseries data streams, but it is likely more space-efficient thanks to better sharing of prefixes of IDs, and more CPU efficient in case data comes in timestamp order on a per-timeseries basis but not globally (which is not unlikely?). I'm curious if you have any data about how much larger the index of the _id field would be if we generated the ID by putting the timestamp first instead of last? — Reply to this email directly, view it on GitHub <#84771 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABUXIUZE3ZDGZHCENK2Z7LVBGIIVANCNFSM5QG5OJEA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jpountz · 2022-03-22T13:01:05Z

At the moment my gut feeling is that we'd need both too. I'm sort of hoping that data about disk usage would confirm that it's not ok so that I feel more confident about moving forward in the direction you suggest.

nik9000 · 2022-03-22T13:06:58Z

At the moment my gut feeling is that we'd need both too.

👍 I'm glad our stomachs agree. I'll try and get data. Though I might be getting distracted in the short term.

henningandersen

A few comments from an initial read.

I was thinking of keeping some number of max timestamps, yeah. Like grabbing the low nibble from the hashed tsid and storing 64 timestamps.

IIRC, we could have millions of tsids. The way we make this safe todahy in failover cases is to ensure the replica knows the max unsafe timestamp. For tsdb, it would need to know the max timestamp or we would have to bootstrap that from scratch on failover. Is the latter your thought here?

I wonder if that necessarily goes well, since then on failover, the first request for every tsid will have an extended duration. It might need the normal "does this id exist" check and/or also have to search the tsid to find the largest timestamp.

If we imagine a cluster max'ed out (more or less) indexing with the optimization, it might fall over when a node dies? Or at least it might buffer up loads of data, reject some and require retries from clients. This could even be the case for some relocations maybe, in particular if they have just one data stream with many tsids.

henningandersen · 2022-04-09T18:42:06Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+        EngineConfig engineConfig,
+        int maxDocs,
+        BiFunction<Long, Long, LocalCheckpointTracker> localCheckpointTrackerSupplier,
+        MayHaveBeenIndexedBefore mayHaveBeenIndexedBefore


Can we add this new strategy object to the EngineConfig instead? I think we would need it there anyway to avoid having to extend EngineFactory? It also feels like it belongs there.

henningandersen · 2022-04-09T18:52:43Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

    @Override
    public final long getMaxSeenAutoIdTimestamp() {
-        return maxSeenAutoIdTimestamp.get();
+        return mayHaveBeenIndexedBefore.getMaxSeenAutoIdTimestamp();


This is used for safety after recovery. I wonder if you would not need something similar with a tsdb specific optimized append?

henningandersen · 2022-04-09T18:55:46Z

server/src/main/java/org/elasticsearch/index/engine/MayHaveBeenIndexedBefore.java

+     */
+    void bootstrap(Iterable<Map.Entry<String, String>> liveCommitData);
+
+    long getMaxSeenAutoIdTimestamp();


I wonder if we should a specific interface for this method (and the corresponding update method) Otherwise I think you would have to return dummy data here in the tsdb version? I'd prefer the casting in InternalEngine I think. Ideally we would turn this into something generic (like a recovery state) and pass a long. I wonder if we can do that without changing serialization, might be possible?

henningandersen · 2022-04-09T18:56:55Z

server/src/main/java/org/elasticsearch/index/engine/MayHaveBeenIndexedBefore.java

+
+    void writerSegmentStats(SegmentsStats stats);
+
+    void commitData(Map<String, String> commitData);


Can we name this updateCommitData or similar to signal that it is expected to update the commitData and not just be reading from it?

henningandersen · 2022-04-09T18:57:28Z

server/src/main/java/org/elasticsearch/index/engine/MayHaveBeenIndexedBefore.java

+
+    void handleNonPrimary(Index index);
+
+    void writerSegmentStats(SegmentsStats stats);


Can we name this updateSegmentsStats?

henningandersen · 2022-04-09T19:02:14Z

server/src/main/java/org/elasticsearch/index/engine/MayHaveBeenIndexedBefore.java

+    /**
+     * {@code true} if it's valid to call {@link #mayHaveBeenIndexedBefore}
+     * on the provided {@link Index}, false otherwise. This should be fast
+     * an only rely on state from the {@link Index} and not rely on any
+     * internal state.
+     */
+    boolean canOptimizeAddDocument(Index index);
+
+    /**
+     * Returns {@code true} if the indexing operation may have already be
+     * processed by the engine. Note that it is OK to rarely return true even
+     * if this is not the case. However a {@code false} return value must
+     * always be correct.
+     * <p>
+     * This relies on state internal to the implementation and may modify
+     * that state.
+     */
+    boolean mayHaveBeenIndexedBefore(Index index);


Can we collapse those into one method? I see a conflict to the assertion in indexIntoLucene, but I think I would prefer to make that an explicit assertXYZ method instead then.

martijnvg · 2022-11-30T12:53:33Z

The index operation results for the tsdb track when running with this change:

|                                                Min Throughput |                   index | 35402.8         | 36538.3         | 1135.55    | docs/s |   +3.21% |
|                                               Mean Throughput |                   index | 37938.5         | 40071.2         | 2132.69    | docs/s |   +5.62% |
|                                             Median Throughput |                   index | 37776           | 40121.3         | 2345.31    | docs/s |   +6.21% |
|                                                Max Throughput |                   index | 42319           | 44990.3         | 2671.25    | docs/s |   +6.31% |
|                                       50th percentile latency |                   index |   968.057       |   908.885       |  -59.1719  |     ms |   -6.11% |
|                                       90th percentile latency |                   index |  1382.28        |  1319.74        |  -62.5386  |     ms |   -4.52% |
|                                       99th percentile latency |                   index |  3577.88        |  3486.66        |  -91.2231  |     ms |   -2.55% |
|                                     99.9th percentile latency |                   index |  6542.95        |  6145.17        | -397.778   |     ms |   -6.08% |
|                                    99.99th percentile latency |                   index |  8484.75        |  7568.61        | -916.14    |     ms |  -10.80% |
|                                      100th percentile latency |                   index |  9014.89        |  8420.02        | -594.871   |     ms |   -6.60% |
|                                  50th percentile service time |                   index |   968.057       |   908.885       |  -59.1719  |     ms |   -6.11% |
|                                  90th percentile service time |                   index |  1382.28        |  1319.74        |  -62.5386  |     ms |   -4.52% |
|                                  99th percentile service time |                   index |  3577.88        |  3486.66        |  -91.2231  |     ms |   -2.55% |
|                                99.9th percentile service time |                   index |  6542.95        |  6145.17        | -397.778   |     ms |   -6.08% |
|                               99.99th percentile service time |                   index |  8484.75        |  7568.61        | -916.14    |     ms |  -10.80% |
|                                 100th percentile service time |                   index |  9014.89        |  8420.02        | -594.871   |     ms |   -6.60% |

Just attaching this here for keeping record of this. I ran with the benchmark defaults (indexing into a tsdb index, 8 clients concurrently indexing).

nik9000 · 2024-08-14T13:55:50Z

This one's so stale it won't go in. We might be able to reuse parts of it one day, but no need to keep it open.

elasticsearchmachine added the v8.2.0 label Mar 8, 2022

nik9000 force-pushed the may_have_been_indexed_before branch from 20750c0 to 838964f Compare March 8, 2022 18:27

nik9000 added 4 commits March 8, 2022 13:38

Extract append-only optimization from Engine

9ac706b

This extracts the logic for the "append only" optimization from `Engine` into a pluggable behavior class so that we can override it in TSDB.

Spotless

0cd03ec

Sptless

920d830

Javadoc

838964f

nik9000 mentioned this pull request Mar 22, 2022

Add better support for metric data types (TSDB) #74660

Closed

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

henningandersen reviewed Apr 9, 2022

View reviewed changes

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

nik9000 added 3 commits July 22, 2022 10:03

Merge branch 'master' into may_have_been_indexed_before

e373e32

Fixup

d07c29b

wire

c768f58

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:08

nik9000 added 3 commits July 22, 2022 19:53

Check

9e2c85b

What about 8k?

1593e4d

12?

677914e

mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022

csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022

kingherc removed the v8.6.0 label Nov 16, 2022

kingherc added the v8.7.0 label Nov 16, 2022

martijnvg added 2 commits November 29, 2022 11:39

Merge remote-tracking branch 'es/main' into may_have_been_indexed_before

3817637

changes after merging in main branch.

2f76c76

rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

pugnascotia added v8.10.0 and removed v8.9.0 labels Jun 22, 2023

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

nik9000 removed the v8.16.0 label Aug 14, 2024

nik9000 closed this Aug 14, 2024


		void writerSegmentStats(SegmentsStats stats);

		void commitData(Map<String, String> commitData);


		void handleNonPrimary(Index index);

		void writerSegmentStats(SegmentsStats stats);

Conversation

nik9000 commented Mar 8, 2022

Uh oh!

jpountz commented Mar 21, 2022

Uh oh!

nik9000 commented Mar 21, 2022

Uh oh!

jpountz commented Mar 22, 2022

Uh oh!

nik9000 commented Mar 22, 2022 via email

Uh oh!

jpountz commented Mar 22, 2022

Uh oh!

nik9000 commented Mar 22, 2022

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen Apr 9, 2022

Choose a reason for hiding this comment

Uh oh!

henningandersen Apr 9, 2022

Choose a reason for hiding this comment

Uh oh!

henningandersen Apr 9, 2022

Choose a reason for hiding this comment

Uh oh!

henningandersen Apr 9, 2022

Choose a reason for hiding this comment

Uh oh!

henningandersen Apr 9, 2022

Choose a reason for hiding this comment

Uh oh!

henningandersen Apr 9, 2022

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Nov 30, 2022

Uh oh!

nik9000 commented Aug 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants