Extract append-only optimization from Engine#84771
Extract append-only optimization from Engine#84771nik9000 wants to merge 12 commits intoelastic:mainfrom
Conversation
20750c0 to
838964f
Compare
|
Out of curiosity, what would the implementation look like for TSDB? Would it track the max timestamp for each timeseries or something like that? |
I was thinking of keeping some number of max timestamps, yeah. Like grabbing the low nibble from the hashed tsid and storing 64 timestamps. Or something like that. Then I got distracted by that |
|
I just had a quick chat with @henningandersen about this PR and I believe that we have two main options at a high level. The first one is the one that you are suggesting, that consists of extracting the timestamp from the ID to optimize writes in case this timestamp is greater than all other timestamps that have been seen before. And the second one consists of generating IDs so that Elasticsearch and Lucene could do it mostly automatically by having an ID that concatenates the timestamp in BE order and then the TSID. The trade-off I'm seeing is that your approach require more maintenance and is specific to timeseries data streams, but it is likely more space-efficient thanks to better sharing of prefixes of IDs, and more CPU efficient in case data comes in timestamp order on a per-timeseries basis but not globally (which is not unlikely?). I'm curious if you have any data about how much larger the index of the _id field would be if we generated the ID by putting the timestamp first instead of last? |
|
Not yet. My plan is to see what changes we get from putting the timestamp
at the front of the id. My semi-educated guess is that we won't be ok with
the space expansion in the inverted index. And that it won't be fast
enough. But I'm not going to do anything until I know. Are there any other
thing like this you think I should try?
If the query is fast enough but too big we can have a look at just plumbing
that funny query in my other PR. If not fast enough we'd need something
like the timestamp array as well.
I think we'll need both. But I'd be pleasantly surprised to need neither.
…On Tue, Mar 22, 2022, 5:38 AM Adrien Grand ***@***.***> wrote:
I just had a quick chat with @henningandersen
<https://github.com/henningandersen> about this PR and I believe that we
have two main options at a high level. The first one is the one that you
are suggesting, that consists of extracting the timestamp from the ID to
optimize writes in case this timestamp is greater than all other timestamps
that have been seen before. And the second one consists of generating IDs
so that Elasticsearch and Lucene could do it mostly automatically by having
an ID that concatenates the timestamp in BE order and then the TSID.
The trade-off I'm seeing is that your approach require more maintenance
and is specific to timeseries data streams, but it is likely more
space-efficient thanks to better sharing of prefixes of IDs, and more CPU
efficient in case data comes in timestamp order on a per-timeseries basis
but not globally (which is not unlikely?).
I'm curious if you have any data about how much larger the index of the
_id field would be if we generated the ID by putting the timestamp first
instead of last?
—
Reply to this email directly, view it on GitHub
<#84771 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUXIUZE3ZDGZHCENK2Z7LVBGIIVANCNFSM5QG5OJEA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
At the moment my gut feeling is that we'd need both too. I'm sort of hoping that data about disk usage would confirm that it's not ok so that I feel more confident about moving forward in the direction you suggest. |
👍 I'm glad our stomachs agree. I'll try and get data. Though I might be getting distracted in the short term. |
henningandersen
left a comment
There was a problem hiding this comment.
A few comments from an initial read.
I was thinking of keeping some number of max timestamps, yeah. Like grabbing the low nibble from the hashed tsid and storing 64 timestamps.
IIRC, we could have millions of tsids. The way we make this safe todahy in failover cases is to ensure the replica knows the max unsafe timestamp. For tsdb, it would need to know the max timestamp or we would have to bootstrap that from scratch on failover. Is the latter your thought here?
I wonder if that necessarily goes well, since then on failover, the first request for every tsid will have an extended duration. It might need the normal "does this id exist" check and/or also have to search the tsid to find the largest timestamp.
If we imagine a cluster max'ed out (more or less) indexing with the optimization, it might fall over when a node dies? Or at least it might buffer up loads of data, reject some and require retries from clients. This could even be the case for some relocations maybe, in particular if they have just one data stream with many tsids.
| EngineConfig engineConfig, | ||
| int maxDocs, | ||
| BiFunction<Long, Long, LocalCheckpointTracker> localCheckpointTrackerSupplier, | ||
| MayHaveBeenIndexedBefore mayHaveBeenIndexedBefore |
There was a problem hiding this comment.
Can we add this new strategy object to the EngineConfig instead? I think we would need it there anyway to avoid having to extend EngineFactory? It also feels like it belongs there.
| @Override | ||
| public final long getMaxSeenAutoIdTimestamp() { | ||
| return maxSeenAutoIdTimestamp.get(); | ||
| return mayHaveBeenIndexedBefore.getMaxSeenAutoIdTimestamp(); |
There was a problem hiding this comment.
This is used for safety after recovery. I wonder if you would not need something similar with a tsdb specific optimized append?
| */ | ||
| void bootstrap(Iterable<Map.Entry<String, String>> liveCommitData); | ||
|
|
||
| long getMaxSeenAutoIdTimestamp(); |
There was a problem hiding this comment.
I wonder if we should a specific interface for this method (and the corresponding update method) Otherwise I think you would have to return dummy data here in the tsdb version? I'd prefer the casting in InternalEngine I think. Ideally we would turn this into something generic (like a recovery state) and pass a long. I wonder if we can do that without changing serialization, might be possible?
|
|
||
| void writerSegmentStats(SegmentsStats stats); | ||
|
|
||
| void commitData(Map<String, String> commitData); |
There was a problem hiding this comment.
Can we name this updateCommitData or similar to signal that it is expected to update the commitData and not just be reading from it?
|
|
||
| void handleNonPrimary(Index index); | ||
|
|
||
| void writerSegmentStats(SegmentsStats stats); |
There was a problem hiding this comment.
Can we name this updateSegmentsStats?
| /** | ||
| * {@code true} if it's valid to call {@link #mayHaveBeenIndexedBefore} | ||
| * on the provided {@link Index}, false otherwise. This should be fast | ||
| * an only rely on state from the {@link Index} and not rely on any | ||
| * internal state. | ||
| */ | ||
| boolean canOptimizeAddDocument(Index index); | ||
|
|
||
| /** | ||
| * Returns {@code true} if the indexing operation may have already be | ||
| * processed by the engine. Note that it is OK to rarely return true even | ||
| * if this is not the case. However a {@code false} return value must | ||
| * always be correct. | ||
| * <p> | ||
| * This relies on state internal to the implementation and may modify | ||
| * that state. | ||
| */ | ||
| boolean mayHaveBeenIndexedBefore(Index index); |
There was a problem hiding this comment.
Can we collapse those into one method? I see a conflict to the assertion in indexIntoLucene, but I think I would prefer to make that an explicit assertXYZ method instead then.
|
The index operation results for the tsdb track when running with this change: Just attaching this here for keeping record of this. I ran with the benchmark defaults (indexing into a tsdb index, 8 clients concurrently indexing). |
|
This one's so stale it won't go in. We might be able to reuse parts of it one day, but no need to keep it open. |
This extracts the logic for the "append only" optimization from
Engineinto a pluggable behavior class so that we can override it in TSDB.