Do not expose hard-deleted docs in Lucene history#32333
Conversation
Today when reading operation history in Lucene, we read all documents. However, if indexing a document is aborted, IndexWriter will hard-delete it; we, therefore, need to exclude that document from Lucene history. This commit makes sure that we exclude aborted documents by using the hard liveDocs of a SegmentReader if there are deletes. Note that this wrapper does not work well with IndexWriter#tryDeleteDocument. We need to revisit the wrapper after LUCENE-8425 gets in.
This reverts commit 8e66a93.
|
Pinging @elastic/es-distributed |
| if (si.getDelCount() == 0) { | ||
| return new LeafReaderWithLiveDocs(segmentReader, null, segmentReader.maxDoc()); | ||
| } else { | ||
| Bits hardLiveBits = si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si, IOContext.READ); |
There was a problem hiding this comment.
I tried to think this through and we might be subject to concurrent deletes if we do it this way. Can't we instead do something like this:
DocIdSetIterator soft_deletes = DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator("soft_deletes", sr);
Bits liveDocs = sr.getLiveDocs();
FixedBitSet hardLiveDocs = new FixedBitSet(numDeletes);
hardLiveDocs.set(0, numDeletes);
for (int i = 0; i < liveDocs.length(); i++) {
if (liveDocs.get(i) == false) {
if (soft_deletes.docID() < i) {
int doc = soft_deletes.docID() == DocIdSetIterator.NO_MORE_DOCS ? DocIdSetIterator.NO_MORE_DOCS : soft_deletes.advance(i);
if (doc != i) {
hardLiveDocs.clear(i);
}
}
}
}note I didn't try this out.. just to provide an idea
|
@s1monw I've updated the PR using the hardLiveDocs exposed in Lucene. I wonder if we should expose "isNRT" property of a SegmentReader so that we know if we can use SegmentInfos to calculate numDocs instead of getting the cardinality of the liveDocs. Can you please have another look? Thank you. |
| return new LeafReaderWithLiveDocs(leaf, null, leaf.maxDoc()); | ||
| } | ||
| final int numDocs; | ||
| if (hardLiveDocs instanceof FixedBitSet) { |
There was a problem hiding this comment.
does it ever kick in? I would assume you get a FixedBits instance for live docs instead of a mutable FixedBitSet.
There was a problem hiding this comment.
Yes, we convert then return a read-only bits. I removed this 💯.
| assertThat(actualDocs, equalTo(liveDocs)); | ||
| } | ||
| IOUtils.close(writer, dir); | ||
| } |
There was a problem hiding this comment.
can we make it two tests? one that never inserts a document that fails indexing, and another one that always inserts one?
|
@jpountz I have addressed your feedbacks? Can you please have another look? Thank you. |
jpountz
left a comment
There was a problem hiding this comment.
LGTM. We should probably leave a TODO about avoiding to count live docs. Unfortunately we can't do it easily with a FilterDirectoryReader since composite readers count documents eagerly, but maybe we could find a different way to consume hard deletes that doesn't require to count them.
|
@jpountz Thanks for reviewing. I added a TODO. |
* elastic/ccr: (57 commits) ShardFollowNodeTask should fetch operation once (elastic#32455) Do not expose hard-deleted docs in Lucene history (elastic#32333) Tests: Fix convert error tests to use fixed value (elastic#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (elastic#32374) REST high-level client: parse back _ignored meta field (elastic#32362) [CI] Mute DocumentSubsetReaderTests testSearch Reject follow request if following setting not enabled on follower (elastic#32448) TEST: testDocStats should always use forceMerge (elastic#32450) TEST: avoid merge in testSegmentMemoryTrackedInBreaker TEST: Avoid deletion in FlushIT AwaitsFix IndexShardTests#testDocStats Painless: Add method type to method. (elastic#32441) Remove reference to non-existent store type (elastic#32418) [TEST] Mute failing FlushIT test Fix ordering of bootstrap checks in docs (elastic#32417) [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints Validate source of an index in LuceneChangesSnapshot (elastic#32288) [TEST] Mute failing testConvertLongHexError bump lucene version after backport Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (elastic#32390) ...
Today when reading operation history in Lucene, we read all documents. However, if indexing a document is aborted, IndexWriter will hard-delete it; we, therefore, need to exclude that document from Lucene history. This commit makes sure that we exclude aborted documents by using the hard liveDocs of a SegmentReader if there are deletes.
Today when reading operation history in Lucene, we read all documents.
However, if indexing a document is aborted, IndexWriter will hard-delete
it; we, therefore, need to exclude that document from Lucene history.
This commit makes sure that we exclude aborted documents by using the
hard liveDocs of a SegmentReader if there are deletes.
Closes #32269