Use soft-update to maintain document history#29458
Conversation
Today we can the soft-update feature from Lucene to maintain a history of document. This change simply replaces hard-update in the Engine by soft-update methods. Stale operations, delete, and no-ops will be handled in subsequent changes.
|
Pinging @elastic/es-distributed |
|
test this please. |
|
please test this. |
|
@elasticmachine test this please |
s1monw
left a comment
There was a problem hiding this comment.
looks great, I left some comments
| /** | ||
| * Returns a numeric docvalues which can be used to soft-delete documents. | ||
| */ | ||
| public static NumericDocValuesField getSoftDeleteDVMarker() { |
There was a problem hiding this comment.
call this newSoftDeletesField?
| * Returns <code>true</code> if soft-delete is enabled. | ||
| */ | ||
| public boolean isSoftDeleteEnabled() { | ||
| return getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1) && softDeleteEnabled; |
There was a problem hiding this comment.
maybe resolve this in the ctor?
| if (softDeleteEnabled) { | ||
| iwc.setSoftDeletesField(Lucene.SOFT_DELETE_FIELD); | ||
| // TODO: soft-delete retention policy | ||
| mergePolicy = new SoftDeletesRetentionMergePolicy(Lucene.SOFT_DELETE_FIELD, () -> new MatchAllDocsQuery(), mergePolicy); |
There was a problem hiding this comment.
we don't need this here since it's match all. Lets just omit it until we use it
| return delegate.useCompoundFile(segments, newSegment, writer); | ||
| } | ||
|
|
||
| @Override |
There was a problem hiding this comment.
instead of doing this I think we should extend MergePolicyWrapper in master and 6.x and then this is handled upstream
| if (flush) { | ||
| // we should have had just 1 merge, so last generation should be exact | ||
| assertEquals(gen2, store.readLastCommittedSegmentsInfo().getLastGeneration()); | ||
| assertEquals(gen2 + 1, store.readLastCommittedSegmentsInfo().getLastGeneration()); |
There was a problem hiding this comment.
hmm why does this need to change?
| @Override | ||
| public long softUpdateDocument(Term term, Iterable<? extends IndexableField> doc, Field... softDeletes) throws IOException { | ||
| maybeThrowFailure(); | ||
| return super.softUpdateDocument(term, doc, softDeletes); |
| public LeafReader wrap(LeafReader in) { | ||
| return new FilterLeafReader(in) { | ||
| @Override | ||
| public CacheHelper getCoreCacheHelper() { |
There was a problem hiding this comment.
this can return in.getCoreCacheHelper();
There was a problem hiding this comment.
I think you can just move this entire thing to a util method in Lucene.java then we can reuse it in other places WDYT? @martijnvg you need this too somewhere right?
There was a problem hiding this comment.
I removed the test and this wrapper in this PR, and will add it back to Lucene.java later. I prefer to keep this change as a cut-over.
| private void updateDocs(final Term uid, final List<ParseContext.Document> docs, final IndexWriter indexWriter) throws IOException { | ||
| if (docs.size() > 1) { | ||
| indexWriter.updateDocuments(uid, docs); | ||
| if (softDeleteEnabled) { |
There was a problem hiding this comment.
I think in the delete case we should also use indexWriter.updateDocValue(uid, Lucene.getSoftDeleteDVMarker()); instead of IndexWriter#deleteDocuments and then override this in the test to maybe throw an exception.
|
@s1monw I have addressed your comments. Can you please take a look? Thank you. |
There was a problem hiding this comment.
I think we can cache this in here in a constant so we don't need to re-create a new one every time.
|
Thanks @s1monw for reviewing. |
Since elastic#29458, we use a searcher to calculate the number of documents for a commit stats. Sadly, that approach is flawed. The searcher might no longer point to the last commit if it's refreshed. This commit uses SoftDeletesDirectoryReaderWrapper to exclude the soft-deleted documents from numDocs in a SegmentInfos. I chose to modify the method Luence#getNumDocs so that we can read a store metadata snapshot correctly without opening an engine. Relates elastic#29458
Today we can use the soft-update feature from Lucene to maintain a history of document. This change simply replaces hard-update in the Engine by soft-update methods. Stale operations, delete, and no-ops will be handled in subsequent changes. This change is just a cut-over from hard-update to soft-update, no new functionality has been introduced. Relates #29530
Since #29458, we use a searcher to calculate the number of documents for a commit stats. Sadly, that approach is flawed. The searcher might no longer point to the last commit if it's refreshed. As synced-flush requires an exact numDocs to work correctly, we have to exclude all soft-deleted docs. This commit makes synced-flush stop using CommitStats but read an exact numDocs directly from an index commit. Relates #29458 Relates #29530
Since #29458, we use a searcher to calculate the number of documents for a commit stats. Sadly, that approach is flawed. The searcher might no longer point to the last commit if it's refreshed. As synced-flush requires an exact numDocs to work correctly, we have to exclude all soft-deleted docs. This commit makes synced-flush stop using CommitStats but read an exact numDocs directly from an index commit. Relates #29458 Relates #29530
Today we can use the soft-update feature from Lucene to maintain a history of document. This change simply replaces hard-update in the Engine by soft-update methods. Stale operations, delete, and no-ops will be handled in subsequent changes. This change is just a cut-over from hard-update to soft-update, no new functionality has been introduced.