Fix IndexAuditTrail rolling upgrade on rollover edge - take 2#38286
Conversation
|
Pinging @elastic/es-security |
| } else if (TemplateUtils.checkTemplateExistsAndVersionMatches(INDEX_TEMPLATE_NAME, | ||
| SECURITY_VERSION_STRING, clusterStateResponse.getState(), logger, | ||
| Version.CURRENT::onOrAfter) == false) { | ||
| Version.CURRENT::onOrBefore) == false) { |
| if (meta == null) { | ||
| logger.info("Missing _meta field in mapping [{}] of index [{}]", docMapping.type(), index); | ||
| throw new IllegalStateException("Cannot read security-version string in index " + index); | ||
| } |
There was a problem hiding this comment.
Fixes #37062 (comment) .
A non-master node detects an un-updated audit index and bails. Instead it should hold off, and retry. The index is un-updated because the master had updated the mapping for the index before it the rollover timeline ("the race" - the template upgrade happend after the rollover edge, but audit events on the master came before that).
| innerStart(); | ||
| }, e2 -> { | ||
| // best effort only | ||
| logger.debug("Failed to update mappings on next audit index [{}]", nextIndex, e2); |
There was a problem hiding this comment.
master tries to update the mapping for the next rollover index, just in case....
| setting 'xpack.security.enabled', 'true' | ||
| setting 'xpack.security.transport.ssl.enabled', 'true' | ||
| setting 'xpack.security.transport.ssl.keystore.path', 'testnode.jks' | ||
| setting 'logger.org.elasticsearch.xpack.security.audit.index', 'DEBUG' |
There was a problem hiding this comment.
this should help in future possible failures!
| assertAuditDocsExist(); | ||
| assertNumUniqueNodeNameBuckets(expectedNumUniqueNodeNameBuckets()); | ||
| }); | ||
| }, 30, TimeUnit.SECONDS); |
There was a problem hiding this comment.
allows some slack for creating and allocating a new audit index by the old nodes while the master is down for upgrade.
|
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/7058/console
@elasticmachine run elasticsearch-ci/2 |
Fixes a race during the rolling upgrade with the index audit output enabled. The race is that after the upgraded node is restarted, it installs the audit template and updates the mapping of the "current" (from his perspective) audit index. But the template might be installed after a new daily rolled-over index has been created by the other old nodes, using the old templates. However, the new node, even if it installs the template after the rollover edge, can accumulate audit events before the edge, and will correctly try to update the mapping of the audit index before the edge. But this way, the mapping of the index after the edge remains un-updated, because only the master node does the mapping updates. The fix keeps the design of only allowing the master to update the mapping, but the master will try, on a best effort policy, to also possibly update the mapping of the next rollover audit index.
Fixes a race during the rolling upgrade with the index audit output enabled. The race is that after the upgraded node is restarted, it installs the audit template and updates the mapping of the "current" (from his perspective) audit index. But the template might be installed after a new daily rolled-over index has been created by the other old nodes, using the old templates. However, the new node, even if it installs the template after the rollover edge, can accumulate audit events before the edge, and will correctly try to update the mapping of the audit index before the edge. But this way, the mapping of the index after the edge remains un-updated, because only the master node does the mapping updates. The fix keeps the design of only allowing the master to update the mapping, but the master will try, on a best effort policy, to also possibly update the mapping of the next rollover audit index.
* 6.6: (121 commits) [DOCS] Add warning about bypassing ML PUT APIs (elastic#38608) fix dissect doc "ip" --> "clientip" (elastic#38512) bad formatted JSON object (elastic#38515) SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440) Update ilm-api.asciidoc, point to REMOVE policy (elastic#38235) Backport changes to the release notes script. (elastic#38347) Change the milliseconds precision to 3 digits for intervals. (elastic#38297) SecuritySettingsSource license.self_generated: trial (elastic#38233) (elastic#38398) Fix IndexAuditTrail rolling upgrade on rollover edge 2 (elastic#38286) (elastic#38381) Cleanup construction of interceptors (elastic#38388) Skip unsupported languages for tests (elastic#38328) (elastic#38385) [ILM][TEST] increase assertBusy timeout (elastic#36864) (elastic#38354) Docs: Drop inline callout from scroll example (elastic#38340) (elastic#38365) Preserve ILM operation mode when creating new lifecycles (elastic#38134) (elastic#38230) [ML] Add explanation so far to file structure finder exceptions (elastic#38337) ML: Fix error race condition on stop _all datafeeds and close _all jobs (elastic#38113) (elastic#38211) (elastic#38222) SQL: Generate relevant error message when grouping functions are not used in GROUP BY (elastic#38017) Fix NPE in Logfile Audit Filter (elastic#38120) (elastic#38273) Enable trace log in FollowerFailOverIT (elastic#38148) Replace awaitBusy with assertBusy in atLeastDocsIndexed (elastic#38190) ...
Fixes a race during the rolling upgrade with the index audit output enabled.
The race is that after the upgraded node is restarted, it installs the audit template and updates the mapping of the "current" (from his perspective) audit index. But the template might be installed after a new daily rolled-over index has been created by the other old nodes, using the old templates.
However, the new node, even if it installs the template after the rollover edge, can accumulate audit
events before the edge, and will correctly try to update the mapping of the audit index before the edge. But this way, the mapping of the index after the edge remains un-updated, because only the master node does the mapping updates.
The fix keeps the design of only allowing the master to update the mapping, but the master will try, on a best effort policy, to also possibly update the mapping of the next rollover audit index.
This can be judged as a shoot in the dark because I don't have access to the failure data anymore, but I think the crumbles point in this direction. Moreover, turning up debugging will allow for easier future diagnosis.
Relates #35988
Closes #33867 #37062