Skip to content

feat(occurrences): New ingest#109180

Merged
thetruecpaul merged 1 commit intomasterfrom
cpaul/eap_occurrences/ingest_v_2
Mar 10, 2026
Merged

feat(occurrences): New ingest#109180
thetruecpaul merged 1 commit intomasterfrom
cpaul/eap_occurrences/ingest_v_2

Conversation

@thetruecpaul
Copy link
Contributor

This PR updates the ingest flow for Occurrences on EAP.

CHANGES:

  • Moves from a "everything in event_data" model to an allowlist model.
  • Unifies tags & contexts into attrs.
  • Processes exceptions in a series of related arrays, rather than as a thicket of nested mappings and arrays

SEE ALSO:

I tried to match the existing behavior of those processors as closely as possible.

TEST PLAN:

pytest -s tests/snuba/search/test_eap_occurrences.py

@thetruecpaul thetruecpaul requested a review from a team February 24, 2026 11:05
@thetruecpaul thetruecpaul requested a review from a team as a code owner February 24, 2026 11:05
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Feb 24, 2026
Copy link
Member

@shashjar shashjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly reviewed the high-level approach so far, and left some comments. I think the main point of open discussion I have is what to do with None values - I'm leaning towards omitting rather than stringifying them, which would also require some updates in the processors (mainly validating whether fields/values exist before encoding them into the result data). After we discuss these points more I can come back to review the processor changes.

encoded_data: dict[str, AnyValue] = {}

# 1: ALLOWLIST OF DIRECT COPIES
simple_allowlist = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define a top-level constant for this allowlist?

Also, event_id was previously in the ignore list — is there a reason we're including it now?

def _encode_attributes(
event: Event | GroupEvent,
event_data: Mapping[str, Any],
ignore_fields: set[str] | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually still need the ability to explicitly ignore some fields by using this parameter? If not, we may be able to simplify this and just remove ignore_fields altogether and update the tests to match

attributes=_encode_attributes(
event, event_data, ignore_fields={"event_id", "timestamp", "tags", "spans", "'spans'"}
event,
event_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we move _encode_attributes / _build_occurrence_attributes / _encode_value / _encode_value_recursive above all the extract helpers in this file? up to you but i think that's more readable to me

def _encode_attributes(
event: Event | GroupEvent, event_data: Mapping[str, Any], ignore_fields: set[str] | None = None
def _extract_from_event(
event: Event | GroupEvent, event_data: Mapping[str, Any]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event_data param not used in the _extract_from_event function, can be removed

encoded_data[key] = _encode_value(event_data[key])

# 2: SIMPLE RENAMES FROM EXISTING DATA
renames: tuple[tuple[str, str], ...] = ( # tuple of old, new pairs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be good to extract out into a constant as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: to confirm my understanding, these 3 keys are also effectively in the allowlist, they're just processed separately since they need the renaming step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

)
elif value is None:
return AnyValue(
string_value="None"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Pierre's response that EAP will not natively support null values, I wonder if we should just prevent None values from being encoded across the board

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the problem here is primarily the exception arrays. Exceptions in the errors Snuba table today (link) are — essentially — maps of standard keys to nullable values.

We can't do maps in EAP. So thought it would be good to instead encode these as equally-sized arrays (so the item at index N is the same frame across all the arrays).

... but that runs into a problem when we can't encode a null value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just skip the key entirely if the value is none?

@@ -1,6 +1,8 @@
from collections.abc import Mapping
import ipaddress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some tests in tests/sentry/eventstream/test_item_helpers.py need to be updated, and is it possible to add some additional coverage for _encode_value_recursive and the processors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's what I'm working on this morning.

if isinstance(value, Mapping):
out: dict[str, AnyValue] = {}
for subkey, subvalue in value.items():
out.update(_encode_value_recursive(".".join([key, subkey]), subvalue))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as warden comment: looks like we call this function as _encode_value_recursive("", contexts) from _extract_tags_and_contexts. So there will end up being a leading dot since the outermost key is considered to be ""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide we want to consider None a real value and not ignore it, I wonder if we should do that across the board and remove this filter in the list/tuple case and also below in the dict case

out = {}
if event.group_id:
out["group_id"] = _encode_value(event.group_id)
if isinstance(event, GroupEvent):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify that event.group.first_seen.timestamp() exists here as well?

def _extract_time_data(event_data: Mapping[str, Any]) -> Mapping[str, AnyValue]:
if "timestamp" not in event_data:
return {}
return {"timestamp_ms": _encode_value(event_data["timestamp"] * 1000)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the timestamp field an int or double in EAP? Do we need to do any casting here?

def _extract_tags_and_contexts(event_data: Mapping[str, Any]) -> Mapping[str, AnyValue]:
# These may be overwritten by promoted tags.
out = {
"release": _encode_value(event_data.get("release")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see "release" is both here and in the simple allowlist - I think one can be removed?

pass

out["user_email"] = _encode_value(user_data["email"])
out["user_id"] = _encode_value(user_data["user_id"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this key is "user_id"? I see it defined as "id" in src/sentry/interfaces/user.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

"level",
"resource_id",
"message",
"release",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate "release" in allowlist and processor

Low Severity

"release" appears in the simple_allowlist at step 1 and is also set by _extract_tags_and_contexts at step 3. The processor's value always overwrites the allowlist value via encoded_data.update(), making the allowlist entry redundant. One of these two sources for release can be removed.

Additional Locations (1)

Fix in Cursor Fix in Web

@thetruecpaul thetruecpaul force-pushed the cpaul/eap_occurrences/ingest_v_2 branch 2 times, most recently from 846aeec to aaf7a95 Compare February 27, 2026 17:49
Copy link
Member

@shashjar shashjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the bot review comments are valid but other than that LGTM!

Copy link
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure that this change won't break any attributes sent from here:

event_data = {
"contexts": {"monitor": get_monitor_environment_context(monitor_env)},
"environment": monitor_env.get_environment().name,
"event_id": occurrence.event_id,
"fingerprint": [incident.grouphash],
"platform": "other",
"project_id": monitor_env.monitor.project_id,
# This is typically the time that the checkin that triggered the
# occurrence was written to relay, otherwise it is when we detected a
# missed or timeout.
"received": received.isoformat(),
"sdk": None,
"tags": {
"monitor.id": str(monitor_env.monitor.guid),
"monitor.slug": str(monitor_env.monitor.slug),
"monitor.incident": str(incident.id),
},
"timestamp": current_timestamp.isoformat(),
}

if event.group_id:
attributes["group_id"] = AnyValue(int_value=event.group_id)
out["group_id"] = event.group_id
if isinstance(event, GroupEvent):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Event and GroupEvent should have group available

Comment on lines +196 to 203
promotions = {
"sentry:release": "release",
"environment": "environment",
"sentry:user": "user",
"sentry:dist": "dist",
"profile.profile_id": "profile_id",
"replay.replay_id": "replay_id",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These used to be explicit columns in the errors table, how is this promotion concept handled in EAP?

)
elif value is None:
return AnyValue(
string_value="None"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just skip the key entirely if the value is none?

@github-actions
Copy link
Contributor

github-actions bot commented Mar 10, 2026

Backend Test Failures

Failures on 68a3d2b in this run:

tests/sentry/profiles/test_task.py::DeobfuscationViaSymbolicator::test_inline_resolvinglog
tests/sentry/profiles/test_task.py:683: in test_inline_resolving
    assert android_profile["profile"]["methods"] == [
E   AssertionError: assert [{'class_name...andler', ...}] == [{'class_name...andler', ...}]
E     
E     At index 0 diff: {'class_name': 'io.sentry.sample.-$$Lambda$r3Avcbztes2hicEObh02jjhQqd4', 'name': 'onClick', 'signature': '()', 'source_file': '-.java', 'source_line': 2, 'data': {'deobfuscation_status': 'deobfuscated'}} != {'class_name': 'io.sentry.sample.-$$Lambda$r3Avcbztes2hicEObh02jjhQqd4', 'data': {'deobfuscation_status': 'deobfuscated'}, 'name': 'onClick', 'signature': '()', 'source_file': None, 'source_line': 2}
E     
E     Full diff:
E       [
E           {
E               'class_name': 'io.sentry.sample.-$$Lambda$r3Avcbztes2hicEObh02jjhQqd4',
E               'data': {
E                   'deobfuscation_status': 'deobfuscated',
E               },
E               'name': 'onClick',
E               'signature': '()',
E     -         'source_file': None,
E     ?                        ^^^^
E     +         'source_file': '-.java',
E     ?                        ^^^^^^^^
E               'source_line': 2,
E           },
E           {
E               'class_name': 'io.sentry.sample.MainActivity',
E               'data': {
E                   'deobfuscation_status': 'deobfuscated',
E               },
E               'inline_frames': [
E                   {
E                       'class_name': 'io.sentry.sample.MainActivity',
E                       'data': {
E                           'deobfuscation_status': 'deobfuscated',
E                       },
E                       'name': 'onClickHandler',
E                       'signature': '()',
E                       'source_file': 'MainActivity.java',
E                       'source_line': 40,
E                   },
E                   {
E                       'class_name': 'io.sentry.sample.MainActivity',
E                       'data': {
E                           'deobfuscation_status': 'deobfuscated',
E                       },
E                       'name': 'foo',
E                       'signature': '()',
E                       'source_file': 'MainActivity.java',
E                       'source_line': 44,
E                   },
E                   {
E                       'class_name': 'io.sentry.sample.MainActivity',
E                       'data': {
... (14 more lines)
tests/sentry/profiles/test_task.py::DeobfuscationViaSymbolicator::test_basic_resolvinglog
tests/sentry/profiles/test_task.py:627: in test_basic_resolving
    assert android_profile["profile"]["methods"] == [
E   AssertionError: assert [{'class_name...oolean', ...}] == [{'class_name...oolean', ...}]
E     
E     At index 0 diff: {'class_name': 'org.slf4j.helpers.Util$ClassContextSecurityManager', 'name': 'getClassContext', 'signature': '()', 'source_file': 'Util.java', 'source_line': 67, 'data': {'deobfuscation_status': 'deobfuscated'}} != {'data': {'deobfuscation_status': 'deobfuscated'}, 'name': 'getClassContext', 'class_name': 'org.slf4j.helpers.Util$ClassContextSecurityManager', 'signature': '()', 'source_file': 'Something.java', 'source_line': 67}
E     
E     Full diff:
E       [
E           {
E               'class_name': 'org.slf4j.helpers.Util$ClassContextSecurityManager',
E               'data': {
E                   'deobfuscation_status': 'deobfuscated',
E               },
E               'name': 'getClassContext',
E               'signature': '()',
E     -         'source_file': 'Something.java',
E     ?                         ^^^^ - ^^
E     +         'source_file': 'Util.java',
E     ?                         ^  ^
E               'source_line': 67,
E           },
E           {
E               'class_name': 'org.slf4j.helpers.Util$ClassContextSecurityManager',
E               'data': {
E                   'deobfuscation_status': 'deobfuscated',
E               },
E               'name': 'getExtraClassContext',
E               'signature': '(): boolean',
E     -         'source_file': 'Else.java',
E     ?                         ^ --
E     +         'source_file': 'Util.java',
E     ?                         ^^^
E               'source_line': 69,
E           },
E       ]

This PR updates the ingest flow for Occurrences on EAP.

CHANGES:
* Moves from a "everything in event_data" model to an allowlist model.
* Unifies tags & contexts into attrs.
* Processes exceptions in a series of related arrays, rather than as a thicket of nested mappings and arrays

SEE ALSO:
* Search issues processor: https://github.com/getsentry/snuba/blob/master/snuba/datasets/processors/search_issues_processor.py
* Errors processor: https://github.com/getsentry/snuba/blob/bc66a617af47de8e0fe573c2ac17c9f378c523c1/rust_snuba/src/processors/errors.rs#L32

I tried to match the existing behavior of those processors as closely as possible.

TEST PLAN:
```
pytest -s tests/snuba/search/test_eap_occurrences.py
```
@thetruecpaul thetruecpaul force-pushed the cpaul/eap_occurrences/ingest_v_2 branch from fabc1bd to a9a11cd Compare March 10, 2026 18:02
@thetruecpaul thetruecpaul merged commit e2b45a5 into master Mar 10, 2026
76 checks passed
@thetruecpaul thetruecpaul deleted the cpaul/eap_occurrences/ingest_v_2 branch March 10, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants