-
Notifications
You must be signed in to change notification settings - Fork 313
[RFC] Internal Model Proposal #319
Description
What kind of business use case are you trying to solve? What are your requirements?
Existing preppers consume and emit serialized JSON strings. This wastes CPU cycles when chaining preppers due to excessive de/serialization. Users of the DP have encountered runtime exceptions due to conflicting data requirements of their preppers. Model definitions are duplicated throughout prepper plugins. (e.g. otel-trace-group-prepper TraceGroup and otel-trace-raw-prepper)
Requirements:
- Extendable - the model should scale beyond the existing trace analytics support.
- Type Safety - the model should be prescriptive enough to enable type safety checks throughout the pipelines in the future.
- Eliminate need to duplicate code between plugins
- Allow preppers to operate on internal data in a generic way
- Remove excessing serialization
What is the problem? What is preventing you from meeting the requirements?
Currently, data flows through Data Prepper as a Collection<Records>. Records are generic types that allow for any type to flow through. Trace events have been been defined as a Collection of Records as type String. The strings are serialized representations of JSON objects conforming to the OTEL Spec.
What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?
We will deprecate Records and define explicit object models for Traces and Logs. Traces and Logs will implement a new interface called Event. Events will be the new data type flowing through Data Prepper.
Source plugins will be responsible for translating the external requests into Events. Sink plugins will be responsible for transforming the Events into the correct output schema. Preppers will only accept Events as inputs and outputs or subtypes. This will effectively create internal boundaries for our model between sources and sinks.
Event
Events will be managed through public putting, deleting and fetching methods. An additional method for generating a JSON String is included to support the sinks.
/**
* Add or update the key with a the given value in the Event
*/
void put(String key, Object value);
/**
* Retrieves the given key from the Event
*/
<T> T get(String key, Class<T> type);
/**
* Deletes the given key from the Event
*/
void delete(String key);
/**
* Generates a Json representation of the Event
*/
JsonNode toJsonNode();
/**
* Get Metadata
*/
Metadata getMetadata();
EventMetadata
Will be a class with a slight refactor of the RecordMetadata Class. Currently, RecordMetadata maintains a map of attributes and has one required recordType attribute. However, the recordType has been historically ignored. The new model will have required attributes recategorized as POJO fields. The eventType will help preserve the type (i.e log, span), for casting and type validation. The EventMetadata class will still maintain a mapping for custom metadata defined in attributes.
public class EventMetadata {
private String eventType;
private long timeReceived;
private Map<String, Object> attributes;
Span
Span will be a new model to support the traces. It will implement the Event interface and maintain the same attributes as the current RawSpan Object. This will ensure backwards compatibility with our existing preppers.
Phased Approach
This design includes breaking changes and will be broken into two phases. One phase allows us to build support for the new model and onboard log ingestion and trace analytics. The second phase will deprecate the old model and will be a part of the 2.0 release
What are your assumptions or prerequisites?
The design and changes to the pipelines to enforce type safety are out of scope and should addressed in a separate review. However, the output of this design should not hinder but enable type safety enforcement.
This aligns with the proposal for Log Ingestion RFC
What are remaining open questions?
- Which library should we use to support the underlying interfaces? (JsonPath or Jackson) JsonPath is a library for reading and updating JSON documents. It natively supports the dot notation. Jackson is fast JSON library for parsing json objects and supports Json Pointers for managing objects. Both libraries will work. Jackson will be ideal to reduce dependencies.