Skip to content

Support point-in-time-join on a given set of dataframes #1400

@jhasm-ck

Description

@jhasm-ck

Is your feature request related to a problem? Please describe.

What is a Point-in-Time (PiT) Correct Join?
A point-in-time correct join is a database operation that performs a join between two tables in a way that ensures the results reflect the state of the tables at a specific point in time.

When is a Point-in-Time (PiT) Correct Join needed when creating training data?
When constructing a snapshot of data (e.g., training data or batch inference data) from precomputed features spread across different feature groups, we often need to construct a snapshot of feature values at a specific point in time. For example, a training dataset for supervised ML is a snapshot of feature values at the time of the observation of each label in each row in the training dataset.

A problem with creating a point-in-time-correct training data snapshot is that the underlying tables (feature groups) are typically updated at different cadences by different data pipelines. As such, it is not always possible to utilize an exact time-based join to obtain the desired result. The solution is a Point-in-Time correct Join that starts with the timestamps for the labels and retrieves the most recent feature values for the features from all the tables joined with the table containing the labels.

Describe the solution you'd like
I want a solution similar to this one: https://www.hopsworks.ai/post/a-spark-join-operator-for-point-in-time-correct-joins
Pandas has merge_asof function that performs similar operation, but it is not distributed.

Describe alternatives you've considered
There is no practical alternative to this one for datasets larger than single node. For smaller datasets pandas.merge_asof can be used.

Additional context
This feature is critical for ML feature engineering, because feature values are stored with timestamps, making all features a timeseries. This might be the biggest feature blocking people from moving from Spark to Daft+Ray for feature engineering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceptedThis issue has been accepted as a known bug, or important feature to includeenhancementNew feature or requesthelp wantedExtra attention is neededp1Important to tackle soon, but preemptable by p0

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions