Skip to content

[RFC] User Behavior Insights #12084

@jzonthemtn

Description

@jzonthemtn

User Behavior Insights (UBI)

This RFC has been revised to describe an approach more integrated with OpenSearch. We now call this functionality "User Behavior Insights" (UBI).

Summary

This RFC is an evolution of 4619 to capture user behaviors and track queries through all steps of querying and website usage.

This RFC proposes functionality in OpenSearch to store application user behavior and corresponding queries in OpenSearch indexes. It also includes an analytics dashboard integrated with OpenSearch Dashboards for analyzing and visualizing the collected information.

UBI will link client-side actions with backend search actions, such as linking queries submitted by users with customer clients, scroll depth, and search result detail pages viewed.

What users have asked for this feature?

This functionality has been discussed on the OpenSearch Search Relevance Meetup and through individual conversations with users of OpenSearch and with the larger community.

What problems are you trying to solve?

The key problem is that OpenSearch users are missing a holistic view of client-side, browser, and app events to enable a deeper understanding of search user behavior for the purposes of improving search relevance and user experience.

With this tooling, users of OpenSearch will be able to collect client-side events and link them with queries from their data stores. This will allow users to create a comprehensive view of users’ search journeys to improve the user experience.

What is the developer experience going to be?

Pre-Existing Work

The work described here has been successfully implemented as an OpenSearch plugin. Due to several factors such as maintaining a plugin, promoting adoption, and ease of use within OpenSearch, it has been determined that a plugin is not the optimal approach. This RFC has been updated to reflect this new direction.

For a description of the plugin's implementation, please see previous revisions of this issue or the plugin's repository.

Proposed Work

Core Contributions

  • All functionality will be directly implemented in the OpenSearch github project. Queries performed against OpenSearch along with the list of query results will be persisted to an OpenSearch index.
  • Two indexes (described below) will be created to facilitate the persistence of client-side events.
  • Clients will be responsible for indexing client-side events in OpenSearch; this project will not add any endpoints to facilitate the indexing of events. This allows clients to use whatever method they prefer to index client-side events, whether it be directly indexing, using a custom pipeline, DataPrepper, OpenTelemetry, or other method of their choice.

Persistence of Queries and Client-Side Events

Queries, including their results, and client-side events will be indexed to two OpenSearch indices. One index will contain the queries, and the other will contain the client-side events.

These indices are .ubi_queries and .ubi_events. They will be automatically created and store queries and events for all OpenSearch indexes. (In the plugin implementation there was the concept of a "store" and there was a one-to-one correlation with a store and an OpenSearch index. This is no longer necessary as it can be accomplished with only these two indexes.)

Schema of Queries Index

The queries index will contain all queries that were received by OpenSearch which include a top-level ubi block. The timestamp, query_id, and other information about the query will be indexed.

{
  "dynamic": false,
  "properties": {
    "timestamp": {
      "type": "date"
    },
    "index": { "type": "keyword", "ignore_above": 100 },
    "query_id": { "type": "keyword", "ignore_above": 100 },
    "query": {
      "type": "text"
    },
    "query_response_id": { "type": "keyword", "ignore_above": 100 },
    "query_response_hit_ids": { "type": "keyword" },
    "user_id": { "type": "keyword", "ignore_above": 100 },
    "session_id": { "type": "keyword", "ignore_above": 100 }
  }
}
Schema of Client-Side Events Index

The events index will contain the client-side events indexed into OpenSearch by the client. Some fields are standardized; most are optional. Others can be customized as needed.

{
  "properties": {
    "query_id": {
      "type": "keyword",
      "ignore_above": 100
    },
    "action_name": {
      "type": "keyword",
      "ignore_above": 100
    },
    "user_id": {
      "type": "keyword",
      "ignore_above": 100
    },
    "session_id": {
      "type": "keyword",
      "ignore_above": 100
    },
    "query_id": {
      "type": "keyword",
      "ignore_above": 100
    },
    "page_id": {
      "type": "keyword",
      "ignore_above": 256
    },
    "message": {
      "type": "keyword",
      "ignore_above": 256
    },
    "message_type": {
      "type": "keyword",
      "ignore_above": 100
    },
    "timestamp": {
      "type": "date",
      "doc_values": true
    },
    "event_attributes": {
      "properties": {
        "user_name": {
          "type": "keyword",
          "ignore_above": 256
        },
        "user_id": {
          "type": "keyword",
          "ignore_above": 100
        },
        "email": {
          "type": "keyword"
        },
        "price": {
          "type": "float"
        },
        "ip": {
          "type": "ip",
          "ignore_malformed": true
        },
        "browser": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "position": {
          "properties": {
            "ordinal": {
              "type": "integer"
            },
            "x": {
              "type": "integer"
            },
            "y": {
              "type": "integer"
            },
            "page_depth": {
              "type": "integer"
            },
            "scroll_depth": {
              "type": "integer"
            },
            "trail": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "object": {
          "properties": {
            "key_value": {
              "type": "keyword"
            },
            "object_id": {
              "type": "keyword",
              "ignore_above": 256
            },
            "object_type": {
              "type": "keyword",
              "ignore_above": 100
            },
            "transaction_id": {
              "type": "keyword",
              "ignore_above": 100
            },
            "name": {
              "type": "keyword",
              "ignore_above": 256
            },
            "description": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "to_user_id": {
              "type": "keyword",
              "ignore_above": 100
            },
            "object_detail": {
              "type": "object"
            }
          }
        }
      }
    }
  }
}

Query Requests and Query Responses

Assumption: the user is on a search-enabled website powered by OpenSearch containing the functionality described above.

When the user performs a search on the website, the query is sent to OpenSearch with a ubi block in the request. This ubi block provides information about the search and the presence of the block tells OpenSearch to persist this query and the query's results. An example ubi block is:

GET _search
 {
  "ubi": {
    "query_id": 300d16cb-b6f1-4012-93ebcc49cac90426,
    "options": {
      "robot":false,
      "mobile":true,
      "experiment_id":"exp00456"
    },
   },
   "query": {
     "query_string": {
       "query": "the wind AND (rises OR rising)"
     }
   }
 }

The fields and their names in the ubi block may change, but the important part is the query_id value which uniquely identifies this search. This value is used to link client-side events with searches, and vice-versa. If the query_id value is not provided, OpenSearch will generate a random query_id and return its value in the search response.

The presence of the ubi block in the search request causes OpenSearch to index the query and the query results.

Every search result has a unique ID. That result ID can be carried through the whole reporting system so that all actions are correlated with the result they came from. In many applications, there is additionally a unique item ID which identifies the underlying object which is referred to by the result ID. There is an N-to-1 relationship between item_ID and result_ID. That is, the same object may have been returned as result 2 of search 1234, and as result 7 of search 3456.

Similarly, the search response will be modified to also include a ubi block:

{
 "took": 13,
  "timed_out": false,
  "ubi": {
    "query_id": "300d16cb-b6f1-4012-93eb-cc49cac90426"
  }
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.9808291,
    "hits": [
      {
        "_index": "students",
        "_id": "1",
        "_score": 0.9808291,
        "_source": {
          "name": "John Doe",
          "gpa": 3.89,
          "grad_year": 2022
        }
      }
    ]
  }
}

In the example above, the search response has been modified to include a ubi block which contains the query_id. If a query_id was provided in the query request, this will be the same value. If a query_id was not provided in the query request, the query_id in the response will be a random UUID. It is recommended that clients manage their own query IDs but OpenSearch will generate a random query ID when necessary to avoid any breaking behavior or undesired effects.

Client-side JavaScript Reference Implementation

A reference implementation of the JavaScript client-side code to capture common events and index those events in OpenSearch will be provided. The code is not intended to be comprehensive or complete, but rather a starting point for users to modify to meet their unique needs.

Code Drops

The Code Drops described below were chosen to be atomic pieces of work suitable for pull requests and review/commit by OpenSearch maintainers. They were similarly selected to avoid any breaking changes. All Code Drops include the appropriate documentation and tests.

  • Code Drop 1 - Passing a ubi block in a query request and receiving back a search response with a ubi block containing the received query_id or a generated query_id if none was sent by the client.
  • Code Drop 2 - Automatically creating the indexes to store the queries and client-side events.
  • Code Drop 3 - Queries received containing a ubi block are persisted to the queries index.
  • Code Drop 4 - Addition of any user-configurable options for customizing operation.

Open Source and Best Practices

Research of currently available open source libraries under acceptable licenses will be conducted to discover which can be either utilized directly or customized to meet our needs.

We will “program to the interface” to permit future extensibility. For instance, while event data will be stored in OpenSearch, there will be no restrictions on creating the ability to use a relational database as the backend instead.

The development plan will evolve over time. Whenever possible, so as to not reinvent the wheel, priority will be given to the use of existing open source code as well as the application of existing standards.

Are there any security considerations?

  • Event data will be sent to OpenSearch for indexing and all communication needs to be over secure channels.
  • The client-side event capturing code must behave ethically and only track user activity when permitted.
  • Strict security parameters and constraints must be in place to connect the client code to the backend OpenSearch logging engine.

The community’s input around these items will be vital during development.

Are there any breaking changes to the API?

No breaking changes to the API are expected.

What is the user experience going to be?

The user will be able to analyze the collected events via a dashboard that is integrated with OpenSearch Dashboards. This functionality will likely be implemented as a its own OpenSearch Dashboards plugin or integrated into the OpenSearch dashboards-search-relevance plugin.

The data will be queryable using SQL and/or DSL, and be exportable to an external data store for additional analysis or training machine learning models.

Are there breaking changes to the User Experience?

No breaking changes to the user experience are expected.

How is this different from other click-tracking applications?

  • It is focused on the highly granular data collection and analysis necessary for search relevance tuning, not on reporting on aggregates.
  • It gives developers full control of and access to their data.
  • It takes advantage of OpenSearch’s ability to log and analyze data, while leaving the client developer free to choose whichever Javascript library they want, such as Snowplow, etc.
  • This will provide a stronger “out of the box” search analytics focus than more general tools.
  • Stretch goal: near real time event tracking, with an eye to being able to provide data for personalization as the user is engaging with the search experience. Simply put, the ability to learn about an individual's preferences, not just focus on aggregated user preferences.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    ✅ Done

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions