Skip to content

Refactor BQ table listings to a side input #1841

@jklukas

Description

@jklukas

One known failure mode for ingestion-beam is rate limiting from the BQ API when we list datasets/tables in order to check whether destination tables exist. See https://mozilla-hub.atlassian.net/browse/DSRE-194 and mozilla/bigquery-backfill#15

Currently, every worker is independently making these API calls, triggering rate limiting when we scale up the number of machines for backfills. I think it should be possible to express this table listing as a slowly updating global window side input which would make it run on a single machine.

Currently, we look up the tables in a dataset only when we see a record with destination table in that dataset. For the side input case, we'd need to list all datasets, and then list all tables within each dataset, so we'd need to provide information about which project to list datasets from.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions