-
Notifications
You must be signed in to change notification settings - Fork 35
Description
One known failure mode for ingestion-beam is rate limiting from the BQ API when we list datasets/tables in order to check whether destination tables exist. See https://mozilla-hub.atlassian.net/browse/DSRE-194 and mozilla/bigquery-backfill#15
Currently, every worker is independently making these API calls, triggering rate limiting when we scale up the number of machines for backfills. I think it should be possible to express this table listing as a slowly updating global window side input which would make it run on a single machine.
Currently, we look up the tables in a dataset only when we see a record with destination table in that dataset. For the side input case, we'd need to list all datasets, and then list all tables within each dataset, so we'd need to provide information about which project to list datasets from.