fix: Handling Large S3 Files by jshlbrd · Pull Request #20 · brexhq/substation

jshlbrd · 2022-08-31T14:22:19Z

Description

Fixes a regression introduced in refactor: Improve S3 Ingest Performance #19 when handling very large S3 files that contain a single line of text
Refactors the Expand processor to speed up data processing for very large JSON objects that contain 1000s of inner objects in an array

Motivation and Context

During migration of a data pipeline using #19 I discovered that it introduced a regression for the edge case of handling very large files downloaded from S3 that contain a single line of text (e.g., files from AWS CloudTrail in very busy AWS accounts).

As a longer term solution, the s3manager will do a one-time check on initialization for the amount of memory that can be allocated to a single token in a bufio scanner and uses this pattern:

by default, the token can be up to 100MB in size
- in practice, it should be impossible to use this setting -- the s3manager is a private package that is only used by other private packages, but this is included as a default in case that changes
when run in an AWS Lambda, the token can be up to half of the total memory of the Lambda
- e.g., if the Lambda is 128MB, then the token can be up to 64MB; if the Lambda is 1024MB (1GB), then the token can be up to 512MB

This also refactors the Expand processor to handle dealing with very large JSON objects that contain many large, inner objects.

How Has This Been Tested?

Tested in our development AWS account on AWS CloudTrail files uploaded to S3.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

jshlbrd added 3 commits August 31, 2022 14:08

fix: corrects scanner regression

3f4f379

refactor: speed up processing large JSON objects

c0c208a

fix: halve lambda capacity

7c469bd

jshlbrd marked this pull request as ready for review August 31, 2022 14:24

jshlbrd requested a review from a team as a code owner August 31, 2022 14:24

julieagnessparks approved these changes Aug 31, 2022

View reviewed changes

jshlbrd merged commit 2791b91 into main Aug 31, 2022

jshlbrd deleted the jshlbrd/s3-large-files branch August 31, 2022 19:54

github-actions bot mentioned this pull request Aug 31, 2022

chore(main): release 0.4.0 #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handling Large S3 Files#20

fix: Handling Large S3 Files#20
jshlbrd merged 3 commits intomainfrom
jshlbrd/s3-large-files

jshlbrd commented Aug 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jshlbrd commented Aug 31, 2022

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants