Skip to content

fix: Handling Large S3 Files#20

Merged
jshlbrd merged 3 commits intomainfrom
jshlbrd/s3-large-files
Aug 31, 2022
Merged

fix: Handling Large S3 Files#20
jshlbrd merged 3 commits intomainfrom
jshlbrd/s3-large-files

Conversation

@jshlbrd
Copy link
Contributor

@jshlbrd jshlbrd commented Aug 31, 2022

Description

  • Fixes a regression introduced in refactor: Improve S3 Ingest Performance #19 when handling very large S3 files that contain a single line of text
  • Refactors the Expand processor to speed up data processing for very large JSON objects that contain 1000s of inner objects in an array

Motivation and Context

During migration of a data pipeline using #19 I discovered that it introduced a regression for the edge case of handling very large files downloaded from S3 that contain a single line of text (e.g., files from AWS CloudTrail in very busy AWS accounts).

As a longer term solution, the s3manager will do a one-time check on initialization for the amount of memory that can be allocated to a single token in a bufio scanner and uses this pattern:

  • by default, the token can be up to 100MB in size
    • in practice, it should be impossible to use this setting -- the s3manager is a private package that is only used by other private packages, but this is included as a default in case that changes
  • when run in an AWS Lambda, the token can be up to half of the total memory of the Lambda
    • e.g., if the Lambda is 128MB, then the token can be up to 64MB; if the Lambda is 1024MB (1GB), then the token can be up to 512MB

This also refactors the Expand processor to handle dealing with very large JSON objects that contain many large, inner objects.

How Has This Been Tested?

Tested in our development AWS account on AWS CloudTrail files uploaded to S3.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@jshlbrd jshlbrd marked this pull request as ready for review August 31, 2022 14:24
@jshlbrd jshlbrd requested a review from a team as a code owner August 31, 2022 14:24
@jshlbrd jshlbrd merged commit 2791b91 into main Aug 31, 2022
@jshlbrd jshlbrd deleted the jshlbrd/s3-large-files branch August 31, 2022 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants