S3 file reader support #32

jankatins · 2020-04-14T10:02:06Z

Refactors the file reader and adds s3 file reader as an alternative to local file reads.

New commands:

data_integration.parallel_tasks.files.ParallelReadS3File: reads in a whole bucket
data_integration.commands.files.ReadS3File: reads a single file from S3

From initial testing, this is a lot slower than sync + reading from a local file system (both iterating over the bucket to get the file list and the individual reads...) but then syncing that bucket to a local filesystem is also taking time... From my perspective this is only worth it if you have to do a "sync to local" every time (which we have to do, not volumns in our ETL container :-(), so the second run is then saving time compared to doing a sync + incremental read via file system. That's at least the theory, up to now I only tested locally.

The single file read will also come in handy as a replacement of google sheet imports.

WIP...

martin-loetzsch · 2020-04-15T15:57:42Z

data_integration/pipelines.py

+    initial_node: Task = None
+    final_node: Task = None

    def __init__(self, id: str,


One can also add a pipeline or ParallelTask as initial / final node

not really: there are places which expect a task, at least I had places where intelij complained that a method wasn't available

Fixed it in a different way

martin-loetzsch · 2020-04-15T15:57:59Z

data_integration/pipelines.py


-    def add_final(self, node: Node) -> 'Pipeline':
+    def add_final(self, node: Task) -> 'Pipeline':
        self.final_node = node


fixed it a different way

martin-loetzsch

Looks very good otherwise. Please squash.

Let's wait with a release for the other PR

This reverts commit c39697d.

ghost · 2020-09-07T13:16:57Z

data_integration/commands/files.py

+class ReadS3File(_ReadFile):
+    """Reads data from a S3 file"""
+
+    def __init__(self, s3_url: str, compression: Compression, target_table: str,


I'd think the parameter s3_url should be called s3_uri, according to the cp command. An URL is always an URI, but not all URIs are URLs. See as well wikipedia URL

# Conflicts: # mara_pipelines/commands/files.py

martin-loetzsch · 2021-03-08T22:44:28Z

@jankatins is this running in production?

jankatins · 2021-03-08T23:01:14Z

@martin-loetzsch Nope, should also be integrated into https://github.com/mara/mara-storage where this looks much easier to do.

jankatins added 4 commits April 14, 2020 11:43

Fix typing information in pipeline

c39697d

Fix call in error case

98f18d1

Refactor out a base class for file reading

2e280bd

Add s3 file reader commands

22b6e0c

jankatins requested review from Tafkas and martin-loetzsch April 14, 2020 10:02

martin-loetzsch reviewed Apr 15, 2020

View reviewed changes

martin-loetzsch approved these changes Apr 15, 2020

View reviewed changes

Allow parallel tasks in final/intitial node

5696a0b

This reverts commit c39697d.

ghost suggested changes Sep 9, 2020

View reviewed changes

Merge remote-tracking branch 'master' into s3_reader

e1ddf1c

# Conflicts: # mara_pipelines/commands/files.py

ghost added the enhancement New feature or request label Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S3 file reader support #32

S3 file reader support #32

Uh oh!

jankatins commented Apr 14, 2020

Uh oh!

martin-loetzsch Apr 15, 2020

Uh oh!

jankatins Apr 15, 2020

Uh oh!

jankatins Apr 28, 2020

Uh oh!

martin-loetzsch Apr 15, 2020

Uh oh!

jankatins Apr 28, 2020

Uh oh!

martin-loetzsch left a comment

Uh oh!

ghost Sep 7, 2020

Uh oh!

martin-loetzsch commented Mar 8, 2021

Uh oh!

jankatins commented Mar 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

S3 file reader support #32

Are you sure you want to change the base?

S3 file reader support #32

Uh oh!

Conversation

jankatins commented Apr 14, 2020

Uh oh!

martin-loetzsch Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

jankatins Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

jankatins Apr 28, 2020

Choose a reason for hiding this comment

Uh oh!

martin-loetzsch Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

jankatins Apr 28, 2020

Choose a reason for hiding this comment

Uh oh!

martin-loetzsch left a comment

Choose a reason for hiding this comment

Uh oh!

ghost Sep 7, 2020

Choose a reason for hiding this comment

Uh oh!

martin-loetzsch commented Mar 8, 2021

Uh oh!

jankatins commented Mar 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants