dlt plugin by zilto · Pull Request #820 · apache/hamilton

zilto · 2024-04-12T19:48:51Z

The dlt library provides many Sources and Destinations to extract (E) and load (L) data.

With Hamilton being a transform (T) tool the goal is to build constructs to:

ET: Create Hamilton nodes from dlt Sources, providing a structured way for them to be upstream dependencies
TL: Store/materialize Hamilton node results into dlt Destinations.
These 2 features alone allow for the full flexibility of (ETL, ELTL, etc.)

For a good user experience, and more easily integrating "Hamilton within dlt", it is valuable to have:
3. ETL: essentially a combination of 1. and 2.
4. ELT: no special integration required here, (dlt does EL, then Hamilton does T), but the transition could be streamlined. We can showcase that via the pipeline.sql_client(). This is where dlt + Ibis integration would shine.

Changes

Added DltResourceLoader (1.)
Added DltDestinationSaver materializer (2.)
Scenario (3.) is covered by adding both Loaders and Savers.

How I tested this

DataSaver is tested for the 3 basic types accepted by dlt: iterables of records, pandas DataFrame, and pyarrow Table
DataLoader tested for a mock Resource
Given dlt pipelines are stateful, there's a possibility errors appear when using many Loader/Saver in a single project.

Notes

there's no easy mapping for dlt Source to a Hamilton construct. The Source has many Resource that would map to Hamilton nodes. Then Resource is closer to a DataLoader, but the Source construct remains needed because it is responsible for authentication and more.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

hamilton/plugins/dlt_extensions.py

hamilton/plugins/h_dlt.py

elijahbenizzy

Looking good! Most curious about the parquet thing -- wonder if we can use duckdb:memory...

elijahbenizzy · 2024-04-15T00:16:50Z

hamilton/plugins/dlt_extensions.py

+    import pyarrow as pa
+
+    DATAFRAME_TYPES.extend([pa.Table, pa.RecordBatch])
+except ModuleNotFoundError:


So this works, the problem is that the error message will be confusing... For now, however, I think we can just document it well.

Why would we want to throw an error? You mean if a node has annotations for pa.Table, but pyarrow is not installed and therefore finds no registered materializers?

Yeah, exactly. Its just a confusing case, but let's not worry about it (maybe add to docs).

hamilton/plugins/dlt_extensions.py

elijahbenizzy · 2024-04-15T00:18:53Z

hamilton/plugins/dlt_extensions.py

+
+        # TODO use pyarrow directly to support different dataframe libraries
+        # ref: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset
+        df = pd.concat([pd.read_parquet(f) for f in partition_file_paths], ignore_index=True)


Parquet locally versus duckdb? Will the next pipeline.drop() erase it?

So the dlt implementation for in-memory duckdb is currently bugged (they're fixing it), but I found a work around.

Using duckdb in-memory doesn't really provide additional value and only adds a dependency:

source -> extract (parquet) -> normalize (parquet) -> load (parquet) -> duckdb (memory) -> query db (memory) -> to pandas (memory)

At the end of the process, memory (duckdb, pandas) is freed and dlt pipeline is cleaned (extract, normalize, load)

My current implementation skips the duckdb steps

source -> extract (parquet) -> normalize (parquet) -> read parquet partitions (memory) -> pandas (memory)

The related design decision is "how can I selectively reload dlt Sources". For example, only load Slack messages once then run Hamilton dataflows many times over that same data.

This would be an ELT use case (dlt -> Hamilton) where you want to refresh each independently. It probably makes more sense to have run_dlt.py and run_hamilton.py

The current Source materializer (from_) with everything in-memory aims to enable ET and the user is responsible for loading the data, potentially with the Destination (to) materializer which does TL

note sure I'm following, but just to mention - we can just cache the result of this function if we don't want it to run more than once?

So yeah, the worry is that it fills up disk space. But I think we can switch this up as needed. Seems reasonable.

docs/integrations/dlt/index.md

hamilton/plugins/dlt_extensions.py

elijahbenizzy

Let's ship, see where it goes from here!

zilto added 3 commits April 12, 2024 15:07

API design

35ea191

streamline DltSaver API

b4f4c29

changed to

b94b6c8

elijahbenizzy reviewed Apr 12, 2024

View reviewed changes

added DataLoader

34bdb0d

elijahbenizzy reviewed Apr 15, 2024

View reviewed changes

added tests, fixed bug

1e93070

zilto marked this pull request as ready for review April 15, 2024 20:24

zilto changed the title ~~[WIP] dlt plugin~~ dlt plugin Apr 15, 2024

updated docs

7281cc4

skrawcz reviewed Apr 16, 2024

View reviewed changes

docs/integrations/dlt/index.md Outdated Show resolved Hide resolved

skrawcz reviewed Apr 16, 2024

View reviewed changes

docs/integrations/dlt/index.md Show resolved Hide resolved

skrawcz reviewed Apr 16, 2024

View reviewed changes

hamilton/plugins/dlt_extensions.py Show resolved Hide resolved

skrawcz reviewed Apr 16, 2024

View reviewed changes

hamilton/plugins/dlt_extensions.py Show resolved Hide resolved

zilto added the i/o label Apr 16, 2024

updated docs; added extension to registry

d4cd8fe

elijahbenizzy approved these changes Apr 17, 2024

View reviewed changes

zilto merged commit d89b03e into main Apr 17, 2024

zilto deleted the extension/dlt-materializer branch April 17, 2024 18:48

Conversation

zilto commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

How I tested this

Notes

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elijahbenizzy left a comment

Choose a reason for hiding this comment

Uh oh!

elijahbenizzy Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

zilto Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elijahbenizzy Apr 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elijahbenizzy Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

zilto Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

zilto Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skrawcz Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elijahbenizzy Apr 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elijahbenizzy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zilto commented Apr 12, 2024 •

edited

Loading

zilto Apr 15, 2024 •

edited

Loading

zilto Apr 15, 2024 •

edited

Loading

skrawcz Apr 16, 2024 •

edited

Loading