Conversation
elijahbenizzy
left a comment
There was a problem hiding this comment.
Looking good! Most curious about the parquet thing -- wonder if we can use duckdb:memory...
| import pyarrow as pa | ||
|
|
||
| DATAFRAME_TYPES.extend([pa.Table, pa.RecordBatch]) | ||
| except ModuleNotFoundError: |
There was a problem hiding this comment.
So this works, the problem is that the error message will be confusing... For now, however, I think we can just document it well.
There was a problem hiding this comment.
Why would we want to throw an error? You mean if a node has annotations for pa.Table, but pyarrow is not installed and therefore finds no registered materializers?
There was a problem hiding this comment.
Yeah, exactly. Its just a confusing case, but let's not worry about it (maybe add to docs).
|
|
||
| # TODO use pyarrow directly to support different dataframe libraries | ||
| # ref: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset | ||
| df = pd.concat([pd.read_parquet(f) for f in partition_file_paths], ignore_index=True) |
There was a problem hiding this comment.
Parquet locally versus duckdb? Will the next pipeline.drop() erase it?
There was a problem hiding this comment.
So the dlt implementation for in-memory duckdb is currently bugged (they're fixing it), but I found a work around.
Using duckdb in-memory doesn't really provide additional value and only adds a dependency:
source -> extract (parquet) -> normalize (parquet) -> load (parquet) -> duckdb (memory) -> query db (memory) -> to pandas (memory)
At the end of the process, memory (duckdb, pandas) is freed and dlt pipeline is cleaned (extract, normalize, load)
My current implementation skips the duckdb steps
source -> extract (parquet) -> normalize (parquet) -> read parquet partitions (memory) -> pandas (memory)
There was a problem hiding this comment.
The related design decision is "how can I selectively reload dlt Sources". For example, only load Slack messages once then run Hamilton dataflows many times over that same data.
This would be an ELT use case (dlt -> Hamilton) where you want to refresh each independently. It probably makes more sense to have run_dlt.py and run_hamilton.py
The current Source materializer (from_) with everything in-memory aims to enable ET and the user is responsible for loading the data, potentially with the Destination (to) materializer which does TL
There was a problem hiding this comment.
note sure I'm following, but just to mention - we can just cache the result of this function if we don't want it to run more than once?
There was a problem hiding this comment.
So yeah, the worry is that it fills up disk space. But I think we can switch this up as needed. Seems reasonable.
elijahbenizzy
left a comment
There was a problem hiding this comment.
Let's ship, see where it goes from here!
The dlt library provides many
SourcesandDestinationsto extract (E) and load (L) data.With Hamilton being a transform (T) tool the goal is to build constructs to:
These 2 features alone allow for the full flexibility of (ETL, ELTL, etc.)
For a good user experience, and more easily integrating "Hamilton within dlt", it is valuable to have:
3. ETL: essentially a combination of 1. and 2.
4. ELT: no special integration required here, (dlt does EL, then Hamilton does T), but the transition could be streamlined. We can showcase that via the
pipeline.sql_client(). This is where dlt + Ibis integration would shine.Changes
DltResourceLoader(1.)DltDestinationSavermaterializer (2.)How I tested this
Notes
Sourcehas manyResourcethat would map to Hamilton nodes. ThenResourceis closer to aDataLoader, but theSourceconstruct remains needed because it is responsible for authentication and more.Checklist