Add a datasource field to the DesignWorkflow object#968
Conversation
| # TODO Figure out how to populate the column definitions | ||
| return CSVDataSource( | ||
| file_link=FileLink(url=args[0], filename=args[1]), | ||
| column_definitions={} | ||
| ) |
There was a problem hiding this comment.
There is no obvious method to identify what the column definitions and identifiers would be from just the DataSourceId.
There was a problem hiding this comment.
Would it be too disruptive to issue a warning here, so the user knows that the data source they're getting is incomplete? Actually, maybe the warning should come when they access the DesignWorkflow.data_source field, if it's a CSVDataSource with no column_definition.
Either way, I'd like if we can alert them this isn't a completely accurate representation.
There was a problem hiding this comment.
Yeah, using a warning makes sense. I don't think people are using CSVDataSources anymore, but...
| return "::".join(str(x) for x in [self._data_source_type, self.datasource_id]) | ||
|
|
||
|
|
||
| class SnapshotDataSource(Serializable['SnapshotDataSource'], DataSource): |
There was a problem hiding this comment.
This will presumably be useful soon, since it's the next step after Multistep Materials tables.
There was a problem hiding this comment.
Yeah, Bill and I were just talking about what's needed to properly support training, etc on snapshots.
| @classmethod | ||
| def _pre_build(cls, data: dict) -> dict: | ||
| """Run data modification before building.""" | ||
| data_source_id = data.pop("data_source_id", None) | ||
| if data_source_id is not None: | ||
| data["data_source"] = DataSource.from_data_source_id(data_source_id).dump() | ||
| return data | ||
|
|
||
| def _post_dump(self, data: dict) -> dict: | ||
| """Run data modification after dumping.""" | ||
| data_source = data.pop("data_source", None) | ||
| if data_source is not None: | ||
| data["data_source_id"] = DataSource.build(data_source).to_data_source_id() | ||
| else: | ||
| data["data_source_id"] = None | ||
| return data | ||
|
|
There was a problem hiding this comment.
This uses the existing serde hooks to filter out data_source_id fields and insert DataSource objects.
|
|
||
| @pytest.fixture | ||
| def design_workflow_dict(generic_entity): | ||
| ret = generic_entity.copy() | ||
| ret.update({ | ||
| "name": "Example Design Workflow", | ||
| "description": "A description! Of the Design Workflow! So you know what it's for!", | ||
| "design_space_id": str(uuid.uuid4()), | ||
| "predictor_id": str(uuid.uuid4()), | ||
| "predictor_version": random.randint(1, 10), | ||
| "branch_id": str(uuid.uuid4()), | ||
| }) | ||
| return ret |
There was a problem hiding this comment.
Having two different DesignWorkflow test objects (a fixture & a Factory) made testing on the updated object complicated.
| @pytest.fixture | ||
| def workflow_minimal(collection, workflow) -> DesignWorkflow: | ||
| workflow.predictor_id = None | ||
| workflow.predictor_version = None | ||
| workflow.design_space_id = None | ||
| return workflow |
There was a problem hiding this comment.
This became obsolete because it was only used twice, and so was better served by a Factory with arguments.
| types: List[Type[Serializable]] = [ | ||
| CSVDataSource, GemTableDataSource, ExperimentDataSourceRef, SnapshotDataSource | ||
| ] |
There was a problem hiding this comment.
Nitpick: maybe lift this list to the class level, as it's duplicated above.
| ] | ||
| res = next((x for x in types if x._data_source_type == terms[0]), None) | ||
| if res is None: | ||
| raise ValueError("Unrecognized type: {}".format(terms[0])) |
| return "::".join( | ||
| str(x) for x in [self._data_source_type, self.table_id, self.table_version] | ||
| ) |
There was a problem hiding this comment.
Nitpick: what's the benefit of this over an f-string?
f{self._data_source_type}::{self.table_id}::{self.table_version}
There was a problem hiding this comment.
My thought when writing was that we have a series of identifiers joined by :: in all cases, but yes, it ended up not great. I thought of creating a get_identifiers method that got joined at the parent level, but then that seemed a bit absurd. I'll swap to f-string.
|
|
||
| def to_data_source_id(self) -> str: | ||
| """Generate the data_source_id for this DataSource.""" | ||
| return "::".join(str(x) for x in [self._data_source_type, self.datasource_id]) |
There was a problem hiding this comment.
Same nitpick as above.
|
|
||
| def to_data_source_id(self) -> str: | ||
| """Generate the data_source_id for this DataSource.""" | ||
| return "::".join(str(x) for x in [self._data_source_type, self.snapshot_id]) |
anoto-moniz
left a comment
There was a problem hiding this comment.
Mostly looks good, just some nitpick comments which aren't worth blocking on.
My biggest concern is the same you call out in TODOs regarding the CSVDataSource. I'd like the sort of warning I mention, but I'm approving anyways since I don't know it's the most crucial thing. And we can always add it later if it becomes a problem.
Citrine Python PR
Description
When the user of the UI creates a branch with a data source but not yet any predictor, the API DesignWorkflow object has a direct reference to the data source. When a Predictor was referenced by the DesignWorkflow, that Predictor contained a reference to the DataSource, thus an SDK user could ultimately recover the value, but without a Predictor, there is no such workflow.
Unfortunately, the API returns a DataSourceId, which is a specially formatted string, not a DataSource object, as with other API references to DataSources. To avoid creating a new class or requiring users to manually construct a specially formatted string, this PR also adds basic mapping tools between DataSourceId strings and DataSource objects, as well as a convenience method for generating a TableDataSource from a GemTable object.
I think this approach is ready, but happy to work through concerns or potential improvements.
Prerequisite PR for https://citrine.atlassian.net/browse/PNE-731
PR Type:
Adherence to team decisions