[NDT-561] Base orchestrator, PodRun, PodResult + LocalDockerOrchestrator implementation#44
Conversation
…eck dictionary, and fix typos.
LocalDockerOrchestrator supportLocalDockerOrchestrator support
…or `get_info`, local docker orchestrator implementation, move model metadata regex to a lazy once_cell, default `compute_checksum` implementation (passthrough), add name generator utility, lazy set/access utility, update tests with a real style transfer example, and update tests.
…e tests, and clean up.
…ct fields without getters, skip delete if no run found, and remove unneeded error.
…r to share memory state between client/docker daemon.
…raise error if accessing purged pod, add datetime parsing utility, update orchestrator test.
…orchestrator tests.
…pdate docs, and add to spelling dictionary.
LocalDockerOrchestrator supportPodRun, PodResult and LocalDockerOrchestrator implementation
PodRun, PodResult and LocalDockerOrchestrator implementationPodRun, PodResult + LocalDockerOrchestrator implementation
…ectory`, expose `PodRun` fields publicly since for user access, replace `canonicalize` with `absolute` since requires existence, and add test for remote container image.
src/orchestrator/mod.rs
Outdated
| } | ||
| /// Available states of a run. | ||
| #[derive(Serialize, Deserialize, Debug, PartialEq, Eq, Clone)] | ||
| pub enum RunState { |
There was a problem hiding this comment.
Should probably rename to just State due to the usage in pod result. Also perhaps move this to utils since it is used by mutiple modules?
There was a problem hiding this comment.
Also is there no option for queuing? Sometimes orcherstrator might have max jobs already running like in K8 and has to wait their turn
There was a problem hiding this comment.
Make an issue about either finding a way to deal with it via enum inheratince (don't think there either)
Other resolved
src/store/filestore.rs
Outdated
| } | ||
| } | ||
|
|
||
| #[expect(clippy::unwrap_used, reason = "Valid static regex")] |
There was a problem hiding this comment.
Hmm, nice we don't have to redo thsi every time
| .join(pod_job.output_stream_path.location.clone()); | ||
| fs::create_dir_all(&host_output_directory)?; | ||
| // Prepare configuration | ||
| let container_name = Generator::with_naming(Name::Plain) |
There was a problem hiding this comment.
Need to look up how this works
…s`, and fix typo in debugger.
… remove `new` for `PodRun`, and add minor improvement to orchestrator test.
… name to `PodRun`.
…_cell` since `LazyLock` now part of standard library, alphabetize cargo config, and add panic messages for invalid regex.
eywalker
left a comment
There was a problem hiding this comment.
Marking it as a change request but it's really to just raise a few discussion points on which I commented.
… to return as-is or in `snake_case`, and update comment on `save_model`.
…up `get_type_name`.
Synicix
left a comment
There was a problem hiding this comment.
looks good with the updates
Depends on #42, #43
Features
PodRun(see design below)mod dockerthat implements a local docker orchestrator (see design below)PodResultmodel (see design below).PodJobmemory state is shared via container labels so it can reconstruct independently across client sessions e.g. callingorchestrator.list()ororchestrator.get_result(pod_run)on a different machine than you started on.PodJobcompute_checksumdefinition forBlobInterface(passthrough)once_cellfor setting RegEx's statically but with lazy evaluation (on first access)Project Management
My goal of this PR was to complete a MVP version of Orcapod that has enough features to be usable in alpha testing. That said, not everything we've discussed is implemented here but I propose we separate them out into individual issues that can be addressed independently without blocking a release for testing.
Specifically, I'm proposing the following changes:
Update the following issues such that this PR:
Orchestratortrait #18: Update to reflect a minimal orchestrator trait and docker implementation.PodRun#12: Update to reflect a minimalPodRunPodResultModel #14: Update to reflect a minimalPodResultlist_modelis stableOpen separate issues for:
PodResult-> Orchestrator logging interface + Docker implementation #50BlobInterface: Evaluating checksums ofFolder's + iterating overCollection's -> Evaluate checksums forFolderandCollection#54PodResult-> Save compute output checksums inPodResult#55PodResult-> Orchestrator resource metrics interface + Docker implementation #51load_imageinterface + Docker implementation #52PodJob-> Add compute retry mechanism toPodJob#53Update the spec for the following issues:
image@digeston starting containers #9: Resolvestart_with_altimagetag to an actual SHA256 digest + use it to launch container (image@digest)thiserror#40: Redefine forthiserrorsince it is stable and provides the features we were looking for inanyhowDesign