Skip to content

[NDT-561] Base orchestrator, PodRun, PodResult + LocalDockerOrchestrator implementation#44

Merged
eywalker merged 30 commits intonauticalab:devfrom
guzman-raphael:orchestrator
Feb 18, 2025
Merged

[NDT-561] Base orchestrator, PodRun, PodResult + LocalDockerOrchestrator implementation#44
eywalker merged 30 commits intonauticalab:devfrom
guzman-raphael:orchestrator

Conversation

@guzman-raphael
Copy link
Copy Markdown
Contributor

@guzman-raphael guzman-raphael commented Nov 8, 2024

Depends on #42, #43

Features

  • Add orchestrator traits+ PodRun (see design below)
  • Add mod docker that implements a local docker orchestrator (see design below)
  • Add initial PodResult model (see design below). PodJob memory state is shared via container labels so it can reconstruct independently across client sessions e.g. calling orchestrator.list() or orchestrator.get_result(pod_run) on a different machine than you started on.
  • Add optional environment variables to PodJob
  • Add default compute_checksum definition for BlobInterface (passthrough)
  • Allow DevContainer to set CPU/Memory limits for containers
  • Use once_cell for setting RegEx's statically but with lazy evaluation (on first access)
  • Add fixture and test support for a simple, real, python-based style transfer example
  • Update dependencies
  • Update tests

Project Management

My goal of this PR was to complete a MVP version of Orcapod that has enough features to be usable in alpha testing. That said, not everything we've discussed is implemented here but I propose we separate them out into individual issues that can be addressed independently without blocking a release for testing.

Specifically, I'm proposing the following changes:

Update the following issues such that this PR:

Open separate issues for:

Update the spec for the following issues:

Design

mod model {
    pub struct PodResult {
        #[serde(skip)]
        pub annotation: Option<Annotation>,
        #[serde(skip)]
        pub hash: String,
        #[serde(
            serialize_with = "serialize_pod_job",
            deserialize_with = "deserialize_pod_job"
        )]
        pub pod_job: PodJob,
        /// Name given by orchestrator.
        pub assigned_name: String,
        /// Status of compute run when terminated.
        pub status: Status,
        /// Time in epoch when created in seconds.
        pub created: u64,
        /// Time in epoch when terminated in seconds.
        pub terminated: u64,
    }
}

mod orchestrator {
    pub enum ImageKind {
        Published(String),
        Tarball(PathBuf),
    }

    pub enum Status {
        Running,
        Completed,
        Failed(i16),
    }

    pub struct RunInfo {
        pub image: String,
        pub created: u64,
        pub terminated: Option<u64>,
        pub env_vars: HashMap<String, String>,
        pub command: String,
        pub status: Status,
        pub mounts: Vec<String>,
        pub labels: HashMap<String, String>,
        pub cpu_limit: f32,
        pub memory_limit: u64,
    }

    pub struct PodRun
    {
        pub pod_job: PodJob,
        pub orchestrator_source: String,
        pub assigned_name: String,
    }

    pub trait Orchestrator
    {
        fn start_with_altimage_blocking(&self, pod_job: &PodJob, image: &ImageKind) -> Result<PodRun>;
        fn start_blocking(&self, pod_job: &PodJob) -> Result<PodRun>;
        fn list_blocking(&self) -> Result<Vec<PodRun>>;
        fn delete_blocking(&self, pod_run: &PodRun) -> Result<()>;
        fn get_info_blocking(&self, pod_run: &PodRun) -> Result<RunInfo>;
        fn get_result_blocking(&self, pod_run: &PodRun) -> Result<PodResult>;
        fn start_with_altimage(
                &self,
                pod_job: &PodJob,
                image: &ImageKind,
            ) -> impl Future<Output = Result<PodRun>> + Send;
        fn start(&self, pod_job: &PodJob) -> impl Future<Output = Result<PodRun>> + Send;
        fn list(&self) -> impl Future<Output = Result<Vec<PodRun>>> + Send;
        fn delete(&self, pod_run: &PodRun) -> impl Future<Output = Result<()>> + Send;
        fn get_info(&self, pod_run: &PodRun) -> impl Future<Output = Result<RunInfo>> + Send;
        fn get_result(&self, pod_run: &PodRun) -> impl Future<Output = Result<PodResult>> + Send;
    }

    mod docker {
        pub struct LocalDockerOrchestrator {
            data_directory: PathBuf,
            api: Docker,
            async_driver: Runtime,
        }

        impl Orchestrator for LocalDockerOrchestrator {..}

        impl LocalDockerOrchestrator {
            pub fn new(data_directory: impl AsRef<Path>) -> Result<Self> {
                Ok(Self {
                    data_directory: fs::canonicalize(data_directory)?,
                    api: Docker::connect_with_local_defaults()?,
                    async_driver: Runtime::new()?,
                })
            }

            fn prepare_container_start_inputs(
                &self,
                pod_job: &PodJob,
                image: String,
            ) -> Result<(
                String,
                Option<CreateContainerOptions<String>>,
                Config<String>,
            )> {..}

            async fn list_containers(
                &self,
                filters: HashMap<String, Vec<String>>,
            ) -> Result<impl Iterator<Item = (String, RunInfo)>> {..}
        }
    }
}

@eywalker eywalker changed the title LocalDockerOrchestrator support [NDT-561] LocalDockerOrchestrator support Jan 24, 2025
…or `get_info`, local docker orchestrator implementation, move model metadata regex to a lazy once_cell, default `compute_checksum` implementation (passthrough), add name generator utility, lazy set/access utility, update tests with a real style transfer example, and update tests.
…ct fields without getters, skip delete if no run found, and remove unneeded error.
…r to share memory state between client/docker daemon.
…raise error if accessing purged pod, add datetime parsing utility, update orchestrator test.
…pdate docs, and add to spelling dictionary.
This was unlinked from issues Jan 28, 2025
@guzman-raphael guzman-raphael changed the title [NDT-561] LocalDockerOrchestrator support [NDT-561] Base orchestrator, PodRun, PodResult and LocalDockerOrchestrator implementation Jan 28, 2025
@guzman-raphael guzman-raphael changed the title [NDT-561] Base orchestrator, PodRun, PodResult and LocalDockerOrchestrator implementation [NDT-561] Base orchestrator, PodRun, PodResult + LocalDockerOrchestrator implementation Jan 28, 2025
…ectory`, expose `PodRun` fields publicly since for user access, replace `canonicalize` with `absolute` since requires existence, and add test for remote container image.
}
/// Available states of a run.
#[derive(Serialize, Deserialize, Debug, PartialEq, Eq, Clone)]
pub enum RunState {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably rename to just State due to the usage in pod result. Also perhaps move this to utils since it is used by mutiple modules?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is there no option for queuing? Sometimes orcherstrator might have max jobs already running like in K8 and has to wait their turn

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make an issue about either finding a way to deal with it via enum inheratince (don't think there either)
Other resolved

}
}

#[expect(clippy::unwrap_used, reason = "Valid static regex")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, nice we don't have to redo thsi every time

.join(pod_job.output_stream_path.location.clone());
fs::create_dir_all(&host_output_directory)?;
// Prepare configuration
let container_name = Generator::with_naming(Name::Plain)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to look up how this works

… remove `new` for `PodRun`, and add minor improvement to orchestrator test.
…_cell` since `LazyLock` now part of standard library, alphabetize cargo config, and add panic messages for invalid regex.
Copy link
Copy Markdown
Contributor

@eywalker eywalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking it as a change request but it's really to just raise a few discussion points on which I commented.

… to return as-is or in `snake_case`, and update comment on `save_model`.
Copy link
Copy Markdown
Contributor

@Synicix Synicix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good with the updates

@eywalker eywalker merged commit 4e092e5 into nauticalab:dev Feb 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revisit List Model output Minimum Orchestrator trait Minimal PodResult Model Minimum PodRun

3 participants