Skip to content

feat: design a generic memory space staging API#205

Draft
dssgabriel wants to merge 5 commits intokokkos:developfrom
dssgabriel:feature/host-staging
Draft

feat: design a generic memory space staging API#205
dssgabriel wants to merge 5 commits intokokkos:developfrom
dssgabriel:feature/host-staging

Conversation

@dssgabriel
Copy link
Copy Markdown
Collaborator

@dssgabriel dssgabriel commented Jan 8, 2026

This PR introduces a minimal API that allows us to perform host staging whenever the underlying communication backend does not support GPU-aware operations (currently, this can only occur with MPI):

  • stage_for: creates a mirror view if the passed view type isn't host-accessible, otherwise returns the same view
  • copy_back: deep copies the host mirror view back to the device view when it is not host accessible (required for receive operations), otherwise does nothing
    I am restricting this API to internal use for now, hence it is behind the Impl:: namespace.

This is a critical feature that is needed to properly fall back to pure CPU-based communications in order to support MPI routines that are not yet GPU-aware (e.g. non-blocking collectives in Open MPI).

Usage would look something like so (here in mpi::iallreduce):

template <KokkosView SView, KokkosView RView, KokkosExecutionSpace ExecSpace>
auto iallreduce(const ExecSpace &space, const SView sv, RView rv, MPI_Op op, MPI_Comm comm) -> Req<MpiSpace> {
  using ST = typename SView::non_const_value_type;
  using RT = typename RView::non_const_value_type;
  static_assert(std::is_same_v<ST, RT>, "KokkosComm::mpi::iallreduce: View value types must be identical");
  Kokkos::Tools::pushRegion("KokkosComm::mpi::iallreduce");

  fail_if(!is_contiguous(sv) || !is_contiguous(rv),
          "KokkosComm::mpi::iallreduce: unimplemented for non-contiguous views");

  // Sync: Work in space may have been used to produce view data.
  space.fence("fence before non-blocking all-gather");

  Req<MpiSpace> req;
  if constexpr (Impl::is_gpu_aware()) {  // NOTE: this API does not exist (yet)
    MPI_Iallreduce(data_handle(sv), data_handle(rv), span(sv), datatype<MpiSpace, ST>(), op, comm, &req.mpi_request());
  } else {
    auto host_staged_sv = KokkosComm::Impl::stage_for(space, sv);
    auto host_staged_rv = KokkosComm::Impl::stage_for(space, rv);
    space.fence();
    MPI_Iallreduce(data_handle(host_staged_sv), data_handle(host_staged_rv), span(host_staged_sv), datatype<MpiSpace, ST>(), op, comm, &req.mpi_request());
    req.call_after_mpi_wait([=]() {
      KokkosComm::Impl::copy_back(space, rv, host_staged_rv);
      space.fence();
    });
  }
  req.extend_view_lifetime(sv);
  req.extend_view_lifetime(rv);

  Kokkos::Tools::popRegion();
  return req;
}

I added some unit tests to verify that it somewhat works as is, but I am not yet fully satisfied with the design.

Some open questions:

  • Should both functions take an execution space?
  • Do we need other functions besides these two?
  • Should this PR include GPU-awareness detection of the MPI impl (as given in the example code above)? This may include some non-trivial build system logic to make it work at compile-time since not all MPI impls provide mechanisms to do it then (looking at you MPICH).

@dssgabriel dssgabriel self-assigned this Jan 8, 2026
@dssgabriel dssgabriel added C-enhancement Category: an enhancement or bug fix A-core Area: KokkosComm core library implementation E-hard Call for participation: hard - high experience level required A-mpi Area: KokkosComm MPI backend implementation labels Jan 8, 2026
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
@dssgabriel dssgabriel force-pushed the feature/host-staging branch from 46c64da to e0a6a0b Compare January 8, 2026 16:29
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
@dssgabriel dssgabriel force-pushed the feature/host-staging branch from e0a6a0b to 29eb2f2 Compare January 8, 2026 16:40
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
@dssgabriel dssgabriel changed the title feat: add basic host staging MPI feat: add basic host staging API Jan 12, 2026
@cedricchevalier19 cedricchevalier19 added this to the Version 0.1 milestone Jan 15, 2026
Copy link
Copy Markdown
Member

@cedricchevalier19 cedricchevalier19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not easy to review this PR without the use case.

Do you mind doing another PR with the usage of the API?

@dssgabriel
Copy link
Copy Markdown
Collaborator Author

Do you mind doing another PR with the usage of the API?

@cedricchevalier19 I have opened #208, which implements host staging for P2P interfaces in the MPI backend. I will do collectives next, but you can start to have a look already.

@dssgabriel dssgabriel marked this pull request as draft January 27, 2026 10:17
@dssgabriel
Copy link
Copy Markdown
Collaborator Author

I am turning this into a draft.

My plan is to rework this PR to let users stage a View in any memory space.
I will close #208 and reimplement host staging solely using create_mirror_view_and_copy.

@cedricchevalier19
Copy link
Copy Markdown
Member

That makes sense, thank you.

@dssgabriel dssgabriel changed the title feat: add basic host staging API feat: design a generic memory space staging API Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-core Area: KokkosComm core library implementation A-mpi Area: KokkosComm MPI backend implementation C-enhancement Category: an enhancement or bug fix E-hard Call for participation: hard - high experience level required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants