In multi-modal single-cell experiments, we obtain data of different modalities (e.g., RNA, protein) from the same set of cells. Naturally, we would like to combine data from different modalities to increase the information available for each cell in further analyses. This is most relevant to analysis steps that operate on cells, e.g., clustering, visualization with t-SNE or UMAP. The simplest combining strategy is to just concatenate the per-modality data matrices together into a single matrix for further analysis. While convenient and compatible with many downstream procedures, this is complicated by the differences in the variance between modalities. Higher noise in one modality might drown out biological signal in another modality that has lower variance.
The mumosa algorithm scales the data from each modality to equalize "uninteresting" noise to concatenation.
First, we compute the median distance to the
Each modality should be represented as a low-dimensional embedding (e.g., after PCA) for more efficient neighbor searches.
Given the embedding coordinates for multiple modalities, we compute the median distance to the
#include "mumosa/mumosa.hpp"
// Mocking up some modalities. For each modality 'm', we have a column-major
// array of 'embeddings[m]' of size 'dimensions[m] * nobs'.
int nobs = 1000;
std::vector<int> dimensions(3, 20);
std::vector<std::vector<double> > embeddings(3);
for (int m = 0; m < 3; ++m) {
embeddings[m].resize(nobs * dimensions[m]);
}
// Configuring the neighbor search algorithm; here, we'll be using an exact
// search based on VP trees with a Euclidean distance metric.
knncolle::VptreeBuilder<int, double, double> vp_builder(
std::make_shared<knncolle::EuclideanDistance<double, double> >()
);
// Computing distances per modality.
mumosa::Options opt;
opt.num_neighbors = 20;
opt.num_threads = 3;
std::vector<std::pair<double, double> > distances;
for (int m = 0; m < 3; ++m) {
distances[m] = mumosa::compute_distance(
dimensions[m],
nobs,
embeddings[m].data(),
vp_builder,
opt
);
}We compute scaling factors for each modality:
auto scale = mumosa::compute_scale(distances);And combine the scaled per-modality embeddings into a single matrix, which can be used for downstream steps like k-means clustering:
std::size_t ntotal = std::accumulate(dimensions.begin(), dimensions.end(), 0);
std::vector<double> combined(ntotal * nobs);
std::vector<const double*> inputs;
for (const auto& em : embeddings) {
inputs.push_back(em.data());
}
mumosa::combine_scaled_embeddings(
dimensions,
nobs,
inputs,
scale,
combined.data()
);Check out the reference documentation for more details.
The premise of the mumosa approach is that the distance to the
Ideally, the median distance-to-neighbor would serve as a proxy for the average variance within subpopulations of at least
- Each modality may have a different subpopulation structure. A modality with a small number of large subpopulations will have a lower median distance-to-neighbor than a modality with a large number of small subpopulations, even if the variance within each subpopulation is the same - this would result in inappropriate upscaling of the former. In practice, this is not too problematic as the definition of a "subpopulation" is so vague that it's hard to say that our scaling is obviously wrong. For example, a big blob of cells may contain further interesting structure, in which case mumosa's upscaling would be appropriate. Users who know better (e.g., from control data) can adjust the scaling factors to give appropriate weights to each modality.
- The median distance-to-neighbor is not an accurate relative measure of the variance at lower dimensions. Even in the simplest cases of i.i.d. noise, the distance is not proportional to the standard deviation at lower dimensions (see analysis here). Nonetheless, mumosa can still be useful for downstream procedures that perform distance calculations between cells, as it ensures that each modality contributes equally to the distance between cells from the same subpopulation in the combined embedding.
One appeal of mumosa is its simplicity relative to other approaches, e.g., multi-modal factor analyses, intersection of simplicial sets. No further transformations beyond scaling are performed, ensuring that population structure within each modality is faithfully represented in the combined embedding. It is very easy to implement and the result is directly compatible with any downstream analysis step that can operate on an embedding matrix. In fact, we only care about the median distance so we could save even more time by only performing the neighbor search for a subset of cells.
If you're using CMake, you just need to add something like this to your CMakeLists.txt:
include(FetchContent)
FetchContent_Declare(
mumosa
GIT_REPOSITORY https://github.com/libscran/mumosa
GIT_TAG master # replace with a pinned release
)
FetchContent_MakeAvailable(mumosa)Then you can link to mumosa to make the headers available during compilation:
# For executables:
target_link_libraries(myexe libscran::mumosa)
# For libaries
target_link_libraries(mylib INTERFACE libscran::mumosa)By default, this will use FetchContent to fetch all external dependencies.
Applications should consider pinning versions of all dependencies - see extern/CMakeLists.txt for suggested versions.
If you want to install them manually, use -DMUMOSA_FETCH_EXTERN=OFF.
find_package(libscran_mumosa CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE libscran::mumosa)To install the library, use:
mkdir build && cd build
cmake .. -DMUMOSA_TESTS=OFF
cmake --build . --target installAgain, this will use FetchContent to retrieve dependencies, see comments above.
If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I.
This also requires the external dependencies listed in extern/CMakeLists.txt.