-
Notifications
You must be signed in to change notification settings - Fork 61
PCA integration method implemented + tested #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Changes from all commits
fbf9be2
c674e5e
ea18649
0959dba
b17692e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| # EXPIMAP Pathway Database | ||
|
|
||
| ## Overview | ||
| This directory contains pathway databases used by the EXPIMAP module for interpretable embedding and pathway analysis. | ||
|
|
||
| ## Files | ||
|
|
||
| ### pathways.gmt | ||
| - **Format**: Gene Matrix Transposed (GMT) | ||
| - **Content**: Reactome pathway gene sets | ||
| - **Description**: Contains biological pathway gene sets from the Reactome database that are used to construct the latent space in EXPIMAP models. Each row represents a pathway with its associated genes. | ||
| - **Source**: [Reactome Database](https://reactome.org/) | ||
| - **Version/Date**: Please refer to Reactome documentation for version information | ||
| - **Usage**: Used by the EXPIMAP module to guide the learning of interpretable latent representations in single-cell transcriptomics data | ||
|
|
||
| ## Usage in Pipeline | ||
|
|
||
| The pathway file is automatically used by the EXPIMAP integration method. Users can specify a custom pathway file via the `expimap_gmt` parameter in the pipeline parameters, or use the default provided in this directory. | ||
|
|
||
| Example: | ||
| ```bash | ||
| nextflow run nf-core/scdownstream \ | ||
| --input samplesheet.csv \ | ||
| --outdir results \ | ||
| --integration_methods expimap \ | ||
| --expimap_gmt assets/databases/expimap/pathways.gmt | ||
| ``` | ||
|
|
||
| ## Notes | ||
|
|
||
| - The GMT format contains tab-separated values where each line represents one pathway | ||
| - First column: pathway name | ||
| - Second column: pathway URL/description | ||
| - Remaining columns: genes in the pathway | ||
| - The default Reactome pathways are suitable for most analyses of mammalian cells (especially human and mouse) | ||
| - For custom pathway analysis, users can provide their own GMT files using the `--expimap_gmt` parameter |
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| channels: | ||
| - conda-forge | ||
| - bioconda | ||
| dependencies: | ||
| - python=3.14 | ||
| - pip | ||
| - pip: | ||
| - scArches==0.6.1 | ||
| - anndata==0.9.2 | ||
|
Comment on lines
+1
to
+9
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This environment uses an outdated version of anndata - is that necessary to get expimap working?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I am trying to resolve an anndata problem. The first module test exposed an error with anndata.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah you're talking about this
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a bit annoying in this context as we would need to install from the master branch, but with seqera containers we can only install published versions
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Exactly, for some reason for scarches=0.6.1 (which should be the most recent one) I receive anndata error
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is that the older anndata version will cause some other compatibility issues |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| process SCARCHES_EXPIMAP { | ||
| tag "${meta.id}" | ||
| label 'process_medium' | ||
|
|
||
| conda "${moduleDir}/environment.yml" | ||
| container "${workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container | ||
| ? 'https://wave.seqera.io/view/builds/bd-e532922f69f9a648_1' | ||
| : 'community.wave.seqera.io/library/pip_scarches:7e8c7e577326f6ad'}" | ||
|
|
||
| input: | ||
| tuple val(meta), path(h5ad, arity: 1) | ||
| tuple val(meta2), path(reference_model) | ||
| val(batch_col) | ||
| val(counts_layer) | ||
|
|
||
| output: | ||
| tuple val(meta), path("${prefix}.h5ad"), emit: h5ad | ||
| path "X_${prefix}.pkl", emit: obsm | ||
| path "versions.yml", emit: versions | ||
|
|
||
| script: | ||
| prefix = task.ext.prefix ?: "${meta.id}" | ||
| if ("${prefix}.h5ad" == "${h5ad}") { | ||
| error "Input and output names are the same, use \"task.ext.prefix\" to disambiguate!" | ||
| } | ||
| template('expimap.py') | ||
|
|
||
| stub: | ||
| prefix = task.ext.prefix ?: "${meta.id}" | ||
| """ | ||
| touch ${prefix}.h5ad | ||
| touch X_${prefix}.pkl | ||
| touch versions.yml | ||
| """ | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| name: "scarches_expimap" | ||
| description: "Train EXPIMAP model from scArches for interpretable embedding and pathway analysis" | ||
| keywords: | ||
| - scarches | ||
| - expimap | ||
| - interpretable embeddings | ||
| - pathway analysis | ||
| - single-cell | ||
| tools: | ||
| - "scarches": | ||
| description: "Transfer learning for single-cell genomics" | ||
| homepage: "https://github.com/theislab/scarches" | ||
| documentation: "https://scarches.readthedocs.io" | ||
| tool_dev_url: "https://github.com/theislab/scarches" | ||
| doi: "10.1038/s41592-021-01232-1" | ||
| licence: ["BSD-3-Clause"] | ||
|
|
||
| input: | ||
| - meta: | ||
| type: map | ||
| description: | | ||
| Groovy Map containing sample metadata | ||
| e.g. `[ id:'sample1' ]` | ||
| - h5ad: | ||
| type: file | ||
| pattern: "*.{h5ad}" | ||
| description: "Annotated data matrix (h5ad format)" | ||
|
|
||
| - meta2: | ||
| type: map | ||
| description: | | ||
| Groovy Map for optional reference model metadata | ||
| e.g. `[ id:'reference' ]` | ||
| - reference_model: | ||
| type: file | ||
| pattern: "*.{gmt,gmx}" | ||
| description: "Optional gene module reference file (GMD/GMT format) for biological knowledge" | ||
| required: false | ||
|
|
||
| - batch_col: | ||
| type: string | ||
| description: "Column name in adata.obs containing batch/condition information" | ||
|
|
||
| - counts_layer: | ||
| type: string | ||
| description: "Layer name containing count data. Default uses 'X' (expression matrix)" | ||
|
|
||
| output: | ||
| - h5ad: | ||
| type: file | ||
| pattern: "*.h5ad" | ||
| description: "Output h5ad file with EXPIMAP embedding in obsm['X_expimap_latent']" | ||
|
|
||
| - obsm: | ||
| type: file | ||
| pattern: "X_*.pkl" | ||
| description: "Pickle file containing EXPIMAP latent embedding matrix" | ||
|
|
||
| - versions: | ||
| type: file | ||
| pattern: "versions.yml" | ||
| description: "File containing software versions" | ||
|
|
||
| authors: | ||
| - "@your_github_username" | ||
| maintainers: | ||
| - "@your_github_username" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| #!/usr/bin/env python3 | ||
|
|
||
| import os | ||
| import platform | ||
| import yaml | ||
|
|
||
| os.environ["MPLCONFIGDIR"] = "./tmp/mpl" | ||
| os.environ["NUMBA_CACHE_DIR"] = "./tmp/numba" | ||
|
|
||
| import scarches as sca | ||
| import pandas as pd | ||
| import scanpy as sc | ||
| import torch | ||
|
|
||
| from threadpoolctl import threadpool_limits | ||
| threadpool_limits(int("${task.cpus}")) | ||
| torch.set_num_threads(int("${task.cpus}")) | ||
|
|
||
| adata = sc.read_h5ad("${h5ad}") | ||
|
|
||
| adata_processing = adata.copy() | ||
|
|
||
| if "${counts_layer}" != "X": | ||
| adata_processing.X = adata.layers["${counts_layer}"] | ||
|
|
||
| # Prior biological knowledge in form of gene programs | ||
| if "${reference_model}": | ||
| sca.utils.add_annotations(adata_processing, "${reference_model}", min_genes=12, clean=True) | ||
| else: | ||
| raise ValueError("Reference model is required for EXPIMAP. Please provide a path to the reference model.") | ||
|
|
||
| # Initialization of the model with the reference network | ||
| intr_cvae = sca.models.EXPIMAP( | ||
| adata=adata_processing, | ||
| condition_key="${batch_col}", | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually if you provide the The pipeline also has a |
||
| hidden_layer_sizes=[256, 256, 256], | ||
| recon_loss="nb" | ||
| ) | ||
|
|
||
| # Train the model | ||
| intr_cvae.train( | ||
| n_epochs=400, | ||
| alpha_epoch_anneal=100, | ||
| alpha=0.7, | ||
| alpha_kl=0.5, | ||
| use_early_stopping=True | ||
| ) | ||
|
|
||
| # Extract the interpretable latent representation | ||
| emb = intr_cvae.get_latent(only_active=True) | ||
| adata.obsm['X_emb'] = emb | ||
|
|
||
| adata.write_h5ad("${prefix}.h5ad") | ||
| df = pd.DataFrame(emb, index=adata.obs_names) | ||
| df.to_pickle("X_${prefix}.pkl") | ||
|
|
||
| # Versions | ||
| versions = { | ||
| "${task.process}": { | ||
| "python": platform.python_version(), | ||
| "scanpy": sc.__version__, | ||
| "pandas": pd.__version__, | ||
| "scarches": sca.__version__ | ||
| } | ||
| } | ||
|
|
||
| with open("versions.yml", "w") as f: | ||
| yaml.dump(versions, f) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is even relevant here, but might be interesting for you
If you remove python from the explicit (versioned) dependencies, then the python version might be inconsistent for users that use conda environments for running the pipeline. This is not really a problem, but if the python version is included in the version capture at the end of the script, then the tests start failing.
Thus, either pin the python version or remove python from the version capture