introdution

(take from abstract)

Find the preprint HERE
Find the data zenodo HERE (dump of /gpfs/gibbs/pi/reilly/tabula_data)
Find the github archival zenodo HERE

Skip down to "install" to get started!

This repository is organzied as follows:


├── docs					# documentation
├── notebooks				# jupyter notebooks
│   ├── object_creation		# creates objects from scMPRA data
│   ├── results				# figures and numerical results
│   └── preprocessing		# preprocesses emperical scMPRA data
└── scMPRAforge				# the package proper
    ├── vingettes			# notebooks demonstrating the use of the package
    ├── presets				# presets to be distributed with the package
    └── tests				# tests

"scMPRAforge" contains the actual pacakge

"Preprocessing" processes emperical data for "object_creation".

"object_creation" creates and saves several python objects (ortho, scMPRA_data, Bounds and similar) useful in multiple analyses run in "results". Intermediate values are not distributed. Simulated datasets & processed emperical datasets are distributed separately, as .scmpra files on a data zenodo LINK. Bounds objects are stored in presets directory in main package.

"results" contains notebooks generating results in the paper. This includes figure panels and numbers.

key assumptions

(move to paper)

Though ideally we would model each barcode (like the approach in MPRAmodel) for greater statistical power, the sparse nature of scMPRA data means that we won't be able to do this. Not that the barcodes are retained, just not modeled separaately.
Removal of "false zeroes" through clonotype analysis and transfection reporters is part of pre-processing.
We are interested in changes in CRE activity within and between cell-types

install

(temporary : replace with pip install scMPRAforge once we have actually )

conda create -n env_tensorzinb python=3.10
conda activate env_tensorzinb
echo "numpy <2" >> $CONDA_PREFIX/conda-meta/pinned
#navigate to tensor zinb
#install apple silicon deps as below if necessary
python setup.py install

For apple silicon, may need

conda install -c apple tensorflow-deps
python -m pip install tensorflow-macos==2.9.2
python -m pip install tensorflow-metal==0.5.1

conda install statsmodels

Additional required packages

conda install matplotlib seaborn bioconda::umi_tools formulaic dask dask-jobqueue

may need to install tensorflow keras if you didn't above

At this point, it may work fine. However some environments (such as mccleary gpu nodes) may require some more fnagling to get GPU accel working. Perform the following:

conda install cudatoolkit=11.2 cudnn=8.1.0
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

latter not necessary on bouchet

then deactivate, activate, and test with

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

If you see a device, you're good to go.

(adapted from mccleary docs)

supplementary conda environments

The above is sufficient for you to use the package, but recapitulating some of the preprocessing steps & results from the paper require other conda environments. Fill in

biopython (used for some data carpentry and single cell processing steps)
parse_scanpy (used for scanpy : particular scanpy version particularly important to get same leiden clustering)
speedracer (used for some rapid text-processing steps)
doc (used to produce documentation)

Data formats

This section describes the various standard tabular formats used internally.

MPRA data formatting

Note that most of our code does not require that any of the barcode sequences should be actual nucleotide sequences. So you can replace them with, for example, numerical combinatorial barcoding sub-barcode ID strings with no consequence. The exception is barcode-deduplication.

(Collapsing cells, replicates, and MPRA barcodes is not meaningful for us, since this would collapse what we believe to be true biological samples. However we may frequently summarize (e.g. compute mean UMIs per cell, or model a distribution where all UMI count originating from one CRE in one cell-type (regardless of MPRA barcode) are considered to have come from one triplicate of zinb parameters).

In memory, dataframes use named columns with dummy row-indicies.

scMPRA data can be umi-wise or read-wise.

UMI = unique molecular identifier

Read-wise

Column name	Type	Description	Mandatory?
cell_bc	str (nucleotide)	cell barcode	T
rep_id	str	replicate id	T
cre_id	str	CRE id or name	T
cell_type	str	cell-type	T
mpra_bc	str (nucleotide)	MPRA reporter barcode	T
mpra_umi	str (nucleotide)	MPRA umi	T
reads_mpra	int	number mpra reads	T
transfection_bc	str (nucleotide)	transfection barcode	F
transfection_umi	str (nucleotide)	transfection reporter umi	F
reads_transfection_bc	int	number transfection reporter reads	F
reads_DNA	int	DNA library reads (for MPRA barcode)	F

Umi-wise

Column name	Type	Description	Mandatory?
cell_bc	str (nucleotide)	cell barcode	~T
rep_id	str	replicate id	T
cre_id	str	CRE id or name	T
cell_type	str	cell-type	T
mpra_bc	str (nucleotide)	MPRA reporter barcode	T
umis_mpra_bc	int	number of MPRA UMIs	T
reads_mpra_bc	int	number of MPRA reads, summed across all UMIs	F
transfection_bc	str (nucleotide)	transfection barcode	F
umis_transfection_bc	int	number transfection reporter UMIs	F
reads_transfection_bc	int	number transfection reporter reads, summed across all UMIs	F
reads_DNA	int	DNA library reads (for MPRA barcode)	F

(Though MPRA barcodes cannot be modeled individually, its undersierable to sum across to remove them : since this would inflate strength estimates of CREs with more MPRA barcodes)

Simulated UMI-wise data doesn't currently require cell barcode.

Note that DNA is treated as constant for a given CRE : it's never reduced on account of certain MPRA barcodes being present in certain cells. This keeps it as a totally exogenous source of information, far from the vicissitudes of single-cell sequencing...

Also note that all these strings are really factors / categorical data, and will be treated as such.

Hypothesis formatting

A hypothesis set is a table with the following format:

Column name	Type	Description	Mandatory?
comparison_CRE	str	CRE id or name	T
comparison_cell_type	str	cell-type	T
reference_CRE	str	CRE id or name	F
reference_cell_type	str	cell-type	F
meta	str	metadata	F

If no reference CRE is provided, the package will assume that we are comparing to zero (looking for any activity at all).
A row that contains only one of reference_CRE, reference_cell_type is considered malformed.
For assessing variant effects, we recommend the convention that reference columns refer to the reference allele, and comparison columns refer to the alternate allele.
The metadata column can also be used to paint plots, so a categorical like "negative_control", "positive_control", "emvar", "putative-brain-specific" or similar would work well. No strict requirements: modify names as suits your experimental design.

A result table is the same as a hypothesis table with the following additional columns:

Column name	Type	Description	Mandatory?
test_type	str	which test was performed	T
test_statistic	float	the test-statistic for that test	T
p_value	float	type 1 error probability	T
fold_change	float	between ref and comparison	T
bh_p	float	benjamini hochberg corrected p-value	T
flattened	bool	whether the CRE was flattened due to insufficient UMIs	T

NOTE: for tiny pvalues where floating point sizes require rounding p, the exact value of this epsilon will depend on the computer hardware/architecture that you are operating on.
```
 eps = np.finfo(float).tiny  # ~1e-308
 p = max(1.0 - chi2.cdf(z*z, 1), eps)
```

Ground-truth formatting

These tables store made-up ground truth, for the purposes of simulation.

Column name	Type	Description
cell_type	str	cell-type
cre_id	str	CRE id or name
true_mean	float	Ground-truth # MPRA barcode UMIs/cell

MPRA library table

Column name	Type	Description	Mandatory?
cre_id	string	CRE id or name	T
mpra_bc	string (nucleotide)	MPRA reporter barcode	T
abundance	float	Relative abundance in DNA library	T

Abundance column must sum to 1. It can be defined in different ways, but in most cases will be (reads MPRA barcode)/(total MPRA barcode reads)

Name		Name	Last commit message	Last commit date
Latest commit History 968 Commits
docs		docs
notebooks		notebooks
scMPRAforge		scMPRAforge
.gitignore		.gitignore
README.md		README.md
pydoc-markdown.yml		pydoc-markdown.yml
workbench.ipynb		workbench.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

introdution

key assumptions

install

supplementary conda environments

Data formats

MPRA data formatting

Hypothesis formatting

Ground-truth formatting

MPRA library table

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Reilly-Lab-Yale/tabula-rasa

Folders and files

Latest commit

History

Repository files navigation

introdution

key assumptions

install

supplementary conda environments

Data formats

MPRA data formatting

Hypothesis formatting

Ground-truth formatting

MPRA library table

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages