This repository was archived by the owner on Mar 11, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathmain.tex
More file actions
372 lines (292 loc) · 29.3 KB
/
main.tex
File metadata and controls
372 lines (292 loc) · 29.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
\documentclass[11pt]{article}
\usepackage[english]{babel}
\usepackage[letterpaper,top=1in,bottom=1in,left=1in,right=1in]{geometry}
\usepackage[colorlinks=true, allcolors=blue]{hyperref}
\usepackage[T1]{fontenc}
\usepackage{amsmath, graphicx, amssymb, amsthm, dsfont, tabu, comment, color, caption, xcolor, mdframed, listings, subfig}
\usepackage[htt]{hyphenat} % allow hyphens in typewriter font
\usepackage{xurl} % allow breaks at more symbols
\usepackage{url} % ensures \url and \path are available
\def\UrlBreaks{\do\/\do\_\do\.\do\-} % explicitly let / _ . – break
\lstdefinestyle{mypython}{
language=Python,
basicstyle=\ttfamily\footnotesize,
keywordstyle=\color{blue}\bfseries,
commentstyle=\color{gray}\itshape,
stringstyle=\color{orange},
showstringspaces=false,
numbers=left,
numberstyle=\tiny\color{gray},
stepnumber=1,
numbersep=5pt,
frame=single,
breaklines=true,
breakatwhitespace=true,
tabsize=4,
}
\title{Final Report for MERFISH Analysis Pipeline Project \\ Focus on Neighbor-Dependent Expression}
\author{Arian Djahed}
\date{May 4, 2025}
\begin{document}
\maketitle
\begin{abstract}
This document outlines and recounts the entire process by which my project for the semester—a MERFISH analysis pipeline and tool suite, with a specific focus on the portion involving neighbor-dependent gene expression analysis—came to fruition. It will first detail the motivation and input behind the entire project before delving into the entire structure (with an obvious emphasis on my contributions). The pipeline incorporates preprocessing, cell analysis, microenvironment clustering, differential expression, and cell-cell interaction studies. My contribution focuses on integrating and analyzing neighbor-dependent gene expression patterns using the \texttt{CellNeighborEX} library. Details on the algorithmic approach, the integration with the greater pipeline, and the implementation of wrapper functions in Python are discussed (along with visual aides to illustrate the concepts explored herein).
\end{abstract}
\section{Project Overview}
\subsection{Motivation and MERFISH Outline}
The project aims to analyze spatial transcriptomics data acquired through MERFISH and related methods. MERFISH (Multiplexed Error-Robust Fluorescence In Situ Hybridization) is a novel, state-of-the-art technique in spatial transcriptomics imaging because it enables the identification and quantification of thousands of RNA molecules within tissue samples at single-molecule resolution. Our pipeline was built to analyze the outputs of models imaged in MERFISH technology \cite{bonsai_merfish_2025}. Figures~\ref{fig:merfish} and~\ref{fig:merfish2} illustrate the rudiments of MERFISH \cite{vizgen_merfish_tech}:
\begin{figure}[ht]
\centering
\begin{mdframed}[backgroundcolor=gray!50,linecolor=gray!50]
\includegraphics[width=\textwidth]{merfish-desktop.png}
\caption{The translation of optical barcodes into identified mRNA transcripts}
\label{fig:merfish}
\end{mdframed}
\end{figure}
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{MERFISH-Technology-OverView.png}
\caption{The sequential process by which MERFISH identifies transcripts}
\label{fig:merfish2}
\end{figure}
The main motivation behind this project was to address the lack of user-friendly resources that exists for the relatively new MERFISH technology. Before MERFISH, there existed another less-detailed technology called SeqFISH (Sequential Fluorescence In Situ Hybridization) for which more resources existed; part of our job involved adapting some of those to MERFISH. In addition, since so many of the computational tools that exist to analyze different facets of spatial transcriptomics data are so disparate, our project also aimed to synthesize these together for more convenient use.
This whole project was also initiated for the purpose of allowing members of the labs of Dr. Roberta Brambilla and Dr. Daniel J. Liebl—both of whom are a part of the Miami Project to Cure Paralysis—to better analyze their MERFISH data.
\subsection{Input Structure}
Before we could analyze the images procured from MERFISH, it would need to be converted into a more usable form. For this, we needed to perform cell segmentation. The base MERFISH data here came in the form of a slide containing 9 Mus Musculus hippocampi, all with early stage EAE (Experimental autoimmune encephalomyelitis, the most widely accepted animal model of multiple sclerosis). However, some had conditional knockout (cko), meaning a specific gene has been selectively knocked out in those samples. Each hippocampus was labeled XXXcko or XXXctrl, where XXX is the primary identifier and cko/ctrl represents whether there is a knockout or it is a control.
Following the imaging, cell segmentation was performed on the regions
of the hippocampi using the Cellpose deep learning model. The cell segmentation was run twice, once using 1 z-layer for segmentation, and once using 3 fused z-layers for segmentation \cite{bonsai_merfish_2025}. Figure~\ref{fig:segmentation} illustrates what the hippocampi regions look like before and after cell segmentation:
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{segmentation.png}
\caption{The MERFISH-imaged hippocampus regions and their segmented reconstructions}
\label{fig:segmentation}
\end{figure}
For each hippocampi region, we also got the following files \cite{bonsai_merfish_2025}:
\begin{itemize}
\item \textbf{cell\_by\_gene.csv}: The abundance of transcripts for each gene in a cell
\item \textbf{cell\_metadata.csv}: The metadata for each cell, including euclidean coordinates and structural information such as anisostropy, solidity, and p/a ratio
\item \textbf{detected\_transcripts.csv}: The metadata for eah transcript, including eucliden coordinates, the gene name, and the cell it belongs to
\item \textbf{sum\_signals.csv}: The fluorescence intensity measurements from the imaging (unimportant for analysis purposes)
\item \textbf{segmentation\_specification.json}: The specs for the Cellpose segmentation (unimportant for analysis purposes)
\end{itemize}
\subsection{Pipeline \& Tool Suite Structure}
Figure~\ref{fig:pipeline} illustrates the general parts involved while figure~\ref{fig:pipeline_in_detail} shows each step in further detail:
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{pipeline_overview.png}
\caption{Overview of our project pipeline \& tool suite structure}
\label{fig:pipeline}
\end{figure}
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{pipeline_in_detail.png}
\caption{The steps required for each part of the process therein}
\label{fig:pipeline_in_detail}
\end{figure}
The data processing pipeline is divided into the following stages:
\begin{itemize}
\item \textbf{Dataset Preprocessing:} Data cleaning, filtering low-expressed genes, and quality control.
\item \textbf{Cell Analysis:} Leiden clustering (using the \texttt{Scanpy} library), reference mapping (i.e. clusters are compared against the reference from the Mouse Brain Atlas), and neighborhood enrichment (i.e. calculating which cell types tend to cluster together more than expected by chance)
\item \textbf{Microenvironment Analysis:} Spatial alignment of samples (i.e. lining up all of the regions so that they can be properly compared) and clustering of local microenvironments (i.e. categorizing each region into its own microenvironment).
\end{itemize}
The subsequent suite of data analysis tools then consists of the following:
\begin{itemize}
\item \textbf{Differential Expression:} Identification of cluster-dependent expression (i.e. finding differentially expressed genes within a certain cell type or microenvironment) and spatially variable genes (i.e. performing Moran’s I spatial autocorrelation; returns the top 10 genes that are the most spatially clustered to detect genes whose expression varies across space, rather than randomly distributed).
\item \textbf{Cell-Cell Interaction:} Analysis of cell-cell communication including node-centric expression (i.e. modeling latent intercellular communication), neighbor-dependent expression (elaborated on below), and spatial variance components (i.e. determining how different sources of variation contribute to spatial gene expression patterns).
\end{itemize}
\section{Neighbor-Dependent Expression}
\subsection{Introduction to the \texttt{CellNeighborEX} Library}
As previously mentioned, my portion of this project for this semester (and the part I am using for my CSC411 project implementation) involves handling the neighbor-dependent expression segment of the pipeline. For this, I was first tasked with conducting investigative work into the python library \texttt{CellNeighborEX}, as suited our intentions ideally.
The \texttt{CellNeighborEX} library, as outlined in the article by Kim et al \cite{https://doi.org/10.15252/msb.202311670}, deciphers gene expression changes induced by direct cell-cell interactions. It identifies immediate celltype neighbors using spatial algorithms such as Delaunay triangulation and k-nearest neighbors, categorizing them into homotypic and heterotypic neighbors. Differential expression is analyzed by comparing transcriptomes between these categories, which then associates certain genes to each of the celltypes. Within the code of the actual library itself, this process can be divided into two main steps:
\begin{itemize}
\item \textbf{Data Preparation}: First, the cell data—which comes as a ``.h5ad'' file—is read into an \texttt{AnnData} object. This is then preprocessed to generate spatial connectivity matrices and neighbor dataframes, which are then processed into one singular dataframe containing all the detected interactions between different cell types.
\item \textbf{Differential Expression Analysis}: The library then performs its differential expression analysis to identify genes with significant neighbor-dependent expression patterns among the various cell-to-cell interactions that were previously identified, along with their expression levels.
\end{itemize}
Figure~\ref{fig:cellneighborex} then illustrates the sequential workflow of the whole library:
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{cellneighborex.png}
\caption{An under-the-hood look at what \texttt{CellNeighborEX} does \cite{https://doi.org/10.15252/msb.202311670}}
\label{fig:cellneighborex}
\end{figure}
\subsection{Implementation in the Pipeline}
After investigating further into \texttt{CellNeighborEX}, I found that when it came to image-based spatial transcriptomic data, the creators of the library had only ever tested it with seqFISH, the aforementioned predecessor to MERFISH. Since our pipeline is being built entirely around MERFISH, that meant that I had to find a way to make \texttt{CellNeighborEX} work with MERFISH. In addition, I found that before any of the library's functions could be called in order to get any of the desired results, a multitude of preliminary steps had to be done. Due to how confusing these could potentially be for someone with little to no background in coding, it became clear to me that truncating each step of the whole process outlined by the \texttt{CellNeighborEX} library within wrapper functions was a necessity.
\subsection{Results and Visualizations}
Once all the data are analyzed and processed, \texttt{CellNeighborEX} outputs CSV files detailing differentially expressed genes and generate along with their associated celltypes along with heatmaps and volcano plots for visualization. The user can also get a spatial plot of a single gene and celltype, which overlays the locations of both the homotypic and heterotypic neighbors for that specific celltype along with the expression level of the gene in question at each specific point. %An example of what such a spatial plot would look like is given in Figure~\ref{fig:example_results}.
\section{Code Overview}
\subsection{Input Structure}
Before I could proceed with any of the other steps of my part, I had to make sure that my input was in the right format. Because my part—along with all the other components of the tool suite—succeed the data processing pipeline, I would first have to utilize the scripts from those parts of the project in order to ensure that our data was ready to be used with \texttt{CellNeighborEX}. Figure~\ref{fig:input_structure} illustrates the steps involved in this process.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{input.png}
\caption{The process by which the initial project input became the input for my part}
\label{fig:input_structure}
\end{figure}
As can be seen from figure~\ref{fig:input_structure}, the final state of the input is an \texttt{AnnData} object. \texttt{AnnData} objects come as the core of the \texttt{AnnData} Python library, which was designed specifically to handle annotated data matrices for cell and gene data in a manner that sorts all of the data in an organized manner whilst also keeping everything within one object \cite{Virshup2021.12.16.473007}. The aforementioned ``.h5ad'' file is simply this in file form, similarly to how a ``.csv'' file can be thought of as a \texttt{pandas} dataframe in file form.
As can also be seen in figure~\ref{fig:input_structure}, the \texttt{AnnData} object is comprised of various components, with each one having a list of strings known as ``keys'' that can be used to refer to specific aspects of each components. These are what each of the components mean \cite{Virshup2021.12.16.473007}:
\begin{itemize}
\item \textbf{X}: the active data matrix
\begin{itemize}
\item Has layers that can be referred to by certain keys
\end{itemize}
\item \textbf{obs}: cell metadata for cell data in X
\begin{itemize}
\item Each of its rows correspond to a row/cell in X
\end{itemize}
\item \textbf{var}: gene metadata for gene data in X
\begin{itemize}
\item Each of its columns correspond to a column/gene in X
\end{itemize}
\item \textbf{obsm}: multidimensional cell annotations
\begin{itemize}
\item In our case, it stores info gotten from clustering
\end{itemize}
\item \textbf{obsp}: annotations for cell-cell pairs
\end{itemize}
\subsection{Testing \texttt{CellNeighborEX} with MERFISH}
The instructions that the creators of \texttt{CellNeighborEX} left in order to use the data with seqFISH were left in a jupyter notebook within the library's dedicated GitHub repository. So, in order to test this library with our MERFISH data, I naturally created a similar jupyter notebook of my own, mimicking the structure of the original so I could more easily follow the steps.
Since they used \texttt{SquidPy}'s sample seqFISH data, I decided to first use \texttt{SquidPy}'s sample MERFISH data (the spatial plot for which is pictured below in Figure~\ref{fig:example_results}) before using our own sample MERFISH data \cite{Palla_Spitzer_Klein_Fischer_Schaar_Kuemmerle_Rybakov_Ibarra_Holmberg_Virshup_et_al._2022}, and here is where I found some initial incompatibilities that needed to be tweaked in order for the data to work. I ended up needing to make 3 separate jupyter notebooks before I found the way in which I could set each parameter of each function of \texttt{CellNeighborEX} in order to produce the desired results. Each one is situated within its own dedicated directory, as \texttt{CellNeighborEX} also creates files necessary for the other functions to work, and all those files must be located in the same directory in order for the code to ``find'' them. Figures~\ref{fig:first_try} and~\ref{fig:last_try} illustrate what our spatial plots initiall looked like with our data and what they looked like when I finally got it to work, respectively.
\begin{figure}[htp]
\centering
\includegraphics[width=\textwidth]{example_results.png}
\caption{Sample spatial plot from \texttt{CellNeighborEX} with \texttt{SquidPy}'s sample SeqFISH data \cite{Palla_Spitzer_Klein_Fischer_Schaar_Kuemmerle_Rybakov_Ibarra_Holmberg_Virshup_et_al._2022}}
\label{fig:example_results}
\end{figure}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.975\textwidth]{output.png}
\caption{The spatial plot from the initial \texttt{CellNeighborEX} test with our MERFISH data}
\label{fig:first_try}
\includegraphics[width=0.975\textwidth]{output2.png}
\caption{The spatial plot from the final \texttt{CellNeighborEX} test with our MERFISH data}
\label{fig:last_try}
\end{figure}
\subsubsection{Obstacles}
As I had previously hinted at, this portion of the project was not without its challenges and setbacks. They were as follows:
\begin{itemize}
\item \textbf{Installation issues}: the sheer size of this and its dependent libraries meant that installation ended up being an unexpected obstacle, especially considering the fact that I was initially doing this on a far older, far less powerful computer.
\item \textbf{Dependency issues}: because of the oddly specific and high amount of requirements for this library, any deviations in the user’s environment renders it unusable. I found this out the hard way. As such, I had to keep making clean environments until I got one where the library worked as intended.
\item \textbf{Data output issues}: during testing, I kept encountering instances in which the library would not detect any neighbor-dependent genes and others in which the spatial plots would not plot properly. Ultimately ended up being fixed by eliminating the log-normalization (since we do that anyways when we create the h5ad file) and using the \texttt{AnnData} library’s layer feature to create a copy of the data that removed a layer that was negatively affecting how the visualizations were coming out. Listings~\ref{lst:testfix1},~\ref{lst:testfix2}, and~\ref{lst:testfix3} display what I needed to do in order to fix this issue.
\end{itemize}
\begin{lstlisting}[style=mypython,
caption={‘Raw\_counts’ layer isolation in v2\_prepare\_datasets.py},%
label={lst:testfix1},
firstnumber=70]
# Create an AnnData object (scanpy) with the expression matrix, cell metadata, and gene metadata
ad_viz = sc.AnnData(X=cell_by_gene.values, obs=meta_cell, var=meta_gene)
ad_viz.layers['raw_counts'] = ad_viz.X.copy()
ann_array = ad_viz.layers['raw_counts']
\end{lstlisting}
\begin{lstlisting}[style=mypython,
caption={Log-normalization and ‘log\_counts’ layer isolation in v2\_cluster\_cells.py},%
label={lst:testfix2},
firstnumber=28]
print('[1/4] Preprocessing: Normalizing, Log Transformation, Scaling, Neighborhood Analysis...')
input_adata = ad.read_h5ad(adata_path)
# Normalize the total counts per cell so they sum to a fixed total
sc.pp.normalize_total(input_adata)
# Apply log transformation to expression values
sc.pp.log1p(input_adata)
input_adata.layers["log_counts"] = input_adata.X.copy()
input_adata.raw = input_adata.copy()
\end{lstlisting}
\begin{lstlisting}[style=mypython,
caption={Making a copy of the data without the ‘raw\_counts’ layer in the test notebook},%
label={lst:testfix3},
firstnumber=1]
# Save the data into dataframes.
index_name = adata.obs.index.name
df_cell_id = pd.DataFrame(adata.obs.index)
df_cell_id = df_cell_id.rename(columns={index_name: 0})
var_name = adata.var.index.name
df_gene_name = pd.DataFrame(adata.var.index)
df_gene_name = df_gene_name.rename(columns={var_name: 0})
log_adata = adata.copy()
log_adata.X = adata.layers["log_counts"]
df_log_data = log_adata.to_df().T
df_log_data = df_log_data.reset_index(drop=True) # row indices are represented as numbers.
assert len(df_cell_id) == len(df_processed), "The number of cells in the dataframes do not match."
\end{lstlisting}
\subsection{Wrapper Functions}
The wrapper functions that I created in the \texttt{ccc\_neighbors.py} file consolidate the myriad steps required before calling each \texttt{CellNeighborEX} function whilst also expanding upon \texttt{CellNeighborEX}'s original functionality to suit our needs:
\subsubsection{\texttt{prepare\_for\_ccc}}
This function combines all the data preparation steps into one neat function with only one single input: the original ``.h5ad'' file. Everything else is handled therein without the user having to worry about performing the extra steps manually. It returns both the \texttt{AnnData} object from the original input along with the processed dataframe (as both are needed for the next step).
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{prepare_for_ccc.png}
\caption{Part of what the dataframe looks like before and after \texttt{prepare\_for\_ccc} is called}
\label{fig:prepare_for_ccc}
\end{figure}
\subsubsection{\texttt{neighbor\_dependent\_expression}}
This function combines all of the differential expression analysis steps into one neat function where the parameters are the \texttt{AnnData} object and final processed pandas dataframe produced by the previous function. It returns the list of differentially-expressed genes as a \texttt{pandas} dataframe (as pictured in figure~\ref{fig:neighbor_dependent_expression}).
\begin{figure}[!ht]
\centering
\includegraphics[width=\textwidth]{neighbor_dependent_expression.png}
\caption{Head of the output dataframe from the DE analysis}
\label{fig:neighbor_dependent_expression}
\end{figure}
\subsubsection{\texttt{get\_full\_gene\_list}}
This function gets all of the genes detected by the \texttt{CellNeighborEX} library—regardless of if they're differentially-expressed or not—along with the celltypes associated with them. It returns a \texttt{pandas} dataframe containing such (as pictured in figure~\ref{fig:get_full_gene_list}).
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{get_full_gene_list.png}
\caption{The consolidation of different files into the final dataframe for \texttt{get\_full\_gene\_list}}
\label{fig:get_full_gene_list}
\end{figure}
\subsubsection{\texttt{cell\_cell\_degs}}
This function filters the genes associated with a specific cell-cell interaction (as inputted by the user), with an option to show the volcano plots and heatmaps for each gene expression (an example of which is pictured in figure~\ref{fig:example}). It returns a \texttt{pandas} dataframe containing said genes.
\begin{figure}[!ht]%
\centering
\subfloat[\centering Volcano plot]{{\includegraphics[width=0.375\textwidth]{Choroid-epithelial-cells+Oligodendrocytes_volcano.png} }}%
\subfloat[\centering Heatmap]{{\includegraphics[width=0.625\textwidth]{Choroid-epithelial-cells+Oligodendrocytes_heatmap.png} }}%
\caption{Sample plots from \texttt{cell\_cell\_degs}; the color indicates expression level}%
\label{fig:example}%
\end{figure}
\subsubsection{\texttt{is\_gene\_degs}}
This function gets all the cell-cell interactions associated with a specific gene (as inputted by the user), with an option to generate the spatial plots for each celltype. (The spatial plots will look similar to how they did in figure~\ref{fig:last_try})
\subsubsection{Obstacles}
As with before, there were some unexpected hurdles in this part that I was fortunately able to overcome, but not without great effort. They are as follows:
\begin{itemize}
\item \textbf{File-type issues}: the \texttt{CellNeighborEX} library saves all of its visual output as pdf files; however this poses an issue because they are not as easy to display through python code. I attempted to change the \texttt{CellNeighborEX} source code so that they would output as image-based files, but that resulted in low-resolution renders when I tried displaying them again through my wrapper code. So, I had to use a library called “\texttt{pdf2image}” in order to get the original pdf files to display properly, which resulted in a slew of errors as I slowly learned how that library worked with all of its intricacies. (The final fix is shown below in listing~\ref{lst:codefix1}; this code can be found in \texttt{ccc\_neighbor.py}.)
\begin{lstlisting}[style=mypython,
caption={The proper way to extract an image from a pdf to redisplay it},%
label={lst:codefix1},
firstnumber=276]
heatmap_path = os.path.join(path_deg, f'{celltype}/{celltype}_heatmap.pdf')
volcano_path = os.path.join(path_deg, f'{celltype}/{celltype}_volcano.pdf')
heatmap = pdf2image.convert_from_path(heatmap_path)[0]
volcano = pdf2image.convert_from_path(volcano_path)[0]
\end{lstlisting}
\item \textbf{Strange image croppings}: when attempting to display the images, I found that the images were cropped weirdly, with some having a bunch of extra whitespace and others having parts cut out. I found that this was because of a missing parameter in the \texttt{fig.savefig} function of the source code (\texttt{bbox\_inches=‘tight’}) which, when added, fixed this issue (shown below in listings~\ref{lst:codefix2} and~\ref{lst:codefix3}; the code for both can be found in \texttt{DEanalysis.py}).
\begin{lstlisting}[style=mypython,
caption={Saving the heatmap figure to PDF without weird croppings},%
label={lst:codefix2},
firstnumber=857]
# Save the heatmap plot as a PDF file
fig.savefig(f"{folderName2}/{filename_heatmap}_heatmap.pdf", dpi='figure', bbox_inches='tight')
\end{lstlisting}
\begin{lstlisting}[style=mypython,
caption={Saving the volcano figure to PDF without weird croppings},%
label={lst:codefix3},
firstnumber=994]
# Save the volcano plot as a PDF file
fig.savefig(f"{folderName2}/{filename_volcano}_volcano.pdf", dpi='figure', bbox_inches='tight')
\end{lstlisting}
\item \textbf{Spatial plots not plotting}: similarly to the previous phase, I had the same issue where the spatial plots would plot the colored points and would only plot the background points. However, the root cause ended up being different; this time, it was because I wasn’t saving the processed pandas dataframe as a csv file and was instead just keeping it as a variable within the code because I didn’t want to save any files if they weren’t necessarily needed. However, it turned out that this variable was being altered at some point when a different function was being called, so I found that I needed to keep the processed dataframe as both a variable and a csv file so that when the time came to call the function that made the spatial plots, I could make a new object out of the csv file and get the clean dataframe as it was when the \texttt{neighbor\_dependent\_expression} function was first called. This finally made the spatial plots plot properly. (The code for this—which can be found in \texttt{ccc\_neighbor.py}—is shown below in listing~\ref{lst:codefix4}.)
\begin{lstlisting}[style=mypython,
caption={Redefining the processed dataframe from the saved file},%
label={lst:codefix4},
firstnumber=330]
# get the path for the plots and for the dataframe
rootpath = os.getcwd()
dataframe_path = os.path.join(rootpath, 'neighbor_info/df_processed.csv')
dfp = pd.read_csv(dataframe_path)
\end{lstlisting}
\item \textbf{Redundancies in functionality}: originally, our project manager wanted the \texttt{neighbor\_dependent\_expression} function to both perform the differential expression analysis and output a pandas dataframe that contains all of the detected neighbor-dependent gene expressions, regardless of if they’re differentially-expressed. He then wanted another function (originally called \texttt{most\_deg\_neighbor\_dependent}) that filtered out certain genes based on set p-value, log-ratue, and false discovery rate cutoffs. However, I found out that this was redundant since the function from \texttt{CellNeighborEX} that you call in order to perform the differential expression analysis does this already, and if you set the parameters to values that wouldn’t filter out any genes, it just doesn’t work. In addition, that same function also makes csv files for each cell-cell interaction combination that lists the genes expressed for each, regardless of if they were differentially expressed or not. So, I instead had all of the functionality of \texttt{most\_deg\_neighbor\_dependent} be handled by \texttt{neighbor\_dependent\_expression} and then make a new function called \texttt{get\_full\_gene\_list} that concatenates all of the aforementioned csv files to get the full list of genes detected.
\end{itemize}
\section{Conclusion}
In spite of the unexpected hurdles that I encountered throughout the course of the semester, my team and I were successfully able to close on the entire project in time for the end of the semester. Now, the research labs in the Miller School of Medicine for which we designed this project will have a comprehensive and easy-to-use toolset to analyze their terabytes of spatial transcriptomics data. With this in hand, they can properly move forward with their research.
As for myself, I found this project to be an enriching experience that bolstered my ability to integrate scientific concepts that lie outside of computing within a computational context. In addition, it also provided a valuable lesson in working with others on a computational project when the constituent parts are split between the group members (which in turn means that everyone has to make sure that the individual pieces fit together). In spite of some minor communication issues, we were still able to collaborate effectively due in part to our weekly meetings and our prior familiarity with one another. I believe that this was definitely a contributing factor to our ultimate success.
\bibliographystyle{plain}
\bibliography{csp_19_}
\end{document}