proend-scripts/README.Rmd at main · drTakuOmics/proend-scripts · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
title: "ProEnd Pipeline 1.0"
output: github_document
bibliography: References.bib
link-citations: true
csl: https://raw.githubusercontent.com/citation-style-language/styles/master/biomed-central.csl
---
```{r, , include=FALSE}
#options(scipen=999) #non scientific ; options(scipen=0) # scientific notation
#BiocManager::install(c("biomaRt","org.At.tair.db","enrichplot","ggtree","GOSemSim","clusterProfiler","ggnewscale"))
#remotes::install_github('YuLab-SMU/ggtree')
library("biomaRt")
library(org.At.tair.db)
library(clusterProfiler)
library(GOSemSim)
library("ggnewscale")
library(enrichplot)
library(ggplot2)
```

# ProEnd Scripts

This project contains two Bash scripts designed to handle and analyze multiple protein sequences. The scripts streamline the extraction and identification of specific C-terminal protein HbYX-motifs from a given FASTA file. The third script shows how to download the proteomes from UniProt.

If you have just one protein sequence or your file is already one protein sequence per line go to script 2
## Scripts

### 1. HbYX-First.bash

This script formats a multi-entry FASTA file into a single line per entry format, preparing it for further analysis. Use this script first if you have multiple sequences, a proteome, or alignments.

#### Usage

Ensure you have the necessary FASTA file in the same directory or specify the path to the file. Execute the script by running:

`./HbYX-First.bash`

or

`bash HbYX-First.bash `

This will output a file named `multifasta.ol.fa`, containing all the sequences from the original FASTA file, formatted for further processing.

### 2. HbYX-Second-Script.bash

This script searches for a specific HbYX-motif at the C-terminus of the protein sequences in the `multifasta.ol.fa` file created by the first script.

#### Usage

Run the script using:

`./HbYX-Second-Script.bash`

or

`bash HbYX-Second-Script.bash`

It will produce a file named `2aa.txt`, containing the motifs found along with the corresponding header from the FASTA file, if available.

### Testing ProEND with *Arabidopsis proteome*

Download (Reviewed Swiss-Pro) Arabidopsis [proteome from Uniprot](https://www.uniprot.org/uniprotkb?query=arabidopsis&facets=reviewed%3Atrue)

#### 1. Fasta linearization

```{bash, eval=TRUE}
zcat Arabidopsis_uniprot_proteome/uniprotkb_arabidopsis_AND_reviewed_true_2024_10_08.fasta.gz | head -n 2
```

```{bash, eval=TRUE}
bash HbYX-First.awk Arabidopsis_uniprot_proteome/uniprotkb_arabidopsis_AND_reviewed_true_2024_10_08.fasta.gz Arabidopsis_uniprot_proteome/arabidopsis_uniprot_proteome.ol.fa
```

```{bash, eval=TRUE, include=FALSE}
head -n 2 Arabidopsis_uniprot_proteome/arabidopsis_uniprot_proteome.ol.fa
#Removing non Arabidopsis sequences
grep -A 1 --no-group-separator "OS=Arabidopsis thaliana"  Arabidopsis_uniprot_proteome/arabidopsis_uniprot_proteome.ol.fa >  Arabidopsis_uniprot_proteome/arabidopsis_uniprot_proteome.ol_clean.fa
```

#### 2. HbYX motif prediction
```{bash, eval=TRUE}
bash HbYX-Second-Script.awk  Arabidopsis_uniprot_proteome/arabidopsis_uniprot_proteome.ol_clean.fa  Arabidopsis_uniprot_proteome/arabidopsis_HbyX_proteome.txt
```

Total number of HbYX motif candidates:
```{bash, eval=TRUE}
grep -c ">" Arabidopsis_uniprot_proteome/arabidopsis_HbyX_proteome.txt
```

#### 3. Exploring conserved HbYX proteasome regulatory protein

```{bash, eval=TRUE}
grep --no-group-separator -A 1 -i "proteasome"  Arabidopsis_uniprot_proteome/arabidopsis_HbyX_proteome.txt
```

#### 4. Expanding to new regulatory candidates

Mapping TAIR Arabidopsis IDs using biomaRt

```{bash, eval=FALSE, include=FALSE}
grep ">"  Arabidopsis_uniprot_proteome/arabidopsis_HbyX_proteome.txt | awk -F'|' '{print $2}' > Arabidopsis_uniprot_proteome/uniprot_ids.txt
```
Getting entrezID and running GO terms analysis

```{r,  include=TRUE}
uniprot_ids <- read.csv("Arabidopsis_uniprot_proteome/uniprot_ids.txt", header = F)
ensembl <- useMart("plants_mart", dataset = "athaliana_eg_gene", host = "https://plants.ensembl.org")
mapping <- getBM(
  attributes = c("uniprotswissprot", "entrezgene_id"), # "tair_locus"),
  filters = "uniprotswissprot",
  values = uniprot_ids,
  mart = ensembl
)

all_uniprot_df <- as.data.frame( uniprot_ids)
colnames(all_uniprot_df ) <- c("uniprotswissprot")

uniprot2entrez <- merge(all_uniprot_df, mapping, by = "uniprotswissprot", all.x = TRUE)
print(paste("Uniprot ids:", nrow(uniprot_ids), " Entrez ids:", nrow(mapping), " Non mapped ids:",  nrow(uniprot_ids) -  nrow(mapping), sep = ""))
#Non mapped IDs
#uniprot2TAIR[is.na(uniprot2TAIR$tair_locus),]

ego <- enrichGO(gene  = uniprot2entrez[!is.na(uniprot2entrez$entrezgene_id),2],
                OrgDb         = org.At.tair.db,
                ont           = "BP", #"MF"
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE) #library(clusterProfiler)

ego_at <- attributes(ego )
Whole_table <- ego_at$result
write.csv(Whole_table, "Arabidopsis_uniprot_proteome/GO_HbYX_Arabidopsis.csv")
d <- godata('org.At.tair.db', ont="BP") #library(GOSemSim)

ego2 <- pairwise_termsim(ego, method="Wang", semData = d) #library(enrichplot)
```
```{r,  include=FALSE}
gos_clustered <-  treeplot(ego2 )

ggsave(filename = "Arabidopsis_uniprot_proteome/ara_go_BP_tree.svg", plot = gos_clustered,device = "svg", width=13, height=7)
#ggsave(filename = "Arabidopsis_uniprot_proteome/ara_go_MF_tree.svg", plot = gos_clustered,device = "svg", width=15, height=7)
row.names(Whole_table) <- NULL
```
<div style="text-align: center;">
<figure>
<img src="Arabidopsis_uniprot_proteome/ara_go_BP_tree.svg"
style="width: 100%;height: auto">
<figcaption style="margin-top: 10px;"><strong>GO terms enrichment of Arabidopsis HbYX contaiting proteins</strong></figcaption>
</figure>
<a name="GO_terms_HbYX"></a>
</div>

#### 5. HbYX protein CDC48A as a potential proteasome regulator

One proposed and controversial candidate for 20S proteasome regulation, significantly enriched across several Gene Ontology categories, is the CDC48 gene
family, as demonstrated in:
<!-- CDC48A (At3g09840) from Arabidopsis
#head(Whole_table[grepl("proteasome",Whole_table$Description),c(2,3,8)])-->
```{r, eval=TRUE}
head(Whole_table[grepl("CDC48",Whole_table$geneID),c(2,3,8)])
```

The CDC48 complex is a critical component in *Arabidopsis thaliana* that mediates ubiquitin-dependent degradation of intra-chloroplast proteins and regulates substrates like RbcL and AtpB via the proteasome pathway in response to oxidative stress @li2022cdc48. The presence of an HbYX motif in this protein, which facilitate substrate recognition and processing in other AAA+ ATPases proteins @salcedo2024proend, may enable CDC48 to bind directly to the 20S proteasome, influencing protein homeostasis in plants.

Although the CDC48a homohexameric complex structure has been already elucidated using X-ray diffraction (6HD3)@banchenko2019common, the crystallization process excluded terminal residues, therefore [it lacks the HbYX tails in the structure](https://www.rcsb.org/structure/6HD3).  Given its potential, we decided to proceed with an in-silico reconstruction of the whole CDC48a complex. CDC48 contains a well-characterized AAA+ domain, which is typically associated with the formation of a homo-oligomer, most commonly a hexamer in proteins of this type. For this reconstruction, AlphaFold V2 Multimer or AlphaFold V3  can be used, the latter allowing the inclusion of ligands such as ATP @abramson2024accurate.

<div style="display: flex; justify-content: center; align-items: center; flex-direction: column;">
<figure style="text-align: center;">
<img src="Arabidopsis_uniprot_proteome/CDC-48.png"
style="width: 100%;height: auto">
<figcaption style="margin-top: 10px;">
<strong>CDC-48 Hexamer prediction with and without ATP. HbYX motif in orange
</strong>
</figcaption>
</figure>
<a name="HbYX"></a>
</div>

As expected, the CDC-48 homohexamer predictions show robust pTM values, with 0.61 in the absence of ATP and 0.55 in the presence of ATP.  A pTM score above 0.5 indicates that the predicted overall structure of the complex is likely to resemble the true native structure. Interestingly, the HbYX motif can be observed shifting from the outer structure to the inner portion, a conformational change commonly  seen in substrate-processing ATPases that interact with the 20S proteasome upon substrate engagement.  This change facilitates interaction with the 20S proteasome to open the gate for substrate entry.

#### 6. Computational reconstruction of the potential interaction between CDC48 and the 20S proteasome.

As shown for the archaeal AAA ATPase PAN-2 "MJ1494" @salcedo2024proend, we proceeded with a molecular docking of the CDC48 complex with the 20S proteasome using ChimeraX or any preferred molecular docking tool
<div style="text-align: center;">

<figure>
<img src="Arabidopsis_uniprot_proteome/CDC48-20S.png"
style="width: 100%;height: auto">
<figcaption style="margin-top: 10px;">
<strong>Potential CDC48-20S complex formation. HbYX motif in orange. 20S in gray
</strong>
</figcaption>
</figure>
<a name="_HbYX"></a>
</div>

Upon docking, the HbYX motif of CDC48, in the presence of ATP, is positioned optimally for interaction with the alpha pockets of the 20S proteasome, potentially facilitating gate opening and activation

### 3. Conclusion and Future Potential.

This example demonstrates how to generate and explore hypotheses using our tool, ProEnd. In this case, we focused on HbYX-containing proteins from *Arabidopsis thaliana*, one of the most extensively studied model organisms in biology. Using ProEnd, we successfully identified  known HbYX-containing proteins, including the 19S-26S regulatory proteins, and discovered enriched candidates (CDC48-p97) with potential for novel interactions with the 20S proteasome, expanding our understanding of proteasome biology and proteostasis.

The formation of the CDC48-20S complex remains a topic of debate. Some researchers argue that the 19S-20S complexes, also referred to as 26S proteasomes, represent the vast majority of proteasomes in the cell. However, various cellular contexts demand alternative proteasome configurations. For instance, there are well-documented cases of 20S complexes interacting with other molecules, such as PA28, PA200, and PI31. CDC48 is particularly relevant in unique cellular environments that require the degradation of tightly folded substrates. These substrates, after folding, may directly engage the 20S proteasome without the need for 19S caps, providing a distinct scenario for CDC48-20S complex formation, potentially in the ER or chloroplast.

The conservation of the HbYX motif in CDC48 across different kingdoms (from archea to eukaryotes) suggests a conserved mechanism for direct degradation through the 20S proteasome without intermediaries, further supporting the idea that CDC48 plays a crucial role in specific proteolytic pathways.

## Requirements

- Unix-like environment
- AWK installed

## Installation

No installation is required. Simply clone this repository or download the scripts to your local machine.

## Data Folder HbYX_data_tables
This folder contains data results files for the ProEnd Scripts project.

## Cite
This code now can be cited as our BMC-Genomics article [@ProEnd](https://rdcu.be/dW6ng)

## License

The code is freely available to download and run, but it's protected and licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc/4.0/), meaning you can use it but citing it's source.

[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)

## References