Releases: jpquast/protti
Releases · jpquast/protti
protti 1.0.0
New features
impute_randomforest()new imputation method random forest usingmissForest.- The
fetch_pdb()function now reads additional data about non-polymer ligands and the donor ligand positions (nonpolymer_donor_label_seq_id). - The
find_peptide_in_structure()function now has additional outputs:start_adjustedandend_adjusted: Are the adjusted start and end positions of a peptide if it is only partially covered by the structure.percentage_covered_peptides: Is the percentage of all provided peptides that are at least partially covered by the structure.
calculate_go_enrichment()received the argumentlabel_sizethat allows the user to specifiy the size of the labels in the plot.- Added
fetch_interpro(). The function allows you to fetch information from the InterPro database. There are two options, either domain level information about the proteins of interest can be retrieved. This includes also e.g. gene ontology terms of the domains as well as their positions within the protein. Second you can retrieve residue level information. These are any annotations of proteins that focus on residues or small stretches, such as active sites, binding sites etc. fetch_alphafold_aligned_error()received aversionargument that lets the user directly control the AlphaFold database version used.map_peptides_on_structure()received new arguments (fixes issue #285):alphafold_version: allows the user to control the AlphaFold database version used (useful if older versions no longer work).baseline_map_value(optional): handles constant mappings. If all mapped values for a structure/protein (or the whole dataset whenscale_per_structure = FALSE) equal this baseline, values are scaled to 50 instead of 100.
Bug fixes
- Fixed issue #193. This makes sure that information in retained columns can be propagated to newly created combinations, which were not present in the original data.
- Fixed issue #251. Sodium ions (Na) are not read as
NAanymore, when using thefetch_pdb()function. calculate_go_enrichment()can now correctly handle groups that are of type factor.- Fixed an issue in
map_peptides_on_structure()wherescale_per_structure = TRUEincorrectly scaled AlphaFold predictions together (because missingpdb_idvalues were grouped asNA). AlphaFold predictions are now scaled per UniProt ID, matching per-structure behavior. - Fixed issue #279. The x-axis of
calculate_go_enrichment()is correctly displayed. fetch_uniprot()received a new default forbatchsize, which is100and dictated by the new limit of UniProt.
Additional Changes
- IMPORTANT! There has been a change to the hyperparameter estimation of the limma package (3.61.8) in the
eBayes()function. This leads to a change of results whenmethod = "moderated_t-test"is used in thecalculate_diff_abundance()function. Therefore, we introduced the new argumentlimma_legacy_estimationthat allows you to go back to the old method. The default behaviour is the new and improved estimation of parameters. assign_peptide_typenow takes thestartargument, containing the start position of a peptide. If a protein does not have any peptide starting at position1and there is a peptide starting at position2, this peptide will be considered "tryptic" at the N-terminus. This is because the initial Methionine is likely missing due to processing for every copy of the protein and therefore position2is the true N-terminus.extract_metal_binders()now uses keywords from UniProt as well. In addition, only "enables" GO terms are considered now.fetch_uniprot()received another default column "keyword".calculate_go_enrichment():- got improved error handling and now checks if any significant proteins are present in the input
data. If none are found (is_significant == TRUEfor no rows), the function exits early. - is now more robust for edge cases. A plot, instead of an error, is returned if there is only one significant GO term and
plot_style = "heatmap"is selected.
- got improved error handling and now checks if any significant proteins are present in the input
fetch_alphafold_prediction()updatedversionto"v6".
protti 0.9.1
Bug fixes
try_query()now correctly handles errors that don't return a response object. We also handle gzip decompression problems better since some databases compressed responses were not handled correctly.
protti 0.9.0
New features
calculate_go_enrichment()got additional arguments.replace_long_name: a logical argument that specifies if GO term names above 50 characters should be replaced by the GO ID instead for the plot. This ensures that the plotting area doesn't become too small due to the long name. The default isTRUE.label_move_frac: a numeric argument between 0 and 1 that specifies which labels should be moved outside of the bar. The default is 0.2, which means that the labels of all bars that have a size of 20% or less of the largest bar are moved to the right of the bar. This prevents labels from overlapping with the bar boundaries.
fetch_alphafold_aligned_error(),fetch_alphafold_prediction(),fetch_mobidb(),fetch_quickgo(),fetch_uniprot()andfetch_uniprot_proteome()got additional arguments:timeout: a numeric value specifying the time in seconds until the download times out.max_tries: a numeric value that specifies the number of times the function tries to download the data in case an error occurs.- Enhanced Flexibility in Protein Quantification: Introduced the
min_n_peptidesparameter to thecalculate_protein_abundance()function. This allows users to specify the minimum number of peptides per protein needed for analysis. Default is set at three peptides.
Bug fixes
fetch_uniprot()previously had an issue where it incorrectly identified certain IDs as UniProt IDs, such as ENSEMBL IDs. For example, it would incorrectly interpret"CON_ENSEMBL:ENSBTAP00000037665"as"P00000". To address this, the function now requires that UniProt IDs are not preceded or followed by letters or digits. This means that UniProt IDs should be recognized only if they stand alone or are separated by non-alphanumeric characters. For instance, in the string"P02545;P20700", both"P02545"and"P20700"are correctly identified as UniProt IDs because they are separated by a semicolon and not attached to any other letters or digits. Fixes issue #245.calculate_go_enrichment()now correctly uses the total number of provided proteins for the contingency table. Previously it falsely only considered proteins with a GO annotation for the enrichment analysis.
Additional Changes
fetch_uniprot()andfetch_uniprot_proteome()are more resistant to database connection issues. They also give more informative messages as to why the data could not be retrieved. Fixes issue #252.qc_csv()now properly works if the column supplied to theconditionargument is a factor. Fixes issue #254.- The
analyse_functional_network()function now includes enhanced error handling to ensure it fails gracefully in case of any issues. Fixes issue #259. - The default
versionparameter foranalyse_functional_network()has been updated to 12.0, aligning with the latest STRINGdb version. Fixes issue #244.
protti 0.8.0
New features
calculate_treatment_enrichment()received additional arguments.fill_colours: a character value that can be used to provide custom colours to the plot.fill_by_group: a logical value that specifies if the bars in the plot should be filled according to group.facet_n_col: specifies the number of columns in the facet plot if agroupcolumn was provided.
calculate_go_enrichment()got additional arguments.facet_n_col: determines the number of columns the faceted plot should have if a group column is provided.plot_title: specifies the title of the plot.min_n_detected_proteins_in_process: argument for plotting that specifies the minimum number of proteins a GO term needs to be detected for.enrichment_type: specifies what kind of enrichment should be calculated. It can be "all", "enrichment" or "deenrichment". This argument affects how thefisher.test()calculates the enrichment. A two-sided test will be used for "all", while a one-sided test in the specific direction will be used for "enriched" or "deenriched".barplot_fill_colour: specifies the colours used to fill the bars in the barplot. Needs always at least two values one for deenriched the other for enriched.plot_style: We added a new plot type to the function. The standard plot is still the default and is called "barplot", while the new plot type is "heatmap". The heatmap plot is especially useful for comparing GO enrichments of multiple groups.heatmap_fill_colour: specifies the colours used for the colour gradient of heatmap plots.heatmap_fill_colour_rev: a logical value that specifies if the colour gradient should be reversed.plot_cutoff: is now more flexible. You can provide any number with the "top" cutoff. E.g. "top10", "top5".
barcode_plot()received additional arguments.facet_n_col: determines the number of columns the faceted plot.fill_colour_gradient: specifies the colours used for the colour gradient if thecolouringcolumn is continous.fill_colour_discrete: specifies the colours used for the fill colours if thecolouringcolumn is discrete.
- Added
mako_coloursto the package that contain 256 colours of the "mako" colour gradient. drc_4p_plot()received additional arguments.facet_title_size: determines the size of the facet titles.export_height: determines the output height of an exported plot in inches.export_width: determines the output width of an exported plot in inches.x_axis_limits: user supplied x-axis limits for each plot.colours: determines colours used for the plot.
fit_drc_4p()andparallel_fit_drc_4p()have been updated in the latest version of protti, leading to slight adjustments in their computational results compared to previous versions.- We added new arguments:
anova_cutofflets you define the ANOVA adjusted p-value cutoff (default 0.05).n_replicate_completenessreplacesreplicate_completeness. Now we encourage you to provide a discrete number of minimal replicates instead of a fraction that is multiplied with the total number of replicates. This is particularly important to ensure that thresholds between different datasets and data completeness levels are reproducible.n_condition_completenessreplacescondition_completeness. Same as above, we encourage you to provide the minimal number of conditions that need to meet the replicate completeness criteria as a number instead of a fraction.complete_dosesis a new optional argument that should be provided if the dataset is small and potentially incomplete. This ensures that no matter if any doses are missing from the provided data or not, the MNAR of the curve is calculated correctly. We would recommend always providing it to ensure proper reproducibility.
- Curves that were previously annotated in the
dose_MNARcolumn are now part of the hits. To get back to the old output you can just exclude them again from the ranked results. - The major change to the function is that now all provided features (e.g. peptides) are also part of the output no matter if a curve was fit or not. To get back to the original output you can remove all features without a fit, but please note that statistics such as the ANOVA p-value adjustment were computed on the complete dataset and might need to be readjusted by running the p-value adjustment again.
- Another major change to the function was the way the
filterargument works. This argument controls if significance statistics should be annotated in the data."pre": This previously filtered curves by the completeness as well as the ANOVA adjusted p-value prior to fitting curves. Now it only filters by completeness. This also allows it to be an option for theparallel_fit_drc_4p()function."post": Is still the default value and still just annotates the data without any filtering.
- In general we would now recommend using
"pre"to remove usually not trustworthy features with too few complete concentrations from the data before p-value adjustment and curve fittings. This will solidify your confidence that features without a dose-response behavior are true negative. The point is that it is better to not include any features with too few values because they are potentially false negative.
- We added new arguments:
Bug fixes
normalise()now correctly works with grouped data. Previously it would only correctly work with ungrouped data frames. Now you can group the data to calculate group specific normalisations. If you want to compute a global normalisation for the dataset, you need to ungroup the data before using the function as usual. This fixes issue #209.qc_sequence_coverage()now correctly displays medians in faceted plot. This fixes issue #202 and #213.fit_drc_4p()andparallel_fit_drc_4p()now correctly calculates the ANOVA p-value. Previously the number of observations for each concentration was not provided correctly.fetch_uniprot()now correctly retrieves information if an input ID was also part of a non-conform input ID combination. When e.g.c("P02545", "P02545;P20700")was provided, previously the"P02545"accession was dropped from theinput_idcolumn even though it is also present on its own and not only in combination with"P20700". The new output now contains 3 rows, one for each ID, with"P02545"having one row with theinput_id``"P02545"and one with theinput_id` `"P02545;P20700"`. This also means that the `input_id` column now always contains the provided input IDs and not only if they were non-conform input ID combinations.
Additional Changes
- For
fit_drc_4p()andparallel_fit_drc_4p()the argumentsreplicate_completenessandcondition_completenessare now deprecated. Please usen_replicate_completenessandn_condition_completenessinstead. - Improved label positions of
qc_charge_states(),qc_peptide_type()andqc_missed_cleavages(). Also made appearance more uniform between methods"count"and"intensity". fetch_uniprot()now returns nothing instead of a partial output if some of the requested batches could not be retrieved due to database issues (e.g. timeout because of too many requests). This addresses issue #203, which requests this change, because the warning message regarding the partial output can be easily missed and users might wrongfully assume that all information was retrieved successfully from UniProt.find_peptide()now preserves the groups of the original data. This does not affect any of the calculations.calculate_sequence_coverage()now works on grouped data.- Some typo fixes. Thank you Steffi!
protti 0.7.0
New features
correct_lip_for_abundance()was added. It corrects LiP-peptides for changes in protein abundance and calculates their significance using a t-test. The function is based on the MSstatsLiP package developed by the Vitek Lab. Big thanks to @FehrAaron for implementing it!qc_cvs()received a new argument calledmax_cvthat specifies the maximum CV that should be included in the plot.peptide_profile_plot()received a new argument calledcomplete_sample. If set toTRUE, each protein gets assigned all sample names that are found in the input data. This ensures that the plot always contains all samples on the x-axis even if there are no measured intensities for a specific sample. The default isFALSE, which is the original behaviour of the function.volcano_plot()received thecolourargument that allows the user to provide custom colours for points.- Increased the speed of
find_peptide()andassign_peptide_type()by only computing on the smallest possible subset of data before joining back to the original data frame. calculate_treatment_enrichment()can now be applied on data frames with multiple different groups. The enrichment will be calculated for each group separately. If the data is plotted, each group is displayed in a separate facet. The group is provided to the newgroupargument.qc_pca(): If the condition argument is numeric a colour gradient is used instead.
Bug fixes
volcano_plot()now also works interactively if there are no significant hits.fetch_chebi(): fixed an issue caused byna_if()that changed its behaviour after the recentdplyrupdate.qc_proteome_coverage(): fixed the label order of fractions of proteins detected and not detected in the proteome. Fixes issue #194.calculate_protein_abundance()now correctly retains columns iffor_plot = TRUE. Previously the columns to retain were not joined considering the precursor column, which lead to duplications of information where it did not belong. Fixes issue #197.fetch_kegg()now returns the pathway name correctly again.qc_intensity_distribution(),qc_median_intensities(),qc_charge_states(),qc_contaminants(),qc_missed_cleavages(),qc_peptide_type(),qc_ids(): If the provided sample column is of type factor, the level order won't be overwritten anymore.
*fit_drc_4p(): If there are no correlations an empty data frame is returned to prevent errors inparallel_fit_drc_4p().calculate_sequence_coverage()does not fail anymore if a protein only containsNApeptide sequences.qc_sequence_coverage()does not return a plot anymore ifplot = FALSE. This fixes issue #207.qc_data_completeness()if sample was of typefactorthe function did not properly facet the data when thedigestionargument was provided. Now we filter out all 0% completeness values that come from factor levels that are not present in subsetted data.
protti 0.6.0
protti 0.6.0
New features
calculate_go_enrichment()can now be applied on data frames with multiple different groups. The enrichment will be calculated for each group separately. If the data is plotted, each group is displayed in a separate facet. The group is provided to the newgroupargument. They_axis_freeargument determines if the y-axis of the faceted plot is "free" or "fixed".- Added a
versionargument tofetch_alphafold_prediction()that specifies which verison of the database should be retrieved. The default is currently the newest version"v4". qc_ranked_intensities()was added. It ranks protein, peptide or precursor intensities from highest to lowest. Ranked intensities can also be plotted using theplotargument.fetch_chebi()recieved atimeoutargument that specifies after how many seconds the connection to the database should timeout. The default is 60 seconds as previously used.
Bug fixes
pval_distribution_plot()facets now have the correct style.calculate_protein_abundance()requires at least three distinct peptides for quantification. The function now applies this rule for each sample independently except for checking the whole dataset to contain at least three distinct peptides.
protti 0.5.0
protti 0.5.0
New features
fetch_alphafold_aligned_error()was added. It fetches the aligned error matrix for structure predictions from the AlphaFold EBI database.predict_alphafold_domain()was added. It uses a graph-based community clustering algorithm of AlphaFold predicted aligned errors in order to infer protein domains in AlphaFold predictions. The code is based on python code by Tristan Croll.
Bug fixes
assign_missingness()now correctly deals with unequal replicate numbers of comparisons. In addition there is a message returned if an unequal number of replicates is detected for a comparison.fetch_chebi()fixed a bug that prevented the function from failing gracefully if there is a connection problem to the server.extract_metal_binders()now checks if the provided data frames areNULL. If yes, a message andNULLis returned.fetch_mobidb()was updated after the API changed.
Additional changes
- Updated the "Protein Structure Analysis Workflow" vignette to include the
fetch_alphafold_aligned_error()andpredict_alphafold_domain()functions.
protti 0.4.0
protti 0.4.0
New features
- Reintroduced the functionalities relying on the
iqpackage toprotti.calculate_protein_abundance()now has the method"iq"again as an option. fetch_pdb()now also retrieves information on engineered mutations, non-standard monomers, secondary structure and binding interfaces of ligands.extract_metal_binders()was completely redone. This was in response to the UniProt update and rework of the binding column provided by UniProt. This function extracts and concatenates all metal binding information available for a protein based on the UniProt and QuickGO databases. Therefore, this function now also takes gene ontology (GO) information from QuickGO as input. Instead of being able to provide column names to specific argument of the function you now only provide the data frames. This makes the function less flexible but reduces the amount of arguments required to achieve the same result. You just need to make sure that the input data frames contain columns with the correct names as stated by the function documentation.fetch_quickgo()was added. It fetches gene ontology (GO) information from the QuickGO EBI database. The retrieved information can either be GO annotations for provided UniProt IDs or Taxon identifiers, a list of all GO terms or a "slims" subset of GO IDs that can be generated based on provided GO IDs.fetch_chebi()now has thestarsargument with which one can select the evidence levels for which entries should be retrieved.
Bug fixes
- Fixed the
auth_seq_idcolumn that is part of the output of thefetch_pdb()function. Previously, the column could contain duplicated or missing positions. This was formerly identified by comparing the number of positions within theauth_seq_idcolumn and the number of residues in the depositedpdb_sequence. Positions are now correct. The original output can be found in theauth_seq_id_originalcolumn. - In the
calculate_diff_abundance()function the intensity column can now be retained with theretain_columnsargument. This was previously not possible until now since this column was used to reduce the annotation dataset. However, after reassessing the benefit of this filter step, it seemed not necessary. - We assumed that users would only retain columns in
calculate_diff_abundance()that would not duplicate the data. However, this seems not to be the case, which can lead to wrong p-value adjustment. p-value adjustment was originally performed after the columns indicated inretain_columnsare joined back to the data. Now p-value adjustment is performed prior to retaining columns as well as only on the subset of data that actually contains p-values. Previously we (by defaultfilter_NA_missingness = TRUE) only filtered outNAs in themissingnesscolumn prior to p-value adjustment. However, it is possible thatmissingnessis notNAbut the p-value isNA. Now for all methods except for"proDA"we removeNAp-values before p-value adjustment. For"proDA"data is handled as previously since p-values are neverNA.
Additional changes
- The default batchsize of
fetch_pdb()was changed to 100 (from 200). This was done since more information is retrieved now, which slows to function down and is slightly improved when batch sizes are smaller. try_query()now only retries to retrieve information once if the returned message was "Timeout was reached". In addition, atimeoutandacceptargument have been added.- The UniProt database has changed its API, therefore column names have changed as well as the format of data. We adjusted the
fetch_uniprot()andfetch_uniprot_proteome()function accordingly. Please be aware that some columns names might have changed and your code might throw error messages if you did not adjust it accordingly.
protti 0.3.1
protti 0.3.1
Bug fixes
- Corrected the "Protein Structure Analysis Workflow" vignette. The example for
map_peptides_on_structure()was still using "residue" as its input. We now usefind_peptide_in_structure()to generate the correct input column. - Fixed a bug in
fetch_uniprot()andfetch_uniprot_proteome(). As UniProt has updated their website and their programmatic access, we now download the information from the legacy version temporarily. A real fix will follow. - Fixed a bug in
fetch_kegg(). The function did not retrieve any data after the API URL had changed.
protti 0.3.0
protti 0.3.0
New features
- The "Protein Structure Analysis Workflow" vignette was added. It contains an example workflow for the analysis of structural proteomics data.
fetch_ecowas added. It fetches evidence & conclusion ontology information from the EBI database.qc_proteome_coverage()now has thereviewedargument that specifies if only reviewed entries in UniProt should be considered as the proteome. The default isTRUEand stays the same as previously.volcano_plot()now has thefacet_scalesargument that specifies if the scales should be "free" or "fixed" when a faceted plot is created. The arguments that can be provided are the same that can be provided to thescalesargument ofggplot2::facet_wrap(). The new default is now"fixed".pval_distribution_plot()now has the optionalfacet_byvariable that allows faceting of the plot.
Bug fixes
- Fixed a bug in
map_peptides_on_structure()that caused an error if the column provided to theauth_seq_idargument was called "residue". - Fixed a bug in
volcano_plot()that did not calculate the horizontal cutoff line correctly if there were multiple significance values that have the same adjusted significance value. Now it correctly uses the two p-values closest to the cutoff for the line position calculation. In addition, points were not correctly displayed if no horizontal cutoff line was created due to no significant values. Now all values are displayed correctly. - Fixed a bug related to fetch functions not failing gracefully. The problem was that the internal
try_query()function now returns errors as a character string if it encounters one. Functions usingtry_query()however, still expectedNULLif there was an error. Also adjusted additional fetch functions that do not usetry_query()to fail gracefully and to return informative messages upon encountering errors.
Additional changes
- Improved test coverage for a few functions.