Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Aniba MR, Poch O, Thompson JD. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Res 2010;38:7353-63. [PMID: 20639539 PMCID: PMC2995051 DOI: 10.1093/nar/gkq625] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Revised: 06/10/2010] [Accepted: 06/29/2010] [Indexed: 11/13/2022] Open

For:	Aniba MR, Poch O, Thompson JD. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Res 2010;38:7353-63. [PMID: 20639539 PMCID: PMC2995051 DOI: 10.1093/nar/gkq625] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Revised: 06/10/2010] [Accepted: 06/29/2010] [Indexed: 11/13/2022] Open

Number

Cited by Other Article(s)

Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024;25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]

Riley AC, Ashlock DA, Graether SP. The difficulty of aligning intrinsically disordered protein sequences as assessed by conservation and phylogeny. PLoS One 2023;18:e0288388. [PMID: 37440576 DOI: 10.1371/journal.pone.0288388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Accepted: 06/26/2023] [Indexed: 07/15/2023] Open

Sonrel A, Luetge A, Soneson C, Mallona I, Germain PL, Knyazev S, Gilis J, Gerber R, Seurinck R, Paul D, Sonder E, Crowell HL, Fanaswala I, Al-Ajami A, Heidari E, Schmeing S, Milosavljevic S, Saeys Y, Mangul S, Robinson MD. Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability. Genome Biol 2023;24:119. [PMID: 37198712 DOI: 10.1186/s13059-023-02962-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 05/06/2023] [Indexed: 05/19/2023] Open

Affiliation(s)

Anthony Sonrel Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
Almut Luetge Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
Charlotte Soneson SIB Swiss Institute of Bioinformatics, Zurich, Switzerland Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
Izaskun Mallona Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
Pierre-Luc Germain Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland D-HEST Institute for Neuroscience, ETH Zürich, Zurich, Switzerland
Sergey Knyazev Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, USA Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, USA
Jeroen Gilis Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
Reto Gerber Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
Ruth Seurinck Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
Dominique Paul Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
Emanuel Sonder Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland D-HEST Institute for Neuroscience, ETH Zürich, Zurich, Switzerland
Helena L Crowell Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
Imran Fanaswala Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
Ahmad Al-Ajami Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
Elyas Heidari Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
Stephan Schmeing Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
Stefan Milosavljevic Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland SIB Swiss Institute of Bioinformatics, Zurich, Switzerland Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
Yvan Saeys Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
Serghei Mangul Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, USA
Mark D Robinson Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland. SIB Swiss Institute of Bioinformatics, Zurich, Switzerland.

Collapse

PSRTTCA: A new approach for improving the prediction and characterization of tumor T cell antigens using propensity score representation learning. Comput Biol Med 2023;152:106368. [PMID: 36481763 DOI: 10.1016/j.compbiomed.2022.106368] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Revised: 10/19/2022] [Accepted: 11/25/2022] [Indexed: 11/27/2022]

Abstract

Despite the arsenal of existing cancer therapies, the ongoing recurrence and new cases of cancer pose a serious health concern that necessitates the development of new and effective treatments. Cancer immunotherapy, which uses the body's immune system to combat cancer, is a promising treatment option. As a result, in silico methods for identifying and characterizing tumor T cell antigens (TTCAs) would be useful for better understanding their functional mechanisms. Although few computational methods for TTCA identification have been developed, their lack of model interpretability is a major drawback. Thus, developing computational methods for the effective identification and characterization of TTCAs is a critical endeavor. PSRTTCA, a new machine learning (ML)-based approach for improving the identification and characterization of TTCAs based on their primary sequences, is proposed in this study. Specifically, we introduce a new propensity score representation learning algorithm that allows one to generate various sets of propensity scores of amino acids, dipeptides, and g-gap dipeptides to be TTCAs. To enhance the predictive performance, optimal sets of variant propensity scores were determined and fed into the final meta-predictor (PSRTTCA). Benchmarking results revealed that PSRTTCA was a more precise and promising tool for the identification and characterization of TTCAs than conventional ML classifiers and existing methods. Furthermore, PSR-derived propensities of amino acids in becoming TTCAs are used to reveal the relationship between TTCAs and their informative physicochemical properties in order to provide insights into TTCA characteristics. Finally, a user-friendly online computational platform of PSRTTCA is publicly available at http://pmlabstack.pythonanywhere.com/PSRTTCA. The PSRTTCA predictor is anticipated to facilitate community-wide efforts in accelerating the discovery of novel TTCAs for cancer immunotherapy and other clinical applications.

Collapse

Accuracy and Completeness of Long Read Metagenomic Assemblies. Microorganisms 2022;11:microorganisms11010096. [PMID: 36677391 PMCID: PMC9861289 DOI: 10.3390/microorganisms11010096] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 12/22/2022] [Accepted: 12/28/2022] [Indexed: 01/03/2023] Open

Kańduła MM, Aldoshin AD, Singh S, Kolaczyk ED, Kreil D. ViLoN-a multi-layer network approach to data integration demonstrated for patient stratification. Nucleic Acids Res 2022;51:e6. [PMID: 36395816 PMCID: PMC9841426 DOI: 10.1093/nar/gkac988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 10/11/2022] [Accepted: 11/02/2022] [Indexed: 11/19/2022] Open

Usability evaluation of circRNA identification tools: Development of a heuristic-based framework and analysis. Comput Biol Med 2022;147:105785. [PMID: 35780604 DOI: 10.1016/j.compbiomed.2022.105785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 05/23/2022] [Accepted: 06/26/2022] [Indexed: 11/21/2022]

Hubley R, Wheeler TJ, Smit AFA. Accuracy of multiple sequence alignment methods in the reconstruction of transposable element families. NAR Genom Bioinform 2022;4:lqac040. [PMID: 35591887 PMCID: PMC9112768 DOI: 10.1093/nargab/lqac040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 03/29/2022] [Accepted: 04/29/2022] [Indexed: 02/06/2023] Open

Chao J, Tang F, Xu L. Developments in Algorithms for Sequence Alignment: A Review. Biomolecules 2022;12:biom12040546. [PMID: 35454135 PMCID: PMC9024764 DOI: 10.3390/biom12040546] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 03/29/2022] [Accepted: 03/31/2022] [Indexed: 01/27/2023] Open

Shrestha B, Adhikari B. Scoring protein sequence alignments using deep Learning. Bioinformatics 2022;38:2988-2995. [PMID: 35385080 DOI: 10.1093/bioinformatics/btac210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 04/01/2022] [Accepted: 04/05/2022] [Indexed: 11/12/2022] Open

Kundu R, Chattopadhyay S, Cuevas E, Sarkar R. AltWOA: Altruistic Whale Optimization Algorithm for feature selection on microarray datasets. Comput Biol Med 2022;144:105349. [PMID: 35303580 DOI: 10.1016/j.compbiomed.2022.105349] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Revised: 02/22/2022] [Accepted: 02/22/2022] [Indexed: 12/15/2022]

Bokulich NA, Ziemski M, Robeson MS, Kaehler BD. Measuring the microbiome: Best practices for developing and benchmarking microbiomics methods. Comput Struct Biotechnol J 2020;18:4048-4062. [PMID: 33363701 PMCID: PMC7744638 DOI: 10.1016/j.csbj.2020.11.049] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Revised: 11/27/2020] [Accepted: 11/28/2020] [Indexed: 12/12/2022] Open

A unique data analysis framework and open source benchmark data set for the analysis of comprehensive two-dimensional gas chromatography software. J Chromatogr A 2020;1635:461721. [PMID: 33246680 DOI: 10.1016/j.chroma.2020.461721] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2020] [Revised: 11/05/2020] [Accepted: 11/09/2020] [Indexed: 12/28/2022]

Durojaye OA, Mushiana T, Uzoeto HO, Cosmas S, Udowo VM, Osotuyi AG, Ibiang GO, Gonlepa MK. Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study. EGYPTIAN JOURNAL OF MEDICAL HUMAN GENETICS 2020;21:44. [PMID: 38624499 PMCID: PMC7529470 DOI: 10.1186/s43042-020-00081-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 07/03/2020] [Indexed: 12/13/2022] Open

Wright ES. RNAconTest: comparing tools for noncoding RNA multiple sequence alignment based on structural consistency. RNA (NEW YORK, N.Y.) 2020;26:531-540. [PMID: 32005745 PMCID: PMC7161358 DOI: 10.1261/rna.073015.119] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Accepted: 01/28/2020] [Indexed: 05/05/2023]

Kreutz C. Guidelines for benchmarking of optimization-based approaches for fitting mathematical models. Genome Biol 2019;20:281. [PMID: 31842943 PMCID: PMC6915982 DOI: 10.1186/s13059-019-1887-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2019] [Accepted: 11/13/2019] [Indexed: 11/10/2022] Open

Ballouz S, Dobin A, Gingeras TR, Gillis J. The fractured landscape of RNA-seq alignment: the default in our STARs. Nucleic Acids Res 2019;46:5125-5138. [PMID: 29718481 PMCID: PMC6007662 DOI: 10.1093/nar/gky325] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Accepted: 04/16/2018] [Indexed: 12/28/2022] Open

Weber LM, Saelens W, Cannoodt R, Soneson C, Hapfelmeier A, Gardner PP, Boulesteix AL, Saeys Y, Robinson MD. Essential guidelines for computational method benchmarking. Genome Biol 2019;20:125. [PMID: 31221194 PMCID: PMC6584985 DOI: 10.1186/s13059-019-1738-8] [Citation(s) in RCA: 77] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open

Nute M, Saleh E, Warnow T. Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets. Syst Biol 2019;68:396-411. [PMID: 30329135 PMCID: PMC6472439 DOI: 10.1093/sysbio/syy068] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 09/27/2018] [Accepted: 10/11/2018] [Indexed: 01/15/2023] Open

Mangul S, Martin LS, Hill BL, Lam AKM, Distler MG, Zelikovsky A, Eskin E, Flint J. Systematic benchmarking of omics computational tools. Nat Commun 2019;10:1393. [PMID: 30918265 PMCID: PMC6437167 DOI: 10.1038/s41467-019-09406-4] [Citation(s) in RCA: 82] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 03/06/2019] [Indexed: 01/11/2023] Open

Wang Y, Wu H, Cai Y. A benchmark study of sequence alignment methods for protein clustering. BMC Bioinformatics 2018;19:529. [PMID: 30598070 PMCID: PMC6311937 DOI: 10.1186/s12859-018-2524-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Saripella GV, Sonnhammer ELL, Forslund K. Benchmarking the next generation of homology inference tools. Bioinformatics 2016;32:2636-41. [PMID: 27256311 PMCID: PMC5013910 DOI: 10.1093/bioinformatics/btw305] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Accepted: 05/05/2016] [Indexed: 12/21/2022] Open

Abstract

Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA.

Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases.

Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization.

Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity.

Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark).

Contact:forslund@embl.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Collapse

Khushi M. Benchmarking database performance for genomic data. J Cell Biochem 2016;116:877-83. [PMID: 25560631 DOI: 10.1002/jcb.25049] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Accepted: 12/16/2014] [Indexed: 01/01/2023]

Olsen LR, Simon C, Kudahl UJ, Bagger FO, Winther O, Reinherz EL, Zhang GL, Brusic V. A computational method for identification of vaccine targets from protein regions of conserved human leukocyte antigen binding. BMC Med Genomics 2015;8 Suppl 4:S1. [PMID: 26679766 PMCID: PMC4682376 DOI: 10.1186/1755-8794-8-s4-s1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Wright ES. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics 2015;16:322. [PMID: 26445311 PMCID: PMC4595117 DOI: 10.1186/s12859-015-0749-z] [Citation(s) in RCA: 198] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 09/23/2015] [Indexed: 12/20/2022] Open

Abstract

BACKGROUND

Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments.

RESULTS

Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets.

CONCLUSIONS

Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the Bioconductor repository.

Collapse

Ndhlovu A, Hazelhurst S, Durand PM. Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix. BMC Bioinformatics 2015;16:255. [PMID: 26269100 PMCID: PMC4535666 DOI: 10.1186/s12859-015-0688-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2015] [Accepted: 07/29/2015] [Indexed: 11/27/2022] Open

Abstract

Background

Selective pressures at the DNA level shape genes into profiles consisting of patterns of rapidly evolving sites and sites withstanding change. These profiles remain detectable even when protein sequences become extensively diverged. A common task in molecular biology is to infer functional, structural or evolutionary relationships by querying a database using an algorithm. However, problems arise when sequence similarity is low. This study presents an algorithm that uses the evolutionary rate at codon sites, the dN/dS (ω) parameter, coupled to a substitution matrix as an alignment metric for detecting distantly related proteins. The algorithm, called BLOSUM-FIRE couples a newer and improved version of the original FIRE (Functional Inference using Rates of Evolution) algorithm with an amino acid substitution matrix in a dynamic scoring function. The enigmatic hepatitis B virus X protein was used as a test case for BLOSUM-FIRE and its associated database EvoDB.

Results

The evolutionary rate based approach was coupled with a conventional BLOSUM substitution matrix. The two approaches are combined in a dynamic scoring function, which uses the selective pressure to score aligned residues. The dynamic scoring function is based on a coupled additive approach that scores aligned sites based on the level of conservation inferred from the ω values. Evaluation of the accuracy of this new implementation, BLOSUM-FIRE, using MAFFT alignment as reference alignments has shown that it is more accurate than its predecessor FIRE. Comparison of the alignment quality with widely used algorithms (MUSCLE, T-COFFEE, and CLUSTAL Omega) revealed that the BLOSUM-FIRE algorithm performs as well as conventional algorithms. Its main strength lies in that it provides greater potential for aligning divergent sequences and addresses the problem of low specificity inherent in the original FIRE algorithm. The utility of this algorithm is demonstrated using the Hepatitis B virus X (HBx) protein, a protein of unknown function, as a test case.

Conclusion

This study describes the utility of an evolutionary rate based approach coupled to the BLOSUM62 amino acid substitution matrix in inferring protein domain function. We demonstrate that such an approach is robust and performs as well as an array of conventional algorithms.

Collapse

Morrison DA. Multiple Sequence Alignment Methods. — Edited by David J. Russell. Syst Biol 2015. [DOI: 10.1093/sysbio/syv018] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Kumar M. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm. EXCLI JOURNAL 2015;14:1232-55. [PMID: 27065770 PMCID: PMC4820728 DOI: 10.17179/excli2015-302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2015] [Accepted: 11/19/2015] [Indexed: 11/10/2022]

Ma C, Zhang HH, Wang X. Machine learning for Big Data analytics in plants. TRENDS IN PLANT SCIENCE 2014;19:798-808. [PMID: 25223304 DOI: 10.1016/j.tplants.2014.08.004] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Revised: 07/30/2014] [Accepted: 08/20/2014] [Indexed: 05/19/2023]

Kultys M, Nicholas L, Schwarz R, Goldman N, King J. Sequence Bundles: a novel method for visualising, discovering and exploring sequence motifs. BMC Proc 2014;8:S8. [PMID: 25237395 PMCID: PMC4155607 DOI: 10.1186/1753-6561-8-s2-s8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open

Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. Methods Mol Biol 2014;1079:59-73. [PMID: 24170395 DOI: 10.1007/978-1-62703-646-7_4] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]

Chakraborty S, Rao BJ, Baker N, Asgeirsson B. Structural phylogeny by profile extraction and multiple superimposition using electrostatic congruence as a discriminator. INTRINSICALLY DISORDERED PROTEINS 2013;1. [PMID: 25364645 PMCID: PMC4212511 DOI: 10.4161/idp.25463] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]

Abstract

Phylogenetic analysis of proteins using multiple sequence alignment (MSA) assumes an underlying evolutionary relationship in these proteins which occasionally remains undetected due to considerable sequence divergence. Structural alignment programs have been developed to unravel such fuzzy relationships. However, none of these structure based methods have used electrostatic properties to discriminate between spatially equivalent residues. We present a methodology for MSA of a set of related proteins with known structures using electrostatic properties as an additional discriminator (STEEP). STEEP first extracts a profile, then generates a multiple structural superimposition providing a consolidated spatial framework for comparing residues and finally emits the MSA. Residues that are aligned differently by including or excluding electrostatic properties can be targeted by directed evolution experiments to transform the enzymatic properties of one protein into another. We have compared STEEP results to those obtained from a MSA program (ClustalW) and a structural alignment method (MUSTANG) for chymotrypsin serine proteases. Subsequently, we used PhyML to generate phylogenetic trees for the serine and metallo-β-lactamase superfamilies from the STEEP generated MSA, and corroborated the accepted relationships in these superfamilies. We have observed that STEEP acts as a functional classifier when electrostatic congruence is used as a discriminator, and thus identifies potential targets for directed evolution experiments. In summary, STEEP is unique among phylogenetic methods for its ability to use electrostatic congruence to specify mutations that might be the source of the functional divergence in a protein family. Based on our results, we also hypothesize that the active site and its close vicinity contains enough information to infer the correct phylogeny for related proteins.

Collapse

Warnow T. Large-Scale Multiple Sequence Alignment and Phylogeny Estimation. MODELS AND ALGORITHMS FOR GENOME EVOLUTION 2013. [DOI: 10.1007/978-1-4471-5298-9_6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]

Nair PS, Vihinen M. VariBench: A Benchmark Database for Variations. Hum Mutat 2012;34:42-9. [DOI: 10.1002/humu.22204] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2012] [Accepted: 07/31/2012] [Indexed: 12/21/2022]

Iwata H, Gotoh O. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res 2012;40:e161. [PMID: 22848105 PMCID: PMC3488211 DOI: 10.1093/nar/gks708] [Citation(s) in RCA: 113] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Ajawatanawong P, Atkinson GC, Watson-Haigh NS, Mackenzie B, Baldauf SL. SeqFIRE: a web application for automated extraction of indel regions and conserved blocks from protein multiple sequence alignments. Nucleic Acids Res 2012;40:W340-7. [PMID: 22693213 PMCID: PMC3394284 DOI: 10.1093/nar/gks561] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2012] [Revised: 05/14/2012] [Accepted: 05/18/2012] [Indexed: 11/16/2022] Open

Vihinen M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics 2012;13 Suppl 4:S2. [PMID: 22759650 PMCID: PMC3303716 DOI: 10.1186/1471-2164-13-s4-s2] [Citation(s) in RCA: 175] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open

Abstract

Background

Prediction methods are increasingly used in biosciences to forecast diverse features and characteristics. Binary two-state classifiers are the most common applications. They are usually based on machine learning approaches. For the end user it is often problematic to evaluate the true performance and applicability of computational tools as some knowledge about computer science and statistics would be needed.

Results

Instructions are given on how to interpret and compare method evaluation results. For systematic method performance analysis is needed established benchmark datasets which contain cases with known outcome, and suitable evaluation measures. The criteria for benchmark datasets are discussed along with their implementation in VariBench, benchmark database for variations. There is no single measure that alone could describe all the aspects of method performance. Predictions of genetic variation effects on DNA, RNA and protein level are important as information about variants can be produced much faster than their disease relevance can be experimentally verified. Therefore numerous prediction tools have been developed, however, systematic analyses of their performance and comparison have just started to emerge.

Conclusions

The end users of prediction tools should be able to understand how evaluation is done and how to interpret the results. Six main performance evaluation measures are introduced. These include sensitivity, specificity, positive predictive value, negative predictive value, accuracy and Matthews correlation coefficient. Together with receiver operating characteristics (ROC) analysis they provide a good picture about the performance of methods and allow their objective and quantitative comparison. A checklist of items to look at is provided. Comparisons of methods for missense variant tolerance, protein stability changes due to amino acid substitutions, and effects of variations on mRNA splicing are presented.

Collapse

Jagadeesh Chandra Bose R, van der Aalst WM. Process diagnostics using trace alignment: Opportunities, issues, and challenges. INFORM SYST 2012. [DOI: 10.1016/j.is.2011.08.003] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]

Astakhova TV, Lobanov MN, Poverennaya IV, Roytberg MA, Yacovlev VV. Verification of the PREFAB alignment database. Biophysics (Nagoya-shi) 2012. [DOI: 10.1134/s0006350912020030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open

Erb I, González-Vallinas JR, Bussotti G, Blanco E, Eyras E, Notredame C. Use of ChIP-Seq data for the design of a multiple promoter-alignment method. Nucleic Acids Res 2012;40:e52. [PMID: 22230796 PMCID: PMC3326335 DOI: 10.1093/nar/gkr1292] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open

Benchmarks for flexible and rigid transcription factor-DNA docking. BMC STRUCTURAL BIOLOGY 2011;11:45. [PMID: 22044637 PMCID: PMC3262759 DOI: 10.1186/1472-6807-11-45] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/09/2011] [Accepted: 11/01/2011] [Indexed: 12/27/2022]

Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 2011;7:539. [PMID: 21988835 PMCID: PMC3261699 DOI: 10.1038/msb.2011.75] [Citation(s) in RCA: 10158] [Impact Index Per Article: 781.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2011] [Accepted: 09/06/2011] [Indexed: 02/06/2023] Open

Mirarab S, Warnow T. FastSP: linear time calculation of alignment accuracy. Bioinformatics 2011;27:3250-8. [PMID: 21984754 DOI: 10.1093/bioinformatics/btr553] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

The phylogeny of monkey beetles based on mitochondrial and ribosomal RNA genes (Coleoptera: Scarabaeidae: Hopliini). Mol Phylogenet Evol 2011;60:408-15. [DOI: 10.1016/j.ympev.2011.04.011] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2010] [Revised: 04/05/2011] [Accepted: 04/18/2011] [Indexed: 11/16/2022]

Thompson JD, Linard B, Lecompte O, Poch O. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One 2011;6:e18093. [PMID: 21483869 PMCID: PMC3069049 DOI: 10.1371/journal.pone.0018093] [Citation(s) in RCA: 129] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2010] [Accepted: 02/21/2011] [Indexed: 12/18/2022] Open

Abstract

Multiple comparison or alignmentof protein sequences has become a fundamental tool in many different domains in modern molecular biology, from evolutionary studies to prediction of 2D/3D structure, molecular function and inter-molecular interactions etc. By placing the sequence in the framework of the overall family, multiple alignments can be used to identify conserved features and to highlight differences or specificities. In this paper, we describe a comprehensive evaluation of many of the most popular methods for multiple sequence alignment (MSA), based on a new benchmark test set. The benchmark is designed to represent typical problems encountered when aligning the large protein sequence sets that result from today's high throughput biotechnologies. We show that alignmentmethods have significantly progressed and can now identify most of the shared sequence features that determine the broad molecular function(s) of a protein family, even for divergent sequences. However,we have identified a number of important challenges. First, the locally conserved regions, that reflect functional specificities or that modulate a protein's function in a given cellular context,are less well aligned. Second, motifs in natively disordered regions are often misaligned. Third, the badly predicted or fragmentary protein sequences, which make up a large proportion of today's databases, lead to a significant number of alignment errors. Based on this study, we demonstrate that the existing MSA methods can be exploited in combination to improve alignment accuracy, although novel approaches will still be needed to fully explore the most difficult regions. We then propose knowledge-enabled, dynamic solutions that will hopefully pave the way to enhanced alignment construction and exploitation in future evolutionary systems biology studies.

Collapse