1
|
Alves G, Yu YK. Robust Accurate Identification and Biomass Estimates of Microorganisms via Tandem Mass Spectrometry. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2020; 31:85-102. [PMID: 32881514 PMCID: PMC10501333 DOI: 10.1021/jasms.9b00035] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Rapid and accurate identification of microorganisms and estimation of their biomasses are of extreme importance to public health. Mass spectrometry has become an important technique for these purposes. Previously we published a workflow named Microorganism Classification and Identification (MiCId v.12.26.2017) that was shown to perform no worse than other workflows. This manuscript presents MiCId v.12.13.2018 that, in comparison with the earlier version v.12.26.2017, allows for biomass estimates, provides more accurate microorganism identifications (better controls the number of false positives), and is robust against database size increase. This significant advance is made possible by several new ingredients introduced: first, we apply a modified expectation-maximization method to compute for each taxon considered a prior probability, which can be used for biomass estimate; second, we introduce a new concept called ownership, through which the participation ratio is computed and use it as the number of taxa to be kept within a cluster of closely related taxa; third, based on confidently identified peptides, we calculate for each taxon its degree of independence from the rest of taxa considered to determine whether or not to split this taxon off the cluster. Using 270 data files, each containing a large number of MS/MS spectra, we show that, in comparison with v.12.26.2017, version v.12.13.2018 yields superior retrieval results. We also show that MiCId v.12.13.2018 can estimate species biomass reasonably well. The new MiCId v.12.13.2018, designed to run in Linux environment, is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotehnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, United States
| | - Yi-Kuo Yu
- National Center for Biotehnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, United States
| |
Collapse
|
2
|
Margelevičius M. Estimating statistical significance of local protein profile-profile alignments. BMC Bioinformatics 2019; 20:419. [PMID: 31409275 PMCID: PMC6693267 DOI: 10.1186/s12859-019-2913-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2019] [Accepted: 05/23/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment of sequence families described by profiles provides a sensitive means for establishing homology between proteins and is important in protein evolutionary, structural, and functional studies. In the context of a steadily growing amount of sequence data, estimating the statistical significance of alignments, including profile-profile alignments, plays a key role in alignment-based homology search algorithms. Still, it is an open question as to what and whether one type of distribution governs profile-profile alignment score, especially when profile-profile substitution scores involve such terms as secondary structure predictions. RESULTS This study presents a methodology for estimating the statistical significance of this type of alignments. The methodology rests on a new algorithm developed for generating random profiles such that their alignment scores are distributed similarly to those obtained for real unrelated profiles. We show that improvements in statistical accuracy and sensitivity and high-quality alignment rate result from statistically characterizing alignments by establishing the dependence of statistical parameters on various measures associated with both individual and pairwise profile characteristics. Implemented in the COMER software, the proposed methodology yielded an increase of up to 34.2% in the number of true positives and up to 61.8% in the number of high-quality alignments with respect to the previous version of the COMER method. CONCLUSIONS The more accurate estimation of statistical significance is implemented in the COMER method, which is now more sensitive and provides an increased rate of high-quality profile-profile alignments. The results of the present study also suggest directions for future research.
Collapse
Affiliation(s)
- Mindaugas Margelevičius
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio al. 7, Vilnius, 10257, Lithuania.
| |
Collapse
|
3
|
Carroll HD, Spouge JL, Gonzalez M. MultiDomainBenchmark: a multi-domain query and subject database suite. BMC Bioinformatics 2019; 20:77. [PMID: 30764761 PMCID: PMC6376684 DOI: 10.1186/s12859-019-2660-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Accepted: 01/28/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genetic sequence database retrieval benchmarks play an essential role in evaluating the performance of sequence searching tools. To date, all phylogenetically diverse benchmarks known to the authors include only query sequences with single protein domains. Domains are the primary building blocks of protein structure and function. Independently, each domain can fulfill a single function, but most proteins (>80% in Metazoa) exist as multi-domain proteins. Multiple domain units combine in various arrangements or architectures to create different functions and are often under evolutionary pressures to yield new ones. Thus, it is crucial to create gold standards reflecting the multi-domain complexity of real proteins to more accurately evaluate sequence searching tools. DESCRIPTION This work introduces MultiDomainBenchmark (MDB), a database suite of 412 curated multi-domain queries and 227,512 target sequences, representing at least 5108 species and 1123 phylogenetically divergent protein families, their relevancy annotation, and domain location. Here, we use the benchmark to evaluate the performance of two commonly used sequence searching tools, BLAST/PSI-BLAST and HMMER. Additionally, we introduce a novel classification technique for multi-domain proteins to evaluate how well an algorithm recovers a domain architecture. CONCLUSION MDB is publicly available at http://csc.columbusstate.edu/carroll/MDB/ .
Collapse
Affiliation(s)
- Hyrum D. Carroll
- TSYS School of Computer Science, Columbus State University, 4225 University Avenue, Columbus, 31907 GA USA
| | - John L. Spouge
- National Center for Biotechnology Information, Bethesda, National Institutes of Health, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | - Mileidy Gonzalez
- National Center for Biotechnology Information, Bethesda, National Institutes of Health, 8600 Rockville Pike, Bethesda, 20894 MD USA
| |
Collapse
|
4
|
Neuwald AF, Altschul SF. Statistical investigations of protein residue direct couplings. PLoS Comput Biol 2018; 14:e1006237. [PMID: 30596639 PMCID: PMC6329532 DOI: 10.1371/journal.pcbi.1006237] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 01/11/2019] [Accepted: 11/23/2018] [Indexed: 12/12/2022] Open
Abstract
Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.
Collapse
Affiliation(s)
- Andrew F. Neuwald
- Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
| | - Stephen F. Altschul
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
5
|
Alves G, Wang G, Ogurtsov AY, Drake SK, Gucek M, Sacks DB, Yu YK. Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2018; 29:1721-1737. [PMID: 29873019 PMCID: PMC6061032 DOI: 10.1007/s13361-018-1986-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/13/2017] [Revised: 03/30/2018] [Accepted: 04/25/2018] [Indexed: 05/30/2023]
Abstract
Rapid and accurate identification and classification of microorganisms is of paramount importance to public health and safety. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly improved. However, the increasing number of microbes sequenced is complicating correct microbial identification even in a simple sample due to the large number of candidates present. To properly untwine candidate microbes in samples containing one or more microbes, one needs to go beyond apparent morphology or simple "fingerprinting"; to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptide-centric representations of microbes to better separate them and by augmenting our earlier analysis method that yields accurate statistical significance. Here, we present an updated analysis workflow that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using 226 MS/MS publicly available data files (each containing from 2500 to nearly 100,000 MS/MS spectra) and 4000 additional MS/MS data files, that the updated workflow can correctly identify multiple microbes at the genus and often the species level for samples containing more than one microbe. We have also shown that the proposed workflow computes accurate statistical significances, i.e., E values for identified peptides and unified E values for identified microbes. Our updated analysis workflow MiCId, a freely available software for Microorganism Classification and Identification, is available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html . Graphical Abstract ᅟ.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Guanghui Wang
- Proteomics Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Aleksey Y Ogurtsov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Steven K Drake
- Critical Care Medicine Department, Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Marjan Gucek
- Proteomics Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - David B Sacks
- Department of Laboratory Medicine, Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
6
|
Ladunga I. Finding Homologs in Amino Acid Sequences Using Network BLAST Searches. ACTA ACUST UNITED AC 2018; 59:3.4.1-3.4.24. [DOI: 10.1002/cpbi.34] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Istvan Ladunga
- Departments of Statistics, Biochemistry and School of Biological Sciences, University of Nebraska–Lincoln Lincoln Nebraska
| |
Collapse
|
7
|
Pearson WR, Li W, Lopez R. Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold. Nucleic Acids Res 2017; 45:e46. [PMID: 27923999 PMCID: PMC5605230 DOI: 10.1093/nar/gkw1207] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 11/18/2016] [Indexed: 11/13/2022] Open
Abstract
Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity.
Collapse
Affiliation(s)
- William R Pearson
- Dept. of Biochemistry and Molecular Genetics, University of Virginia, School of Medicine, Charlottesville, VA 22908, USA
| | - Weizhong Li
- European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rodrigo Lopez
- European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
8
|
Sarajlić A, Malod-Dognin N, Yaveroğlu ÖN, Pržulj N. Graphlet-based Characterization of Directed Networks. Sci Rep 2016; 6:35098. [PMID: 27734973 PMCID: PMC5062067 DOI: 10.1038/srep35098] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2016] [Accepted: 09/26/2016] [Indexed: 01/22/2023] Open
Abstract
We are flooded with large-scale, dynamic, directed, networked data. Analyses requiring exact comparisons between networks are computationally intractable, so new methodologies are sought. To analyse directed networks, we extend graphlets (small induced sub-graphs) and their degrees to directed data. Using these directed graphlets, we generalise state-of-the-art network distance measures (RGF, GDDA and GCD) to directed networks and show their superiority for comparing directed networks. Also, we extend the canonical correlation analysis framework that enables uncovering the relationships between the wiring patterns around nodes in a directed network and their expert annotations. On directed World Trade Networks (WTNs), our methodology allows uncovering the core-broker-periphery structure of the WTN, predicting the economic attributes of a country, such as its gross domestic product, from its wiring patterns in the WTN for up-to ten years in the future. It does so by enabling us to track the dynamics of a country's positioning in the WTN over years. On directed metabolic networks, our framework yields insights into preservation of enzyme function from the network wiring patterns rather than from sequence data. Overall, our methodology enables advanced analyses of directed networked data from any area of science, allowing domain-specific interpretation of a directed network's topology.
Collapse
Affiliation(s)
- Anida Sarajlić
- Department of Computing, Imperial College London, SW7 2AZ London, UK
| | - Noël Malod-Dognin
- Department of Computer Science, University College London, WC1E 6BT London, UK
| | | | - Nataša Pržulj
- Department of Computer Science, University College London, WC1E 6BT London, UK
| |
Collapse
|
9
|
Alves G, Yu YK. Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution. Bioinformatics 2016; 32:2642-9. [PMID: 27153659 DOI: 10.1093/bioinformatics/btw225] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Accepted: 04/16/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION There is a growing trend for biomedical researchers to extract evidence and draw conclusions from mass spectrometry based proteomics experiments, the cornerstone of which is peptide identification. Inaccurate assignments of peptide identification confidence thus may have far-reaching and adverse consequences. Although some peptide identification methods report accurate statistics, they have been limited to certain types of scoring function. The extreme value statistics based method, while more general in the scoring functions it allows, demands accurate parameter estimates and requires, at least in its original design, excessive computational resources. Improving the parameter estimate accuracy and reducing the computational cost for this method has two advantages: it provides another feasible route to accurate significance assessment, and it could provide reliable statistics for scoring functions yet to be developed. RESULTS We have formulated and implemented an efficient algorithm for calculating the extreme value statistics for peptide identification applicable to various scoring functions, bypassing the need for searching large random databases. AVAILABILITY AND IMPLEMENTATION The source code, implemented in C ++ on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bit CONTACT yyu@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
10
|
Alves G, Wang G, Ogurtsov AY, Drake SK, Gucek M, Suffredini AF, Sacks DB, Yu YK. Identification of Microorganisms by High Resolution Tandem Mass Spectrometry with Accurate Statistical Significance. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2016; 27:194-210. [PMID: 26510657 PMCID: PMC4723618 DOI: 10.1007/s13361-015-1271-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2015] [Revised: 09/04/2015] [Accepted: 09/05/2015] [Indexed: 05/13/2023]
Abstract
Correct and rapid identification of microorganisms is the key to the success of many important applications in health and safety, including, but not limited to, infection treatment, food safety, and biodefense. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly improved. However, the increasing number of microbes sequenced is challenging correct microbial identification because of the large number of choices present. To properly disentangle candidate microbes, one needs to go beyond apparent morphology or simple 'fingerprinting'; to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptidome profiles of microbes to better separate them and by designing an analysis method that yields accurate statistical significance. Here, we present an analysis pipeline that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using MS/MS data of 81 samples, each composed of a single known microorganism, that the proposed pipeline can correctly identify microorganisms at least at the genus and species levels. We have also shown that the proposed pipeline computes accurate statistical significances, i.e., E-values for identified peptides and unified E-values for identified microorganisms. The proposed analysis pipeline has been implemented in MiCId, a freely available software for Microorganism Classification and Identification. MiCId is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html . Graphical Abstract ᅟ.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Guanghui Wang
- Proteomics Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Aleksey Y Ogurtsov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Steven K Drake
- Critical Care Medicine Department, Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Marjan Gucek
- Proteomics Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Anthony F Suffredini
- Critical Care Medicine Department, Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA
| | - David B Sacks
- Department of Laboratory Medicine, Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
11
|
Hoxha E, Campion SR. Structure-critical distribution of aromatic residues in the fibronectin type III protein family. Protein J 2014; 33:165-73. [PMID: 24563228 DOI: 10.1007/s10930-014-9549-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Over a thousand individual Fibronectin type III (FnIII) domain sequences, extracted from more than 60 different FnIII-dependent protein super-structures, were downloaded from curated database resources. Three regions of extreme sequence conservation within the well-characterized FnIII β-sandwich structure were respectively defined by near absolute conservation of a tryptophan (Trp) in β-strand-B, tyrosines (Tyr) in both β-strand-C and β-strand-F, and a leucine (Leu) residue in the unstructured region immediately preceding β-strand-F. Employing these four conserved landmarks, the entire FnIII sequence dataset was vertically registered to align the three conserved regions, and the cumulative distribution of all other amino acid functionality was determined and plotted relative to these landmark residues. Conserved aromatic sites were each found to be flanked by aliphatic residues that assure localization of these sites to the inaccessible hydrophobic interface between major sheet structures. Mapping the location of conserved aromatic sites in numerous PDB structures demonstrated the consistent pair-wise co-localization of the indole side-chain of the conserved strand-B Trp site to within 0.35 nm of the phenolic side-chain of the strand-C Tyr site located 8-14 amino acids distal. Likewise, the side-chain of the strand-F Tyr site co-localized to within 0.45 nm of the aliphatic side-chain of the conserved Leu that uniformly precedes it by six residues. While classic hydropathy-based theories would deem the "burying" of Tyr and Trp side-chains and/or their association with hydrophobic FnIII core residues thermodynamically unnecessary, alternative contributions of conserved Trp and Tyr residues, and particularly the role of the absolutely conserved tyrosine phenolic -OH in native FnIII structure-function are considered. A more global role for conserved FnIII aromaticity is also discussed in light of the aromatic conservation observed in other well-established protein families.
Collapse
Affiliation(s)
- Ema Hoxha
- Department of Science and Mathematics, Alvernia University, Reading, PA, 19607, USA
| | | |
Collapse
|
12
|
Alves G, Yu YK. Mass spectrometry-based protein identification with accurate statistical significance assignment. ACTA ACUST UNITED AC 2014; 31:699-706. [PMID: 25362092 DOI: 10.1093/bioinformatics/btu717] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Assigning statistical significance accurately has become increasingly important as metadata of many types, often assembled in hierarchies, are constructed and combined for further biological analyses. Statistical inaccuracy of metadata at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of mass spectrometry-based proteomics, even though accurate statistics for peptide identification can now be achieved, accurate protein level statistics remain challenging. RESULTS We have constructed a protein ID method that combines peptide evidences of a candidate protein based on a rigorous formula derived earlier; in this formula the database P-value of every peptide is weighted, prior to the final combination, according to the number of proteins it maps to. We have also shown that this protein ID method provides accurate protein level E-value, eliminating the need of using empirical post-processing methods for type-I error control. Using a known protein mixture, we find that this protein ID method, when combined with the Sorić formula, yields accurate values for the proportion of false discoveries. In terms of retrieval efficacy, the results from our method are comparable with other methods tested. AVAILABILITY AND IMPLEMENTATION The source code, implemented in C++ on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bit.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
13
|
Ward N, Moreno-Hagelsieb G. Quickly finding orthologs as reciprocal best hits with BLAT, LAST, and UBLAST: how much do we miss? PLoS One 2014; 9:e101850. [PMID: 25013894 PMCID: PMC4094424 DOI: 10.1371/journal.pone.0101850] [Citation(s) in RCA: 107] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Accepted: 06/11/2014] [Indexed: 11/30/2022] Open
Abstract
Reciprocal Best Hits (RBH) are a common proxy for orthology in comparative genomics. Essentially, a RBH is found when the proteins encoded by two genes, each in a different genome, find each other as the best scoring match in the other genome. NCBI's BLAST is the software most usually used for the sequence comparisons necessary to finding RBHs. Since sequence comparison can be time consuming, we decided to compare the number and quality of RBHs detected using algorithms that run in a fraction of the time as BLAST. We tested BLAT, LAST and UBLAST. All three programs ran in a hundredth to a 25th of the time required to run BLAST. A reduction in the number of homologs and RBHs found by the faster algorithms compared to BLAST becomes apparent as the genomes compared become more dissimilar, with BLAT, a program optimized for quickly finding very similar sequences, missing both the most homologs and the most RBHs. Though LAST produced the closest number of homologs and RBH to those produced with BLAST, UBLAST was very close, with either program producing between 0.6 and 0.8 of the RBHs as BLAST between dissimilar genomes, while in more similar genomes the differences were barely apparent. UBLAST ran faster than LAST, making it the best option among the programs tested.
Collapse
Affiliation(s)
- Natalie Ward
- Department of Biology, Wilfrid Laurier University, Waterloo, Ontario, Canada
| | | |
Collapse
|
14
|
Yaveroğlu ÖN, Malod-Dognin N, Davis D, Levnajic Z, Janjic V, Karapandza R, Stojmirovic A, Pržulj N. Revealing the hidden language of complex networks. Sci Rep 2014; 4:4547. [PMID: 24686408 PMCID: PMC3971399 DOI: 10.1038/srep04547] [Citation(s) in RCA: 78] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2014] [Accepted: 03/13/2014] [Indexed: 12/03/2022] Open
Abstract
Sophisticated methods for analysing complex networks promise to be of great benefit to almost all scientific disciplines, yet they elude us. In this work, we make fundamental methodological advances to rectify this. We discover that the interaction between a small number of roles, played by nodes in a network, can characterize a network's structure and also provide a clear real-world interpretation. Given this insight, we develop a framework for analysing and comparing networks, which outperforms all existing ones. We demonstrate its strength by uncovering novel relationships between seemingly unrelated networks, such as Facebook, metabolic, and protein structure networks. We also use it to track the dynamics of the world trade network, showing that a country's role of a broker between non-trading countries indicates economic prosperity, whereas peripheral roles are associated with poverty. This result, though intuitive, has escaped all existing frameworks. Finally, our approach translates network topology into everyday language, bringing network analysis closer to domain scientists.
Collapse
Affiliation(s)
| | | | - Darren Davis
- Computer Science Department, University of California, Irvine, USA
| | - Zoran Levnajic
- 1] Department of Computing, Imperial College London, UK [2] Faculty of Information Studies in Novo mesto, Novo Mesto, Slovenia
| | - Vuk Janjic
- Department of Computing, Imperial College London, UK
| | - Rasa Karapandza
- Department of Finance, Accounting & Real Estate EBS Business School, Germany
| | - Aleksandar Stojmirovic
- 1] National Center for Biotechnology Information (NCBI), USA [2] Janssen Research and Development, LLC, Spring House, PA, USA
| | - Nataša Pržulj
- Department of Computing, Imperial College London, UK
| |
Collapse
|
15
|
Gruber A, Kroth PG. Deducing intracellular distributions of metabolic pathways from genomic data. Methods Mol Biol 2014; 1083:187-211. [PMID: 24218217 DOI: 10.1007/978-1-62703-661-0_12] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
In the recent years, a large number of genomes from a variety of different organisms have been sequenced. Most of the sequence data has been publicly released and can be assessed by interested users. However, this wealth of information is currently underexploited by scientists not directly involved in genome annotation. This is partially because sequencing, assembly, and automated annotation can be done much faster than the identification, classification, and prediction of the intracellular localization of the gene products. This part of the annotation process still largely relies on manual curation and addition of contextual information. Users of genome databases who are unfamiliar with the types of data available from (whole) genomes might therefore find themselves either overwhelmed by the vast amount and multiple layers of data or dissatisfied with less-than-meaningful analyses of the data.In this chapter we present procedures and approaches to identify and characterize gene models of enzymes involved in metabolic pathways based on their similarity to known sequences. Furthermore we describe how to predict the subcellular location of the proteins using publicly available prediction servers and how to interpret the obtained results. The strategies we describe are generally applicable to organisms with primary plastids such as land plants or green algae. Additionally, we describe strategies suitable for those groups of algae with secondary plastids (for instance diatoms), which are characterized by a different cellular topology and a larger number of intracellular compartments compared to plants.
Collapse
Affiliation(s)
- Ansgar Gruber
- Fachbereich Biologie, Universität Konstanz, Konstanz, Germany
| | | |
Collapse
|
16
|
Wang R, Gao X. A Two-Layer Learning Architecture for Multi-Class Protein Folds Classification. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Classification of protein folds plays a very important role in the protein structure discovery process, especially when traditional sequence alignment methods fail to yield convincing structural homologies. In this chapter, we have developed a two-layer learning architecture, named TLLA, for multi-class protein folds classification. In the first layer, OET-KNN (Optimized Evidence-Theoretic K Nearest Neighbors) is used as the component classifier to find the most probable K-folds of the query protein. In the second layer, we use support vector machine (SVM) to build the multi-class classifier just on the K-folds, generated in the first layer, rather than on all the 27 folds. For multi-feature combination, ensemble strategy based on voting is selected to give the final classification result. The standard percentage accuracy of our method at ~63% is achieved on the independent testing dataset, where most of the proteins have <25% sequence identity with those in the training dataset. The experimental evaluation based on a widely used benchmark dataset has shown that our approach outperforms the competing methods, implying our approach might become a useful vehicle in the literature.
Collapse
|
17
|
Zhang Y, Misra S, Agrawal A, Patwary MMA, Liao WK, Qin Z, Choudhary A. Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power. BMC Bioinformatics 2012; 13 Suppl 5:S3. [PMID: 22537007 PMCID: PMC3318904 DOI: 10.1186/1471-2105-13-s5-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability. Results We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel© Core™i7 CPU 920 processor. Conclusions Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.
Collapse
Affiliation(s)
- Yuhong Zhang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China.
| | | | | | | | | | | | | |
Collapse
|
18
|
Abstract
Tubulins are a family of GTPases that are key components of the cytoskeleton in all eukaryotes and are distantly related to the FtsZ GTPase that is involved in cell division in most bacteria and many archaea. Among prokaryotes, bona fide tubulins have been identified only in bacteria of the genus Prosthecobacter. These bacterial tubulin genes appear to have been horizontally transferred from eukaryotes. Here we describe tubulins encoded in the genomes of thaumarchaeota of the genus Nitrosoarchaeum that we denote artubulins Phylogenetic analysis results are compatible with the origin of eukaryotic tubulins from artubulins. These findings expand the emerging picture of the origin of key components of eukaryotic functional systems from ancestral forms that are scattered among the extant archaea.
Collapse
Affiliation(s)
- Natalya Yutin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|
19
|
WONG WINGCHEONG, MAURER-STROH SEBASTIAN, EISENHABER FRANK. THE JANUS-FACED E-VALUES OF HMMER2: EXTREME VALUE DISTRIBUTION OR LOGISTIC FUNCTION? J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720011005264] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD.
Collapse
Affiliation(s)
- WING-CHEONG WONG
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
| | - SEBASTIAN MAURER-STROH
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 63755, Singapore
| | - FRANK EISENHABER
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597, Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553, Singapore
| |
Collapse
|
20
|
Abstract
Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the "multiple segment Viterbi" (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call "sparse rescaling". These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches.
Collapse
Affiliation(s)
- Sean R Eddy
- HHMI Janelia Farm Research Campus, Ashburn, Virginia, United States of America.
| |
Collapse
|
21
|
Alves G, Yu YK. Combining independent, weighted P-values: achieving computational stability by a systematic expansion with controllable accuracy. PLoS One 2011; 6:e22647. [PMID: 21912585 PMCID: PMC3166143 DOI: 10.1371/journal.pone.0022647] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2011] [Accepted: 06/27/2011] [Indexed: 11/27/2022] Open
Abstract
Given the expanding availability of scientific data and tools to analyze them, combining different assessments of the same piece of information has become increasingly important for social, biological, and even physical sciences. This task demands, to begin with, a method-independent standard, such as the -value, that can be used to assess the reliability of a piece of information. Good's formula and Fisher's method combine independent -values with respectively unequal and equal weights. Both approaches may be regarded as limiting instances of a general case of combining -values from groups; -values within each group are weighted equally, while weight varies by group. When some of the weights become nearly degenerate, as cautioned by Good, numeric instability occurs in computation of the combined -values. We deal explicitly with this difficulty by deriving a controlled expansion, in powers of differences in inverse weights, that provides both accurate statistics and stable numerics. We illustrate the utility of this systematic approach with a few examples. In addition, we also provide here an alternative derivation for the probability distribution function of the general case and show how the analytic formula obtained reduces to both Good's and Fisher's methods as special cases. A C++ program, which computes the combined -values with equal numerical stability regardless of whether weights are (nearly) degenerate or not, is available for download at our group website http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/CoinedPValues.html.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|
22
|
Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:194-205. [PMID: 21071807 DOI: 10.1109/tcbb.2009.69] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.
Collapse
Affiliation(s)
- Ankit Agrawal
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | |
Collapse
|
23
|
RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics. PLoS One 2010; 5:e15438. [PMID: 21103371 PMCID: PMC2982831 DOI: 10.1371/journal.pone.0015438] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2010] [Accepted: 09/20/2010] [Indexed: 11/26/2022] Open
Abstract
Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an -value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic -value reported by any method into the textbook-defined -value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific-values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign -values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign -values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.
Collapse
|
24
|
Aniba MR, Poch O, Marchler-Bauer A, Thompson JD. AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis. Nucleic Acids Res 2010; 38:6338-49. [PMID: 20530533 PMCID: PMC2965243 DOI: 10.1093/nar/gkq526] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
Multiple sequence alignment (MSA) is a cornerstone of modern molecular biology and represents a unique means of investigating the patterns of conservation and diversity in complex biological systems. Many different algorithms have been developed to construct MSAs, but previous studies have shown that no single aligner consistently outperforms the rest. This has led to the development of a number of ‘meta-methods’ that systematically run several aligners and merge the output into one single solution. Although these methods generally produce more accurate alignments, they are inefficient because all the aligners need to be run first and the choice of the best solution is made a posteriori. Here, we describe the development of a new expert system, AlexSys, for the multiple alignment of protein sequences. AlexSys incorporates an intelligent inference engine to automatically select an appropriate aligner a priori, depending only on the nature of the input sequences. The inference engine was trained on a large set of reference multiple alignments, using a novel machine learning approach. Applying AlexSys to a test set of 178 alignments, we show that the expert system represents a good compromise between alignment quality and running time, making it suitable for high throughput projects. AlexSys is freely available from http://alnitak.u-strasbg.fr/∼aniba/alexsys.
Collapse
Affiliation(s)
- Mohamed Radhouene Aniba
- Department of Structural Biology and Genomics, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Institut National de la Santé et de la Recherche Médicale, The Centre National de la Recherche Scientifique, UMR7104, F-67400 Illkirch, Université Louis Pasteur, F-67000 Strasbourg, France
| | | | | | | |
Collapse
|
25
|
Mizianty MJ, Kurgan L. Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics 2009; 10:414. [PMID: 20003388 PMCID: PMC2805645 DOI: 10.1186/1471-2105-10-414] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2009] [Accepted: 12/13/2009] [Indexed: 11/13/2022] Open
Abstract
Background Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. Results The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. Conclusions The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.
Collapse
Affiliation(s)
- Marcin J Mizianty
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada.
| | | |
Collapse
|
26
|
Nguyen VH, Lavenier D. PLAST: parallel local alignment search tool for database comparison. BMC Bioinformatics 2009; 10:329. [PMID: 19821978 PMCID: PMC2770072 DOI: 10.1186/1471-2105-10-329] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2009] [Accepted: 10/12/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Sequence similarity searching is an important and challenging task in molecular biology and next-generation sequencing should further strengthen the need for faster algorithms to process such vast amounts of data. At the same time, the internal architecture of current microprocessors is tending towards more parallelism, leading to the use of chips with two, four and more cores integrated on the same die. The main purpose of this work was to design an effective algorithm to fit with the parallel capabilities of modern microprocessors. RESULTS A parallel algorithm for comparing large genomic banks and targeting middle-range computers has been developed and implemented in PLAST software. The algorithm exploits two key parallel features of existing and future microprocessors: the SIMD programming model (SSE instruction set) and the multithreading concept (multicore). Compared to multithreaded BLAST software, tests performed on an 8-processor server have shown speedup ranging from 3 to 6 with a similar level of accuracy. CONCLUSION A parallel algorithmic approach driven by the knowledge of the internal microprocessor architecture allows significant speedup to be obtained while preserving standard sensitivity for similarity search problems.
Collapse
Affiliation(s)
- Van Hoa Nguyen
- INRIA/IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France.
| | | |
Collapse
|
27
|
Forslund K, Sonnhammer ELL. Benchmarking homology detection procedures with low complexity filters. ACTA ACUST UNITED AC 2009; 25:2500-5. [PMID: 19620098 DOI: 10.1093/bioinformatics/btp446] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
BACKGROUND Low-complexity sequence regions present a common problem in finding true homologs to a protein query sequence. Several solutions to this have been suggested, but a detailed comparison between these on challenging data has so far been lacking. A common benchmark for homology detection procedures is to use SCOP/ASTRAL domain sequences belonging to the same or different superfamilies, but these contain almost no low complexity sequences. RESULTS We here introduce an alternative benchmarking strategy based around Pfam domains and clans on whole-proteome data sets. This gives a realistic level of low complexity sequences. We used it to evaluate all six built-in BLAST low complexity filter settings as well as a range of settings in the MSPcrunch post-processing filter. The effect on alignment length was also assessed. CONCLUSION Score matrix adjustment methods provide a low false positive rate at a relatively small loss in sensitivity relative to no filtering, across the range of test conditions we apply. MSPcrunch achieved even less loss in sensitivity, but at a higher false positive rate. A drawback of the score matrix adjustment methods is however that the alignments often become truncated. AVAILABILITY Perl scripts for MSPcrunch BLAST filtering and for generating the benchmark dataset are available at http://sonnhammer.sbc.su.se/download/software/MSPcrunch+Blixem/benchmark.tar.gz
Collapse
Affiliation(s)
- Kristoffer Forslund
- Stockholm Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden.
| | | |
Collapse
|
28
|
Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty. BMC Bioinformatics 2009; 10 Suppl 3:S1. [PMID: 19344477 PMCID: PMC2665049 DOI: 10.1186/1471-2105-10-s3-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. RESULTS Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty. CONCLUSION The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.
Collapse
Affiliation(s)
- Ankit Agrawal
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | |
Collapse
|
29
|
Hopper-Borge EA, Nasto RE, Ratushny V, Weiner LM, Golemis EA, Astsaturov I. Mechanisms of tumor resistance to EGFR-targeted therapies. Expert Opin Ther Targets 2009; 13:339-62. [PMID: 19236156 PMCID: PMC2670612 DOI: 10.1517/14712590902735795] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
BACKGROUND Much effort has been devoted to development of cancer therapies targeting EGFR, based on its role in regulating cell growth. Small-molecule and antibody EGFR inhibitors have clinical roles based on their efficacy in a subset of cancers, generally as components of combination therapies. Many cancers are either initially resistant to EGFR inhibitors or become resistant during treatment, limiting the efficacy of these reagents. OBJECTIVE/METHODS To review cellular resistance mechanisms to EGFR-targeted therapies. RESULTS/CONCLUSIONS The best validated of these mechanisms include activation of classic ATP-binding casette (ABC) multidrug transporters; activation or mutation of EGFR; and overexpression or activation of signaling proteins operating in relation to EGFR. We discuss current efforts and potential strategies to override these sources of resistance. We describe emerging systems-biology-based concepts of alternative resistance to EGFR-targeted therapies, and discuss their implications for use of EGFR-targeted and other targeted therapies.
Collapse
Affiliation(s)
- Elizabeth A Hopper-Borge
- Fox Chase Cancer Center, W462, 333 Cottman Ave., Philadelphia, PA 19111, USA, Tel: (215) 728-2500; Fax: -3616; E-mail:
| | - Rochelle E Nasto
- Fox Chase Cancer Center, W462, 333 Cottman Ave., Philadelphia, PA 19111, USA, Tel: (215) 728-2500; Fax: -3616; E-mail:
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA 19104, USA
| | - Vladimir Ratushny
- Fox Chase Cancer Center, W462, 333 Cottman Ave., Philadelphia, PA 19111, USA, Tel: (215) 728-2500; Fax: -3616; E-mail:
- Department of Biochemistry, Drexel University College of Medicine, Philadelphia, PA 19102, USA
| | - Louis M Weiner
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC 20057-1468, USA
| | - Erica A Golemis
- Fox Chase Cancer Center, W462, 333 Cottman Ave., Philadelphia, PA 19111, USA, Tel: (215) 728-2500; Fax: -3616; E-mail:
| | - Igor Astsaturov
- Fox Chase Cancer Center, W462, 333 Cottman Ave., Philadelphia, PA 19111, USA, Tel: (215) 728-2500; Fax: -3616; E-mail:
| |
Collapse
|
30
|
Agrawal A, Huang X. PSIBLAST_PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance. Bioinformatics 2009; 25:1082-3. [DOI: 10.1093/bioinformatics/btp089] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
31
|
Stojmirović A, Gertz EM, Altschul SF, Yu YK. The effectiveness of position- and composition-specific gap costs for protein similarity searches. Bioinformatics 2008; 24:i15-23. [PMID: 18586708 PMCID: PMC2718649 DOI: 10.1093/bioinformatics/btn171] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Motivation: The flexibility in gap cost enjoyed by hidden Markov models (HMMs) is expected to afford them better retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments. Results: We found that position-specific gap penalties have an advantage over uniform gap costs. We did not explore optimizing distinct uniform gap costs for each query. For Pfam, PSSMs iteratively constructed from seeds based on HMM consensus sequences perform equivalently to HMMs that were adjusted to have constant gap transition probabilities, albeit with much greater variance. We observed no effect of composition-specific gap costs on retrieval performance. These results suggest possible improvements to the PSI-BLAST protein database search program. Availability: The scripts for performing evaluations are available upon request from the authors. Contact:yyu@ncbi.nlm.nih.gov
Collapse
Affiliation(s)
- Aleksandar Stojmirović
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
32
|
Alves G, Wu WW, Wang G, Shen RF, Yu YK. Enhancing peptide identification confidence by combining search methods. J Proteome Res 2008; 7:3102-13. [PMID: 18558733 PMCID: PMC2658881 DOI: 10.1021/pr700798h] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
![]()
Confident peptide identification is one of the most important components in mass-spectrometry-based proteomics. We propose a method to properly combine the results from different database search methods to enhance the accuracy of peptide identifications. The database search methods included in our analysis are SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X! Tandem (v2007.07.01.2), OMSSA (v2.0) and RAId_DbS. Using two data sets, one collected in profile mode and one collected in centroid mode, we tested the search performance of all 21 combinations of two search methods as well as all 35 possible combinations of three search methods. The results obtained from our study suggest that properly combining search methods does improve retrieval accuracy. In addition to performance results, we also describe the theoretical framework which in principle allows one to combine many independent scoring methods including de novo sequencing and spectral library searches. The correlations among different methods are also investigated in terms of common true positives, common false positives, and a global analysis. We find that the average correlation strength, between any pairwise combination of the seven methods studied, is usually smaller than the associated standard error. This indicates only weak correlation may be present among different methods and validates our approach in combining the search results. The usefulness of our approach is further confirmed by showing that the average cumulative number of false positive peptides agrees reasonably well with the combined E-value. The data related to this study are freely available upon request. Properly combining the results from different database search methods will enhance the accuracy of peptide identifications. This nevertheless requires, among different search methods, a common statistical standard, which can be achieved by statistical calibration of E-values. After transforming E-values into database P-values, we use the well-established Fisher’s formula to combine different P-values. Our protocol provides a statistically sound method for integration of search results.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotechnology Information, Library of Medicine, NIH, Bethesda, MD 20894, USA
| | | | | | | | | |
Collapse
|
33
|
Drees C, Stürmer CA, Möller HM, Fritz G. Expression and purification of neurolin immunoglobulin domain 2 from Carrassius auratus (goldfish) in Escherichia coli. Protein Expr Purif 2008; 59:47-54. [DOI: 10.1016/j.pep.2007.12.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2007] [Revised: 12/25/2007] [Accepted: 12/28/2007] [Indexed: 10/22/2022]
|
34
|
SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 2008; 9:226. [PMID: 18452616 PMCID: PMC2391167 DOI: 10.1186/1471-2105-9-226] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2007] [Accepted: 05/01/2008] [Indexed: 11/16/2022] Open
Abstract
Background Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. Results SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. Conclusion The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.
Collapse
|
35
|
Brendel V. DNA, Words and Models-Statistics of Exceptional Words by S. Robin, F. Rodolphe, and S. Schbath. Biometrics 2008. [DOI: 10.1111/j.1541-0420.2008.00962_4.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
36
|
Sadreyev RI, Grishin NV. Accurate statistical model of comparison between multiple sequence alignments. Nucleic Acids Res 2008; 36:2240-8. [PMID: 18285364 PMCID: PMC2367703 DOI: 10.1093/nar/gkn065] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similarities found in a database search, so that biologically relevant relationships are discriminated from spurious connections. Here, we develop an accurate statistical description of MSA comparison that does not originate from conventional models of single sequence comparison and captures essential features of protein families. As a final result, we compute E-values for the similarity between any two MSA using a mathematical function that depends on MSA lengths and sequence diversity. To develop these estimates of statistical significance, we first establish a procedure for generating realistic alignment decoys that reproduce natural patterns of sequence conservation dictated by protein secondary structure. Second, since similarity scores between these alignments do not follow the classic Gumbel extreme value distribution, we propose a novel distribution that yields statistically perfect agreement with the data. Third, we apply this random model to database searches and show that it surpasses conventional models in the accuracy of detecting remote protein similarities.
Collapse
|
37
|
Chen K, Kurgan L. PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics 2007; 23:2843-50. [DOI: 10.1093/bioinformatics/btm475] [Citation(s) in RCA: 91] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
38
|
Gertz EM, Yu YK, Agarwala R, Schäffer AA, Altschul SF. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 2006; 4:41. [PMID: 17156431 PMCID: PMC1779365 DOI: 10.1186/1741-7007-4-41] [Citation(s) in RCA: 343] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2006] [Accepted: 12/07/2006] [Indexed: 11/29/2022] Open
Abstract
Background TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. Results We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. Conclusion TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.
Collapse
Affiliation(s)
- E Michael Gertz
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| | - Richa Agarwala
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| | - Alejandro A Schäffer
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| | - Stephen F Altschul
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA
| |
Collapse
|