Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Yu YK, Gertz EM, Agarwala R, Schäffer AA, Altschul SF. Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res 2006;34:5966-73. [PMID: 17068079 PMCID: PMC1635310 DOI: 10.1093/nar/gkl731] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

For:	Yu YK, Gertz EM, Agarwala R, Schäffer AA, Altschul SF. Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res 2006;34:5966-73. [PMID: 17068079 PMCID: PMC1635310 DOI: 10.1093/nar/gkl731] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Number

Cited by Other Article(s)

Alves G, Yu YK. Robust Accurate Identification and Biomass Estimates of Microorganisms via Tandem Mass Spectrometry. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2020;31:85-102. [PMID: 32881514 PMCID: PMC10501333 DOI: 10.1021/jasms.9b00035] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]

Margelevičius M. Estimating statistical significance of local protein profile-profile alignments. BMC Bioinformatics 2019;20:419. [PMID: 31409275 PMCID: PMC6693267 DOI: 10.1186/s12859-019-2913-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2019] [Accepted: 05/23/2019] [Indexed: 11/10/2022] Open

Carroll HD, Spouge JL, Gonzalez M. MultiDomainBenchmark: a multi-domain query and subject database suite. BMC Bioinformatics 2019;20:77. [PMID: 30764761 PMCID: PMC6376684 DOI: 10.1186/s12859-019-2660-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Accepted: 01/28/2019] [Indexed: 11/10/2022] Open

Neuwald AF, Altschul SF. Statistical investigations of protein residue direct couplings. PLoS Comput Biol 2018;14:e1006237. [PMID: 30596639 PMCID: PMC6329532 DOI: 10.1371/journal.pcbi.1006237] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 01/11/2019] [Accepted: 11/23/2018] [Indexed: 12/12/2022] Open

Alves G, Wang G, Ogurtsov AY, Drake SK, Gucek M, Sacks DB, Yu YK. Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2018;29:1721-1737. [PMID: 29873019 PMCID: PMC6061032 DOI: 10.1007/s13361-018-1986-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/13/2017] [Revised: 03/30/2018] [Accepted: 04/25/2018] [Indexed: 05/30/2023]

Ladunga I. Finding Homologs in Amino Acid Sequences Using Network BLAST Searches. ACTA ACUST UNITED AC 2018;59:3.4.1-3.4.24. [DOI: 10.1002/cpbi.34] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Pearson WR, Li W, Lopez R. Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold. Nucleic Acids Res 2017;45:e46. [PMID: 27923999 PMCID: PMC5605230 DOI: 10.1093/nar/gkw1207] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 11/18/2016] [Indexed: 11/13/2022] Open

Sarajlić A, Malod-Dognin N, Yaveroğlu ÖN, Pržulj N. Graphlet-based Characterization of Directed Networks. Sci Rep 2016;6:35098. [PMID: 27734973 PMCID: PMC5062067 DOI: 10.1038/srep35098] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2016] [Accepted: 09/26/2016] [Indexed: 01/22/2023] Open

Alves G, Yu YK. Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution. Bioinformatics 2016;32:2642-9. [PMID: 27153659 DOI: 10.1093/bioinformatics/btw225] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Accepted: 04/16/2016] [Indexed: 11/14/2022] Open

Alves G, Wang G, Ogurtsov AY, Drake SK, Gucek M, Suffredini AF, Sacks DB, Yu YK. Identification of Microorganisms by High Resolution Tandem Mass Spectrometry with Accurate Statistical Significance. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2016;27:194-210. [PMID: 26510657 PMCID: PMC4723618 DOI: 10.1007/s13361-015-1271-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2015] [Revised: 09/04/2015] [Accepted: 09/05/2015] [Indexed: 05/13/2023]

Hoxha E, Campion SR. Structure-critical distribution of aromatic residues in the fibronectin type III protein family. Protein J 2014;33:165-73. [PMID: 24563228 DOI: 10.1007/s10930-014-9549-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Abstract

Over a thousand individual Fibronectin type III (FnIII) domain sequences, extracted from more than 60 different FnIII-dependent protein super-structures, were downloaded from curated database resources. Three regions of extreme sequence conservation within the well-characterized FnIII β-sandwich structure were respectively defined by near absolute conservation of a tryptophan (Trp) in β-strand-B, tyrosines (Tyr) in both β-strand-C and β-strand-F, and a leucine (Leu) residue in the unstructured region immediately preceding β-strand-F. Employing these four conserved landmarks, the entire FnIII sequence dataset was vertically registered to align the three conserved regions, and the cumulative distribution of all other amino acid functionality was determined and plotted relative to these landmark residues. Conserved aromatic sites were each found to be flanked by aliphatic residues that assure localization of these sites to the inaccessible hydrophobic interface between major sheet structures. Mapping the location of conserved aromatic sites in numerous PDB structures demonstrated the consistent pair-wise co-localization of the indole side-chain of the conserved strand-B Trp site to within 0.35 nm of the phenolic side-chain of the strand-C Tyr site located 8-14 amino acids distal. Likewise, the side-chain of the strand-F Tyr site co-localized to within 0.45 nm of the aliphatic side-chain of the conserved Leu that uniformly precedes it by six residues. While classic hydropathy-based theories would deem the "burying" of Tyr and Trp side-chains and/or their association with hydrophobic FnIII core residues thermodynamically unnecessary, alternative contributions of conserved Trp and Tyr residues, and particularly the role of the absolutely conserved tyrosine phenolic -OH in native FnIII structure-function are considered. A more global role for conserved FnIII aromaticity is also discussed in light of the aromatic conservation observed in other well-established protein families.

Collapse

Alves G, Yu YK. Mass spectrometry-based protein identification with accurate statistical significance assignment. ACTA ACUST UNITED AC 2014;31:699-706. [PMID: 25362092 DOI: 10.1093/bioinformatics/btu717] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Ward N, Moreno-Hagelsieb G. Quickly finding orthologs as reciprocal best hits with BLAT, LAST, and UBLAST: how much do we miss? PLoS One 2014;9:e101850. [PMID: 25013894 PMCID: PMC4094424 DOI: 10.1371/journal.pone.0101850] [Citation(s) in RCA: 107] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Accepted: 06/11/2014] [Indexed: 11/30/2022] Open

Yaveroğlu ÖN, Malod-Dognin N, Davis D, Levnajic Z, Janjic V, Karapandza R, Stojmirovic A, Pržulj N. Revealing the hidden language of complex networks. Sci Rep 2014;4:4547. [PMID: 24686408 PMCID: PMC3971399 DOI: 10.1038/srep04547] [Citation(s) in RCA: 78] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2014] [Accepted: 03/13/2014] [Indexed: 12/03/2022] Open

Gruber A, Kroth PG. Deducing intracellular distributions of metabolic pathways from genomic data. Methods Mol Biol 2014;1083:187-211. [PMID: 24218217 DOI: 10.1007/978-1-62703-661-0_12] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]

Wang R, Gao X. A Two-Layer Learning Architecture for Multi-Class Protein Folds Classification. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open

Zhang Y, Misra S, Agrawal A, Patwary MMA, Liao WK, Qin Z, Choudhary A. Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power. BMC Bioinformatics 2012;13 Suppl 5:S3. [PMID: 22537007 PMCID: PMC3318904 DOI: 10.1186/1471-2105-13-s5-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open

Yutin N, Koonin EV. Archaeal origin of tubulin. Biol Direct 2012;7:10. [PMID: 22458654 PMCID: PMC3349469 DOI: 10.1186/1745-6150-7-10] [Citation(s) in RCA: 74] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2011] [Accepted: 03/29/2012] [Indexed: 12/27/2022] Open

WONG WINGCHEONG, MAURER-STROH SEBASTIAN, EISENHABER FRANK. THE JANUS-FACED E-VALUES OF HMMER2: EXTREME VALUE DISTRIBUTION OR LOGISTIC FUNCTION? J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720011005264] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Abstract E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD. Collapse

Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol 2011;7:e1002195. [PMID: 22039361 PMCID: PMC3197634 DOI: 10.1371/journal.pcbi.1002195] [Citation(s) in RCA: 3786] [Impact Index Per Article: 291.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2011] [Accepted: 07/29/2011] [Indexed: 11/18/2022] Open

Alves G, Yu YK. Combining independent, weighted P-values: achieving computational stability by a systematic expansion with controllable accuracy. PLoS One 2011;6:e22647. [PMID: 21912585 PMCID: PMC3166143 DOI: 10.1371/journal.pone.0022647] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2011] [Accepted: 06/27/2011] [Indexed: 11/27/2022] Open

Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011;8:194-205. [PMID: 21071807 DOI: 10.1109/tcbb.2009.69] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]

RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics. PLoS One 2010;5:e15438. [PMID: 21103371 PMCID: PMC2982831 DOI: 10.1371/journal.pone.0015438] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2010] [Accepted: 09/20/2010] [Indexed: 11/26/2022] Open

Aniba MR, Poch O, Marchler-Bauer A, Thompson JD. AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis. Nucleic Acids Res 2010;38:6338-49. [PMID: 20530533 PMCID: PMC2965243 DOI: 10.1093/nar/gkq526] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open

Mizianty MJ, Kurgan L. Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics 2009;10:414. [PMID: 20003388 PMCID: PMC2805645 DOI: 10.1186/1471-2105-10-414] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2009] [Accepted: 12/13/2009] [Indexed: 11/13/2022] Open

Abstract

Background

Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences.

Results

The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes.

Conclusions

The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.

Collapse

Nguyen VH, Lavenier D. PLAST: parallel local alignment search tool for database comparison. BMC Bioinformatics 2009;10:329. [PMID: 19821978 PMCID: PMC2770072 DOI: 10.1186/1471-2105-10-329] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2009] [Accepted: 10/12/2009] [Indexed: 11/10/2022] Open

Forslund K, Sonnhammer ELL. Benchmarking homology detection procedures with low complexity filters. ACTA ACUST UNITED AC 2009;25:2500-5. [PMID: 19620098 DOI: 10.1093/bioinformatics/btp446] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty. BMC Bioinformatics 2009;10 Suppl 3:S1. [PMID: 19344477 PMCID: PMC2665049 DOI: 10.1186/1471-2105-10-s3-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Hopper-Borge EA, Nasto RE, Ratushny V, Weiner LM, Golemis EA, Astsaturov I. Mechanisms of tumor resistance to EGFR-targeted therapies. Expert Opin Ther Targets 2009;13:339-62. [PMID: 19236156 PMCID: PMC2670612 DOI: 10.1517/14712590902735795] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]

Agrawal A, Huang X. PSIBLAST_PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance. Bioinformatics 2009;25:1082-3. [DOI: 10.1093/bioinformatics/btp089] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Stojmirović A, Gertz EM, Altschul SF, Yu YK. The effectiveness of position- and composition-specific gap costs for protein similarity searches. Bioinformatics 2008;24:i15-23. [PMID: 18586708 PMCID: PMC2718649 DOI: 10.1093/bioinformatics/btn171] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Alves G, Wu WW, Wang G, Shen RF, Yu YK. Enhancing peptide identification confidence by combining search methods. J Proteome Res 2008;7:3102-13. [PMID: 18558733 PMCID: PMC2658881 DOI: 10.1021/pr700798h] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Abstract

Confident peptide identification is one of the most important components in mass-spectrometry-based proteomics. We propose a method to properly combine the results from different database search methods to enhance the accuracy of peptide identifications. The database search methods included in our analysis are SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X! Tandem (v2007.07.01.2), OMSSA (v2.0) and RAId_DbS. Using two data sets, one collected in profile mode and one collected in centroid mode, we tested the search performance of all 21 combinations of two search methods as well as all 35 possible combinations of three search methods. The results obtained from our study suggest that properly combining search methods does improve retrieval accuracy. In addition to performance results, we also describe the theoretical framework which in principle allows one to combine many independent scoring methods including de novo sequencing and spectral library searches. The correlations among different methods are also investigated in terms of common true positives, common false positives, and a global analysis. We find that the average correlation strength, between any pairwise combination of the seven methods studied, is usually smaller than the associated standard error. This indicates only weak correlation may be present among different methods and validates our approach in combining the search results. The usefulness of our approach is further confirmed by showing that the average cumulative number of false positive peptides agrees reasonably well with the combined E-value. The data related to this study are freely available upon request.

Properly combining the results from different database search methods will enhance the accuracy of peptide identifications. This nevertheless requires, among different search methods, a common statistical standard, which can be achieved by statistical calibration of E-values. After transforming E-values into database P-values, we use the well-established Fisher’s formula to combine different P-values. Our protocol provides a statistically sound method for integration of search results.

Collapse

Drees C, Stürmer CA, Möller HM, Fritz G. Expression and purification of neurolin immunoglobulin domain 2 from Carrassius auratus (goldfish) in Escherichia coli. Protein Expr Purif 2008;59:47-54. [DOI: 10.1016/j.pep.2007.12.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2007] [Revised: 12/25/2007] [Accepted: 12/28/2007] [Indexed: 10/22/2022]

SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 2008;9:226. [PMID: 18452616 PMCID: PMC2391167 DOI: 10.1186/1471-2105-9-226] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2007] [Accepted: 05/01/2008] [Indexed: 11/16/2022] Open

Abstract

Background

Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction.

Results

SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors.

Conclusion

The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.

Collapse

Brendel V. DNA, Words and Models-Statistics of Exceptional Words by S. Robin, F. Rodolphe, and S. Schbath. Biometrics 2008. [DOI: 10.1111/j.1541-0420.2008.00962_4.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Sadreyev RI, Grishin NV. Accurate statistical model of comparison between multiple sequence alignments. Nucleic Acids Res 2008;36:2240-8. [PMID: 18285364 PMCID: PMC2367703 DOI: 10.1093/nar/gkn065] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open

Chen K, Kurgan L. PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics 2007;23:2843-50. [DOI: 10.1093/bioinformatics/btm475] [Citation(s) in RCA: 91] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Gertz EM, Yu YK, Agarwala R, Schäffer AA, Altschul SF. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 2006;4:41. [PMID: 17156431 PMCID: PMC1779365 DOI: 10.1186/1741-7007-4-41] [Citation(s) in RCA: 343] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2006] [Accepted: 12/07/2006] [Indexed: 11/29/2022] Open

Abstract

Background

TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server.

Results

We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy.

Conclusion

TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.

Collapse