1
|
Janaki C, Gowri VS, Srinivasan N. Master Blaster: an approach to sensitive identification of remotely related proteins. Sci Rep 2021; 11:8746. [PMID: 33888741 PMCID: PMC8062480 DOI: 10.1038/s41598-021-87833-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Accepted: 04/06/2021] [Indexed: 11/11/2022] Open
Abstract
Genome sequencing projects unearth sequences of all the protein sequences encoded in a genome. As the first step, homology detection is employed to obtain clues to structure and function of these proteins. However, high evolutionary divergence between homologous proteins challenges our ability to detect distant relationships. In the past, an approach involving multiple Position Specific Scoring Matrices (PSSMs) was found to be more effective than traditional single PSSMs. Cascaded search is another successful approach where hits of a search are queried to detect more homologues. We propose a protocol, ‘Master Blaster’, which combines the principles adopted in these two approaches to enhance our ability to detect remote homologues even further. Assessment of the approach was performed using known relationships available in the SCOP70 database, and the results were compared against that of PSI-BLAST and HHblits, a hidden Markov model-based method. Compared to PSI-BLAST, Master Blaster resulted in 10% improvement with respect to detection of cross superfamily connections, nearly 35% improvement in cross family and more than 80% improvement in intra family connections. From the results it was observed that HHblits is more sensitive in detecting remote homologues compared to Master Blaster. However, there are true hits from 46-folds for which Master Blaster reported homologs that are not reported by HHblits even using the optimal parameters indicating that for detecting remote homologues, use of multiple methods employing a combination of different approaches can be more effective in detecting remote homologs. Master Blaster stand-alone code is available for download in the supplementary archive.
Collapse
Affiliation(s)
- Chintalapati Janaki
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560012, India.,Centre for Development of Advanced Computing, Knowledge Park, Byappanahalli, Bangalore, 560038, India
| | - Venkatraman S Gowri
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560012, India.,Department of Chemistry, Auxilium College, Gandhinagar, Vellore, 632006, India
| | | |
Collapse
|
2
|
Tong J, Sadreyev RI, Pei J, Kinch LN, Grishin NV. Using homology relations within a database markedly boosts protein sequence similarity search. Proc Natl Acad Sci U S A 2015; 112:7003-8. [PMID: 26038555 PMCID: PMC4460465 DOI: 10.1073/pnas.1424324112] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence-based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit's known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre.
Collapse
Affiliation(s)
- Jing Tong
- Department of Molecular Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050
| | - Ruslan I Sadreyev
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA 02114; Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114
| | - Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050
| | - Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050
| | - Nick V Grishin
- Department of Molecular Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050; Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050
| |
Collapse
|
3
|
Minami S, Sawada K, Chikenji G. How a spatial arrangement of secondary structure elements is dispersed in the universe of protein folds. PLoS One 2014; 9:e107959. [PMID: 25243952 PMCID: PMC4171485 DOI: 10.1371/journal.pone.0107959] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2014] [Accepted: 08/18/2014] [Indexed: 11/18/2022] Open
Abstract
It has been known that topologically different proteins of the same class sometimes share the same spatial arrangement of secondary structure elements (SSEs). However, the frequency by which topologically different structures share the same spatial arrangement of SSEs is unclear. It is important to estimate this frequency because it provides both a deeper understanding of the geometry of protein folds and a valuable suggestion for predicting protein structures with novel folds. Here we clarified the frequency with which protein folds share the same SSE packing arrangement with other folds, the types of spatial arrangement of SSEs that are frequently observed across different folds, and the diversity of protein folds that share the same spatial arrangement of SSEs with a given fold, using a protein structure alignment program MICAN, which we have been developing. By performing comprehensive structural comparison of SCOP fold representatives, we found that approximately 80% of protein folds share the same spatial arrangement of SSEs with other folds. We also observed that many protein pairs that share the same spatial arrangement of SSEs belong to the different classes, often with an opposing N- to C-terminal direction of the polypeptide chain. The most frequently observed spatial arrangement of SSEs was the 2-layer α/β packing arrangement and it was dispersed among as many as 27% of SCOP fold representatives. These results suggest that the same spatial arrangements of SSEs are adopted by a wide variety of different folds and that the spatial arrangement of SSEs is highly robust against the N- to C-terminal direction of the polypeptide chain.
Collapse
Affiliation(s)
- Shintaro Minami
- Department of Complex Systems Science, Nagoya University, Nagoya, Aichi, Japan
| | - Kengo Sawada
- Department of Applied Physics, Nagoya University, Nagoya, Aichi, Japan
| | - George Chikenji
- Department of Computational Science and Engineering, Nagoya University, Nagoya, Aichi, Japan
- * E-mail:
| |
Collapse
|
4
|
Simon NC, Aktories K, Barbieri JT. Novel bacterial ADP-ribosylating toxins: structure and function. Nat Rev Microbiol 2014; 12:599-611. [PMID: 25023120 PMCID: PMC5846498 DOI: 10.1038/nrmicro3310] [Citation(s) in RCA: 148] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Bacterial ADP-ribosyltransferase toxins (bARTTs) transfer ADP-ribose to eukaryotic proteins to promote bacterial pathogenesis. In this Review, we use prototype bARTTs, such as diphtheria toxin and pertussis toxin, as references for the characterization of several new bARTTs from human, insect and plant pathogens, which were recently identified by bioinformatic analyses. Several of these toxins, including cholix toxin (ChxA) from Vibrio cholerae, SpyA from Streptococcus pyogenes, HopU1 from Pseudomonas syringae and the Tcc toxins from Photorhabdus luminescens, ADP-ribosylate novel substrates and have unique organizations, which distinguish them from the reference toxins. The characterization of these toxins increases our appreciation of the range of structural and functional properties that are possessed by bARTTs and their roles in bacterial pathogenesis.
Collapse
Affiliation(s)
- Nathan C. Simon
- Medical College of Wisconsin, Microbiology and Molecular Genetics, Milwaukee, WI, USA
| | - Klaus Aktories
- Institute of Experimental and Clinical Pharmacology and Toxicology; Albert-Ludwigs-University Freiburg; Freiburg, Germany
| | - Joseph T. Barbieri
- Medical College of Wisconsin, Microbiology and Molecular Genetics, Milwaukee, WI, USA
| |
Collapse
|
5
|
A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Sci Rep 2014; 3:2619. [PMID: 24018415 PMCID: PMC3965362 DOI: 10.1038/srep02619] [Citation(s) in RCA: 128] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2013] [Accepted: 08/22/2013] [Indexed: 11/08/2022] Open
Abstract
Protein sequence alignment is essential for template-based protein structure prediction and function annotation. We collect 20 sequence alignment algorithms, 10 published and 10 newly developed, which cover all representative sequence- and profile-based alignment approaches. These algorithms are benchmarked on 538 non-redundant proteins for protein fold-recognition on a uniform template library. Results demonstrate dominant advantage of profile-profile based methods, which generate models with average TM-score 26.5% higher than sequence-profile methods and 49.8% higher than sequence-sequence alignment methods. There is no obvious difference in results between methods with profiles generated from PSI-BLAST PSSM matrix and hidden Markov models. Accuracy of profile-profile alignments can be further improved by 9.6% or 21.4% when predicted or native structure features are incorporated. Nevertheless, TM-scores from profile-profile methods including experimental structural features are still 37.1% lower than that from TM-align, demonstrating that the fold-recognition problem cannot be solved solely by improving accuracy of structure feature predictions.
Collapse
|
6
|
Cai H, Kuang R, Gu J, Wang Y. Proteases in malaria parasites - a phylogenomic perspective. Curr Genomics 2012; 12:417-27. [PMID: 22379395 PMCID: PMC3178910 DOI: 10.2174/138920211797248565] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2011] [Revised: 07/17/2011] [Accepted: 07/20/2011] [Indexed: 12/21/2022] Open
Abstract
Malaria continues to be one of the most devastating global health problems due to the high morbidity and mortality it causes in endemic regions. The search for new antimalarial targets is of high priority because of the increasing prevalence of drug resistance in malaria parasites. Malarial proteases constitute a class of promising therapeutic targets as they play important roles in the parasite life cycle and it is possible to design and screen for specific protease inhibitors. In this mini-review, we provide a phylogenomic overview of malarial proteases. An evolutionary perspective on the origin and divergence of these proteases will provide insights into the adaptive mechanisms of parasite growth, development, infection, and pathogenesis.B
Collapse
Affiliation(s)
- Hong Cai
- Department of Biology, University of Texas at San Antonio, San Antonio, TX 78249, USA
| | | | | | | |
Collapse
|
7
|
Söding J, Remmert M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol 2011; 21:404-11. [PMID: 21458982 DOI: 10.1016/j.sbi.2011.03.005] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Revised: 03/01/2011] [Accepted: 03/09/2011] [Indexed: 11/26/2022]
Abstract
Protein sequence comparison methods have grown increasingly sensitive during the last decade and can often identify distantly related proteins sharing a common ancestor some 3 billion years ago. Although cellular function is not conserved so long, molecular functions and structures of protein domains often are. In combination with a domain-centered approach to function and structure prediction, modern remote homology detection methods have a great and largely underexploited potential for elucidating protein functions and evolution. Advances during the last few years include nonlinear scoring functions combining various sequence features, the use of sequence context information, and powerful new software packages. Since progress depends on realistically assessing new and existing methods and published benchmarks are often hard to compare, we propose 10 rules of good-practice benchmarking.
Collapse
Affiliation(s)
- Johannes Söding
- Gene Center and Center for Integrated Protein Science, Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, Munich, Germany.
| | | |
Collapse
|
8
|
Abstract
Homology modeling is based on the observation that related protein sequences adopt similar three-dimensional structures. Hence, a homology model of a protein can be derived using related protein structure(s) as modeling template(s). A key step in this approach is the establishment of correspondence between residues of the protein to be modeled and those of modeling template(s). This step, often referred to as sequence-structure alignment, is one of the major determinants of the accuracy of a homology model. This chapter gives an overview of methods for deriving sequence-structure alignments and discusses recent methodological developments leading to improved performance. However, no method is perfect. How to find alignment regions that may have errors and how to make improvements? This is another focus of this chapter. Finally, the chapter provides a practical guidance of how to get the most of the available tools in maximizing the accuracy of sequence-structure alignments.
Collapse
|
9
|
Fieldhouse RJ, Turgeon Z, White D, Merrill AR. Cholera- and anthrax-like toxins are among several new ADP-ribosyltransferases. PLoS Comput Biol 2010; 6:e1001029. [PMID: 21170356 PMCID: PMC3000352 DOI: 10.1371/journal.pcbi.1001029] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2010] [Accepted: 11/10/2010] [Indexed: 11/19/2022] Open
Abstract
Chelt, a cholera-like toxin from Vibrio cholerae, and Certhrax, an anthrax-like toxin from Bacillus cereus, are among six new bacterial protein toxins we identified and characterized using in silico and cell-based techniques. We also uncovered medically relevant toxins from Mycobacterium avium and Enterococcus faecalis. We found agriculturally relevant toxins in Photorhabdus luminescens and Vibrio splendidus. These toxins belong to the ADP-ribosyltransferase family that has conserved structure despite low sequence identity. Therefore, our search for new toxins combined fold recognition with rules for filtering sequences--including a primary sequence pattern--to reduce reliance on sequence identity and identify toxins using structure. We used computers to build models and analyzed each new toxin to understand features including: structure, secretion, cell entry, activation, NAD+ substrate binding, intracellular target binding and the reaction mechanism. We confirmed activity using a yeast growth test. In this era where an expanding protein structure library complements abundant protein sequence data--and we need high-throughput validation--our approach provides insight into the newest toxin ADP-ribosyltransferases.
Collapse
Affiliation(s)
- Robert J. Fieldhouse
- Department of Molecular and Cellular Biology, University of Guelph, Guelph, Ontario, Canada
| | - Zachari Turgeon
- Department of Molecular and Cellular Biology, University of Guelph, Guelph, Ontario, Canada
| | - Dawn White
- Department of Molecular and Cellular Biology, University of Guelph, Guelph, Ontario, Canada
| | - A. Rod Merrill
- Department of Molecular and Cellular Biology, University of Guelph, Guelph, Ontario, Canada
| |
Collapse
|
10
|
Jeong CS, Kim D. Linear predictive coding representation of correlated mutation for protein sequence alignment. BMC Bioinformatics 2010; 11 Suppl 2:S2. [PMID: 20406500 PMCID: PMC3165164 DOI: 10.1186/1471-2105-11-s2-s2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background Although both conservation and correlated mutation (CM) are important information reflecting the different sorts of context in multiple sequence alignment, most of alignment methods use sequence profiles that only represent conservation. There is no general way to represent correlated mutation and incorporate it with sequence alignment yet. Methods We develop a novel method, CM profile, to represent correlated mutation as the spectral feature derived by using linear predictive coding where correlated mutations among different positions are represented by a fixed number of values. We combine CM profile with conventional sequence profile to improve alignment quality. Results For distantly related protein pairs, using CM profile improves the profile-profile alignment with or without predicted secondary structure. Especially, at superfamily level, combining CM profile with sequence profile improves profile-profile alignment by 9.5% while predicted secondary structure does by 6.0%. More significantly, using both of them improves profile-profile alignment by 13.9%. We also exemplify the effectiveness of CM profile by demonstrating that the resulting alignment preserves share coevolution and contacts. Conclusions In this work, we introduce a novel method, CM profile, which represents correlated mutation information as paralleled form, and apply it to the protein sequence alignment problem. When combined with conventional sequence profile, CM profile improves alignment quality significantly better than predicted secondary structure information, which should be beneficial for target-template alignment in protein structure prediction. Because of the generality of CM profile, it can be used for other bioinformatics applications in the same way of using sequence profile.
Collapse
Affiliation(s)
- Chan-seok Jeong
- Department of Bio and Brain Engineering, KAIST, 373-1 Guseong-dong, Yuseong-gu, Daejeon, 305-701, Korea
| | | |
Collapse
|
11
|
Margelevicius M, Venclovas C. Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison. BMC Bioinformatics 2010; 11:89. [PMID: 20158924 PMCID: PMC2837030 DOI: 10.1186/1471-2105-11-89] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2009] [Accepted: 02/17/2010] [Indexed: 01/31/2023] Open
Abstract
Background Detection of common evolutionary origin (homology) is a primary means of inferring protein structure and function. At present, comparison of protein families represented as sequence profiles is arguably the most effective homology detection strategy. However, finding the best way to represent evolutionary information of a protein sequence family in the profile, to compare profiles and to estimate the biological significance of such comparisons, remains an active area of research. Results Here, we present a new homology detection method based on sequence profile-profile comparison. The method has a number of new features including position-dependent gap penalties and a global score system. Position-dependent gap penalties provide a more biologically relevant way to represent and align protein families as sequence profiles. The global score system enables an analytical solution of the statistical parameters needed to estimate the statistical significance of profile-profile similarities. The new method, together with other state-of-the-art profile-based methods (HHsearch, COMPASS and PSI-BLAST), is benchmarked in all-against-all comparison of a challenging set of SCOP domains that share at most 20% sequence identity. For benchmarking, we use a reference ("gold standard") free model-based evaluation framework. Evaluation results show that at the level of protein domains our method compares favorably to all other tested methods. We also provide examples of the new method outperforming structure-based similarity detection and alignment. The implementation of the new method both as a standalone software package and as a web server is available at http://www.ibt.lt/bioinformatics/coma. Conclusion Due to a number of developments, the new profile-profile comparison method shows an improved ability to match distantly related protein domains. Therefore, the method should be useful for annotation and homology modeling of uncharacterized proteins.
Collapse
|
12
|
Considering scores between unrelated proteins in the search database improves profile comparison. BMC Bioinformatics 2009; 10:399. [PMID: 19961610 PMCID: PMC3087343 DOI: 10.1186/1471-2105-10-399] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2009] [Accepted: 12/04/2009] [Indexed: 12/02/2022] Open
Abstract
Background Profile-based comparison of multiple sequence alignments is a powerful methodology for the detection remote protein sequence similarity, which is essential for the inference and analysis of protein structure, function, and evolution. Accurate estimation of statistical significance of detected profile similarities is essential for further development of this methodology. Here we analyze a novel approach to estimate the statistical significance of profile similarity: the explicit consideration of background score distributions for each database template (subject). Results Using a simple scheme to combine and analytically approximate query- and subject-based distributions, we show that (i) inclusion of background distributions for the subjects increases the quality of homology detection; (ii) this increase is higher when the distributions are based on the scores to all known non-homologs of the subject rather than a small calibration subset of the database representatives; and (iii) these all known non-homolog distributions of scores for the subject make the dominant contribution to the improved performance: adding the calibration distribution of the query has a negligible additional effect. Conclusion The construction of distributions based on the complete sets of non-homologs for each subject is particularly relevant in the setting of structure prediction where the database consists of proteins with solved 3D structure (PDB, SCOP, CATH, etc.) and therefore structural relationships between proteins are known. These results point to a potential new direction in the development of more powerful methods for remote homology detection.
Collapse
|
13
|
Abstract
UNLABELLED Sensitive and accurate detection of distant protein homology is essential for the studies of protein structure, function and evolution. We recently developed PROCAIN, a method that is based on sequence profile comparison and involves the analysis of four signals--similarities of residue content at the profile positions combined with three types of assisting information: sequence motifs, residue conservation and predicted secondary structure. Here we present the PROCAIN web server that allows the user to submit a query sequence or multiple sequence alignment and perform the search in a profile database of choice. The output is structured similar to that of BLAST, with the list of detected homologs sorted by E-value and followed by profile-profile alignments. The front page allows the user to adjust multiple options of input processing and output formatting, as well as search settings, including the relative weights assigned to the three types of assisting information. AVAILABILITY http://prodata.swmed.edu/procain/.
Collapse
Affiliation(s)
- Yong Wang
- Biomedical Engineering Program, University of Texas Southwestern Medical Center, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | | | | |
Collapse
|
14
|
Sadreyev RI, Kim BH, Grishin NV. Discrete-continuous duality of protein structure space. Curr Opin Struct Biol 2009; 19:321-8. [PMID: 19482467 PMCID: PMC3688466 DOI: 10.1016/j.sbi.2009.04.009] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2009] [Revised: 04/29/2009] [Accepted: 04/29/2009] [Indexed: 11/30/2022]
Abstract
Recently, the nature of protein structure space has been widely discussed in the literature. The traditional discrete view of protein universe as a set of separate folds has been criticized in the light of growing evidence that almost any arrangement of secondary structures is possible and the whole protein space can be traversed through a path of similar structures. Here we argue that the discrete and continuous descriptions are not mutually exclusive, but complementary: the space is largely discrete in evolutionary sense, but continuous geometrically when purely structural similarities are quantified. Evolutionary connections are mainly confined to separate structural prototypes corresponding to folds as islands of structural stability, with few remaining traceable links between the islands. However, for a geometric similarity measure, it is usually possible to find a reasonable cutoff that yields paths connecting any two structures through intermediates.
Collapse
Affiliation(s)
- Ruslan I. Sadreyev
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | - Bong-Hyun Kim
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | - Nick V. Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA
| |
Collapse
|
15
|
Hasegawa H, Holm L. Advances and pitfalls of protein structural alignment. Curr Opin Struct Biol 2009; 19:341-8. [PMID: 19481444 DOI: 10.1016/j.sbi.2009.04.003] [Citation(s) in RCA: 303] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2009] [Accepted: 04/16/2009] [Indexed: 11/30/2022]
Abstract
Structure comparison opens a window into the distant past of protein evolution, which has been unreachable by sequence comparison alone. With 55,000 entries in the Protein Data Bank and about 500 new structures added each week, automated processing, comparison, and classification are necessary. A variety of methods use different representations, scoring functions, and optimization algorithms, and they generate contradictory results even for moderately distant structures. Sequence mutations, insertions, and deletions are accommodated by plastic deformations of the common core, retaining the precise geometry of the active site, and peripheral regions may refold completely. Therefore structure comparison methods that allow for flexibility and plasticity generate the most biologically meaningful alignments. Active research directions include both the search for fold invariant features and the modeling of structural transitions in evolution. Advances have been made in algorithmic robustness, multiple alignment, and speeding up database searches.
Collapse
Affiliation(s)
- Hitomi Hasegawa
- Institute of Biotechnology, University of Helsinki, P.O. Box 56 (Viikinkaari 5), 00014 University of Helsinki, Finland
| | | |
Collapse
|
16
|
Sadreyev RI, Tang M, Kim BH, Grishin NV. COMPASS server for homology detection: improved statistical accuracy, speed and functionality. Nucleic Acids Res 2009; 37:W90-4. [PMID: 19435884 PMCID: PMC2703893 DOI: 10.1093/nar/gkp360] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
COMPASS is a profile-based method for the detection of remote sequence similarity and the prediction of protein structure. Here we describe a recently improved public web server of COMPASS, http://prodata.swmed.edu/compass. The server features three major developments: (i) improved statistical accuracy; (ii) increased speed from parallel implementation; and (iii) new functional features facilitating structure prediction. These features include visualization tools that allow the user to quickly and effectively analyze specific local structural region predictions suggested by COMPASS alignments. As an application example, we describe the structural, evolutionary and functional analysis of a protein with unknown function that served as a target in the recent CASP8 (Critical Assessment of Techniques for Protein Structure Prediction round 8). URL: http://prodata.swmed.edu/compass
Collapse
Affiliation(s)
- Ruslan I Sadreyev
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA.
| | | | | | | |
Collapse
|
17
|
Wang Y, Sadreyev RI, Grishin NV. PROCAIN: protein profile comparison with assisting information. Nucleic Acids Res 2009; 37:3522-30. [PMID: 19357092 PMCID: PMC2699500 DOI: 10.1093/nar/gkp212] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Detection of remote sequence homology is essential for the accurate inference of protein structure, function and evolution. The most sensitive detection methods involve the comparison of evolutionary patterns reflected in multiple sequence alignments (MSAs) of protein families. We present PROCAIN, a new method for MSA comparison based on the combination of 'vertical' MSA context (substitution constraints at individual sequence positions) and 'horizontal' context (patterns of residue content at multiple positions). Based on a simple and tractable profile methodology and primitive measures for the similarity of horizontal MSA patterns, the method achieves the quality of homology detection comparable to a more complex advanced method employing hidden Markov models (HMMs) and secondary structure (SS) prediction. Adding SS information further improves PROCAIN performance beyond the capabilities of current state-of-the-art tools. The potential value of the method for structure/function predictions is illustrated by the detection of subtle homology between evolutionary distant yet structurally similar protein domains. ProCAIn, relevant databases and tools can be downloaded from: http://prodata.swmed.edu/procain/download. The web server can be accessed at http://prodata.swmed.edu/procain/procain.php.
Collapse
Affiliation(s)
- Yong Wang
- Biomedical Engineering Program, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050, USA
| | | | | |
Collapse
|
18
|
Wrabl JO, Grishin NV. Statistics of Random Protein Superpositions: p-Values for Pairwise Structure Alignment. J Comput Biol 2008; 15:317-55. [DOI: 10.1089/cmb.2007.0161] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- James O. Wrabl
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas
| | - Nick V. Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas
| |
Collapse
|
19
|
Sadreyev RI, Grishin NV. Accurate statistical model of comparison between multiple sequence alignments. Nucleic Acids Res 2008; 36:2240-8. [PMID: 18285364 PMCID: PMC2367703 DOI: 10.1093/nar/gkn065] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similarities found in a database search, so that biologically relevant relationships are discriminated from spurious connections. Here, we develop an accurate statistical description of MSA comparison that does not originate from conventional models of single sequence comparison and captures essential features of protein families. As a final result, we compute E-values for the similarity between any two MSA using a mathematical function that depends on MSA lengths and sequence diversity. To develop these estimates of statistical significance, we first establish a procedure for generating realistic alignment decoys that reproduce natural patterns of sequence conservation dictated by protein secondary structure. Second, since similarity scores between these alignments do not follow the classic Gumbel extreme value distribution, we propose a novel distribution that yields statistically perfect agreement with the data. Third, we apply this random model to database searches and show that it surpasses conventional models in the accuracy of detecting remote protein similarities.
Collapse
|