1
|
Dunham AS, Beltrao P. Exploring amino acid functions in a deep mutational landscape. Mol Syst Biol 2021; 17:e10305. [PMID: 34292650 PMCID: PMC8297461 DOI: 10.15252/msb.202110305] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Revised: 06/29/2021] [Accepted: 06/30/2021] [Indexed: 12/21/2022] Open
Abstract
Amino acids fulfil a diverse range of roles in proteins, each utilising its chemical properties in different ways in different contexts to create required functions. For example, cysteines form disulphide or hydrogen bonds in different circumstances and charged amino acids do not always make use of their charge. The repertoire of amino acid functions and the frequency at which they occur in proteins remains understudied. Measuring large numbers of mutational consequences, which can elucidate the role an amino acid plays, was prohibitively time-consuming until recent developments in deep mutational scanning. In this study, we gathered data from 28 deep mutational scanning studies, covering 6,291 positions in 30 proteins, and used the consequences of mutation at each position to define a mutational landscape. We demonstrated rich relationships between this landscape and biophysical or evolutionary properties. Finally, we identified 100 functional amino acid subtypes with a data-driven clustering analysis and studied their features, including their frequencies and chemical properties such as tolerating polarity, hydrophobicity or being intolerant of charge or specific amino acids. The mutational landscape and amino acid subtypes provide a foundational catalogue of amino acid functional diversity, which will be refined as the number of studied protein positions increases.
Collapse
Affiliation(s)
- Alistair S Dunham
- European Molecular Biology LaboratoryEuropean Bioinformatics Institute (EMBL‐EBI)CambridgeUK
| | - Pedro Beltrao
- European Molecular Biology LaboratoryEuropean Bioinformatics Institute (EMBL‐EBI)CambridgeUK
| |
Collapse
|
2
|
Trivedi R, Nagarajaram HA. Substitution scoring matrices for proteins - An overview. Protein Sci 2020; 29:2150-2163. [PMID: 32954566 DOI: 10.1002/pro.3954] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 09/17/2020] [Accepted: 09/18/2020] [Indexed: 01/17/2023]
Abstract
Sequence analysis is the primary and simplest approach to discover structural, functional and evolutionary details of related proteins. All the alignment based approaches of sequence analysis make use of amino acid substitution matrices, and the accuracy of the results largely depends on the type of scoring matrices used to perform alignment tasks. An amino acid substitution matrix is a 20 × 20 matrix in which the individual elements encapsulate the rates at which each of the 20 amino acid residues in proteins are substituted by other amino acid residues over time. In contrast to most globular/ordered proteins whose amino acids composition is considered as standard, there are several classes of proteins (e.g., transmembrane proteins) in which certain types of amino acid (e.g., hydrophobic residues) are enriched. These compositional differences among various classes of proteins are manifested in their underlying residue substitution frequencies. Therefore, each of the compositionally distinct class of proteins or protein segments should be studied using specific scoring matrices that reflect their distinct residue substitution pattern. In this review, we describe the development and application of various substitution scoring matrices peculiar to proteins with standard and biased compositions. Along with most commonly used standard matrices (PAM, BLOSUM, MD and VTML) that act as default parameters in various homologs search and alignment tools, different substitution scoring matrices specific to compositionally distinct class of proteins are discussed in detail.
Collapse
Affiliation(s)
- Rakesh Trivedi
- Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics, Uppal, Hyderabad, Telangana, India.,Graduate School, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Hampapathalu Adimurthy Nagarajaram
- Laboratory of Computational Biology, Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India.,Centre for Modelling, Simulation and Design, University of Hyderabad, Hyderabad, Telangana, India
| |
Collapse
|
3
|
Niu G, Shao Z, Liu C, Chen T, Jiao Q, Hong Z. Comparative and evolutionary analyses of the divergence of plant oligosaccharyltransferase STT3 isoforms. FEBS Open Bio 2020; 10:468-483. [PMID: 32011067 PMCID: PMC7050244 DOI: 10.1002/2211-5463.12804] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Revised: 01/11/2020] [Accepted: 01/30/2020] [Indexed: 11/08/2022] Open
Abstract
STT3 is a catalytic subunit of hetero-oligomeric oligosaccharyltransferase (OST), which is important for asparagine-linked glycosylation. In mammals and plants, OSTs with different STT3 isoforms exhibit distinct levels of enzymatic efficiency or different responses to stressors. Although two different STT3 isoforms have been identified in both plants and animals, it remains unclear whether these isoforms result from gene duplication in an ancestral eukaryote. Furthermore, the molecular mechanisms underlying the functional divergences between the two STT3 isoforms in plant have not been well elucidated. Here, we conducted phylogenetic analysis of the major evolutionary node species and suggested that gene duplications of STT3 may have occurred independently in animals and plants. Across land plants, the exon-intron structure differed between the two STT3 isoforms, but was highly conserved for each isoform. Most angiosperm STT3a genes had 23 exons with intron phase 0, while STT3b genes had 6 exons with intron phase 2. Characteristic motifs (motif 18 and 19) of STT3s were mapped to different structure domains in the plant STT3 proteins. These two motifs overlap with regions of high nonsynonymous-to-synonymous substitution rates, suggesting the regions may be related to functional difference between STT3a and STT3b. In addition, promoter elements and gene expression profiles were different between the two isoforms, indicating expression pattern divergence of the two genes. Collectively, the identified differences may result in the functional divergence of plant STT3s.
Collapse
Affiliation(s)
- Guanting Niu
- State Key Laboratory of Pharmaceutical Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of Life Sciences, Nanjing University, China
| | - Zhuqing Shao
- State Key Laboratory of Pharmaceutical Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of Life Sciences, Nanjing University, China
| | - Chuanfa Liu
- Department of Biology, Institute of Plant and Food Science, Southern University of Science and Technology, Shenzhen, China
| | - Tianshu Chen
- State Key Laboratory of Pharmaceutical Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of Life Sciences, Nanjing University, China
| | - Qingsong Jiao
- State Key Laboratory of Pharmaceutical Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of Life Sciences, Nanjing University, China
| | - Zhi Hong
- State Key Laboratory of Pharmaceutical Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of Life Sciences, Nanjing University, China
| |
Collapse
|
4
|
Depth dependent amino acid substitution matrices and their use in predicting deleterious mutations. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2017; 128:14-23. [DOI: 10.1016/j.pbiomolbio.2017.02.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/13/2016] [Revised: 01/06/2017] [Accepted: 02/07/2017] [Indexed: 12/31/2022]
|
5
|
Chrysostomou C, Seker H. Novel protein weight matrix generated from amino acid indices. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2016; 2015:8181-4. [PMID: 26738193 DOI: 10.1109/embc.2015.7320293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In recent years, numerous protein weight matrices have been developed that include physical characteristics of proteins, such as local sequence-structure information, alpha-helix information, secondary structure information and solvent accessibility states. These protein weight matrices are shown to have generally improved protein sequence alignments over classical protein weight matrices, like Point Accepted Mutation (PAM), Blocks of Amino Acid Substitution (BLOSUM), and GONNET matrices, where important limitations have been observe in recent works. In this paper, a novel protein weight matrix is constructed and presented. This protein weight matrix is not considered based on the mutation rate, like PAM or BLOSUM matrices, but on the physicochemical properties of each amino acid. In the literature, over 500 amino acid indices exist, each one representing a unique biological protein feature. For this study, 25 amino acid indices were selected. These amino acid indices represent general and widely accepted features of the amino acids. By using the proposed protein weight matrix the following advantages can be obtained compared to the classical protein weight matrices. The proposed protein weight matrix is not biased to specific groups of protein sequences as the values are calculated from the amino acid indices, and not from the protein sequences. Additionally, for the proposed protein weight matrix, the same matrix can be considered regardless of the protein sequence's homology to be aligned or the mutation rate presented. A correlation to the physical characterisations of the amino acids that the protein weight matrix derived from can be achieved. Different similarity matrices can be generated when different physical characterisations of amino acids are considered.
Collapse
|
6
|
Three-dimensional protein structure prediction: Methods and computational strategies. Comput Biol Chem 2014; 53PB:251-276. [DOI: 10.1016/j.compbiolchem.2014.10.001] [Citation(s) in RCA: 121] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Revised: 10/03/2014] [Accepted: 10/07/2014] [Indexed: 01/01/2023]
|
7
|
Joseph AP, de Brevern AG. From local structure to a global framework: recognition of protein folds. J R Soc Interface 2014; 11:20131147. [PMID: 24740960 DOI: 10.1098/rsif.2013.1147] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Protein folding has been a major area of research for many years. Nonetheless, the mechanisms leading to the formation of an active biological fold are still not fully apprehended. The huge amount of available sequence and structural information provides hints to identify the putative fold for a given sequence. Indeed, protein structures prefer a limited number of local backbone conformations, some being characterized by preferences for certain amino acids. These preferences largely depend on the local structural environment. The prediction of local backbone conformations has become an important factor to correctly identifying the global protein fold. Here, we review the developments in the field of local structure prediction and especially their implication in protein fold recognition.
Collapse
Affiliation(s)
- Agnel Praveen Joseph
- Science and Technology Facilities Council, Rutherford Appleton Laboratory, Harwell Oxford, , Didcot OX11 0QX, UK
| | | |
Collapse
|
8
|
Improvement in low-homology template-based modeling by employing a model evaluation method with focus on topology. PLoS One 2014; 9:e89935. [PMID: 24587135 PMCID: PMC3935967 DOI: 10.1371/journal.pone.0089935] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 01/24/2014] [Indexed: 01/22/2023] Open
Abstract
Many template-based modeling (TBM) methods have been developed over the recent years that allow for protein structure prediction and for the study of structure-function relationships for proteins. One major problem all TBM algorithms face, however, is their unsatisfactory performance when proteins under consideration are low-homology. To improve the performance of TBM methods for such targets, a novel model evaluation method was developed here, and named MEFTop. Our novel method focuses on evaluating the topology by using two novel groups of features. These novel features included secondary structure element (SSE) contact information and 3-dimensional topology information. By combining MEFTop algorithm with FR-t5, a threading program developed by our group, we found that this modified TBM program, which was named FR-t5-M, exhibited significant improvements in predictive abilities for low-homology protein targets. We further showed that the MEFTop could be a generalized method to improve threading programs for low-homology protein targets. The softwares (FR-t5-M and MEFTop) are available to non-commercial users at our website: http://jianglab.ibp.ac.cn/lims/FRt5M/FRt5M.html.
Collapse
|
9
|
Angermüller C, Biegert A, Söding J. Discriminative modelling of context-specific amino acid substitution probabilities. ACTA ACUST UNITED AC 2012; 28:3240-7. [PMID: 23080114 DOI: 10.1093/bioinformatics/bts622] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Protein sequence searching and alignment are fundamental tools of modern biology. Alignments are assessed using their similarity scores, essentially the sum of substitution matrix scores over all pairs of aligned amino acids. We previously proposed a generative probabilistic method that yields scores that take the sequence context around each aligned residue into account. This method showed drastically improved sensitivity and alignment quality compared with standard substitution matrix-based alignment. RESULTS Here, we develop an alternative discriminative approach to predict sequence context-specific substitution scores. We applied our approach to compute context-specific sequence profiles for Basic Local Alignment Search Tool (BLAST) and compared the new tool (CS-BLASTdis) to BLAST and the previous context-specific version (CS-BLASTgen). On a dataset filtered to 20% maximum sequence identity, CS-BLASTdisis was 51% more sensitive than BLAST and 17% more sensitive than CS-BLASTgenin, detecting remote homologues at 10% false discovery rate. At 30% maximum sequence identity, its alignments contain 21 and 12% more correct residue pairs than those of BLAST and CS-BLASTgen, respectively. Clear improvements are also seen when the approach is combined with PSI-BLAST and HHblits. We believe the context-specific approach should replace substitution matrices wherever sensitivity and alignment quality are critical.
Collapse
Affiliation(s)
- Christof Angermüller
- Gene Center Munich and Department of Biochemistry, Ludwig-Maximilians-Universtät München, 81377 Munich, Germany
| | | | | |
Collapse
|
10
|
Söding J, Remmert M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol 2011; 21:404-11. [PMID: 21458982 DOI: 10.1016/j.sbi.2011.03.005] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Revised: 03/01/2011] [Accepted: 03/09/2011] [Indexed: 11/26/2022]
Abstract
Protein sequence comparison methods have grown increasingly sensitive during the last decade and can often identify distantly related proteins sharing a common ancestor some 3 billion years ago. Although cellular function is not conserved so long, molecular functions and structures of protein domains often are. In combination with a domain-centered approach to function and structure prediction, modern remote homology detection methods have a great and largely underexploited potential for elucidating protein functions and evolution. Advances during the last few years include nonlinear scoring functions combining various sequence features, the use of sequence context information, and powerful new software packages. Since progress depends on realistically assessing new and existing methods and published benchmarks are often hard to compare, we propose 10 rules of good-practice benchmarking.
Collapse
Affiliation(s)
- Johannes Söding
- Gene Center and Center for Integrated Protein Science, Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, Munich, Germany.
| | | |
Collapse
|
11
|
Hu Y, Dong X, Wu A, Cao Y, Tian L, Jiang T. Incorporation of local structural preference potential improves fold recognition. PLoS One 2011; 6:e17215. [PMID: 21365008 PMCID: PMC3041821 DOI: 10.1371/journal.pone.0017215] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2010] [Accepted: 01/25/2011] [Indexed: 11/19/2022] Open
Abstract
Fold recognition, or threading, is a popular protein structure modeling approach that uses known structure templates to build structures for those of unknown. The key to the success of fold recognition methods lies in the proper integration of sequence, physiochemical and structural information. Here we introduce another type of information, local structural preference potentials of 3-residue and 9-residue fragments, for fold recognition. By combining the two local structural preference potentials with the widely used sequence profile, secondary structure information and hydrophobic score, we have developed a new threading method called FR-t5 (fold recognition by use of 5 terms). In benchmark testings, we have found the consideration of local structural preference potentials in FR-t5 not only greatly enhances the alignment accuracy and recognition sensitivity, but also significantly improves the quality of prediction models.
Collapse
Affiliation(s)
- Yun Hu
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Xiaoxi Dong
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Aiping Wu
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Yang Cao
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Liqing Tian
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Taijiao Jiang
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- * E-mail:
| |
Collapse
|
12
|
Xu HS, Ren WK, Liu XH, Li XQ. Aligning protein sequence and analysing substitution pattern using a class-specific matrix. J Biosci 2011; 35:295-314. [PMID: 20689185 DOI: 10.1007/s12038-010-0033-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Aligning protein sequences using a score matrix has became a routine but valuable method in modern biological research. However, alignment in the 'twilight zone' remains an open issue. It is feasible and necessary to construct a new score matrix as more protein structures are resolved. Three structural class-specific score matrices (all-alpha, all-beta and alpha/beta) were constructed based on the structure alignment of low identity proteins of the corresponding structural classes. The class-specific score matrices were significantly better than a structure-derived matrix (HSDM) and three other generalized matrices (BLOSUM30, BLOSUM60 and Gonnet250) in alignment performance tests. The optimized gap penalties presented here also promote alignment performance. The results indicate that different protein classes have distinct amino acid substitution patterns, and an amino acid score matrix should be constructed based on different structural classes. The class-specific score matrices could also be used in profile construction to improve homology detection.
Collapse
Affiliation(s)
- Hai Song Xu
- College of Life Science and Bioengineering, Beijing University of Technology, Beijing 100124, China
| | | | | | | |
Collapse
|
13
|
Shen HD, Tam MF, Huang CH, Chou H, Tai HY, Chen YS, Sheu SY, Thomas WR. Homology modeling and monoclonal antibody binding of the Der f 7 dust mite allergen. Immunol Cell Biol 2010; 89:225-30. [DOI: 10.1038/icb.2010.77] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Horng-Der Shen
- Department of Medical Research and Education, Taipei Veterans General Hospital; Taipei Taiwan, ROC
| | - Ming F Tam
- Institute of Molecular Biology, Academia Sinica; Taipei Taiwan, ROC
| | - Chao-Hsien Huang
- Department of Life Sciences and Institute of Genome Sciences, National Yang-Ming University; Taipei Taiwan, ROC
| | - Hong Chou
- Department of Medical Research and Education, Taipei Veterans General Hospital; Taipei Taiwan, ROC
| | - Hsiao-Yun Tai
- Department of Medical Research and Education, Taipei Veterans General Hospital; Taipei Taiwan, ROC
| | - Yu-Sen Chen
- Department of Medical Research and Education, Taipei Veterans General Hospital; Taipei Taiwan, ROC
| | - Sheh-Yi Sheu
- Department of Life Sciences and Institute of Genome Sciences, National Yang-Ming University; Taipei Taiwan, ROC
| | - Wayne R Thomas
- Centre for Child Health Research, University of Western Australia, Telethon Institute for Child Health Research; West Perth Western Australia Australia
| |
Collapse
|
14
|
Chen CC, Hwang JK, Yang JM. (PS)2-v2: template-based protein structure prediction server. BMC Bioinformatics 2009; 10:366. [PMID: 19878598 PMCID: PMC2775752 DOI: 10.1186/1471-2105-10-366] [Citation(s) in RCA: 93] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2009] [Accepted: 10/31/2009] [Indexed: 03/11/2024] Open
Abstract
Background Template selection and target-template alignment are critical steps for template-based modeling (TBM) methods. To identify the template for the twilight zone of 15~25% sequence similarity between targets and templates is still difficulty for template-based protein structure prediction. This study presents the (PS)2-v2 server, based on our original server with numerous enhancements and modifications, to improve reliability and applicability. Results To detect homologous proteins with remote similarity, the (PS)2-v2 server utilizes the S2A2 matrix, which is a 60 × 60 substitution matrix using the secondary structure propensities of 20 amino acids, and the position-specific sequence profile (PSSM) generated by PSI-BLAST. In addition, our server uses multiple templates and multiple models to build and assess models. Our method was evaluated on the Lindahl benchmark for fold recognition and ProSup benchmark for sequence alignment. Evaluation results indicated that our method outperforms sequence-profile approaches, and had comparable performance to that of structure-based methods on these benchmarks. Finally, we tested our method using the 154 TBM targets of the CASP8 (Critical Assessment of Techniques for Protein Structure Prediction) dataset. Experimental results show that (PS)2-v2 is ranked 6th among 72 severs and is faster than the top-rank five serves, which utilize ab initio methods. Conclusion Experimental results demonstrate that (PS)2-v2 with the S2A2 matrix is useful for template selections and target-template alignments by blending the amino acid and structural propensities. The multiple-template and multiple-model strategies are able to significantly improve the accuracies for target-template alignments in the twilight zone. We believe that this server is useful in structure prediction and modeling, especially in detecting homologous templates with sequence similarity in the twilight zone.
Collapse
Affiliation(s)
- Chih-Chieh Chen
- Institute of Bioinformatics, National Chiao Tung University, Hsinchu 30050, Taiwan, Republic of China.
| | | | | |
Collapse
|
15
|
Tångrot JE, Kågström B, Sauer UH. Accurate domain identification with structure-anchored hidden Markov models, saHMMs. Proteins 2009; 76:343-52. [PMID: 19173309 DOI: 10.1002/prot.22349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The ever increasing speed of DNA sequencing widens the discrepancy between the number of known gene products, and the knowledge of their function and structure. Proper annotation of protein sequences is therefore crucial if the missing information is to be deduced from sequence-based similarity comparisons. These comparisons become exceedingly difficult as the pairwise identities drop to very low values. To improve the accuracy of domain identification, we exploit the fact that the three-dimensional structures of domains are much more conserved than their sequences. Based on structure-anchored multiple sequence alignments of low identity homologues we constructed 850 structure-anchored hidden Markov models (saHMMs), each representing one domain family. Since the saHMMs are highly family specific, they can be used to assign a domain to its correct family and clearly distinguish it from domains belonging to other families, even within the same superfamily. This task is not trivial and becomes particularly difficult if the unknown domain is distantly related to the rest of the domain sequences within the family. In a search with full length protein sequences, harbouring at least one domain as defined by the structural classification of proteins database (SCOP), version 1.71, versus the saHMM database based on SCOP version 1.69, we achieve an accuracy of 99.0%. All of the few hits outside the family fall within the correct superfamily. Compared to Pfam_ls HMMs, the saHMMs obtain about 11% higher coverage. A comparison with BLAST and PSI-BLAST demonstrates that the saHMMs have consistently fewer errors per query at a given coverage. Within our recommended E-value range, the same is true for a comparison with SUPERFAMILY. Furthermore, we are able to annotate 232 proteins with 530 nonoverlapping domains belonging to 102 different domain families among human proteins labelled "unknown" in the NCBI protein database. Our results demonstrate that the saHMM database represents a versatile and reliable tool for identification of domains in protein sequences. With the aid of saHMMs, homology on the family level can be assigned, even for distantly related sequences. Due to the construction of the saHMMs, the hits they provide are always associated with high quality crystal structures. The saHMM database can be accessed via the FISH server at http://babel.ucmp.umu.se/fish/.
Collapse
|
16
|
Song CM, Lim SJ, Tong JC. Recent advances in computer-aided drug design. Brief Bioinform 2009; 10:579-91. [PMID: 19433475 DOI: 10.1093/bib/bbp023] [Citation(s) in RCA: 152] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Modern drug discovery is characterized by the production of vast quantities of compounds and the need to examine these huge libraries in short periods of time. The need to store, manage and analyze these rapidly increasing resources has given rise to the field known as computer-aided drug design (CADD). CADD represents computational methods and resources that are used to facilitate the design and discovery of new therapeutic solutions. Digital repositories, containing detailed information on drugs and other useful compounds, are goldmines for the study of chemical reactions capabilities. Design libraries, with the potential to generate molecular variants in their entirety, allow the selection and sampling of chemical compounds with diverse characteristics. Fold recognition, for studying sequence-structure homology between protein sequences and structures, are helpful for inferring binding sites and molecular functions. Virtual screening, the in silico analog of high-throughput screening, offers great promise for systematic evaluation of huge chemical libraries to identify potential lead candidates that can be synthesized and tested. In this article, we present an overview of the most important data sources and computational methods for the discovery of new molecular entities. The workflow of the entire virtual screening campaign is discussed, from data collection through to post-screening analysis.
Collapse
Affiliation(s)
- Chun Meng Song
- Institute for Infocomm Research, Connexis South Tower, Singapore 138632
| | | | | |
Collapse
|
17
|
Xia X, Zhang S, Su Y, Sun Z. MICAlign: a sequence-to-structure alignment tool integrating multiple sources of information in conditional random fields. Bioinformatics 2009; 25:1433-4. [DOI: 10.1093/bioinformatics/btp251] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
18
|
Abstract
Sequence alignment and database searching are essential tools in biology because a protein's function can often be inferred from homologous proteins. Standard sequence comparison methods use substitution matrices to find the alignment with the best sum of similarity scores between aligned residues. These similarity scores do not take the local sequence context into account. Here, we present an approach that derives context-specific amino acid similarities from short windows centered on each query sequence residue. Our results demonstrate that the sequence context contains much more information about the expected mutations than just the residue itself. By employing our context-specific similarities (CS-BLAST) in combination with NCBI BLAST, we increase the sensitivity more than 2-fold on a difficult benchmark set, without loss of speed. Alignment quality is likewise improved significantly. Furthermore, we demonstrate considerable improvements when applying this paradigm to sequence profiles: Two iterations of CSI-BLAST, our context-specific version of PSI-BLAST, are more sensitive than 5 iterations of PSI-BLAST. The paradigm for biological sequence comparison presented here is very general. It can replace substitution matrices in sequence- and profile-based alignment and search methods for both protein and nucleotide sequences.
Collapse
|
19
|
PROSIGN: A method for protein secondary structure assignment based on three-dimensional coordinates of consecutive Cα atoms. Comput Biol Chem 2008; 32:406-11. [DOI: 10.1016/j.compbiolchem.2008.07.027] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2007] [Revised: 07/15/2008] [Accepted: 07/24/2008] [Indexed: 11/18/2022]
|
20
|
Improved scoring function for comparative modeling using the M4T method. ACTA ACUST UNITED AC 2008; 10:95-9. [PMID: 18985440 DOI: 10.1007/s10969-008-9044-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2008] [Accepted: 10/16/2008] [Indexed: 10/21/2022]
Abstract
Improvements in comparative protein structure modeling for the remote target-template sequence similarity cases are possible through the optimal combination of multiple template structures and by improving the quality of target-template alignment. Recently developed MMM and M4T methods were designed to address these problems. Here we describe new developments in both the alignment generation and the template selection parts of the modeling algorithms. We set up a new scoring function in MMM to deliver more accurate target-template alignments. This was achieved by developing and incorporating into the composite scoring function a novel statistical pairwise potential that combines local and non-local terms. The non-local term of the statistical potential utilizes a shuffled reference state definition that helped to eliminate most of the false positive signal from the background distribution of pairwise contacts. The accuracy of the scoring function was further increased by using BLOSUM mutation table scores.
Collapse
|
21
|
Gong S, Blundell TL. Discarding functional residues from the substitution table improves predictions of active sites within three-dimensional structures. PLoS Comput Biol 2008; 4:e1000179. [PMID: 18833291 PMCID: PMC2527532 DOI: 10.1371/journal.pcbi.1000179] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2008] [Accepted: 08/07/2008] [Indexed: 11/21/2022] Open
Abstract
Substitutions of individual amino acids in proteins may be under very different evolutionary restraints depending on their structural and functional roles. The Environment Specific Substitution Table (ESST) describes the pattern of substitutions in terms of amino acid location within elements of secondary structure, solvent accessibility, and the existence of hydrogen bonds between side chains and neighbouring amino acid residues. Clearly amino acids that have very different local environments in their functional state compared to those in the protein analysed will give rise to inconsistencies in the calculation of amino acid substitution tables. Here, we describe how the calculation of ESSTs can be improved by discarding the functional residues from the calculation of substitution tables. Four categories of functions are examined in this study: protein–protein interactions, protein–nucleic acid interactions, protein–ligand interactions, and catalytic activity of enzymes. Their contributions to residue conservation are measured and investigated. We test our new ESSTs using the program CRESCENDO, designed to predict functional residues by exploiting knowledge of amino acid substitutions, and compare the benchmark results with proteins whose functions have been defined experimentally. The new methodology increases the Z-score by 98% at the active site residues and finds 16% more active sites compared with the old ESST. We also find that discarding amino acids responsible for protein–protein interactions helps in the prediction of those residues although they are not as conserved as the residues of active sites. Our methodology can make the substitution tables better reflect and describe the substitution patterns of amino acids that are under structural restraints only. Identification of residues responsible for a specific function of a protein can provide clues about the mechanism of action. Computational approaches to identifying functional residues have emerged as low-cost alternatives to experimental methods by providing fast and large-scale analyses. Moreover, the demand for such approaches is increasing as more sequences become available from genome sequencing projects. Here, we focus on the use of CRESCENDO to identify functional residues in proteins of known structure by comparing the amino acid substitutions observed in a family of proteins with those predicted on the basis of the protein structure. CRESCENDO uses Environment Specific Substitution Tables, or ESSTs, which define the way that accepted amino acid substitutions are influenced by the local structural environment. We describe how the calculation of ESSTs can be improved by using only amino acids that are not involved in catalytic activity, metal or ligand binding, nucleic acid or protein interactions, and other molecular functions. Our new substitution table can better describe the degree of amino acids substitutions that are under structural restraints. It should be of value in all applications of ESSTs, including their use in sequence–structure homology recognition, structure validation, and structure prediction in addition to their use in the identification of functional residues. These approaches should enhance the understanding of protein structure and function, which is critically important in the postgenomic era.
Collapse
Affiliation(s)
- Sungsam Gong
- Biocomputing Group, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
| | - Tom L. Blundell
- Biocomputing Group, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
22
|
Bernsel A, Viklund H, Elofsson A. Remote homology detection of integral membrane proteins using conserved sequence features. Proteins 2008; 71:1387-99. [PMID: 18076048 DOI: 10.1002/prot.21825] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Compared with globular proteins, transmembrane proteins are surrounded by a more intricate environment and, consequently, amino acid composition varies between the different compartments. Existing algorithms for homology detection are generally developed with globular proteins in mind and may not be optimal to detect distant homology between transmembrane proteins. Here, we introduce a new profile-profile based alignment method for remote homology detection of transmembrane proteins in a hidden Markov model framework that takes advantage of the sequence constraints placed by the hydrophobic interior of the membrane. We expect that, for distant membrane protein homologs, even if the sequences have diverged too far to be recognized, the hydrophobicity pattern and the transmembrane topology are better conserved. By using this information in parallel with sequence information, we show that both sensitivity and specificity can be substantially improved for remote homology detection in two independent test sets. In addition, we show that alignment quality can be improved for the most distant homologs in a public dataset of membrane protein structures. Applying the method to the Pfam domain database, we are able to suggest new putative evolutionary relationships for a few relatively uncharacterized protein domain families, of which several are confirmed by other methods. The method is called Searcher for Homology Relationships of Integral Membrane Proteins (SHRIMP) and is available for download at http://www.sbc.su.se/shrimp/.
Collapse
Affiliation(s)
- Andreas Bernsel
- Center for Biomembrane Research, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden
| | | | | |
Collapse
|
23
|
Lee MM, Bundschuh R, Chan MK. Distant homology detection using a LEngth and STructure-based sequence Alignment Tool (LESTAT). Proteins 2007; 71:1409-19. [DOI: 10.1002/prot.21830] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
24
|
Goonesekere NCW, Lee B. Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins 2007; 71:910-9. [DOI: 10.1002/prot.21775] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
|
25
|
Smith RE, Lovell SC, Burke DF, Montalvao RW, Blundell TL. Andante: reducing side-chain rotamer search space during comparative modeling using environment-specific substitution probabilities. Bioinformatics 2007; 23:1099-105. [PMID: 17341496 DOI: 10.1093/bioinformatics/btm073] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The accurate placement of side chains in computational protein modeling and design involves the searching of vast numbers of rotamer combinations. RESULTS We have applied the information contained within structurally aligned homologous families, in the form of conserved chi angle conservation rules, to the problem of the comparative modeling. This allows the accurate borrowing of entire side-chain conformations and/or the restriction to high probability rotamer bins. The application of these rules consistently reduces the number of rotamer combinations that need to be searched to trivial values and also reduces the overall side-chain root mean square deviation (rmsd) of the final model. The approach is complementary to current side-chain placement algorithms that use the decomposition of interacting clusters to increase the speed of the placement process.
Collapse
Affiliation(s)
- Richard E Smith
- Department of Biochemistry, University of Cambridge, Cambridge, UK.
| | | | | | | | | |
Collapse
|
26
|
Baussand J, Deremble C, Carbone A. Periodic distributions of hydrophobic amino acids allows the definition of fundamental building blocks to align distantly related proteins. Proteins 2007; 67:695-708. [PMID: 17299747 DOI: 10.1002/prot.21319] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Several studies on large and small families of proteins proved in a general manner that hydrophobic amino acids are globally conserved even if they are subjected to high rate substitution. Statistical analysis of amino acids evolution within blocks of hydrophobic amino acids detected in sequences suggests their usage as a basic structural pattern to align pairs of proteins of less than 25% sequence identity, with no need of knowing their 3D structure. The authors present a new global alignment method and an automatic tool for Proteins with HYdrophobic Blocks ALignment (PHYBAL) based on the combinatorics of overlapping hydrophobic blocks. Two substitution matrices modeling a different selective pressure inside and outside hydrophobic blocks are constructed, the Inside Hydrophobic Blocks Matrix and the Outside Hydrophobic Blocks Matrix, and a 4D space of gap values is explored. PHYBAL performance is evaluated against Needleman and Wunsch algorithm run with Blosum 30, Blosum 45, Blosum 62, Gonnet, HSDM, PAM250, Johnson and Remote Homo matrices. PHYBAL behavior is analyzed on eight randomly selected pairs of proteins of >30% sequence identity that cover a large spectrum of structural properties. It is also validated on two large datasets, the 127 pairs of the Domingues dataset with >30% sequence identity, and 181 pairs issued from BAliBASE 2.0 and ranked by percentage of identity from 7 to 25%. Results confirm the importance of considering substitution matrices modeling hydrophobic contexts and a 4D space of gap values in aligning distantly related proteins. Two new notions of local and global stability are defined to assess the robustness of an alignment algorithm and the accuracy of PHYBAL. A new notion, the SAD-coefficient, to assess the difficulty of structural alignment is also introduced. PHYBAL has been compared with Hydrophobic Cluster Analysis and HMMSUM methods.
Collapse
Affiliation(s)
- J Baussand
- Génomique Analytique, INSERM UMRS511, Université Pierre et Marie Curie-Paris 6, 91, Bd de l'Hôpital, 75013 Paris, France
| | | | | |
Collapse
|
27
|
Abstract
MOTIVATION Accurate alignment of a target sequence to a template structure continues to be a bottleneck in producing good quality comparative protein structure models. RESULTS Multiple Mapping Method (MMM) is a comparative protein structure modeling server with an emphasis on a novel alignment optimization protocol. MMM takes inputs from five profile-to-profile based alignment methods. The alternatively aligned regions from the input alignment set are combined according to their fit in the structural environment of the template structure. The resulting, optimally spliced MMM alignment is used as input to an automated comparative modeling module to produce a full atom model. AVAILABILITY The MMM server is freely accessible at http://www.fiserlab.org/servers/mmm
Collapse
Affiliation(s)
- Brajesh K Rai
- Department of Biochemistry and Seaver Center for Bioinformatics, Albert Einstein College of Medicine 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | | | | | | |
Collapse
|
28
|
Rai BK, Fiser A. Multiple mapping method: a novel approach to the sequence-to-structure alignment problem in comparative protein structure modeling. Proteins 2006; 63:644-61. [PMID: 16437570 DOI: 10.1002/prot.20835] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
A major bottleneck in comparative protein structure modeling is the quality of input alignment between the target sequence and the template structure. A number of alignment methods are available, but none of these techniques produce consistently good solutions for all cases. Alignments produced by alternative methods may be superior in certain segments but inferior in others when compared to each other; therefore, an accurate solution often requires an optimal combination of them. To address this problem, we have developed a new approach, Multiple Mapping Method (MMM). The algorithm first identifies the alternatively aligned regions from a set of input alignments. These alternatively aligned segments are scored using a composite scoring function, which determines their fitness within the structural environment of the template. The best scoring regions from a set of alternative segments are combined with the core part of the alignments to produce the final MMM alignment. The algorithm was tested on a dataset of 1400 protein pairs using 11 combinations of two to four alignment methods. In all cases MMM showed statistically significant improvement by reducing alignment errors in the range of 3 to 17%. MMM also compared favorably over two alignment meta-servers. The algorithm is computationally efficient; therefore, it is a suitable tool for genome scale modeling studies.
Collapse
Affiliation(s)
- Brajesh K Rai
- Department of Biochemistry and Seaver Center for Bioinformatics, Albert Einstein College of Medicine, Bronx, New York 10461, USA
| | | |
Collapse
|
29
|
Abstract
Homology modeling plays a central role in determining protein structure in the structural genomics project. The importance of homology modeling has been steadily increasing because of the large gap that exists between the overwhelming number of available protein sequences and experimentally solved protein structures, and also, more importantly, because of the increasing reliability and accuracy of the method. In fact, a protein sequence with over 30% identity to a known structure can often be predicted with an accuracy equivalent to a low-resolution X-ray structure. The recent advances in homology modeling, especially in detecting distant homologues, aligning sequences with template structures, modeling of loops and side chains, as well as detecting errors in a model, have contributed to reliable prediction of protein structure, which was not possible even several years ago. The ongoing efforts in solving protein structures, which can be time-consuming and often difficult, will continue to spur the development of a host of new computational methods that can fill in the gap and further contribute to understanding the relationship between protein structure and function.
Collapse
Affiliation(s)
- Zhexin Xiang
- Center for Molecular Modeling, Center for Information Technology, National Institutes of Health, Building 12A Room 2051, 12 South Drive, Bethesda, Maryland 20892-5624, USA.
| |
Collapse
|
30
|
Abstract
It has long been recognized that knowledge of the 3D structures of proteins has the potential to accelerate drug discovery, but recent developments in genome sequencing, robotics and bioinformatics have radically transformed the opportunities. Many new protein targets have been identified from genome analyses and studied by X-ray analysis or NMR spectroscopy. Structural biology has been instrumental in directing not only lead optimization and target identification, where it has well-established roles, but also lead discovery, now that high-throughput methods of structure determination can provide powerful approaches to screening.
Collapse
Affiliation(s)
- Miles Congreve
- Astex Technology, 436 Cambridge Science Park, Milton Road, Cambridge CB4 0QA, UK
| | | | | |
Collapse
|
31
|
Dunbrack RL. Sequence comparison and protein structure prediction. Curr Opin Struct Biol 2006; 16:374-84. [PMID: 16713709 DOI: 10.1016/j.sbi.2006.05.006] [Citation(s) in RCA: 119] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2006] [Revised: 03/22/2006] [Accepted: 05/08/2006] [Indexed: 10/24/2022]
Abstract
Sequence comparison is a major step in the prediction of protein structure from existing templates in the Protein Data Bank. The identification of potentially remote homologues to be used as templates for modeling target sequences of unknown structure and their accurate alignment remain challenges, despite many years of study. The most recent advances have been in combining as many sources of information as possible--including amino acid variation in the form of profiles or hidden Markov models for both the target and template families, known and predicted secondary structures of the template and target, respectively, the combination of structure alignment for distant homologues and sequence alignment for close homologues to build better profiles, and the anchoring of certain regions of the alignment based on existing biological data. Newer technologies have been applied to the problem, including the use of support vector machines to tackle the fold classification problem for a target sequence and the alignment of hidden Markov models. Finally, using the consensus of many fold recognition methods, whether based on profile-profile alignments, threading or other approaches, continues to be one of the most successful strategies for both recognition and alignment of remote homologues. Although there is still room for improvement in identification and alignment methods, additional progress may come from model building and refinement methods that can compensate for large structural changes between remotely related targets and templates, as well as for regions of misalignment.
Collapse
Affiliation(s)
- Roland L Dunbrack
- Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA.
| |
Collapse
|
32
|
Blundell TL, Sibanda BL, Montalvão RW, Brewerton S, Chelliah V, Worth CL, Harmer NJ, Davies O, Burke D. Structural biology and bioinformatics in drug design: opportunities and challenges for target identification and lead discovery. Philos Trans R Soc Lond B Biol Sci 2006; 361:413-23. [PMID: 16524830 PMCID: PMC1609333 DOI: 10.1098/rstb.2005.1800] [Citation(s) in RCA: 98] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Impressive progress in genome sequencing, protein expression and high-throughput crystallography and NMR has radically transformed the opportunities to use protein three-dimensional structures to accelerate drug discovery, but the quantity and complexity of the data have ensured a central place for informatics. Structural biology and bioinformatics have assisted in lead optimization and target identification where they have well established roles; they can now contribute to lead discovery, exploiting high-throughput methods of structure determination that provide powerful approaches to screening of fragment binding.
Collapse
Affiliation(s)
- Tom L Blundell
- Department of Biochemistry, University of Cambridge 80 Tennis Court Road, Cambridge CB2 1GA, UK.
| | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Litvinov II, Lobanov MY, Mironov AA, Finkelshtein AV, Roytberg MA. Information on the secondary structure improves the quality of protein sequence alignment. Mol Biol 2006. [DOI: 10.1134/s0026893306030149] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
34
|
Fernandez-Fuentes N, Oliva B, Fiser A. A supersecondary structure library and search algorithm for modeling loops in protein structures. Nucleic Acids Res 2006; 34:2085-97. [PMID: 16617149 PMCID: PMC1440879 DOI: 10.1093/nar/gkl156] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We present a fragment-search based method for predicting loop conformations in protein models. A hierarchical and multidimensional database has been set up that currently classifies 105 950 loop fragments and loop flanking secondary structures. Besides the length of the loops and types of bracing secondary structures the database is organized along four internal coordinates, a distance and three types of angles characterizing the geometry of stem regions. Candidate fragments are selected from this library by matching the length, the types of bracing secondary structures of the query and satisfying the geometrical restraints of the stems and subsequently inserted in the query protein framework where their fit is assessed by the root mean square deviation (r.m.s.d.) of stem regions and by the number of rigid body clashes with the environment. In the final step remaining candidate loops are ranked by a Z-score that combines information on sequence similarity and fit of predicted and observed ϕ/ψ main chain dihedral angle propensities. Confidence Z-score cut-offs were determined for each loop length that identify those predicted fragments that outperform a competitive ab initio method. A web server implements the method, regularly updates the fragment library and performs prediction. Predicted segments are returned, or optionally, these can be completed with side chain reconstruction and subsequently annealed in the environment of the query protein by conjugate gradient minimization. The prediction method was tested on artificially prepared search datasets where all trivial sequence similarities on the SCOP superfamily level were removed. Under these conditions it is possible to predict loops of length 4, 8 and 12 with coverage of 98, 78 and 28% with at least of 0.22, 1.38 and 2.47 Å of r.m.s.d. accuracy, respectively. In a head-to-head comparison on loops extracted from freshly deposited new protein folds the current method outperformed in a ∼5:1 ratio an earlier developed database search method.
Collapse
Affiliation(s)
| | - Baldomero Oliva
- Structural Bioinformatics Group (GRIB), Universitat Pompeu FabraC/Doctor Aiguader,80. 08003, Barcelona, Catalonia, Spain
| | - András Fiser
- To whom correspondence should be addressed. Tel: +1 718 430 3233; Fax: +1 718 430 856;
| |
Collapse
|
35
|
Qiu J, Elber R. SSALN: an alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins 2006; 62:881-91. [PMID: 16385554 DOI: 10.1002/prot.20854] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In template-based modeling of protein structures, the generation of the alignment between the target and the template is a critical step that significantly affects the accuracy of the final model. This paper proposes an alignment algorithm SSALN that learns substitution matrices and position-specific gap penalties from a database of structurally aligned protein pairs. In addition to the amino acid sequence information, secondary structure and solvent accessibility information of a position are used to derive substitution scores and position-specific gap penalties. In a test set of CASP5 targets, SSALN outperforms sequence alignment methods such as a Smith-Waterman algorithm with BLOSUM50 and PSI_BLAST. SSALN also generates better alignments than PSI_BLAST in the CASP6 test set. LOOPP server prediction based on an SSALN alignment is ranked the best for target T0280_1 in CASP6. SSALN is also compared with several threading methods and sequence alignment methods on the ProSup benchmark. SSALN has the highest alignment accuracy among the methods compared. On the Fischer's benchmark, SSALN performs better than CLUSTALW and GenTHREADER, and generates more alignments with accuracy >50%, >60% or >70% than FUGUE, but fewer alignments with accuracy >80% than FUGUE. All the supplemental materials can be found at http://www.cs.cornell.edu/ approximately jianq/research.htm.
Collapse
Affiliation(s)
- Jian Qiu
- Department of Computer Science, Cornell University, Ithaca, New York 14853, USA
| | | |
Collapse
|
36
|
Chu CK, Feng LL, Wouters MA. Comparison of sequence and structure-based datasets for nonredundant structural data mining. Proteins 2006; 60:577-83. [PMID: 16001417 DOI: 10.1002/prot.20505] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Structural data mining studies attempt to deduce general principles of protein structure from solved structures deposited in the protein data bank (PDB). The entire database is unsuitable for such studies because it is not representative of the ensemble of protein folds. Given that novel folds continue to be unearthed, some folds are currently unrepresented in the PDB while other folds are overrepresented. Overrepresentation can easily be avoided by filtering the dataset. PDB_SELECT is a well-used representative subset of the PDB that has been deduced by sequence comparison. Specifically, structures with sequences that exhibit a pairwise sequence identity above a threshold value are weeded from the dataset. Although length criteria for pairwise alignments have a structural basis, this automated method of pruning is essentially sequence-based and runs into problems in the twilight zone, possibly resulting in some folds being overrepresented. The value-added structure databases SCOP and CATH are also a potential source of a nonredundant dataset. Here we compare the sequence-derived dataset PDB_SELECT with the structural databases SCOP (Structural Classification Of Proteins) and CATH (Class-Architecture-Topology-Homology). We show that some folds remain overrepresented in the PDB_SELECT dataset while other folds are not represented at all. However, SCOP and CATH also have their own problems such as the labor-intensiveness of the update process and the problem of determining whether all folds are equally or sufficiently distant. We discuss areas where further work is required.
Collapse
Affiliation(s)
- Carmen K Chu
- Computational Biology and Bioinformatics Program, Victor Chang Cardiac Research Institute, Sydney, NSW, Australia
| | | | | |
Collapse
|
37
|
Skolnick J. In quest of an empirical potential for protein structure prediction. Curr Opin Struct Biol 2006; 16:166-71. [PMID: 16524716 DOI: 10.1016/j.sbi.2006.02.004] [Citation(s) in RCA: 112] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2005] [Revised: 02/10/2006] [Accepted: 02/23/2006] [Indexed: 11/19/2022]
Abstract
Key to successful protein structure prediction is a potential that recognizes the native state from misfolded structures. Recent advances in empirical potentials based on known protein structures include improved reference states for assessing random interactions, sidechain-orientation-dependent pair potentials, potentials for describing secondary or supersecondary structural preferences and, most importantly, optimization protocols that sculpt the energy landscape to enhance the correlation between native-like features and the energy. Improved clustering algorithms that select native-like structures on the basis of cluster density also resulted in greater prediction accuracy. For template-based modeling, these advances allowed improvement in predicted structures relative to their initial template alignments over a wide range of target-template homology. This represents significant progress and suggests applications to proteome-scale structure prediction.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center of Excellence in Bioinformatics, University at Buffalo, 901 Washington Street, Buffalo, NY 14203, USA.
| |
Collapse
|
38
|
Chen Y, Crippen GM. A novel approach to structural alignment using realistic structural and environmental information. Protein Sci 2005; 14:2935-46. [PMID: 16260755 PMCID: PMC2253243 DOI: 10.1110/ps.051428205] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
In the era of structural genomics, it is necessary to generate accurate structural alignments in order to build good templates for homology modeling. Although a great number of structural alignment algorithms have been developed, most of them ignore intermolecular interactions during the alignment procedure. Therefore, structures in different oligomeric states are barely distinguishable, and it is very challenging to find correct alignment in coil regions. Here we present a novel approach to structural alignment using a clique finding algorithm and environmental information (SAUCE). In this approach, we build the alignment based on not only structural coordinate information but also realistic environmental information extracted from biological unit files provided by the Protein Data Bank (PDB). At first, we eliminate all environmentally unfavorable pairings of residues. Then we identify alignments in core regions via a maximal clique finding algorithm. Two extreme value distribution (EVD) form statistics have been developed to evaluate core region alignments. With an optional extension step, global alignment can be derived based on environment-based dynamic programming linking. We show that our method is able to differentiate three-dimensional structures in different oligomeric states, and is able to find flexible alignments between multidomain structures without predetermined hinge regions. The overall performance is also evaluated on a large scale by comparisons to current structural classification databases as well as to other alignment methods.
Collapse
Affiliation(s)
- Yu Chen
- College of Pharmacy, University of Michigan, 428 Church Street, Ann Arbor, MI 48109-1065, USA
| | | |
Collapse
|
39
|
Johnston CR, Shields DC. A sequence sub-sampling algorithm increases the power to detect distant homologues. Nucleic Acids Res 2005; 33:3772-8. [PMID: 16006623 PMCID: PMC1174907 DOI: 10.1093/nar/gki687] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Searching databases for distant homologues using alignments instead of individual sequences increases the power of detection. However, most methods assume that protein evolution proceeds in a regular fashion, with the inferred tree of sequences providing a good estimation of the evolutionary process. We investigated the combined HMMER search results from random alignment subsets (with three sequences each) drawn from the parent alignment (Rand-shuffle algorithm), using the SCOP structural classification to determine true similarities. At false-positive rates of 5%, the Rand-shuffle algorithm improved HMMER's sensitivity, with a 37.5% greater sensitivity compared with HMMER alone, when easily identified similarities (identifiable by BLAST) were excluded from consideration. An extension of the Rand-shuffle algorithm (Ali-shuffle) weighted towards more informative sequence subsets. This approach improved the performance over HMMER alone and PSI-BLAST, particularly at higher false-positive rates. The improvements in performance of these sequence sub-sampling methods may reflect lower sensitivity to alignment error and irregular evolutionary patterns. The Ali-shuffle and Rand-shuffle sequence homology search programs are available by request from the authors.
Collapse
Affiliation(s)
- Catrióna R Johnston
- Department of Clinical Pharmacology, Bioinformatics Group, Royal College of Surgeons in Ireland, 123 St Stephens Green, Dublin 2, Ireland.
| | | |
Collapse
|
40
|
Sillitoe I, Dibley M, Bray J, Addou S, Orengo C. Assessing strategies for improved superfamily recognition. Protein Sci 2005; 14:1800-10. [PMID: 15937274 PMCID: PMC2253352 DOI: 10.1110/ps.041056105] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (approximately 13,000 nonredundant structures solved to date), several powerful sequence-based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence-based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single-seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D-HMM library, CATH-ISL increased the coverage to 86%. The single-seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss-Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.
Collapse
Affiliation(s)
- Ian Sillitoe
- Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, UK
| | | | | | | | | |
Collapse
|
41
|
Hamelryck T. An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins 2005; 59:38-48. [DOI: 10.1002/prot.20379] [Citation(s) in RCA: 111] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
42
|
EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments. BMC Bioinformatics 2005; 6:4. [PMID: 15638949 PMCID: PMC545998 DOI: 10.1186/1471-2105-6-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2004] [Accepted: 01/10/2005] [Indexed: 12/04/2022] Open
Abstract
Background Structure-dependent substitution matrices increase the accuracy of sequence alignments when the 3D structure of one sequence is known, and are successful e.g. in fold recognition. We propose a new automated method, EvDTree, based on a decision tree algorithm, for automatic derivation of amino acid substitution probabilities from a set of sequence-structure alignments. The main advantage over other approaches is an unbiased automatic selection of the most informative structural descriptors and associated values or thresholds. This feature allows automatic derivation of structure-dependent substitution scores for any specific set of structures, without the need to empirically determine best descriptors and parameters. Results Decision trees for residue substitutions were constructed for each residue type from sequence-structure alignments extracted from the HOMSTRAD database. For each tree cluster, environment-dependent substitution profiles were derived. The resulting structure-dependent substitution scores were assessed using a criterion based on the mean ranking of observed substitution among all possible substitutions and in sequence-structure alignments. The automatically built EvDTree substitution scores provide significantly better results than conventional matrices and similar or slightly better results than other structure-dependent matrices. EvDTree has been applied to small disulfide-rich proteins as a test case to automatically derive specific substitutions scores providing better results than non-specific substitution scores. Analyses of the decision tree classifications provide useful information on the relative importance of different structural descriptors. Conclusions We propose a fully automatic method for the classification of structural environments and inference of structure-dependent substitution profiles. We show that this approach is more accurate than existing methods for various applications. The easy adaptation of EvDTree to any specific data set opens the way for class-specific structure-dependent substitution scores which can be used in threading-based remote homology searches.
Collapse
|
43
|
Pei J, Grishin NV. Combining evolutionary and structural information for local protein structure prediction. Proteins 2004; 56:782-94. [PMID: 15281130 DOI: 10.1002/prot.20158] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
We study the effects of various factors in representing and combining evolutionary and structural information for local protein structural prediction based on fragment selection. We prepare databases of fragments from a set of non-redundant protein domains. For each fragment, evolutionary information is derived from homologous sequences and represented as estimated effective counts and frequencies of amino acids (evolutionary frequencies) at each position. Position-specific amino acid preferences called structural frequencies are derived from statistical analysis of discrete local structural environments in database structures. Our method for local structure prediction is based on ranking and selecting database fragments that are most similar to a target fragment. Using secondary structure type as a local structural property, we test our method in a number of settings. The major findings are: (1) the COMPASS-type scoring function for fragment similarity comparison gives better prediction accuracy than three other tested scoring functions for profile-profile comparison. We show that the COMPASS-type scoring function can be derived both in the probabilistic framework and in the framework of statistical potentials. (2) Using the evolutionary frequencies of database fragments gives better prediction accuracy than using structural frequencies. (3) Finer definition of local environments, such as including more side-chain solvent accessibility classes and considering the backbone conformations of neighboring residues, gives increasingly better prediction accuracy using structural frequencies. (4) Combining evolutionary and structural frequencies of database fragments, either in a linear fashion or using a pseudocount mixture formula, results in improvement of prediction accuracy. Combination at the log-odds score level is not as effective as combination at the frequency level. This suggests that there might be better ways of combining sequence and structural information than the commonly used linear combination of log-odds scores. Our method of fragment selection and frequency combination gives reasonable results of secondary structure prediction tested on 56 CASP5 targets (average SOV score 0.77), suggesting that it is a valid method for local protein structure prediction. Mixture of predicted structural frequencies and evolutionary frequencies improve the quality of local profile-to-profile alignment by COMPASS.
Collapse
Affiliation(s)
- Jimin Pei
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas 75390-9050, USA
| | | |
Collapse
|
44
|
Massé N, Ainouze M, Néel B, Wild TF, Buckland R, Langedijk JPM. Measles virus (MV) hemagglutinin: evidence that attachment sites for MV receptors SLAM and CD46 overlap on the globular head. J Virol 2004; 78:9051-63. [PMID: 15308701 PMCID: PMC506930 DOI: 10.1128/jvi.78.17.9051-9063.2004] [Citation(s) in RCA: 73] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2004] [Accepted: 04/20/2004] [Indexed: 11/20/2022] Open
Abstract
Measles virus hemagglutinin (MVH) residues potentially responsible for attachment to the wild-type (wt) MV receptor SLAM (CD150) have been identified and localized on the MVH globular head by reference to a revised hypothetical structural model for MVH (www.pepscan.nl/downloads/measlesH.pdb). We show that the mutation of five charged MVH residues which are conserved among morbillivirus H proteins has major effects on both SLAM downregulation and SLAM-dependent fusion. In the three-dimensional surface representation of the structural model, three of these residues (D505, D507, and R533) align the rim on one side of the cavity on the top surface of the MVH globular head and form the basis of a single continuous site that overlaps with the 546-548-549 CD46 binding site. We show that the overlapping sites fall within the footprint of an anti-MVH monoclonal antibody that neutralizes both wt and laboratory-vaccine MV strains and whose epitope contains R533. Our study does not exclude the possibility that Y481 binds CD46 directly but suggests that the N481Y mutation of wt MVH could influence, at a distance, the conformation of the overlapping sites so that affinity to CD46 increases. The relevance of these results to present concepts of MV receptor usage is discussed, and an explanation is proposed as to why morbillivirus attachment proteins are H, whereas those from the other paramyxoviruses are HN (hemagglutinin-neuraminidase).
Collapse
MESH Headings
- Amino Acid Sequence
- Animals
- Antibodies, Monoclonal/immunology
- Antibodies, Viral/immunology
- Antigens, CD/metabolism
- Binding Sites
- Cell Line
- Down-Regulation
- Epitopes/immunology
- Glycoproteins/metabolism
- HeLa Cells
- Hemagglutinins, Viral/chemistry
- Hemagglutinins, Viral/genetics
- Hemagglutinins, Viral/immunology
- Hemagglutinins, Viral/metabolism
- Humans
- Immunoglobulins/metabolism
- Measles virus/metabolism
- Membrane Cofactor Protein
- Membrane Fusion
- Membrane Glycoproteins/metabolism
- Models, Molecular
- Molecular Sequence Data
- Mutation/genetics
- Neutralization Tests
- Protein Binding
- Protein Structure, Tertiary
- Receptors, Cell Surface
- Receptors, Virus/metabolism
- Signaling Lymphocytic Activation Molecule Family Member 1
Collapse
Affiliation(s)
- Nicolas Massé
- Molecular Basis of Paramyxovirus Entry, INSERM U404, Immunité et Vaccination, CERVI, IFR 128 Biosciences Lyon-Gerland, Lyon, France
| | | | | | | | | | | |
Collapse
|
45
|
Sadreyev RI, Grishin NV. Estimates of statistical significance for comparison of individual positions in multiple sequence alignments. BMC Bioinformatics 2004; 5:106. [PMID: 15296518 PMCID: PMC516024 DOI: 10.1186/1471-2105-5-106] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2004] [Accepted: 08/05/2004] [Indexed: 11/17/2022] Open
Abstract
Background Profile-based analysis of multiple sequence alignments (MSA) allows for accurate comparison of protein families. Here, we address the problems of detecting statistically confident dissimilarities between (1) MSA position and a set of predicted residue frequencies, and (2) between two MSA positions. These problems are important for (i) evaluation and optimization of methods predicting residue occurrence at protein positions; (ii) detection of potentially misaligned regions in automatically produced alignments and their further refinement; and (iii) detection of sites that determine functional or structural specificity in two related families. Results For problems (1) and (2), we propose analytical estimates of P-value and apply them to the detection of significant positional dissimilarities in various experimental situations. (a) We compare structure-based predictions of residue propensities at a protein position to the actual residue frequencies in the MSA of homologs. (b) We evaluate our method by the ability to detect erroneous position matches produced by an automatic sequence aligner. (c) We compare MSA positions that correspond to residues aligned by automatic structure aligners. (d) We compare MSA positions that are aligned by high-quality manual superposition of structures. Detected dissimilarities reveal shortcomings of the automatic methods for residue frequency prediction and alignment construction. For the high-quality structural alignments, the dissimilarities suggest sites of potential functional or structural importance. Conclusion The proposed computational method is of significant potential value for the analysis of protein families.
Collapse
Affiliation(s)
- Ruslan I Sadreyev
- Howard Hughes Medical Institute, and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute, and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
| |
Collapse
|
46
|
Przybylski D, Rost B. Improving Fold Recognition Without Folds. J Mol Biol 2004; 341:255-69. [PMID: 15312777 DOI: 10.1016/j.jmb.2004.05.041] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2004] [Revised: 05/18/2004] [Accepted: 05/18/2004] [Indexed: 11/21/2022]
Abstract
The most reliable way to align two proteins of unknown structure is through sequence-profile and profile-profile alignment methods. If the structure for one of the two is known, fold recognition methods outperform purely sequence-based alignments. Here, we introduced a novel method that aligns generalised sequence and predicted structure profiles. Using predicted 1D structure (secondary structure and solvent accessibility) significantly improved over sequence-only methods, both in terms of correctly recognising pairs of proteins with different sequences and similar structures and in terms of correctly aligning the pairs. The scores obtained by our generalised scoring matrix followed an extreme value distribution; this yielded accurate estimates of the statistical significance of our alignments. We found that mistakes in 1D structure predictions correlated between proteins from different sequence-structure families. The impact of this surprising result was that our method succeeded in significantly out-performing sequence-only methods even without explicitly using structural information from any of the two. Since AGAPE also outperformed established methods that rely on 3D information, we made it available through. If we solved the problem of CPU-time required to apply AGAPE on millions of proteins, our results could also impact everyday database searches.
Collapse
Affiliation(s)
- Dariusz Przybylski
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | |
Collapse
|
47
|
Zhang Z, Kochhar S, Grigorov M. Exploring the sequence-structure protein landscape in the glycosyltransferase family. Protein Sci 2004; 12:2291-302. [PMID: 14500887 PMCID: PMC2366918 DOI: 10.1110/ps.03131303] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
To understand the molecular basis of glycosyltransferases' (GTFs) catalytic mechanism, extensive structural information is required. Here, fold recognition methods were employed to assign 3D protein shapes (folds) to the currently known GTF sequences, available in public databases such as GenBank and Swissprot. First, GTF sequences were retrieved and classified into clusters, based on sequence similarity only. Intracluster sequence similarity was chosen sufficiently high to ensure that the same fold is found within a given cluster. Then, a representative sequence from each cluster was selected to compose a subset of GTF sequences. The members of this reduced set were processed by three different fold recognition methods: 3D-PSSM, FUGUE, and GeneFold. Finally, the results from different fold recognition methods were analyzed and compared to sequence-similarity search methods (i.e., BLAST and PSI-BLAST). It was established that the folds of about 70% of all currently known GTF sequences can be confidently assigned by fold recognition methods, a value which is higher than the fold identification rate based on sequence comparison alone (48% for BLAST and 64% for PSI-BLAST). The identified folds were submitted to 3D clustering, and we found that most of the GTF sequences adopt the typical GTF A or GTF B folds. Our results indicate a lack of evidence that new GTF folds (i.e., folds other than GTF A and B) exist. Based on cases where fold identification was not possible, we suggest several sequences as the most promising targets for a structural genomics initiative focused on the GTF protein family.
Collapse
Affiliation(s)
- Ziding Zhang
- Nestlé Research Center, CH-1000 Lausanne 26, Switzerland.
| | | | | |
Collapse
|
48
|
Cherkasov A, Jones SJM. An approach to large scale identification of non-obvious structural similarities between proteins. BMC Bioinformatics 2004; 5:61. [PMID: 15147578 PMCID: PMC434491 DOI: 10.1186/1471-2105-5-61] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2004] [Accepted: 05/17/2004] [Indexed: 11/13/2022] Open
Abstract
Background A new sequence independent bioinformatics approach allowing genome-wide search for proteins with similar three dimensional structures has been developed. By utilizing the numerical output of the sequence threading it establishes putative non-obvious structural similarities between proteins. When applied to the testing set of proteins with known three dimensional structures the developed approach was able to recognize structurally similar proteins with high accuracy. Results The method has been developed to identify pathogenic proteins with low sequence identity and high structural similarity to host analogues. Such protein structure relationships would be hypothesized to arise through convergent evolution or through ancient horizontal gene transfer events, now undetectable using current sequence alignment techniques. The pathogen proteins, which could mimic or interfere with host activities, would represent candidate virulence factors. The developed approach utilizes the numerical outputs from the sequence-structure threading. It identifies the potential structural similarity between a pair of proteins by correlating the threading scores of the corresponding two primary sequences against the library of the standard folds. This approach allowed up to 64% sensitivity and 99.9% specificity in distinguishing protein pairs with high structural similarity. Conclusion Preliminary results obtained by comparison of the genomes of Homo sapiens and several strains of Chlamydia trachomatis have demonstrated the potential usefulness of the method in the identification of bacterial proteins with known or potential roles in virulence.
Collapse
Affiliation(s)
- Artem Cherkasov
- Division of Infectious Diseases, Department of Medicine, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Steven JM Jones
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
| |
Collapse
|
49
|
Cherkasov A, Jones SJM. Structural characterization of genomes by large scale sequence-structure threading. BMC Bioinformatics 2004; 5:37. [PMID: 15061866 PMCID: PMC419331 DOI: 10.1186/1471-2105-5-37] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2003] [Accepted: 04/03/2004] [Indexed: 12/02/2022] Open
Abstract
Background Using sequence-structure threading we have conducted structural characterization of complete proteomes of 37 archaeal, bacterial and eukaryotic organisms (including worm, fly, mouse and human) totaling 167,888 genes. Results The reported data represent first rather general evaluation of performance of full sequence-structure threading on multiple genomes providing opportunity to evaluate its general applicability for large scale studies. According to the estimated results the sequence-structure threading has assigned protein folds to more then 60% of eukaryotic, 68% of archaeal and 70% of bacterial proteomes. The repertoires of protein classes, architectures, topologies and homologous superfamilies (according to the CATH 2.4 classification) have been established for distant organisms and superkingdoms. It has been found that the average abundance of CATH classes decreases from "alpha and beta" to "mainly beta", followed by "mainly alpha" and "few secondary structures". 3-Layer (aba) Sandwich has been characterized as the most abundant protein architecture and Rossman fold as the most common topology. Conclusion The analysis of genomic occurrences of CATH 2.4 protein homologous superfamilies and topologies has revealed the power-law character of their distributions. The corresponding double logarithmic "frequency – genomic occurrence" dependences characteristic of scale-free systems have been established for individual organisms and for three superkingdoms. Supplementary materials to this works are available at [1].
Collapse
Affiliation(s)
- Artem Cherkasov
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
- Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Steven JM Jones
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
| |
Collapse
|
50
|
Wallner B, Fang H, Ohlson T, Frey-Skött J, Elofsson A. Using evolutionary information for the query and target improves fold recognition. Proteins 2004; 54:342-50. [PMID: 14696196 DOI: 10.1002/prot.10565] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In this study, we show that it is possible to increase the performance over PSI-BLAST by using evolutionary information for both query and target sequences. This information can be used in three different ways: by sequence linking, profile-profile alignments, and by combining sequence-profile and profile-sequence searches. If only PSI-BLAST is used, 16% of superfamily-related protein domains can be detected at 90% specificity, but if a sequence-profile and a profile-sequence search are combined, this is increased to 20%, profile-profile searches detects 19%, whereas a linking procedure identifies 22% of these proteins. All three methods show equal performance, but the best combination of speed and accuracy seems to be obtained by the combined searches, because this method shows a good performance even at high specificity and the lowest computational cost. In addition, we show that the E-values reported by all these methods, including PSI-BLAST, underestimate the true rate of false positives. This behavior is seen even if a very strict E-value cutoff and a limited number of iterations are used. However, the difference is more pronounced with a looser E-value cutoff and more iterations.
Collapse
Affiliation(s)
- Björn Wallner
- Stockholm Bioinformatics Center, Stockholm University, Stockholm, Sweden
| | | | | | | | | |
Collapse
|