1
|
Sánchez IE, Galpern EA, Garibaldi MM, Ferreiro DU. Molecular Information Theory Meets Protein Folding. J Phys Chem B 2022; 126:8655-8668. [PMID: 36282961 DOI: 10.1021/acs.jpcb.2c04532] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
We propose an application of molecular information theory to analyze the folding of single domain proteins. We analyze results from various areas of protein science, such as sequence-based potentials, reduced amino acid alphabets, backbone configurational entropy, secondary structure content, residue burial layers, and mutational studies of protein stability changes. We found that the average information contained in the sequences of evolved proteins is very close to the average information needed to specify a fold ∼2.2 ± 0.3 bits/(site·operation). The effective alphabet size in evolved proteins equals the effective number of conformations of a residue in the compact unfolded state at around 5. We calculated an energy-to-information conversion efficiency upon folding of around 50%, lower than the theoretical limit of 70%, but much higher than human-built macroscopic machines. We propose a simple mapping between molecular information theory and energy landscape theory and explore the connections between sequence evolution, configurational entropy, and the energetics of protein folding.
Collapse
Affiliation(s)
- Ignacio E Sánchez
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Ezequiel A Galpern
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Martín M Garibaldi
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Diego U Ferreiro
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| |
Collapse
|
2
|
Nguyen Q, Tran HV, Nguyen BP, Do TTT. Identifying Transcription Factors That Prefer Binding to Methylated DNA Using Reduced G-Gap Dipeptide Composition. ACS OMEGA 2022; 7:32322-32330. [PMID: 36119976 PMCID: PMC9475634 DOI: 10.1021/acsomega.2c03696] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Accepted: 08/23/2022] [Indexed: 06/15/2023]
Abstract
Transcription factors (TFs) play an important role in gene expression and regulation of 3D genome conformation. TFs have ability to bind to specific DNA fragments called enhancers and promoters. Some TFs bind to promoter DNA fragments which are near the transcription initiation site and form complexes that allow polymerase enzymes to bind to initiate transcription. Previous studies showed that methylated DNAs had ability to inhibit and prevent TFs from binding to DNA fragments. However, recent studies have found that there were TFs that could bind to methylated DNA fragments. The identification of these TFs is an important steppingstone to a better understanding of cellular gene expression mechanisms. However, as experimental methods are often time-consuming and labor-intensive, developing computational methods is essential. In this study, we propose two machine learning methods for two problems: (1) identifying TFs and (2) identifying TFs that prefer binding to methylated DNA targets (TFPMs). For the TF identification problem, the proposed method uses the position-specific scoring matrix for data representation and a deep convolutional neural network for modeling. This method achieved 90.56% sensitivity, 83.96% specificity, and an area under the receiver operating characteristic curve (AUC) of 0.9596 on an independent test set. For the TFPM identification problem, we propose to use the reduced g-gap dipeptide composition for data representation and the support vector machine algorithm for modeling. This method achieved 82.61% sensitivity, 64.86% specificity, and an AUC of 0.8486 on another independent test set. These results are higher than those of other studies on the same problems.
Collapse
Affiliation(s)
- Quang
H. Nguyen
- School
of Information and Communication Technology, Hanoi University of Science and Technology, 1 Dai Co Viet, Hanoi 100000, Vietnam
| | - Hoang V. Tran
- School
of Information and Communication Technology, Hanoi University of Science and Technology, 1 Dai Co Viet, Hanoi 100000, Vietnam
| | - Binh P. Nguyen
- School
of Mathematics and Statistics, Victoria
University of Wellington, Kelburn Parade, Wellington 6140, New Zealand
| | - Trang T. T. Do
- School
of Innovation, Design and Technology, Wellington
Institute of Technology, 21 Kensington Avenue, Lower Hutt 5012, New Zealand
| |
Collapse
|
3
|
Liang Y, Yang S, Zheng L, Wang H, Zhou J, Huang S, Yang L, Zuo Y. Research progress of reduced amino acid alphabets in protein analysis and prediction. Comput Struct Biotechnol J 2022; 20:3503-3510. [PMID: 35860409 PMCID: PMC9284397 DOI: 10.1016/j.csbj.2022.07.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 06/30/2022] [Accepted: 07/01/2022] [Indexed: 11/29/2022] Open
Abstract
A comprehensive summary of the literature on the reduced amino acid alphabets. A systematic review of the development history of reduced amino acid alphabets. Rich application cases of amino acid reduction alphabets are described in the article. A detailed analysis of the properties and uses of the reduced amino acid alphabets.
Proteins are the executors of cellular physiological activities, and accurate structural and function elucidation are crucial for the refined mapping of proteins. As a feature engineering method, the reduction of amino acid composition is not only an important method for protein structure and function analysis, but also opens a broad horizon for the complex field of machine learning. Representing sequences with fewer amino acid types greatly reduces the complexity and noise of traditional feature engineering in dimension, and provides more interpretable predictive models for machine learning to capture key features. In this paper, we systematically reviewed the strategy and method studies of the reduced amino acid (RAA) alphabets, and summarized its main research in protein sequence alignment, functional classification, and prediction of structural properties, respectively. In the end, we gave a comprehensive analysis of 672 RAA alphabets from 74 reduction methods.
Collapse
Affiliation(s)
- Yuchao Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Siqi Yang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Hao Wang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Jian Zhou
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
- Corresponding authors.
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
- Corresponding authors.
| |
Collapse
|
4
|
Kebabci N, Timucin AC, Timucin E. Toward Compilation of Balanced Protein Stability Data Sets: Flattening the ΔΔ G Curve through Systematic Enrichment. J Chem Inf Model 2022; 62:1345-1355. [PMID: 35201762 DOI: 10.1021/acs.jcim.2c00054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Often studies analyzing stability data sets and/or predictors ignore neutral mutations and use a binary classification scheme labeling only destabilizing and stabilizing mutations. Recognizing that highly concentrated neutral mutations interfere with data set quality, we have explored three protein stability data sets: S2648, PON-tstab, and the symmetric Ssym that differ in size and quality. A characteristic leptokurtic shape in the ΔΔG distributions of all three data sets including the curated and symmetric ones was reported due to concentrated neutral mutations. To further investigate the impact of neutral mutations on ΔΔG predictions, we have comprehensively assessed the performance of 11 predictors on the PON-tstab data set. Correlation and error analyses showed that all of the predictors performed the best on the neutral mutations, while their performance became gradually worse as the ΔΔG of the mutations departed further from the neutral zone regardless of the direction, implying a bias toward dense mutations. To this end, after unraveling the role of concentrated neutral mutations in biases of stability data sets, we described a systematic enrichment approach to balance the ΔΔG distributions. Before enrichment, mutations were clustered based on their biochemical and/or structural features, and then three mutations were selected from every 2 kcal/mol of each cluster. Upon implementation of this approach by distinct clustering schemes, we generated five subsets varying in size and ΔΔG distributions. All subsets showed improved ΔΔG and frequency distributions. We ultimately reported that the errors toward enriched subsets were higher than those toward the parent data sets, confirming the enrichment of difficult-to-predict mutations in the subsets. In summary, we elaborated the prediction bias toward a concentrated neutral zone and also implemented a rational strategy to tackle this and other forms of biases. Ultimately, this study equipping us with an extended view of shortcomings of stability data sets is a step taken toward development of an unbiased predictor.
Collapse
Affiliation(s)
- Narod Kebabci
- Department of Biostatistics and Bioinformatics, Institute of Health Sciences, Acibadem University, Istanbul 34752, Turkey
| | - Ahmet Can Timucin
- Department of Molecular Biology and Genetics, Faculty of Arts and Sciences, Acibadem University, Istanbul 34752, Turkey
| | - Emel Timucin
- Department of Biostatistics and Medical Informatics, School of Medicine, Acibadem University, Istanbul 34752, Turkey
| |
Collapse
|
5
|
Ou J, Liu H, Nirala NK, Stukalov A, Acharya U, Green MR, Zhu LJ. dagLogo: An R/Bioconductor package for identifying and visualizing differential amino acid group usage in proteomics data. PLoS One 2020; 15:e0242030. [PMID: 33156866 PMCID: PMC7647101 DOI: 10.1371/journal.pone.0242030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Accepted: 10/23/2020] [Indexed: 11/18/2022] Open
Abstract
Sequence logos have been widely used as graphical representations of conserved nucleic acid and protein motifs. Due to the complexity of the amino acid (AA) alphabet, rich post-translational modification, and diverse subcellular localization of proteins, few versatile tools are available for effective identification and visualization of protein motifs. In addition, various reduced AA alphabets based on physicochemical, structural, or functional properties have been valuable in the study of protein alignment, folding, structure prediction, and evolution. However, there is lack of tools for applying reduced AA alphabets to the identification and visualization of statistically significant motifs. To fill this gap, we developed an R/Bioconductor package dagLogo, which has several advantages over existing tools. First, dagLogo allows various formats for input sets and provides comprehensive options to build optimal background models. It implements different reduced AA alphabets to group AAs of similar properties. Furthermore, dagLogo provides statistical and visual solutions for differential AA (or AA group) usage analysis of both large and small data sets. Case studies showed that dagLogo can better identify and visualize conserved protein sequence patterns from different types of inputs and can potentially reveal the biological patterns that could be missed by other logo generators.
Collapse
Affiliation(s)
- Jianhong Ou
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
- Regeneration NEXT, Duke University School of Medicine, Duke University, Durham, North Carolina, United States of America
| | - Haibo Liu
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
| | - Niraj K. Nirala
- Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
| | - Alexey Stukalov
- Institute of Virology, Technical University of Munich, Munich, Germany
| | - Usha Acharya
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
| | - Michael R. Green
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
| | - Lihua Julie Zhu
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
- Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
6
|
Caldararu O, Mehra R, Blundell TL, Kepp KP. Systematic Investigation of the Data Set Dependency of Protein Stability Predictors. J Chem Inf Model 2020; 60:4772-4784. [DOI: 10.1021/acs.jcim.0c00591] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- Octav Caldararu
- DTU Chemistry, Technical University of Denmark, Building 206, 2800 Kgs. Lyngby, Denmark
| | - Rukmankesh Mehra
- DTU Chemistry, Technical University of Denmark, Building 206, 2800 Kgs. Lyngby, Denmark
| | - Tom L. Blundell
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, United Kingdom
| | - Kasper P. Kepp
- DTU Chemistry, Technical University of Denmark, Building 206, 2800 Kgs. Lyngby, Denmark
| |
Collapse
|
7
|
Zheng L, Huang S, Mu N, Zhang H, Zhang J, Chang Y, Yang L, Zuo Y. RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou's five-step rule. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5650975. [PMID: 31802128 PMCID: PMC6893003 DOI: 10.1093/database/baz131] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Revised: 10/16/2019] [Accepted: 10/17/2019] [Indexed: 12/12/2022]
Abstract
By reducing amino acid alphabet, the protein complexity can be significantly simplified, which could improve computational efficiency, decrease information redundancy and reduce chance of overfitting. Although some reduced alphabets have been proposed, different classification rules could produce distinctive results for protein sequence analysis. Thus, it is urgent to construct a systematical frame for reduced alphabets. In this work, we constructed a comprehensive web server called RAACBook for protein sequence analysis and machine learning application by integrating reduction alphabets. The web server contains three parts: (i) 74 types of reduced amino acid alphabet were manually extracted to generate 673 reduced amino acid clusters (RAACs) for dealing with unique protein problems. It is easy for users to select desired RAACs from a multilayer browser tool. (ii) An online tool was developed to analyze primary sequence of protein. The tool could produce K-tuple reduced amino acid composition by defining three correlation parameters (K-tuple, g-gap, λ-correlation). The results are visualized as sequence alignment, mergence of RAA composition, feature distribution and logo of reduced sequence. (iii) The machine learning server is provided to train the model of protein classification based on K-tuple RAAC. The optimal model could be selected according to the evaluation indexes (ROC, AUC, MCC, etc.). In conclusion, RAACBook presents a powerful and user-friendly service in protein sequence analysis and computational proteomics. RAACBook can be freely available at http://bioinfor.imu.edu.cn/raacbook. Database URL: http://bioinfor.imu.edu.cn/raacbook
Collapse
Affiliation(s)
- Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Nengjiang Mu
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Haoyue Zhang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Jiayu Zhang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Yu Chang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Baojian Road No.157, Harbin 150081, China
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| |
Collapse
|
8
|
Laine E, Karami Y, Carbone A. GEMME: a simple and fast global epistatic model predicting mutational effects. Mol Biol Evol 2019; 36:2604-2619. [PMID: 31406981 PMCID: PMC6805226 DOI: 10.1093/molbev/msz179] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2019] [Revised: 06/03/2019] [Accepted: 08/02/2019] [Indexed: 12/15/2022] Open
Abstract
The systematic and accurate description of protein mutational landscapes is a question of utmost importance in biology, bioengineering, and medicine. Recent progress has been achieved by leveraging on the increasing wealth of genomic data and by modeling intersite dependencies within biological sequences. However, state-of-the-art methods remain time consuming. Here, we present Global Epistatic Model for predicting Mutational Effects (GEMME) (www.lcqb.upmc.fr/GEMME), an original and fast method that predicts mutational outcomes by explicitly modeling the evolutionary history of natural sequences. This allows accounting for all positions in a sequence when estimating the effect of a given mutation. GEMME uses only a few biologically meaningful and interpretable parameters. Assessed against 50 high- and low-throughput mutational experiments, it overall performs similarly or better than existing methods. It accurately predicts the mutational landscapes of a wide range of protein families, including viral ones and, more generally, of much conserved families. Given an input alignment, it generates the full mutational landscape of a protein in a matter of minutes. It is freely available as a package and a webserver at www.lcqb.upmc.fr/GEMME/.
Collapse
Affiliation(s)
- Elodie Laine
- Sorbonne Université, UPMC University Paris 06, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), 75005 Paris, France
| | - Yasaman Karami
- Sorbonne Université, UPMC University Paris 06, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), 75005 Paris, France.,Sorbonne Université, UPMC-Univ P6, Institut du Calcul et de la Simulation
| | - Alessandra Carbone
- Sorbonne Université, UPMC University Paris 06, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), 75005 Paris, France.,Institut Universitaire de France
| |
Collapse
|
9
|
Burdukiewicz M, Sobczyk P, Rödiger S, Duda-Madej A, Mackiewicz P, Kotulska M. Amyloidogenic motifs revealed by n-gram analysis. Sci Rep 2017; 7:12961. [PMID: 29021608 PMCID: PMC5636826 DOI: 10.1038/s41598-017-13210-9] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Accepted: 09/21/2017] [Indexed: 12/24/2022] Open
Abstract
Amyloids are proteins associated with several clinical disorders, including Alzheimer's, and Creutzfeldt-Jakob's. Despite their diversity, all amyloid proteins can undergo aggregation initiated by short segments called hot spots. To find the patterns defining the hot spots, we trained predictors of amyloidogenicity, using n-grams and random forest classifiers. Since the amyloidogenicity may not depend on the exact sequence of amino acids but on their more general properties, we tested 524,284 reduced amino acid alphabets of different lengths (three to six letters) to find the alphabet providing the best performance in cross-validation. The predictor based on this alphabet, called AmyloGram, was benchmarked against the most popular tools for the detection of amyloid peptides using an external data set and obtained the highest values of performance measures (AUC: 0.90, MCC: 0.63). Our results showed sequential patterns in the amyloids which are strongly correlated with hydrophobicity, a tendency to form β-sheets, and lower flexibility of amino acid residues. Among the most informative n-grams of AmyloGram we identified 15 that were previously confirmed experimentally. AmyloGram is available as the web-server: http://smorfland.uni.wroc.pl/shiny/AmyloGram/ and as the R package AmyloGram. R scripts and data used to produce the results of this manuscript are available at http://github.com/michbur/AmyloGramAnalysis .
Collapse
Affiliation(s)
| | - Piotr Sobczyk
- Faculty of Pure and Applied Mathematics, Wrocław University of Science and Technology, Wrocław, Poland
| | - Stefan Rödiger
- Institute of Biotechnology, Brandenburg University of Technology Cottbus-Senftenberg, Senftenberg, Germany
| | - Anna Duda-Madej
- Department of Microbiology, Wrocław Medical University, Wrocław, Poland
| | | | - Małgorzata Kotulska
- Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wrocław University of Science and Technology, Wrocław, Poland.
| |
Collapse
|
10
|
Childs LM, Baskerville EB, Cobey S. Trade-offs in antibody repertoires to complex antigens. Philos Trans R Soc Lond B Biol Sci 2016; 370:rstb.2014.0245. [PMID: 26194759 PMCID: PMC4528422 DOI: 10.1098/rstb.2014.0245] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Pathogens vary in their antigenic complexity. While some pathogens such as measles present a few relatively invariant targets to the immune system, others such as malaria display considerable antigenic diversity. How the immune response copes in the presence of multiple antigens, and whether a trade-off exists between the breadth and efficacy of antibody (Ab)-mediated immune responses, are unsolved problems. We present a theoretical model of affinity maturation of B-cell receptors (BCRs) during a primary infection and examine how variation in the number of accessible antigenic sites alters the Ab repertoire. Naive B cells with randomly generated receptor sequences initiate the germinal centre (GC) reaction. The binding affinity of a BCR to an antigen is quantified via a genotype-phenotype map, based on a random energy landscape, that combines local and distant interactions between residues. In the presence of numerous antigens or epitopes, B-cell clones with different specificities compete for stimulation during rounds of mutation within GCs. We find that the availability of many epitopes reduces the affinity and relative breadth of the Ab repertoire. Despite the stochasticity of somatic hypermutation, patterns of immunodominance are strongly shaped by chance selection of naive B cells with specificities for particular epitopes. Our model provides a mechanistic basis for the diversity of Ab repertoires and the evolutionary advantage of antigenically complex pathogens.
Collapse
Affiliation(s)
- Lauren M Childs
- Center for Communicable Disease Dynamics, Harvard T.H. Chan School of Public Health, Boston, MA, USA Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | | | - Sarah Cobey
- Ecology and Evolution, University of Chicago, Chicago, IL, USA
| |
Collapse
|
11
|
Solis AD. Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins. Proteins 2015; 83:2198-216. [DOI: 10.1002/prot.24936] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2015] [Revised: 09/04/2015] [Accepted: 09/04/2015] [Indexed: 12/14/2022]
Affiliation(s)
- Armando D. Solis
- Biological Sciences Department, New York City College of Technology; the City University of New York (CUNY); Brooklyn New York 11201
| |
Collapse
|
12
|
Huang JT, Wang T, Huang SR, Li X. Reduced alphabet for protein folding prediction. Proteins 2015; 83:631-9. [PMID: 25641420 DOI: 10.1002/prot.24762] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2014] [Revised: 11/07/2014] [Accepted: 12/21/2014] [Indexed: 01/17/2023]
Abstract
What are the key building blocks that would have been needed to construct complex protein folds? This is an important issue for understanding protein folding mechanism and guiding de novo protein design. Twenty naturally occurring amino acids and eight secondary structures consist of a 28-letter alphabet to determine folding kinetics and mechanism. Here we predict folding kinetic rates of proteins from many reduced alphabets. We find that a reduced alphabet of 10 letters achieves good correlation with folding rates, close to the one achieved by full 28-letter alphabet. Many other reduced alphabets are not significantly correlated to folding rates. The finding suggests that not all amino acids and secondary structures are equally important for protein folding. The foldable sequence of a protein could be designed using at least 10 folding units, which can either promote or inhibit protein folding. Reducing alphabet cardinality without losing key folding kinetic information opens the door to potentially faster machine learning and data mining applications in protein structure prediction, sequence alignment and protein design.
Collapse
Affiliation(s)
- Jitao T Huang
- Department of Chemistry and National Laboratory of Elemento-Organic Chemistry, Nankai University, Tianjin, 300071, People's Republic of China
| | | | | | | |
Collapse
|
13
|
Suzuki S, Kakuta M, Ishida T, Akiyama Y. Faster sequence homology searches by clustering subsequences. ACTA ACUST UNITED AC 2014; 31:1183-90. [PMID: 25432166 PMCID: PMC4393512 DOI: 10.1093/bioinformatics/btu780] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2014] [Accepted: 11/12/2014] [Indexed: 01/17/2023]
Abstract
Motivation: Sequence homology searches are used in various fields. New sequencing technologies produce huge amounts of sequence data, which continuously increase the size of sequence databases. As a result, homology searches require large amounts of computational time, especially for metagenomic analysis. Results: We developed a fast homology search method based on database subsequence clustering, and implemented it as GHOSTZ. This method clusters similar subsequences from a database to perform an efficient seed search and ungapped extension by reducing alignment candidates based on triangle inequality. The database subsequence clustering technique achieved an ∼2-fold increase in speed without a large decrease in search sensitivity. When we measured with metagenomic data, GHOSTZ is ∼2.2–2.8 times faster than RAPSearch and is ∼185–261 times faster than BLASTX. Availability and implementation: The source code is freely available for download at http://www.bi.cs.titech.ac.jp/ghostz/ Contact:akiyama@cs.titech.ac.jp Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shuji Suzuki
- Graduate School of Information Science and Engineering, Tokyo Institute of Technology and Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, Tokyo 152-8550, Japan Graduate School of Information Science and Engineering, Tokyo Institute of Technology and Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, Tokyo 152-8550, Japan
| | - Masanori Kakuta
- Graduate School of Information Science and Engineering, Tokyo Institute of Technology and Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, Tokyo 152-8550, Japan
| | - Takashi Ishida
- Graduate School of Information Science and Engineering, Tokyo Institute of Technology and Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, Tokyo 152-8550, Japan
| | - Yutaka Akiyama
- Graduate School of Information Science and Engineering, Tokyo Institute of Technology and Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, Tokyo 152-8550, Japan Graduate School of Information Science and Engineering, Tokyo Institute of Technology and Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, Tokyo 152-8550, Japan
| |
Collapse
|
14
|
An information-theoretic classification of amino acids for the assessment of interfaces in protein-protein docking. J Mol Model 2013; 19:3901-10. [PMID: 23828247 DOI: 10.1007/s00894-013-1916-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2013] [Accepted: 06/09/2013] [Indexed: 12/28/2022]
Abstract
Docking represents a versatile and powerful method to predict the geometry of protein-protein complexes. However, despite significant methodical advances, the identification of good docking solutions among a large number of false solutions still remains a difficult task. We have previously demonstrated that the formalism of mutual information (MI) from information theory can be adapted to protein docking, and we have now extended this approach to enhance its robustness and applicability. A large dataset consisting of 22,934 docking decoys derived from 203 different protein-protein complexes was used for an MI-based optimization of reduced amino acid alphabets representing the protein-protein interfaces. This optimization relied on a clustering analysis that allows one to estimate the mutual information of whole amino acid alphabets by considering all structural features simultaneously, rather than by treating them individually. This clustering approach is fast and can be applied in a similar fashion to the generation of reduced alphabets for other biological problems like fold recognition, sequence data mining, or secondary structure prediction. The reduced alphabets derived from the present work were converted into a scoring function for the evaluation of docking solutions, which is available for public use via the web service score-MI: http://score-MI.biochem.uni-erlangen.de.
Collapse
|
15
|
Stephenson JD, Freeland SJ. Unearthing the root of amino acid similarity. J Mol Evol 2013; 77:159-69. [PMID: 23743923 PMCID: PMC6763418 DOI: 10.1007/s00239-013-9565-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2013] [Accepted: 05/08/2013] [Indexed: 12/31/2022]
Abstract
Similarities and differences between amino acids define the rates at which they substitute for one another within protein sequences and the patterns by which these sequences form protein structures. However, there exist many ways to measure similarity, whether one considers the molecular attributes of individual amino acids, the roles that they play within proteins, or some nuanced contribution of each. One popular approach to representing these relationships is to divide the 20 amino acids of the standard genetic code into groups, thereby forming a simplified amino acid alphabet. Here, we develop a method to compare or combine different simplified alphabets, and apply it to 34 simplified alphabets from the scientific literature. We use this method to show that while different suggestions vary and agree in non-intuitive ways, they combine to reveal a consensus view of amino acid similarity that is clearly rooted in physico-chemistry.
Collapse
Affiliation(s)
- James D Stephenson
- NASA Astrobiology Institute, University of Hawaii, Honolulu, HI, 96822, USA,
| | | |
Collapse
|
16
|
Affiliation(s)
- Rachel Kolodny
- Department of Computer Science, University of Haifa, Haifa 31905, Israel;
| | - Leonid Pereyaslavets
- Department of Structural Biology, Stanford University, Stanford, California 94305; ,
| | | | - Michael Levitt
- Department of Structural Biology, Stanford University, Stanford, California 94305; ,
| |
Collapse
|
17
|
Hod R, Kohen R, Mandel-Gutfreund Y. Searching for protein signatures using a multilevel alphabet. Proteins 2013; 81:1058-68. [PMID: 23386227 DOI: 10.1002/prot.24261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2012] [Revised: 01/09/2013] [Accepted: 01/11/2013] [Indexed: 11/07/2022]
Abstract
Short motifs are known to play diverse roles in proteins, such as in mediating the interactions with other molecules, binding to membranes, or conducting a specific biological function. Standard approaches currently employed to detect short motifs in proteins search for enrichment of amino acid motifs considering mostly the sequence information. Here, we presented a new approach to search for common motifs (protein signatures) which share both physicochemical and structural properties, looking simultaneously at different features. Our method takes as an input an amino acid sequence and translates it to a new alphabet that reflects its intrinsic structural and chemical properties. Using the MEME search algorithm, we identified the proteins signatures within subsets of protein which encompass common sequence and structural information. We demonstrated that we can detect enriched structural motifs, such as the amphipathic helix, from large datasets of linear sequences, as well as predicting common structural properties (such as disorder, surface accessibility, or secondary structures) of known functional-motifs. Finally, we applied the method to the yeast protein interactome and identified novel putative interacting motifs. We propose that our approach can be applied for de novo protein function prediction given either sequence or structural information.
Collapse
Affiliation(s)
- Ronit Hod
- Faculty of Biology, Technion-Israel Institute of Technology, Haifa 32000, Israel
| | | | | |
Collapse
|
18
|
Masso M. Generation of atomic four-body statistical potentials derived from the delaunay tessellation of protein structures. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2012; 2012:6321-6324. [PMID: 23367374 DOI: 10.1109/embc.2012.6347439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Delaunay tessellation of the atomic coordinates for a crystallographic protein structure yields an aggregate of non-overlapping and space-filling irregular tetrahedral simplices. The vertices of each simplex objectively identify a quadruplet of nearest neighbor atoms in the protein. Here we apply Delaunay tessellation to 1417 high-resolution structures of single chains that share low sequence identity, for the purpose of determining the relative frequencies of occurrence for all possible nearest neighbor atomic quadruplet types. Alternative distributions are explored by varying two fundamental parameters: atomic alphabet selection and cutoff length for admissible simplex edges. The distributions are then converted to four-body potential functions by implementing the inverted Boltzmann principle, which requires calculating the distribution of the reference state. Two alternative definitions for the reference state are presented, which introduces a third parameter, and we derive and compare an array of such potential functions. These knowledge-based statistical potentials based on higher-order interactions complement and generalize the more commonly encountered atom-pair potentials, for which a number of approaches are described in the literature.
Collapse
Affiliation(s)
- Majid Masso
- Laboratory for Structural Bioinformatics, School of Systems Biology, George Mason University, Manassas, VA 20110, USA.
| |
Collapse
|
19
|
Solis AD, Rackovsky SR. Fold homology detection using sequence fragment composition profiles of proteins. Proteins 2011; 78:2745-56. [PMID: 20635424 DOI: 10.1002/prot.22788] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so-called "twilight zone" problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment-free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20-letter amino acid alphabet) into a more tractable number of reduced tetramers (approximately 15-30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database that share low pairwise sequence similarity. Using the receiver-operating characteristic measure, we demonstrate potentially significant improvement in using information-optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the "twilight zone".
Collapse
Affiliation(s)
- Armando D Solis
- Department of Biological Sciences, New York City College of Technology, The City University of New York, Brooklyn, New York 11201, USA.
| | | |
Collapse
|
20
|
Capriotti E, Norambuena T, Marti-Renom MA, Melo F. All-atom knowledge-based potential for RNA structure prediction and assessment. ACTA ACUST UNITED AC 2011; 27:1086-93. [PMID: 21349865 DOI: 10.1093/bioinformatics/btr093] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
MOTIVATION Over the recent years, the vision that RNA simply serves as information transfer molecule has dramatically changed. The study of the sequence/structure/function relationships in RNA is becoming more important. As a direct consequence, the total number of experimentally solved RNA structures has dramatically increased and new computer tools for predicting RNA structure from sequence are rapidly emerging. Therefore, new and accurate methods for assessing the accuracy of RNA structure models are clearly needed. RESULTS Here, we introduce an all-atom knowledge-based potential for the assessment of RNA three-dimensional (3D) structures. We have benchmarked our new potential, called Ribonucleic Acids Statistical Potential (RASP), with two different decoy datasets composed of near-native RNA structures. In one of the benchmark sets, RASP was able to rank the closest model to the X-ray structure as the best and within the top 10 models for ∼93 and ∼95% of decoys, respectively. The average correlation coefficient between model accuracy, calculated as the root mean square deviation and global distance test-total score (GDT-TS) measures of C3' atoms, and the RASP score was 0.85 and 0.89, respectively. Based on a recently released benchmark dataset that contains hundreds of 3D models for 32 RNA motifs with non-canonical base pairs, RASP scoring function compared favorably to ROSETTA FARFAR force field in the selection of accurate models. Finally, using the self-splicing group I intron and the stem-loop IIIc from hepatitis C virus internal ribosome entry site as test cases, we show that RASP is able to discriminate between known structure-destabilizing mutations and compensatory mutations. AVAILABILITY RASP can be readily applied to assess all-atom or coarse-grained RNA structures and thus should be of interest to both developers and end-users of RNA structure prediction methods. The computer software and knowledge-based potentials are freely available at http://melolab.org/supmat.html. CONTACT fmelo@bio.puc.cl; mmarti@cipf.es SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Emidio Capriotti
- Structural Genomics Unit, Bioinformatics and Genomics Department, Centro de Investigación Principe Felipe, 46012 Valencia, Spain
| | | | | | | |
Collapse
|
21
|
Slama P, Geman D. Identification of family-determining residues in PHD fingers. Nucleic Acids Res 2010; 39:1666-79. [PMID: 21059680 PMCID: PMC3061080 DOI: 10.1093/nar/gkq947] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Histone modifications are fundamental to chromatin structure and transcriptional regulation, and are recognized by a limited number of protein folds. Among these folds are PHD fingers, which are present in most chromatin modification complexes. To date, about 15 PHD finger domains have been structurally characterized, whereas hundreds of different sequences have been identified. Consequently, an important open problem is to predict structural features of a PHD finger knowing only its sequence. Here, we classify PHD fingers into different groups based on the analysis of residue–residue co-evolution in their sequences. We measure the degree to which fixing the amino acid type at one position modifies the frequencies of amino acids at other positions. We then detect those position/amino acid combinations, or ‘conditions’, which have the strongest impact on other sequence positions. Clustering these strong conditions yields four families, providing informative labels for PHD finger sequences. Existing experimental results, as well as docking calculations performed here, reveal that these families indeed show discrepancies at the functional level. Our method should facilitate the functional characterization of new PHD fingers, as well as other protein families, solely based on sequence information.
Collapse
Affiliation(s)
- Patrick Slama
- Institute for Computational Medicine and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA.
| | | |
Collapse
|
22
|
Molecular characterization of Neospora caninum MAG1, a dense granule protein secreted into the parasitophorous vacuole, and associated with the cyst wall and the cyst matrix. Parasitology 2010; 137:1605-19. [PMID: 20444303 DOI: 10.1017/s0031182010000442] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
SUMMARY In Neospora caninum and Toxoplasma gondii, the parasitophorous vacuole (PV) is synthesized at the time of infection. During tachyzoite-to-bradyzoite stage conversion, the PV is later transformed into a tissue cyst that allows parasites to survive in their host for extended periods of time. We report on the characterization of NcMAG1, the N. caninum orthologue of T. gondii MAG1 (matrix antigen 1; TgMAG1). The 456 amino acid predicted NcMAG1 protein is 54% identical to TgMAG1. By immunoblotting, a rabbit antiserum raised against recombinant NcMAG1 detected a major product of approximately 67 kDa in extracts of N. caninum tachyzoite-infected Vero cells, which was stained more prominently in extracts of infected Vero cells treated to induce in vitro bradyzoite conversion. Immunofluorescence and TEM localized the protein mainly within the cyst wall and the cyst matrix. In both tachyzoites and bradyzoites, NcMAG1 was associated with the parasite dense granules. Comparison between NcMAG1 and TgMAG1 amino acid sequences revealed that the C-terminal conserved regions exhibit 66% identity, while the N-terminal variable regions exhibit only 32% identity. Antibodies against NcMAG1-conserved region cross-reacted with the orthologuous protein in T. gondii but those against the variable region did not. This indicates that the variable region possesses unique antigenic characteristics.
Collapse
|
23
|
Rykunov D, Fiser A. New statistical potential for quality assessment of protein models and a survey of energy functions. BMC Bioinformatics 2010; 11:128. [PMID: 20226048 PMCID: PMC2853469 DOI: 10.1186/1471-2105-11-128] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2009] [Accepted: 03/12/2010] [Indexed: 11/30/2022] Open
Abstract
Background Scoring functions, such as molecular mechanic forcefields and statistical potentials are fundamentally important tools in protein structure modeling and quality assessment. Results The performances of a number of publicly available scoring functions are compared with a statistical rigor, with an emphasis on knowledge-based potentials. We explored the effect on accuracy of alternative choices for representing interaction center types and other features of scoring functions, such as using information on solvent accessibility, on torsion angles, accounting for secondary structure preferences and side chain orientation. Partially based on the observations made, we present a novel residue based statistical potential, which employs a shuffled reference state definition and takes into account the mutual orientation of residue side chains. Atom- and residue-level statistical potentials and Linux executables to calculate the energy of a given protein proposed in this work can be downloaded from http://www.fiserlab.org/potentials. Conclusions Among the most influential terms we observed a critical role of a proper reference state definition and the benefits of including information about the microenvironment of interaction centers. Molecular mechanical potentials were also tested and found to be over-sensitive to small local imperfections in a structure, requiring unfeasible long energy relaxation before energy scores started to correlate with model quality.
Collapse
Affiliation(s)
- Dmitry Rykunov
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Ave,, Bronx, NY 10461, USA
| | | |
Collapse
|
24
|
Abstract
Empirical or knowledge-based potentials have many applications in structural biology such as the prediction of protein structure, protein-protein, and protein-ligand interactions and in the evaluation of stability for mutant proteins, the assessment of errors in experimentally solved structures, and the design of new proteins. Here, we describe a simple procedure to derive and use pairwise distance-dependent potentials that rely on the definition of effective atomic interactions, which attempt to capture interactions that are more likely to be physically relevant. Based on a difficult benchmark test composed of proteins with different secondary structure composition and representing many different folds, we show that the use of effective atomic interactions significantly improves the performance of potentials at discriminating between native and near-native conformations. We also found that, in agreement with previous reports, the potentials derived from the observed effective atomic interactions in native protein structures contain a larger amount of mutual information. A detailed analysis of the effective energy functions shows that atom connectivity effects, which mostly arise when deriving the potential by the incorporation of those indirect atomic interactions occurring beyond the first atomic shell, are clearly filtered out. The shape of the energy functions for direct atomic interactions representing hydrogen bonding and disulfide and salt bridges formation is almost unaffected when effective interactions are taken into account. On the contrary, the shape of the energy functions for indirect atom interactions (i.e., those describing the interaction between two atoms bound to a direct interacting pair) is clearly different when effective interactions are considered. Effective energy functions for indirect interacting atom pairs are not influenced by the shape or the energy minimum observed for the corresponding direct interacting atom pair. Our results suggest that the dependency between the signals in different energy functions is a key aspect that need to be addressed when empirical energy functions are derived and used, and also highlight the importance of additivity assumptions in the use of potential energy functions.
Collapse
Affiliation(s)
- Evandro Ferrada
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | | |
Collapse
|
25
|
Peterson EL, Kondev J, Theriot JA, Phillips R. Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. ACTA ACUST UNITED AC 2009; 25:1356-62. [PMID: 19351620 DOI: 10.1093/bioinformatics/btp164] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Many proteins with vastly dissimilar sequences are found to share a common fold, as evidenced in the wealth of structures now available in the Protein Data Bank. One idea that has found success in various applications is the concept of a reduced amino acid alphabet, wherein similar amino acids are clustered together. Given the structural similarity exhibited by many apparently dissimilar sequences, we undertook this study looking for improvements in fold recognition by comparing protein sequences written in a reduced alphabet. RESULTS We tested over 150 of the amino acid clustering schemes proposed in the literature with all-versus-all pairwise sequence alignments of sequences in the Distance mAtrix aLIgnment database. We combined several metrics from information retrieval popular in the literature: mean precision, area under the Receiver Operating Characteristic curve and recall at a fixed error rate and found that, in contrast to previous work, reduced alphabets in many cases outperform full alphabets. We find that reduced alphabets can perform at a level comparable to full alphabets in correct pairwise alignment of sequences and can show increased sensitivity to pairs of sequences with structural similarity but low-sequence identity. Based on these results, we hypothesize that reduced alphabets may also show performance gains with more sophisticated methods such as profile and pattern searches. AVAILABILITY A table of results as well as the substitution matrices and residue groupings from this study can be downloaded from (http://www.rpgroup.caltech.edu/publications/supplements/alphabets).
Collapse
Affiliation(s)
- Eric L Peterson
- Department of Physics, California Institute of Technology, Pasadena, CA 91125, USA
| | | | | | | |
Collapse
|
26
|
Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE, Krasnogor N. Automated alphabet reduction for protein datasets. BMC Bioinformatics 2009; 10:6. [PMID: 19126227 PMCID: PMC2646702 DOI: 10.1186/1471-2105-10-6] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2008] [Accepted: 01/06/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques. RESULTS We applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations. CONCLUSION Our automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data.
Collapse
Affiliation(s)
- Jaume Bacardit
- ASAP research group, School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK.
| | | | | | | | | | | |
Collapse
|
27
|
|
28
|
Niv MY, Skrabanek L, Roberts RJ, Scheraga HA, Weinstein H. Identification of GATC- and CCGG-recognizing Type II REases and their putative specificity-determining positions using Scan2S--a novel motif scan algorithm with optional secondary structure constraints. Proteins 2008; 71:631-40. [PMID: 17972284 DOI: 10.1002/prot.21777] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Restriction endonucleases (REases) are DNA-cleaving enzymes that have become indispensable tools in molecular biology. Type II REases are highly divergent in sequence despite their common structural core, function and, in some cases, common specificities towards DNA sequences. This makes it difficult to identify and classify them functionally based on sequence, and has hampered the efforts of specificity-engineering. Here, we define novel REase sequence motifs, which extend beyond the PD-(D/E)XK hallmark, and incorporate secondary structure information. The automated search using these motifs is carried out with a newly developed fast regular expression matching algorithm that accommodates long patterns with optional secondary structure constraints. Using this new tool, named Scan2S, motifs derived from REases with specificity towards GATC- and CGGG-containing DNA sequences successfully identify REases of the same specificity. Notably, some of these sequences are not identified by standard sequence detection tools. The new motifs highlight potential specificity-determining positions that do not fully overlap for the GATC- and the CCGG-recognizing REases and are candidates for specificity re-engineering.
Collapse
Affiliation(s)
- Masha Y Niv
- Department of Physiology and Biophysics, Weill Medical College of Cornell University, 1300 York Ave., New York, New York 10021, USA.
| | | | | | | | | |
Collapse
|
29
|
Panjkovich A, Melo F, Marti-Renom MA. Evolutionary potentials: structure specific knowledge-based potentials exploiting the evolutionary record of sequence homologs. Genome Biol 2008; 9:R68. [PMID: 18397517 PMCID: PMC2643939 DOI: 10.1186/gb-2008-9-4-r68] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2008] [Revised: 04/02/2008] [Accepted: 04/08/2008] [Indexed: 11/10/2022] Open
Abstract
So-called ‘Evolutionary potentials’ for protein structure prediction are derived using a single experimental protein structure and all three-dimensional models of its homologous sequences. We introduce a new type of knowledge-based potentials for protein structure prediction, called 'evolutionary potentials', which are derived using a single experimental protein structure and all three-dimensional models of its homologous sequences. The new potentials have been benchmarked against other knowledge-based potentials, resulting in a significant increase in accuracy for model assessment. In contrast to standard knowledge-based potentials, we propose that evolutionary potentials capture key determinants of thermodynamic stability and specific sequence constraints required for fast folding.
Collapse
Affiliation(s)
- Alejandro Panjkovich
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | | | | |
Collapse
|
30
|
Dahl DB, Bohannan Z, Mo Q, Vannucci M, Tsai J. Assessing side-chain perturbations of the protein backbone: a knowledge-based classification of residue Ramachandran space. J Mol Biol 2008; 378:749-58. [PMID: 18377931 DOI: 10.1016/j.jmb.2008.02.043] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2007] [Revised: 02/20/2008] [Accepted: 02/21/2008] [Indexed: 11/25/2022]
Abstract
Grouping the 20 residues is a classic strategy to discover ordered patterns and insights about the fundamental nature of proteins, their structure, and how they fold. Usually, this categorization is based on the biophysical and/or structural properties of a residue's side-chain group. We extend this approach to understand the effects of side chains on backbone conformation and to perform a knowledge-based classification of amino acids by comparing their backbone phi, psi distributions in different types of secondary structure. At this finer, more specific resolution, torsion angle data are often sparse and discontinuous (especially for nonhelical classes) even though a comprehensive set of protein structures is used. To ensure the precision of Ramachandran plot comparisons, we applied a rigorous Bayesian density estimation method that produces continuous estimates of the backbone phi, psi distributions. Based on this statistical modeling, a robust hierarchical clustering was performed using a divergence score to measure the similarity between plots. There were seven general groups based on the clusters from the complete Ramachandran data: nonpolar/beta-branched (Ile and Val), AsX (Asn and Asp), long (Met, Gln, Arg, Glu, Lys, and Leu), aromatic (Phe, Tyr, His, and Cys), small (Ala and Ser), bulky (Thr and Trp), and, lastly, the singletons of Gly and Pro. At the level of secondary structure (helix, sheet, turn, and coil), these groups remain somewhat consistent, although there are a few significant variations. Besides the expected uniqueness of the Gly and Pro distributions, the nonpolar/beta-branched and AsX clusters were very consistent across all types of secondary structure. Effectively, this consistency across the secondary structure classes implies that side-chain steric effects strongly influence a residue's backbone torsion angle conformation. These results help to explain the plasticity of amino acid substitutions on protein structure and should help in protein design and structure evaluation.
Collapse
Affiliation(s)
- David B Dahl
- Department of Statistics, Texas A&M University, College Station, TX 77843, USA
| | | | | | | | | |
Collapse
|
31
|
Abstract
Identification and Classification of G-protein coupled receptors (GPCRs) using protein sequences is an important computational challenge, given that experimental screening of thousands of ligands is an expensive proposition. There are two distinct but complementary approaches to GPCR classification --machine learning and sequence motif analysis. Machine learning methodologies typically suffer from problems of class imbalance and lack of multi-class classification. Many sequence motif methods, meanwhile, are too dependent on the similarity of the primary sequence alignments. It is desirable to have a motif discovery and application methodology that is not strongly dependent on primary sequence similarity. It should also overcome limitations of machine learning. We propose and evaluate the effectiveness of a simple methodology that uses a reduced protein functional alphabet representation, where similar functional residues have similar symbols. Regular expression motifs can then be obtained by ClustalW based multiple sequence alignment, using an identity matrix. Since evolutionary matrices like BLOSUM, PAM are not used, this method can be useful for any set of sequences that do not necessarily share a common ancestry. Reduced alphabet motifs can accurately classify known GPCR proteins and the results are comparable to PRINTS and PROSITE. For well known GPCR proteins from SWISSPROT, there were no false negatives and only a few false positives. This methodology covers most currently known classes of GPCRs, even if there are very few representative sequences. It also predicts more than one class for certain sequences, thus overcoming the limitation of machine learning methods. We also annotated, 695 orphan receptors, and 121 were identified as belonging to Family A. A simple JavaScript based web interface has been developed to predict GPCR families and subfamilies (www.insilico-consulting.com/gpcrmotif.html).
Collapse
Affiliation(s)
- Rajeev Gangal
- Insilico Consulting, 402, Citi Centre, 39/2, Erandwane, Karve Road, Pune, Maharashtra, India
| | | |
Collapse
|
32
|
Hu J, Yan C. HMM_RA: an improved method for alpha-helical transmembrane protein topology prediction. Bioinform Biol Insights 2008; 2:67-74. [PMID: 19812766 PMCID: PMC2735969 DOI: 10.4137/bbi.s358] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
alpha-helical transmembrane (TM) proteins play important and diverse functional roles in cells. The ability to predict the topology of these proteins is important for identifying functional sites and inferring function of membrane proteins. This paper presents a Hidden Markov Model (referred to as HMM_RA) that can predict the topology of alpha-helical transmembrane proteins with improved performance. HMM_RA adopts the same structure as the HMMTOP method, which has five modules: inside loop, inside helix tail, membrane helix, outside helix tail and outside loop. Each module consists of one or multiple states. HMM_RA allows using reduced alphabets to encode protein sequences. Thus, each state of HMM_RA is associated with n emission probabilities, where n is the size of the reduced alphabet set. Direct comparisons using two standard data sets show that HMM_RA consistently outperforms HMMTOP and TMHMM in topology prediction. Specifically, on a high-quality data set of 83 proteins, HMM_RA outperforms HMMTOP by up to 7.6% in topology accuracy and 6.4% in alpha-helices location accuracy. On the same data set, HMM_RA outperforms TMHMM by up to 6.4% in topology accuracy and 2.9% in location accuracy. Comparison also shows that HMM_RA achieves comparable performance as Phobius, a recently published method.
Collapse
Affiliation(s)
- Jing Hu
- Department of Computer Science, Utah State University, Logan, UT 84322 U.S.A
| | - Changhui Yan
- Department of Computer Science, Utah State University, Logan, UT 84322 U.S.A
| |
Collapse
|
33
|
Lu Y, Freeland SJ. A quantitative investigation of the chemical space surrounding amino acid alphabet formation. J Theor Biol 2007; 250:349-61. [PMID: 18005995 DOI: 10.1016/j.jtbi.2007.10.007] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2007] [Revised: 09/21/2007] [Accepted: 10/08/2007] [Indexed: 11/29/2022]
Abstract
To date, explanations for the origin and emergence of the alphabet of amino acids encoded by the standard genetic code have been largely qualitative and speculative. Here, with the help of computational chemistry, we present the first quantitative exploration of nature's "choices" set against various models for plausible alternatives. Specifically, we consider the chemical space defined by three fundamental biophysical properties (size, charge, and hydrophobicity) to ask whether the amino acids that entered the genetic code exhibit a higher diversity than random samples of similar size drawn from several different definitions of amino acid possibility space. We found that in terms of the properties studied, the full, standard set of 20 biologically encoded amino acids is indeed significantly more diverse than an equivalently sized group drawn at random from the set of plausible, prebiotic alternatives (using the Murchison meteorite as a model for pre-biotic plausibility). However, when the set of possible amino acids is enlarged to include those that are produced by standard biosynthetic pathways (reflecting the widespread idea that many members of the standard alphabet were recruited in this way), then the genetically encoded amino acids can no longer be distinguished as more diverse than a random sample. Finally, if we turn to consider the overlap between biologically encoded amino acids and those that are prebiotically plausible, then we find that the biologically encoded subset are no more diverse as a group than would be expected from a random sample, unless the definition of "random sample" is adjusted to reflect possible prebiotic abundance (again, using the contents of the Murchison meteorite as our estimator). This final result is contingent on the accuracy of our computational estimates for amino acid properties, and prebiotic abundances, and an exploration of the likely effect of errors in our estimation reveals that our results should be treated with caution. We thus present this work as a first step in quantifying and thus testing various origin-of-life hypotheses regarding the origin and evolution of life's amino acid alphabet, and advocate the progress that would add valuable information in the future.
Collapse
Affiliation(s)
- Yi Lu
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 25250, USA
| | | |
Collapse
|
34
|
Etchebest C, Benros C, Bornot A, Camproux AC, de Brevern AG. A reduced amino acid alphabet for understanding and designing protein adaptation to mutation. EUROPEAN BIOPHYSICS JOURNAL: EBJ 2007; 36:1059-69. [PMID: 17565494 DOI: 10.1007/s00249-007-0188-5] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2007] [Revised: 05/05/2007] [Accepted: 05/07/2007] [Indexed: 10/23/2022]
Abstract
Protein sequence world is considerably larger than structure world. In consequence, numerous non-related sequences may adopt similar 3D folds and different kinds of amino acids may thus be found in similar 3D structures. By grouping together the 20 amino acids into a smaller number of representative residues with similar features, sequence world simplification may be achieved. This clustering hence defines a reduced amino acid alphabet (reduced AAA). Numerous works have shown that protein 3D structures are composed of a limited number of building blocks, defining a structural alphabet. We previously identified such an alphabet composed of 16 representative structural motifs (5-residues length) called Protein Blocks (PBs). This alphabet permits to translate the structure (3D) in sequence of PBs (1D). Based on these two concepts, reduced AAA and PBs, we analyzed the distributions of the different kinds of amino acids and their equivalences in the structural context. Different reduced sets were considered. Recurrent amino acid associations were found in all the local structures while other were specific of some local structures (PBs) (e.g Cysteine, Histidine, Threonine and Serine for the alpha-helix Ncap). Some similar associations are found in other reduced AAAs, e.g Ile with Val, or hydrophobic aromatic residues Trp with Phe and Tyr. We put into evidence interesting alternative associations. This highlights the dependence on the information considered (sequence or structure). This approach, equivalent to a substitution matrix, could be useful for designing protein sequence with different features (for instance adaptation to environment) while preserving mainly the 3D fold.
Collapse
Affiliation(s)
- C Etchebest
- Equipe de Bioinformatique Génomique et Moléculaire (EBGM), INSERM UMR-S 726, Université Denis DIDEROT, Paris 7, case 7113, 2, place Jussieu, 75251, Paris, France
| | | | | | | | | |
Collapse
|
35
|
Solis AD, Rackovsky S. Property-based sequence representations do not adequately encode local protein folding information. Proteins 2007; 67:785-8. [PMID: 17387739 DOI: 10.1002/prot.21434] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
We examine the informatic characteristics of amino acid representations based on physical properties. We demonstrate that sequences rewritten using contracted alphabets based on physical properties do not encode local folding information well. The best four-character alphabet can only encode approximately 57% of the maximum possible amount of structural information. This result suggests that property-based representations that operate on a local length scale are not likely to be useful in homology searches and fold-recognition exercises.
Collapse
Affiliation(s)
- A D Solis
- Department of Pharmacology and Biological Chemistry, Mount Sinai School of Medicine, One Gustave L. Levy Place, New York, New York 10029, USA
| | | |
Collapse
|