1
|
Wang B, Lei X, Tian W, Perez-Rathke A, Tseng YY, Liang J. Structure-based pathogenicity relationship identifier for predicting effects of single missense variants and discovery of higher-order cancer susceptibility clusters of mutations. Brief Bioinform 2023; 24:bbad206. [PMID: 37332013 PMCID: PMC10359089 DOI: 10.1093/bib/bbad206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 04/19/2023] [Accepted: 05/13/2023] [Indexed: 06/20/2023] Open
Abstract
We report the structure-based pathogenicity relationship identifier (SPRI), a novel computational tool for accurate evaluation of pathological effects of missense single mutations and prediction of higher-order spatially organized units of mutational clusters. SPRI can effectively extract properties determining pathogenicity encoded in protein structures, and can identify deleterious missense mutations of germ line origin associated with Mendelian diseases, as well as mutations of somatic origin associated with cancer drivers. It compares favorably to other methods in predicting deleterious mutations. Furthermore, SPRI can discover spatially organized pathogenic higher-order spatial clusters (patHOS) of deleterious mutations, including those of low recurrence, and can be used for discovery of candidate cancer driver genes and driver mutations. We further demonstrate that SPRI can take advantage of AlphaFold2 predicted structures and can be deployed for saturation mutation analysis of the whole human proteome.
Collapse
Affiliation(s)
- Boshen Wang
- Center for Bioinformatics and Quantitative Biology, Richard and Loan Hill, Department of Biomedical Engineering, University of Illinois at Chicago, W103 Suite, 820 S Wood St, 60612 IL, USA
| | - Xue Lei
- Center for Bioinformatics and Quantitative Biology, Richard and Loan Hill, Department of Biomedical Engineering, University of Illinois at Chicago, W103 Suite, 820 S Wood St, 60612 IL, USA
| | - Wei Tian
- Center for Bioinformatics and Quantitative Biology, Richard and Loan Hill, Department of Biomedical Engineering, University of Illinois at Chicago, W103 Suite, 820 S Wood St, 60612 IL, USA
| | - Alan Perez-Rathke
- Center for Bioinformatics and Quantitative Biology, Richard and Loan Hill, Department of Biomedical Engineering, University of Illinois at Chicago, W103 Suite, 820 S Wood St, 60612 IL, USA
| | - Yan-Yuan Tseng
- Center for Molecular Medicine and Genetics, Biochemistry and Molecular Biology Department, School of Medicine, Wayne State University, 540 E. Canfield Avenue, 48201MI, USA
| | - Jie Liang
- Center for Bioinformatics and Quantitative Biology, Richard and Loan Hill, Department of Biomedical Engineering, University of Illinois at Chicago, W103 Suite, 820 S Wood St, 60612 IL, USA
| |
Collapse
|
2
|
Ye B, Wang B, Liang J. Predicting Pathology of Missense Mutations through Protein-Specific Evolutionary Pattern. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2023; 2023:1-4. [PMID: 38082878 PMCID: PMC10984725 DOI: 10.1109/embc40787.2023.10339993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2023]
Abstract
Missense mutations, which are single base pair genetic alternation resulting in a different amino acid, are among the most common occurring variants in exon regions of the human genome and may lead to diseases. Thus to assess the effects of missense mutations, it is essential to investigate the evolutionary history of the protein under selection pressures. In this study, we employ a continuous-time Markov model to investigate the evolutionary patterns in protein sequences and a Bayesian Markov chain Monte Carlo method to estimate the substitution rates for protein of interest, from which we obtain scoring matrices. Specifically, we examined the evolutionary patterns of protein sequences containing missense mutations using a species tree to define the phylogeny of the protein of interest. We thoroughly studied the evolutionary pattern of human muscle glycogen phosphorylase containing 127 known missense mutations, and identified characteristic evolutionary patterns in 63 proteins with 2,238 missense mutations, including both deleterious and neutral effects. Our results show that the estimated protein-specific evolutionary pattern-based scoring matrices (PSM) lead to higher sensitivity in detecting the pathological effects of missense mutations, compared to the general evolutionary pattern-based scoring matrix of Blosum62 (BL62) matrix. By incorporating PSM, the performance of a recently released structure-based model SPRI for evaluating missense mutations is further improved.
Collapse
|
3
|
Liang Y, Yang S, Zheng L, Wang H, Zhou J, Huang S, Yang L, Zuo Y. Research progress of reduced amino acid alphabets in protein analysis and prediction. Comput Struct Biotechnol J 2022; 20:3503-3510. [PMID: 35860409 PMCID: PMC9284397 DOI: 10.1016/j.csbj.2022.07.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 06/30/2022] [Accepted: 07/01/2022] [Indexed: 11/29/2022] Open
Abstract
A comprehensive summary of the literature on the reduced amino acid alphabets. A systematic review of the development history of reduced amino acid alphabets. Rich application cases of amino acid reduction alphabets are described in the article. A detailed analysis of the properties and uses of the reduced amino acid alphabets.
Proteins are the executors of cellular physiological activities, and accurate structural and function elucidation are crucial for the refined mapping of proteins. As a feature engineering method, the reduction of amino acid composition is not only an important method for protein structure and function analysis, but also opens a broad horizon for the complex field of machine learning. Representing sequences with fewer amino acid types greatly reduces the complexity and noise of traditional feature engineering in dimension, and provides more interpretable predictive models for machine learning to capture key features. In this paper, we systematically reviewed the strategy and method studies of the reduced amino acid (RAA) alphabets, and summarized its main research in protein sequence alignment, functional classification, and prediction of structural properties, respectively. In the end, we gave a comprehensive analysis of 672 RAA alphabets from 74 reduction methods.
Collapse
Affiliation(s)
- Yuchao Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Siqi Yang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Hao Wang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Jian Zhou
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
- Corresponding authors.
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
- Corresponding authors.
| |
Collapse
|
4
|
Lei X, Wang B, Perez-Rathke A, Tian W, Chou CY, Tseng YY, Liang J. Predicting Oncogenic Missense Mutations. ... IEEE-EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS. IEEE-EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS 2019; 2019:10.1109/bhi.2019.8834553. [PMID: 35261984 PMCID: PMC8901086 DOI: 10.1109/bhi.2019.8834553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
With the rapid progress of cancer genome studies, many missense mutations in populations of somatic cells of different cancer types and at different stages have been identified. However, it is challenging to understand the implications of these cancer-related variants. We have developed a computational method that integrates structural, topographical, and evolutionary information for assessments of biochemical effects and the extent of deleteriousness of the cancer-related variants. We have mapped somatic missense mutations from the Catalogue of Somatic Mutations In Cancer (COSMIC) to 3D structures in the Protein Data Bank (PDB). Our results show that a large portion of these missense mutations is located on protein surface pockets, which often serve as a structural and functional unit of cancer variants. We provide detailed analysis of several examples and assessment on the importance of these variants, including prediction of previously unreported cancer-variants, along with independent evidence from the literature. Furthermore, we show our predictions can inform on the functional roles and the mechanism of predicted cancer variants.
Collapse
Affiliation(s)
- Xue Lei
- Bioinformatics Program, Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Boshen Wang
- Bioinformatics Program, Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Alan Perez-Rathke
- Bioinformatics Program, Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Wei Tian
- Bioinformatics Program, Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Chia-Yi Chou
- Molecular Medicine and Genetics, Wayne State University, Detroit, MI 48201, USA
| | - Yan Yuan Tseng
- Molecular Medicine and Genetics, Wayne State University, Detroit, MI 48201, USA
| | - Jie Liang
- Bioinformatics Program, Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| |
Collapse
|
5
|
Wang C, Wei Y, Zhang H, Kong L, Sun S, Zheng WM, Bu D. Constructing effective energy functions for protein structure prediction through broadening attraction-basin and reverse Monte Carlo sampling. BMC Bioinformatics 2019; 20:135. [PMID: 30925867 PMCID: PMC6439974 DOI: 10.1186/s12859-019-2652-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ab initio approaches to protein structure prediction usually employ the Monte Carlo technique to search the structural conformation that has the lowest energy. However, the widely-used energy functions are usually ineffective for conformation search. How to construct an effective energy function remains a challenging task. RESULTS Here, we present a framework to construct effective energy functions for protein structure prediction. Unlike existing energy functions only requiring the native structure to be the lowest one, we attempt to maximize the attraction-basin where the native structure lies in the energy landscape. The underlying rationale is that each energy function determines a specific energy landscape together with a native attraction-basin, and the larger the attraction-basin is, the more likely for the Monte Carlo search procedure to find the native structure. Following this rationale, we constructed effective energy functions as follows: i) To explore the native attraction-basin determined by a certain energy function, we performed reverse Monte Carlo sampling starting from the native structure, identifying the structural conformations on the edge of attraction-basin. ii) To broaden the native attraction-basin, we smoothened the edge points of attraction-basin through tuning weights of energy terms, thus acquiring an improved energy function. Our framework alternates the broadening attraction-basin and reverse sampling steps (thus called BARS) until the native attraction-basin is sufficiently large. We present extensive experimental results to show that using the BARS framework, the constructed energy functions could greatly facilitate protein structure prediction in improving the quality of predicted structures and speeding up conformation search. CONCLUSION Using the BARS framework, we constructed effective energy functions for protein structure prediction, which could improve the quality of predicted structures and speed up conformation search as well.
Collapse
Affiliation(s)
- Chao Wang
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, 6, Kexueyuan South Road, Zhongguancun, Beijing, 100190 China
- University of Chinese Academy of Sciences, 19-1, Yuquan Road, Shijingshan, Beijing, 100049 China
| | - Yi Wei
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, 6, Kexueyuan South Road, Zhongguancun, Beijing, 100190 China
- University of Chinese Academy of Sciences, 19-1, Yuquan Road, Shijingshan, Beijing, 100049 China
| | - Haicang Zhang
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, 6, Kexueyuan South Road, Zhongguancun, Beijing, 100190 China
- University of Chinese Academy of Sciences, 19-1, Yuquan Road, Shijingshan, Beijing, 100049 China
| | - Lupeng Kong
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, 6, Kexueyuan South Road, Zhongguancun, Beijing, 100190 China
- University of Chinese Academy of Sciences, 19-1, Yuquan Road, Shijingshan, Beijing, 100049 China
| | - Shiwei Sun
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, 6, Kexueyuan South Road, Zhongguancun, Beijing, 100190 China
- University of Chinese Academy of Sciences, 19-1, Yuquan Road, Shijingshan, Beijing, 100049 China
| | - Wei-Mou Zheng
- University of Chinese Academy of Sciences, 19-1, Yuquan Road, Shijingshan, Beijing, 100049 China
- Institute of Theoretical Physics, Chinese Academy of Sciences, 55, Zhongguancun East Road, Beijing, 100190 China
| | - Dongbo Bu
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, 6, Kexueyuan South Road, Zhongguancun, Beijing, 100190 China
- University of Chinese Academy of Sciences, 19-1, Yuquan Road, Shijingshan, Beijing, 100049 China
| |
Collapse
|
6
|
Elhefnawy W, Chen L, Han Y, Li Y. ICOSA: A Distance-Dependent, Orientation-Specific Coarse-Grained Contact Potential for Protein Structure Modeling. J Mol Biol 2015; 427:2562-2576. [DOI: 10.1016/j.jmb.2015.05.022] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 05/21/2015] [Indexed: 11/16/2022]
|
7
|
Tang K, Wong SWK, Liu JS, Zhang J, Liang J. Conformational sampling and structure prediction of multiple interacting loops in soluble and β-barrel membrane proteins using multi-loop distance-guided chain-growth Monte Carlo method. Bioinformatics 2015; 31:2646-52. [PMID: 25861965 DOI: 10.1093/bioinformatics/btv198] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 04/03/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Loops in proteins are often involved in biochemical functions. Their irregularity and flexibility make experimental structure determination and computational modeling challenging. Most current loop modeling methods focus on modeling single loops. In protein structure prediction, multiple loops often need to be modeled simultaneously. As interactions among loops in spatial proximity can be rather complex, sampling the conformations of multiple interacting loops is a challenging task. RESULTS In this study, we report a new method called multi-loop Distance-guided Sequential chain-Growth Monte Carlo (M-DiSGro) for prediction of the conformations of multiple interacting loops in proteins. Our method achieves an average RMSD of 1.93 Å for lowest energy conformations of 36 pairs of interacting protein loops with the total length ranging from 12 to 24 residues. We further constructed a data set containing proteins with 2, 3 and 4 interacting loops. For the most challenging target proteins with four loops, the average RMSD of the lowest energy conformations is 2.35 Å. Our method is also tested for predicting multiple loops in β-barrel membrane proteins. For outer-membrane protein G, the lowest energy conformation has a RMSD of 2.62 Å for the three extracellular interacting loops with a total length of 34 residues (12, 12 and 10 residues in each loop). AVAILABILITY AND IMPLEMENTATION The software is freely available at: tanto.bioe.uic.edu/m-DiSGro. CONTACT jinfeng@stat.fsu.edu or jliang@uic.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ke Tang
- Richard and Loan Hill Department of Bioengineering, University of Illinois at Chicago, Chicago, IL
| | - Samuel W K Wong
- Department of Statistics, University of Florida, Gainesville, FL
| | - Jun S Liu
- Department of Statistics, Harvard University, Science Center, Cambridge, MA and
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Jie Liang
- Richard and Loan Hill Department of Bioengineering, University of Illinois at Chicago, Chicago, IL
| |
Collapse
|
8
|
Liang J, Cao Y, Gürsoy G, Naveed H, Terebus A, Zhao J. Multiscale Modeling of Cellular Epigenetic States: Stochasticity in Molecular Networks, Chromatin Folding in Cell Nuclei, and Tissue Pattern Formation of Cells. Crit Rev Biomed Eng 2015; 43:323-46. [PMID: 27480462 PMCID: PMC4976639 DOI: 10.1615/critrevbiomedeng.2016016559] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Genome sequences provide the overall genetic blueprint of cells, but cells possessing the same genome can exhibit diverse phenotypes. There is a multitude of mechanisms controlling cellular epigenetic states and that dictate the behavior of cells. Among these, networks of interacting molecules, often under stochastic control, depending on the specific wirings of molecular components and the physiological conditions, can have a different landscape of cellular states. In addition, chromosome folding in three-dimensional space provides another important control mechanism for selective activation and repression of gene expression. Fully differentiated cells with different properties grow, divide, and interact through mechanical forces and communicate through signal transduction, resulting in the formation of complex tissue patterns. Developing quantitative models to study these multi-scale phenomena and to identify opportunities for improving human health requires development of theoretical models, algorithms, and computational tools. Here we review recent progress made in these important directions.
Collapse
Affiliation(s)
- Jie Liang
- Program in Bioinformatics, Department of Bioengineering, University of Illinois at Chicago, IL, 60612, USA
| | - Youfang Cao
- Theoretical Biology and Biophysics (T-6) and Center for Nonlinear Studies (CNLS), Los Alamos National Laboratory, Los Alamos, NM, 87545, USA
| | - Gamze Gürsoy
- Program in Bioinformatics, Department of Bioengineering, University of Illinois at Chicago, IL, 60612, USA
| | - Hammad Naveed
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave. Chicago, Illinois 60637, USA
| | - Anna Terebus
- Program in Bioinformatics, Department of Bioengineering, University of Illinois at Chicago, IL, 60612, USA
| | - Jieling Zhao
- Program in Bioinformatics, Department of Bioengineering, University of Illinois at Chicago, IL, 60612, USA
| |
Collapse
|
9
|
On simplified global nonlinear function for fitness landscape: a case study of inverse protein folding. PLoS One 2014; 9:e104403. [PMID: 25110986 PMCID: PMC4128808 DOI: 10.1371/journal.pone.0104403] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2013] [Accepted: 07/14/2014] [Indexed: 11/19/2022] Open
Abstract
The construction of fitness landscape has broad implication in understanding molecular evolution, cellular epigenetic state, and protein structures. We studied the problem of constructing fitness landscape of inverse protein folding or protein design, with the aim to generate amino acid sequences that would fold into an a priori determined structural fold which would enable engineering novel or enhanced biochemistry. For this task, an effective fitness function should allow identification of correct sequences that would fold into the desired structure. In this study, we showed that nonlinear fitness function for protein design can be constructed using a rectangular kernel with a basis set of proteins and decoys chosen a priori. The full landscape for a large number of protein folds can be captured using only 480 native proteins and 3,200 non-protein decoys via a finite Newton method. A blind test of a simplified version of fitness function for sequence design was carried out to discriminate simultaneously 428 native sequences not homologous to any training proteins from 11 million challenging protein-like decoys. This simplified function correctly classified 408 native sequences (20 misclassifications, 95% correct rate), which outperforms several other statistical linear scoring function and optimized linear function. Our results further suggested that for the task of global sequence design of 428 selected proteins, the search space of protein shape and sequence can be effectively parametrized with just about 3,680 carefully chosen basis set of proteins and decoys, and we showed in addition that the overall landscape is not overly sensitive to the specific choice of this set. Our results can be generalized to construct other types of fitness landscape.
Collapse
|
10
|
Tang K, Zhang J, Liang J. Fast protein loop sampling and structure prediction using distance-guided sequential chain-growth Monte Carlo method. PLoS Comput Biol 2014; 10:e1003539. [PMID: 24763317 PMCID: PMC3998890 DOI: 10.1371/journal.pcbi.1003539] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2013] [Accepted: 02/01/2014] [Indexed: 11/18/2022] Open
Abstract
Loops in proteins are flexible regions connecting regular secondary structures. They are often involved in protein functions through interacting with other molecules. The irregularity and flexibility of loops make their structures difficult to determine experimentally and challenging to model computationally. Conformation sampling and energy evaluation are the two key components in loop modeling. We have developed a new method for loop conformation sampling and prediction based on a chain growth sequential Monte Carlo sampling strategy, called Distance-guided Sequential chain-Growth Monte Carlo (DISGRO). With an energy function designed specifically for loops, our method can efficiently generate high quality loop conformations with low energy that are enriched with near-native loop structures. The average minimum global backbone RMSD for 1,000 conformations of 12-residue loops is 1:53 A° , with a lowest energy RMSD of 2:99 A° , and an average ensembleRMSD of 5:23 A° . A novel geometric criterion is applied to speed up calculations. The computational cost of generating 1,000 conformations for each of the x loops in a benchmark dataset is only about 10 cpu minutes for 12-residue loops, compared to ca 180 cpu minutes using the FALCm method. Test results on benchmark datasets show that DISGRO performs comparably or better than previous successful methods, while requiring far less computing time. DISGRO is especially effective in modeling longer loops (10-17 residues).
Collapse
Affiliation(s)
- Ke Tang
- Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois, United States of America
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, Florida, United States of America
- * E-mail: (JZ); (JL)
| | - Jie Liang
- Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois, United States of America
- * E-mail: (JZ); (JL)
| |
Collapse
|
11
|
Computational structure analysis of biomacromolecule complexes by interface geometry. Comput Biol Chem 2013; 47:16-23. [DOI: 10.1016/j.compbiolchem.2013.06.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2012] [Revised: 06/11/2013] [Accepted: 06/12/2013] [Indexed: 11/18/2022]
|
12
|
Andreani J, Faure G, Guerois R. InterEvScore: a novel coarse-grained interface scoring function using a multi-body statistical potential coupled to evolution. ACTA ACUST UNITED AC 2013; 29:1742-9. [PMID: 23652426 DOI: 10.1093/bioinformatics/btt260] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION Structural prediction of protein interactions currently remains a challenging but fundamental goal. In particular, progress in scoring functions is critical for the efficient discrimination of near-native interfaces among large sets of decoys. Many functions have been developed using knowledge-based potentials, but few make use of multi-body interactions or evolutionary information, although multi-residue interactions are crucial for protein-protein binding and protein interfaces undergo significant selection pressure to maintain their interactions. RESULTS This article presents InterEvScore, a novel scoring function using a coarse-grained statistical potential including two- and three-body interactions, which provides each residue with the opportunity to contribute in its most favorable local structural environment. Combination of this potential with evolutionary information considerably improves scoring results on the 54 test cases from the widely used protein docking benchmark for which evolutionary information can be collected. We analyze how our way to include evolutionary information gradually increases the discriminative power of InterEvScore. Comparison with several previously published scoring functions (ZDOCK, ZRANK and SPIDER) shows the significant progress brought by InterEvScore. AVAILABILITY http://biodev.cea.fr/interevol/interevscore CONTACT guerois@cea.fr SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jessica Andreani
- CEA, iBiTecS, Service de Bioenergetique Biologie Structurale et Mecanismes SB2SM, Laboratoire de Biologie Structurale et Radiobiologie LBSR, F-91191 Gif sur Yvette, France
| | | | | |
Collapse
|
13
|
Zhou W, Yan H. Alpha shape and Delaunay triangulation in studies of protein-related interactions. Brief Bioinform 2012. [PMID: 23193202 DOI: 10.1093/bib/bbs077] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
In recent years, more 3D protein structures have become available, which has made the analysis of large molecular structures much easier. There is a strong demand for geometric models for the study of protein-related interactions. Alpha shape and Delaunay triangulation are powerful tools to represent protein structures and have advantages in characterizing the surface curvature and atom contacts. This review presents state-of-the-art applications of alpha shape and Delaunay triangulation in the studies on protein-DNA, protein-protein, protein-ligand interactions and protein structure analysis.
Collapse
Affiliation(s)
- Weiqiang Zhou
- Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue 83, Hong Kong.
| | | |
Collapse
|
14
|
Basu S, Bhattacharyya D, Banerjee R. Self-complementarity within proteins: bridging the gap between binding and folding. Biophys J 2012; 102:2605-14. [PMID: 22713576 DOI: 10.1016/j.bpj.2012.04.029] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2011] [Revised: 03/30/2012] [Accepted: 04/17/2012] [Indexed: 01/09/2023] Open
Abstract
Complementarity, in terms of both shape and electrostatic potential, has been quantitatively estimated at protein-protein interfaces and used extensively to predict the specific geometry of association between interacting proteins. In this work, we attempted to place both binding and folding on a common conceptual platform based on complementarity. To that end, we estimated (for the first time to our knowledge) electrostatic complementarity (Em) for residues buried within proteins. Em measures the correlation of surface electrostatic potential at protein interiors. The results show fairly uniform and significant values for all amino acids. Interestingly, hydrophobic side chains also attain appreciable complementarity primarily due to the trajectory of the main chain. Previous work from our laboratory characterized the surface (or shape) complementarity (Sm) of interior residues, and both of these measures have now been combined to derive two scoring functions to identify the native fold amid a set of decoys. These scoring functions are somewhat similar to functions that discriminate among multiple solutions in a protein-protein docking exercise. The performances of both of these functions on state-of-the-art databases were comparable if not better than most currently available scoring functions. Thus, analogously to interfacial residues of protein chains associated (docked) with specific geometry, amino acids found in the native interior have to satisfy fairly stringent constraints in terms of both Sm and Em. The functions were also found to be useful for correctly identifying the same fold for two sequences with low sequence identity. Finally, inspired by the Ramachandran plot, we developed a plot of Sm versus Em (referred to as the complementarity plot) that identifies residues with suboptimal packing and electrostatics which appear to be correlated to coordinate errors.
Collapse
Affiliation(s)
- Sankar Basu
- Crystallography and Molecular Biology Division, Saha Institute of Nuclear Physics, Kolkata, India
| | | | | |
Collapse
|
15
|
Zhou W, Yan H. Prediction of DNA-binding protein based on statistical and geometric features and support vector machines. Proteome Sci 2011; 9 Suppl 1:S1. [PMID: 22166014 PMCID: PMC3289070 DOI: 10.1186/1477-5956-9-s1-s1] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Background Previous studies on protein-DNA interaction mostly focused on the bound structure of DNA-binding proteins but few paid enough attention to the unbound structures. As more new proteins are discovered, it is useful and imperative to develop algorithms for the functional prediction of unbound proteins. In our work, we apply an alpha shape model to represent the surface structure of the protein-DNA complex and extract useful statistical and geometric features, and use structural alignment and support vector machines for the prediction of unbound DNA-binding proteins. Results The performance of our method is evaluated by discriminating a set of 104 DNA-binding proteins from 401 non-DNA-binding proteins. In the same test, the proposed method outperforms the other method using conditional probability. The results achieved by our proposed method for; precision, 83.33%; accuracy, 86.53%; and MCC, 0.5368 demonstrate its good performance. Conclusions In this study we develop an effective method for the prediction of protein-DNA interactions based on statistical and geometric features and support vector machines. Our results show that interface surface features play an important role in protein-DNA interaction. Our technique is able to predict unbound DNA-binding protein and discriminatory DNA-binding proteins from proteins that bind with other molecules.
Collapse
Affiliation(s)
- Weiqiang Zhou
- Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong.
| | | |
Collapse
|
16
|
Mahdavi S, Mohades A, Salehzadeh Yazdi A, Jahandideh S, Masoudi-Nejad A. Computational analysis of RNA-protein interaction interfaces via the Voronoi diagram. J Theor Biol 2011; 293:55-64. [PMID: 22004995 DOI: 10.1016/j.jtbi.2011.09.033] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2011] [Revised: 07/09/2011] [Accepted: 09/30/2011] [Indexed: 10/16/2022]
Abstract
Cellular functions are mediated by various biological processes including biomolecular interactions, such as protein-protein, DNA-protein and RNA-protein interactions in which RNA-Protein interactions are indispensable for many biological processes like cell development and viral replication. Unlike the protein-protein and protein-DNA interactions, accurate mechanisms and structures of the RNA-Protein complexes are not fully understood. A large amount of theoretical evidence have shown during the past several years that computational geometry is the first pace in understanding the binding profiles and plays a key role in the study of intricate biological structures, interactions and complexes. In this paper, RNA-Protein interaction interface surface is computed via the weighted Voronoi diagram of atoms. Using two filter operations provides a natural definition for interface atoms as classic methods. Unbounded parts of Voronoi facets that are far from the complex are trimmed using modified convex hull of atom centers. This algorithm is implemented to a database with different RNA-Protein complexes extracted from Protein Data Bank (PDB). Afterward, the features of interfaces have been computed and compared with classic method. The results show high correlation coefficients between interface size in the Voronoi model and the classical model based on solvent accessibility, as well as high accuracy and precision in comparison to classical model.
Collapse
Affiliation(s)
- Sedigheh Mahdavi
- Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics and COE in Biomathematics, University of Tehran, Tehran, Iran
| | | | | | | | | |
Collapse
|
17
|
Tian Y, Deutsch C, Krishnamoorthy B. Scoring function to predict solubility mutagenesis. Algorithms Mol Biol 2010; 5:33. [PMID: 20929563 PMCID: PMC2958853 DOI: 10.1186/1748-7188-5-33] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2010] [Accepted: 10/07/2010] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Mutagenesis is commonly used to engineer proteins with desirable properties not present in the wild type (WT) protein, such as increased or decreased stability, reactivity, or solubility. Experimentalists often have to choose a small subset of mutations from a large number of candidates to obtain the desired change, and computational techniques are invaluable to make the choices. While several such methods have been proposed to predict stability and reactivity mutagenesis, solubility has not received much attention. RESULTS We use concepts from computational geometry to define a three body scoring function that predicts the change in protein solubility due to mutations. The scoring function captures both sequence and structure information. By exploring the literature, we have assembled a substantial database of 137 single- and multiple-point solubility mutations. Our database is the largest such collection with structural information known so far. We optimize the scoring function using linear programming (LP) methods to derive its weights based on training. Starting with default values of 1, we find weights in the range [0,2] so that predictions of increase or decrease in solubility are optimized. We compare the LP method to the standard machine learning techniques of support vector machines (SVM) and the Lasso. Using statistics for leave-one-out (LOO), 10-fold, and 3-fold cross validations (CV) for training and prediction, we demonstrate that the LP method performs the best overall. For the LOOCV, the LP method has an overall accuracy of 81%. AVAILABILITY Executables of programs, tables of weights, and datasets of mutants are available from the following web page: http://www.wsu.edu/~kbala/OptSolMut.html.
Collapse
Affiliation(s)
- Ye Tian
- Department of Mathematics, Washington State University, Pullman, WA 99164, USA
| | | | - Bala Krishnamoorthy
- Department of Mathematics, Washington State University, Pullman, WA 99164, USA
| |
Collapse
|
18
|
Zhou W, Yan H. A discriminatory function for prediction of protein-DNA interactions based on alpha shape modeling. Bioinformatics 2010; 26:2541-8. [DOI: 10.1093/bioinformatics/btq478] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
19
|
Arab S, Sadeghi M, Eslahchi C, Pezeshk H, Sheari A. A pairwise residue contact area-based mean force potential for discrimination of native protein structure. BMC Bioinformatics 2010; 11:16. [PMID: 20064218 PMCID: PMC2821318 DOI: 10.1186/1471-2105-11-16] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2009] [Accepted: 01/09/2010] [Indexed: 11/21/2022] Open
Abstract
Background Considering energy function to detect a correct protein fold from incorrect ones is very important for protein structure prediction and protein folding. Knowledge-based mean force potentials are certainly the most popular type of interaction function for protein threading. They are derived from statistical analyses of interacting groups in experimentally determined protein structures. These potentials are developed at the atom or the amino acid level. Based on orientation dependent contact area, a new type of knowledge-based mean force potential has been developed. Results We developed a new approach to calculate a knowledge-based potential of mean-force, using pairwise residue contact area. To test the performance of our approach, we performed it on several decoy sets to measure its ability to discriminate native structure from decoys. This potential has been able to distinguish native structures from the decoys in the most cases. Further, the calculated Z-scores were quite high for all protein datasets. Conclusions This knowledge-based potential of mean force can be used in protein structure prediction, fold recognition, comparative modelling and molecular recognition. The program is available at http://www.bioinf.cs.ipm.ac.ir/softwares/surfield
Collapse
Affiliation(s)
- Shahriar Arab
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | | | | | | | | |
Collapse
|
20
|
Albou LP, Schwarz B, Poch O, Wurtz JM, Moras D. Defining and characterizing protein surface using alpha shapes. Proteins 2009; 76:1-12. [DOI: 10.1002/prot.22301] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
21
|
Lin M, Lu HM, Chen R, Liang J. Generating properly weighted ensemble of conformations of proteins from sparse or indirect distance constraints. J Chem Phys 2009; 129:094101. [PMID: 19044859 DOI: 10.1063/1.2968605] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Inferring three-dimensional structural information of biomacromolecules such as proteins from limited experimental data is an important and challenging task. Nuclear Overhauser effect measurements based on nucleic magnetic resonance, disulfide linking, and electron paramagnetic resonance labeling studies can all provide useful partial distance constraint characteristic of the conformations of proteins. In this study, we describe a general approach for reconstructing conformations of biomolecules that are consistent with given distance constraints. Such constraints can be in the form of upper bounds and lower bounds of distances between residue pairs, contact maps based on specific contact distance cutoff values, or indirect distance constraints such as experimental phi-value measurement. Our approach is based on the framework of sequential Monte Carlo method, a chain growth-based method. We have developed a novel growth potential function to guide the generation of conformations that satisfy given distance constraints. This potential function incorporates not only the distance information of current residue during growth but also the distance information of future residues by introducing global distance upper bounds between residue pairs and the placement of reference points. To obtain protein conformations from indirect distance constraints in the form of experimental phi-values, we first generate properly weighted contact maps satisfying phi-value constraints, we then generate conformations from these contact maps. We show that our approach can faithfully generate conformations that satisfy the given constraints, which approach the native structures when distance constraints for all residue pairs are given.
Collapse
Affiliation(s)
- Ming Lin
- Department of Information and Decision Science, University of Illinois at Chicago, 845 S. Morgan St., Chicago, Illinois 60607, USA
| | | | | | | |
Collapse
|
22
|
Ngan SC, Hung LH, Liu T, Samudrala R. Scoring functions for de novo protein structure prediction revisited. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2008; 413:243-81. [PMID: 18075169 DOI: 10.1007/978-1-59745-574-9_10] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/08/2023]
Abstract
De novo protein structure prediction methods attempt to predict tertiary structures from sequences based on general principles that govern protein folding energetics and/or statistical tendencies of conformational features that native structures acquire, without the use of explicit templates. A general paradigm for de novo prediction involves sampling the conformational space, guided by scoring functions and other sequence-dependent biases, such that a large set of candidate ("decoy") structures are generated, and then selecting native-like conformations from those decoys using scoring functions as well as conformer clustering. High-resolution refinement is sometimes used as a final step to fine-tune native-like structures. There are two major classes of scoring functions. Physics-based functions are based on mathematical models describing aspects of the known physics of molecular interaction. Knowledge-based functions are formed with statistical models capturing aspects of the properties of native protein conformations. We discuss the implementation and use of some of the scoring functions from these two classes for de novo structure prediction in this chapter.
Collapse
Affiliation(s)
- Shing-Chung Ngan
- Department of Microbiology, University of Washington School of Medicine, Seattle, WA, USA
| | | | | | | |
Collapse
|
23
|
Ouyang Z, Liang J. Predicting protein folding rates from geometric contact and amino acid sequence. Protein Sci 2008; 17:1256-63. [PMID: 18434498 DOI: 10.1110/ps.034660.108] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Protein folding speeds are known to vary over more than eight orders of magnitude. Plaxco, Simons, and Baker (see References) first showed a correlation of folding speed with the topology of the native protein. That and subsequent studies showed, if the native structure of a protein is known, its folding speed can be predicted reasonably well through a correlation with the "localness" of the contacts in the protein. In the present work, we develop a related measure, the geometric contact number, N (alpha), which is the number of nonlocal contacts that are well-packed, by a Voronoi criterion. We find, first, that in 80 proteins, the largest such database of proteins yet studied, N (alpha) is a consistently excellent predictor of folding speeds of both two-state fast folders and more complex multistate folders. Second, we show that folding rates can also be predicted from amino acid sequences directly, without the need to know the native topology or other structural properties.
Collapse
Affiliation(s)
- Zheng Ouyang
- Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois 60607, USA
| | | |
Collapse
|
24
|
Zhang J, Chen R, Liang J. Potential function of simplified protein models for discriminating native proteins from decoys: combining contact interaction and local sequence-dependent geometry. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2976-9. [PMID: 17270903 DOI: 10.1109/iembs.2004.1403844] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
An effective potential function is critical for protein structure prediction and folding simulation. For simplified models of proteins where coordinates of only Ca atoms need to be specified, an accurate potential function is important. Such a simplified model is essential for efficient search of conformational space. In this work, we present a formulation of potential function for simplified representations of protein structures. It is based on the combination of descriptors derived from residue-residue contact and sequence-dependent local geometry. The optimal weight coefficients for contact and local geometry is obtained through optimization by maximizing margins among native and decoy structures. The latter are generated by chain growth and by gapless threading. The performance of the potential function in blind test of discriminating native protein structures from decoys is evaluated using several benchmark decoy sets. This potential function have comparable or better performance than several residue-based potential functions that require in addition coordinates of side chain centers or coordinates of all side chain atoms.
Collapse
|
25
|
Hu C, Li X, Liang J. Optimal nonlinear scoring function for global fitness landscape of protein design. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2828-31. [PMID: 17270866 DOI: 10.1109/iembs.2004.1403807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Protein design aims to identify sequences compatible with a given protein fold but incompatible to any alternative folds. To select the correct sequences and to guide the search process, a design scoring function is critically important. It is also important that a design scoring function can characterize the global fitness landscape of many proteins simultaneously. We describe how finding optimal design scoring functions can be understood from two geometric viewpoints, and propose a formulation using mixture of Gaussian kernel functions. We give results of distinguishing native sequences for a major portion of representative protein structures from a large number of alternative decoy sequences. We succeeded in deriving nonlinear scoring function that perfectly discriminate a set of 440 representative native proteins of known protein structures from 14 million sequence decoys. We show that no linear scoring function can have perfect discrimination. In an independent blind test using 194 unrelated proteins, our scoring function misclassifies only 13 native proteins. This compares favorably with 37 or 51 misclassifications when optimal linear functions reported in literature are used.
Collapse
Affiliation(s)
- Changyu Hu
- Dept. of Bioeng., Illinois Univ., Chicago, IL, USA
| | | | | |
Collapse
|
26
|
Abstract
Protein–DNA interactions are vital for many processes in living cells, especially transcriptional regulation and DNA modification. To further our understanding of these important processes on the microscopic level, it is necessary that theoretical models describe the macromolecular interaction energetics accurately. While several methods have been proposed, there has not been a careful comparison of how well the different methods are able to predict biologically important quantities such as the correct DNA binding sequence, total binding free energy and free energy changes caused by DNA mutation. In addition to carrying out the comparison, we present two important theoretical models developed initially in protein folding that have not yet been tried on protein–DNA interactions. In the process, we find that the results of these knowledge-based potentials show a strong dependence on the interaction distance and the derivation method. Finally, we present a knowledge-based potential that gives comparable or superior results to the best of the other methods, including the molecular mechanics force field AMBER99.
Collapse
Affiliation(s)
- Jason E Donald
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford St. Cambridge, MA 02138, USA.
| | | | | |
Collapse
|
27
|
Zhang J, Lin M, Chen R, Liang J, Liu JS. Monte Carlo sampling of near-native structures of proteins with applications. Proteins 2006; 66:61-8. [PMID: 17039507 DOI: 10.1002/prot.21203] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Since a protein's dynamic fluctuation inside cells affects the protein's biological properties, we present a novel method to study the ensemble of near-native structures (NNS) of proteins, namely, the conformations that are very similar to the experimentally determined native structure. We show that this method enables us to (i) quantify the difficulty of predicting a protein's structure, (ii) choose appropriate simplified representations of protein structures, and (iii) assess the effectiveness of knowledge-based potential functions. We found that well-designed simple representations of protein structures are likely as accurate as those more complex ones for certain potential functions. We also found that the widely used contact potential functions stabilize NNS poorly, whereas potential functions incorporating local structure information significantly increase the stability of NNS.
Collapse
Affiliation(s)
- Jinfeng Zhang
- Department of Statistics, Harvard University, Cambridge, Massachusetts, USA
| | | | | | | | | |
Collapse
|
28
|
Gu X. A simple statistical method for estimating type-II (cluster-specific) functional divergence of protein sequences. Mol Biol Evol 2006; 23:1937-45. [PMID: 16864604 DOI: 10.1093/molbev/msl056] [Citation(s) in RCA: 166] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Predicting functional amino acid residues in silico is important for comparative genomics. In this paper, we focus on the issue of how to statistically identify cluster-specific amino acid residues that are related to the functional divergence after gene duplication. We approach this problem using a framework based on site-specific shift of amino acid property (type-II functional divergence), as opposed to site-specific shift of evolutionary rate (type-I functional divergence). An efficient statistical procedure is implemented to facilitate the development of phylogenomic database for cluster-specific residues of large-scale protein families. Our method has the following features: 1) statistical testing of the type-II functional divergence and 2) the site-specific Bayesian profile to measure how amino acid residues contribute to type-II (cluster-specific) functional divergence. Consequently, one may obtain the posterior probability for "functional" cluster-specific residues. Case studies are presented and indicate that radical cluster-specific residues are responsible for most of inferred type-II functional divergence, whereas conserved cluster-specific residues appear less than even those imperfect radical cluster-specific residues to this type of functional divergence.
Collapse
Affiliation(s)
- Xun Gu
- Department of Genetics, Development and Cell Biology, Center for Bioinformatics and Biological Statistics, Iowa State University, USA.
| |
Collapse
|
29
|
Zhang J, Chen R, Liang J. Empirical potential function for simplified protein models: combining contact and local sequence-structure descriptors. Proteins 2006; 63:949-60. [PMID: 16477624 DOI: 10.1002/prot.20809] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
An effective potential function is critical for protein structure prediction and folding simulation. Simplified protein models such as those requiring only Calpha or backbone atoms are attractive because they enable efficient search of the conformational space. We show residue-specific reduced discrete-state models can represent the backbone conformations of proteins with small RMSD values. However, no potential functions exist that are designed for such simplified protein models. In this study, we develop optimal potential functions by combining contact interaction descriptors and local sequence-structure descriptors. The form of the potential function is a weighted linear sum of all descriptors, and the optimal weight coefficients are obtained through optimization using both native and decoy structures. The performance of the potential function in a test of discriminating native protein structures from decoys is evaluated using several benchmark decoy sets. Our potential function requiring only backbone atoms or Calpha atoms have comparable or better performance than several residue-based potential functions that require additional coordinates of side-chain centers or coordinates of all side-chain atoms. By reducing the residue alphabets down to size 10 for contact descriptors, the performance of the potential function can be further improved. Our results also suggest that local sequence-structure correlation may play important role in reducing the entropic cost of protein folding.
Collapse
Affiliation(s)
- Jinfeng Zhang
- Department of Bioengineering, University of Illinois, Chicago, Illinois, USA
| | | | | |
Collapse
|
30
|
Jackups R, Liang J. Interstrand Pairing Patterns in β-Barrel Membrane Proteins: The Positive-outside Rule, Aromatic Rescue, and Strand Registration Prediction. J Mol Biol 2005; 354:979-93. [PMID: 16277990 DOI: 10.1016/j.jmb.2005.09.094] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2005] [Revised: 09/23/2005] [Accepted: 09/27/2005] [Indexed: 10/25/2022]
Abstract
beta-Barrel membrane proteins are found in the outer membrane of Gram-negative bacteria, mitochondria, and chloroplasts. Little is known about how residues in membrane beta-barrels interact preferentially with other residues on adjacent strands. We have developed probabilistic models to quantify propensities of residues for different spatial locations and for interstrand pairwise contact interactions involving strong H-bonds, side-chain interactions, and weak H-bonds. Using the reference state of exhaustive permutation of residues within the same beta-strand, the propensity values and p-values measuring statistical significance are calculated exactly by analytical formulae we have developed. Our findings show that there are characteristic preferences of residues for different membrane locations. Contrary to the "positive-inside" rule for helical membrane proteins, beta-barrel membrane proteins follow a significant albeit weaker "positive-outside" rule, in that the basic residues Arg and Lys are disproportionately favored in the extracellular cap region and disfavored in the periplasmic cap region. We find that different residue pairs prefer strong backbone H-bonded interstrand pairings (e.g. Gly-aromatic) or non-H-bonded pairings (e.g. aromatic-aromatic). In addition, we find that Tyr and Phe participate in aromatic rescue by shielding Gly from polar environments. We also show that these propensities can be used to predict the registration of strand pairs, an important task for the structure prediction of beta-barrel membrane proteins. Our accuracy of 44% is considerably better than random (7%). It also significantly outperforms a comparable registration prediction for soluble beta-sheets under similar conditions. Our results imply several experiments that can help to elucidate the mechanisms of in vitro and in vivo folding of beta-barrel membrane proteins. The propensity scales developed in this study will also be useful for computational structure prediction and for folding simulations.
Collapse
Affiliation(s)
- Ronald Jackups
- Department of Bioengineering, SEO, MC-063, University of Illinois at Chicago, 851 S. Morgan Street, Room 218, Chicago, IL 60607-7052, USA
| | | |
Collapse
|
31
|
Tseng YY, Liang J. Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a Bayesian Monte Carlo approach. Mol Biol Evol 2005; 23:421-36. [PMID: 16251508 DOI: 10.1093/molbev/msj048] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
The amino acid sequences of proteins provide rich information for inferring distant phylogenetic relationships and for predicting protein functions. Estimating the rate matrix of residue substitutions from amino acid sequences is also important because the rate matrix can be used to develop scoring matrices for sequence alignment. Here we use a continuous time Markov process to model the substitution rates of residues and develop a Bayesian Markov chain Monte Carlo method for rate estimation. We validate our method using simulated artificial protein sequences. Because different local regions such as binding surfaces and the protein interior core experience different selection pressures due to functional or stability constraints, we use our method to estimate the substitution rates of local regions. Our results show that the substitution rates are very different for residues in the buried core and residues on the solvent-exposed surfaces. In addition, the rest of the proteins on the binding surfaces also have very different substitution rates from residues. Based on these findings, we further develop a method for protein function prediction by surface matching using scoring matrices derived from estimated substitution rates for residues located on the binding surfaces. We show with examples that our method is effective in identifying functionally related proteins that have overall low sequence identity, a task known to be very challenging.
Collapse
Affiliation(s)
- Yan Y Tseng
- Department of Bioengineering, Science and Engineering Offices, MC-063, University of Illinois at Chicago, USA
| | | |
Collapse
|
32
|
Poupon A. Voronoi and Voronoi-related tessellations in studies of protein structure and interaction. Curr Opin Struct Biol 2005; 14:233-41. [PMID: 15093839 DOI: 10.1016/j.sbi.2004.03.010] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The three-dimensional structure of a protein can be modeled by a set of polyhedra drawn around its atoms or residues. The tessellation invented by Voronoi in 1908, and other tessellations of space derived from it, provide versatile representations of three-dimensional structures. In recent years, they have been used to investigate a series of issues relating to proteins: atom and residue volumes, packing, folding, interactions and binding.
Collapse
Affiliation(s)
- Anne Poupon
- Laboratoire d'Enzymologie et Biochimie Structurales, CNRS Bat 34, 91198 Gif-sur-Yvette, France.
| |
Collapse
|
33
|
Li X, Liang J. Geometric cooperativity and anticooperativity of three-body interactions in native proteins. Proteins 2005; 60:46-65. [PMID: 15849756 DOI: 10.1002/prot.20438] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Characterizing multibody interactions of hydrophobic, polar, and ionizable residues in protein is important for understanding the stability of protein structures. We introduce a geometric model for quantifying 3-body interactions in native proteins. With this model, empirical propensity values for many types of 3-body interactions can be reliably estimated from a database of native protein structures, despite the overwhelming presence of pairwise contacts. In addition, we define a nonadditive coefficient that characterizes cooperativity and anticooperativity of residue interactions in native proteins by measuring the deviation of 3-body interactions from 3 independent pairwise interactions. It compares the 3-body propensity value from what would be expected if only pairwise interactions were considered, and highlights the distinction of propensity and cooperativity of 3-body interaction. Based on the geometric model, and what can be inferred from statistical analysis of such a model, we find that hydrophobic interactions and hydrogen-bonding interactions make nonadditive contributions to protein stability, but the nonadditive nature depends on whether such interactions are located in the protein interior or on the protein surface. When located in the interior, many hydrophobic interactions such as those involving alkyl residues are anticooperative. Salt-bridge and regular hydrogen-bonding interactions, such as those involving ionizable residues and polar residues, are cooperative. When located on the protein surface, these salt-bridge and regular hydrogen-bonding interactions are anticooperative, and hydrophobic interactions involving alkyl residues become cooperative. We show with examples that incorporating 3-body interactions improves discrimination of protein native structures against decoy conformations. In addition, analysis of cooperative 3-body interaction may reveal spatial motifs that can suggest specific protein functions.
Collapse
Affiliation(s)
- Xiang Li
- Department of Bioengineering, SEO, MC-063, University of Illinois at Chicago, Chicago, Illinois 60607-7052, USA
| | | |
Collapse
|
34
|
Zhang C, Liu S, Zhou H, Zhou Y. An accurate, residue-level, pair potential of mean force for folding and binding based on the distance-scaled, ideal-gas reference state. Protein Sci 2004; 13:400-11. [PMID: 14739325 PMCID: PMC2286718 DOI: 10.1110/ps.03348304] [Citation(s) in RCA: 116] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Structure prediction on a genomic scale requires a simplified energy function that can efficiently sample the conformational space of polypeptide chains. A good energy function at minimum should discriminate native structures against decoys. Here, we show that a recently developed, residue-specific, all-atom knowledge-based potential (167 atomic types) based on distance-scaled, finite ideal-gas reference state (DFIRE-all-atom) can be substantially simplified to 20 residue types located at side-chain center of mass (DFIRE-SCM) without a significant change in its capability of structure discrimination. Using 96 standard multiple decoy sets, we show that there is only a small reduction (from 80% to 78%) in success rate of ranking native structures as the top 1. The success rate is higher than two previously developed, all-atom distance-dependent statistical pair potentials. Applied to structure selections of 21 docking decoys without modification, the DFIRE-SCM potential is 29% more successful in recognizing native complex structures than an all-atom statistical potential trained by a database of dimeric interfaces. The potential also achieves 92% accuracy in distinguishing true dimeric interfaces from artificial crystal interfaces. In addition, the DFIRE potential with the C(alpha) positions as the interaction centers recognizes 123 native structures out of a comprehensive 125-protein TOUCHSTONE decoy set in which each protein has 24,000 decoys with only C(alpha) positions. Furthermore, the performance by DFIRE-SCM on newly established 25 monomeric and 31 docking Rosetta-decoy sets is comparable to (or better than in the case of monomeric decoy sets) that of a recently developed, all-atom Rosetta energy function enhanced with an orientation-dependent hydrogen bonding potential.
Collapse
Affiliation(s)
- Chi Zhang
- Howard Hughes Medical Institute Center for Single Molecule Biophysics, SUNY Buffalo, 124 Sherman Hall, Buffalo, NY 14214, USA
| | | | | | | |
Collapse
|