1
|
Li C, Zhao J, Wang C, Yao Y. Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation. Comb Chem High Throughput Screen 2019; 21:100-110. [PMID: 29380690 PMCID: PMC5930480 DOI: 10.2174/1386207321666180130100838] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Revised: 01/24/2018] [Accepted: 01/26/2018] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. METHODS Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. RESULTS By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. CONCLUSION These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.
Collapse
Affiliation(s)
- Chun Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China.,Department of Mathematics, Bohai University, Jinzhou 121013, China.,Research Institute of Food Science, Bohai University, Jinzhou 121013, China
| | - Jialing Zhao
- Department of Mathematics, Bohai University, Jinzhou 121013, China
| | - Changzhong Wang
- Department of Mathematics, Bohai University, Jinzhou 121013, China
| | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| |
Collapse
|
2
|
Huang JT, Wang T, Huang SR, Li X. Prediction of protein folding rates from simplified secondary structure alphabet. J Theor Biol 2015; 383:1-6. [PMID: 26247139 DOI: 10.1016/j.jtbi.2015.07.024] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2014] [Revised: 06/20/2015] [Accepted: 07/23/2015] [Indexed: 10/23/2022]
Abstract
Protein folding is a very complicated and highly cooperative dynamic process. However, the folding kinetics is likely to depend more on a few key structural features. Here we find that secondary structures can determine folding rates of only large, multi-state folding proteins and fails to predict those for small, two-state proteins. The importance of secondary structures for protein folding is ordered as: extended β strand > α helix > bend > turn > undefined secondary structure>310 helix > isolated β strand > π helix. Only the first three secondary structures, extended β strand, α helix and bend, can achieve a good correlation with folding rates. This suggests that the rate-limiting step of protein folding would depend upon the formation of regular secondary structures and the buckling of chain. The reduced secondary structure alphabet provides a simplified description for the machine learning applications in protein design.
Collapse
Affiliation(s)
- Jitao T Huang
- Department of Chemistry and National Laboratory of Elemento-Organic Chemistry, Nankai University, Tianjin 300071, China.
| | - Titi Wang
- Department of Chemistry and National Laboratory of Elemento-Organic Chemistry, Nankai University, Tianjin 300071, China
| | - Shanran R Huang
- Department of Chemistry and National Laboratory of Elemento-Organic Chemistry, Nankai University, Tianjin 300071, China
| | - Xin Li
- Department of Chemistry and National Laboratory of Elemento-Organic Chemistry, Nankai University, Tianjin 300071, China
| |
Collapse
|
3
|
Huang JT, Wang T, Huang SR, Li X. Reduced alphabet for protein folding prediction. Proteins 2015; 83:631-9. [PMID: 25641420 DOI: 10.1002/prot.24762] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2014] [Revised: 11/07/2014] [Accepted: 12/21/2014] [Indexed: 01/17/2023]
Abstract
What are the key building blocks that would have been needed to construct complex protein folds? This is an important issue for understanding protein folding mechanism and guiding de novo protein design. Twenty naturally occurring amino acids and eight secondary structures consist of a 28-letter alphabet to determine folding kinetics and mechanism. Here we predict folding kinetic rates of proteins from many reduced alphabets. We find that a reduced alphabet of 10 letters achieves good correlation with folding rates, close to the one achieved by full 28-letter alphabet. Many other reduced alphabets are not significantly correlated to folding rates. The finding suggests that not all amino acids and secondary structures are equally important for protein folding. The foldable sequence of a protein could be designed using at least 10 folding units, which can either promote or inhibit protein folding. Reducing alphabet cardinality without losing key folding kinetic information opens the door to potentially faster machine learning and data mining applications in protein structure prediction, sequence alignment and protein design.
Collapse
Affiliation(s)
- Jitao T Huang
- Department of Chemistry and National Laboratory of Elemento-Organic Chemistry, Nankai University, Tianjin, 300071, People's Republic of China
| | | | | | | |
Collapse
|
4
|
Xu SC, Li Z, Zhang SP, Hu JL. Primary structure similarity analysis of proteins sequences by a new graphical representation. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2014; 25:791-803. [PMID: 25242152 DOI: 10.1080/1062936x.2014.955055] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
A new graphical description of the primary structure of protein sequences is introduced. First, a three-dimensional space discrete point set of a protein sequence is created based on the three main physicochemical properties of the amino acids. Secondly, a continuous cubic B-spline curve interpolating the amino acid points is constructed to represent the shape of the protein sequence. Then the geometric properties (curvature and torsion) of the continuous curve are extracted for the purpose of analyzing the similarity between protein sequences. Finally, an improved Canberra distance comparison is introduced for the similarity analysis of protein sequences with different lengths. Experimental results show that our method is effective for the similarity comparison of protein sequences.
Collapse
Affiliation(s)
- S C Xu
- a College of Science , Zhejiang Sci-Tech University , Hangzhou , China
| | | | | | | |
Collapse
|
5
|
DV-curve representation of protein sequences and its application. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2014; 2014:203871. [PMID: 24899916 PMCID: PMC4034481 DOI: 10.1155/2014/203871] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2014] [Revised: 03/10/2014] [Accepted: 04/03/2014] [Indexed: 11/17/2022]
Abstract
Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. This graphical representation not only avoids degeneracy, but also has good visualization no matter how long these sequences are, and can reflect the length of protein sequence. Then we transform the 2D-graphical representation into a numerical characterization that can facilitate quantitative comparison of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins.
Collapse
|
6
|
Arnold Emerson I, Gothandam KM. Residue centrality in alpha helical polytopic transmembrane protein structures. J Theor Biol 2012; 309:78-87. [PMID: 22721996 DOI: 10.1016/j.jtbi.2012.06.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2011] [Revised: 04/16/2012] [Accepted: 06/04/2012] [Indexed: 10/28/2022]
Abstract
Transmembrane proteins serve as receptors, transporters or as enzymes. They mediate a broad range of fundamental cellular activities including signal transduction, cell trafficking and photosynthesis. In this study, we analyzed the significance of central residues in the polytopic transmembrane proteins. Each protein is represented as an undirected graph, where residues represent nodes and inter-residue interactions as the edges. Residue centrality was calculated by removing the nodes and its corresponding edges from the protein contact network. Results revealed that 80% of the predicted central residues had normalized conservation values below the mean since they were slowly evolving conserved sites. We also found that 56% of amino acids were interacting with the ligand molecules and metal ions. Predicted central residues in the polytopic transmembrane proteins were found to account for 84% of binding and active site amino acids. From mutation sensitivity analysis, it was observed that 89% of central residues had deleterious mutations whose probabilities were greater than their mean value. Interestingly, we find that z-score values of each amino acid positively correlate with the conservation scores and also with the degrees of each node. Results show that 87% of central residues are hub residues.
Collapse
Affiliation(s)
- I Arnold Emerson
- School of Bio Sciences and Technology, VIT University, Vellore-632014, Tamil Nadu, India
| | | |
Collapse
|
7
|
Randić M, Zupan J, Balaban AT, Vikić-Topić D, Plavšić D. Graphical Representation of Proteins. Chem Rev 2010; 111:790-862. [PMID: 20939561 DOI: 10.1021/cr800198j] [Citation(s) in RCA: 93] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Milan Randić
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Jure Zupan
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Alexandru T. Balaban
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Dražen Vikić-Topić
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Dejan Plavšić
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| |
Collapse
|
8
|
Randić M, Vracko M, Novic M, Plavsić D. Spectral representation of reduced protein models. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2009; 20:415-427. [PMID: 19916107 DOI: 10.1080/10629360903278685] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
We consider a spectrum-like two-dimensional graphical representation of proteins based on a reduced protein model in which 20 amino acids are grouped into five classes. This particular grouping of amino acids was suggested by Riddle and co-workers in 1997. The graphical representation is based on depicting sequentially the amino acids on five horizontal lines at equal separations. One-letter codes, B, O, U, X and Y, to which numerical values 1 to 5 have been assigned, are suggested as labels for the fictional amino acids that represent all the amino acids within each group. The approach is illustrated on ND6 proteins of eight species having from 168 to 175 amino acids. While visual inspection of the novel spectral graphical representations of proteins may reveal local similarities and dissimilarities of protein sequences, arithmetic manipulations of spectra offer an elegant route to graphic visualization of the degree of similarity for selected pairs of proteins.
Collapse
Affiliation(s)
- M Randić
- National Institute of Chemistry, Ljubljana, Hajdrihova 19, Slovenia.
| | | | | | | |
Collapse
|
9
|
Li C, Xing L, Wang X. 2-D graphical representation of protein sequences and its application to coronavirus phylogeny. BMB Rep 2008; 41:217-22. [DOI: 10.5483/bmbrep.2008.41.3.217] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
10
|
Yang JY, Yu ZG, Anh V. Correlations between designability and various structural characteristics of protein lattice models. J Chem Phys 2007; 126:195101. [PMID: 17523837 DOI: 10.1063/1.2737042] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Using six kinds of lattice types (4 x 4, 5 x 5, and 6 x 6 square lattices; 3 x 3 x 3 cubic lattice; and 2+3+4+3+2 and 4+5+6+5+4 triangular lattices), three different size alphabets (HP, HNUP, and 20 letters), and two energy functions, the designability of protein structures is calculated based on random samplings of structures and common biased sampling (CBS) of protein sequence space. Then three quantities stability (average energy gap), foldability, and partnum of the structure, which are defined to elucidate the designability, are calculated. The authors find that whatever the type of lattice, alphabet size, and energy function used, there will be an emergence of highly designable (preferred) structure. For all cases considered, the local interactions reduce degeneracy and make the designability higher. The designability is sensitive to the lattice type, alphabet size, energy function, and sampling method of the sequence space. Compared with the random sampling method, both the CBS and the Metropolis Monte Carlo sampling methods make the designability higher. The correlation coefficients between the designability, stability, and foldability are mostly larger than 0.5, which demonstrate that they have strong correlation relationship. But the correlation relationship between the designability and the partnum is not so strong because the partnum is independent of the energy. The results are useful in practical use of the designability principle, such as to predict the protein tertiary structure.
Collapse
Affiliation(s)
- Jian-Yi Yang
- School of Mathematics and Computing Science, Xiangtan University, Hunan 411105, China
| | | | | |
Collapse
|
11
|
Yu ZG, Anh VV, Lau KS, Zhou LQ. Clustering of protein structures using hydrophobic free energy and solvent accessibility of proteins. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 73:031920. [PMID: 16605571 DOI: 10.1103/physreve.73.031920] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2005] [Revised: 01/18/2006] [Indexed: 05/08/2023]
Abstract
The hydrophobic free energy and solvent accessibility of amino acids are used to study the relationship between the primary structure and structural classification of large proteins. A measure representation and a Z curve representation of protein sequences are proposed. Fractal analysis of the measure and Z curve representations of proteins and multifractal analysis of their hydrophobic free energy and solvent accessibility sequences indicate that the protein sequences possess correlations and multifractal scaling. The parameters from the fractal and multifractal analyses on these sequences are used to construct some parameter spaces. Each protein is represented by a point in these spaces. A method is proposed to distinguish and cluster proteins from the alpha, beta, alpha + beta, and alpha/beta structural classes in these parameter spaces. Fisher's linear discriminant algorithm is used to give a quantitative assessment of our clustering on the selected proteins. Numerical results indicate that the discriminant accuracies are satisfactory. In particular, they reach 94.12% and 88.89% in separating proteins from {alpha, alpha + beta, alpha/beta} proteins in a three-dimensional space.
Collapse
Affiliation(s)
- Z G Yu
- Program in Statistics and Operations Research, Queensland University of Technology, GPO Box 2434, Brisbane, Queensland 4001, Australia.
| | | | | | | |
Collapse
|
12
|
Yu ZG, Anh V, Lau KS. Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. J Theor Biol 2004; 226:341-8. [PMID: 14643648 DOI: 10.1016/j.jtbi.2003.09.009] [Citation(s) in RCA: 106] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Similar to the chaos game representation (CGR) of DNA sequences proposed by Jeffrey (Nucleic Acid Res. 18 (1990) 2163), a new CGR of protein sequences based on the detailed HP model is proposed. Multifractal and correlation analyses of the measures based on the CGR of protein sequences from complete genomes are performed. The Dq spectra of all organisms studied are multifractal-like and sufficiently smooth for the Cq curves to be meaningful. The Cq curves of bacteria resemble a classical phase transition at a critical point. The correlation distance of the difference between the measure based on the CGR of protein sequences and its fractal background is also proposed to construct a more precise phylogenetic tree of bacteria.
Collapse
Affiliation(s)
- Zu-Guo Yu
- Program in Statistics and Operations Research, Queensland University of Technology, G.P.O. Box 2434, QLD 4001, Brisbane, Australia
| | | | | |
Collapse
|
13
|
Abstract
Selecting a protein sequence that corresponds to a specific three-dimensional protein structure is known as the protein design problem. One principal bottleneck in solving this problem is our lack of knowledge of precise atomic interactions. Using a simple model of amino acid interactions, we determine three crucial factors that are important for solving the protein design problem. Among these factors is the protein alphabet-a set of sequence elements that encodes protein structure. Our model predicts that alphabet size is independent of protein length, suggesting the possibility of designing a protein of arbitrary length with the natural protein alphabet. We also find that protein alphabet size is governed by protein structural properties and the energetic properties of the protein alphabet units. We discover that the usage of average types of amino acid in proteins is less than expected if amino acids were chosen randomly with naturally occurring frequencies. We propose three possible scenarios that account for amino acid underusage in proteins. These scenarios suggest the possibility that amino acids themselves might not constitute the alphabet of natural proteins.
Collapse
Affiliation(s)
- Nikolay V Dokholyan
- Department of Biochemistry and Biophysics, School of Medicine, University of North Carolina at Chapel Hill, 27599, USA.
| |
Collapse
|
14
|
Xue B, Wang J, Wang W. Collapse of homopolymer chains with two fixed terminals. J Chem Phys 2003. [DOI: 10.1063/1.1605732] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
15
|
Wang J, Wang W. Folding transition of model protein chains characterized by partition function zeros. J Chem Phys 2003. [DOI: 10.1063/1.1536162] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
16
|
Dickinson E, Krishna S. Aggregation in a concentrated model protein system: a mesoscopic simulation of β-casein self-assembly. Food Hydrocoll 2001. [DOI: 10.1016/s0268-005x(00)00057-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|