1
|
Xi B, Tao J, Liu X, Xu X, He P, Dai Q. RaaMLab: A MATLAB toolbox that generates amino acid groups and reduced amino acid modes. Biosystems 2019; 180:38-45. [PMID: 30904554 DOI: 10.1016/j.biosystems.2019.03.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 12/25/2018] [Accepted: 03/06/2019] [Indexed: 01/31/2023]
Abstract
Amino acid (AA) classification and its different biophysical and chemical characteristics have been widely applied to analyze and predict the structural, functional, expression and interaction profiles of proteins and peptides. We present RaaMLab, a free and open-source MATLAB toolbox, to facilitate studies on proteins and peptides, to generate AA groups and to extract the structural and physicochemical features of reduced AAs (RedAA). This toolbox offers 4 kinds of databases, including the physicochemical properties of AAs and their groupings, 49 AA classification methods and 5 types of biophysicochemical features of RedAAs. These factors can be easily computed based on user-defined alphabet size and AA properties of AA groupings. RaaMLab is an open source freely available at https://github.com/bioinfo0706/RaaMLab. This website also contains a tutorial, extensive documentation and examples.
Collapse
Affiliation(s)
- Baohang Xi
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Jin Tao
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, People's Republic of China
| | - Xinnan Xu
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Pingan He
- College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| |
Collapse
|
2
|
Yu Z, Deng Z, Wong HS, Tan L. Identifying Protein-Kinase-Specific Phosphorylation Sites Based on the Bagging–AdaBoost Ensemble Approach. IEEE Trans Nanobioscience 2010; 9:132-43. [DOI: 10.1109/tnb.2010.2043682] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
3
|
Abstract
Identification and Classification of G-protein coupled receptors (GPCRs) using protein sequences is an important computational challenge, given that experimental screening of thousands of ligands is an expensive proposition. There are two distinct but complementary approaches to GPCR classification --machine learning and sequence motif analysis. Machine learning methodologies typically suffer from problems of class imbalance and lack of multi-class classification. Many sequence motif methods, meanwhile, are too dependent on the similarity of the primary sequence alignments. It is desirable to have a motif discovery and application methodology that is not strongly dependent on primary sequence similarity. It should also overcome limitations of machine learning. We propose and evaluate the effectiveness of a simple methodology that uses a reduced protein functional alphabet representation, where similar functional residues have similar symbols. Regular expression motifs can then be obtained by ClustalW based multiple sequence alignment, using an identity matrix. Since evolutionary matrices like BLOSUM, PAM are not used, this method can be useful for any set of sequences that do not necessarily share a common ancestry. Reduced alphabet motifs can accurately classify known GPCR proteins and the results are comparable to PRINTS and PROSITE. For well known GPCR proteins from SWISSPROT, there were no false negatives and only a few false positives. This methodology covers most currently known classes of GPCRs, even if there are very few representative sequences. It also predicts more than one class for certain sequences, thus overcoming the limitation of machine learning methods. We also annotated, 695 orphan receptors, and 121 were identified as belonging to Family A. A simple JavaScript based web interface has been developed to predict GPCR families and subfamilies (www.insilico-consulting.com/gpcrmotif.html).
Collapse
Affiliation(s)
- Rajeev Gangal
- Insilico Consulting, 402, Citi Centre, 39/2, Erandwane, Karve Road, Pune, Maharashtra, India
| | | |
Collapse
|
4
|
Fitzgerald JE, Jha AK, Colubri A, Sosnick TR, Freed KF. Reduced C(beta) statistical potentials can outperform all-atom potentials in decoy identification. Protein Sci 2007; 16:2123-39. [PMID: 17893359 PMCID: PMC2204143 DOI: 10.1110/ps.072939707] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
We developed a series of statistical potentials to recognize the native protein from decoys, particularly when using only a reduced representation in which each side chain is treated as a single C(beta) atom. Beginning with a highly successful all-atom statistical potential, the Discrete Optimized Protein Energy function (DOPE), we considered the implications of including additional information in the all-atom statistical potential and subsequently reducing to the C(beta) representation. One of the potentials includes interaction energies conditional on backbone geometries. A second potential separates sequence local from sequence nonlocal interactions and introduces a novel reference state for the sequence local interactions. The resultant potentials perform better than the original DOPE statistical potential in decoy identification. Moreover, even upon passing to a reduced C(beta) representation, these statistical potentials outscore the original (all-atom) DOPE potential in identifying native states for sets of decoys. Interestingly, the backbone-dependent statistical potential is shown to retain nearly all of the information content of the all-atom representation in the C(beta) representation. In addition, these new statistical potentials are combined with existing potentials to model hydrogen bonding, torsion energies, and solvation energies to produce even better performing potentials. The ability of the C(beta) statistical potentials to accurately represent protein interactions bodes well for computational efficiency in protein folding calculations using reduced backbone representations, while the extensions to DOPE illustrate general principles for improving knowledge-based potentials.
Collapse
Affiliation(s)
- James E Fitzgerald
- Department of Physics, The University of Chicago, Chicago, Illinois 60637, USA
| | | | | | | | | |
Collapse
|
5
|
Melo F, Marti-Renom MA. Accuracy of sequence alignment and fold assessment using reduced amino acid alphabets. Proteins 2006; 63:986-95. [PMID: 16506243 DOI: 10.1002/prot.20881] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Reduced or simplified amino acid alphabets group the 20 naturally occurring amino acids into a smaller number of representative protein residues. To date, several reduced amino acid alphabets have been proposed, which have been derived and optimized by a variety of methods. The resulting reduced amino acid alphabets have been applied to pattern recognition, generation of consensus sequences from multiple alignments, protein folding, and protein structure prediction. In this work, amino acid substitution matrices and statistical potentials were derived based on several reduced amino acid alphabets and their performance assessed in a large benchmark for the tasks of sequence alignment and fold assessment of protein structure models, using as a reference frame the standard alphabet of 20 amino acids. The results showed that a large reduction in the total number of residue types does not necessarily translate into a significant loss of discriminative power for sequence alignment and fold assessment. Therefore, some definitions of a few residue types are able to encode most of the relevant sequence/structure information that is present in the 20 standard amino acids. Based on these results, we suggest that the use of reduced amino acid alphabets may allow to increasing the accuracy of current substitution matrices and statistical potentials for the prediction of protein structure of remote homologs.
Collapse
Affiliation(s)
- Francisco Melo
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Santiago, Chile.
| | | |
Collapse
|
6
|
Ozer N, Haliloglu T, Schiffer CA. Substrate specificity in HIV-1 protease by a biased sequence search method. Proteins 2006; 64:444-56. [PMID: 16741993 DOI: 10.1002/prot.21023] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Drug resistance in HIV-1 protease can also occasionally confer a change in the substrate specificity. Through the use of computational techniques, a relationship can be determined between the substrate sequence and three-dimensional structure of HIV-1 protease, and be utilized to predict substrate specificity. In this study, we introduce a biased sequence search threading (BSST) methodology to analyze the preferences of substrate positions and correlations between them that might also identify which positions within known substrates can likely tolerate sequence variability and which cannot. The potential sequence space was efficiently explored using a low-resolution knowledge-based scoring function. The low-energy substrate sequences generated by the biased search are correlated with the natural substrates. Octameric sequences were predicted using the probabilities of residue positions in the sequences generated by BSST in three ways: considering each position in the substrate independently, considering pairwise interdependency, and considering triple-wise interdependency. The prediction of octameric sequences using the triple-wise conditional probabilities produces the most accurate results, reproducing most of the sequences for five of the nine natural substrates and implying that there is a complex interdependence between the different substrate residue positions. This likely reflects that HIV-1 protease recognizes the overall shape of the substrate more than its specific sequence.
Collapse
Affiliation(s)
- Nevra Ozer
- Polymer Research Center and Chemical Engineering Department, Bogazici University, Bebek, Istanbul, Turkey
| | | | | |
Collapse
|
7
|
Wrabl JO, Grishin NV. Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization. Proteins 2006; 61:523-34. [PMID: 16184599 DOI: 10.1002/prot.20648] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Understanding of amino acid type co-occurrence in trusted multiple sequence alignments is a prerequisite for improved sequence alignment and remote homology detection algorithms. Two objective approaches were used to investigate co-occurrence, both based on variance maximization of the weighted residue frequencies in columns taken from a large alignment database. The first approach discretely grouped amino acid types, and the second approach extracted orthogonal properties of amino acids using principal components analysis. The grouping results corresponded to amino acid physical properties such as side chain hydrophobicity, size, or backbone flexibility, and an optimal arrangement of approximately eight groups was observed. However, interpretation of the orthogonal properties was more complex. Although the principal components accounting for the largest variances exhibited modest correlations with hydrophobicity and conservation of glycine, in general principal components did not correspond to physical properties of amino acids. Although not intuitive, these amino acid mathematical properties were demonstrated to be robust and to improve local pairwise alignment accuracy, relative to 20 amino acid frequencies alone, for a simple test case.
Collapse
Affiliation(s)
- James O Wrabl
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas 75390-9050, USA
| | | |
Collapse
|
8
|
Solis AD, Rackovsky S. Improvement of statistical potentials and threading score functions using information maximization. Proteins 2006; 62:892-908. [PMID: 16395676 DOI: 10.1002/prot.20501] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
We show that statistical potentials and threading score functions, derived from finite data sets, are informatic functions, and that their performance depends on the manner in which data are classified and compressed. The choice of sequence and structural parameters affects estimates of the conditional probabilities P(C|S), the quantification of the effect of sequence S on conformation C, and determines the amount of information extracted from the data set, as measured by information gain. The mathematical link between information gain and mean conformational energy, established in this work using the local backbone potential as model, demonstrates that manipulation of descriptive parameters also alters the "energy" values assigned to native conformation and to decoy structures in the test pool, and consequently, the performance of such statistical potential functions in fold recognition exercises. We show that sequence and structural partitions that maximize information gain also minimize the mean energy of the ensemble of native conformations. Moreover, we establish an informatic basis for the placement of the native score within an energy spectrum given by the decoy pool in a threading exercise. We discover that, among all informatic quantities, information gain is the best predictor of threading success, even better than the standard Z-score. Consequently, the choices of sequence and structural descriptors, extent of compression, and levels of discretization that maximize information gain must also produce the best potential functions. Strategies to optimize these parameters with respect to information extraction are therefore relevant to building better statistical potentials. Last, we demonstrate that the backbone torsion potential, defined by the trimer sequence, can be an effective tool in greatly reducing the set of possible conformations from a vast decoy pool.
Collapse
Affiliation(s)
- Armando D Solis
- Department of Pharmacology and Biological Chemistry, Mount Sinai School of Medicine, Box 1215, New York, New York 10029, USA
| | | |
Collapse
|
9
|
Kuznetsov IB, Rackovsky S. On the properties and sequence context of structurally ambivalent fragments in proteins. Protein Sci 2004; 12:2420-33. [PMID: 14573856 PMCID: PMC2366964 DOI: 10.1110/ps.03209703] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
The goal of this work is to characterize structurally ambivalent fragments in proteins. We have searched the Protein Data Bank and identified all structurally ambivalent peptides (SAPs) of length five or greater that exist in two different backbone conformations. The SAPs were classified in five distinct categories based on their structure. We propose a novel index that provides a quantitative measure of conformational variability of a sequence fragment. It measures the context-dependent width of the distribution of (phi,xi) dihedral angles associated with each amino acid type. This index was used to analyze the local structural propensity of both SAPs and the sequence fragments contiguous to them. We also analyzed type-specific amino acid composition, solvent accessibility, and overall structural properties of SAPs and their sequence context. We show that each type of SAP has an unusual, type-specific amino acid composition and, as a result, simultaneous intrinsic preferences for two distinct types of backbone conformation. All types of SAPs have lower sequence complexity than average. Fragments that adopt helical conformation in one protein and sheet conformation in another have the lowest sequence complexity and are sampled from a relatively limited repertoire of possible residue combinations. A statistically significant difference between two distinct conformations of the same SAP is observed not only in the overall structural properties of proteins harboring the SAP but also in the properties of its flanking regions and in the pattern of solvent accessibility. These results have implications for protein design and structure prediction.
Collapse
Affiliation(s)
- Igor B Kuznetsov
- Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York, New York 10029, USA
| | | |
Collapse
|