1
|
Dayhoff GW, Uversky VN. Rapid prediction and analysis of protein intrinsic disorder. Protein Sci 2022; 31:e4496. [PMID: 36334049 PMCID: PMC9679974 DOI: 10.1002/pro.4496] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 10/28/2022] [Accepted: 11/02/2022] [Indexed: 11/07/2022]
Abstract
Protein intrinsic disorder is found in all kingdoms of life and is known to underpin numerous physiological and pathological processes. Computational methods play an important role in characterizing and identifying intrinsically disordered proteins and protein regions. Herein, we present a new high-efficiency web-based disorder predictor named Rapid Intrinsic Disorder Analysis Online (RIDAO) that is designed to facilitate the application of protein intrinsic disorder analysis in genome-scale structural bioinformatics and comparative genomics/proteomics. RIDAO integrates six established disorder predictors into a single, unified platform that reproduces the results of individual predictors with near-perfect fidelity. To demonstrate the potential applications, we construct a test set containing more than one million sequences from one hundred organisms comprising over 420 million residues. Using this test set, we compare the efficiency and accessibility (i.e., ease of use) of RIDAO to five well-known and popular disorder predictors, namely: AUCpreD, IUPred3, metapredict V2, flDPnn, and SPOT-Disorder2. We show that RIDAO yields per-residue predictions at a rate two to six orders of magnitude greater than the other predictors and completely processes the test set in under an hour. RIDAO can be accessed free of charge at https://ridao.app.
Collapse
Affiliation(s)
- Guy W. Dayhoff
- Department of ChemistryUniversity of South FloridaTampaFloridaUSA
| | - Vladimir N. Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research InstituteUniversity of South FloridaTampaFloridaUSA
| |
Collapse
|
2
|
Yang Y, Gao J, Wang J, Heffernan R, Hanson J, Paliwal K, Zhou Y. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Brief Bioinform 2018; 19:482-494. [PMID: 28040746 PMCID: PMC5952956 DOI: 10.1093/bib/bbw129] [Citation(s) in RCA: 84] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 11/15/2016] [Indexed: 11/13/2022] Open
Abstract
Protein secondary structure prediction began in 1951 when Pauling and Corey predicted helical and sheet conformations for protein polypeptide backbone even before the first protein structure was determined. Sixty-five years later, powerful new methods breathe new life into this field. The highest three-state accuracy without relying on structure templates is now at 82-84%, a number unthinkable just a few years ago. These improvements came from increasingly larger databases of protein sequences and structures for training, the use of template secondary structure information and more powerful deep learning techniques. As we are approaching to the theoretical limit of three-state prediction (88-90%), alternative to secondary structure prediction (prediction of backbone torsion angles and Cα-atom-based angles and torsion angles) not only has more room for further improvement but also allows direct prediction of three-dimensional fragment structures with constantly improved accuracy. About 20% of all 40-residue fragments in a database of 1199 non-redundant proteins have <6 Å root-mean-squared distance from the native conformations by SPIDER2. More powerful deep learning methods with improved capability of capturing long-range interactions begin to emerge as the next generation of techniques for secondary structure prediction. The time has come to finish off the final stretch of the long march towards protein secondary structure prediction.
Collapse
Affiliation(s)
- Yuedong Yang
- Insitute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, QLD, Australia
| | - Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Jihua Wang
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| | - Rhys Heffernan
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Yaoqi Zhou
- Insitute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, QLD, Australia
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| |
Collapse
|
3
|
Dunker AK, Oldfield CJ. Back to the Future: Nuclear Magnetic Resonance and Bioinformatics Studies on Intrinsically Disordered Proteins. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2015; 870:1-34. [PMID: 26387098 DOI: 10.1007/978-3-319-20164-1_1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
From the 1970s to the present, regions of missing electron density in protein structures determined by X-ray diffraction and the characterization of the functions of these regions have suggested that not all protein regions depend on prior 3D structure to carry out function. Motivated by these observations, in early 1996 we began to use bioinformatics approaches to study these intrinsically disordered proteins (IDPs) and IDP regions. At just about the same time, several laboratory groups began to study a collection of IDPs and IDP regions using nuclear magnetic resonance. The temporal overlap of the bioinformatics and NMR studies played a significant role in the development of our understanding of IDPs. Here the goal is to recount some of this history and to project from this experience possible directions for future work.
Collapse
Affiliation(s)
- A Keith Dunker
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 46202, Indianapolis, IN, USA.
| | - Christopher J Oldfield
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 46202, Indianapolis, IN, USA.
| |
Collapse
|
4
|
A methodological review of data mining techniques in predictive medicine: An application in hemodynamic prediction for abdominal aortic aneurysm disease. Biocybern Biomed Eng 2014. [DOI: 10.1016/j.bbe.2014.03.003] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
5
|
Cheng Y, Oldfield CJ, Meng J, Romero P, Uversky VN, Dunker AK. Mining alpha-helix-forming molecular recognition features with cross species sequence alignments. Biochemistry 2007; 46:13468-77. [PMID: 17973494 DOI: 10.1021/bi7012273] [Citation(s) in RCA: 268] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Previously described algorithms for mining alpha-helix-forming molecular recognition elements (MoREs), described by Oldfield et al. (Oldfield, C. J., Cheng, Y., Cortese, M. S., Brown, C. J., Uversky, V. N., and Dunker, A. K. (2005) Comparing and combining predictors of mostly disordered proteins, Biochemistry 44, 1989-2000), also known as molecular recognition features (MoRFs) (Mohan, A., Oldfield, C. J., Radivojac, P., Vacic, V., Cortese, M. S., Dunker, A. K., and Uversky, V. N. (2006) Analysis of Molecular Recognition Features (MoRFs), J. Mol. Biol. 362, 1043-1059), revealed that regions undergoing disorder-to-order transition are involved in many molecular recognition events and are crucial for protein-protein interactions. However, these algorithms were developed using a training data set of a limited size. Here we propose to improve the prediction algorithms by (1) including additional alpha-MoRF examples and their cross species homologues in the positive training set, (2) carefully extracting monomer structure chains from the Protein Data Bank (PDB) as the negative training set, (3) including attributes from recently developed disorder predictors, secondary structure predictions, and amino acid indices, and (4) constructing neural network based predictors and performing validation. Over 50 regions which undergo disorder-to-order transition that were identified in the PDB together with a set of corresponding cross species homologues of each structure-based example were included in a new positive training set. Over 1500 attributes, including disorder predictions, secondary structure predictions, and amino acid indices, were evaluated by the conditional probability method. The top attributes, including VSL2 and VL3 disorder predictions and several physicochemical propensities of amino acid residues, were used to develop the feed forward neural networks. The sensitivity, specificity, and accuracy of the resulting predictor, alpha-MoRF-PredII, were 0.87 +/- 0.10, 0.87 +/- 0.11, and 0.87 +/- 0.08 over 10 cross validations, respectively. We present the results of these analyses and validation examples to discuss the potential improvement of the alpha-MoRF-PredII prediction accuracy.
Collapse
Affiliation(s)
- Yugong Cheng
- Center for Computational Biology and Bioinformatics, Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA
| | | | | | | | | | | |
Collapse
|
6
|
Dor O, Zhou Y. Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins 2007; 66:838-45. [PMID: 17177203 DOI: 10.1002/prot.21298] [Citation(s) in RCA: 97] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
An integrated system of neural networks, called SPINE, is established and optimized for predicting structural properties of proteins. SPINE is applied to three-state secondary-structure and residue-solvent-accessibility (RSA) prediction in this paper. The integrated neural networks are carefully trained with a large dataset of 2640 chains, sequence profiles generated from multiple sequence alignment, representative amino acid properties, a slow learning rate, overfitting protection, and an optimized sliding-widow size. More than 200,000 weights in SPINE are optimized by maximizing the accuracy measured by Q(3) (the percentage of correctly classified residues). SPINE yields a 10-fold cross-validated accuracy of 79.5% (80.0% for chains of length between 50 and 300) in secondary-structure prediction after one-month (CPU time) training on 22 processors. An accuracy of 87.5% is achieved for exposed residues (RSA >95%). The latter approaches the theoretical upper limit of 88-90% accuracy in assigning secondary structures. An accuracy of 73% for three-state solvent-accessibility prediction (25%/75% cutoff) and 79.3% for two-state prediction (25% cutoff) is also obtained.
Collapse
Affiliation(s)
- Ofer Dor
- Department of Physiology and Biophysics, Center for Single Molecule Biophysics, Howard Hughes Medical Institute, State University of New York at Buffalo, Buffalo, New York 14214, USA
| | | |
Collapse
|
7
|
Liu H, Wong L. Data mining tools for biological sequences. J Bioinform Comput Biol 2004; 1:139-67. [PMID: 15290785 DOI: 10.1142/s0219720003000216] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2002] [Revised: 04/07/2003] [Accepted: 04/07/2003] [Indexed: 11/18/2022]
Abstract
We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.
Collapse
Affiliation(s)
- Huiqing Liu
- Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore.
| | | |
Collapse
|
8
|
Dunker AK, Brown CJ, Obradovic Z. Identification and functions of usefully disordered proteins. ADVANCES IN PROTEIN CHEMISTRY 2004; 62:25-49. [PMID: 12418100 DOI: 10.1016/s0065-3233(02)62004-2] [Citation(s) in RCA: 285] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- A Keith Dunker
- School of Molecular Biosciences, Washington State University, Pullman, Washington 99164, USA
| | | | | |
Collapse
|
9
|
Li Y, Rosal RV, Brandt-Rauf PW, Fine RL. Correlation between hydrophobic properties and efficiency of carrier-mediated membrane transduction and apoptosis of a p53 C-terminal peptide. Biochem Biophys Res Commun 2002; 298:439-49. [PMID: 12413961 DOI: 10.1016/s0006-291x(02)02470-1] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Two membrane transporters, the 17 amino acid (aa) oligopeptide penetratin derived from the homeodomain of Antennapedia (Ant) and an analogue of the basic domain of TAT (aa 47-57) (TAT-a) from HIV-1, were tested as carriers for a p53 C-terminal peptide (aa 361-382) into human breast cancer cells. The studies were performed to determine whether the membrane-transduction efficiency of membrane carriers: Ant, TAT or TAT analogue (TAT-a) correlated with peptide hydrophobic features. Peptide-sequence analysis clearly demonstrated that the Ant sequence and p53 peptide sequence (p53p) together created a peptide with enhanced hydrophobic characteristics; while the TAT or TAT analogue (TAT-a) and p53p sequence together created a peptide with significantly less hydrophobic qualities. The degree of hydrophobic moment and helical wheel plots for these peptides correlated directly with their ability to transduce the p53 peptide. Western blot analysis revealed that Ant was able to transduce p53 C-terminal peptide into human breast cancer cells as a highly efficient membrane transporter. Compared to Ant, TAT-a fused to the C-terminus of p53 peptide (p53p-TAT-a) was a less efficient carrier into these cells under the conditions of our study. Additionally, N-terminal linked TAT-a to p53p (TAT-a-p53p) showed even lower efficiency as a transporter than p53-TAT-a. Apoptosis assays showed that the p53 peptide, fused at its C-terminus to Ant (p53p-Ant), induced a higher percentage of apoptotic cells in human breast cancer cell lines expressing mutant or wild-type p53 as compared to p53 peptide fused at its C-terminus to the TAT-a sequence (p53p-TAT-a) or when fused at the N-terminus to TAT-a (TAT-a-p53p). These data suggested a direct correlation between hydrophobic characteristics and efficiency as a transporter. Sequence study, using hydrophobic moment and helical wheel analyses, may be useful predictive tools for choosing the best carrier for a peptide.
Collapse
Affiliation(s)
- Yin Li
- Experimental Therapeutics Program, Division of Medical Oncology, Herbert Irving Comprehensive Cancer Center, College of Physicians and Surgeons of Columbia University, New York, NY 10032-3702, USA
| | | | | | | |
Collapse
|
10
|
Juretić D, Lučić B, Zucić D, Trinajstić N. Protein transmembrane structure: recognition and prediction by using hydrophobicity scales through preference functions. THEORETICAL AND COMPUTATIONAL CHEMISTRY 1998. [DOI: 10.1016/s1380-7323(98)80015-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
11
|
Lathrop RH, Rogers RG, White JV, Gaitatzes C, Smith TF, Bienkowska J, Bryant BK, Buturović LJ, Nambudripad R. Analysis and algorithms for protein sequence–structure alignment. COMPUTATIONAL METHODS IN MOLECULAR BIOLOGY 1998. [DOI: 10.1016/s0167-7306(08)60469-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
12
|
Eisenhaber F, Persson B, Argos P. Protein structure prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence. Crit Rev Biochem Mol Biol 1995; 30:1-94. [PMID: 7587278 DOI: 10.3109/10409239509085139] [Citation(s) in RCA: 96] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
This review attempts a critical stock-taking of the current state of the science aimed at predicting structural features of proteins from their amino acid sequences. At the primary structure level, methods are considered for detection of remotely related sequences and for recognizing amino acid patterns to predict posttranslational modifications and binding sites. The techniques involving secondary structural features include prediction of secondary structure, membrane-spanning regions, and secondary structural class. At the tertiary structural level, methods for threading a sequence into a mainchain fold, homology modeling and assigning sequences to protein families with similar folds are discussed. A literature analysis suggests that, to date, threading techniques are not able to show their superiority over sequence pattern recognition methods. Recent progress in the state of ab initio structure calculation is reviewed in detail. The analysis shows that many structural features can be predicted from the amino acid sequence much better than just a few years ago and with attendant utility in experimental research. Best prediction can be achieved for new protein sequences that can be assigned to well-studied protein families. For single sequences without homologues, the folding problem has not yet been solved.
Collapse
Affiliation(s)
- F Eisenhaber
- Institut für Biochemie der Charité, Medizinische Fakultät, Humboldt-Universität zu Berlin, Fed. Rep. Germany
| | | | | |
Collapse
|