101
|
Computational Approaches for Predicting Binding Partners, Interface Residues, and Binding Affinity of Protein-Protein Complexes. Methods Mol Biol 2017; 1484:237-253. [PMID: 27787830 DOI: 10.1007/978-1-4939-6406-2_16] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Studying protein-protein interactions leads to a better understanding of the underlying principles of several biological pathways. Cost and labor-intensive experimental techniques suggest the need for computational methods to complement them. Several such state-of-the-art methods have been reported for analyzing diverse aspects such as predicting binding partners, interface residues, and binding affinity for protein-protein complexes with reliable performance. However, there are specific drawbacks for different methods that indicate the need for their improvement. This review highlights various available computational algorithms for analyzing diverse aspects of protein-protein interactions and endorses the necessity for developing new robust methods for gaining deep insights about protein-protein interactions.
Collapse
|
102
|
|
103
|
Liu L, Tang L, Dong W, Yao S, Zhou W. An overview of topic modeling and its current applications in bioinformatics. SPRINGERPLUS 2016; 5:1608. [PMID: 27652181 PMCID: PMC5028368 DOI: 10.1186/s40064-016-3252-8] [Citation(s) in RCA: 94] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Accepted: 09/08/2016] [Indexed: 11/10/2022]
Abstract
BACKGROUND With the rapid accumulation of biological datasets, machine learning methods designed to automate data analysis are urgently needed. In recent years, so-called topic models that originated from the field of natural language processing have been receiving much attention in bioinformatics because of their interpretability. Our aim was to review the application and development of topic models for bioinformatics. DESCRIPTION This paper starts with the description of a topic model, with a focus on the understanding of topic modeling. A general outline is provided on how to build an application in a topic model and how to develop a topic model. Meanwhile, the literature on application of topic models to biological data was searched and analyzed in depth. According to the types of models and the analogy between the concept of document-topic-word and a biological object (as well as the tasks of a topic model), we categorized the related studies and provided an outlook on the use of topic models for the development of bioinformatics applications. CONCLUSION Topic modeling is a useful method (in contrast to the traditional means of data reduction in bioinformatics) and enhances researchers' ability to interpret biological information. Nevertheless, due to the lack of topic models optimized for specific biological data, the studies on topic modeling in biological data still have a long and challenging road ahead. We believe that topic models are a promising method for various applications in bioinformatics research.
Collapse
Affiliation(s)
- Lin Liu
- School of Information, Yunnan University, Kunming, 650091 Yunnan China
- School of Information (Key Laboratory of Educational Informatization for Nationalities Ministry of Education), Yunnan Normal University, Kunming, 650092 Yunnan China
| | - Lin Tang
- Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, Kunming, 650092 Yunnan China
| | - Wen Dong
- School of Information, Yunnan University, Kunming, 650091 Yunnan China
| | - Shaowen Yao
- National Pilot School of Software, Yunnan University, Kunming, 650091 Yunnan China
| | - Wei Zhou
- National Pilot School of Software, Yunnan University, Kunming, 650091 Yunnan China
| |
Collapse
|
104
|
Pan X, Fan YX, Yan J, Shen HB. IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction. BMC Genomics 2016; 17:582. [PMID: 27506469 PMCID: PMC4979166 DOI: 10.1186/s12864-016-2931-8] [Citation(s) in RCA: 91] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2016] [Accepted: 07/12/2016] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Non-coding RNAs (ncRNAs) play crucial roles in many biological processes, such as post-transcription of gene regulation. ncRNAs mainly function through interaction with RNA binding proteins (RBPs). To understand the function of a ncRNA, a fundamental step is to identify which protein is involved into its interaction. Therefore it is promising to computationally predict RBPs, where the major challenge is that the interaction pattern or motif is difficult to be found. RESULTS In this study, we propose a computational method IPMiner (Interaction Pattern Miner) to predict ncRNA-protein interactions from sequences, which makes use of deep learning and further improves its performance using stacked ensembling. One of the IPMiner's typical merits is that it is able to mine the hidden sequential interaction patterns from sequence composition features of protein and RNA sequences using stacked autoencoder, and then the learned hidden features are fed into random forest models. Finally, stacked ensembling is used to integrate different predictors to further improve the prediction performance. The experimental results indicate that IPMiner achieves superior performance on the tested lncRNA-protein interaction dataset with an accuracy of 0.891, sensitivity of 0.939, specificity of 0.831, precision of 0.945 and Matthews correlation coefficient of 0.784, respectively. We further comprehensively investigate IPMiner on other RNA-protein interaction datasets, which yields better performance than the state-of-the-art methods, and the performance has an increase of over 20 % on some tested benchmarked datasets. In addition, we further apply IPMiner for large-scale prediction of ncRNA-protein network, that achieves promising prediction performance. CONCLUSION By integrating deep neural network and stacked ensembling, from simple sequence composition features, IPMiner can automatically learn high-level abstraction features, which had strong discriminant ability for RNA-protein detection. IPMiner achieved high performance on our constructed lncRNA-protein benchmark dataset and other RNA-protein datasets. IPMiner tool is available at http://www.csbio.sjtu.edu.cn/bioinf/IPMiner .
Collapse
Affiliation(s)
- Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Dongchuan Road, Shanghai, China
- Present Address: Department of Veterinary Clinical and Animal Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Yong-Xian Fan
- Guangxi Key Laboratory of Trusted Software, Guangxi Colleges and Universities Key Laboratory of Intelligent Processing of Computer Images and Graphics, Guilin University of Electronic Technology, Guilin, China
| | - Junchi Yan
- Institute of Software Engineering, East China Normal University, Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Dongchuan Road, Shanghai, China
| |
Collapse
|
105
|
Prediction of human protein–protein interaction by a domain-based approach. J Theor Biol 2016; 396:144-53. [PMID: 26925814 DOI: 10.1016/j.jtbi.2016.02.026] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2015] [Revised: 01/29/2016] [Accepted: 02/20/2016] [Indexed: 02/04/2023]
|
106
|
Li Z, Tang J, Guo F. Identification of 14-3-3 Proteins Phosphopeptide-Binding Specificity Using an Affinity-Based Computational Approach. PLoS One 2016; 11:e0147467. [PMID: 26828594 PMCID: PMC4734684 DOI: 10.1371/journal.pone.0147467] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Accepted: 01/04/2016] [Indexed: 11/17/2022] Open
Abstract
The 14-3-3 proteins are a highly conserved family of homodimeric and heterodimeric molecules, expressed in all eukaryotic cells. In human cells, this family consists of seven distinct but highly homologous 14-3-3 isoforms. 14-3-3σ is the only isoform directly linked to cancer in epithelial cells, which is regulated by major tumor suppressor genes. For each 14-3-3 isoform, we have 1,000 peptide motifs with experimental binding affinity values. In this paper, we present a novel method for identifying peptide motifs binding to 14-3-3σ isoform. First, we propose a sampling criteria to build a predictor for each new peptide sequence. Then, we select nine physicochemical properties of amino acids to describe each peptide motif. We also use auto-cross covariance to extract correlative properties of amino acids in any two positions. Finally, we consider elastic net to predict affinity values of peptide motifs, based on ridge regression and least absolute shrinkage and selection operator (LASSO). Our method tests on the 1,000 known peptide motifs binding to seven 14-3-3 isoforms. On the 14-3-3σ isoform, our method has overall pearson-product-moment correlation coefficient (PCC) and root mean squared error (RMSE) values of 0.84 and 252.31 for N-terminal sublibrary, and 0.77 and 269.13 for C-terminal sublibrary. We predict affinity values of 16,000 peptide sequences and relative binding ability across six permutated positions similar with experimental values. We identify phosphopeptides that preferentially bind to 14-3-3σ over other isoforms. Several positions on peptide motifs are in the same amino acid category with experimental substrate specificity of phosphopeptides binding to 14-3-3σ. Our method is fast and reliable and is a general computational method that can be used in peptide-protein binding identification in proteomics research.
Collapse
Affiliation(s)
- Zhao Li
- School of Computer Science and Technology, Tianjin University, 92 Weijin Road, Nankai District, Tianjin, P.R. China
| | - Jijun Tang
- School of Computer Science and Technology, Tianjin University, 92 Weijin Road, Nankai District, Tianjin, P.R. China.,School of Computational Science and Engineering, University of South Carolina, Columbia, United States of America
| | - Fei Guo
- School of Computer Science and Technology, Tianjin University, 92 Weijin Road, Nankai District, Tianjin, P.R. China
| |
Collapse
|
107
|
You ZH, Chan KCC, Hu P. Predicting protein-protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest. PLoS One 2015; 10:e0125811. [PMID: 25946106 PMCID: PMC4422660 DOI: 10.1371/journal.pone.0125811] [Citation(s) in RCA: 98] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2014] [Accepted: 03/04/2015] [Indexed: 11/18/2022] Open
Abstract
The study of protein-protein interactions (PPIs) can be very important for the understanding of biological cellular functions. However, detecting PPIs in the laboratories are both time-consuming and expensive. For this reason, there has been much recent effort to develop techniques for computational prediction of PPIs as this can complement laboratory procedures and provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale. Although much progress has already been achieved in this direction, the problem is still far from being solved. More effective approaches are still required to overcome the limitations of the current ones. In this study, a novel Multi-scale Local Descriptor (MLD) feature representation scheme is proposed to extract features from a protein sequence. This scheme can capture multi-scale local information by varying the length of protein-sequence segments. Based on the MLD, an ensemble learning method, the Random Forest (RF) method, is used as classifier. The MLD feature representation scheme facilitates the mining of interaction information from multi-scale continuous amino acid segments, making it easier to capture multiple overlapping continuous binding patterns within a protein sequence. When the proposed method is tested with the PPI data of Saccharomyces cerevisiae, it achieves a prediction accuracy of 94.72% with 94.34% sensitivity at the precision of 98.91%. Extensive experiments are performed to compare our method with existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors also with the H. pylori dataset. The reason why such good results are achieved can largely be credited to the learning capabilities of the RF model and the novel MLD feature representation scheme. The experiment results show that the proposed approach can be very promising for predicting PPIs and can be a useful tool for future proteomic studies.
Collapse
Affiliation(s)
- Zhu-Hong You
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China; School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Keith C C Chan
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
| | - Pengwei Hu
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
| |
Collapse
|
108
|
Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines. BIOMED RESEARCH INTERNATIONAL 2015; 2015:867516. [PMID: 26000305 PMCID: PMC4426769 DOI: 10.1155/2015/867516] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Revised: 01/09/2015] [Accepted: 01/09/2015] [Indexed: 11/27/2022]
Abstract
Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of fundamental importance to understand the molecular mechanisms in biological systems. Although the convenience brought by high-throughput experiment in technological advances makes it possible to detect a large amount of PPIs, the data generated through these methods is unreliable and may not be completely inclusive of all possible PPIs. Targeting at this problem, this study develops a novel computational approach to effectively detect the protein interactions. This approach is proposed based on a novel matrix-based representation of protein sequence combined with the algorithm of support vector machine (SVM), which fully considers the sequence order and dipeptide information of the protein primary sequence. When performed on yeast PPIs datasets, the proposed method can reach 90.06% prediction accuracy with 94.37% specificity at the sensitivity of 85.74%, indicating that this predictor is a useful tool to predict PPIs. Achieved results also demonstrate that our approach can be a helpful supplement for the interactions that have been detected experimentally.
Collapse
|
109
|
Pan X, Xiong K. PredcircRNA: computational classification of circular RNA from other long non-coding RNA using hybrid features. MOLECULAR BIOSYSTEMS 2015; 11:2219-26. [DOI: 10.1039/c5mb00214a] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
PredcircRNA presents computational classification of circularRNA from other lncRNA using hybrid features based on multiple kernel learning.
Collapse
Affiliation(s)
- Xiaoyong Pan
- Department of Veterinary Clinical and Animal Sciences
- University of Copenhagen
- Denmark
| | - Kai Xiong
- Department of Veterinary Clinical and Animal Sciences
- University of Copenhagen
- Denmark
| |
Collapse
|
110
|
You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinformatics 2014; 15 Suppl 15:S9. [PMID: 25474679 PMCID: PMC4271571 DOI: 10.1186/1471-2105-15-s15-s9] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Identifying protein-protein interactions (PPIs) is essential for elucidating protein functions and understanding the molecular mechanisms inside the cell. However, the experimental methods for detecting PPIs are both time-consuming and expensive. Therefore, computational prediction of protein interactions are becoming increasingly popular, which can provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale, and can be used to complement experimental approaches. Although much progress has already been achieved in this direction, the problem is still far from being solved and new approaches are still required to overcome the limitations of the current prediction models. RESULTS In this work, a sequence-based approach is developed by combining a novel Multi-scale Continuous and Discontinuous (MCD) feature representation and Support Vector Machine (SVM). The MCD representation gives adequate consideration to the interactions between sequentially distant but spatially close amino acid residues, thus it can sufficiently capture multiple overlapping continuous and discontinuous binding patterns within a protein sequence. An effective feature selection method mRMR was employed to construct an optimized and more discriminative feature set by excluding redundant features. Finally, a prediction model is trained and tested based on SVM algorithm to predict the interaction probability of protein pairs. CONCLUSIONS When performed on the yeast PPIs data set, the proposed approach achieved 91.36% prediction accuracy with 91.94% precision at the sensitivity of 90.67%. Extensive experiments are conducted to compare our method with the existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors, whose average prediction accuracy is 84.91%, sensitivity is 83.24%, and precision is 86.12%. Achieved results show that the proposed approach is very promising for predicting PPI, so it can be a useful supplementary tool for future proteomics studies. The source code and the datasets are freely available at http://csse.szu.edu.cn/staff/youzh/MCDPPI.zip for academic use.
Collapse
|
111
|
You ZH, Yu JZ, Zhu L, Li S, Wen ZK. A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2014.05.072] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
112
|
Predicting protein-RNA interaction amino acids using random forest based on submodularity subset selection. Comput Biol Chem 2014; 53PB:324-330. [PMID: 25462339 DOI: 10.1016/j.compbiolchem.2014.11.002] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2014] [Revised: 10/31/2014] [Accepted: 11/08/2014] [Indexed: 12/25/2022]
Abstract
Protein-RNA interaction plays a very crucial role in many biological processes, such as protein synthesis, transcription and post-transcription of gene expression and pathogenesis of disease. Especially RNAs always function through binding to proteins. Identification of binding interface region is especially useful for cellular pathways analysis and drug design. In this study, we proposed a novel approach for binding sites identification in proteins, which not only integrates local features and global features from protein sequence directly, but also constructed a balanced training dataset using sub-sampling based on submodularity subset selection. Firstly we extracted local features and global features from protein sequence, such as evolution information and molecule weight. Secondly, the number of non-interaction sites is much more than interaction sites, which leads to a sample imbalance problem, and hence biased machine learning model with preference to non-interaction sites. To better resolve this problem, instead of previous randomly sub-sampling over-represented non-interaction sites, a novel sampling approach based on submodularity subset selection was employed, which can select more representative data subset. Finally random forest were trained on optimally selected training subsets to predict interaction sites. Our result showed that our proposed method is very promising for predicting protein-RNA interaction residues, it achieved an accuracy of 0.863, which is better than other state-of-the-art methods. Furthermore, it also indicated the extracted global features have very strong discriminate ability for identifying interaction residues from random forest feature importance analysis.
Collapse
|
113
|
Large-scale protein-protein interactions detection by integrating big biosensing data with computational model. BIOMED RESEARCH INTERNATIONAL 2014; 2014:598129. [PMID: 25215285 PMCID: PMC4151593 DOI: 10.1155/2014/598129] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 07/24/2014] [Indexed: 01/12/2023]
Abstract
Protein-protein interactions are the basis of biological functions, and studying these interactions on a molecular level is of crucial importance for understanding the functionality of a living cell. During the past decade, biosensors have emerged as an important tool for the high-throughput identification of proteins and their interactions. However, the high-throughput experimental methods for identifying PPIs are both time-consuming and expensive. On the other hand, high-throughput PPI data are often associated with high false-positive and high false-negative rates. Targeting at these problems, we propose a method for PPI detection by integrating biosensor-based PPI data with a novel computational model. This method was developed based on the algorithm of extreme learning machine combined with a novel representation of protein sequence descriptor. When performed on the large-scale human protein interaction dataset, the proposed method achieved 84.8% prediction accuracy with 84.08% sensitivity at the specificity of 85.53%. We conducted more extensive experiments to compare the proposed method with the state-of-the-art techniques, support vector machine. The achieved results demonstrate that our approach is very promising for detecting new PPIs, and it can be a helpful supplement for biosensor-based PPI data detection.
Collapse
|
114
|
Nanni L, Lumini A, Brahnam S. An empirical study of different approaches for protein classification. ScientificWorldJournal 2014; 2014:236717. [PMID: 25028675 PMCID: PMC4084589 DOI: 10.1155/2014/236717] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Revised: 05/05/2014] [Accepted: 05/07/2014] [Indexed: 01/05/2023] Open
Abstract
Many domains would benefit from reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful in only a few domains. Our goal is to evaluate several feature extraction approaches for representing proteins by testing them across multiple datasets. Different types of protein representations are evaluated: those starting from the position specific scoring matrix of the proteins (PSSM), those derived from the amino-acid sequence, two matrix representations, and features taken from the 3D tertiary structure of the protein. We also test new variants of proteins descriptors. We develop our system experimentally by comparing and combining different descriptors taken from the protein representations. Each descriptor is used to train a separate support vector machine (SVM), and the results are combined by sum rule. Some stand-alone descriptors work well on some datasets but not on others. Through fusion, the different descriptors provide a performance that works well across all tested datasets, in some cases performing better than the state-of-the-art.
Collapse
Affiliation(s)
- Loris Nanni
- Dipartimento di Ingegneria dell'Informazione, Via Gradenigo 6/A, 35131 Padova, Italy
| | | | - Sheryl Brahnam
- Computer Information Systems, Missouri State University, 901 South National, Springfield, MO 65804, USA
| |
Collapse
|
115
|
Yugandhar K, Gromiha MM. Feature selection and classification of protein-protein complexes based on their binding affinities using machine learning approaches. Proteins 2014; 82:2088-96. [PMID: 24648146 DOI: 10.1002/prot.24564] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2014] [Accepted: 03/14/2014] [Indexed: 12/16/2022]
Abstract
Protein-protein interactions are intrinsic to virtually every cellular process. Predicting the binding affinity of protein-protein complexes is one of the challenging problems in computational and molecular biology. In this work, we related sequence features of protein-protein complexes with their binding affinities using machine learning approaches. We set up a database of 185 protein-protein complexes for which the interacting pairs are heterodimers and their experimental binding affinities are available. On the other hand, we have developed a set of 610 features from the sequences of protein complexes and utilized Ranker search method, which is the combination of Attribute evaluator and Ranker method for selecting specific features. We have analyzed several machine learning algorithms to discriminate protein-protein complexes into high and low affinity groups based on their Kd values. Our results showed a 10-fold cross-validation accuracy of 76.1% with the combination of nine features using support vector machines. Further, we observed accuracy of 83.3% on an independent test set of 30 complexes. We suggest that our method would serve as an effective tool for identifying the interacting partners in protein-protein interaction networks and human-pathogen interactions based on the strength of interactions.
Collapse
Affiliation(s)
- K Yugandhar
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai, 600036, Tamil Nadu, India
| | | |
Collapse
|
116
|
Prediction of protein-protein interaction with pairwise kernel support vector machine. Int J Mol Sci 2014; 15:3220-33. [PMID: 24566145 PMCID: PMC3958907 DOI: 10.3390/ijms15023220] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2014] [Revised: 01/27/2014] [Accepted: 01/29/2014] [Indexed: 11/17/2022] Open
Abstract
Protein–protein interactions (PPIs) play a key role in many cellular processes. Unfortunately, the experimental methods currently used to identify PPIs are both time-consuming and expensive. These obstacles could be overcome by developing computational approaches to predict PPIs. Here, we report two methods of amino acids feature extraction: (i) distance frequency with PCA reducing the dimension (DFPCA) and (ii) amino acid index distribution (AAID) representing the protein sequences. In order to obtain the most robust and reliable results for PPI prediction, pairwise kernel function and support vector machines (SVM) were employed to avoid the concatenation order of two feature vectors generated with two proteins. The highest prediction accuracies of AAID and DFPCA were 94% and 93.96%, respectively, using the 10 CV test, and the results of pairwise radial basis kernel function are considerably improved over those based on radial basis kernel function. Overall, the PPI prediction tool, termed PPI-PKSVM, which is freely available at http://159.226.118.31/PPI/index.html, promises to become useful in such areas as bio-analysis and drug development.
Collapse
|
117
|
Fan CY, Bai YH, Huang CY, Yao TJ, Chiang WH, Chang DTH. PRASA: an integrated web server that analyzes protein interaction types. Gene 2013; 518:78-83. [PMID: 23276706 DOI: 10.1016/j.gene.2012.11.083] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Accepted: 11/27/2012] [Indexed: 11/16/2022]
Abstract
This work presents the Protein Association Analyzer (PRASA) (http://zoro.ee.ncku.edu.tw/prasa/) that predicts protein interactions as well as interaction types. Protein interactions are essential to most biological functions. The existence of diverse interaction types, such as physically contacted or functionally related interactions, makes protein interactions complex. Different interaction types are distinct and should not be confused. However, most existing tools focus on a specific interaction type or mix different interaction types. This work collected 7234058 associations with experimentally verified interaction types from five databases and compiled individual probabilistic models for different interaction types. The PRASA result page shows predicted associations and their related references by interaction type. Experimental results demonstrate the performance difference when distinguishing between different interaction types. The PRASA provides a centralized and organized platform for easy browsing, downloading and comparing of interaction types, which helps reveal insights into the complex roles that proteins play in organisms.
Collapse
Affiliation(s)
- Chen-Yu Fan
- Department of Electrical Engineering, National Cheng Kung University, Tainan 70101, Taiwan
| | | | | | | | | | | |
Collapse
|
118
|
|
119
|
Use of fast conformational sampling to improve the characterization of VEGF A–peptide interactions. J Theor Biol 2013; 317:293-300. [DOI: 10.1016/j.jtbi.2012.10.021] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2012] [Revised: 10/12/2012] [Accepted: 10/15/2012] [Indexed: 01/25/2023]
|
120
|
An empirical study on the matrix-based protein representations and their combination with sequence-based approaches. Amino Acids 2012; 44:887-901. [PMID: 23108592 DOI: 10.1007/s00726-012-1416-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2012] [Accepted: 10/03/2012] [Indexed: 10/27/2022]
Abstract
Many domains have a stake in the development of reliable systems for automatic protein classification. Of particular interest in recent studies of automatic protein classification is the exploration of new methods for extracting features from a protein that enhance classification for specific problems. These methods have proven very useful in one or two domains, but they have failed to generalize well across several domains (i.e. classification problems). In this paper, we evaluate several feature extraction approaches for representing proteins with the aim of sequence-based protein classification. Several protein representations are evaluated, those starting from: the position specific scoring matrix (PSSM) of the proteins; the amino-acid sequence; a matrix representation of the protein, of dimension (length of the protein) ×20, obtained using the substitution matrices for representing each amino-acid as a vector. A valuable result is that a texture descriptor can be extracted from the PSSM protein representation which improves the performance of standard descriptors based on the PSSM representation. Experimentally, we develop our systems by comparing several protein descriptors on nine different datasets. Each descriptor is used to train a support vector machine (SVM) or an ensemble of SVM. Although different stand-alone descriptors work well on some datasets (but not on others), we have discovered that fusion among classifiers trained using different descriptors obtains a good performance across all the tested datasets. Matlab code/Datasets used in the proposed paper are available at http://www.bias.csr.unibo.it\nanni\PSSM.rar.
Collapse
|
121
|
Abstract
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
Collapse
Affiliation(s)
- Cornelia Caragea
- Information Sciences and Technology, Pennsylvania State University, University Park, PA, USA
| | | | - Prasenjit Mitra
- Information Sciences and Technology, Pennsylvania State University, University Park, PA, USA
- Naviance Inc., Oakland, CA, USA
| |
Collapse
|
122
|
Chen Y, Xu J, Yang B, Zhao Y, He W. A novel method for prediction of protein interaction sites based on integrated RBF neural networks. Comput Biol Med 2012; 42:402-7. [PMID: 22226645 DOI: 10.1016/j.compbiomed.2011.12.007] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2011] [Revised: 11/27/2011] [Accepted: 12/08/2011] [Indexed: 11/18/2022]
Abstract
Protein interactions are very important for control life activities. If we want to study the principle of protein interactions, we have to find the seats of a protein which are involved in the interactions called interaction sites firstly. In this paper, a novel method based on an integrated RBF neural networks is proposed for prediction of protein interaction sites. At first, a number of features were extracted, i.e., sequence profiles, entropy, relative entropy, conservation weight, accessible surface area and sequence variability. Then 6 sliding windows about these features were made, and they contained 1, 3, 5, 7, 9 and 11 amino acid residues respectively. These sliding windows were put into the input layers of six radial basis functional neural networks that were optimized by Particle Swarm Optimization. Thus, six group results were obtained. Finally, these six group results were integrated by decision fusion (DF) and Genetic Algorithm based Selective Ensemble (GASEN). The experimental results show that the proposed method performs better than the other related methods such as neural networks and support vector machine.
Collapse
Affiliation(s)
- Yuehui Chen
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, School of Information Science and Engineering, University of Jinan, Jiwei road 106, Jinan 250022, Shandong, PR China.
| | | | | | | | | |
Collapse
|
123
|
Schleker S, Sun J, Raghavan B, Srnec M, Müller N, Koepfinger M, Murthy L, Zhao Z, Klein-Seetharaman J. The current Salmonella-host interactome. Proteomics Clin Appl 2011; 6:117-33. [PMID: 22213674 DOI: 10.1002/prca.201100083] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2011] [Revised: 09/26/2011] [Accepted: 09/27/2011] [Indexed: 11/11/2022]
Abstract
Salmonella bacteria cause millions of infections and thousands of deaths every year. This pathogen has an unusually broad host range including humans, animals, and even plants. During infection, Salmonella expresses a variety of virulence factors and effectors that are delivered into the host cell triggering cellular responses through protein-protein interactions (PPI) with host cell proteins which make the pathogen's invasion and replication possible. To speed up proteomic efforts in elucidating Salmonella-host interactomes, we carried out a survey of the currently published Salmonella-host PPI. Such a list can serve as the gold standard for computational models aimed at predicting Salmonella-host interactomes through integration of large-scale biological data sources. Manual literature and database search of >2200 journal articles and >100 databases resulted in a gold standard list of currently 62 PPI, including primarily interactions of Salmonella proteins with human and mouse proteins. Only six of these interactions were directly retrievable from PPI databases and 16 were highlighted in databases featuring literature extracts. Thus, the literature survey resulted in the most complete interactome available to date for Salmonella. Pathway analysis using Ingenuity and Broad Gene Set Enrichment Analysis (GSEA) software revealed among general pathways such as MAPK signaling in particular those related to cell death as well as cell morphology, turnover, and interactions, in addition to response to not only Salmonella but also other pathogenic - viral and bacterial - infections. The list of interactions is available at http://www.shiprec.org/indicationslist.htm.
Collapse
Affiliation(s)
- Sylvia Schleker
- Forschungszentrum Jülich, Institute of Complex Systems, Jülich, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
124
|
Jessulat M, Pitre S, Gui Y, Hooshyar M, Omidi K, Samanfar B, Tan LH, Alamgir M, Green J, Dehne F, Golshani A. Recent advances in protein-protein interaction prediction: experimental and computational methods. Expert Opin Drug Discov 2011; 6:921-35. [PMID: 22646215 DOI: 10.1517/17460441.2011.603722] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
INTRODUCTION Proteins within the cell act as part of complex networks, which allow pathways and processes to function. Therefore, understanding how proteins interact is a significant area of current research. AREAS COVERED This review aims to present an overview of key experimental techniques (yeast two-hybrid, tandem affinity purification and protein microarrays) used to discover protein-protein interactions (PPIs), as well as to briefly discuss certain computational methods for predicting protein interactions based on gene localization, phylogenetic information, 3D structural modeling or primary protein sequence data. Due to the large-scale applicability of primary sequence-based methods, the authors have chosen to focus on this strategy for our review. There is an emphasis on a recent algorithm called Protein Interaction Prediction Engine (PIPE) that can predict global PPIs. The readers will discover recent advances both in the practical determination of protein interaction and the strategies that are available to attempt to anticipate interactions without the time and costs of experimental work. EXPERT OPINION Global PPI maps can help understand the biology of complex diseases and facilitate the identification of novel drug target sites. This study describes different techniques used for PPI prediction that we believe will significantly impact the development of the field in a new future. We expect to see a growing number of similar techniques capable of large-scale PPI predictions.
Collapse
Affiliation(s)
- Matthew Jessulat
- Carleton University , Department of Biology , 209 Nesbitt Building, 1125 Colonel By Drive, Ottawa, Ontario K1S 5B6 , Canada
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
125
|
Xue LC, Dobbs D, Honavar V. HomPPI: a class of sequence homology based protein-protein interface prediction methods. BMC Bioinformatics 2011; 12:244. [PMID: 21682895 PMCID: PMC3213298 DOI: 10.1186/1471-2105-12-244] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2010] [Accepted: 06/17/2011] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. RESULTS We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence.Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein.Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the target can be reliably identified. The HomPPI web server is available at http://homppi.cs.iastate.edu/. CONCLUSIONS Sequence homology-based methods offer a class of computationally efficient and reliable approaches for predicting the protein-protein interface residues that participate in either obligate or transient interactions. For query proteins involved in transient interactions, the reliability of interface residue prediction can be improved by exploiting knowledge of putative interaction partners.
Collapse
Affiliation(s)
- Li C Xue
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA.
| | | | | |
Collapse
|
126
|
Zhang YN, Pan XY, Huang Y, Shen HB. Adaptive compressive learning for prediction of protein-protein interactions from primary sequence. J Theor Biol 2011; 283:44-52. [PMID: 21635901 DOI: 10.1016/j.jtbi.2011.05.023] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2010] [Revised: 04/20/2011] [Accepted: 05/16/2011] [Indexed: 12/11/2022]
Abstract
Protein-protein interactions (PPIs) play an important role in biological processes. Although much effort has been devoted to the identification of novel PPIs by integrating experimental biological knowledge, there are still many difficulties because of lacking enough protein structural and functional information. It is highly desired to develop methods based only on amino acid sequences for predicting PPIs. However, sequence-based predictors are often struggling with the high-dimensionality causing over-fitting and high computational complexity problems, as well as the redundancy of sequential feature vectors. In this paper, a novel computational approach based on compressed sensing theory is proposed to predict yeast Saccharomyces cerevisiae PPIs from primary sequence and has achieved promising results. The key advantage of the proposed compressed sensing algorithm is that it can compress the original high-dimensional protein sequential feature vector into a much lower but more condensed space taking the sparsity property of the original signal into account. What makes compressed sensing much more attractive in protein sequence analysis is its compressed signal can be reconstructed from far fewer measurements than what is usually considered necessary in traditional Nyquist sampling theory. Experimental results demonstrate that proposed compressed sensing method is powerful for analyzing noisy biological data and reducing redundancy in feature vectors. The proposed method represents a new strategy of dealing with high-dimensional protein discrete model and has great potentiality to be extended to deal with many other complicated biological systems.
Collapse
Affiliation(s)
- Ya-Nan Zhang
- Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | | | | | | |
Collapse
|