1
|
Gao Q, Zhang C, Li M, Yu T. Protein-Protein Interaction Prediction Model Based on ProtBert-BiGRU-Attention. J Comput Biol 2024; 31:797-814. [PMID: 39069885 DOI: 10.1089/cmb.2023.0297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2024] Open
Abstract
The physiological activities within cells are mainly regulated through protein-protein interactions (PPI). Therefore, studying protein interactions has become an essential part of researching protein function and mechanisms. Traditional biological experiments required for PPI prediction are expensive and time consuming. For this reason, many methods based on predicting PPI from protein sequences have been proposed in recent years. However, existing computational methods usually require the combination of evolutionary feature information of proteins to predict PPI docking situations. Because different relevant features of selected proteins are chosen, there may be differences in the predicted results for PPI. This article proposes a PPI prediction method based on the pretrained protein sequence model ProtBert, combined with the Bidirectional Gated Recurrent Unit (BiGRU) and attention mechanism. Only using protein sequence information and leveraging ProtBert's powerful ability to capture amino acid feature information, BiGRU is used for further feature extraction of the amino acid vectors output by ProtBert. The attention mechanism is then applied to enhance the focus on different amino acid features and improve the expression ability of protein sequence features, ultimately obtaining binary classification results for protein interactions. Experimental results show that our proposed ProtBert-BiGRU-Attention model has good predictive performance for PPI. Through relevant comparative experiments, it has been proven that our model performs well in protein binary prediction. Furthermore, through the ablation experiment of the model, different deep learning modules' contributions to the prediction have been demonstrated.
Collapse
Affiliation(s)
- Qian Gao
- College of Computer and Control Engineering, Qiqihar University, Qiqihar, China
| | - Chi Zhang
- College of Computer and Control Engineering, Qiqihar University, Qiqihar, China
| | - Ming Li
- College of Computer and Control Engineering, Qiqihar University, Qiqihar, China
| | - Tianfei Yu
- College of Life Science and Agriculture Forestry, Qiqihar University, Qiqihar, China
| |
Collapse
|
2
|
Cao MY, Zainudin S, Daud KM. Protein features fusion using attributed network embedding for predicting protein-protein interaction. BMC Genomics 2024; 25:466. [PMID: 38741045 DOI: 10.1186/s12864-024-10361-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 04/29/2024] [Indexed: 05/16/2024] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) hold significant importance in biology, with precise PPI prediction as a pivotal factor in comprehending cellular processes and facilitating drug design. However, experimental determination of PPIs is laborious, time-consuming, and often constrained by technical limitations. METHODS We introduce a new node representation method based on initial information fusion, called FFANE, which amalgamates PPI networks and protein sequence data to enhance the precision of PPIs' prediction. A Gaussian kernel similarity matrix is initially established by leveraging protein structural resemblances. Concurrently, protein sequence similarities are gauged using the Levenshtein distance, enabling the capture of diverse protein attributes. Subsequently, to construct an initial information matrix, these two feature matrices are merged by employing weighted fusion to achieve an organic amalgamation of structural and sequence details. To gain a more profound understanding of the amalgamated features, a Stacked Autoencoder (SAE) is employed for encoding learning, thereby yielding more representative feature representations. Ultimately, classification models are trained to predict PPIs by using the well-learned fusion feature. RESULTS When employing 5-fold cross-validation experiments on SVM, our proposed method achieved average accuracies of 94.28%, 97.69%, and 84.05% in terms of Saccharomyces cerevisiae, Homo sapiens, and Helicobacter pylori datasets, respectively. CONCLUSION Experimental findings across various authentic datasets validate the efficacy and superiority of this fusion feature representation approach, underscoring its potential value in bioinformatics.
Collapse
Affiliation(s)
- Mei-Yuan Cao
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia.
| | - Suhaila Zainudin
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia
| | - Kauthar Mohd Daud
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia
| |
Collapse
|
3
|
Jang YH, Han J, Shim SK, Cheong S, Lee SH, Han JK, Hwang CS. Cross-Wired Memristive Crossbar Array for Effective Graph Data Analysis. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2023:e2311040. [PMID: 38145578 DOI: 10.1002/adma.202311040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Revised: 12/06/2023] [Indexed: 12/27/2023]
Abstract
Graphs adequately represent the enormous interconnections among numerous entities in big data, incurring high computational costs in analyzing them with conventional hardware. Physical graph representation (PGR) is an approach that replicates the graph within a physical system, allowing for efficient analysis. This study introduces a cross-wired crossbar array (cwCBA), uniquely connecting diagonal and non-diagonal components in a CBA by a cross-wiring process. The cross-wired diagonal cells enable cwCBA to achieve precise PGR and dynamic node state control. For this purpose, a cwCBA is fabricated using Pt/Ta2 O5 /HfO2 /TiN (PTHT) memristor with high on/off and self-rectifying characteristics. The structural and device benefits of PTHT cwCBA for enhanced PGR precision are highlighted, and the practical efficacy is demonstrated for two applications. First, it executes a dynamic path-finding algorithm, identifying the shortest paths in a dynamic graph. PTHT cwCBA shows a more accurate inferred distance and ≈1/3800 lower processing complexity than the conventional method. Second, it analyzes the protein-protein interaction (PPI) networks containing self-interacting proteins, which possess intricate characteristics compared to typical graphs. The PPI prediction results exhibit an average of 30.5% and 21.3% improvement in area under the curve and F1-score, respectively, compared to existing algorithms.
Collapse
Affiliation(s)
- Yoon Ho Jang
- Department of Materials Science and Engineering and Inter-university Semiconductor Research Center, College of Engineering, Seoul National University, Seoul, 08826, Republic of Korea
| | - Janguk Han
- Department of Materials Science and Engineering and Inter-university Semiconductor Research Center, College of Engineering, Seoul National University, Seoul, 08826, Republic of Korea
| | - Sung Keun Shim
- Department of Materials Science and Engineering and Inter-university Semiconductor Research Center, College of Engineering, Seoul National University, Seoul, 08826, Republic of Korea
| | - Sunwoo Cheong
- Department of Materials Science and Engineering and Inter-university Semiconductor Research Center, College of Engineering, Seoul National University, Seoul, 08826, Republic of Korea
| | - Soo Hyung Lee
- Department of Materials Science and Engineering and Inter-university Semiconductor Research Center, College of Engineering, Seoul National University, Seoul, 08826, Republic of Korea
| | - Joon-Kyu Han
- Department of Materials Science and Engineering and Inter-university Semiconductor Research Center, College of Engineering, Seoul National University, Seoul, 08826, Republic of Korea
| | - Cheol Seong Hwang
- Department of Materials Science and Engineering and Inter-university Semiconductor Research Center, College of Engineering, Seoul National University, Seoul, 08826, Republic of Korea
| |
Collapse
|
4
|
Halsana AA, Chakroborty T, Halder AK, Basu S. DensePPI: A Novel Image-Based Deep Learning Method for Prediction of Protein-Protein Interactions. IEEE Trans Nanobioscience 2023; 22:904-911. [PMID: 37028059 DOI: 10.1109/tnb.2023.3251192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2023]
Abstract
Protein-protein interactions (PPI) are crucial for understanding the behaviour of living organisms and identifying disease associations. This paper proposes DensePPI, a novel deep convolution strategy applied to the 2D image map generated from the interacting protein pairs for PPI prediction. A colour encoding scheme has been introduced to embed the bigram interaction possibilities of Amino Acids into RGB colour space to enhance the learning and prediction task. The DensePPI model is trained on 5.5 million sub-images of size 128×128 generated from nearly 36,000 interacting and 36,000 non-interacting benchmark protein pairs. The performance is evaluated on independent datasets from five different organisms; Caenorhabditis elegans, Escherichia coli, Helicobacter Pylori, Homo sapiens and Mus Musculus. The proposed model achieves an average prediction accuracy score of 99.95% on these datasets, considering inter-species and intra-species interactions. The performance of DensePPI is compared with the state-of-the-art methods and outperforms those approaches in different evaluation metrics. Improved performance of DensePPI indicates the efficiency of the image-based encoding strategy of sequence information with the deep learning architecture in PPI prediction. The enhanced performance on diverse test sets shows that the DensePPI is significant for intra-species interaction prediction and cross-species interactions. The dataset, supplementary file, and the developed models are available at https://github.com/Aanzil/DensePPI for academic use only.
Collapse
|
5
|
Madugula SS, Pandey S, Amalapurapu S, Bozdag S. NRPreTo: A Machine Learning-Based Nuclear Receptor and Subfamily Prediction Tool. ACS OMEGA 2023; 8:20379-20388. [PMID: 37323377 PMCID: PMC10268018 DOI: 10.1021/acsomega.3c00286] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/15/2023] [Accepted: 05/09/2023] [Indexed: 06/17/2023]
Abstract
The nuclear receptor (NR) superfamily includes phylogenetically related ligand-activated proteins, which play a key role in various cellular activities. NR proteins are subdivided into seven subfamilies based on their function, mechanism, and nature of the interacting ligand. Developing robust tools to identify NR could give insights into their functional relationships and involvement in disease pathways. Existing NR prediction tools only use a few types of sequence-based features and are tested on relatively similar independent datasets; thus, they may suffer from overfitting when extended to new genera of sequences. To address this problem, we developed Nuclear Receptor Prediction Tool (NRPreTo), a two-level NR prediction tool with a unique training approach where in addition to the sequence-based features used by existing NR prediction tools, six additional feature groups depicting various physiochemical, structural, and evolutionary features of proteins were utilized. The first level of NRPreTo allows for the successful prediction of a query protein as NR or non-NR and further subclassifies the protein into one of the seven NR subfamilies in the second level. We developed Random Forest classifiers to test on benchmark datasets, as well as the entire human protein datasets from RefSeq and Human Protein Reference Database (HPRD). We observed that using additional feature groups improved the performance. We also observed that NRPreTo achieved high performance on the external datasets and predicted 59 novel NRs in the human proteome. The source code of NRPreTo is publicly available at https://github.com/bozdaglab/NRPreTo.
Collapse
Affiliation(s)
- Sita Sirisha Madugula
- Department
of Computer Science & Engineering, University
of North Texas, Denton, Texas TX 76203, United States
| | - Suman Pandey
- Department
of Computer Science & Engineering, University
of North Texas, Denton, Texas TX 76203, United States
| | - Shreya Amalapurapu
- Department
of Computer Science & Engineering, University
of North Texas, Denton, Texas TX 76203, United States
- The
Texas Academy of Mathematics and Science, University of North Texas, Denton, Texas TX 76203, United States
| | - Serdar Bozdag
- Department
of Computer Science & Engineering, University
of North Texas, Denton, Texas TX 76203, United States
- Department
of Mathematics, University of North Texas, Denton, Texas TX 76203, United
States
- BioDiscovery
Institute, University of North Texas, Denton, Texas TX 76203, United States
| |
Collapse
|
6
|
Jha K, Karmakar S, Saha S. Graph-BERT and language model-based framework for protein-protein interaction identification. Sci Rep 2023; 13:5663. [PMID: 37024543 PMCID: PMC10079975 DOI: 10.1038/s41598-023-31612-w] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 03/14/2023] [Indexed: 04/08/2023] Open
Abstract
Identification of protein-protein interactions (PPI) is among the critical problems in the domain of bioinformatics. Previous studies have utilized different AI-based models for PPI classification with advances in artificial intelligence (AI) techniques. The input to these models is the features extracted from different sources of protein information, mainly sequence-derived features. In this work, we present an AI-based PPI identification model utilizing a PPI network and protein sequences. The PPI network is represented as a graph where each node is a protein pair, and an edge is defined between two nodes if there exists a common protein between these nodes. Each node in a graph has a feature vector. In this work, we have used the language model to extract feature vectors directly from protein sequences. The feature vectors for protein in pairs are concatenated and used as a node feature vector of a PPI network graph. Finally, we have used the Graph-BERT model to encode the PPI network graph with sequence-based features and learn the hidden representation of the feature vector for each node. The next step involves feeding the learned representations of nodes to the fully connected layer, the output of which is fed into the softmax layer to classify the protein interactions. To assess the efficacy of the proposed PPI model, we have performed experiments on several PPI datasets. The experimental results demonstrate that the proposed approach surpasses the existing PPI works and designed baselines in classifying PPI.
Collapse
Affiliation(s)
- Kanchan Jha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India.
| | - Sourav Karmakar
- Department of Computer Science and Engineering, National Institute of Technology Durgapur, Durgapur, West Bengal, 713209, India
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India
| |
Collapse
|
7
|
Wang X, Yang W, Yang Y, He Y, Zhang J, Wang L, Hu L. PPISB: A Novel Network-Based Algorithm of Predicting Protein-Protein Interactions With Mixed Membership Stochastic Blockmodel. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1606-1612. [PMID: 35939453 DOI: 10.1109/tcbb.2022.3196336] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Protein-protein interactions (PPIs) play an essential role for most of biological processes in cells. Many computational algorithms have thus been proposed to predict PPIs. However, most of them heavily rest on the biological information of proteins while ignoring the latent structural features of proteins presented in a PPI network. In this paper, we propose an efficient network-based prediction algorithm, namely PPISB, based on a mixed membership stochastic blockmodel. By simulating the generative process of a PPI network, PPISB is able to capture the latent community structures. The inference procedure adopted by PPISB further optimizes the membership distributions of proteins over different complexes. After that, a distance measure is designed to compute the similarity between two proteins in terms of their likelihoods of being in the same complex, thus verifying whether they interact with each other or not. To evaluate the performance of PPISB, a series of extensive experiments have been conducted with five PPI networks collected from different species and the results demonstrate that PPISB has a promising performance when applied to predict PPIs in terms of several evaluation metrics. Hence, we reason that PPISB is preferred over state-of-the-art network-based prediction algorithms especially for predicting potential PPIs.
Collapse
|
8
|
Soleymani F, Paquet E, Viktor HL, Michalowski W, Spinello D. ProtInteract: A deep learning framework for predicting protein-protein interactions. Comput Struct Biotechnol J 2023; 21:1324-1348. [PMID: 36817951 PMCID: PMC9929211 DOI: 10.1016/j.csbj.2023.01.028] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 01/20/2023] [Accepted: 01/20/2023] [Indexed: 01/26/2023] Open
Abstract
Proteins mainly perform their functions by interacting with other proteins. Protein-protein interactions underpin various biological activities such as metabolic cycles, signal transduction, and immune response. However, due to the sheer number of proteins, experimental methods for finding interacting and non-interacting protein pairs are time-consuming and costly. We therefore developed the ProtInteract framework to predict protein-protein interaction. ProtInteract comprises two components: first, a novel autoencoder architecture that encodes each protein's primary structure to a lower-dimensional vector while preserving its underlying sequence attributes. This leads to faster training of the second network, a deep convolutional neural network (CNN) that receives encoded proteins and predicts their interaction under three different scenarios. In each scenario, the deep CNN predicts the class of a given encoded protein pair. Each class indicates different ranges of confidence scores corresponding to the probability of whether a predicted interaction occurs or not. The proposed framework features significantly low computational complexity and relatively fast response. The contributions of this work are twofold. First, ProtInteract assimilates the protein's primary structure into a pseudo-time series. Therefore, we leverage the nature of the time series of proteins and their physicochemical properties to encode a protein's amino acid sequence into a lower-dimensional vector space. This approach enables extracting highly informative sequence attributes while reducing computational complexity. Second, the ProtInteract framework utilises this information to identify protein interactions with other proteins based on its amino acid configuration. Our results suggest that the proposed framework performs with high accuracy and efficiency in predicting protein-protein interactions.
Collapse
Affiliation(s)
- Farzan Soleymani
- Department of Mechanical Engineering, University of Ottawa, Ottawa, ON K1N 6N5, Canada
| | - Eric Paquet
- National Research Council, 1200 Montreal Road, Ottawa, ON K1A 0R6, Canada,Corresponding author.
| | - Herna Lydia Viktor
- School of Electrical Engineering and Computer Science, University of Ottawa, ON K1N 6N5, Canada
| | | | - Davide Spinello
- Department of Mechanical Engineering, University of Ottawa, Ottawa, ON K1N 6N5, Canada
| |
Collapse
|
9
|
Jha K, Saha S. Analyzing Effect of Multi-Modality in Predicting Protein-Protein Interactions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:162-173. [PMID: 35259112 DOI: 10.1109/tcbb.2022.3157531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Nowadays, multiple sources of information about proteins are available such as protein sequences, 3D structures, Gene Ontology (GO), etc. Most of the works on protein-protein interaction (PPI) identification had utilized these information about proteins, mainly sequence-based, but individually. The new advances in deep learning techniques allow us to leverage multiple sources/modalities of proteins, which complement each other. Some recent works have shown that multi-modal PPI models perform better than uni-modal approaches. This paper aims to investigate whether the performance of multi-modal PPI models is always consistent or depends on other factors such as dataset distribution, algorithms used to learn features, etc. We have used three modalities for this study: Protein sequence, 3D structure, and GO. Various techniques, including deep learning algorithms, are employed to extract features from multiple sources of proteins. These feature vectors from different modalities are then integrated in several combinations (bi-modal and tri-modal) to predict PPI. To conduct this study, we have used Human and S. cerevisiae PPI datasets. The obtained results demonstrate the potentiality of a multi-modal approach and deep learning techniques in predicting protein interactions. However, the predictive capability of a model for PPI depends on feature extraction methods as well. Also, increasing the modality does not always ensure performance improvement. In this study, the PPI model integrating two modalities outperforms the designed uni-modal and tri-modal PPI models.
Collapse
|
10
|
Soleymani F, Paquet E, Viktor H, Michalowski W, Spinello D. Protein-protein interaction prediction with deep learning: A comprehensive review. Comput Struct Biotechnol J 2022; 20:5316-5341. [PMID: 36212542 PMCID: PMC9520216 DOI: 10.1016/j.csbj.2022.08.070] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 08/29/2022] [Accepted: 08/30/2022] [Indexed: 11/15/2022] Open
Abstract
Most proteins perform their biological function by interacting with themselves or other molecules. Thus, one may obtain biological insights into protein functions, disease prevalence, and therapy development by identifying protein-protein interactions (PPI). However, finding the interacting and non-interacting protein pairs through experimental approaches is labour-intensive and time-consuming, owing to the variety of proteins. Hence, protein-protein interaction and protein-ligand binding problems have drawn attention in the fields of bioinformatics and computer-aided drug discovery. Deep learning methods paved the way for scientists to predict the 3-D structure of proteins from genomes, predict the functions and attributes of a protein, and modify and design new proteins to provide desired functions. This review focuses on recent deep learning methods applied to problems including predicting protein functions, protein-protein interaction and their sites, protein-ligand binding, and protein design.
Collapse
Affiliation(s)
- Farzan Soleymani
- Department of Mechanical Engineering, University of Ottawa, Ottawa, ON, Canada
| | - Eric Paquet
- National Research Council, 1200 Montreal Road, Ottawa, ON K1A 0R6, Canada
| | - Herna Viktor
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, Canada
| | | | - Davide Spinello
- Department of Mechanical Engineering, University of Ottawa, Ottawa, ON, Canada
| |
Collapse
|
11
|
Protein-protein interaction and non-interaction predictions using gene sequence natural vector. Commun Biol 2022; 5:652. [PMID: 35780196 PMCID: PMC9250521 DOI: 10.1038/s42003-022-03617-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 06/21/2022] [Indexed: 12/02/2022] Open
Abstract
Predicting protein–protein interaction and non-interaction are two important different aspects of multi-body structure predictions, which provide vital information about protein function. Some computational methods have recently been developed to complement experimental methods, but still cannot effectively detect real non-interacting protein pairs. We proposed a gene sequence-based method, named NVDT (Natural Vector combine with Dinucleotide and Triplet nucleotide), for the prediction of interaction and non-interaction. For protein–protein non-interactions (PPNIs), the proposed method obtained accuracies of 86.23% for Homo sapiens and 85.34% for Mus musculus, and it performed well on three types of non-interaction networks. For protein-protein interactions (PPIs), we obtained accuracies of 99.20, 94.94, 98.56, 95.41, and 94.83% for Saccharomyces cerevisiae, Drosophila melanogaster, Helicobacter pylori, Homo sapiens, and Mus musculus, respectively. Furthermore, NVDT outperformed established sequence-based methods and demonstrated high prediction results for cross-species interactions. NVDT is expected to be an effective approach for predicting PPIs and PPNIs. Protein-protein non-interactions and interactions are distinguished and predicted by gene sequence using single nucleotide and contiguous nucleotides combined with machine learning models.
Collapse
|
12
|
Wang Y, Wang LL, Wong L, Li Y, Wang L, You ZH. SIPGCN: A Novel Deep Learning Model for Predicting Self-Interacting Proteins from Sequence Information Using Graph Convolutional Networks. Biomedicines 2022; 10:biomedicines10071543. [PMID: 35884848 PMCID: PMC9313220 DOI: 10.3390/biomedicines10071543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Revised: 06/24/2022] [Accepted: 06/24/2022] [Indexed: 11/16/2022] Open
Abstract
Protein is the basic organic substance that constitutes the cell and is the material condition for the life activity and the guarantee of the biological function activity. Elucidating the interactions and functions of proteins is a central task in exploring the mysteries of life. As an important protein interaction, self-interacting protein (SIP) has a critical role. The fast growth of high-throughput experimental techniques among biomolecules has led to a massive influx of available SIP data. How to conduct scientific research using the massive amount of SIP data has become a new challenge that is being faced in related research fields such as biology and medicine. In this work, we design an SIP prediction method SIPGCN using a deep learning graph convolutional network (GCN) based on protein sequences. First, protein sequences are characterized using a position-specific scoring matrix, which is able to describe the biological evolutionary message, then their hidden features are extracted by the deep learning method GCN, and, finally, the random forest is utilized to predict whether there are interrelationships between proteins. In the cross-validation experiment, SIPGCN achieved 93.65% accuracy and 99.64% specificity in the human data set. SIPGCN achieved 90.69% and 99.08% of these two indicators in the yeast data set, respectively. Compared with other feature models and previous methods, SIPGCN showed excellent results. These outcomes suggest that SIPGCN may be a suitable instrument for predicting SIP and may be a reliable candidate for future wet experiments.
Collapse
Affiliation(s)
- Ying Wang
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, China;
| | - Lin-Lin Wang
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, China;
- Correspondence: (L.-L.W.); (L.W.)
| | - Leon Wong
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Sciences, Nanning 530007, China; (L.W.); (Z.-H.Y.)
| | - Yang Li
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China;
| | - Lei Wang
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, China;
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Sciences, Nanning 530007, China; (L.W.); (Z.-H.Y.)
- Correspondence: (L.-L.W.); (L.W.)
| | - Zhu-Hong You
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Sciences, Nanning 530007, China; (L.W.); (Z.-H.Y.)
- School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China
| |
Collapse
|
13
|
Song B, Luo X, Luo X, Liu Y, Niu Z, Zeng X. Learning spatial structures of proteins improves protein-protein interaction prediction. Brief Bioinform 2022; 23:6501351. [PMID: 35018418 DOI: 10.1093/bib/bbab558] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Revised: 12/07/2021] [Accepted: 12/07/2021] [Indexed: 01/09/2023] Open
Abstract
Spatial structures of proteins are closely related to protein functions. Integrating protein structures improves the performance of protein-protein interaction (PPI) prediction. However, the limited quantity of known protein structures restricts the application of structure-based prediction methods. Utilizing the predicted protein structure information is a promising method to improve the performance of sequence-based prediction methods. We propose a novel end-to-end framework, TAGPPI, to predict PPIs using protein sequence alone. TAGPPI extracts multi-dimensional features by employing 1D convolution operation on protein sequences and graph learning method on contact maps constructed from AlphaFold. A contact map contains abundant spatial structure information, which is difficult to obtain from 1D sequence data directly. We further demonstrate that the spatial information learned from contact maps improves the ability of TAGPPI in PPI prediction tasks. We compare the performance of TAGPPI with those of nine state-of-the-art sequence-based methods, and TAGPPI outperforms such methods in all metrics. To the best of our knowledge, this is the first method to use the predicted protein topology structure graph for sequence-based PPI prediction. More importantly, our proposed architecture could be extended to other prediction tasks related to proteins.
Collapse
Affiliation(s)
- Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China
| | - Xiaoyan Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China.,MindRank AI ltd., Hangzhou, 311113, Zhejiang, China
| | - Xiaoli Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China.,BioMap, Haidian, 100089, Beijing, China
| | - Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China
| | | | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China
| |
Collapse
|
14
|
Hu L, Yang S, Luo X, Yuan H, Sedraoui K, Zhou M. A Distributed Framework for Large-scale Protein-protein Interaction Data Analysis and Prediction Using MapReduce. IEEE/CAA JOURNAL OF AUTOMATICA SINICA 2022; 9:160-172. [DOI: 10.1109/jas.2021.1004198] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
15
|
Hu L, Zhao BW, Yang S, Luo X, Zhou M. Predicting Large-scale Protein-protein Interactions by Extracting Coevolutionary Patterns with MapReduce Paradigm. 2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC) 2021:939-944. [DOI: 10.1109/smc52423.2021.9658839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
16
|
Hu L, Wang X, Huang YA, Hu P, You ZH. A Novel Network-Based Algorithm for Predicting Protein-Protein Interactions Using Gene Ontology. Front Microbiol 2021; 12:735329. [PMID: 34512614 PMCID: PMC8425590 DOI: 10.3389/fmicb.2021.735329] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Accepted: 08/02/2021] [Indexed: 11/24/2022] Open
Abstract
Proteins are one of most significant components in living organism, and their main role in cells is to undertake various physiological functions by interacting with each other. Thus, the prediction of protein-protein interactions (PPIs) is crucial for understanding the molecular basis of biological processes, such as chronic infections. Given the fact that laboratory-based experiments are normally time-consuming and labor-intensive, computational prediction algorithms have become popular at present. However, few of them could simultaneously consider both the structural information of PPI networks and the biological information of proteins for an improved accuracy. To do so, we assume that the prior information of functional modules is known in advance and then simulate the generative process of a PPI network associated with the biological information of proteins, i.e., Gene Ontology, by using an established Bayesian model. In order to indicate to what extent two proteins are likely to interact with each other, we propose a novel scoring function by combining the membership distributions of proteins with network paths. Experimental results show that our algorithm has a promising performance in terms of several independent metrics when compared with state-of-the-art prediction algorithms, and also reveal that the consideration of modularity in PPI networks provides us an alternative, yet much more flexible, way to accurately predict PPIs.
Collapse
Affiliation(s)
- Lun Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Xiaojuan Wang
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Yu-An Huang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Pengwei Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| |
Collapse
|
17
|
Czibula G, Albu AI, Bocicor MI, Chira C. AutoPPI: An Ensemble of Deep Autoencoders for Protein-Protein Interaction Prediction. ENTROPY 2021; 23:e23060643. [PMID: 34064042 PMCID: PMC8223997 DOI: 10.3390/e23060643] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 05/08/2021] [Accepted: 05/19/2021] [Indexed: 01/06/2023]
Abstract
Proteins are essential molecules, that must correctly perform their roles for the good health of living organisms. The majority of proteins operate in complexes and the way they interact has pivotal influence on the proper functioning of such organisms. In this study we address the problem of protein–protein interaction and we propose and investigate a method based on the use of an ensemble of autoencoders. Our approach, entitled AutoPPI, adopts a strategy based on two autoencoders, one for each type of interactions (positive and negative) and we advance three types of neural network architectures for the autoencoders. Experiments were performed on several data sets comprising proteins from four different species. The results indicate good performances of our proposed model, with accuracy and AUC values of over 0.97 in all cases. The best performing model relies on a Siamese architecture in both the encoder and the decoder, which advantageously captures common features in protein pairs. Comparisons with other machine learning techniques applied for the same problem prove that AutoPPI outperforms most of its contenders, for the considered data sets.
Collapse
|
18
|
Hu L, Wang X, Huang YA, Hu P, You ZH. A survey on computational models for predicting protein-protein interactions. Brief Bioinform 2021; 22:6159365. [PMID: 33693513 DOI: 10.1093/bib/bbab036] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Revised: 12/31/2020] [Indexed: 12/24/2022] Open
Abstract
Proteins interact with each other to play critical roles in many biological processes in cells. Although promising, laboratory experiments usually suffer from the disadvantages of being time-consuming and labor-intensive. The results obtained are often not robust and considerably uncertain. Due recently to advances in high-throughput technologies, a large amount of proteomics data has been collected and this presents a significant opportunity and also a challenge to develop computational models to predict protein-protein interactions (PPIs) based on these data. In this paper, we present a comprehensive survey of the recent efforts that have been made towards the development of effective computational models for PPI prediction. The survey introduces the algorithms that can be used to learn computational models for predicting PPIs, and it classifies these models into different categories. To understand their relative merits, the paper discusses different validation schemes and metrics to evaluate the prediction performance. Biological databases that are commonly used in different experiments for performance comparison are also described and their use in a series of extensive experiments to compare different prediction models are discussed. Finally, we present some open issues in PPI prediction for future work. We explain how the performance of PPI prediction can be improved if these issues are effectively tackled.
Collapse
Affiliation(s)
- Lun Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, 830011, Urumqi, China
| | - Xiaojuan Wang
- School of Computer Science and Technology, Wuhan University of Technology, 430070, Wuhan, China
| | - Yu-An Huang
- College of Computer Science and Software Engineering, Shenzhen University, 518060, Shenzhen, China
| | | | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, 830011, Urumqi, China
| |
Collapse
|
19
|
Wang Z, Li Y, You ZH, Li LP, Zhan XK, Pan J. Prediction of Protein-Protein Interactions from Protein Sequences by Combining MatPCA Feature Extraction Algorithms and Weighted Sparse Representation Models. MATHEMATICAL PROBLEMS IN ENGINEERING 2020; 2020:1-11. [DOI: 10.1155/2020/5764060] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
Abstract
Identifying protein-protein interactions (PPIs) plays a vital role in a number of biological activities such as signal transduction, transcriptional regulation, and apoptosis. Although advances in high-throughput technologies have generated large amounts of PPI data for different species, they only cover a small part of the entire PPI network. Furthermore, traditional experimental methods are generally expensive, time-consuming, tedious, and prone to high false-positive rates. Therefore, to overcome this problem, it is necessary to develop a novel computational method for predicting PPIs. In this article, we propose an efficient computational method to detect protein-protein interactions using only protein sequence information, which integrates the MatPCA feature extraction algorithm and the weighted sparse representation classifier. As a result, when predicting PPIs on yeast, human, and H. pylori datasets, the proposed method achieves superior prediction performance with an average accuracy of 94.55%, 97.48%, and 83.64%, respectively. These experimental results further illustrate that the proposed method is reliable and robust in predicting PPIs, which can be regarded as a useful complement to the experimental method.
Collapse
Affiliation(s)
- Zheng Wang
- School of Information Engineering, Xijing University, Xi’an 710123, China
| | - Yang Li
- School of Information Engineering, Xijing University, Xi’an 710123, China
| | - Zhu-Hong You
- School of Information Engineering, Xijing University, Xi’an 710123, China
| | - Li-Ping Li
- School of Information Engineering, Xijing University, Xi’an 710123, China
| | - Xin-Ke Zhan
- School of Information Engineering, Xijing University, Xi’an 710123, China
| | - Jie Pan
- School of Information Engineering, Xijing University, Xi’an 710123, China
| |
Collapse
|
20
|
Cervantes J, Garcia-Lamont F, Rodríguez-Mazahua L, Lopez A. A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.10.118] [Citation(s) in RCA: 312] [Impact Index Per Article: 62.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
|
21
|
Li J, Shi X, You ZH, Yi HC, Chen Z, Lin Q, Fang M. Using Weighted Extreme Learning Machine Combined With Scale-Invariant Feature Transform to Predict Protein-Protein Interactions From Protein Evolutionary Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1546-1554. [PMID: 31940546 DOI: 10.1109/tcbb.2020.2965919] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein-Protein Interactions (PPIs) play an irreplaceable role in biological activities of organisms. Although many high-throughput methods are used to identify PPIs from different kinds of organisms, they have some shortcomings, such as high cost and time-consuming. To solve the above problems, computational methods are developed to predict PPIs. Thus, in this paper, we present a method to predict PPIs using protein sequences. First, protein sequences are transformed into Position Weight Matrix (PWM), in which Scale-Invariant Feature Transform (SIFT) algorithm is used to extract features. Then Principal Component Analysis (PCA) is applied to reduce the dimension of features. At last, Weighted Extreme Learning Machine (WELM) classifier is employed to predict PPIs and a series of evaluation results are obtained. In our method, since SIFT and WELM are used to extract features and classify respectively, we called the proposed method SIFT-WELM. When applying the proposed method on three well-known PPIs datasets of Yeast, Human and Helicobacter.pylori, the average accuracies of our method using five-fold cross validation are obtained as high as 94.83, 97.60 and 83.64 percent, respectively. In order to evaluate the proposed approach properly, we compare it with Support Vector Machine (SVM) classifier and other recent-developed methods in different aspects. Moreover, the training time of our method is greatly shortened, which is obviously superior to the previous methods, such as SVM, ACC, PCVMZM and so on.
Collapse
|
22
|
Zhan XK, You ZH, Li LP, Li Y, Wang Z, Pan J. Using Random Forest Model Combined With Gabor Feature to Predict Protein-Protein Interaction From Protein Sequence. Evol Bioinform Online 2020; 16:1176934320934498. [PMID: 32655275 PMCID: PMC7328357 DOI: 10.1177/1176934320934498] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Accepted: 05/20/2020] [Indexed: 12/12/2022] Open
Abstract
Protein-protein interactions (PPIs) play a crucial role in the life cycles of
living cells. Thus, it is important to understand the underlying mechanisms of
PPIs. Although many high-throughput technologies have generated large amounts of
PPI data in different organisms, the experiments for detecting PPIs are still
costly and time-consuming. Therefore, novel computational methods are urgently
needed for predicting PPIs. For this reason, developing a new computational
method for predicting PPIs is drawing more and more attention. In this study, we
proposed a novel computational method based on texture feature of protein
sequence for predicting PPIs. Especially, the Gabor feature is used to extract
texture feature and protein evolutionary information from Position-Specific
Scoring Matrix, which is generated by Position-Specific Iterated Basic Local
Alignment Search Tool. Then, random forest–based classifiers are used to infer
the protein interactions. When performed on PPI data sets of yeast,
human, and Helicobacter pylori, we obtained good
results with average accuracies of 92.10%, 97.03%, and 86.45%, respectively. To
better evaluate the proposed method, we compared Gabor feature, Discrete Cosine
Transform, and Local Phase Quantization. Our results show that the proposed
method is both feasible and stable and the Gabor feature descriptor is reliable
in extracting protein sequence information. Furthermore, additional experiments
have been conducted to predict PPIs of other 4 species data sets. The promising
results indicate that our proposed method is both powerful and robust.
Collapse
Affiliation(s)
- Xin-Ke Zhan
- School of Information Engineering, Xijing University, Xi'an, China
| | - Zhu-Hong You
- School of Information Engineering, Xijing University, Xi'an, China
| | - Li-Ping Li
- School of Information Engineering, Xijing University, Xi'an, China
| | - Yang Li
- School of Information Engineering, Xijing University, Xi'an, China
| | - Zheng Wang
- School of Information Engineering, Xijing University, Xi'an, China
| | - Jie Pan
- School of Information Engineering, Xijing University, Xi'an, China
| |
Collapse
|
23
|
Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms. Soft comput 2019. [DOI: 10.1007/s00500-019-04453-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
24
|
Analysis of big data for prediction of provider-initiated preterm birth and spontaneous premature deliveries and ranking the predictive features. Arch Gynecol Obstet 2019; 300:1565-1582. [PMID: 31650230 DOI: 10.1007/s00404-019-05325-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Accepted: 10/09/2019] [Indexed: 12/13/2022]
Abstract
PURPOSE High rate of preterm birth (birth before 37 weeks of gestation) in the world, its negative outcomes for pregnant women and newborns necessitate to predict preterm birth and identify its main risk factors. Premature deliveries have been divided into provider-initiated (with medical intervention for early terminating the pregnancy) and spontaneous preterm birth (without any intervention) categories in the previous studies. The main aim of this study is proposing methods for prediction of provider-initiated preterm birth and spontaneous premature deliveries and ranking the predictive features. METHODS Data from national databank of Maternal and neonatal records (IMAN registry) is used in the study. The collected data have information about more than 1,400,000 deliveries with 112 features. Among them, 116,080 preterm births have occurred (from which 11,799 and 104,281 cases belong to provider-initiated preterm birth and spontaneous premature delivery, respectively). The data can be considered as big data due to its large number of data records, large number of the features and unbalanced distribution of the data between three classes of term, provider-initiated and spontaneous preterm birth. Therefore, we need to analyze data based on big data algorithms. In this paper, Map Reduce-based machine learning algorithms named MR-PB-PFS are proposed for this purpose. Map phase use parallel feature selection and classification methods to score the features. Reduce phase aggregates the feature scores obtained in Map phase and assign final scores to the features. Moreover, the classifiers trained in Map phase are aggregated based on two different ensemble rules in Reduce phase. RESULTS Experimental results show that the best performance of the proposed models for preterm birth prediction is accuracy of 81% and the area under the receiver operating characteristic curve (AUC) of 68%. Top features for predicting term, provider-initiated preterm and spontaneous premature birth identified in this study are having pregnancy risk factors, having gestational diabetes, having cardiovascular disease, maternal underlying diseases, and mother age. Chronic blood pressure is a high rank feature for preterm birth prediction and father nationality is highly important for discriminating provider-initiated from spontaneous premature delivery. CONCLUSIONS Identifying the pregnant women with high risk of spontaneous premature or therapeutic preterm delivery in our proposed model can help them to: (1) reduce the probability of premature birth with monitoring and management of the main risk factors and/or (2) educate them to care from the premature newborn. Management and monitoring top features discriminating term, provider-initiated preterm and spontaneous premature birth or their associated factors can reduce preterm labor or its negative outcomes.
Collapse
|
25
|
Chen ZH, Li LP, He Z, Zhou JR, Li Y, Wong L. An Improved Deep Forest Model for Predicting Self-Interacting Proteins From Protein Sequence Using Wavelet Transformation. Front Genet 2019; 10:90. [PMID: 30881376 PMCID: PMC6405691 DOI: 10.3389/fgene.2019.00090] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 01/29/2019] [Indexed: 12/23/2022] Open
Abstract
Self-interacting proteins (SIPs), whose more than two identities can interact with each other, play significant roles in the understanding of cellular process and cell functions. Although a number of experimental methods have been designed to detect the SIPs, they remain to be extremely time-consuming, expensive, and challenging even nowadays. Therefore, there is an urgent need to develop the computational methods for predicting SIPs. In this study, we propose a deep forest based predictor for accurate prediction of SIPs using protein sequence information. More specifically, a novel feature representation method, which integrate position-specific scoring matrix (PSSM) with wavelet transform, is introduced. To evaluate the performance of the proposed method, cross-validation tests are performed on two widely used benchmark datasets. The experimental results show that the proposed model achieved high accuracies of 95.43 and 93.65% on human and yeast datasets, respectively. The AUC value for evaluating the performance of the proposed method was also reported. The AUC value for yeast and human datasets are 0.9203 and 0.9586, respectively. To further show the advantage of the proposed method, it is compared with several existing methods. The results demonstrate that the proposed model is better than other SIPs prediction methods. This work can offer an effective architecture to biologists in detecting new SIPs.
Collapse
Affiliation(s)
- Zhan-Heng Chen
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Li-Ping Li
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Zhou He
- College of Engineering and Applied Science, University of Colorado Boulder, Boulder, CO, United States
| | - Ji-Ren Zhou
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Yangming Li
- ECTET, Rochester Institute of Technology, Rochester, NY, United States
| | - Leon Wong
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
26
|
Inference of Large-scale Time-delayed Gene Regulatory Network with Parallel MapReduce Cloud Platform. Sci Rep 2018; 8:17787. [PMID: 30542062 PMCID: PMC6290780 DOI: 10.1038/s41598-018-36180-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2018] [Accepted: 11/16/2018] [Indexed: 02/06/2023] Open
Abstract
Inference of gene regulatory network (GRN) is crucial to understand intracellular physiological activity and function of biology. The identification of large-scale GRN has been a difficult and hot topic of system biology in recent years. In order to reduce the computation load for large-scale GRN identification, a parallel algorithm based on restricted gene expression programming (RGEP), namely MPRGEP, is proposed to infer instantaneous and time-delayed regulatory relationships between transcription factors and target genes. In MPRGEP, the structure and parameters of time-delayed S-system (TDSS) model are encoded into one chromosome. An original hybrid optimization approach based on genetic algorithm (GA) and gene expression programming (GEP) is proposed to optimize TDSS model with MapReduce framework. Time-delayed GRNs (TDGRN) with hundreds of genes are utilized to test the performance of MPRGEP. The experiment results reveal that MPRGEP could infer more accurately gene regulatory network than other state-of-art methods, and obtain the convincing speedup.
Collapse
|
27
|
You ZH, Huang W, Zhang S, Huang YA, Yu CQ, Li LP. An Efficient Ensemble Learning Approach for Predicting Protein-Protein Interactions by Integrating Protein Primary Sequence and Evolutionary Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:809-817. [PMID: 30475726 DOI: 10.1109/tcbb.2018.2882423] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Protein-protein interactions (PPIs) perform a very important function in many cellular processes, including signal transduction, post-translational modifications, apoptosis, and cell growth. Deregulation of PPIs results in many diseases, including cancer and pernicious anemia. Although many high-throughput methods have been applied to generate a large amount of PPIs data, they are generally expensive, inefficient and labor-intensive. Hence, there is an urgent need for developing a computational method to accurately and rapidly detect PPIs. In this article, we proposed a highly efficient approach to predict PPIs by integrating a new protein sequence substitution matrix feature representation and ensemble weighted sparse representation model classifier. The proposed method is demonstrated on Saccharomyces cerevisiae dataset and achieved 99.26% prediction accuracy with 98.53% sensitivity at precision of 100%, which is shown to have much higher predictive accuracy than current state-of-the-art algorithms. Extensive experiments are performed with the benchmark data set from Human and Helicobacter pylori that the proposed method achieves outstanding better success rates than other existing approaches in this problem. Experiment results illustrate that our proposed method presents an economical approach for computational building of PPI networks, which can be a helpful supplementary method for future proteomics researches.
Collapse
|
28
|
Using Two-dimensional Principal Component Analysis and Rotation Forest for Prediction of Protein-Protein Interactions. Sci Rep 2018; 8:12874. [PMID: 30150728 PMCID: PMC6110764 DOI: 10.1038/s41598-018-30694-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 07/17/2018] [Indexed: 11/09/2022] Open
Abstract
The interaction among proteins is essential in all life activities, and it is the basis of all the metabolic activities of the cells. By studying the protein-protein interactions (PPIs), people can better interpret the function of protein, decoding the phenomenon of life, especially in the design of new drugs with great practical value. Although many high-throughput techniques have been devised for large-scale detection of PPIs, these methods are still expensive and time-consuming. For this reason, there is a much-needed to develop computational methods for predicting PPIs at the entire proteome scale. In this article, we propose a new approach to predict PPIs using Rotation Forest (RF) classifier combine with matrix-based protein sequence. We apply the Position-Specific Scoring Matrix (PSSM), which contains biological evolution information, to represent protein sequences and extract the features through the two-dimensional Principal Component Analysis (2DPCA) algorithm. The descriptors are then sending to the rotation forest classifier for classification. We obtained 97.43% prediction accuracy with 94.92% sensitivity at the precision of 99.93% when the proposed method was applied to the PPIs data of yeast. To evaluate the performance of the proposed method, we compared it with other methods in the same dataset, and validate it on an independent datasets. The results obtained show that the proposed method is an appropriate and promising method for predicting PPIs.
Collapse
|
29
|
Deep Neural Network Based Predictions of Protein Interactions Using Primary Sequences. Molecules 2018; 23:molecules23081923. [PMID: 30071670 PMCID: PMC6222503 DOI: 10.3390/molecules23081923] [Citation(s) in RCA: 66] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Revised: 07/16/2018] [Accepted: 07/28/2018] [Indexed: 01/01/2023] Open
Abstract
Machine learning based predictions of protein–protein interactions (PPIs) could provide valuable insights into protein functions, disease occurrence, and therapy design on a large scale. The intensive feature engineering in most of these methods makes the prediction task more tedious and trivial. The emerging deep learning technology enabling automatic feature engineering is gaining great success in various fields. However, the over-fitting and generalization of its models are not yet well investigated in most scenarios. Here, we present a deep neural network framework (DNN-PPI) for predicting PPIs using features learned automatically only from protein primary sequences. Within the framework, the sequences of two interacting proteins are sequentially fed into the encoding, embedding, convolution neural network (CNN), and long short-term memory (LSTM) neural network layers. Then, a concatenated vector of the two outputs from the previous layer is wired as the input of the fully connected neural network. Finally, the Adam optimizer is applied to learn the network weights in a back-propagation fashion. The different types of features, including semantic associations between amino acids, position-related sequence segments (motif), and their long- and short-term dependencies, are captured in the embedding, CNN and LSTM layers, respectively. When the model was trained on Pan’s human PPI dataset, it achieved a prediction accuracy of 98.78% at the Matthew’s correlation coefficient (MCC) of 97.57%. The prediction accuracies for six external datasets ranged from 92.80% to 97.89%, making them superior to those achieved with previous methods. When performed on Escherichia coli, Drosophila, and Caenorhabditis elegans datasets, DNN-PPI obtained prediction accuracies of 95.949%, 98.389%, and 98.669%, respectively. The performances in cross-species testing among the four species above coincided in their evolutionary distances. However, when testing Mus Musculus using the models from those species, they all obtained prediction accuracies of over 92.43%, which is difficult to achieve and worthy of note for further study. These results suggest that DNN-PPI has remarkable generalization and is a promising tool for identifying protein interactions.
Collapse
|
30
|
Pashazadeh A, Navimipour NJ. Big data handling mechanisms in the healthcare applications: A comprehensive and systematic literature review. J Biomed Inform 2018; 82:47-62. [PMID: 29655946 DOI: 10.1016/j.jbi.2018.03.014] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Revised: 11/19/2017] [Accepted: 03/23/2018] [Indexed: 01/08/2023]
Abstract
Healthcare provides many services such as diagnosing, treatment, prevention of diseases, illnesses, injuries, and other physical and mental disorders. Large-scale distributed data processing applications in healthcare as a basic concept operates on large amounts of data. Therefore, big data application functions are the main part of healthcare operations, but there was not any comprehensive and systematic survey about studying and evaluating the important techniques in this field. Therefore, this paper aims at providing the comprehensive, detailed, and systematic study of the state-of-the-art mechanisms in the big data related to healthcare applications in five categories, including machine learning, cloud-based, heuristic-based, agent-based, and hybrid mechanisms. Also, this paper displayed a systematic literature review (SLR) of the big data applications in the healthcare literature up to the end of 2016. Initially, 205 papers were identified, but a paper selection process reduced the number of papers to 29 important studies.
Collapse
Affiliation(s)
- Asma Pashazadeh
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran
| | - Nima Jafari Navimipour
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran.
| |
Collapse
|
31
|
Prediction of Protein-Protein Interactions from Amino Acid Sequences Based on Continuous and Discrete Wavelet Transform Features. Molecules 2018; 23:molecules23040823. [PMID: 29617272 PMCID: PMC6017726 DOI: 10.3390/molecules23040823] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Revised: 03/25/2018] [Accepted: 03/29/2018] [Indexed: 12/12/2022] Open
Abstract
Protein-protein interactions (PPIs) play important roles in various aspects of the structural and functional organization of cells; thus, detecting PPIs is one of the most important issues in current molecular biology. Although much effort has been devoted to using high-throughput techniques to identify protein-protein interactions, the experimental methods are both time-consuming and costly. In addition, they yield high rates of false positive and false negative results. In addition, most of the proposed computational methods are limited in information about protein homology or the interaction marks of the protein partners. In this paper, we report a computational method only using the information from protein sequences. The main improvements come from novel protein sequence representation by combing the continuous and discrete wavelet transforms and from adopting weighted sparse representation-based classifier (WSRC). The proposed method was used to predict PPIs from three different datasets: yeast, human and H. pylori. In addition, we employed the prediction model trained on the PPIs dataset of yeast to predict the PPIs of six datasets of other species. To further evaluate the performance of the prediction model, we compared WSRC with the state-of-the-art support vector machine classifier. When predicting PPIs of yeast, humans and H. pylori dataset, we obtained high average prediction accuracies of 97.38%, 98.92% and 93.93% respectively. In the cross-species experiments, most of the prediction accuracies are over 94%. These promising results show that the proposed method is indeed capable of obtaining higher performance in PPIs detection.
Collapse
|
32
|
Wang J, Yang J, Zhang J, Wang X, Zhang W(C. Big data driven cycle time parallel prediction for production planning in wafer manufacturing. ENTERP INF SYST-UK 2018. [DOI: 10.1080/17517575.2018.1450998] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Junliang Wang
- College of Mechanical Engineering, Donghua University, Shanghai, China
| | - Jungang Yang
- School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Jie Zhang
- College of Mechanical Engineering, Donghua University, Shanghai, China
| | - Xiaoxi Wang
- College of Mechanical Engineering, Donghua University, Shanghai, China
| | | |
Collapse
|
33
|
Chen Q, Cao F. Distributed support vector machine in master-slave mode. Neural Netw 2018; 101:94-100. [PMID: 29494875 DOI: 10.1016/j.neunet.2018.02.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2017] [Revised: 01/31/2018] [Accepted: 02/06/2018] [Indexed: 11/16/2022]
Abstract
It is well known that the support vector machine (SVM) is an effective learning algorithm. The alternating direction method of multipliers (ADMM) algorithm has emerged as a powerful technique for solving distributed optimisation models. This paper proposes a distributed SVM algorithm in a master-slave mode (MS-DSVM), which integrates a distributed SVM and ADMM acting in a master-slave configuration where the master node and slave nodes are connected, meaning the results can be broadcasted. The distributed SVM is regarded as a regularised optimisation problem and modelled as a series of convex optimisation sub-problems that are solved by ADMM. Additionally, the over-relaxation technique is utilised to accelerate the convergence rate of the proposed MS-DSVM. Our theoretical analysis demonstrates that the proposed MS-DSVM has linear convergence, meaning it possesses the fastest convergence rate among existing standard distributed ADMM algorithms. Numerical examples demonstrate that the convergence and accuracy of the proposed MS-DSVM are superior to those of existing methods under the ADMM framework.
Collapse
Affiliation(s)
- Qingguo Chen
- Department of Applied Mathematics, College of Sciences, China Jiliang University, Hangzhou 310018, Zhejaing Province, PR China
| | - Feilong Cao
- Department of Applied Mathematics, College of Sciences, China Jiliang University, Hangzhou 310018, Zhejaing Province, PR China.
| |
Collapse
|
34
|
A contiguous column coherent evolution biclustering algorithm for time-series gene expression data. INT J MACH LEARN CYB 2018. [DOI: 10.1007/s13042-015-0487-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
35
|
Raja MAZ, Asma K, Aslam MS. Bio-inspired computational heuristics to study models of HIV infection of CD4+ T-cell. INT J BIOMATH 2018. [DOI: 10.1142/s1793524518500195] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this work, biologically-inspired computing framework is developed for HIV infection of CD4[Formula: see text] T-cell model using feed-forward artificial neural networks (ANNs), genetic algorithms (GAs), sequential quadratic programming (SQP) and hybrid approach based on GA-SQP. The mathematical model for HIV infection of CD4[Formula: see text] T-cells is represented with the help of initial value problems (IVPs) based on the system of ordinary differential equations (ODEs). The ANN model for the system is constructed by exploiting its strength of universal approximation. An objective function is developed for the system through unsupervised error using ANNs in the mean square sense. Training with weights of ANNs is carried out with GAs for effective global search supported with SQP for efficient local search. The proposed scheme is evaluated on a number of scenarios for the HIV infection model by taking the different levels for infected cells, natural substitution rates of uninfected cells, and virus particles. Comparisons of the approximate solutions are made with results of Adams numerical solver to establish the correctness of the proposed scheme. Accuracy and convergence of the approach are validated through the results of statistical analysis based on the sufficient large number of independent runs.
Collapse
Affiliation(s)
- Muhammad Asif Zahoor Raja
- Department of Electrical Engineering, COMSATS Institute of Information Technology, Attock Campus, Attock, Pakistan
| | - Kiran Asma
- Department of Computer Sciences, COMSATS Institute of Information Technology, Attock Campus, Attock, Pakistan
| | - Muhammad Saeed Aslam
- Pakistan Institute of Engineering and Applied Sciences, Nilore Islamabad, Pakistan
| |
Collapse
|
36
|
An Ensemble Classifier with Random Projection for Predicting Protein–Protein Interactions Using Sequence and Evolutionary Information. APPLIED SCIENCES-BASEL 2018. [DOI: 10.3390/app8010089] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
|
37
|
Li J, Shi X, You Z, Chen Z, Lin Q, Fang M. Using Weighted Extreme Learning Machine Combined with Scale-Invariant Feature Transform to Predict Protein-Protein Interactions from Protein Evolutionary Information. INTELLIGENT COMPUTING THEORIES AND APPLICATION 2018:527-532. [DOI: 10.1007/978-3-319-95930-6_49] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/30/2023]
|
38
|
Yuan Y, Luo X, Shang MS. Effects of preprocessing and training biases in latent factor models for recommender systems. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2017.10.040] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
39
|
Li JQ, You ZH, Li X, Ming Z, Chen X. PSPEL: In Silico Prediction of Self-Interacting Proteins from Amino Acids Sequences Using Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1165-1172. [PMID: 28092572 DOI: 10.1109/tcbb.2017.2649529] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Self interacting proteins (SIPs) play an important role in various aspects of the structural and functional organization of the cell. Detecting SIPs is one of the most important issues in current molecular biology. Although a large number of SIPs data has been generated by experimental methods, wet laboratory approaches are both time-consuming and costly. In addition, they yield high false negative and positive rates. Thus, there is a great need for in silico methods to predict SIPs accurately and efficiently. In this study, a new sequence-based method is proposed to predict SIPs. The evolutionary information contained in Position-Specific Scoring Matrix (PSSM) is extracted from of protein with known sequence. Then, features are fed to an ensemble classifier to distinguish the self-interacting and non-self-interacting proteins. When performed on Saccharomyces cerevisiae and Human SIPs data sets, the proposed method can achieve high accuracies of 86.86 and 91.30 percent, respectively. Our method also shows a good performance when compared with the SVM classifier and previous methods. Consequently, the proposed method can be considered to be a novel promising tool to predict SIPs.
Collapse
|
40
|
Chen X, Gong Y, Zhang DH, You ZH, Li ZW. DRMDA: deep representations-based miRNA-disease association prediction. J Cell Mol Med 2017; 22:472-485. [PMID: 28857494 PMCID: PMC5742725 DOI: 10.1111/jcmm.13336] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2017] [Accepted: 07/01/2017] [Indexed: 12/22/2022] Open
Abstract
Recently, microRNAs (miRNAs) are confirmed to be important molecules within many crucial biological processes and therefore related to various complex human diseases. However, previous methods of predicting miRNA–disease associations have their own deficiencies. Under this circumstance, we developed a prediction method called deep representations‐based miRNA–disease association (DRMDA) prediction. The original miRNA–disease association data were extracted from HDMM database. Meanwhile, stacked auto‐encoder, greedy layer‐wise unsupervised pre‐training algorithm and support vector machine were implemented to predict potential associations. We compared DRMDA with five previous classical prediction models (HGIMDA, RLSMDA, HDMP, WBSMDA and RWRMDA) in global leave‐one‐out cross‐validation (LOOCV), local LOOCV and fivefold cross‐validation, respectively. The AUCs achieved by DRMDA were 0.9177, 08339 and 0.9156 ± 0.0006 in the three tests above, respectively. In further case studies, we predicted the top 50 potential miRNAs for colon neoplasms, lymphoma and prostate neoplasms, and 88%, 90% and 86% of the predicted miRNA can be verified by experimental evidence, respectively. In conclusion, DRMDA is a promising prediction method which could identify potential and novel miRNA–disease associations.
Collapse
Affiliation(s)
- Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Yao Gong
- School of Life Science, Peking University, Beijing, China
| | - De-Hong Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Ürümqi, China
| | - Zheng-Wei Li
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
| |
Collapse
|
41
|
Hu L, Yuan X, Hu P, Chan KC. Efficiently predicting large-scale protein-protein interactions using MapReduce. Comput Biol Chem 2017; 69:202-206. [PMID: 28396055 DOI: 10.1016/j.compbiolchem.2017.03.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2017] [Accepted: 03/27/2017] [Indexed: 10/19/2022]
|
42
|
Sun T, Zhou B, Lai L, Pei J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics 2017; 18:277. [PMID: 28545462 PMCID: PMC5445391 DOI: 10.1186/s12859-017-1700-2] [Citation(s) in RCA: 190] [Impact Index Per Article: 23.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2017] [Accepted: 05/18/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) are critical for many biological processes. It is therefore important to develop accurate high-throughput methods for identifying PPI to better understand protein function, disease occurrence, and therapy design. Though various computational methods for predicting PPI have been developed, their robustness for prediction with external datasets is unknown. Deep-learning algorithms have achieved successful results in diverse areas, but their effectiveness for PPI prediction has not been tested. RESULTS We used a stacked autoencoder, a type of deep-learning algorithm, to study the sequence-based PPI prediction. The best model achieved an average accuracy of 97.19% with 10-fold cross-validation. The prediction accuracies for various external datasets ranged from 87.99% to 99.21%, which are superior to those achieved with previous methods. CONCLUSIONS To our knowledge, this research is the first to apply a deep-learning algorithm to sequence-based PPI prediction, and the results demonstrate its potential in this field.
Collapse
Affiliation(s)
- Tanlin Sun
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Bo Zhou
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Luhua Lai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China.,Beijing National Laboratory for Molecular Science, State Key Laboratory for Structural Chemistry of Unstable and Stable Species, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China.,Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, 100871, China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China.
| |
Collapse
|
43
|
Wang Y, You Z, Li X, Chen X, Jiang T, Zhang J. PCVMZM: Using the Probabilistic Classification Vector Machines Model Combined with a Zernike Moments Descriptor to Predict Protein-Protein Interactions from Protein Sequences. Int J Mol Sci 2017; 18:ijms18051029. [PMID: 28492483 PMCID: PMC5454941 DOI: 10.3390/ijms18051029] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2017] [Revised: 04/24/2017] [Accepted: 04/29/2017] [Indexed: 01/08/2023] Open
Abstract
Protein–protein interactions (PPIs) are essential for most living organisms’ process. Thus, detecting PPIs is extremely important to understand the molecular mechanisms of biological systems. Although many PPIs data have been generated by high-throughput technologies for a variety of organisms, the whole interatom is still far from complete. In addition, the high-throughput technologies for detecting PPIs has some unavoidable defects, including time consumption, high cost, and high error rate. In recent years, with the development of machine learning, computational methods have been broadly used to predict PPIs, and can achieve good prediction rate. In this paper, we present here PCVMZM, a computational method based on a Probabilistic Classification Vector Machines (PCVM) model and Zernike moments (ZM) descriptor for predicting the PPIs from protein amino acids sequences. Specifically, a Zernike moments (ZM) descriptor is used to extract protein evolutionary information from Position-Specific Scoring Matrix (PSSM) generated by Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST). Then, PCVM classifier is used to infer the interactions among protein. When performed on PPIs datasets of Yeast and H. Pylori, the proposed method can achieve the average prediction accuracy of 94.48% and 91.25%, respectively. In order to further evaluate the performance of the proposed method, the state-of-the-art support vector machines (SVM) classifier is used and compares with the PCVM model. Experimental results on the Yeast dataset show that the performance of PCVM classifier is better than that of SVM classifier. The experimental results indicate that our proposed method is robust, powerful and feasible, which can be used as a helpful tool for proteomics research.
Collapse
Affiliation(s)
- Yanbin Wang
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Zhuhong You
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Xiao Li
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China.
| | - Tonghai Jiang
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Jingting Zhang
- Department of Mathematics and Statistics, Henan University, Kaifeng 100190, China.
| |
Collapse
|
44
|
|
45
|
You ZH, Zhou M, Luo X, Li S. Highly Efficient Framework for Predicting Interactions Between Proteins. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:731-743. [PMID: 28113829 DOI: 10.1109/tcyb.2016.2524994] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Protein-protein interactions (PPIs) play a central role in many biological processes. Although a large amount of human PPI data has been generated by high-throughput experimental techniques, they are very limited compared to the estimated 130 000 protein interactions in humans. Hence, automatic methods for human PPI-detection are highly desired. This work proposes a novel framework, i.e., Low-rank approximation-kernel Extreme Learning Machine (LELM), for detecting human PPI from a protein's primary sequences automatically. It has three main steps: 1) mapping each protein sequence into a matrix built on all kinds of adjacent amino acids; 2) applying the low-rank approximation model to the obtained matrix to solve its lowest rank representation, which reflects its true subspace structures; and 3) utilizing a powerful kernel extreme learning machine to predict the probability for PPI based on this lowest rank representation. Experimental results on a large-scale human PPI dataset demonstrate that the proposed LELM has significant advantages in accuracy and efficiency over the state-of-art approaches. Hence, this work establishes a new and effective way for the automatic detection of PPI.
Collapse
|
46
|
You ZH, Li X, Chan KCC. An improved sequence-based prediction protocol for protein-protein interactions using amino acids substitution matrix and rotation forest ensemble classifiers. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2016.10.042] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
47
|
|
48
|
|
49
|
Huang YA, You ZH, Chen X, Yan GY. Improved protein-protein interactions prediction via weighted sparse representation model combining continuous wavelet descriptor and PseAA composition. BMC SYSTEMS BIOLOGY 2016; 10:120. [PMID: 28155718 PMCID: PMC5260127 DOI: 10.1186/s12918-016-0360-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Background Protein-protein interactions (PPIs) are essential to most biological processes. Since bioscience has entered into the era of genome and proteome, there is a growing demand for the knowledge about PPI network. High-throughput biological technologies can be used to identify new PPIs, but they are expensive, time-consuming, and tedious. Therefore, computational methods for predicting PPIs have an important role. For the past years, an increasing number of computational methods such as protein structure-based approaches have been proposed for predicting PPIs. The major limitation in principle of these methods lies in the prior information of the protein to infer PPIs. Therefore, it is of much significance to develop computational methods which only use the information of protein amino acids sequence. Results Here, we report a highly efficient approach for predicting PPIs. The main improvements come from the use of a novel protein sequence representation by combining continuous wavelet descriptor and Chou’s pseudo amino acid composition (PseAAC), and from adopting weighted sparse representation based classifier (WSRC). This method, cross-validated on the PPIs datasets of Saccharomyces cerevisiae, Human and H. pylori, achieves an excellent results with accuracies as high as 92.50%, 95.54% and 84.28% respectively, significantly better than previously proposed methods. Extensive experiments are performed to compare the proposed method with state-of-the-art Support Vector Machine (SVM) classifier. Conclusions The outstanding results yield by our model that the proposed feature extraction method combing two kinds of descriptors have strong expression ability and are expected to provide comprehensive and effective information for machine learning-based classification models. In addition, the prediction performance in the comparison experiments shows the well cooperation between the combined feature and WSRC. Thus, the proposed method is a very efficient method to predict PPIs and may be a useful supplementary tool for future proteomics studies.
Collapse
Affiliation(s)
- Yu-An Huang
- Department of Computing, Hong Kong Polytechnic University, Hung Hom, Hong Kong, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China.
| | - Xing Chen
- School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, 221116, China.
| | - Gui-Ying Yan
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100010, China
| |
Collapse
|
50
|
An JY, You ZH, Chen X, Huang DS, Li ZW, Liu G, Wang Y. Identification of self-interacting proteins by exploring evolutionary information embedded in PSI-BLAST-constructed position specific scoring matrix. Oncotarget 2016; 7:82440-82449. [PMID: 27732957 PMCID: PMC5347703 DOI: 10.18632/oncotarget.12517] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Accepted: 09/28/2016] [Indexed: 01/31/2023] Open
Abstract
Self-interacting Proteins (SIPs) play an essential role in a wide range of biological processes, such as gene expression regulation, signal transduction, enzyme activation and immune response. Because of the limitations for experimental self-interaction proteins identification, developing an effective computational method based on protein sequence to detect SIPs is much important. In the study, we proposed a novel computational approach called RVMBIGP that combines the Relevance Vector Machine (RVM) model and Bi-gram probability (BIGP) to predict SIPs based on protein sequence. The proposed prediction model includes as following steps: (1) an effective feature extraction method named BIGP is used to represent protein sequences on Position Specific Scoring Matrix (PSSM); (2) Principal Component Analysis (PCA) method is employed for integrating the useful information and reducing the influence of noise; (3) the robust classifier Relevance Vector Machine (RVM) is used to carry out classification. When performed on yeast and human datasets, the proposed RVMBIGP model can achieve very high accuracies of 95.48% and 98.80%, respectively. The experimental results show that our proposed method is very promising and may provide a cost-effective alternative for SIPs identification. In addition, to facilitate extensive studies for future proteomics research, the RVMBIGP server is freely available for academic use at http://219.219.62.123:8888/RVMBIGP.
Collapse
Affiliation(s)
- Ji-Yong An
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 21116, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China
| | - Xing Chen
- School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - De-Shuang Huang
- School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Zheng-Wei Li
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 21116, China
| | - Gang Liu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China
| | - Yin Wang
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 21116, China
| |
Collapse
|