1
|
Liu Y, Liu Y, Li Z. Protein-Protein Interaction Prediction via Structure-Based Deep Learning. Proteins 2024. [PMID: 38923590 DOI: 10.1002/prot.26721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 05/04/2024] [Accepted: 06/04/2024] [Indexed: 06/28/2024]
Abstract
Protein-protein interactions (PPIs) play an essential role in life activities. Many artificial intelligence algorithms based on protein sequence information have been developed to predict PPIs. However, these models have difficulty dealing with various sequence lengths and suffer from low generalization and prediction accuracy. In this study, we proposed a novel end-to-end deep learning framework, RSPPI, combining residual neural network (ResNet) and spatial pyramid pooling (SPP), to predict PPIs based on the protein sequence physicochemistry properties and spatial structural information. In the RSPPI model, ResNet was employed to extract the structural and physicochemical information from the protein three-dimensional structure and primary sequence; the SPP layer was used to transform feature maps to a single vector and avoid the fixed-length requirement. The RSPPI model possessed excellent cross-species performance and outperformed several state-of-the-art methods based either on protein sequence or gene ontology in most evaluation metrics. The RSPPI model provides a novel strategy to develop an AI PPI prediction algorithm.
Collapse
Affiliation(s)
- Yucong Liu
- Shanghai Key Laboratory of Mechanics in Energy Engineering, Shanghai Institute of Applied Mathematics and Mechanics, School of Mechanics and Engineering Science, Shanghai University, Shanghai, China
| | - Yijun Liu
- Shanghai Key Laboratory of Mechanics in Energy Engineering, Shanghai Institute of Applied Mathematics and Mechanics, School of Mechanics and Engineering Science, Shanghai University, Shanghai, China
| | - Zhenhai Li
- Shanghai Key Laboratory of Mechanics in Energy Engineering, Shanghai Institute of Applied Mathematics and Mechanics, School of Mechanics and Engineering Science, Shanghai University, Shanghai, China
| |
Collapse
|
2
|
Lannelongue L, Inouye M. Pitfalls of machine learning models for protein-protein interaction networks. Bioinformatics 2024; 40:btae012. [PMID: 38200587 PMCID: PMC10868344 DOI: 10.1093/bioinformatics/btae012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 11/24/2023] [Accepted: 01/09/2024] [Indexed: 01/12/2024] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) are essential to understanding biological pathways as well as their roles in development and disease. Computational tools, based on classic machine learning, have been successful at predicting PPIs in silico, but the lack of consistent and reliable frameworks for this task has led to network models that are difficult to compare and discrepancies between algorithms that remain unexplained. RESULTS To better understand the underlying inference mechanisms that underpin these models, we designed an open-source framework for benchmarking that accounts for a range of biological and statistical pitfalls while facilitating reproducibility. We use it to shed light on the impact of network topology and how different algorithms deal with highly connected proteins. By studying functional genomics-based and sequence-based models on human PPIs, we show their complementarity as the former performs best on lone proteins while the latter specializes in interactions involving hubs. We also show that algorithm design has little impact on performance with functional genomic data. We replicate our results between both human and S. cerevisiae data and demonstrate that models using functional genomics are better suited to PPI prediction across species. With rapidly increasing amounts of sequence and functional genomics data, our study provides a principled foundation for future construction, comparison, and application of PPI networks. AVAILABILITY AND IMPLEMENTATION The code and data are available on GitHub: https://github.com/Llannelongue/B4PPI.
Collapse
Affiliation(s)
- Loïc Lannelongue
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, United Kingdom
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, United Kingdom
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, 3004 Victoria, Australia
- British Heart Foundation Centre of Research Excellence, University of Cambridge, CB2 0BB Cambridge, United Kingdom
| |
Collapse
|
3
|
Wang P, Nie J, Yang L, Zhao J, Wang X, Zhang Y, Zang H, Yang Y, Zeng Z. Plant growth stages covered the legacy effect of rotation systems on microbial community structure and function in wheat rhizosphere. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2023; 30:59632-59644. [PMID: 37012567 DOI: 10.1007/s11356-023-26703-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 03/24/2023] [Indexed: 05/10/2023]
Abstract
Legume-based crop rotation is conducive to improve soil multifunctionality, but how the legacy effect of previous legumes influenced the rhizosphere microbial community of the following crops along with growth stages remains unclear. Here, the wheat rhizosphere microbial community was assessed at the regreening and filling stages with four previous legumes (mungbean, adzuki bean, soybean, and peanut), as well as cereal maize as a control. The composition and structure of both bacterial and fungal communities varied dramatically between two growth stages. The differences in fungal community structure among rotation systems were observed at both the regreening and filling stages, while the difference in bacterial community structure among rotation systems was observed only at the filling stage. The complexity and centrality of the microbial network decreased along with crop growth stages. The species associations were strengthened in legume-based rotation systems than in cereal-based rotation system at the filling stage. The abundance of KEGG orthologs (KOs) associated with carbon, nitrogen, phosphorus, and sulfur metabolism of bacterial community decreased from the regreening stage to the filling stage. However, there was no difference in the abundance of KOs among rotation systems. Together, our results showed that plant growth stages had a stronger impact than the legacy effect of rotation systems in shaping the wheat rhizosphere microbial community, and the differences among rotation systems were more obvious at the late growth stage. Such compositional, structural, and functional changes may provide predictable consequences of crop growth and soil nutrient cycling.
Collapse
Affiliation(s)
- Peixin Wang
- College of Agronomy and Biotechnology/Key Laboratory of Farming System of Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing, 100193, China
- Institute of Grassland, Flowers and Ecology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, 100097, China
| | - Jiangwen Nie
- College of Agronomy and Biotechnology/Key Laboratory of Farming System of Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing, 100193, China
| | - Lei Yang
- College of Agronomy and Biotechnology/Key Laboratory of Farming System of Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing, 100193, China
| | - Jie Zhao
- College of Agronomy and Biotechnology/Key Laboratory of Farming System of Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing, 100193, China
| | - Xiquan Wang
- College of Agronomy and Biotechnology/Key Laboratory of Farming System of Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing, 100193, China
- Institute of Agricultural Sources and Regional Planning, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Yudan Zhang
- College of Agronomy and Biotechnology/Key Laboratory of Farming System of Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing, 100193, China
- Jining Academy of Agricultural Sciences, Jining, 272000, China
| | - Huadong Zang
- College of Agronomy and Biotechnology/Key Laboratory of Farming System of Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing, 100193, China
| | - Yadong Yang
- College of Agronomy and Biotechnology/Key Laboratory of Farming System of Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing, 100193, China.
| | - Zhaohai Zeng
- College of Agronomy and Biotechnology/Key Laboratory of Farming System of Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing, 100193, China
| |
Collapse
|
4
|
Fernandez G, Yubero D, Palau F, Armstrong J. Molecular Modelling Hurdle in the Next-Generation Sequencing Era. Int J Mol Sci 2022; 23:ijms23137176. [PMID: 35806177 PMCID: PMC9266691 DOI: 10.3390/ijms23137176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 06/24/2022] [Accepted: 06/27/2022] [Indexed: 12/10/2022] Open
Abstract
There are challenges in the genetic diagnosis of rare diseases, and pursuing an optimal strategy to identify the cause of the disease is one of the main objectives of any clinical genomics unit. A range of techniques are currently used to characterize the genomic variability within the human genome to detect causative variants of specific disorders. With the introduction of next-generation sequencing (NGS) in the clinical setting, geneticists can study single-nucleotide variants (SNVs) throughout the entire exome/genome. In turn, the number of variants to be evaluated per patient has increased significantly, and more information has to be processed and analyzed to determine a proper diagnosis. Roughly 50% of patients with a Mendelian genetic disorder are diagnosed using NGS, but a fair number of patients still suffer a diagnostic odyssey. Due to the inherent diversity of the human population, as more exomes or genomes are sequenced, variants of uncertain significance (VUSs) will increase exponentially. Thus, assigning relevance to a VUS (non-synonymous as well as synonymous) in an undiagnosed patient becomes crucial to assess the proper diagnosis. Multiple algorithms have been used to predict how a specific mutation might affect the protein’s function, but they are far from accurate enough to be conclusive. In this work, we highlight the difficulties of genomic variability determined by NGS that have arisen in diagnosing rare genetic diseases, and how molecular modelling has to be a key component to elucidate the relevance of a specific mutation in the protein’s loss of function or malfunction. We suggest that the creation of a multi-omics data model should improve the classification of pathogenicity for a significant amount of the detected genomic variability. Moreover, we argue how it should be incorporated systematically in the process of variant evaluation to be useful in the clinical setting and the diagnostic pipeline.
Collapse
Affiliation(s)
- Guerau Fernandez
- Department of Genetic and Molecular Medicine—IPER, Hospital Sant Joan de Déu, Institut de Recerca Sant Joan de Déu, 08950 Barcelona, Spain; (G.F.); (F.P.); (J.A.)
- Center for Biomedical Research Network on Rare Diseases (CIBERER), ISCIII, 08950 Barcelona, Spain
| | - Dèlia Yubero
- Department of Genetic and Molecular Medicine—IPER, Hospital Sant Joan de Déu, Institut de Recerca Sant Joan de Déu, 08950 Barcelona, Spain; (G.F.); (F.P.); (J.A.)
- Center for Biomedical Research Network on Rare Diseases (CIBERER), ISCIII, 08950 Barcelona, Spain
- Correspondence: ; Tel.: +34-93-600-9451; Fax: +34-93-600-9760
| | - Francesc Palau
- Department of Genetic and Molecular Medicine—IPER, Hospital Sant Joan de Déu, Institut de Recerca Sant Joan de Déu, 08950 Barcelona, Spain; (G.F.); (F.P.); (J.A.)
- Center for Biomedical Research Network on Rare Diseases (CIBERER), ISCIII, 08950 Barcelona, Spain
- Division of Pediatrics, University of Barcelona School of Medicine and Health Sciences, 08007 Barcelona, Spain
| | - Judith Armstrong
- Department of Genetic and Molecular Medicine—IPER, Hospital Sant Joan de Déu, Institut de Recerca Sant Joan de Déu, 08950 Barcelona, Spain; (G.F.); (F.P.); (J.A.)
- Center for Biomedical Research Network on Rare Diseases (CIBERER), ISCIII, 08950 Barcelona, Spain
| |
Collapse
|
5
|
Chen T, Wu X, Li L, Li J, Feng S. Extraction of entity relations from Chinese medical literature based on multi-scale CRNN. ANNALS OF TRANSLATIONAL MEDICINE 2022; 10:520. [PMID: 35928762 PMCID: PMC9347033 DOI: 10.21037/atm-22-1226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 04/07/2022] [Indexed: 11/12/2022]
Abstract
Background Entity relation extraction technology can be used to extract entities and relations from medical literature, and automatically establish professional mapping knowledge domains. The classical text classification model, convolutional neural networks for sentence classification (TEXTCNN), has been shown to have good classification performance, but also has a long-distance dependency problem, which is a common problem of convolutional neural networks (CNNs). Recurrent neural networks (RNN) address the long-distance dependency problem but cannot capture text features at a specific scale in the text. Methods To solve these problems, this study sought to establish a model with a multi-scale convolutional recurrent neural network for Sentence Classification (TEXTCRNN) to address the deficiencies in the 2 neural network structures. In entity relation extraction, the entity pair is generally composed of a subject and an object, but as the subject in the entity pair of medical literature is always omitted, it is difficult to use this coding method to obtain general entity position information. Thus, we proposed a new coding method to obtain entity position information to re-establish the relationship between subject and object and complete the entity relation extraction. Results By comparing the benchmark neural network model and 2 typical multi-scale TEXTCRNN models, the TEXTCRNN [bidirectional long- and short-term memory (BiLSTM)] and TEXTCRNN [double-layer stacking gated recurrent unit (GRU)], the results showed that the multi-scale CRNN model had the best F1 value performance, and the TEXTCRNN (double-layer stacking GRU) was more capable of entity relation classification when the same entity word did not belong to the same entity relation. Conclusions The experimental results of the entity relation extraction from Pharmacopoeia of the People's Republic of China-Guidelines for Clinical Drug Use-Volume of Chemical Drugs and Biological Products showed that entity relation extraction could effectively proceed using the new labeling method. Additionally, compared to typical neural network models, including the TEXTCNN, GRU, and BiLSTM, the multi-scale convolutional recurrent neural network structure had advantages across several evaluation indicators.
Collapse
Affiliation(s)
- Tingyin Chen
- Department of Network and Information, Xiangya Hospital, Central South University, Changsha, China
- National Clinical Research Center for Geriatric Disorders (Xiangya Hospital), Changsha, China
| | - Xuehong Wu
- Hunan Creator Information Technology Co. Ltd, Changsha, China
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Linyi Li
- Hunan Creator Information Technology Co. Ltd, Changsha, China
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Jianhua Li
- Hunan Creator Information Technology Co. Ltd, Changsha, China
| | - Song Feng
- Department of Network and Information, Xiangya Hospital, Central South University, Changsha, China
- National Clinical Research Center for Geriatric Disorders (Xiangya Hospital), Changsha, China
| |
Collapse
|
6
|
Xu K, Lin C, Lee SY, Mao L, Meng K. Comparative analysis of complete Ilex (Aquifoliaceae) chloroplast genomes: insights into evolutionary dynamics and phylogenetic relationships. BMC Genomics 2022; 23:203. [PMID: 35287585 PMCID: PMC8922745 DOI: 10.1186/s12864-022-08397-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 02/17/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ilex (Aquifoliaceae) are of great horticultural importance throughout the world for their foliage and decorative berries, yet a dearth of genetic information has hampered our understanding of phylogenetic relationships and evolutionary history. Here, we compare chloroplast genomes from across Ilex and estimate phylogenetic relationships. RESULTS We sequenced the chloroplast genomes of seven Ilex species and compared them with 34 previously published Ilex plastomes. The length of the seven newly sequenced Ilex chloroplast genomes ranged from 157,182 bp to 158,009 bp, and contained a total of 118 genes, including 83 protein-coding, 31 rRNA, and four tRNA genes. GC content ranged from 37.6 to 37.69%. Comparative analysis showed shared genomic structures and gene rearrangements. Expansion and contraction of the inverted repeat regions at the LSC/IRa and IRa/SSC junctions were observed in 22 and 26 taxa, respectively; in contrast, the IRb boundary was largely invariant. A total of 2146 simple sequence repeats and 2843 large repeats were detected in the 41 Ilex plastomes. Additionally, six genes (psaC, rbcL, trnQ, trnR, trnT, and ycf1) and two intergenic spacer regions (ndhC-trnV and petN-psbM) were identified as hypervariable, and thus potentially useful for future phylogenetic studies and DNA barcoding. We recovered consistent phylogenetic relationships regardless of inference methodology or choice of loci. We recovered five distinct, major clades, which were inconsistent with traditional taxonomic systems. CONCLUSION Our findings challenge traditional circumscriptions of the genus Ilex and provide new insights into the evolutionary history of this important clade. Furthermore, we detail hypervariable and repetitive regions that will be useful for future phylogenetic and population genetic studies.
Collapse
Affiliation(s)
- Kewang Xu
- Co-Innovation Center for Sustainable Forestry in Southern China, College of Biology and the Environment, Nanjing Forestry University, Nanjing, 510275, China
| | - Chenxue Lin
- Co-Innovation Center for Sustainable Forestry in Southern China, College of Biology and the Environment, Nanjing Forestry University, Nanjing, 510275, China
| | - Shiou Yih Lee
- Faculty of Health and Life Sciences, INTI International University, 71800, Nilai, Malaysia
| | - Lingfeng Mao
- Co-Innovation Center for Sustainable Forestry in Southern China, College of Biology and the Environment, Nanjing Forestry University, Nanjing, 510275, China.
| | - Kaikai Meng
- State Key Laboratory of Biocontrol and Guangdong Provincial Key Laboratory of Plant Resources, School of Life Sciences, Sun Yat-sen University, Guangzhou, China.
| |
Collapse
|
7
|
Martins YC, Ziviani A, Nicolás MF, de Vasconcelos ATR. Large-Scale Protein Interactions Prediction by Multiple Evidence Analysis Associated With an In-Silico Curation Strategy. FRONTIERS IN BIOINFORMATICS 2021; 1:731345. [PMID: 36303787 PMCID: PMC9581021 DOI: 10.3389/fbinf.2021.731345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Accepted: 08/23/2021] [Indexed: 11/17/2022] Open
Abstract
Predicting the physical or functional associations through protein-protein interactions (PPIs) represents an integral approach for inferring novel protein functions and discovering new drug targets during repositioning analysis. Recent advances in high-throughput data generation and multi-omics techniques have enabled large-scale PPI predictions, thus promoting several computational methods based on different levels of biological evidence. However, integrating multiple results and strategies to optimize, extract interaction features automatically and scale up the entire PPI prediction process is still challenging. Most procedures do not offer an in-silico validation process to evaluate the predicted PPIs. In this context, this paper presents the PredPrIn scientific workflow that enables PPI prediction based on multiple lines of evidence, including the structure, sequence, and functional annotation categories, by combining boosting and stacking machine learning techniques. We also present a pipeline (PPIVPro) for the validation process based on cellular co-localization filtering and a focused search of PPI evidence on scientific publications. Thus, our combined approach provides means to extensive scale training or prediction of new PPIs and a strategy to evaluate the prediction quality. PredPrIn and PPIVPro are publicly available at https://github.com/YasCoMa/predprin and https://github.com/YasCoMa/ppi_validation_process.
Collapse
Affiliation(s)
- Yasmmin Côrtes Martins
- Bioinformatics Laboratory, National Laboratory of Scientific Computing, Petrópolis, Brazil
| | - Artur Ziviani
- Data Extreme Lab (DEXL), National Laboratory of Scientific Computing, Petrópolis, Brazil
| | - Marisa Fabiana Nicolás
- Bioinformatics Laboratory, National Laboratory of Scientific Computing, Petrópolis, Brazil
| | - Ana Tereza Ribeiro de Vasconcelos
- Bioinformatics Laboratory, National Laboratory of Scientific Computing, Petrópolis, Brazil
- *Correspondence: Ana Tereza Ribeiro de Vasconcelos,
| |
Collapse
|
8
|
Li J, Zhang L, Li H, Ping Y, Xu Q, Wang R, Tan R, Wang Z, Liu B, Wang Y. Integrated entropy-based approach for analyzing exons and introns in DNA sequences. BMC Bioinformatics 2019; 20:283. [PMID: 31182012 PMCID: PMC6557737 DOI: 10.1186/s12859-019-2772-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Numerous essential algorithms and methods, including entropy-based quantitative methods, have been developed to analyze complex DNA sequences since the last decade. Exons and introns are the most notable components of DNA and their identification and prediction are always the focus of state-of-the-art research. RESULTS In this study, we designed an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences. We optimized and implemented the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences. By comparing digitalizing entropy values of exons and introns, we observed that they are significantly different. After we converted DNA data to numerical topological entropy value, we applied SVD method to effectively investigate exon and intron regions on a single gene sequence. Additionally, several genes across five species are used for exon predictions. CONCLUSIONS Our approach not only helps to explore the complexity of DNA sequence and its functional elements, but also provides an entropy-based GSP method to analyze exon and intron regions. Our work is feasible across different species and extendable to analyze other components in both coding and noncoding region of DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Zhen Wang
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031 China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| |
Collapse
|
9
|
Chen KH, Wang TF, Hu YJ. Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme. BMC Bioinformatics 2019; 20:308. [PMID: 31182027 PMCID: PMC6558856 DOI: 10.1186/s12859-019-2907-1] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2019] [Accepted: 05/17/2019] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Although various machine learning-based predictors have been developed for estimating protein-protein interactions, their performances vary with dataset and species, and are affected by two primary aspects: choice of learning algorithm, and the representation of protein pairs. To improve the performance of predicting protein-protein interactions, we exploit the synergy of multiple learning algorithms, and utilize the expressiveness of different protein-pair features. RESULTS We developed a stacked generalization scheme that integrates five learning algorithms. We also designed three types of protein-pair features based on the physicochemical properties of amino acids, gene ontology annotations, and interaction network topologies. When tested on 19 published datasets collected from eight species, the proposed approach achieved a significantly higher or comparable overall performance, compared with seven competitive predictors. CONCLUSION We introduced an ensemble learning approach for PPI prediction that integrated multiple learning algorithms and different protein-pair representations. The extensive comparisons with other state-of-the-art prediction tools demonstrated the feasibility and superiority of the proposed method.
Collapse
Affiliation(s)
- Kuan-Hsi Chen
- College of Computer Science, National Chiao Tung University, Hsinchu, 300, Taiwan
| | - Tsai-Feng Wang
- Institute of Data Science and Engineering, National Chiao Tung University, Hsinchu, 300, Taiwan
| | - Yuh-Jyh Hu
- Institute of Biomedical Engineering, College of Computer Science, National Chiao Tung University, Hsinchu, 300, Taiwan.
| |
Collapse
|