1
|
Payra AK, Saha B, Ghosh A. MEM-FET: Essential protein prediction using membership feature and machine learning approach. Proteins 2024; 92:60-75. [PMID: 37638618 DOI: 10.1002/prot.26577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 02/21/2023] [Accepted: 08/08/2023] [Indexed: 08/29/2023]
Abstract
Proteins are played key roles in different functionalities in our daily life. All functional roles of a protein are a bit enhanced in interaction compared to individuals. Identification of essential proteins of an organism is a time consume and costly task during observation in the wet lab. The results of observation in wet lab always ensure high reliability and accuracy in the biological ground. Essential protein prediction using computational approaches is an alternative choice in research. It proves its significance rapidly in day-to-day life as well as reduces the experimental cost of wet lab effectively. Existing computational methods were implemented using Protein interaction networks (PPIN), Sequence, Gene Expression Dataset (GED), Gene Ontology (GO), Orthologous groups, and Subcellular localized datasets. Machine learning has diverse categories of features that enable to model and predict essential macromolecules of understudied organisms. A novel methodology MEM-FET (membership feature) is predicted based on features, that is, edge clustering coefficient, Average clustering coefficient, subcellular localization, and Gene Ontology within a compartment of common neighbors. The accuracy (ACC) values of the predicted true positive (TP) essential proteins are 0.79, 0.74, 0.78, and 0.71 for YHQ, YMIPS, YDIP, and YMBD datasets. An enriched set of essential proteins are also predicted using the MEM-FET algorithm. Ensemble ML also validated the proposed model with an accuracy of 60%. It has been predicted that MEM-FET algorithms outperform other existing algorithms with an ACC value of 80% for the yeast dataset.
Collapse
Affiliation(s)
- Anjan Kumar Payra
- Department of Computer Science and Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, Kolkata, India
| | - Banani Saha
- Department of Computer Science and Engineering, University of Calcutta, Kolkata, India
| | - Anupam Ghosh
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| |
Collapse
|
2
|
Devi SB, Kumar S. Designing a multi-epitope chimeric protein from different potential targets: A potential vaccine candidate against Plasmodium. Mol Biochem Parasitol 2023; 255:111560. [PMID: 37084957 DOI: 10.1016/j.molbiopara.2023.111560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 03/30/2023] [Accepted: 04/03/2023] [Indexed: 04/23/2023]
Abstract
Malaria is an infectious disease that has been a continuous threat to mankind since the time immemorial. Owing to the complex multi-staged life cycle of the plasmodium parasite, an effective malaria vaccine which is fully protective against the parasite infection is urgently needed to deal with the challenges. In the present study, essential parasite proteins were identified and a chimeric protein with multivalent epitopes was generated. The designed chimeric protein consists of best potential B and T cell epitopes from five different essential parasite proteins. Physiochemical studies of the chimeric protein showed that the modeled vaccine construct was thermo-stable, hydrophilic and antigenic in nature. And the binding of the vaccine construct with Toll-like receptor-4 (TLR-4) as revealed by the molecular docking suggests the possible interaction and role of the vaccine construct in activating the innate immune response. The constructed vaccine being a chimeric protein containing epitopes from different potential candidates could target different stages or pathways of the parasite. Moreover, the approach used in this study is time and cost effective, and can be applied in the discoveries of new potential vaccine targets for other pathogens.
Collapse
Affiliation(s)
- Sanasam Bijara Devi
- Department of Life science & Bioinformatics, Assam University, Silchar 788011 India.
| | - Sanjeev Kumar
- Department of Life science & Bioinformatics, Assam University, Silchar 788011 India
| |
Collapse
|
3
|
Hossain MA, Rahman MH, Sultana H, Ahsan A, Rayhan SI, Hasan MI, Sohel M, Somadder PD, Moni MA. An integrated in-silico Pharmaco-BioInformatics approaches to identify synergistic effects of COVID-19 to HIV patients. Comput Biol Med 2023; 155:106656. [PMID: 36805222 PMCID: PMC9911982 DOI: 10.1016/j.compbiomed.2023.106656] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2022] [Revised: 01/18/2023] [Accepted: 02/08/2023] [Indexed: 02/12/2023]
Abstract
BACKGROUND With high inflammatory states from both COVID-19 and HIV conditions further result in complications. The ongoing confrontation between these two viral infections can be avoided by adopting suitable management measures. PURPOSE The aim of this study was to figure out the pharmacological mechanism behind apigenin's role in the synergetic effects of COVID-19 to the progression of HIV patients. METHOD We employed computer-aided methods to uncover similar biological targets and signaling pathways associated with COVID-19 and HIV, along with bioinformatics and network pharmacology techniques to assess the synergetic effects of apigenin on COVID-19 to the progression of HIV, as well as pharmacokinetics analysis to examine apigenin's safety in the human body. RESULT Stress-responsive, membrane receptor, and induction pathways were mostly involved in gene ontology (GO) pathways, whereas apoptosis and inflammatory pathways were significantly associated in the Kyoto encyclopedia of genes and genomes (KEGG). The top 20 hub genes were detected utilizing the shortest path ranked by degree method and protein-protein interaction (PPI), as well as molecular docking and molecular dynamics simulation were performed, revealing apigenin's strong interaction with hub proteins (MAPK3, RELA, MAPK1, EP300, and AKT1). Moreover, the pharmacokinetic features of apigenin revealed that it is an effective therapeutic agent with minimal adverse effects, for instance, hepatoxicity. CONCLUSION Synergetic effects of COVID-19 on the progression of HIV may still be a danger to global public health. Consequently, advanced solutions are required to give valid information regarding apigenin as a suitable therapeutic agent for the management of COVID-19 and HIV synergetic effects. However, the findings have yet to be confirmed in patients, suggesting more in vitro and in vivo studies.
Collapse
Affiliation(s)
- Md Arju Hossain
- Department of Biotechnology and Genetic Engineering, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Md Habibur Rahman
- Department of Computer Science and Engineering, Islamic University, Kushtia, 7003, Bangladesh; Center for Advanced Bioinformatics and Artificial Intelligent Research, Islamic University, Kushtia, 7003, Bangladesh.
| | - Habiba Sultana
- Department of Biotechnology and Genetic Engineering, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Asif Ahsan
- Department of Biotechnology and Genetic Engineering, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Saiful Islam Rayhan
- Department of Biochemistry and Molecular Biology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Md Imran Hasan
- Department of Computer Science and Engineering, Islamic University, Kushtia, 7003, Bangladesh
| | - Md Sohel
- Department of Biochemistry and Molecular Biology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Pratul Dipta Somadder
- Department of Biotechnology and Genetic Engineering, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Mohammad Ali Moni
- School of Health and Rehabilitation Sciences, Faculty of Health and Behavioural Sciences, The University of Queensland, St Lucia, QLD, 4072, Australia.
| |
Collapse
|
4
|
Payra AK, Saha B, Ghosh A. MM-CCNB: Essential protein prediction using MAX-MIN strategies and compartment of common neighboring approach. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 228:107247. [PMID: 36427433 DOI: 10.1016/j.cmpb.2022.107247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Revised: 10/16/2022] [Accepted: 11/14/2022] [Indexed: 06/16/2023]
Abstract
BACKGROUND AND OBJECTIVE Proteins are indispensable for the flow of the life of living organisms. Protein pairs in interaction exhibit more functional activities than individuals. These activities have been considered an essential measure in predicting their essentiality. Neighborhood approaches have been used frequently in the prediction of essentiality scores. All paired neighbors of the essential proteins are nominated for the suitable candidate seeds for prediction. Still now Jaccard's coefficient is limited to predicting functions, homologous groups, sequence analysis, etc. It really motivate us to predict essential proteins efficiently using different computational approaches. METHODS In our work, we proposed modified Jaccard's coefficient to predict essential proteins. We have proposed a novel methodology for predicting essential proteins using MAX-MIN strategies and modified Jaccard's coefficient approach. RESULTS The performance of our proposed methodology has been analyzed for Saccharomyces cerevisiae datasets with an accuracy of more than 80%. It has been observed that the proposed algorithm is outperforms with an accuracy of 0.78, 0.74, 0.79, and 0.862 for YDIP, YMIPS, YHQ, and YMBD datasets respectivly. CONCLUSIONS There are several computational approaches in the existing state-of-art model of essential protein prediction. It has been noted that our predicted methodology outperforms other existing models viz. different centralities, local interaction density combined with protein complexes, modified monkey algorithm and ortho_sim_loc methods.
Collapse
Affiliation(s)
- Anjan Kumar Payra
- Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata 700074, India.
| | - Banani Saha
- Department of Computer Science & Engineering, University of Calcutta, Saltlake City Kolkata 700073, India
| | - Anupam Ghosh
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata 700152, India.
| |
Collapse
|
5
|
Li Y, Zeng M, Zhang F, Wu FX, Li M. DeepCellEss: cell line-specific essential protein prediction with attention-based interpretable deep learning. Bioinformatics 2022; 39:6865030. [PMID: 36458923 PMCID: PMC9825760 DOI: 10.1093/bioinformatics/btac779] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 11/25/2022] [Accepted: 12/01/2022] [Indexed: 12/05/2022] Open
Abstract
MOTIVATION Protein essentiality is usually accepted to be a conditional trait and strongly affected by cellular environments. However, existing computational methods often do not take such characteristics into account, preferring to incorporate all available data and train a general model for all cell lines. In addition, the lack of model interpretability limits further exploration and analysis of essential protein predictions. RESULTS In this study, we proposed DeepCellEss, a sequence-based interpretable deep learning framework for cell line-specific essential protein predictions. DeepCellEss utilizes a convolutional neural network and bidirectional long short-term memory to learn short- and long-range latent information from protein sequences. Further, a multi-head self-attention mechanism is used to provide residue-level model interpretability. For model construction, we collected extremely large-scale benchmark datasets across 323 cell lines. Extensive computational experiments demonstrate that DeepCellEss yields effective prediction performance for different cell lines and outperforms existing sequence-based methods as well as network-based centrality measures. Finally, we conducted some case studies to illustrate the necessity of considering specific cell lines and the superiority of DeepCellEss. We believe that DeepCellEss can serve as a useful tool for predicting essential proteins across different cell lines. AVAILABILITY AND IMPLEMENTATION The DeepCellEss web server is available at http://csuligroup.com:8000/DeepCellEss. The source code and data underlying this study can be obtained from https://github.com/CSUBioGroup/DeepCellEss. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yiming Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Zeng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Fuhao Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, Department of Computer Science, Department of Mechanical Engineering University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Min Li
- To whom correspondence should be addressed.
| |
Collapse
|
6
|
Huang X, Zhou H, Yang X, Shi W, Hu L, Wang J, Zhang F, Shao F, Zhang M, Jiang F, Wang Y. Construction and analysis of expression profile of exosomal lncRNAs in pleural effusion in lung adenocarcinoma. J Clin Lab Anal 2022; 36:e24777. [PMID: 36426920 PMCID: PMC9756994 DOI: 10.1002/jcla.24777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Revised: 10/15/2022] [Accepted: 10/29/2022] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND Lung adenocarcinoma (LUAD) is a highly malignant tumor with a very low five-year survival rate. In this study, we aimed to identify differentially expressed long-chain non-coding RNA (lncRNAs) and mRNAs from benign and malignant pleural effusion exosomes. METHODS We used gene microassay and quantitative real-time reverse transcription polymerase chain reaction (RT-qPCR) to detect and verify differentially expressed mRNAs and lncRNAs in benign and malignant pleural effusion exosomes. Gene Ontology (GO) functional significance and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway significance enrichment analyses were performed to identify the difference in biological processes and functions between different mRNAs. We selected the lncRNA ZBED5-AS1 with an upregulated differential fold of 3.003 and conducted a preliminary study on its cellular function. RESULTS Gene microassay results revealed that 177 differentially expressed lncRNAs were upregulated, and 215 were downregulated. The top 10 upregulated were FMN1, AL118505.1, LINC00452, AL109811.2, CATG00000040683.1, AC137932.1, AC008619.1, AL450344.1, AC092718.6, and ZBED5-AS1. The top 10 downregulated were TEX41, G067726, JAZF1-AS1, AC027328.1, AL445645.1, AL022345.4, AC008572.1, AC123777.1, AC093714.1, and PHKG1. For the mRNAs, 79 were upregulated, and 123 were notably downregulated. GO analysis revealed that the upregulated differential mRNAs were mainly involved in "cellular response to acidic pH" (biological processes), "endoplasmic reticulum part" (cellular components), and "at DNA binding, cyclase activity" (molecular functions). KEGG pathways were found to be related to V. cholerae infection, Parkinson's disease, and cell adhesion molecules. RT-qPCR showed that ZBED5-AS1 was highly expressed in LUAD tissues, cells, and benign and malignant pleural fluid exosomes. Overexpression of ZBED5-AS1 could significantly promote the proliferation, migration, invasion, and colony formation of LUAD cells, and knockdown had the opposite consequence. CONCLUSION The pleural effusion exosomes from patients with LUAD include several improperly expressed genes, and lncRNA-ZBED5-AS1 is a new biomarker that aids in our understanding of the occurrence and progression of LUAD.
Collapse
Affiliation(s)
- Xiaolu Huang
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Huixin Zhou
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Xiang Yang
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Wenjing Shi
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Lijuan Hu
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Junjun Wang
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Fan Zhang
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Fanggui Shao
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Meijuan Zhang
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Feng Jiang
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| | - Yumin Wang
- Department of Laboratory MedicineThe First Affiliated Hospital of Wenzhou Medical UniversityWenzhouChina
| |
Collapse
|
7
|
Wang L, Peng J, Kuang L, Tan Y, Chen Z. Identification of Essential Proteins Based on Local Random Walk and Adaptive Multi-View Multi-Label Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3507-3516. [PMID: 34788220 DOI: 10.1109/tcbb.2021.3128638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Accumulating evidences have indicated that essential proteins play vital roles in human physiological process. In recent years, although researches on prediction of essential proteins have been developing rapidly, there are as well various limitations such as unsatisfactory data suitability, low accuracy of predictive results and so on. In this manuscript, a novel method called RWAMVL was proposed to predict essential proteins based on the Random Walk and the Adaptive Multi-View multi-label Learning. In RWAMVL, considering that the inherent noise is ubiquitous in existing datasets of known protein-protein interactions (PPIs), a variety of different features including biological features of proteins and topological features of PPI networks were obtained by adopting adaptive multi-view multi-label learning first. And then, an improved random walk method was designed to detect essential proteins based on these different features. Finally, in order to verify the predictive performance of RWAMVL, intensive experiments were done to compare it with multiple state-of-the-art predictive methods under different expeditionary frameworks. And as a result, RWAMVL was proven that it can achieve better prediction accuracy than all those competitive methods, which demonstrated as well that RWAMVL may be a potential tool for prediction of key proteins in the future.
Collapse
|
8
|
Li Y, Zeng M, Wu Y, Li Y, Li M. Accurate Prediction of Human Essential Proteins Using Ensemble Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3263-3271. [PMID: 34699365 DOI: 10.1109/tcbb.2021.3122294] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Essential proteins are considered the foundation of life as they are indispensable for the survival of living organisms. Computational methods for essential protein discovery provide a fast way to identify essential proteins. But most of them heavily rely on various biological information, especially protein-protein interaction networks, which limits their practical applications. With the rapid development of high-throughput sequencing technology, sequencing data has become the most accessible biological data. However, using only protein sequence information to predict essential proteins has limited accuracy. In this paper, we propose EP-EDL, an ensemble deep learning model using only protein sequence information to predict human essential proteins. EP-EDL integrates multiple classifiers to alleviate the class imbalance problem and to improve prediction accuracy and robustness. In each base classifier, we employ multi-scale text convolutional neural networks to extract useful features from protein sequence feature matrices with evolutionary information. Our computational results show that EP-EDL outperforms the state-of-the-art sequence-based methods. Furthermore, EP-EDL provides a more practical and flexible way for biologists to accurately predict essential proteins. The source code and datasets can be downloaded from https://github.com/CSUBioGroup/EP-EDL.
Collapse
|
9
|
Wang C, Zhang H, Ma H, Wang Y, Cai K, Guo T, Yang Y, Li Z, Zhu Y. Inference of pan-cancer related genes by orthologs matching based on enhanced LSTM model. Front Microbiol 2022; 13:963704. [PMID: 36267181 PMCID: PMC9577021 DOI: 10.3389/fmicb.2022.963704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Many disease-related genes have been found to be associated with cancer diagnosis, which is useful for understanding the pathophysiology of cancer, generating targeted drugs, and developing new diagnostic and treatment techniques. With the development of the pan-cancer project and the ongoing expansion of sequencing technology, many scientists are focusing on mining common genes from The Cancer Genome Atlas (TCGA) across various cancer types. In this study, we attempted to infer pan-cancer associated genes by examining the microbial model organism Saccharomyces Cerevisiae (Yeast) by homology matching, which was motivated by the benefits of reverse genetics. First, a background network of protein-protein interactions and a pathogenic gene set involving several cancer types in humans and yeast were created. The homology between the human gene and yeast gene was then discovered by homology matching, and its interaction sub-network was obtained. This was undertaken following the principle that the homologous genes of the common ancestor may have similarities in expression. Then, using bidirectional long short-term memory (BiLSTM) in combination with adaptive integration of heterogeneous information, we further explored the topological characteristics of the yeast protein interaction network and presented a node representation score to evaluate the node ability in graphs. Finally, homologous mapping for human genes matched the important genes identified by ensemble classifiers for yeast, which may be thought of as genes connected to all types of cancer. One way to assess the performance of the BiLSTM model is through experiments on the database. On the other hand, enrichment analysis, survival analysis, and other outcomes can be used to confirm the biological importance of the prediction results. You may access the whole experimental protocols and programs at https://github.com/zhuyuan-cug/AI-BiLSTM/tree/master.
Collapse
Affiliation(s)
- Chao Wang
- Department of Surgery, Hepatic Surgery Center, Institute of Hepato-Pancreato-Biliary Surgery, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Houwang Zhang
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Haishu Ma
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
| | - Yawen Wang
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Ke Cai
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
| | - Tingrui Guo
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
| | - Yuanhang Yang
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Zhen Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Shanghai, China
- *Correspondence: Yuan Zhu
| |
Collapse
|
10
|
Identifying essential proteins from protein-protein interaction networks based on influence maximization. BMC Bioinformatics 2022; 23:339. [PMID: 35974329 PMCID: PMC9380286 DOI: 10.1186/s12859-022-04874-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 08/03/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Essential proteins are indispensable to the development and survival of cells. The identification of essential proteins not only is helpful for the understanding of the minimal requirements for cell survival, but also has practical significance in disease diagnosis, drug design and medical treatment. With the rapidly amassing of protein-protein interaction (PPI) data, computationally identifying essential proteins from protein-protein interaction networks (PINs) becomes more and more popular. Up to now, a number of various approaches for essential protein identification based on PINs have been developed. RESULTS In this paper, we propose a new and effective approach called iMEPP to identify essential proteins from PINs by fusing multiple types of biological data and applying the influence maximization mechanism to the PINs. Concretely, we first integrate PPI data, gene expression data and Gene Ontology to construct weighted PINs, to alleviate the impact of high false-positives in the raw PPI data. Then, we define the influence scores of nodes in PINs with both orthological data and PIN topological information. Finally, we develop an influence discount algorithm to identify essential proteins based on the influence maximization mechanism. CONCLUSIONS We applied our method to identifying essential proteins from saccharomyces cerevisiae PIN. Experiments show that our iMEPP method outperforms the existing methods, which validates its effectiveness and advantage.
Collapse
|
11
|
Schapke J, Tavares A, Recamonde-Mendoza M. EPGAT: Gene Essentiality Prediction With Graph Attention Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1615-1626. [PMID: 33497339 DOI: 10.1109/tcbb.2021.3054738] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Identifying essential genes and proteins is a critical step towards a better understanding of human biology and pathology. Computational approaches helped to mitigate experimental constraints by exploring machine learning (ML) methods and the correlation of essentiality with biological information, especially protein-protein interaction (PPI) networks, to predict essential genes. Nonetheless, their performance is still limited, as network-based centralities are not exclusive proxies of essentiality, and traditional ML methods are unable to learn from non-euclidean domains such as graphs. Given these limitations, we proposed EPGAT, an approach for Essentiality Prediction based on Graph Attention Networks (GATs), which are attention-based Graph Neural Networks (GNNs), operating on graph-structured data. Our model directly learns gene essentiality patterns from PPI networks, integrating additional evidence from multiomics data encoded as node attributes. We benchmarked EPGAT for four organisms, including humans, accurately predicting gene essentiality with ROC AUC score ranging from 0.78 to 0.97. Our model significantly outperformed network-based and shallow ML-based methods and achieved a very competitive performance against the state-of-the-art node2vec embedding method. Notably, EPGAT was the most robust approach in scenarios with limited and imbalanced training data. Thus, the proposed approach offers a powerful and effective way to identify essential genes and proteins.
Collapse
|
12
|
Zhang Z, Luo Y, Jiang M, Wu D, Zhang W, Yan W, Zhao B. An efficient strategy for identifying essential proteins based on homology, subcellular location and protein-protein interaction information. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:6331-6343. [PMID: 35603404 DOI: 10.3934/mbe.2022296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
High throughput biological experiments are expensive and time consuming. For the past few years, many computational methods based on biological information have been proposed and widely used to understand the biological background. However, the processing of biological information data inevitably produces false positive and false negative data, such as the noise in the Protein-Protein Interaction (PPI) networks and the noise generated by the integration of a variety of biological information. How to solve these noise problems is the key role in essential protein predictions. An Identifying Essential Proteins model based on non-negative Matrix Symmetric tri-Factorization and multiple biological information (IEPMSF) is proposed in this paper, which utilizes only the PPI network proteins common neighbor characters to develop a weighted network, and uses the non-negative matrix symmetric tri-factorization method to find more potential interactions between proteins in the network so as to optimize the weighted network. Then, using the subcellular location and lineal homology information, the starting score of proteins is determined, and the random walk algorithm with restart mode is applied to the optimized network to mark and rank each protein. We tested the suggested forecasting model against current representative approaches using a public database. Experiment shows high efficiency of new method in essential proteins identification. The effectiveness of this method shows that it can dramatically solve the noise problems that existing in the multi-source biological information itself and cased by integrating them.
Collapse
Affiliation(s)
- Zhihong Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, Hunan 410022, China
| | - Yingchun Luo
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, Hunan 410008, China
| | - Meiping Jiang
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, Hunan 410008, China
| | - Dongjie Wu
- Department of Banking and Finance, Monash University, Clayton, Victoria 3168, Australia
| | - Wang Zhang
- Department of Optoelectronic Engineering, Jinan University, Guangzhou, Guangdong 510632, China
| | - Wei Yan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, Hunan 410022, China
| | - Bihai Zhao
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, Hunan 410022, China
| |
Collapse
|
13
|
Zhu X, Zhu Y, Tan Y, Chen Z, Wang L. An Iterative Method for Predicting Essential Proteins Based on Multifeature Fusion and Linear Neighborhood Similarity. Front Aging Neurosci 2022; 13:799500. [PMID: 35140599 PMCID: PMC8819145 DOI: 10.3389/fnagi.2021.799500] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 12/02/2021] [Indexed: 11/13/2022] Open
Abstract
Growing evidence have demonstrated that many biological processes are inseparable from the participation of key proteins. In this paper, a novel iterative method called linear neighborhood similarity-based protein multifeatures fusion (LNSPF) is proposed to identify potential key proteins based on multifeature fusion. In LNSPF, an original protein-protein interaction (PPI) network will be constructed first based on known protein-protein interaction data downloaded from benchmark databases, based on which, topological features will be further extracted. Next, gene expression data of proteins will be adopted to transfer the original PPI network to a weighted PPI network based on the linear neighborhood similarity. After that, subcellular localization and homologous information of proteins will be integrated to extract functional features for proteins, and based on both functional and topological features obtained above. And then, an iterative method will be designed and carried out to predict potential key proteins. At last, for evaluating the predictive performance of LNSPF, extensive experiments have been done, and compare results between LNPSF and 15 state-of-the-art competitive methods have demonstrated that LNSPF can achieve satisfactory recognition accuracy, which is markedly better than that achieved by each competing method.
Collapse
Affiliation(s)
- Xianyou Zhu
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, China
| | - Yaocan Zhu
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Yihong Tan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Zhiping Chen
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, China
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| |
Collapse
|
14
|
Zhang Z, Jiang M, Wu D, Zhang W, Yan W, Qu X. A Novel Method for Identifying Essential Proteins Based on Non-negative Matrix Tri-Factorization. Front Genet 2021; 12:709660. [PMID: 34422014 PMCID: PMC8378176 DOI: 10.3389/fgene.2021.709660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 07/06/2021] [Indexed: 11/29/2022] Open
Abstract
Identification of essential proteins is very important for understanding the basic requirements to sustain a living organism. In recent years, there has been an increasing interest in using computational methods to predict essential proteins based on protein–protein interaction (PPI) networks or fusing multiple biological information. However, it has been observed that existing PPI data have false-negative and false-positive data. The fusion of multiple biological information can reduce the influence of false data in PPI, but inevitably more noise data will be produced at the same time. In this article, we proposed a novel non-negative matrix tri-factorization (NMTF)-based model (NTMEP) to predict essential proteins. Firstly, a weighted PPI network is established only using the topology features of the network, so as to avoid more noise. To reduce the influence of false data (existing in PPI network) on performance of identify essential proteins, the NMTF technique, as a widely used recommendation algorithm, is performed to reconstruct a most optimized PPI network with more potential protein–protein interactions. Then, we use the PageRank algorithm to compute the final ranking score of each protein, in which subcellular localization and homologous information of proteins were used to calculate the initial scores. In addition, extensive experiments are performed on the publicly available datasets and the results indicate that our NTMEP model has better performance in predicting essential proteins against the start-of-the-art method. In this investigation, we demonstrated that the introduction of non-negative matrix tri-factorization technology can effectively improve the condition of the protein–protein interaction network, so as to reduce the negative impact of noise on the prediction. At the same time, this finding provides a more novel angle of view for other applications based on protein–protein interaction networks.
Collapse
Affiliation(s)
- Zhihong Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China.,School of Information Technology and Management, Hunan University of Finance and Economics, Changsha, China
| | - Meiping Jiang
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, China
| | - Dongjie Wu
- Department of Banking and Finance, Monash University, Clayton, VIC, Australia
| | - Wang Zhang
- Department of Optoelectronic Engineering, Jinan University, Guangzhou, China
| | - Wei Yan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Xilong Qu
- School of Information Technology and Management, Hunan University of Finance and Economics, Changsha, China.,Hunan Provincial Key Laboratory of Finance and Economics Big Data Science and Technology, Hunan University of Finance and Economics, Changsha, China
| |
Collapse
|
15
|
Zhong J, Tang C, Peng W, Xie M, Sun Y, Tang Q, Xiao Q, Yang J. A novel essential protein identification method based on PPI networks and gene expression data. BMC Bioinformatics 2021; 22:248. [PMID: 33985429 PMCID: PMC8120700 DOI: 10.1186/s12859-021-04175-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Accepted: 05/06/2021] [Indexed: 02/08/2023] Open
Abstract
Background Some proposed methods for identifying essential proteins have better results by using biological information. Gene expression data is generally used to identify essential proteins. However, gene expression data is prone to fluctuations, which may affect the accuracy of essential protein identification. Therefore, we propose an essential protein identification method based on gene expression and the PPI network data to calculate the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network. Our experiments show that the method can improve the accuracy in predicting essential proteins. Results In this paper, we propose a new measure named JDC, which is based on the PPI network data and gene expression data. The JDC method offers a dynamic threshold method to binarize gene expression data. After that, it combines the degree centrality and Jaccard similarity index to calculate the JDC score for each protein in the PPI network. We benchmark the JDC method on four organisms respectively, and evaluate our method by using ROC analysis, modular analysis, jackknife analysis, overlapping analysis, top analysis, and accuracy analysis. The results show that the performance of JDC is better than DC, IC, EC, SC, BC, CC, NC, PeC, and WDC. We compare JDC with both NF-PIN and TS-PIN methods, which predict essential proteins through active PPI networks constructed from dynamic gene expression. Conclusions We demonstrate that the new centrality measure, JDC, is more efficient than state-of-the-art prediction methods with same input. The main ideas behind JDC are as follows: (1) Essential proteins are generally densely connected clusters in the PPI network. (2) Binarizing gene expression data can screen out fluctuations in gene expression profiles. (3) The essentiality of the protein depends on the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network.
Collapse
Affiliation(s)
- Jiancheng Zhong
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.,Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Changsha, 410083, China
| | - Chao Tang
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China
| | - Wei Peng
- College of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, Yunnan, China
| | - Minzhu Xie
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China
| | - Yusui Sun
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China
| | - Qiang Tang
- College of Engineering and Design, Hunan Normal University, Changsha, 410081, China
| | - Qiu Xiao
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.
| | - Jiahong Yang
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.
| |
Collapse
|
16
|
Payra AK, Saha B, Ghosh A. Ortho_Sim_Loc: Essential protein prediction using orthology and priority-based similarity approach. Comput Biol Chem 2021; 92:107503. [PMID: 33962168 DOI: 10.1016/j.compbiolchem.2021.107503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Revised: 04/02/2021] [Accepted: 04/21/2021] [Indexed: 10/21/2022]
Abstract
Proteins are the essential macro-molecules of living organism. But all proteins cannot be considered as essential in different relevant studies. Essentiality of a protein is thus computed by computation methods rather than biological experiments which in turn save both time and effort. Different computational approaches are already predicted to select essential proteins successfully with different biological significances by researchers. Most of the experimental approaches return higher false negative outcomes with respect to others. In order to retain the prediction accuracy level, a novel methodology "Ortho_Sim_Loc"has been proposed which is a combined approach of Orthology, Similarity (using clustering and priority based GO-Annotation) and Subcellular localization. Ortho_Sim_Loc can predict enriched functional set essential proteins. The predicted results are validated with other existing methods like different centrality measures, LIDC. The validation results exhibits better performance of Ortho_Sim_Loc in compare to other existing computational approaches.
Collapse
Affiliation(s)
- Anjan Kumar Payra
- Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata, 700074, India.
| | - Banani Saha
- Department of Computer Science & Engineering, University of Calcutta, Saltlake City, Kolkata, 700073, India.
| | - Anupam Ghosh
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata, 700152, India.
| |
Collapse
|
17
|
CEGSO: Boosting Essential Proteins Prediction by Integrating Protein Complex, Gene Expression, Gene Ontology, Subcellular Localization and Orthology Information. Interdiscip Sci 2021; 13:349-361. [PMID: 33772722 DOI: 10.1007/s12539-021-00426-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 02/04/2021] [Accepted: 03/05/2021] [Indexed: 01/13/2023]
Abstract
Essential proteins are assumed to be an indispensable element in sustaining normal physiological function and crucial to drug design and disease diagnosis. The discovery of essential proteins is of great importance in revealing the molecular mechanisms and biological processes. Owing to the tedious biological experiment, many numerical methods have been developed to discover key proteins by mining the features of the high throughput data. Appropriate integration of differential biological information based on protein-protein interaction (PPI) network has been proven useful in predicting essential proteins. The main intention of this research is to provide a comprehensive study and a review on identifying essential proteins by integrating multi-source data and provide guidance for researchers. Detailed analysis and comparison of current essential protein prediction algorithms have been carried out and tested on benchmark PPI networks. In addition, based on the previous method TEGS (short for the network Topology, gene Expression, Gene ontology, and Subcellular localization), we improve the performance of predicting essential proteins by incorporating known protein complex information, the gene expression profile, Gene Ontology (GO) terms information, subcellular localization information, and protein's orthology data into the PPI network, named CEGSO. The simulation results show that CEGSO achieves more accurate and robust results than other compared methods under different test datasets with various evaluation measurements.
Collapse
|
18
|
Meng Z, Kuang L, Chen Z, Zhang Z, Tan Y, Li X, Wang L. Method for Essential Protein Prediction Based on a Novel Weighted Protein-Domain Interaction Network. Front Genet 2021; 12:645932. [PMID: 33815480 PMCID: PMC8010314 DOI: 10.3389/fgene.2021.645932] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Accepted: 02/15/2021] [Indexed: 01/04/2023] Open
Abstract
In recent years a number of calculative models based on protein-protein interaction (PPI) networks have been proposed successively. However, due to false positives, false negatives, and the incompleteness of PPI networks, there are still many challenges affecting the design of computational models with satisfactory predictive accuracy when inferring key proteins. This study proposes a prediction model called WPDINM for detecting key proteins based on a novel weighted protein-domain interaction (PDI) network. In WPDINM, a weighted PPI network is constructed first by combining the gene expression data of proteins with topological information extracted from the original PPI network. Simultaneously, a weighted domain-domain interaction (DDI) network is constructed based on the original PDI network. Next, through integrating the newly obtained weighted PPI network and weighted DDI network with the original PDI network, a weighted PDI network is further constructed. Then, based on topological features and biological information, including the subcellular localization and orthologous information of proteins, a novel PageRank-based iterative algorithm is designed and implemented on the newly constructed weighted PDI network to estimate the criticality of proteins. Finally, to assess the prediction performance of WPDINM, we compared it with 12 kinds of competitive measures. Experimental results show that WPDINM can achieve a predictive accuracy rate of 90.19, 81.96, 70.72, 62.04, 55.83, and 51.13% in the top 1%, top 5%, top 10%, top 15%, top 20%, and top 25% separately, which exceeds the prediction accuracy achieved by traditional state-of-the-art competing measures. Owing to the satisfactory identification effect, the WPDINM measure may contribute to the further development of key protein identification.
Collapse
Affiliation(s)
- Zixuan Meng
- College of Computer, Xiangtan University, Xiangtan, China
| | - Linai Kuang
- College of Computer, Xiangtan University, Xiangtan, China
| | - Zhiping Chen
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Zhen Zhang
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Yihong Tan
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Xueyong Li
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer, Xiangtan University, Xiangtan, China
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| |
Collapse
|
19
|
Dai W, Chen B, Peng W, Li X, Zhong J, Wang J. A Novel Multi-Ensemble Method for Identifying Essential Proteins. J Comput Biol 2021; 28:637-649. [PMID: 33439753 DOI: 10.1089/cmb.2020.0527] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Essential proteins possess critical functions for cell survival. Identifying essential proteins improves our understanding of how a cell works and also plays a vital role in the research fields of disease treatment and drug development. Recently, some machine-learning methods and ensemble learning methods have been proposed to identify essential proteins by introducing effective protein features. However, the ensemble learning method only used to focus on the choice of base classifiers. In this article, we propose a novel ensemble learning framework called multi-ensemble to integrate different base classifiers. The multi-ensemble method adopts the idea of multi-view learning and selects multiple base classifiers and trains those classifiers by continually adding the samples that are predicted correctly by the other base classifiers. We applied multi-ensemble to Yeast data and Escherichia coli data. The results show that our approach achieved better performance than both individual classifiers and the other ensemble learning methods.
Collapse
Affiliation(s)
- Wei Dai
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China.,Computer Technology Application Key Lab of Yunnan Province, Kunming University of Science and Technology, Kunming, China
| | - Bingxi Chen
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
| | - Wei Peng
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China.,Computer Technology Application Key Lab of Yunnan Province, Kunming University of Science and Technology, Kunming, China
| | - Xia Li
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
| | - Jiancheng Zhong
- School of Information Science and Engineering, Hunan Normal University, Changsha, China
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
20
|
Zeng M, Li M, Fei Z, Wu FX, Li Y, Pan Y, Wang J. A Deep Learning Framework for Identifying Essential Proteins by Integrating Multiple Types of Biological Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:296-305. [PMID: 30736002 DOI: 10.1109/tcbb.2019.2897679] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Computational methods including centrality and machine learning-based methods have been proposed to identify essential proteins for understanding the minimum requirements of the survival and evolution of a cell. In centrality methods, researchers are required to design a score function which is based on prior knowledge, yet is usually not sufficient to capture the complexity of biological information. In machine learning-based methods, some selected biological features cannot represent the complete properties of biological information as they lack a computational framework to automatically select features. To tackle these problems, we propose a deep learning framework to automatically learn biological features without prior knowledge. We use node2vec technique to automatically learn a richer representation of protein-protein interaction (PPI) network topologies than a score function. Bidirectional long short term memory cells are applied to capture non-local relationships in gene expression data. For subcellular localization information, we exploit a high dimensional indicator vector to characterize their feature. To evaluate the performance of our method, we tested it on PPI network of S. cerevisiae. Our experimental results demonstrate that the performance of our method is better than traditional centrality methods and is superior to existing machine learning-based methods. To explore which of the three types of biological information is the most vital element, we conduct an ablation study by removing each component in turn. Our results show that the PPI network embedding contributes most to the improvement. In addition, gene expression profiles and subcellular localization information are also helpful to improve the performance in identification of essential proteins.
Collapse
|
21
|
A novel scheme for essential protein discovery based on multi-source biological information. J Theor Biol 2020; 504:110414. [PMID: 32712150 DOI: 10.1016/j.jtbi.2020.110414] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Revised: 02/14/2020] [Accepted: 07/15/2020] [Indexed: 02/06/2023]
Abstract
Mining essential protein is crucial for discovering the process of cellular organization and viability. At present, there are many computational methods for essential proteins detecting. However, these existing methods only focus on the topological information of the networks and ignore the biological information of proteins, which lead to low accuracy of essential protein identification. Therefore, this paper presents a new essential proteins prediction strategy, called DEP-MSB which integrates a variety of biological information including gene expression profiles, GO annotations, and Domain interaction strength. In order to evaluate the performance of DEP-MSB, we conduct a series of experiments on the yeast PPI network and the experimental results have shown that the proposed algorithm DEP-MSB is more superior to the other existing traditional methods and has obviously improvement in prediction accuracy.
Collapse
|
22
|
Zhang W, Xu J, Zou X. Predicting Essential Proteins by Integrating Network Topology, Subcellular Localization Information, Gene Expression Profile and GO Annotation Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:2053-2061. [PMID: 31095490 DOI: 10.1109/tcbb.2019.2916038] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Essential proteins are indispensable for maintaining normal cellular functions. Identification of essential proteins from Protein-protein interaction (PPI) networks has become a hot topic in recent years. Traditionally biological experimental based approaches are time-consuming and expensive, although lots of computational based methods have been developed in the past years; however, the prediction accuracy is still unsatisfied. In this research, by introducing the protein sub-cellular localization information, we define a new measurement for characterizing the protein's subcellular localization essentiality, and a new data fusion based method is developed for identifying essential proteins, named TEGS, based on integrating network topology, gene expression profile, GO annotation information, and protein subcellular localization information. To demonstrate the efficiency of the proposed method TEGS, we evaluate its performance on two Saccharomyces cerevisiae datasets and compare with other seven state-of-the-art methods (DC, BC, NC, PeC, WDC, SON, and TEO) in terms of true predicted number, jackknife curve, and precision-recall curve. Simulation results show that the TEGS outperforms the other compared methods in identifying essential proteins. The source code of TEGS is freely available at https://github.com/wzhangwhu/TEGS.
Collapse
|
23
|
Khorsand B, Savadi A, Naghibzadeh M. Comprehensive host-pathogen protein-protein interaction network analysis. BMC Bioinformatics 2020; 21:400. [PMID: 32912135 PMCID: PMC7488060 DOI: 10.1186/s12859-020-03706-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2020] [Accepted: 07/31/2020] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Infectious diseases are a cruel assassin with millions of victims around the world each year. Understanding infectious mechanism of viruses is indispensable for their inhibition. One of the best ways of unveiling this mechanism is to investigate the host-pathogen protein-protein interaction network. In this paper we try to disclose many properties of this network. We focus on human as host and integrate experimentally 32,859 interaction between human proteins and virus proteins from several databases. We investigate different properties of human proteins targeted by virus proteins and find that most of them have a considerable high centrality scores in human intra protein-protein interaction network. Investigating human proteins network properties which are targeted by different virus proteins can help us to design multipurpose drugs. RESULTS As host-pathogen protein-protein interaction network is a bipartite network and centrality measures for this type of networks are scarce, we proposed seven new centrality measures for analyzing bipartite networks. Applying them to different virus strains reveals unrandomness of attack strategies of virus proteins which could help us in drug design hence elevating the quality of life. They could also be used in detecting host essential proteins. Essential proteins are those whose functions are critical for survival of its host. One of the proposed centralities named diversity of predators, outperforms the other existing centralities in terms of detecting essential proteins and could be used as an optimal essential proteins' marker. CONCLUSIONS Different centralities were applied to analyze human protein-protein interaction network and to detect characteristics of human proteins targeted by virus proteins. Moreover, seven new centralities were proposed to analyze host-pathogen protein-protein interaction network and to detect pathogens' favorite host protein victims. Comparing different centralities in detecting essential proteins reveals that diversity of predator (one of the proposed centralities) is the best essential protein marker.
Collapse
Affiliation(s)
- Babak Khorsand
- Computer Engineering Department, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Abdorreza Savadi
- Computer Engineering Department, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
- Ferdowsi University of Mashhad, Azadi Square, Mashhad, 9177948974 Iran
| | | |
Collapse
|
24
|
Zhang X, Xiao W, Xiao W. DeepHE: Accurately predicting human essential genes based on deep learning. PLoS Comput Biol 2020; 16:e1008229. [PMID: 32936825 PMCID: PMC7521708 DOI: 10.1371/journal.pcbi.1008229] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Revised: 09/28/2020] [Accepted: 08/09/2020] [Indexed: 11/19/2022] Open
Abstract
Accurately predicting essential genes using computational methods can greatly reduce the effort in finding them via wet experiments at both time and resource scales, and further accelerate the process of drug discovery. Several computational methods have been proposed for predicting essential genes in model organisms by integrating multiple biological data sources either via centrality measures or machine learning based methods. However, the methods aiming to predict human essential genes are still limited and the performance still need improve. In addition, most of the machine learning based essential gene prediction methods are lack of skills to handle the imbalanced learning issue inherent in the essential gene prediction problem, which might be one factor affecting their performance. We propose a deep learning based method, DeepHE, to predict human essential genes by integrating features derived from sequence data and protein-protein interaction (PPI) network. A deep learning based network embedding method is utilized to automatically learn features from PPI network. In addition, 89 sequence features were derived from DNA sequence and protein sequence for each gene. These two types of features are integrated to train a multilayer neural network. A cost-sensitive technique is used to address the imbalanced learning problem when training the deep neural network. The experimental results for predicting human essential genes show that our proposed method, DeepHE, can accurately predict human gene essentiality with an average performance of AUC higher than 94%, the area under precision-recall curve (AP) higher than 90%, and the accuracy higher than 90%. We also compare DeepHE with several widely used traditional machine learning models (SVM, Naïve Bayes, Random Forest, and Adaboost) using the same features and utilizing the same cost-sensitive technique to against the imbalanced learning issue. The experimental results show that DeepHE significantly outperforms the compared machine learning models. We have demonstrated that human essential genes can be accurately predicted by designing effective machine learning algorithm and integrating representative features captured from available biological data. The proposed deep learning framework is effective for such task.
Collapse
Affiliation(s)
- Xue Zhang
- Faculty of Computer and Software Engineering, Huaiyin Institute of Technology, Huai’an, Jiangsu, China
- School of Medicine, Tufts University, Boston, Massachusetts, United States of America
| | - Wangxin Xiao
- Faculty of Transportation Engineering, Huaiyin Institute of Technology, Huai’an, Jiangsu, China
| | - Weijia Xiao
- Boston Latin School, Boston, Massachusetts, United States of America
| |
Collapse
|
25
|
Li G, Li M, Wang J, Li Y, Pan Y. United Neighborhood Closeness Centrality and Orthology for Predicting Essential Proteins. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1451-1458. [PMID: 30596582 DOI: 10.1109/tcbb.2018.2889978] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Identifying essential proteins plays an important role in disease study, drug design, and understanding the minimal requirement for cellular life. Computational methods for essential proteins discovery overcome the disadvantages of biological experimental methods that are often time-consuming, expensive, and inefficient. The topological features of protein-protein interaction (PPI) networks are often used to design computational prediction methods, such as Degree Centrality (DC), Betweenness Centrality (BC), Closeness Centrality (CC), Subgraph Centrality (SC), Eigenvector Centrality (EC), Information Centrality (IC), and Neighborhood Centrality (NC). However, the prediction accuracies of these individual methods still have space to be improved. Studies show that additional information, such as orthologous relations, helps discover essential proteins. Many researchers have proposed different methods by combining multiple information sources to gain improvement of prediction accuracy. In this study, we find that essential proteins appear in triangular structure in PPI network significantly more often than nonessential ones. Based on this phenomenon, we propose a novel pure centrality measure, so-called Neighborhood Closeness Centrality (NCC). Accordingly, we develop a new combination model, Extended Pareto Optimality Consensus model, named EPOC, to fuse NCC and Orthology information and a novel essential proteins identification method, NCCO, is fully proposed. Compared with seven existing classic centrality methods (DC, BC, IC, CC, SC, EC, and NC) and three consensus methods (PeC, ION, and CSC), our results on S.cerevisiae and E.coli datasets show that NCCO has clear advantages. As a consensus method, EPOC also yields better performance than the random walk model.
Collapse
|
26
|
Li M, Meng X, Zheng R, Wu FX, Li Y, Pan Y, Wang J. Identification of Protein Complexes by Using a Spatial and Temporal Active Protein Interaction Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:817-827. [PMID: 28885159 DOI: 10.1109/tcbb.2017.2749571] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The rapid development of proteomics and high-throughput technologies has produced a large amount of Protein-Protein Interaction (PPI) data, which makes it possible for considering dynamic properties of protein interaction networks (PINs) instead of static properties. Identification of protein complexes from dynamic PINs becomes a vital scientific problem for understanding cellular life in the post genome era. Up to now, plenty of models or methods have been proposed for the construction of dynamic PINs to identify protein complexes. However, most of the constructed dynamic PINs just focus on the temporal dynamic information and thus overlook the spatial dynamic information of the complex biological systems. To address the limitation of the existing dynamic PIN analysis approaches, in this paper, we propose a new model-based scheme for the construction of the Spatial and Temporal Active Protein Interaction Network (ST-APIN) by integrating time-course gene expression data and subcellular location information. To evaluate the efficiency of ST-APIN, the commonly used classical clustering algorithm MCL is adopted to identify protein complexes from ST-APIN and the other three dynamic PINs, NF-APIN, DPIN, and TC-PIN. The experimental results show that, the performance of MCL on ST-APIN outperforms those on the other three dynamic PINs in terms of matching with known complexes, sensitivity, specificity, and f-measure. Furthermore, we evaluate the identified protein complexes by Gene Ontology (GO) function enrichment analysis. The validation shows that the identified protein complexes from ST-APIN are more biologically significant. This study provides a general paradigm for constructing the ST-APINs, which is essential for further understanding of molecular systems and the biomedical mechanism of complex diseases.
Collapse
|
27
|
Jia K, Zhou Y, Cui Q. Quantifying Gene Essentiality Based on the Context of Cellular Components. Front Genet 2020; 10:1342. [PMID: 32038710 PMCID: PMC6985572 DOI: 10.3389/fgene.2019.01342] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Accepted: 12/09/2019] [Indexed: 11/26/2022] Open
Abstract
Different genes have their protein products localized in various subcellular compartments. The diversity in protein localization may serve as a gene characteristic, revealing gene essentiality from a subcellular perspective. To measure this diversity, we introduced a Subcellular Diversity Index (SDI) based on the Gene Ontology-Cellular Component Ontology (GO-CCO) and a semantic similarity measure of GO terms. Analyses revealed that SDI of human genes was well correlated with some known measures of gene essentiality, including protein–protein interaction (PPI) network topology measurements, dN/dS ratio, homologous gene number, expression level and tissue specificity. In addition, SDI had a good performance in predicting human essential genes (AUC = 0.702) and drug target genes (AUC = 0.704), and drug targets with higher SDI scores tended to cause more side-effects. The results suggest that SDI could be used to identify novel drug targets and to guide the filtering of drug targets with fewer potential side effects. Finally, we developed a user-friendly online database for querying SDI score for genes across eight species, and the predicted probabilities of human drug target based on SDI. The online database of SDI is available at: http://www.cuilab.cn/sdi.
Collapse
Affiliation(s)
- Kaiwen Jia
- Department of Biomedical Informatics, Department of Physiology and Pathophysiology, Center for Noncoding RNA Medicine, MOE Key Lab of Cardiovascular Sciences, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Yuan Zhou
- Department of Biomedical Informatics, Department of Physiology and Pathophysiology, Center for Noncoding RNA Medicine, MOE Key Lab of Cardiovascular Sciences, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Qinghua Cui
- Department of Biomedical Informatics, Department of Physiology and Pathophysiology, Center for Noncoding RNA Medicine, MOE Key Lab of Cardiovascular Sciences, School of Basic Medical Sciences, Peking University, Beijing, China
| |
Collapse
|
28
|
Zeng M, Li M, Wu FX, Li Y, Pan Y. DeepEP: a deep learning framework for identifying essential proteins. BMC Bioinformatics 2019; 20:506. [PMID: 31787076 PMCID: PMC6886168 DOI: 10.1186/s12859-019-3076-y] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background Essential proteins are crucial for cellular life and thus, identification of essential proteins is an important topic and a challenging problem for researchers. Recently lots of computational approaches have been proposed to handle this problem. However, traditional centrality methods cannot fully represent the topological features of biological networks. In addition, identifying essential proteins is an imbalanced learning problem; but few current shallow machine learning-based methods are designed to handle the imbalanced characteristics. Results We develop DeepEP based on a deep learning framework that uses the node2vec technique, multi-scale convolutional neural networks and a sampling technique to identify essential proteins. In DeepEP, the node2vec technique is applied to automatically learn topological and semantic features for each protein in protein-protein interaction (PPI) network. Gene expression profiles are treated as images and multi-scale convolutional neural networks are applied to extract their patterns. In addition, DeepEP uses a sampling method to alleviate the imbalanced characteristics. The sampling method samples the same number of the majority and minority samples in a training epoch, which is not biased to any class in training process. The experimental results show that DeepEP outperforms traditional centrality methods. Moreover, DeepEP is better than shallow machine learning-based methods. Detailed analyses show that the dense vectors which are generated by node2vec technique contribute a lot to the improved performance. It is clear that the node2vec technique effectively captures the topological and semantic properties of PPI network. The sampling method also improves the performance of identifying essential proteins. Conclusion We demonstrate that DeepEP improves the prediction performance by integrating multiple deep learning techniques and a sampling method. DeepEP is more effective than existing methods.
Collapse
Affiliation(s)
- Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, People's Republic of China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, People's Republic of China.
| | - Fang-Xiang Wu
- Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SKS7N5A9, Canada
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA23529, USA
| | - Yi Pan
- Department of Computer Science, Georgia State University, Atlanta, GA30302, USA
| |
Collapse
|
29
|
Li G, Li M, Peng W, Li Y, Pan Y, Wang J. A novel extended Pareto Optimality Consensus model for predicting essential proteins. J Theor Biol 2019; 480:141-149. [PMID: 31398315 DOI: 10.1016/j.jtbi.2019.08.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Revised: 08/02/2019] [Accepted: 08/06/2019] [Indexed: 12/11/2022]
Abstract
Essential proteins have vital functions, when they are destroyed in cells, the cells will die or stop reproducing. Therefore, it is very important to identify essential proteins from a large number of other proteins. Due to the time-consuming, expensive, and inefficient process in biological experimental methods, computational methods become more and more popular to recognize them. In the early stages, these methods mainly rely on protein-protein interaction (PPI) information, which limits their discovery capacities. Researchers find novel methods by fusing multi-information to improve prediction accuracy. According to these features, essential proteins are more important and conservative in the evolution process, their neighbors in PPI networks are usually likely to be essential, there are many false positives in PPI data, whether a protein is essential can be assessed by the importance of a protein itself, the relevance of neighbors and the reliability of PPIs. The importance of neighbors and the reliability of PPIs can be further integrated into neighborhood feature. In the study, orthologous information, edge-clustering coefficient and gene expression information are used to measure the importance of a protein itself, the importance of the neighbors and the reliability of PPIs, respectively. Then, a novel expanded POC model, E_POC, is proposed to fuse the above information to discover essential proteins, a weighted PPI network is constructed. The proteins ranked high according to their weights are treated as candidate essential proteins. This novel method is named as E_POC. E_POC outperforms the existing classical methods on S. cerevisiae and E. coli data.
Collapse
Affiliation(s)
- Gaoshi Li
- School of Computer Science and engineering, Central South University, Changsha 410083, China; Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, Guangxi 541004, China.
| | - Min Li
- School of Computer Science and engineering, Central South University, Changsha 410083, China.
| | - Wei Peng
- Computer Center/ Faculty of Information Engineering and Automation of Kunming University of Science and Technology, Kunming, Yunnan 650093, China
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA 23529, USA.
| | - Yi Pan
- Department of Computer Science, Georgia State University, Atlanta, GA 30302-4110, USA.
| | - Jianxin Wang
- School of Computer Science and engineering, Central South University, Changsha 410083, China.
| |
Collapse
|
30
|
Li M, Ni P, Chen X, Wang J, Wu FX, Pan Y. Construction of Refined Protein Interaction Network for Predicting Essential Proteins. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1386-1397. [PMID: 28186903 DOI: 10.1109/tcbb.2017.2665482] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Identification of essential proteins based on protein interaction network (PIN) is a very important and hot topic in the post genome era. Up to now, a number of network-based essential protein discovery methods have been proposed. Generally, a static protein interaction network was constructed by using the protein-protein interactions obtained from different experiments or databases. Unfortunately, most of the network-based essential protein discovery methods are sensitive to the reliability of the constructed PIN. In this paper, we propose a new method for constructing refined PIN by using gene expression profiles and subcellular location information. The basic idea behind refining the PIN is that two proteins should have higher possibility to physically interact with each other if they appear together at the same subcellular location and are active together at least at a time point in the cell cycle. The original static PIN is denoted by S-PIN while the final PIN refined by our method is denoted by TS-PIN. To evaluate whether the constructed TS-PIN is more suitable to be used in the identification of essential proteins, 10 network-based essential protein discovery methods (DC, EC, SC, BC, CC, IC, LAC, NC, BN, and DMNC) are applied on it to identify essential proteins. A comparison of TS-PIN and two other networks: S-PIN and NF-APIN (a noise-filtered active PIN constructed by using gene expression data and S-PIN) is implemented on the prediction of essential proteins by using these ten network-based methods. The comparison results show that all of the 10 network-based methods achieve better results when being applied on TS-PIN than that being applied on S-PIN and NF-APIN.
Collapse
|
31
|
Zhang Z, Ruan J, Gao J, Wu FX. Predicting essential proteins from protein-protein interactions using order statistics. J Theor Biol 2019; 480:274-283. [PMID: 31251944 DOI: 10.1016/j.jtbi.2019.06.022] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 03/24/2019] [Accepted: 06/24/2019] [Indexed: 12/11/2022]
Abstract
Many computational methods have been proposed to predict essential proteins from protein-protein interaction (PPI) networks. However, it is still challenging to improve the prediction accuracy. In this study, we propose a new method, esPOS (essential proteins Predictor using Order Statistics) to predict essential proteins from PPI networks. Firstly, we refine the networks by using gene expression information and subcellular localization information. Secondly, we design some new features, which combine the protein predicted secondary structure with PPI network. We show that these new features are useful to predict essential proteins. Thirdly, we optimize these features by using a greedy method, and combine the optimized features by order statistic method. Our method achieves the prediction accuracy of 0.76-0.79 on two network datasets. The proposed method is available at https://sourceforge.net/projects/espos/.
Collapse
Affiliation(s)
- Zhaopeng Zhang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China.
| | - Jishou Ruan
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China.
| | - Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China.
| | - Fang-Xiang Wu
- Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada.
| |
Collapse
|
32
|
Sanasam BD, Kumar S. PRE-binding protein of Plasmodium falciparum is a potential candidate for vaccine design and development: An in silico evaluation of the hypothesis. Med Hypotheses 2019; 125:119-123. [DOI: 10.1016/j.mehy.2019.01.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Revised: 12/14/2018] [Accepted: 01/10/2019] [Indexed: 11/29/2022]
|
33
|
Abstract
Background:
Essential proteins play important roles in the survival or reproduction of
an organism and support the stability of the system. Essential proteins are the minimum set of
proteins absolutely required to maintain a living cell. The identification of essential proteins is a
very important topic not only for a better comprehension of the minimal requirements for cellular
life, but also for a more efficient discovery of the human disease genes and drug targets.
Traditionally, as the experimental identification of essential proteins is complex, it usually requires
great time and expense. With the cumulation of high-throughput experimental data, many
computational methods that make useful complements to experimental methods have been
proposed to identify essential proteins. In addition, the ability to rapidly and precisely identify
essential proteins is of great significance for discovering disease genes and drug design, and has
great potential for applications in basic and synthetic biology research.
Objective:
The aim of this paper is to provide a review on the identification of essential proteins
and genes focusing on the current developments of different types of computational methods, point
out some progress and limitations of existing methods, and the challenges and directions for
further research are discussed.
Collapse
Affiliation(s)
- Ming Fang
- School of Computer Science, Shaanxi Normal University, Xi'an 710119, China
| | - Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi'an 710119, China
| | - Ling Guo
- College of Life Sciences, Shaanxi Normal University, Xi'an 710119, China
| |
Collapse
|
34
|
Lei X, Yang X, Fujita H. Random walk based method to identify essential proteins by integrating network topology and biological characteristics. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2019.01.012] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
35
|
Lei X, Wang S, Wu F. Identification of Essential Proteins Based on Improved HITS Algorithm. Genes (Basel) 2019; 10:E177. [PMID: 30823614 PMCID: PMC6409685 DOI: 10.3390/genes10020177] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 02/09/2019] [Accepted: 02/19/2019] [Indexed: 11/16/2022] Open
Abstract
Essential proteins are critical to the development and survival of cells. Identifying and analyzing essential proteins is vital to understand the molecular mechanisms of living cells and design new drugs. With the development of high-throughput technologies, many protein⁻protein interaction (PPI) data are available, which facilitates the studies of essential proteins at the network level. Up to now, although various computational methods have been proposed, the prediction precision still needs to be improved. In this paper, we propose a novel method by applying Hyperlink-Induced Topic Search (HITS) on weighted PPI networks to detect essential proteins, named HSEP. First, an original undirected PPI network is transformed into a bidirectional PPI network. Then, both biological information and network topological characteristics are taken into account to weighted PPI networks. Pieces of biological information include gene expression data, Gene Ontology (GO) annotation and subcellular localization. The edge clustering coefficient is represented as network topological characteristics to measure the closeness of two connected nodes. We conducted experiments on two species, namely Saccharomyces cerevisiae and Drosophila melanogaster, and the experimental results show that HSEP outperformed some state-of-the-art essential proteins detection techniques.
Collapse
Affiliation(s)
- Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi'an 710119, China.
| | - Siguo Wang
- School of Computer Science, Shaanxi Normal University, Xi'an 710119, China.
| | - Fangxiang Wu
- Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada.
| |
Collapse
|
36
|
Li X, Li W, Zeng M, Zheng R, Li M. Network-based methods for predicting essential genes or proteins: a survey. Brief Bioinform 2019; 21:566-583. [DOI: 10.1093/bib/bbz017] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 01/21/2019] [Accepted: 01/22/2019] [Indexed: 12/14/2022] Open
Abstract
Abstract
Genes that are thought to be critical for the survival of organisms or cells are called essential genes. The prediction of essential genes and their products (essential proteins) is of great value in exploring the mechanism of complex diseases, the study of the minimal required genome for living cells and the development of new drug targets. As laboratory methods are often complicated, costly and time-consuming, a great many of computational methods have been proposed to identify essential genes/proteins from the perspective of the network level with the in-depth understanding of network biology and the rapid development of biotechnologies. Through analyzing the topological characteristics of essential genes/proteins in protein–protein interaction networks (PINs), integrating biological information and considering the dynamic features of PINs, network-based methods have been proved to be effective in the identification of essential genes/proteins. In this paper, we survey the advanced methods for network-based prediction of essential genes/proteins and present the challenges and directions for future research.
Collapse
Affiliation(s)
- Xingyi Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan, China
| | - Wenkai Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan, China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, Hunan, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha, Hunan, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan, China
| |
Collapse
|
37
|
Zhang F, Peng W, Yang Y, Dai W, Song J. A Novel Method for Identifying Essential Genes by Fusing Dynamic Protein⁻Protein Interactive Networks. Genes (Basel) 2019; 10:genes10010031. [PMID: 30626157 PMCID: PMC6356314 DOI: 10.3390/genes10010031] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2018] [Revised: 12/24/2018] [Accepted: 01/02/2019] [Indexed: 11/16/2022] Open
Abstract
Essential genes play an indispensable role in supporting the life of an organism. Identification of essential genes helps us to understand the underlying mechanism of cell life. The essential genes of bacteria are potential drug targets of some diseases genes. Recently, several computational methods have been proposed to detect essential genes based on the static protein⁻protein interactive (PPI) networks. However, these methods have ignored the fact that essential genes play essential roles under certain conditions. In this work, a novel method was proposed for the identification of essential proteins by fusing the dynamic PPI networks of different time points (called by FDP). Firstly, the active PPI networks of each time point were constructed and then they were fused into a final network according to the networks' similarities. Finally, a novel centrality method was designed to assign each gene in the final network a ranking score, whilst considering its orthologous property and its global and local topological properties in the network. This model was applied on two different yeast data sets. The results showed that the FDP achieved a better performance in essential gene prediction as compared to other existing methods that are based on the static PPI network or that are based on dynamic networks.
Collapse
Affiliation(s)
- Fengyu Zhang
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China.
| | - Wei Peng
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China.
- Computer Center of Kunming University of Science and Technology, Kunming 650093, China.
| | - Yunfei Yang
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China.
| | - Wei Dai
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China.
| | - Junrong Song
- Faculty of Management and Economics, Kunming University of Science and Technology, Kunming 650093, China.
| |
Collapse
|
38
|
Ijaq J, Malik G, Kumar A, Das PS, Meena N, Bethi N, Sundararajan VS, Suravajhala P. A model to predict the function of hypothetical proteins through a nine-point classification scoring schema. BMC Bioinformatics 2019; 20:14. [PMID: 30621574 PMCID: PMC6325861 DOI: 10.1186/s12859-018-2554-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Accepted: 11/30/2018] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Hypothetical proteins [HP] are those that are predicted to be expressed in an organism, but no evidence of their existence is known. In the recent past, annotation and curation efforts have helped overcome the challenge in understanding their diverse functions. Techniques to decipher sequence-structure-function relationship, especially in terms of functional modelling of the HPs have been developed by researchers, but using the features as classifiers for HPs has not been attempted. With the rise in number of annotation strategies, next-generation sequencing methods have provided further understanding the functions of HPs. RESULTS In our previous work, we developed a six-point classification scoring schema with annotation pertaining to protein family scores, orthology, protein interaction/association studies, bidirectional best BLAST hits, sorting signals, known databases and visualizers which were used to validate protein interactions. In this study, we introduced three more classifiers to our annotation system, viz. pseudogenes linked to HPs, homology modelling and non-coding RNAs associated to HPs. We discuss the challenges and performance of these classifiers using machine learning heuristics with an improved accuracy from Perceptron (81.08 to 97.67), Naive Bayes (54.05 to 96.67), Decision tree J48 (67.57 to 97.00), and SMO_npolyk (59.46 to 96.67). CONCLUSION With the introduction of three new classification features, the performance of the nine-point classification scoring schema has an improved accuracy to functionally annotate the HPs.
Collapse
Affiliation(s)
- Johny Ijaq
- Department of Biotechnology, Osmania University, Hyderabad, 500007 India
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
| | - Girik Malik
- Department of Pediatrics, The Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s Hospital, The Ohio State University, Columbus, OH USA
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
- Labrynthe, New Delhi, India
| | - Anuj Kumar
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
- Advanced Center for Computational and Applied Biotechnology, Uttarakhand Council for Biotechnology, Dehradun, 248007 India
| | - Partha Sarathi Das
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
- Department of Microbiology, Bioinformatics Infrastructure Facility, Vidyasagar University, Midnapore, India
| | - Narendra Meena
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, RJ 302001 India
| | - Neeraja Bethi
- Department of Biotechnology, Osmania University, Hyderabad, 500007 India
| | | | - Prashanth Suravajhala
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, RJ 302001 India
| |
Collapse
|
39
|
Dong C, Jin YT, Hua HL, Wen QF, Luo S, Zheng WX, Guo FB. Comprehensive review of the identification of essential genes using computational methods: focusing on feature implementation and assessment. Brief Bioinform 2018; 21:171-181. [PMID: 30496347 DOI: 10.1093/bib/bby116] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Revised: 11/01/2018] [Accepted: 11/02/2018] [Indexed: 02/06/2023] Open
Abstract
Essential genes have attracted increasing attention in recent years due to the important functions of these genes in organisms. Among the methods used to identify the essential genes, accurate and efficient computational methods can make up for the deficiencies of expensive and time-consuming experimental technologies. In this review, we have collected researches on essential gene predictions in prokaryotes and eukaryotes and summarized the five predominant types of features used in these studies. The five types of features include evolutionary conservation, domain information, network topology, sequence component and expression level. We have described how to implement the useful forms of these features and evaluated their performance based on the data of Escherichia coli MG1655, Bacillus subtilis 168 and human. The prerequisite and applicable range of these features is described. In addition, we have investigated the techniques used to weight features in various models. To facilitate researchers in the field, two available online tools, which are accessible for free and can be directly used to predict gene essentiality in prokaryotes and humans, were referred. This article provides a simple guide for the identification of essential genes in prokaryotes and eukaryotes.
Collapse
Affiliation(s)
- Chuan Dong
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hong-Li Hua
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Qing-Feng Wen
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Sen Luo
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wen-Xin Zheng
- School of Biomedical Engineering, Capital Medical University, Beijing, China
| | - Feng-Biao Guo
- School of Life Science and Technology, Center for Informational Biology, Intelligent Learning Institute for Science and Application, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
40
|
Lei X, Zhao J, Fujita H, Zhang A. Predicting essential proteins based on RNA-Seq, subcellular localization and GO annotation datasets. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.03.027] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
41
|
Feature Selection via Swarm Intelligence for Determining Protein Essentiality. MOLECULES (BASEL, SWITZERLAND) 2018; 23:molecules23071569. [PMID: 29958434 PMCID: PMC6100311 DOI: 10.3390/molecules23071569] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/25/2018] [Revised: 06/22/2018] [Accepted: 06/25/2018] [Indexed: 01/24/2023]
Abstract
Protein essentiality is fundamental to comprehend the function and evolution of genes. The prediction of protein essentiality is pivotal in identifying disease genes and potential drug targets. Since the experimental methods need many investments in time and funds, it is of great value to predict protein essentiality with high accuracy using computational methods. In this study, we present a novel feature selection named Elite Search mechanism-based Flower Pollination Algorithm (ESFPA) to determine protein essentiality. Unlike other protein essentiality prediction methods, ESFPA uses an improved swarm intelligence⁻based algorithm for feature selection and selects optimal features for protein essentiality prediction. The first step is to collect numerous features with the highly predictive characteristics of essentiality. The second step is to develop a feature selection strategy based on a swarm intelligence algorithm to obtain the optimal feature subset. Furthermore, an elite search mechanism is adopted to further improve the quality of feature subset. Subsequently a hybrid classifier is applied to evaluate the essentiality for each protein. Finally, the experimental results show that our method is competitive to some well-known feature selection methods. The proposed method aims to provide a new perspective for protein essentiality determination.
Collapse
|
42
|
Lei X, Fang M, Wu FX, Chen L. Improved flower pollination algorithm for identifying essential proteins. BMC SYSTEMS BIOLOGY 2018; 12:46. [PMID: 29745838 PMCID: PMC5998882 DOI: 10.1186/s12918-018-0573-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Background Essential proteins are necessary for the survival and development of cells. The identification of essential proteins can help to understand the minimal requirements for cellular life and it also plays an important role in the disease genes study and drug design. With the development of high-throughput techniques, a large amount of protein-protein interactions data is available to predict essential proteins at the network level. Hitherto, even though a number of essential protein discovery methods have been proposed, the prediction precision still needs to be improved. Methods In this paper, we propose a new algorithm, improved Flower Pollination algorithm (FPA) for identifying Essential proteins, named FPE. Different from other existing essential protein discovery methods, we apply FPA which is a new intelligent algorithm imitating pollination behavior of flowering plants in nature to identify essential proteins. Analogous to flower pollination is to find optimal reproduction from the perspective of biological evolution, and the identification of essential proteins is to discover a candidate essential protein set by analyzing the corresponding relationships between FPA algorithm and the prediction of essential proteins, and redefining the positions of flowers and specific pollination process. Moreover, it has been proved that the integration of biological and topological properties can get improved precision for identifying essential proteins. Consequently, we develop a GSC measurement in order to judge the essentiality of proteins, which takes into account not only the Gene expression data, Subcellular localization and protein Complexes information, but also the network topology. Results The experimental results show that FPE performs better than the state-of-the-art methods (DC, SC, IC, EC, LAC, NC, PeC, WDC, UDoNC and SON) in terms of the prediction precision, precision-recall curve and jackknife curve for identifying essential proteins and also has high stability. Conclusions We confirm that FPE can be used to effectively identify essential proteins by the use of nature-inspired algorithm FPA and the combination of network topology with gene expression data, subcellular localization and protein complexes information. The experimental results have shown the superiority of FPE for the prediction of essential proteins.
Collapse
Affiliation(s)
- Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China.
| | - Ming Fang
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Canada
| | - Luonan Chen
- Key Laboratory of Systems Biology, CAS center for Excellence in Molecular Cell Science, Innovation Center for Cell Signaling Network, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| |
Collapse
|
43
|
Predicting essential proteins by integrating orthology, gene expressions, and PPI networks. PLoS One 2018; 13:e0195410. [PMID: 29634727 PMCID: PMC5892885 DOI: 10.1371/journal.pone.0195410] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2017] [Accepted: 03/21/2018] [Indexed: 12/04/2022] Open
Abstract
Identifying essential proteins is very important for understanding the minimal requirements of cellular life and finding human disease genes as well as potential drug targets. Experimental methods for identifying essential proteins are often costly, time-consuming, and laborious. Many computational methods for such task have been proposed based on the topological properties of protein-protein interaction networks (PINs). However, most of these methods have limited prediction accuracy due to the noisy and incomplete natures of PINs and the fact that protein essentiality may relate to multiple biological factors. In this work, we proposed a new centrality measure, OGN, by integrating orthologous information, gene expressions, and PINs together. OGN determines a protein’s essentiality by capturing its co-clustering and co-expression properties, as well as its conservation in the evolution process. The performance of OGN was tested on the species of Saccharomyces cerevisiae. Compared with several published centrality measures, OGN achieves higher prediction accuracy in both working alone and ensemble.
Collapse
|
44
|
Li M, Li W, Wu FX, Pan Y, Wang J. Identifying essential proteins based on sub-network partition and prioritization by integrating subcellular localization information. J Theor Biol 2018; 447:65-73. [PMID: 29571709 DOI: 10.1016/j.jtbi.2018.03.029] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2017] [Revised: 03/19/2018] [Accepted: 03/20/2018] [Indexed: 01/07/2023]
Abstract
Essential proteins are important participants in various life activities and play a vital role in the survival and reproduction of living organisms. Identification of essential proteins from protein-protein interaction (PPI) networks has great significance to facilitate the study of human complex diseases, the design of drugs and the development of bioinformatics and computational science. Studies have shown that highly connected proteins in a PPI network tend to be essential. A series of computational methods have been proposed to identify essential proteins by analyzing topological structures of PPI networks. However, the high noise in the PPI data can degrade the accuracy of essential protein prediction. Moreover, proteins must be located in the appropriate subcellular localization to perform their functions, and only when the proteins are located in the same subcellular localization, it is possible that they can interact with each other. In this paper, we propose a new network-based essential protein discovery method based on sub-network partition and prioritization by integrating subcellular localization information, named SPP. The proposed method SPP was tested on two different yeast PPI networks obtained from DIP database and BioGRID database. The experimental results show that SPP can effectively reduce the effect of false positives in PPI networks and predict essential proteins more accurately compared with other existing computational methods DC, BC, CC, SC, EC, IC, NC.
Collapse
Affiliation(s)
- Min Li
- School of Information Science and Engineering, Central South University, Changsha 410083, China.
| | - Wenkai Li
- School of Information Science and Engineering, Central South University, Changsha 410083, China.
| | - Fang-Xiang Wu
- Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada.
| | - Yi Pan
- Department of Computer Science, Georgia State University, Atlanta, GA 30302-4110, USA.
| | - Jianxin Wang
- School of Information Science and Engineering, Central South University, Changsha 410083, China.
| |
Collapse
|
45
|
Chen L, Zhang YH, Wang S, Zhang Y, Huang T, Cai YD. Prediction and analysis of essential genes using the enrichments of gene ontology and KEGG pathways. PLoS One 2017; 12:e0184129. [PMID: 28873455 PMCID: PMC5584762 DOI: 10.1371/journal.pone.0184129] [Citation(s) in RCA: 173] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2017] [Accepted: 08/18/2017] [Indexed: 12/20/2022] Open
Abstract
Identifying essential genes in a given organism is important for research on their fundamental roles in organism survival. Furthermore, if possible, uncovering the links between core functions or pathways with these essential genes will further help us obtain deep insight into the key roles of these genes. In this study, we investigated the essential and non-essential genes reported in a previous study and extracted gene ontology (GO) terms and biological pathways that are important for the determination of essential genes. Through the enrichment theory of GO and KEGG pathways, we encoded each essential/non-essential gene into a vector in which each component represented the relationship between the gene and one GO term or KEGG pathway. To analyze these relationships, the maximum relevance minimum redundancy (mRMR) was adopted. Then, the incremental feature selection (IFS) and support vector machine (SVM) were employed to extract important GO terms and KEGG pathways. A prediction model was built simultaneously using the extracted GO terms and KEGG pathways, which yielded nearly perfect performance, with a Matthews correlation coefficient of 0.951, for distinguishing essential and non-essential genes. To fully investigate the key factors influencing the fundamental roles of essential genes, the 21 most important GO terms and three KEGG pathways were analyzed in detail. In addition, several genes was provided in this study, which were predicted to be essential genes by our prediction model. We suggest that this study provides more functional and pathway information on the essential genes and provides a new way to investigate related problems.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, People’s Republic of China
- College of Information Engineering, Shanghai Maritime University, Shanghai, People’s Republic of China
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - ShaoPeng Wang
- School of Life Sciences, Shanghai University, Shanghai, People’s Republic of China
| | - YunHua Zhang
- Anhui province key lab of farmland ecological conversation and pollution prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, People’s Republic of China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, People’s Republic of China
| |
Collapse
|
46
|
Qin C, Sun Y, Dong Y. A new computational strategy for identifying essential proteins based on network topological properties and biological information. PLoS One 2017; 12:e0182031. [PMID: 28753682 PMCID: PMC5533339 DOI: 10.1371/journal.pone.0182031] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Accepted: 07/11/2017] [Indexed: 12/26/2022] Open
Abstract
Essential proteins are the proteins that are indispensable to the survival and development of an organism. Deleting a single essential protein will cause lethality or infertility. Identifying and analysing essential proteins are key to understanding the molecular mechanisms of living cells. There are two types of methods for predicting essential proteins: experimental methods, which require considerable time and resources, and computational methods, which overcome the shortcomings of experimental methods. However, the prediction accuracy of computational methods for essential proteins requires further improvement. In this paper, we propose a new computational strategy named CoTB for identifying essential proteins based on a combination of topological properties, subcellular localization information and orthologous protein information. First, we introduce several topological properties of the protein-protein interaction (PPI) network. Second, we propose new methods for measuring orthologous information and subcellular localization and a new computational strategy that uses a random forest prediction model to obtain a probability score for the proteins being essential. Finally, we conduct experiments on four different Saccharomyces cerevisiae datasets. The experimental results demonstrate that our strategy for identifying essential proteins outperforms traditional computational methods and the most recently developed method, SON. In particular, our strategy improves the prediction accuracy to 89, 78, 79, and 85 percent on the YDIP, YMIPS, YMBD and YHQ datasets at the top 100 level, respectively.
Collapse
Affiliation(s)
- Chao Qin
- Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
| | - Yongqi Sun
- Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
- * E-mail:
| | - Yadong Dong
- Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
| |
Collapse
|
47
|
Zhang W, Xu J, Li X, Zou X. A New Method for Identifying Essential Proteins by Measuring Co-Expression and Functional Similarity. IEEE Trans Nanobioscience 2016; 15:939-945. [DOI: 10.1109/tnb.2016.2625460] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|