1
|
Lu P, Tian J. ACDMBI: A deep learning model based on community division and multi-source biological information fusion predicts essential proteins. Comput Biol Chem 2024; 112:108115. [PMID: 38865861 DOI: 10.1016/j.compbiolchem.2024.108115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 05/15/2024] [Accepted: 05/28/2024] [Indexed: 06/14/2024]
Abstract
Accurately identifying essential proteins is vital for drug research and disease diagnosis. Traditional centrality methods and machine learning approaches often face challenges in accurately discerning essential proteins, primarily relying on information derived from protein-protein interaction (PPI) networks. Despite attempts by some researchers to integrate biological data and PPI networks for predicting essential proteins, designing effective integration methods remains a challenge. In response to these challenges, this paper presents the ACDMBI model, specifically designed to overcome the aforementioned issues. ACDMBI is comprised of two key modules: feature extraction and classification. In terms of capturing relevant information, we draw insights from three distinct data sources. Initially, structural features of proteins are extracted from the PPI network through community division. Subsequently, these features are further optimized using Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT). Moving forward, protein features are extracted from gene expression data utilizing Bidirectional Long Short-Term Memory networks (BiLSTM) and a multi-head self-attention mechanism. Finally, protein features are derived by mapping subcellular localization data to a one-dimensional vector and processing it through fully connected layers. In the classification phase, we integrate features extracted from three different data sources, crafting a multi-layer deep neural network (DNN) for protein classification prediction. Experimental results on brewing yeast data showcase the ACDMBI model's superior performance, with AUC reaching 0.9533 and AUPR reaching 0.9153. Ablation experiments further reveal that the effective integration of features from diverse biological information significantly boosts the model's performance.
Collapse
Affiliation(s)
- Pengli Lu
- School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China.
| | - Jialong Tian
- School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China.
| |
Collapse
|
2
|
Yang W, Zou S, Gao H, Wang L, Ni W. A Novel Method for Targeted Identification of Essential Proteins by Integrating Chemical Reaction Optimization and Naive Bayes Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1274-1286. [PMID: 38536675 DOI: 10.1109/tcbb.2024.3382392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2024]
Abstract
Targeted identification of essential proteins is of great significance for species identification, drug manufacturing, and disease treatment. It is a challenge to analyze the binding mechanism between essential proteins and improve the identification speed while ensuring the accuracy of the identification. This paper proposes a novel method called EPCRO for identifying essential proteins, which incorporates the chemical reaction optimization (CRO) algorithm and the naive Bayes model to effectively detect essential proteins. In EPCRO, the naive Bayes model is employed to analyze the homogeneity between proteins. In order to improve the identification rate and speed of essential proteins, the protein homogeneity rate is integrated into the CRO algorithm to balance between local and global searches. EPCRO is experimentally compared with 17 existing methods (including, DC, SC, IC, EC, LAC, NC, PeC, WDC, EPD-RW, RWHN, TEGS, CFMM, BSPM, AFSO-EP, CVIM, RWEP, and EPPSO-DC) based on biological datasets. The results show that EPCRO is superior to the above methods in identification accuracy and speed.
Collapse
|
3
|
Pan L, Wang H, Yang B, Li W. A protein network refinement method based on module discovery and biological information. BMC Bioinformatics 2024; 25:157. [PMID: 38643108 PMCID: PMC11031909 DOI: 10.1186/s12859-024-05772-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 04/10/2024] [Indexed: 04/22/2024] Open
Abstract
BACKGROUND The identification of essential proteins can help in understanding the minimum requirements for cell survival and development to discover drug targets and prevent disease. Nowadays, node ranking methods are a common way to identify essential proteins, but the poor data quality of the underlying PIN has somewhat hindered the identification accuracy of essential proteins for these methods in the PIN. Therefore, researchers constructed refinement networks by considering certain biological properties of interacting protein pairs to improve the performance of node ranking methods in the PIN. Studies show that proteins in a complex are more likely to be essential than proteins not present in the complex. However, the modularity is usually ignored for the refinement methods of the PINs. METHODS Based on this, we proposed a network refinement method based on module discovery and biological information. The idea is, first, to extract the maximal connected subgraph in the PIN, and to divide it into different modules by using Fast-unfolding algorithm; then, to detect critical modules according to the orthologous information, subcellular localization information and topology information within each module; finally, to construct a more refined network (CM-PIN) by using the identified critical modules. RESULTS To evaluate the effectiveness of the proposed method, we used 12 typical node ranking methods (LAC, DC, DMNC, NC, TP, LID, CC, BC, PR, LR, PeC, WDC) to compare the overall performance of the CM-PIN with those on the S-PIN, D-PIN and RD-PIN. The experimental results showed that the CM-PIN was optimal in terms of the identification number of essential proteins, precision-recall curve, Jackknifing method and other criteria, and can help to identify essential proteins more accurately.
Collapse
Affiliation(s)
- Li Pan
- Hunan Institute of Science and Technology, Yueyang, 414006, China
- Hunan Engineering Research Center of Multimodal Health Sensing and Intelligent Analysis, Yueyang, 414006, China
| | - Haoyue Wang
- Hunan Institute of Science and Technology, Yueyang, 414006, China.
| | - Bo Yang
- Hunan Institute of Science and Technology, Yueyang, 414006, China
- Hunan Engineering Research Center of Multimodal Health Sensing and Intelligent Analysis, Yueyang, 414006, China
| | - Wenbin Li
- Hunan Institute of Science and Technology, Yueyang, 414006, China.
| |
Collapse
|
4
|
Ye C, Wu Q, Chen S, Zhang X, Xu W, Wu Y, Zhang Y, Yue Y. ECDEP: identifying essential proteins based on evolutionary community discovery and subcellular localization. BMC Genomics 2024; 25:117. [PMID: 38279081 PMCID: PMC10821549 DOI: 10.1186/s12864-024-10019-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 01/15/2024] [Indexed: 01/28/2024] Open
Abstract
BACKGROUND In cellular activities, essential proteins play a vital role and are instrumental in comprehending fundamental biological necessities and identifying pathogenic genes. Current deep learning approaches for predicting essential proteins underutilize the potential of gene expression data and are inadequate for the exploration of dynamic networks with limited evaluation across diverse species. RESULTS We introduce ECDEP, an essential protein identification model based on evolutionary community discovery. ECDEP integrates temporal gene expression data with a protein-protein interaction (PPI) network and employs the 3-Sigma rule to eliminate outliers at each time point, constructing a dynamic network. Next, we utilize edge birth and death information to establish an interaction streaming source to feed into the evolutionary community discovery algorithm and then identify overlapping communities during the evolution of the dynamic network. SVM recursive feature elimination (RFE) is applied to extract the most informative communities, which are combined with subcellular localization data for classification predictions. We assess the performance of ECDEP by comparing it against ten centrality methods, four shallow machine learning methods with RFE, and two deep learning methods that incorporate multiple biological data sources on Saccharomyces. Cerevisiae (S. cerevisiae), Homo sapiens (H. sapiens), Mus musculus, and Caenorhabditis elegans. ECDEP achieves an AP value of 0.86 on the H. sapiens dataset and the contribution ratio of community features in classification reaches 0.54 on the S. cerevisiae (Krogan) dataset. CONCLUSIONS Our proposed method adeptly integrates network dynamics and yields outstanding results across various datasets. Furthermore, the incorporation of evolutionary community discovery algorithms amplifies the capacity of gene expression data in classification.
Collapse
Affiliation(s)
- Chen Ye
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Qi Wu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Shuxia Chen
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Xuemei Zhang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Wenwen Xu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Yunzhi Wu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Youhua Zhang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Yi Yue
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China.
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
5
|
Li G, Luo X, Hu Z, Wu J, Peng W, Liu J, Zhu X. Essential proteins discovery based on dominance relationship and neighborhood similarity centrality. Health Inf Sci Syst 2023; 11:55. [PMID: 37981988 PMCID: PMC10654316 DOI: 10.1007/s13755-023-00252-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 10/13/2023] [Indexed: 11/21/2023] Open
Abstract
Essential proteins play a vital role in development and reproduction of cells. The identification of essential proteins helps to understand the basic survival of cells. Due to time-consuming, costly and inefficient with biological experimental methods for discovering essential proteins, computational methods have gained increasing attention. In the initial stage, essential proteins are mainly identified by the centralities based on protein-protein interaction (PPI) networks, which limit their identification rate due to many false positives in PPI networks. In this study, a purified PPI network is firstly introduced to reduce the impact of false positives in the PPI network. Secondly, by analyzing the similarity relationship between a protein and its neighbors in the PPI network, a new centrality called neighborhood similarity centrality (NSC) is proposed. Thirdly, based on the subcellular localization and orthologous data, the protein subcellular localization score and ortholog score are calculated, respectively. Fourthly, by analyzing a large number of methods based on multi-feature fusion, it is found that there is a special relationship among features, which is called dominance relationship, then, a novel model based on dominance relationship is proposed. Finally, NSC, subcellular localization score, and ortholog score are fused by the dominance relationship model, and a new method called NSO is proposed. In order to verify the performance of NSO, the seven representative methods (ION, NCCO, E_POC, SON, JDC, PeC, WDC) are compared on yeast datasets. The experimental results show that the NSO method has higher identification rate than other methods.
Collapse
Affiliation(s)
- Gaoshi Li
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Xinlong Luo
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Zhipeng Hu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Jingli Wu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Wei Peng
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500 Yunnan China
| | - Jiafei Liu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Xiaoshu Zhu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
- School of Computer and Information Security & School of Software Engineering, Guilin University of Electronic Science and Technology, Guilin, China
| |
Collapse
|
6
|
Zhao H, Liu G, Cao X. A seed expansion-based method to identify essential proteins by integrating protein-protein interaction sub-networks and multiple biological characteristics. BMC Bioinformatics 2023; 24:452. [PMID: 38036960 PMCID: PMC10688502 DOI: 10.1186/s12859-023-05583-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Accepted: 11/24/2023] [Indexed: 12/02/2023] Open
Abstract
BACKGROUND The identification of essential proteins is of great significance in biology and pathology. However, protein-protein interaction (PPI) data obtained through high-throughput technology include a high number of false positives. To overcome this limitation, numerous computational algorithms based on biological characteristics and topological features have been proposed to identify essential proteins. RESULTS In this paper, we propose a novel method named SESN for identifying essential proteins. It is a seed expansion method based on PPI sub-networks and multiple biological characteristics. Firstly, SESN utilizes gene expression data to construct PPI sub-networks. Secondly, seed expansion is performed simultaneously in each sub-network, and the expansion process is based on the topological features of predicted essential proteins. Thirdly, the error correction mechanism is based on multiple biological characteristics and the entire PPI network. Finally, SESN analyzes the impact of each biological characteristic, including protein complex, gene expression data, GO annotations, and subcellular localization, and adopts the biological data with the best experimental results. The output of SESN is a set of predicted essential proteins. CONCLUSIONS The analysis of each component of SESN indicates the effectiveness of all components. We conduct comparison experiments using three datasets from two species, and the experimental results demonstrate that SESN achieves superior performance compared to other methods.
Collapse
Affiliation(s)
- He Zhao
- College of Computer Science and Technology, Jilin University, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Guixia Liu
- College of Computer Science and Technology, Jilin University, Changchun, China.
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China.
| | - Xintian Cao
- College of Computer Science and Technology, Jilin University, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| |
Collapse
|
7
|
Mount RA, Athif M, O’Connor M, Saligrama A, Tseng HA, Sridhar S, Zhou C, Bortz E, San Antonio E, Kramer MA, Man HY, Han X. The autism spectrum disorder risk gene NEXMIF over-synchronizes hippocampal CA1 network and alters neuronal coding. Front Neurosci 2023; 17:1277501. [PMID: 37965217 PMCID: PMC10641898 DOI: 10.3389/fnins.2023.1277501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 10/10/2023] [Indexed: 11/16/2023] Open
Abstract
Mutations in autism spectrum disorder (ASD) risk genes disrupt neural network dynamics that ultimately lead to abnormal behavior. To understand how ASD-risk genes influence neural circuit computation during behavior, we analyzed the hippocampal network by performing large-scale cellular calcium imaging from hundreds of individual CA1 neurons simultaneously in transgenic mice with total knockout of the X-linked ASD-risk gene NEXMIF (neurite extension and migration factor). As NEXMIF knockout in mice led to profound learning and memory deficits, we examined the CA1 network during voluntary locomotion, a fundamental component of spatial memory. We found that NEXMIF knockout does not alter the overall excitability of individual neurons but exaggerates movement-related neuronal responses. To quantify network functional connectivity changes, we applied closeness centrality analysis from graph theory to our large-scale calcium imaging datasets, in addition to using the conventional pairwise correlation analysis. Closeness centrality analysis considers both the number of connections and the connection strength between neurons within a network. We found that in wild-type mice the CA1 network desynchronizes during locomotion, consistent with increased network information coding during active behavior. Upon NEXMIF knockout, CA1 network is over-synchronized regardless of behavioral state and fails to desynchronize during locomotion, highlighting how perturbations in ASD-implicated genes create abnormal network synchronization that could contribute to ASD-related behaviors.
Collapse
Affiliation(s)
- Rebecca A. Mount
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Mohamed Athif
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | | | - Amith Saligrama
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
- Commonwealth School, Boston, MA, United States
| | - Hua-an Tseng
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Sudiksha Sridhar
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Chengqian Zhou
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Emma Bortz
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Erynne San Antonio
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Mark A. Kramer
- Department of Mathematics, Boston University, Boston, MA, United States
| | - Heng-Ye Man
- Department of Biology, Boston University, Boston, MA, United States
| | - Xue Han
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| |
Collapse
|
8
|
Sun J, Pan L, Li B, Wang H, Yang B, Li W. A Construction Method of Dynamic Protein Interaction Networks by Using Relevant Features of Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2790-2801. [PMID: 37030714 DOI: 10.1109/tcbb.2023.3264241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Essential proteins play an important role in various life activities and are considered to be a vital part of the organism. Gene expression data are an important dataset to construct dynamic protein-protein interaction networks (DPIN). The existing methods for the construction of DPINs generally utilize all features (or the features in a cycle) of the gene expression data. However, the features observed from successive time points tend to be highly correlated, and thus there are some redundant and irrelevant features in the gene expression data, which will influence the quality of the constructed network and the predictive performance of essential proteins. To address this problem, we propose a construction method of DPINs by using selected relevant features rather than continuous and periodic features. We adopt an improved unsupervised feature selection method based on Laplacian algorithm to remove irrelevant and redundant features from gene expression data, then integrate the chosen relevant features into the static protein-protein interaction network (SPIN) to construct a more concise and effective DPIN (FS-DPIN). To evaluate the effectiveness of the FS-DPIN, we apply 15 network-based centrality methods on the FS-DPIN and compare the results with those on the SPIN and the existing DPINs. Then the predictive performance of the 15 centrality methods is validated in terms of sensitivity, specificity, positive predictive value, negative predictive value, F-measure, accuracy, Jackknife and AUPRC. The experimental results show that the FS-DPIN is superior to the existing DPINs in the identification accuracy of essential proteins.
Collapse
|
9
|
Žalik KR, Žalik M. Density-Based Entropy Centrality for Community Detection in Complex Networks. ENTROPY (BASEL, SWITZERLAND) 2023; 25:1196. [PMID: 37628226 PMCID: PMC10453840 DOI: 10.3390/e25081196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 07/31/2023] [Accepted: 08/02/2023] [Indexed: 08/27/2023]
Abstract
One of the most important problems in complex networks is the location of nodes that are essential or play a main role in the network. Nodes with main local roles are the centers of real communities. Communities are sets of nodes of complex networks and are densely connected internally. Choosing the right nodes as seeds of the communities is crucial in determining real communities. We propose a new centrality measure named density-based entropy centrality for the local identification of the most important nodes. It measures the entropy of the sum of the sizes of the maximal cliques to which each node and its neighbor nodes belong. The proposed centrality is a local measure for explaining the local influence of each node, which provides an efficient way to locally identify the most important nodes and for community detection because communities are local structures. It can be computed independently for individual vertices, for large networks, and for not well-specified networks. The use of the proposed density-based entropy centrality for community seed selection and community detection outperforms other centrality measures.
Collapse
Affiliation(s)
- Krista Rizman Žalik
- Faculty of Electrical Engineering and Computer Science, University of Maribor, 2000 Maribor, Slovenia
| | | |
Collapse
|
10
|
Bröhl T, Lehnertz K. A perturbation-based approach to identifying potentially superfluous network constituents. CHAOS (WOODBURY, N.Y.) 2023; 33:2894464. [PMID: 37276550 DOI: 10.1063/5.0152030] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 05/16/2023] [Indexed: 06/07/2023]
Abstract
Constructing networks from empirical time-series data is often faced with the as yet unsolved issue of how to avoid potentially superfluous network constituents. Such constituents can result, e.g., from spatial and temporal oversampling of the system's dynamics, and neglecting them can lead to severe misinterpretations of network characteristics ranging from global to local scale. We derive a perturbation-based method to identify potentially superfluous network constituents that makes use of vertex and edge centrality concepts. We investigate the suitability of our approach through analyses of weighted small-world, scale-free, random, and complete networks.
Collapse
Affiliation(s)
- Timo Bröhl
- Department of Epileptology, University of Bonn Medical Centre, Venusberg Campus 1, 53127 Bonn, Germany
- Helmholtz Institute for Radiation and Nuclear Physics, University of Bonn, Nussallee 14-16, 53115 Bonn, Germany
| | - Klaus Lehnertz
- Department of Epileptology, University of Bonn Medical Centre, Venusberg Campus 1, 53127 Bonn, Germany
- Helmholtz Institute for Radiation and Nuclear Physics, University of Bonn, Nussallee 14-16, 53115 Bonn, Germany
- Interdisciplinary Center for Complex Systems, University of Bonn, Brühler Straße 7, 53175 Bonn, Germany
| |
Collapse
|
11
|
Liu P, Liu C, Mao Y, Guo J, Liu F, Cai W, Zhao F. Identification of essential proteins based on edge features and the fusion of multiple-source biological information. BMC Bioinformatics 2023; 24:203. [PMID: 37198530 DOI: 10.1186/s12859-023-05315-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 04/30/2023] [Indexed: 05/19/2023] Open
Abstract
BACKGROUND A major current focus in the analysis of protein-protein interaction (PPI) data is how to identify essential proteins. As massive PPI data are available, this warrants the design of efficient computing methods for identifying essential proteins. Previous studies have achieved considerable performance. However, as a consequence of the features of high noise and structural complexity in PPIs, it is still a challenge to further upgrade the performance of the identification methods. METHODS This paper proposes an identification method, named CTF, which identifies essential proteins based on edge features including h-quasi-cliques and uv-triangle graphs and the fusion of multiple-source information. We first design an edge-weight function, named EWCT, for computing the topological scores of proteins based on quasi-cliques and triangle graphs. Then, we generate an edge-weighted PPI network using EWCT and dynamic PPI data. Finally, we compute the essentiality of proteins by the fusion of topological scores and three scores of biological information. RESULTS We evaluated the performance of the CTF method by comparison with 16 other methods, such as MON, PeC, TEGS, and LBCC, the experiment results on three datasets of Saccharomyces cerevisiae show that CTF outperforms the state-of-the-art methods. Moreover, our method indicates that the fusion of other biological information is beneficial to improve the accuracy of identification.
Collapse
Affiliation(s)
- Peiqiang Liu
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China.
| | - Chang Liu
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| | - Yanyan Mao
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
- College of Oceanography and Space Informatics, China University of Petroleum (East China), Qingdao, China
| | - Junhong Guo
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| | - Fanshu Liu
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| | - Wangmin Cai
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| | - Feng Zhao
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| |
Collapse
|
12
|
Cueno ME, Taketsuna K, Saito M, Inoue S, Imai K. Network analysis of the autophagy biochemical network in relation to various autophagy-targeted proteins found among SARS-CoV-2 variants of concern. J Mol Graph Model 2023; 119:108396. [PMID: 36549224 PMCID: PMC9749836 DOI: 10.1016/j.jmgm.2022.108396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 10/28/2022] [Accepted: 12/09/2022] [Indexed: 12/15/2022]
Abstract
Autophagy is an important cellular process that triggers a coordinated action involving multiple individual proteins and protein complexes while SARS-CoV-2 (SARS2) was found to both hinder autophagy to evade host defense and utilize autophagy for viral replication. Interestingly, the possible significant stages of the autophagy biochemical network in relation to the corresponding autophagy-targeted SARS2 proteins from the different variants of concern (VOC) were never established. In this study, we performed the following: autophagy biochemical network design and centrality analyses; generated autophagy-targeted SARS2 protein models; and superimposed protein models for structural comparison. We identified 2 significant biochemical pathways (one starts from the ULK complex and the other starts from the PI3P complex) within the autophagy biochemical network. Similarly, we determined that the autophagy-targeted SARS2 proteins (Nsp15, M, ORF7a, ORF3a, and E) are structurally conserved throughout the different SARS2 VOC suggesting that the function of each protein is preserved during SARS2 evolution. Interestingly, among the autophagy-targeted SARS2 proteins, the M protein coincides with the 2 significant biochemical pathways we identified within the autophagy biochemical network. In this regard, we propose that the SARS2 M protein is the main determinant that would influence autophagy outcome in regard to SARS2 infection.
Collapse
Affiliation(s)
- Marni E. Cueno
- Department of Microbiology and Immunology, Nihon University School of Dentistry, Tokyo, 101-8310, Japan,Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, 178-0063, Japan,Corresponding author. Department of Microbiology and Immunology, Nihon University School of Dentistry, 1-8-13 Kanda-Surugadai, Chiyoda-ku, Tokyo, 101-8310, Japan
| | - Keiichi Taketsuna
- Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, 178-0063, Japan
| | - Mitsuki Saito
- Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, 178-0063, Japan
| | - Sara Inoue
- Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, 178-0063, Japan
| | - Kenichi Imai
- Department of Microbiology and Immunology, Nihon University School of Dentistry, Tokyo, 101-8310, Japan
| |
Collapse
|
13
|
Zelenkovski K, Sandev T, Metzler R, Kocarev L, Basnarkov L. Random Walks on Networks with Centrality-Based Stochastic Resetting. ENTROPY (BASEL, SWITZERLAND) 2023; 25:293. [PMID: 36832659 PMCID: PMC9955709 DOI: 10.3390/e25020293] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 01/19/2023] [Accepted: 02/02/2023] [Indexed: 06/18/2023]
Abstract
We introduce a refined way to diffusely explore complex networks with stochastic resetting where the resetting site is derived from node centrality measures. This approach differs from previous ones, since it not only allows the random walker with a certain probability to jump from the current node to a deliberately chosen resetting node, rather it enables the walker to jump to the node that can reach all other nodes faster. Following this strategy, we consider the resetting site to be the geometric center, the node that minimizes the average travel time to all the other nodes. Using the established Markov chain theory, we calculate the Global Mean First Passage Time (GMFPT) to determine the search performance of the random walk with resetting for different resetting node candidates individually. Furthermore, we compare which nodes are better resetting node sites by comparing the GMFPT for each node. We study this approach for different topologies of generic and real-life networks. We show that, for directed networks extracted for real-life relationships, this centrality focused resetting can improve the search to a greater extent than for the generated undirected networks. This resetting to the center advocated here can minimize the average travel time to all other nodes in real networks as well. We also present a relationship between the longest shortest path (the diameter), the average node degree and the GMFPT when the starting node is the center. We show that, for undirected scale-free networks, stochastic resetting is effective only for networks that are extremely sparse with tree-like structures as they have larger diameters and smaller average node degrees. For directed networks, the resetting is beneficial even for networks that have loops. The numerical results are confirmed by analytic solutions. Our study demonstrates that the proposed random walk approach with resetting based on centrality measures reduces the memoryless search time for targets in the examined network topologies.
Collapse
Affiliation(s)
- Kiril Zelenkovski
- Research Center for Computer Science and Information Technologies, Macedonian Academy of Sciences and Arts, Bul. Krste Misirkov 2, 1000 Skopje, Macedonia
| | - Trifce Sandev
- Research Center for Computer Science and Information Technologies, Macedonian Academy of Sciences and Arts, Bul. Krste Misirkov 2, 1000 Skopje, Macedonia
- Institute of Physics, Faculty of Natural Sciences and Mathematics, Ss. Cyril and Methodius University, Arhimedova 3, 1000 Skopje, Macedonia
- Institute of Physics & Astronomy, University of Potsdam, D-14776 Potsdam, Germany
| | - Ralf Metzler
- Institute of Physics & Astronomy, University of Potsdam, D-14776 Potsdam, Germany
- Asia Pacific Center for Theoretical Physics, Pohang 37673, Republic of Korea
| | - Ljupco Kocarev
- Research Center for Computer Science and Information Technologies, Macedonian Academy of Sciences and Arts, Bul. Krste Misirkov 2, 1000 Skopje, Macedonia
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, 1000 Skopje, Macedonia
| | - Lasko Basnarkov
- Research Center for Computer Science and Information Technologies, Macedonian Academy of Sciences and Arts, Bul. Krste Misirkov 2, 1000 Skopje, Macedonia
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, 1000 Skopje, Macedonia
| |
Collapse
|
14
|
Chen S, Huang C, Wang L, Zhou S. A disease-related essential protein prediction model based on the transfer neural network. Front Genet 2023; 13:1087294. [PMID: 36685976 PMCID: PMC9845409 DOI: 10.3389/fgene.2022.1087294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Accepted: 12/14/2022] [Indexed: 01/06/2023] Open
Abstract
Essential proteins play important roles in the development and survival of organisms whose mutations are proven to be the drivers of common internal diseases having higher prevalence rates. Due to high costs of traditional biological experiments, an improved Transfer Neural Network (TNN) was designed to extract raw features from multiple biological information of proteins first, and then, based on the newly-constructed Transfer Neural Network, a novel computational model called TNNM was designed to infer essential proteins in this paper. Different from traditional Markov chain, since Transfer Neural Network adopted the gradient descent algorithm to automatically obtain the transition probability matrix, the prediction accuracy of TNNM was greatly improved. Moreover, additional antecedent memory coefficient and bias term were introduced in Transfer Neural Network, which further enhanced both the robustness and the non-linear expression ability of TNNM as well. Finally, in order to evaluate the identification performance of TNNM, intensive experiments have been executed based on two well-known public databases separately, and experimental results show that TNNM can achieve better performance than representative state-of-the-art prediction models in terms of both predictive accuracies and decline rate of accuracies. Therefore, TNNM may play an important role in key protein prediction in the future.
Collapse
Affiliation(s)
- Sisi Chen
- The First Hospital of Hunan University of Chinese Medicine, Changsha, Hunan, China
| | - Chiguo Huang
- Big Data Innovation and Entrepreneurship Education Center of Hunan Province, Changsha University, Changsha, China,*Correspondence: Chiguo Huang, ; Lei Wang, ; Shunxian Zhou,
| | - Lei Wang
- The First Hospital of Hunan University of Chinese Medicine, Changsha, Hunan, China,Big Data Innovation and Entrepreneurship Education Center of Hunan Province, Changsha University, Changsha, China,*Correspondence: Chiguo Huang, ; Lei Wang, ; Shunxian Zhou,
| | - Shunxian Zhou
- The First Hospital of Hunan University of Chinese Medicine, Changsha, Hunan, China,Big Data Innovation and Entrepreneurship Education Center of Hunan Province, Changsha University, Changsha, China,College of Information Science and Engineering, Hunan Women’s University, Changsha, Hunan, China,*Correspondence: Chiguo Huang, ; Lei Wang, ; Shunxian Zhou,
| |
Collapse
|
15
|
Xue X, Zhang W, Fan A. Comparative analysis of gene ontology-based semantic similarity measurements for the application of identifying essential proteins. PLoS One 2023; 18:e0284274. [PMID: 37083829 PMCID: PMC10121005 DOI: 10.1371/journal.pone.0284274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Accepted: 03/28/2023] [Indexed: 04/22/2023] Open
Abstract
Identifying key proteins from protein-protein interaction (PPI) networks is one of the most fundamental and important tasks for computational biologists. However, the protein interactions obtained by high-throughput technology are characterized by a high false positive rate, which severely hinders the prediction accuracy of the current computational methods. In this paper, we propose a novel strategy to identify key proteins by constructing reliable PPI networks. Five Gene Ontology (GO)-based semantic similarity measurements (Jiang, Lin, Rel, Resnik, and Wang) are used to calculate the confidence scores for protein pairs under three annotation terms (Molecular function (MF), Biological process (BP), and Cellular component (CC)). The protein pairs with low similarity values are assumed to be low-confidence links, and the refined PPI networks are constructed by filtering the low-confidence links. Six topology-based centrality methods (the BC, DC, EC, NC, SC, and aveNC) are applied to test the performance of the measurements under the original network and refined network. We systematically compare the performance of the five semantic similarity metrics with the three GO annotation terms on four benchmark datasets, and the simulation results show that the performance of these centrality methods under refined PPI networks is relatively better than that under the original networks. Resnik with a BP annotation term performs best among all five metrics with the three annotation terms. These findings suggest the importance of semantic similarity metrics in measuring the reliability of the links between proteins and highlight the Resnik metric with the BP annotation term as a favourable choice.
Collapse
Affiliation(s)
- Xiaoli Xue
- School of Science, East China Jiaotong University, Nanchang, China
| | - Wei Zhang
- School of Science, East China Jiaotong University, Nanchang, China
| | - Anjing Fan
- School of Computer and Information Engineering, Anyang Normal University, Anyang, China
| |
Collapse
|
16
|
Wang L, Peng J, Kuang L, Tan Y, Chen Z. Identification of Essential Proteins Based on Local Random Walk and Adaptive Multi-View Multi-Label Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3507-3516. [PMID: 34788220 DOI: 10.1109/tcbb.2021.3128638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Accumulating evidences have indicated that essential proteins play vital roles in human physiological process. In recent years, although researches on prediction of essential proteins have been developing rapidly, there are as well various limitations such as unsatisfactory data suitability, low accuracy of predictive results and so on. In this manuscript, a novel method called RWAMVL was proposed to predict essential proteins based on the Random Walk and the Adaptive Multi-View multi-label Learning. In RWAMVL, considering that the inherent noise is ubiquitous in existing datasets of known protein-protein interactions (PPIs), a variety of different features including biological features of proteins and topological features of PPI networks were obtained by adopting adaptive multi-view multi-label learning first. And then, an improved random walk method was designed to detect essential proteins based on these different features. Finally, in order to verify the predictive performance of RWAMVL, intensive experiments were done to compare it with multiple state-of-the-art predictive methods under different expeditionary frameworks. And as a result, RWAMVL was proven that it can achieve better prediction accuracy than all those competitive methods, which demonstrated as well that RWAMVL may be a potential tool for prediction of key proteins in the future.
Collapse
|
17
|
Wang C, Zhang H, Ma H, Wang Y, Cai K, Guo T, Yang Y, Li Z, Zhu Y. Inference of pan-cancer related genes by orthologs matching based on enhanced LSTM model. Front Microbiol 2022; 13:963704. [PMID: 36267181 PMCID: PMC9577021 DOI: 10.3389/fmicb.2022.963704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Many disease-related genes have been found to be associated with cancer diagnosis, which is useful for understanding the pathophysiology of cancer, generating targeted drugs, and developing new diagnostic and treatment techniques. With the development of the pan-cancer project and the ongoing expansion of sequencing technology, many scientists are focusing on mining common genes from The Cancer Genome Atlas (TCGA) across various cancer types. In this study, we attempted to infer pan-cancer associated genes by examining the microbial model organism Saccharomyces Cerevisiae (Yeast) by homology matching, which was motivated by the benefits of reverse genetics. First, a background network of protein-protein interactions and a pathogenic gene set involving several cancer types in humans and yeast were created. The homology between the human gene and yeast gene was then discovered by homology matching, and its interaction sub-network was obtained. This was undertaken following the principle that the homologous genes of the common ancestor may have similarities in expression. Then, using bidirectional long short-term memory (BiLSTM) in combination with adaptive integration of heterogeneous information, we further explored the topological characteristics of the yeast protein interaction network and presented a node representation score to evaluate the node ability in graphs. Finally, homologous mapping for human genes matched the important genes identified by ensemble classifiers for yeast, which may be thought of as genes connected to all types of cancer. One way to assess the performance of the BiLSTM model is through experiments on the database. On the other hand, enrichment analysis, survival analysis, and other outcomes can be used to confirm the biological importance of the prediction results. You may access the whole experimental protocols and programs at https://github.com/zhuyuan-cug/AI-BiLSTM/tree/master.
Collapse
Affiliation(s)
- Chao Wang
- Department of Surgery, Hepatic Surgery Center, Institute of Hepato-Pancreato-Biliary Surgery, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Houwang Zhang
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Haishu Ma
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
| | - Yawen Wang
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Ke Cai
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
| | - Tingrui Guo
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
| | - Yuanhang Yang
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Zhen Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Shanghai, China
- *Correspondence: Yuan Zhu
| |
Collapse
|
18
|
Saha S, Chatterjee P, Halder AK, Nasipuri M, Basu S, Plewczynski D. ML-DTD: Machine Learning-Based Drug Target Discovery for the Potential Treatment of COVID-19. Vaccines (Basel) 2022; 10:1643. [PMID: 36298508 PMCID: PMC9607653 DOI: 10.3390/vaccines10101643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 09/11/2022] [Accepted: 09/14/2022] [Indexed: 11/05/2022] Open
Abstract
Recent research has highlighted that a large section of druggable protein targets in the Human interactome remains unexplored for various diseases. It might lead to the drug repurposing study and help in the in-silico prediction of new drug-human protein target interactions. The same applies to the current pandemic of COVID-19 disease in global health issues. It is highly desirable to identify potential human drug targets for COVID-19 using a machine learning approach since it saves time and labor compared to traditional experimental methods. Structure-based drug discovery where druggability is determined by molecular docking is only appropriate for the protein whose three-dimensional structures are available. With machine learning algorithms, differentiating relevant features for predicting targets and non-targets can be used for the proteins whose 3-D structures are unavailable. In this research, a Machine Learning-based Drug Target Discovery (ML-DTD) approach is proposed where a machine learning model is initially built up and tested on the curated dataset consisting of COVID-19 human drug targets and non-targets formed by using the Therapeutic Target Database (TTD) and human interactome using several classifiers like XGBBoost Classifier, AdaBoost Classifier, Logistic Regression, Support Vector Classification, Decision Tree Classifier, Random Forest Classifier, Naive Bayes Classifier, and K-Nearest Neighbour Classifier (KNN). In this method, protein features include Gene Set Enrichment Analysis (GSEA) ranking, properties derived from the protein sequence, and encoded protein network centrality-based measures. Among all these, XGBBoost, KNN, and Random Forest models are satisfactory and consistent. This model is further used to predict novel COVID-19 human drug targets, which are further validated by target pathway analysis, the emergence of allied repurposed drugs, and their subsequent docking study.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science & Engineering, Institute of Engineering & Management, Salt Lake Electronics Complex, Kolkata 700091, India
| | - Piyali Chatterjee
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata 700152, India
| | - Anup Kumar Halder
- Faculty of Mathematics and Information Sciences, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c Street, 02-097 Warsaw, Poland
| | - Mita Nasipuri
- Department of Computer Science & Engineering, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, India
| | - Subhadip Basu
- Department of Computer Science & Engineering, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, India
| | - Dariusz Plewczynski
- Faculty of Mathematics and Information Sciences, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c Street, 02-097 Warsaw, Poland
| |
Collapse
|
19
|
Rule-Based Pruning and In Silico Identification of Essential Proteins in Yeast PPIN. Cells 2022; 11:cells11172648. [PMID: 36078056 PMCID: PMC9454873 DOI: 10.3390/cells11172648] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 08/18/2022] [Accepted: 08/22/2022] [Indexed: 11/25/2022] Open
Abstract
Proteins are vital for the significant cellular activities of living organisms. However, not all of them are essential. Identifying essential proteins through different biological experiments is relatively more laborious and time-consuming than the computational approaches used in recent times. However, practical implementation of conventional scientific methods sometimes becomes challenging due to poor performance impact in specific scenarios. Thus, more developed and efficient computational prediction models are required for essential protein identification. An effective methodology is proposed in this research, capable of predicting essential proteins in a refined yeast protein–protein interaction network (PPIN). The rule-based refinement is done using protein complex and local interaction density information derived from the neighborhood properties of proteins in the network. Identification and pruning of non-essential proteins are equally crucial here. In the initial phase, careful assessment is performed by applying node and edge weights to identify and discard the non-essential proteins from the interaction network. Three cut-off levels are considered for each node and edge weight for pruning the non-essential proteins. Once the PPIN has been filtered out, the second phase starts with two centralities-based approaches: (1) local interaction density (LID) and (2) local interaction density with protein complex (LIDC), which are successively implemented to identify the essential proteins in the yeast PPIN. Our proposed methodology achieves better performance in comparison to the existing state-of-the-art techniques.
Collapse
|
20
|
Identifying essential proteins from protein-protein interaction networks based on influence maximization. BMC Bioinformatics 2022; 23:339. [PMID: 35974329 PMCID: PMC9380286 DOI: 10.1186/s12859-022-04874-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 08/03/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Essential proteins are indispensable to the development and survival of cells. The identification of essential proteins not only is helpful for the understanding of the minimal requirements for cell survival, but also has practical significance in disease diagnosis, drug design and medical treatment. With the rapidly amassing of protein-protein interaction (PPI) data, computationally identifying essential proteins from protein-protein interaction networks (PINs) becomes more and more popular. Up to now, a number of various approaches for essential protein identification based on PINs have been developed. RESULTS In this paper, we propose a new and effective approach called iMEPP to identify essential proteins from PINs by fusing multiple types of biological data and applying the influence maximization mechanism to the PINs. Concretely, we first integrate PPI data, gene expression data and Gene Ontology to construct weighted PINs, to alleviate the impact of high false-positives in the raw PPI data. Then, we define the influence scores of nodes in PINs with both orthological data and PIN topological information. Finally, we develop an influence discount algorithm to identify essential proteins based on the influence maximization mechanism. CONCLUSIONS We applied our method to identifying essential proteins from saccharomyces cerevisiae PIN. Experiments show that our iMEPP method outperforms the existing methods, which validates its effectiveness and advantage.
Collapse
|
21
|
Yue Y, Ye C, Peng PY, Zhai HX, Ahmad I, Xia C, Wu YZ, Zhang YH. A deep learning framework for identifying essential proteins based on multiple biological information. BMC Bioinformatics 2022; 23:318. [PMID: 35927611 PMCID: PMC9351218 DOI: 10.1186/s12859-022-04868-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 07/29/2022] [Indexed: 11/15/2022] Open
Abstract
Background Essential Proteins are demonstrated to exert vital functions on cellular processes and are indispensable for the survival and reproduction of the organism. Traditional centrality methods perform poorly on complex protein–protein interaction (PPI) networks. Machine learning approaches based on high-throughput data lack the exploitation of the temporal and spatial dimensions of biological information. Results We put forward a deep learning framework to predict essential proteins by integrating features obtained from the PPI network, subcellular localization, and gene expression profiles. In our model, the node2vec method is applied to learn continuous feature representations for proteins in the PPI network, which capture the diversity of connectivity patterns in the network. The concept of depthwise separable convolution is employed on gene expression profiles to extract properties and observe the trends of gene expression over time under different experimental conditions. Subcellular localization information is mapped into a long one-dimensional vector to capture its characteristics. Additionally, we use a sampling method to mitigate the impact of imbalanced learning when training the model. With experiments carried out on the data of Saccharomyces cerevisiae, results show that our model outperforms traditional centrality methods and machine learning methods. Likewise, the comparative experiments have manifested that our process of various biological information is preferable. Conclusions Our proposed deep learning framework effectively identifies essential proteins by integrating multiple biological data, proving a broader selection of subcellular localization information significantly improves the results of prediction and depthwise separable convolution implemented on gene expression profiles enhances the performance.
Collapse
Affiliation(s)
- Yi Yue
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China. .,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China. .,State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University, Hefei, 230036, China.
| | - Chen Ye
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Pei-Yun Peng
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Hui-Xin Zhai
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Iftikhar Ahmad
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Chuan Xia
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Yun-Zhi Wu
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China.,State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University, Hefei, 230036, China
| | - You-Hua Zhang
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China. .,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
22
|
Sagulkoo P, Suratanee A, Plaimas K. Immune-Related Protein Interaction Network in Severe COVID-19 Patients toward the Identification of Key Proteins and Drug Repurposing. Biomolecules 2022; 12:biom12050690. [PMID: 35625619 PMCID: PMC9138873 DOI: 10.3390/biom12050690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Revised: 05/07/2022] [Accepted: 05/09/2022] [Indexed: 02/05/2023] Open
Abstract
Coronavirus disease 2019 (COVID-19) is still an active global public health issue. Although vaccines and therapeutic options are available, some patients experience severe conditions and need critical care support. Hence, identifying key genes or proteins involved in immune-related severe COVID-19 is necessary to find or develop the targeted therapies. This study proposed a novel construction of an immune-related protein interaction network (IPIN) in severe cases with the use of a network diffusion technique on a human interactome network and transcriptomic data. Enrichment analysis revealed that the IPIN was mainly associated with antiviral, innate immune, apoptosis, cell division, and cell cycle regulation signaling pathways. Twenty-three proteins were identified as key proteins to find associated drugs. Finally, poly (I:C), mitomycin C, decitabine, gemcitabine, hydroxyurea, tamoxifen, and curcumin were the potential drugs interacting with the key proteins to heal severe COVID-19. In conclusion, IPIN can be a good representative network for the immune system that integrates the protein interaction network and transcriptomic data. Thus, the key proteins and target drugs in IPIN help to find a new treatment with the use of existing drugs to treat the disease apart from vaccination and conventional antiviral therapy.
Collapse
Affiliation(s)
- Pakorn Sagulkoo
- Program in Bioinformatics and Computational Biology, Graduate School, Chulalongkorn University, Bangkok 10330, Thailand;
- Center of Biomedical Informatics, Department of Family Medicine, Faculty of Medicine, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Apichat Suratanee
- Department of Mathematics, Faculty of Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand;
- Intelligent and Nonlinear Dynamics Innovations Research Center, Science and Technology Research Institute, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand
| | - Kitiporn Plaimas
- Advance Virtual and Intelligent Computing (AVIC) Center, Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand
- Omics Science and Bioinformatics Center, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand
- Correspondence:
| |
Collapse
|
23
|
Wu C, Feng Z, Zheng J, Zhang H, Cao J, Yan H. Star topology convolution for graph representation learning. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00744-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
AbstractWe present a novel graph convolutional method called star topology convolution (STC). This method makes graph convolution more similar to conventional convolutional neural networks (CNNs) in Euclidean feature spaces. STC learns subgraphs which have a star topology rather than learning a fixed graph like most spectral methods. Due to the properties of a star topology, STC is graph-scale free (without a fixed graph size constraint). It has fewer parameters in its convolutional filter and is inductive, so it is more flexible and can be applied to large and evolving graphs. The convolutional filter is learnable and localized, similar to CNNs in Euclidean feature spaces, and can share weights across graphs. To test the method, STC was compared with the state-of-the-art graph convolutional methods in a supervised learning setting on nine node properties prediction benchmark datasets: Cora, Citeseer, Pubmed, PPI, Arxiv, MAG, ACM, DBLP, and IMDB. The experimental results showed that STC achieved the state-of-the-art performance on all these datasets and maintained good robustness. In an essential protein identification task, STC outperformed the state-of-the-art essential protein identification methods. An application of using pretrained STC as the embedding for feature extraction of some downstream classification tasks was introduced. The experimental results showed that STC can share weights across different graphs and be used as the embedding to improve the performance of downstream tasks.
Collapse
|
24
|
Zhang Z, Luo Y, Jiang M, Wu D, Zhang W, Yan W, Zhao B. An efficient strategy for identifying essential proteins based on homology, subcellular location and protein-protein interaction information. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:6331-6343. [PMID: 35603404 DOI: 10.3934/mbe.2022296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
High throughput biological experiments are expensive and time consuming. For the past few years, many computational methods based on biological information have been proposed and widely used to understand the biological background. However, the processing of biological information data inevitably produces false positive and false negative data, such as the noise in the Protein-Protein Interaction (PPI) networks and the noise generated by the integration of a variety of biological information. How to solve these noise problems is the key role in essential protein predictions. An Identifying Essential Proteins model based on non-negative Matrix Symmetric tri-Factorization and multiple biological information (IEPMSF) is proposed in this paper, which utilizes only the PPI network proteins common neighbor characters to develop a weighted network, and uses the non-negative matrix symmetric tri-factorization method to find more potential interactions between proteins in the network so as to optimize the weighted network. Then, using the subcellular location and lineal homology information, the starting score of proteins is determined, and the random walk algorithm with restart mode is applied to the optimized network to mark and rank each protein. We tested the suggested forecasting model against current representative approaches using a public database. Experiment shows high efficiency of new method in essential proteins identification. The effectiveness of this method shows that it can dramatically solve the noise problems that existing in the multi-source biological information itself and cased by integrating them.
Collapse
Affiliation(s)
- Zhihong Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, Hunan 410022, China
| | - Yingchun Luo
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, Hunan 410008, China
| | - Meiping Jiang
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, Hunan 410008, China
| | - Dongjie Wu
- Department of Banking and Finance, Monash University, Clayton, Victoria 3168, Australia
| | - Wang Zhang
- Department of Optoelectronic Engineering, Jinan University, Guangzhou, Guangdong 510632, China
| | - Wei Yan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, Hunan 410022, China
| | - Bihai Zhao
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, Hunan 410022, China
| |
Collapse
|
25
|
Xin XH, Zhang YY, Gao CQ, Min H, Wang L, Du PF. SGII: Systematic Identification of Essential lncRNAs in Mouse and Human Genome With lncRNA-Protein-Protein Heterogeneous Interaction Network. Front Genet 2022; 13:864564. [PMID: 35386279 PMCID: PMC8978670 DOI: 10.3389/fgene.2022.864564] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Accepted: 03/02/2022] [Indexed: 12/25/2022] Open
Abstract
Long noncoding RNAs (lncRNAs) play important roles in a variety of biological processes. Knocking out or knocking down some lncRNA genes can lead to death or infertility. These lncRNAs are called essential lncRNAs. Identifying the essential lncRNA is of importance for complex disease diagnosis and treatments. However, experimental methods for identifying essential lncRNAs are always costly and time consuming. Therefore, computational methods can be considered as an alternative approach. We propose a method to identify essential lncRNAs by combining network centrality measures and lncRNA sequence information. By constructing a lncRNA-protein-protein interaction network, we measure the essentiality of lncRNAs from their role in the network and their sequence together. We name our method as the systematic gene importance index (SGII). As far as we can tell, this is the first attempt to identify essential lncRNAs by combining sequence and network information together. The results of our method indicated that essential lncRNAs have similar roles in the LPPI network as the essential coding genes in the PPI network. Another encouraging observation is that the network information can significantly boost the predictive performance of sequence-based method. All source code and dataset of SGII have been deposited in a GitHub repository (https://github.com/ninglolo/SGII).
Collapse
Affiliation(s)
- Xiao-Hong Xin
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Ying-Ying Zhang
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Chu-Qiao Gao
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Hui Min
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Likun Wang
- Institute of Systems Biomedicine, Department of Pathology, School of Basic Medical Sciences, Beijing Key Laboratory of Tumor Systems Biology, Peking-Tsinghua Center of Life Sciences, Peking University Health Science Center, Beijing, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
26
|
Lu H, Shang C, Zou S, Cheng L, Yang S, Wang L. A Novel Method for Predicting Essential Proteins by Integrating Multidimensional Biological Attribute Information and Topological Properties. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220304201507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Essential proteins are indispensable to the maintenance of life activities and play essential roles in the areas of synthetic biology. Identification of essential proteins by computational methods has become a hot topic in recent years because of its efficiency.
Objective:
Identification of essential proteins is of important significance and practical use in the areas of synthetic biology, drug targets, and human disease genes.
Method:
In this paper, a method called EOP(Edge clustering coefficient -Orthologous-Protein) is proposed to infer potential essential proteins by combining Multidimensional Biological Attribute Information of proteins with Topological Properties of the protein-protein interaction network.
Results:
The simulation results on the yeast protein interaction network show that the number of essential proteins identified by this method is more than the number identified by the other 12 methods(DC, IC, EC, SC, BC, CC, NC, LAC, PEC, CoEWC, POEM, DWE). Especially compared with DC(Degree Centrality), the SN(sensitivity) is 9% higher, when the candidate protein is 1%, the recognition rate is 34% higher, when the candidate protein is 5%, 10%, 15%, 20%, 25% the recognition rate is 36%, 22%, 15%, 11%, 8% higher respectively.
Conclusion:
Experimental results show that our method can achieve satisfactory prediction results, which may provide references for future research.
Collapse
Affiliation(s)
- Hanyu Lu
- College of Big Data and Information Engineering, Guizhou University, Guizhou, China
| | - Chen Shang
- College of Big Data and Information Engineering, Guizhou University, Guizhou, China
| | - Sai Zou
- College of Big Data and Information Engineering, Guizhou University, Guizhou, China
| | - Lihong Cheng
- College of Foreign Languages, Dalian Jiaotong University, China
| | - Shikong Yang
- College of Big Data and Information Engineering, Guizhou University, Guizhou, China
| | - Lei Wang
- College of Computer Engineering and Applied Mathematics, Changsha University, China
| |
Collapse
|
27
|
Zhu X, Zhu Y, Tan Y, Chen Z, Wang L. An Iterative Method for Predicting Essential Proteins Based on Multifeature Fusion and Linear Neighborhood Similarity. Front Aging Neurosci 2022; 13:799500. [PMID: 35140599 PMCID: PMC8819145 DOI: 10.3389/fnagi.2021.799500] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 12/02/2021] [Indexed: 11/13/2022] Open
Abstract
Growing evidence have demonstrated that many biological processes are inseparable from the participation of key proteins. In this paper, a novel iterative method called linear neighborhood similarity-based protein multifeatures fusion (LNSPF) is proposed to identify potential key proteins based on multifeature fusion. In LNSPF, an original protein-protein interaction (PPI) network will be constructed first based on known protein-protein interaction data downloaded from benchmark databases, based on which, topological features will be further extracted. Next, gene expression data of proteins will be adopted to transfer the original PPI network to a weighted PPI network based on the linear neighborhood similarity. After that, subcellular localization and homologous information of proteins will be integrated to extract functional features for proteins, and based on both functional and topological features obtained above. And then, an iterative method will be designed and carried out to predict potential key proteins. At last, for evaluating the predictive performance of LNSPF, extensive experiments have been done, and compare results between LNPSF and 15 state-of-the-art competitive methods have demonstrated that LNSPF can achieve satisfactory recognition accuracy, which is markedly better than that achieved by each competing method.
Collapse
Affiliation(s)
- Xianyou Zhu
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, China
| | - Yaocan Zhu
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Yihong Tan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Zhiping Chen
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, China
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| |
Collapse
|
28
|
Predicting Essential Proteins Based on Integration of Local Fuzzy Fractal Dimension and Subcellular Location Information. Genes (Basel) 2022; 13:genes13020173. [PMID: 35205217 PMCID: PMC8872415 DOI: 10.3390/genes13020173] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Revised: 01/08/2022] [Accepted: 01/12/2022] [Indexed: 11/17/2022] Open
Abstract
Essential proteins are indispensable to cells’ survival and development. Prediction and analysis of essential proteins are crucial for uncovering the mechanisms of cells. With the help of computer science and high-throughput technologies, forecasting essential proteins by protein–protein interaction (PPI) networks has become more efficient than traditional approaches (expensive experimental methods are generally used). Many computational algorithms were employed to predict the essential proteins; however, they have various restrictions. To improve the prediction accuracy, by introducing the Local Fuzzy Fractal Dimension (LFFD) of complex networks into the analysis of the PPI network, we propose a novel algorithm named LDS, which combines the LFFD of the PPI network with the protein subcellular location information. By testing the proposed LDS algorithm on three different yeast PPI networks, the experimental results show that LDS outperforms some state-of-the-art essential protein-prediction techniques.
Collapse
|
29
|
Liu Y, Liang H, Zou Q, He Z. Significance-Based Essential Protein Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:633-642. [PMID: 32750873 DOI: 10.1109/tcbb.2020.3004364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The identification of essential proteins is an important problem in bioinformatics. During the past decades, many centrality measures and algorithms have been proposed to address this issue. However, existing methods still deserve the following drawbacks: (1) the lack of a context-free and readily interpretable quantification of their centrality values; (2) the difficulty of specifying a proper threshold for their centrality values; (3) the incapability of controlling the quality of reported essential proteins in a statistically sound manner. To overcome the limitations of existing solutions, we tackle the essential protein discovery problem from a significance testing perspective. More precisely, the essential protein discovery problem is formulated as a multiple hypothesis testing problem, where the null hypothesis is that each protein is not an essential protein. To quantify the statistical significance of each protein, we present a p-value calculation method in which both the degree and the local clustering coefficient are used as the test statistic and the Erdös-Rényi model is employed as the random graph model. After calculating the p-value for each protein, the false discovery rate is used as the error rate in the multiple testing correction procedure. Our significance-based essential protein discovery method is named as SigEP, which is tested on both simulated networks and real PPI networks. The experimental results show that our method is able to achieve better performance than those competing algorithms.
Collapse
|
30
|
Zhu X, He X, Kuang L, Chen Z, Lancine C. A Novel Collaborative Filtering Model-Based Method for Identifying Essential Proteins. Front Genet 2021; 12:763153. [PMID: 34745230 PMCID: PMC8566338 DOI: 10.3389/fgene.2021.763153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 09/13/2021] [Indexed: 11/19/2022] Open
Abstract
Considering that traditional biological experiments are expensive and time consuming, it is important to develop effective computational models to infer potential essential proteins. In this manuscript, a novel collaborative filtering model-based method called CFMM was proposed, in which, an updated protein–domain interaction (PDI) network was constructed first by applying collaborative filtering algorithm on the original PDI network, and then, through integrating topological features of PDI networks with biological features of proteins, a calculative method was designed to infer potential essential proteins based on an improved PageRank algorithm. The novelties of CFMM lie in construction of an updated PDI network, application of the commodity-customer-based collaborative filtering algorithm, and introduction of the calculation method based on an improved PageRank algorithm, which ensured that CFMM can be applied to predict essential proteins without relying entirely on known protein–domain associations. Simulation results showed that CFMM can achieve reliable prediction accuracies of 92.16, 83.14, 71.37, 63.87, 55.84, and 52.43% in the top 1, 5, 10, 15, 20, and 25% predicted candidate key proteins based on the DIP database, which are remarkably higher than 14 competitive state-of-the-art predictive models as a whole, and in addition, CFMM can achieve satisfactory predictive performances based on different databases with various evaluation measurements, which further indicated that CFMM may be a useful tool for the identification of essential proteins in the future.
Collapse
Affiliation(s)
- Xianyou Zhu
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, China.,Hunan Provincial Key Laboratory of Intelligent Information Processing and Application, Hengyang, China
| | - Xin He
- College of Computer, Xiangtan University, Xiangtan, China
| | - Linai Kuang
- College of Computer, Xiangtan University, Xiangtan, China
| | - Zhiping Chen
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Camara Lancine
- The Social Sciences and Management University of Bamako, Bamako, Mali
| |
Collapse
|
31
|
Liu Y, Chen W, He Z. Essential Protein Recognition via Community Significance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2788-2794. [PMID: 34347602 DOI: 10.1109/tcbb.2021.3102018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Essential protein plays a vital role in understanding the cellular life. With the advance in high-throughput technologies, a number of protein-protein interaction (PPI) networks have been constructed such that essential proteins can be identified from a system biology perspective. Although a series of network-based essential protein discovery methods have been proposed, these existing methods still have some drawbacks. Recently, it has been shown that the significance-based method SigEP is promising on overcoming the defects that are inherent in currently available essential protein identification methods. However, the SigEP method is developed under the unrealistic Erdös-Rényi (E-R) model and its time complexity is very high. Hence, we propose a new significance-based essential protein recognition method named EPCS in which the essential protein discovery problem is formulated as a community significance testing problem. Experimental results on four PPI networks show that EPCS performs better than nine state-of-the-art essential protein identification methods and the only significance-based essential protein identification method SigEP.
Collapse
|
32
|
Li S, Zhang Z, Li X, Tan Y, Wang L, Chen Z. An iteration model for identifying essential proteins by combining comprehensive PPI network with biological information. BMC Bioinformatics 2021; 22:430. [PMID: 34496745 PMCID: PMC8425031 DOI: 10.1186/s12859-021-04300-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Accepted: 07/08/2021] [Indexed: 11/10/2022] Open
Abstract
Background Essential proteins have great impacts on cell survival and development, and played important roles in disease analysis and new drug design. However, since it is inefficient and costly to identify essential proteins by using biological experiments, then there is an urgent need for automated and accurate detection methods. In recent years, the recognition of essential proteins in protein interaction networks (PPI) has become a research hotspot, and many computational models for predicting essential proteins have been proposed successively. Results In order to achieve higher prediction performance, in this paper, a new prediction model called TGSO is proposed. In TGSO, a protein aggregation degree network is constructed first by adopting the node density measurement method for complex networks. And simultaneously, a protein co-expression interactive network is constructed by combining the gene expression information with the network connectivity, and a protein co-localization interaction network is constructed based on the subcellular localization data. And then, through integrating these three kinds of newly constructed networks, a comprehensive protein–protein interaction network will be obtained. Finally, based on the homology information, scores can be calculated out iteratively for different proteins, which can be utilized to estimate the importance of proteins effectively. Moreover, in order to evaluate the identification performance of TGSO, we have compared TGSO with 13 different latest competitive methods based on three kinds of yeast databases. And experimental results show that TGSO can achieve identification accuracies of 94%, 82% and 72% out of the top 1%, 5% and 10% candidate proteins respectively, which are to some degree superior to these state-of-the-art competitive models. Conclusions We constructed a comprehensive interactive network based on multi-source data to reduce the noise and errors in the initial PPI, and combined with iterative methods to improve the accuracy of necessary protein prediction, and means that TGSO may be conducive to the future development of essential protein recognition as well.
Collapse
Affiliation(s)
- Shiyuan Li
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China
| | - Zhen Zhang
- College of Electronic Information and Electrical Engineering, Changsha University, Changsha, 410022, China
| | - Xueyong Li
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China
| | - Yihong Tan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China. .,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China.
| | - Lei Wang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China
| | - Zhiping Chen
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China. .,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China.
| |
Collapse
|
33
|
Saha S, Chatterjee P, Nasipuri M, Basu S. Detection of spreader nodes in human-SARS-CoV protein-protein interaction network. PeerJ 2021; 9:e12117. [PMID: 34567845 PMCID: PMC8428263 DOI: 10.7717/peerj.12117] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 08/15/2021] [Indexed: 12/20/2022] Open
Abstract
The entire world is witnessing the coronavirus pandemic (COVID-19), caused by a novel coronavirus (n-CoV) generally distinguished as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). SARS-CoV-2 promotes fatal chronic respiratory disease followed by multiple organ failure, ultimately putting an end to human life. International Committee on Taxonomy of Viruses (ICTV) has reached a consensus that SARS-CoV-2 is highly genetically similar (up to 89%) to the Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV), which had an outbreak in 2003. With this hypothesis, current work focuses on identifying the spreader nodes in the SARS-CoV-human protein-protein interaction network (PPIN) to find possible lineage with the disease propagation pattern of the current pandemic. Various PPIN characteristics like edge ratio, neighborhood density, and node weight have been explored for defining a new feature spreadability index by which spreader proteins and protein-protein interaction (in the form of network edges) are identified. Top spreader nodes with a high spreadability index have been validated by Susceptible-Infected-Susceptible (SIS) disease model, first using a synthetic PPIN followed by a SARS-CoV-human PPIN. The ranked edges highlight the path of entire disease propagation from SARS-CoV to human PPIN (up to level-2 neighborhood). The developed network attribute, spreadability index, and the generated SIS model, compared with the other network centrality-based methodologies, perform better than the existing state-of-art.
Collapse
Affiliation(s)
- Sovan Saha
- Computer Science and Engineering, Institute of Engineering and Management, Kolkata, West Bengal, India
| | - Piyali Chatterjee
- Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Mita Nasipuri
- Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Subhadip Basu
- Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| |
Collapse
|
34
|
Xie J, Zhao C, Sun J, Li J, Yang F, Wang J, Nie Q. Prediction of Essential Genes in Comparison States Using Machine Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1784-1792. [PMID: 32991286 DOI: 10.1109/tcbb.2020.3027392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying essential genes in comparison states (EGS) is vital to understanding cell differentiation, performing drug discovery, and identifying disease causes. Here, we present a machine learning method termed Prediction of Essential Genes in Comparison States (PreEGS). To capture the alteration of the network in comparison states, PreEGS extracts topological and gene expression features of each gene in a five-dimensional vector. PreEGS also recruits a positive sample expansion method to address the problem of unbalanced positive and negative samples, which is often encountered in practical applications. Different classifiers are applied to the simulated datasets, and the PreEGS based on the random forests model (PreEGSRF) was chosen for optimal performance. PreEGSRF was then compared with six other methods, including three machine learning methods, to predict EGS in a specific state. On real datasets with four gene regulatory networks, PreEGSRF predicted five essential genes related to leukemia and five enriched KEGG pathways. Four of the predicted essential genes and all predicted pathways were consistent with previous studies and highly correlated with leukemia. With high prediction accuracy and generalization ability, PreEGSRF is broadly applicable for the discovery of disease-causing genes, driver genes for cell fate decisions, and complex biomarkers of biological systems.
Collapse
|
35
|
Ali T, Borah C. Analysis of amino acids network based on mutation and base positions. GENE REPORTS 2021. [DOI: 10.1016/j.genrep.2021.101291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
36
|
Zhang Z, Jiang M, Wu D, Zhang W, Yan W, Qu X. A Novel Method for Identifying Essential Proteins Based on Non-negative Matrix Tri-Factorization. Front Genet 2021; 12:709660. [PMID: 34422014 PMCID: PMC8378176 DOI: 10.3389/fgene.2021.709660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 07/06/2021] [Indexed: 11/29/2022] Open
Abstract
Identification of essential proteins is very important for understanding the basic requirements to sustain a living organism. In recent years, there has been an increasing interest in using computational methods to predict essential proteins based on protein–protein interaction (PPI) networks or fusing multiple biological information. However, it has been observed that existing PPI data have false-negative and false-positive data. The fusion of multiple biological information can reduce the influence of false data in PPI, but inevitably more noise data will be produced at the same time. In this article, we proposed a novel non-negative matrix tri-factorization (NMTF)-based model (NTMEP) to predict essential proteins. Firstly, a weighted PPI network is established only using the topology features of the network, so as to avoid more noise. To reduce the influence of false data (existing in PPI network) on performance of identify essential proteins, the NMTF technique, as a widely used recommendation algorithm, is performed to reconstruct a most optimized PPI network with more potential protein–protein interactions. Then, we use the PageRank algorithm to compute the final ranking score of each protein, in which subcellular localization and homologous information of proteins were used to calculate the initial scores. In addition, extensive experiments are performed on the publicly available datasets and the results indicate that our NTMEP model has better performance in predicting essential proteins against the start-of-the-art method. In this investigation, we demonstrated that the introduction of non-negative matrix tri-factorization technology can effectively improve the condition of the protein–protein interaction network, so as to reduce the negative impact of noise on the prediction. At the same time, this finding provides a more novel angle of view for other applications based on protein–protein interaction networks.
Collapse
Affiliation(s)
- Zhihong Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China.,School of Information Technology and Management, Hunan University of Finance and Economics, Changsha, China
| | - Meiping Jiang
- Department of Ultrasound, Hunan Provincial Maternal and Child Health Care Hospital, Changsha, China
| | - Dongjie Wu
- Department of Banking and Finance, Monash University, Clayton, VIC, Australia
| | - Wang Zhang
- Department of Optoelectronic Engineering, Jinan University, Guangzhou, China
| | - Wei Yan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Xilong Qu
- School of Information Technology and Management, Hunan University of Finance and Economics, Changsha, China.,Hunan Provincial Key Laboratory of Finance and Economics Big Data Science and Technology, Hunan University of Finance and Economics, Changsha, China
| |
Collapse
|
37
|
Peng J, Kuang L, Zhang Z, Tan Y, Chen Z, Wang L. A Novel Model for Identifying Essential Proteins Based on Key Target Convergence Sets. Front Genet 2021; 12:721486. [PMID: 34394201 PMCID: PMC8358660 DOI: 10.3389/fgene.2021.721486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 06/30/2021] [Indexed: 11/20/2022] Open
Abstract
In recent years, many computational models have been designed to detect essential proteins based on protein-protein interaction (PPI) networks. However, due to the incompleteness of PPI networks, the prediction accuracy of these models is still not satisfactory. In this manuscript, a novel key target convergence sets based prediction model (KTCSPM) is proposed to identify essential proteins. In KTCSPM, a weighted PPI network and a weighted (Domain-Domain Interaction) network are constructed first based on known PPIs and PDIs downloaded from benchmark databases. And then, by integrating these two kinds of networks, a novel weighted PDI network is built. Next, through assigning a unique key target convergence set (KTCS) for each node in the weighted PDI network, an improved method based on the random walk with restart is designed to identify essential proteins. Finally, in order to evaluate the predictive effects of KTCSPM, it is compared with 12 competitive state-of-the-art models, and experimental results show that KTCSPM can achieve better prediction accuracy. Considering the satisfactory predictive performance achieved by KTCSPM, it indicates that KTCSPM might be a good supplement to the future research on prediction of essential proteins.
Collapse
Affiliation(s)
- Jiaxin Peng
- College of Computer, Xiangtan University, Xiangtan, China.,College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Linai Kuang
- College of Computer, Xiangtan University, Xiangtan, China
| | - Zhen Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Yihong Tan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Zhiping Chen
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer, Xiangtan University, Xiangtan, China.,College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| |
Collapse
|
38
|
He X, Kuang L, Chen Z, Tan Y, Wang L. Method for Identifying Essential Proteins by Key Features of Proteins in a Novel Protein-Domain Network. Front Genet 2021; 12:708162. [PMID: 34267785 PMCID: PMC8276041 DOI: 10.3389/fgene.2021.708162] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 05/31/2021] [Indexed: 11/21/2022] Open
Abstract
In recent years, due to low accuracy and high costs of traditional biological experiments, more and more computational models have been proposed successively to infer potential essential proteins. In this paper, a novel prediction method called KFPM is proposed, in which, a novel protein-domain heterogeneous network is established first by combining known protein-protein interactions with known associations between proteins and domains. Next, based on key topological characteristics extracted from the newly constructed protein-domain network and functional characteristics extracted from multiple biological information of proteins, a new computational method is designed to effectively integrate multiple biological features to infer potential essential proteins based on an improved PageRank algorithm. Finally, in order to evaluate the performance of KFPM, we compared it with 13 state-of-the-art prediction methods, experimental results show that, among the top 1, 5, and 10% of candidate proteins predicted by KFPM, the prediction accuracy can achieve 96.08, 83.14, and 70.59%, respectively, which significantly outperform all these 13 competitive methods. It means that KFPM may be a meaningful tool for prediction of potential essential proteins in the future.
Collapse
Affiliation(s)
- Xin He
- College of Computer, Xiangtan University, Xiangtan, China
| | - Linai Kuang
- College of Computer, Xiangtan University, Xiangtan, China
| | - Zhiping Chen
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Yihong Tan
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer, Xiangtan University, Xiangtan, China
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| |
Collapse
|
39
|
Wang N, Zeng M, Li Y, Wu FX, Li M. Essential Protein Prediction Based on node2vec and XGBoost. J Comput Biol 2021; 28:687-700. [PMID: 34152838 DOI: 10.1089/cmb.2020.0543] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Essential proteins are a vital part of the survival of organisms and cells. Identification of essential proteins lays a solid foundation for understanding protein functions and discovering drug targets. The traditional biological experiments are expensive and time-consuming. Recently, many computational methods have been proposed. However, some noises in the protein-protein interaction (PPI) networks affect the efficiency of essential protein prediction. It is necessary to construct a credible PPI network by using other useful biological information to reduce the effects of these noises. In this article, we proposed a model, Ess-NEXG, to identify essential proteins, which integrates biological information, including orthologous information, subcellular localization information, RNA-Seq information, and PPI network. In our model, first, we constructed a credible weighted PPI network by using different types of biological information. Second, we extracted the topological features of proteins in the constructed weighted PPI network by using the node2vec technique. Last, we used eXtreme Gradient Boosting (XGBoost) to predict essential proteins by using the topological features of proteins. The extensive results show that our model has better performance than other computational methods.
Collapse
Affiliation(s)
- Nian Wang
- School of Computer Science and Engineering, Central South University, Changsha, P.R. China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, P.R. China
| | - Yiming Li
- School of Computer Science and Engineering, Central South University, Changsha, P.R. China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, Canada.,Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Canada
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, P.R. China
| |
Collapse
|
40
|
Zhong J, Tang C, Peng W, Xie M, Sun Y, Tang Q, Xiao Q, Yang J. A novel essential protein identification method based on PPI networks and gene expression data. BMC Bioinformatics 2021; 22:248. [PMID: 33985429 PMCID: PMC8120700 DOI: 10.1186/s12859-021-04175-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Accepted: 05/06/2021] [Indexed: 02/08/2023] Open
Abstract
Background Some proposed methods for identifying essential proteins have better results by using biological information. Gene expression data is generally used to identify essential proteins. However, gene expression data is prone to fluctuations, which may affect the accuracy of essential protein identification. Therefore, we propose an essential protein identification method based on gene expression and the PPI network data to calculate the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network. Our experiments show that the method can improve the accuracy in predicting essential proteins. Results In this paper, we propose a new measure named JDC, which is based on the PPI network data and gene expression data. The JDC method offers a dynamic threshold method to binarize gene expression data. After that, it combines the degree centrality and Jaccard similarity index to calculate the JDC score for each protein in the PPI network. We benchmark the JDC method on four organisms respectively, and evaluate our method by using ROC analysis, modular analysis, jackknife analysis, overlapping analysis, top analysis, and accuracy analysis. The results show that the performance of JDC is better than DC, IC, EC, SC, BC, CC, NC, PeC, and WDC. We compare JDC with both NF-PIN and TS-PIN methods, which predict essential proteins through active PPI networks constructed from dynamic gene expression. Conclusions We demonstrate that the new centrality measure, JDC, is more efficient than state-of-the-art prediction methods with same input. The main ideas behind JDC are as follows: (1) Essential proteins are generally densely connected clusters in the PPI network. (2) Binarizing gene expression data can screen out fluctuations in gene expression profiles. (3) The essentiality of the protein depends on the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network.
Collapse
Affiliation(s)
- Jiancheng Zhong
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.,Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Changsha, 410083, China
| | - Chao Tang
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China
| | - Wei Peng
- College of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, Yunnan, China
| | - Minzhu Xie
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China
| | - Yusui Sun
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China
| | - Qiang Tang
- College of Engineering and Design, Hunan Normal University, Changsha, 410081, China
| | - Qiu Xiao
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.
| | - Jiahong Yang
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.
| |
Collapse
|
41
|
Aromolaran O, Aromolaran D, Isewon I, Oyelade J. Machine learning approach to gene essentiality prediction: a review. Brief Bioinform 2021; 22:6219158. [PMID: 33842944 DOI: 10.1093/bib/bbab128] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 03/04/2021] [Accepted: 03/17/2021] [Indexed: 12/17/2022] Open
Abstract
Essential genes are critical for the growth and survival of any organism. The machine learning approach complements the experimental methods to minimize the resources required for essentiality assays. Previous studies revealed the need to discover relevant features that significantly classify essential genes, improve on the generalizability of prediction models across organisms, and construct a robust gold standard as the class label for the train data to enhance prediction. Findings also show that a significant limitation of the machine learning approach is predicting conditionally essential genes. The essentiality status of a gene can change due to a specific condition of the organism. This review examines various methods applied to essential gene prediction task, their strengths, limitations and the factors responsible for effective computational prediction of essential genes. We discussed categories of features and how they contribute to the classification performance of essentiality prediction models. Five categories of features, namely, gene sequence, protein sequence, network topology, homology and gene ontology-based features, were generated for Caenorhabditis elegans to perform a comparative analysis of their essentiality prediction capacity. Gene ontology-based feature category outperformed other categories of features majorly due to its high correlation with the genes' biological functions. However, the topology feature category provided the highest discriminatory power making it more suitable for essentiality prediction. The major limiting factor of machine learning to predict essential genes conditionality is the unavailability of labeled data for interest conditions that can train a classifier. Therefore, cooperative machine learning could further exploit models that can perform well in conditional essentiality predictions. SHORT ABSTRACT Identification of essential genes is imperative because it provides an understanding of the core structure and function, accelerating drug targets' discovery, among other functions. Recent studies have applied machine learning to complement the experimental identification of essential genes. However, several factors are limiting the performance of machine learning approaches. This review aims to present the standard procedure and resources available for predicting essential genes in organisms, and also highlight the factors responsible for the current limitation in using machine learning for conditional gene essentiality prediction. The choice of features and ML technique was identified as an important factor to predict essential genes effectively.
Collapse
Affiliation(s)
- Olufemi Aromolaran
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria.,Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Damilare Aromolaran
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria.,Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Itunuoluwa Isewon
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria.,Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Jelili Oyelade
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria.,Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| |
Collapse
|
42
|
CEGSO: Boosting Essential Proteins Prediction by Integrating Protein Complex, Gene Expression, Gene Ontology, Subcellular Localization and Orthology Information. Interdiscip Sci 2021; 13:349-361. [PMID: 33772722 DOI: 10.1007/s12539-021-00426-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 02/04/2021] [Accepted: 03/05/2021] [Indexed: 01/13/2023]
Abstract
Essential proteins are assumed to be an indispensable element in sustaining normal physiological function and crucial to drug design and disease diagnosis. The discovery of essential proteins is of great importance in revealing the molecular mechanisms and biological processes. Owing to the tedious biological experiment, many numerical methods have been developed to discover key proteins by mining the features of the high throughput data. Appropriate integration of differential biological information based on protein-protein interaction (PPI) network has been proven useful in predicting essential proteins. The main intention of this research is to provide a comprehensive study and a review on identifying essential proteins by integrating multi-source data and provide guidance for researchers. Detailed analysis and comparison of current essential protein prediction algorithms have been carried out and tested on benchmark PPI networks. In addition, based on the previous method TEGS (short for the network Topology, gene Expression, Gene ontology, and Subcellular localization), we improve the performance of predicting essential proteins by incorporating known protein complex information, the gene expression profile, Gene Ontology (GO) terms information, subcellular localization information, and protein's orthology data into the PPI network, named CEGSO. The simulation results show that CEGSO achieves more accurate and robust results than other compared methods under different test datasets with various evaluation measurements.
Collapse
|
43
|
Meng Z, Kuang L, Chen Z, Zhang Z, Tan Y, Li X, Wang L. Method for Essential Protein Prediction Based on a Novel Weighted Protein-Domain Interaction Network. Front Genet 2021; 12:645932. [PMID: 33815480 PMCID: PMC8010314 DOI: 10.3389/fgene.2021.645932] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Accepted: 02/15/2021] [Indexed: 01/04/2023] Open
Abstract
In recent years a number of calculative models based on protein-protein interaction (PPI) networks have been proposed successively. However, due to false positives, false negatives, and the incompleteness of PPI networks, there are still many challenges affecting the design of computational models with satisfactory predictive accuracy when inferring key proteins. This study proposes a prediction model called WPDINM for detecting key proteins based on a novel weighted protein-domain interaction (PDI) network. In WPDINM, a weighted PPI network is constructed first by combining the gene expression data of proteins with topological information extracted from the original PPI network. Simultaneously, a weighted domain-domain interaction (DDI) network is constructed based on the original PDI network. Next, through integrating the newly obtained weighted PPI network and weighted DDI network with the original PDI network, a weighted PDI network is further constructed. Then, based on topological features and biological information, including the subcellular localization and orthologous information of proteins, a novel PageRank-based iterative algorithm is designed and implemented on the newly constructed weighted PDI network to estimate the criticality of proteins. Finally, to assess the prediction performance of WPDINM, we compared it with 12 kinds of competitive measures. Experimental results show that WPDINM can achieve a predictive accuracy rate of 90.19, 81.96, 70.72, 62.04, 55.83, and 51.13% in the top 1%, top 5%, top 10%, top 15%, top 20%, and top 25% separately, which exceeds the prediction accuracy achieved by traditional state-of-the-art competing measures. Owing to the satisfactory identification effect, the WPDINM measure may contribute to the further development of key protein identification.
Collapse
Affiliation(s)
- Zixuan Meng
- College of Computer, Xiangtan University, Xiangtan, China
| | - Linai Kuang
- College of Computer, Xiangtan University, Xiangtan, China
| | - Zhiping Chen
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Zhen Zhang
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Yihong Tan
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Xueyong Li
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer, Xiangtan University, Xiangtan, China
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| |
Collapse
|
44
|
Cueno ME, Ueno M, Iguchi R, Harada T, Miki Y, Yasumaru K, Kiso N, Wada K, Baba K, Imai K. Insights on the Structural Variations of the Furin-Like Cleavage Site Found Among the December 2019-July 2020 SARS-CoV-2 Spike Glycoprotein: A Computational Study Linking Viral Evolution and Infection. Front Med (Lausanne) 2021; 8:613412. [PMID: 33777970 PMCID: PMC7987684 DOI: 10.3389/fmed.2021.613412] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 02/16/2021] [Indexed: 12/17/2022] Open
Abstract
The SARS-CoV-2 (SARS2) is the cause of the coronavirus disease 2019 (COVID-19) pandemic. One unique structural feature of the SARS2 spike protein is the presence of a furin-like cleavage site (FLC) which is associated with both viral pathogenesis and host tropism. Specifically, SARS2 spike protein binds to the host ACE-2 receptor which in-turn is cleaved by furin proteases at the FLC site, suggesting that SARS2 FLC structural variations may have an impact on viral infectivity. However, this has not yet been fully elucidated. This study designed and analyzed a COVID-19 genomic epidemiology network for December 2019 to July 2020, and subsequently generated and analyzed representative SARS2 spike protein models from significant node clusters within the network. To distinguish possible structural variations, a model quality assessment was performed before further protein model analyses and superimposition of the protein models, particularly in both the receptor-binding domain (RBD) and FLC. Mutant spike models were generated with the unique 681PRRA684 amino acid sequence found within the deleted FLC. We found 9 SARS2 FLC structural patterns that could potentially correspond to nine node clusters encompassing various countries found within the COVID-19 genomic epidemiology network. Similarly, we associated this with the rapid evolution of the SARS2 genome. Furthermore, we observed that either in the presence or absence of the unique 681PRRA684 amino acid sequence no structural changes occurred within the SARS2 RBD, which we believe would mean that the SARS2 FLC has no structural influence on SARS2 RBD and may explain why host tropism was maintained.
Collapse
Affiliation(s)
- Marni E Cueno
- Department of Microbiology, Nihon University School of Dentistry, Tokyo, Japan.,Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan.,Immersion Physics Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan
| | - Miu Ueno
- Immersion Physics Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan
| | - Rinako Iguchi
- Immersion Physics Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan
| | - Tsubasa Harada
- Immersion Physics Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan
| | - Yoshifumi Miki
- Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan
| | - Kanae Yasumaru
- Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan
| | - Natsumi Kiso
- Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan
| | - Kanta Wada
- Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan
| | - Koki Baba
- Immersion Biology Class, Department of Science, Tokyo Gakugei University International Secondary School, Tokyo, Japan
| | - Kenichi Imai
- Department of Microbiology, Nihon University School of Dentistry, Tokyo, Japan
| |
Collapse
|
45
|
Modelling Cell Metabolism: A Review on Constraint-Based Steady-State and Kinetic Approaches. Processes (Basel) 2021. [DOI: 10.3390/pr9020322] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Studying cell metabolism serves a plethora of objectives such as the enhancement of bioprocess performance, and advancement in the understanding of cell biology, of drug target discovery, and in metabolic therapy. Remarkable successes in these fields emerged from heuristics approaches, for instance, with the introduction of effective strategies for genetic modifications, drug developments and optimization of bioprocess management. However, heuristics approaches have showed significant shortcomings, such as to describe regulation of metabolic pathways and to extrapolate experimental conditions. In the specific case of bioprocess management, such shortcomings limit their capacity to increase product quality, while maintaining desirable productivity and reproducibility levels. For instance, since heuristics approaches are not capable of prediction of the cellular functions under varying experimental conditions, they may lead to sub-optimal processes. Also, such approaches used for bioprocess control often fail in regulating a process under unexpected variations of external conditions. Therefore, methodologies inspired by the systematic mathematical formulation of cell metabolism have been used to address such drawbacks and achieve robust reproducible results. Mathematical modelling approaches are effective for both the characterization of the cell physiology, and the estimation of metabolic pathways utilization, thus allowing to characterize a cell population metabolic behavior. In this article, we present a review on methodology used and promising mathematical modelling approaches, focusing primarily to investigate metabolic events and regulation. Proceeding from a topological representation of the metabolic networks, we first present the metabolic modelling approaches that investigate cell metabolism at steady state, complying to the constraints imposed by mass conservation law and thermodynamics of reactions reversibility. Constraint-based models (CBMs) are reviewed highlighting the set of assumed optimality functions for reaction pathways. We explore models simulating cell growth dynamics, by expanding flux balance models developed at steady state. Then, discussing a change of metabolic modelling paradigm, we describe dynamic kinetic models that are based on the mathematical representation of the mechanistic description of nonlinear enzyme activities. In such approaches metabolic pathway regulations are considered explicitly as a function of the activity of other components of metabolic networks and possibly far from the metabolic steady state. We have also assessed the significance of metabolic model parameterization in kinetic models, summarizing a standard parameter estimation procedure frequently employed in kinetic metabolic modelling literature. Finally, some optimization practices used for the parameter estimation are reviewed.
Collapse
|
46
|
Dai W, Chen B, Peng W, Li X, Zhong J, Wang J. A Novel Multi-Ensemble Method for Identifying Essential Proteins. J Comput Biol 2021; 28:637-649. [PMID: 33439753 DOI: 10.1089/cmb.2020.0527] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Essential proteins possess critical functions for cell survival. Identifying essential proteins improves our understanding of how a cell works and also plays a vital role in the research fields of disease treatment and drug development. Recently, some machine-learning methods and ensemble learning methods have been proposed to identify essential proteins by introducing effective protein features. However, the ensemble learning method only used to focus on the choice of base classifiers. In this article, we propose a novel ensemble learning framework called multi-ensemble to integrate different base classifiers. The multi-ensemble method adopts the idea of multi-view learning and selects multiple base classifiers and trains those classifiers by continually adding the samples that are predicted correctly by the other base classifiers. We applied multi-ensemble to Yeast data and Escherichia coli data. The results show that our approach achieved better performance than both individual classifiers and the other ensemble learning methods.
Collapse
Affiliation(s)
- Wei Dai
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China.,Computer Technology Application Key Lab of Yunnan Province, Kunming University of Science and Technology, Kunming, China
| | - Bingxi Chen
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
| | - Wei Peng
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China.,Computer Technology Application Key Lab of Yunnan Province, Kunming University of Science and Technology, Kunming, China
| | - Xia Li
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
| | - Jiancheng Zhong
- School of Information Science and Engineering, Hunan Normal University, Changsha, China
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
47
|
Zeng M, Li M, Fei Z, Wu FX, Li Y, Pan Y, Wang J. A Deep Learning Framework for Identifying Essential Proteins by Integrating Multiple Types of Biological Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:296-305. [PMID: 30736002 DOI: 10.1109/tcbb.2019.2897679] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Computational methods including centrality and machine learning-based methods have been proposed to identify essential proteins for understanding the minimum requirements of the survival and evolution of a cell. In centrality methods, researchers are required to design a score function which is based on prior knowledge, yet is usually not sufficient to capture the complexity of biological information. In machine learning-based methods, some selected biological features cannot represent the complete properties of biological information as they lack a computational framework to automatically select features. To tackle these problems, we propose a deep learning framework to automatically learn biological features without prior knowledge. We use node2vec technique to automatically learn a richer representation of protein-protein interaction (PPI) network topologies than a score function. Bidirectional long short term memory cells are applied to capture non-local relationships in gene expression data. For subcellular localization information, we exploit a high dimensional indicator vector to characterize their feature. To evaluate the performance of our method, we tested it on PPI network of S. cerevisiae. Our experimental results demonstrate that the performance of our method is better than traditional centrality methods and is superior to existing machine learning-based methods. To explore which of the three types of biological information is the most vital element, we conduct an ablation study by removing each component in turn. Our results show that the PPI network embedding contributes most to the improvement. In addition, gene expression profiles and subcellular localization information are also helpful to improve the performance in identification of essential proteins.
Collapse
|
48
|
A novel scheme for essential protein discovery based on multi-source biological information. J Theor Biol 2020; 504:110414. [PMID: 32712150 DOI: 10.1016/j.jtbi.2020.110414] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Revised: 02/14/2020] [Accepted: 07/15/2020] [Indexed: 02/06/2023]
Abstract
Mining essential protein is crucial for discovering the process of cellular organization and viability. At present, there are many computational methods for essential proteins detecting. However, these existing methods only focus on the topological information of the networks and ignore the biological information of proteins, which lead to low accuracy of essential protein identification. Therefore, this paper presents a new essential proteins prediction strategy, called DEP-MSB which integrates a variety of biological information including gene expression profiles, GO annotations, and Domain interaction strength. In order to evaluate the performance of DEP-MSB, we conduct a series of experiments on the yeast PPI network and the experimental results have shown that the proposed algorithm DEP-MSB is more superior to the other existing traditional methods and has obviously improvement in prediction accuracy.
Collapse
|
49
|
Zhang W, Xu J, Zou X. Predicting Essential Proteins by Integrating Network Topology, Subcellular Localization Information, Gene Expression Profile and GO Annotation Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:2053-2061. [PMID: 31095490 DOI: 10.1109/tcbb.2019.2916038] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Essential proteins are indispensable for maintaining normal cellular functions. Identification of essential proteins from Protein-protein interaction (PPI) networks has become a hot topic in recent years. Traditionally biological experimental based approaches are time-consuming and expensive, although lots of computational based methods have been developed in the past years; however, the prediction accuracy is still unsatisfied. In this research, by introducing the protein sub-cellular localization information, we define a new measurement for characterizing the protein's subcellular localization essentiality, and a new data fusion based method is developed for identifying essential proteins, named TEGS, based on integrating network topology, gene expression profile, GO annotation information, and protein subcellular localization information. To demonstrate the efficiency of the proposed method TEGS, we evaluate its performance on two Saccharomyces cerevisiae datasets and compare with other seven state-of-the-art methods (DC, BC, NC, PeC, WDC, SON, and TEO) in terms of true predicted number, jackknife curve, and precision-recall curve. Simulation results show that the TEGS outperforms the other compared methods in identifying essential proteins. The source code of TEGS is freely available at https://github.com/wzhangwhu/TEGS.
Collapse
|
50
|
Athira K, Gopakumar G. An integrated method for identifying essential proteins from multiplex network model of protein-protein interactions. J Bioinform Comput Biol 2020; 18:2050020. [PMID: 32795133 DOI: 10.1142/s0219720020500201] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Cell survival requires the presence of essential proteins. Detection of essential proteins is relevant not only because of the critical biological functions they perform but also the role played by them as a drug target against pathogens. Several computational techniques are in place to identify essential proteins based on protein-protein interaction (PPI) network. Essential protein detection using only physical interaction data of proteins is challenging due to its inherent uncertainty. Hence, in this work, we propose a multiplex network-based framework that incorporates multiple protein interaction data from their physical, coexpression and phylogenetic profiles. An extended version termed as multiplex eigenvector centrality (MEC) is used to identify essential proteins from this network. The methodology integrates the score obtained from the multiplex analysis with subcellular localization and Gene Ontology information and is implemented using Saccharomyces cerevisiae datasets. The proposed method outperformed many recent essential protein prediction techniques in the literature.
Collapse
Affiliation(s)
- K Athira
- Department of Computer Science and Engineering, National Institute of Technology Calicut, Kozhikkode, Kerala 673601, India
| | - G Gopakumar
- Department of Computer Science and Engineering, National Institute of Technology Calicut, Kozhikkode, Kerala 673601, India
| |
Collapse
|