1
|
Hu W, Li M, Xiao H, Guan L. Essential genes identification model based on sequence feature map and graph convolutional neural network. BMC Genomics 2024; 25:47. [PMID: 38200437 PMCID: PMC10777564 DOI: 10.1186/s12864-024-09958-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 01/01/2024] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Essential genes encode functions that play a vital role in the life activities of organisms, encompassing growth, development, immune system functioning, and cell structure maintenance. Conventional experimental techniques for identifying essential genes are resource-intensive and time-consuming, and the accuracy of current machine learning models needs further enhancement. Therefore, it is crucial to develop a robust computational model to accurately predict essential genes. RESULTS In this study, we introduce GCNN-SFM, a computational model for identifying essential genes in organisms, based on graph convolutional neural networks (GCNN). GCNN-SFM integrates a graph convolutional layer, a convolutional layer, and a fully connected layer to model and extract features from gene sequences of essential genes. Initially, the gene sequence is transformed into a feature map using coding techniques. Subsequently, a multi-layer GCN is employed to perform graph convolution operations, effectively capturing both local and global features of the gene sequence. Further feature extraction is performed, followed by integrating convolution and fully-connected layers to generate prediction results for essential genes. The gradient descent algorithm is utilized to iteratively update the cross-entropy loss function, thereby enhancing the accuracy of the prediction results. Meanwhile, model parameters are tuned to determine the optimal parameter combination that yields the best prediction performance during training. CONCLUSIONS Experimental evaluation demonstrates that GCNN-SFM surpasses various advanced essential gene prediction models and achieves an average accuracy of 94.53%. This study presents a novel and effective approach for identifying essential genes, which has significant implications for biology and genomics research.
Collapse
Affiliation(s)
- Wenxing Hu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
| | - Haiyang Xiao
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| |
Collapse
|
2
|
Beder T, Aromolaran O, Dönitz J, Tapanelli S, Adedeji E, Adebiyi E, Bucher G, Koenig R. Identifying essential genes across eukaryotes by machine learning. NAR Genom Bioinform 2021; 3:lqab110. [PMID: 34859210 PMCID: PMC8634067 DOI: 10.1093/nargab/lqab110] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 10/09/2021] [Accepted: 11/29/2021] [Indexed: 02/07/2023] Open
Abstract
Identifying essential genes on a genome scale is resource intensive and has been performed for only a few eukaryotes. For less studied organisms essentiality might be predicted by gene homology. However, this approach cannot be applied to non-conserved genes. Additionally, divergent essentiality information is obtained from studying single cells or whole, multi-cellular organisms, and particularly when derived from human cell line screens and human population studies. We employed machine learning across six model eukaryotes and 60 381 genes, using 41 635 features derived from the sequence, gene function information and network topology. Within a leave-one-organism-out cross-validation, the classifiers showed high generalizability with an average accuracy close to 80% in the left-out species. As a case study, we applied the method to Tribolium castaneum and Bombyx mori and validated predictions experimentally yielding similar performances. Finally, using the classifier based on the studied model organisms enabled linking the essentiality information of human cell line screens and population studies.
Collapse
Affiliation(s)
- Thomas Beder
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Am Klinikum 1, 07747 Jena, Germany
- Institute of Infectious Diseases and Infection Control, Jena University Hospital, Am Klinikum 1, 07747 Jena, Germany
- Department of Internal Medicine II, University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
| | - Olufemi Aromolaran
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Jürgen Dönitz
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
- Department of Medical Bioinformatics, University Medical Center Göttingen (UMG), 37099 Göttingen, Germany
| | - Sofia Tapanelli
- Department of Life Sciences, Imperial College London, London SW7 2AZ, UK
| | - Eunice O Adedeji
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
- Department of Biochemistry, Covenant University, Ota, Ogun State, Nigeria
| | - Ezekiel Adebiyi
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Gregor Bucher
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| | - Rainer Koenig
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Am Klinikum 1, 07747 Jena, Germany
- Institute of Infectious Diseases and Infection Control, Jena University Hospital, Am Klinikum 1, 07747 Jena, Germany
| |
Collapse
|
3
|
Senthamizhan V, Ravindran B, Raman K. NetGenes: A Database of Essential Genes Predicted Using Features From Interaction Networks. Front Genet 2021; 12:722198. [PMID: 34630517 PMCID: PMC8495214 DOI: 10.3389/fgene.2021.722198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 08/16/2021] [Indexed: 11/26/2022] Open
Abstract
Essential gene prediction models built so far are heavily reliant on sequence-based features, and the scope of network-based features has been narrow. Previous work from our group demonstrated the importance of using network-based features for predicting essential genes with high accuracy. Here, we apply our approach for the prediction of essential genes to organisms from the STRING database and host the results in a standalone website. Our database, NetGenes, contains essential gene predictions for 2,700+ bacteria predicted using features derived from STRING protein-protein functional association networks. Housing a total of over 2.1 million genes, NetGenes offers various features like essentiality scores, annotations, and feature vectors for each gene. NetGenes database is available from https://rbc-dsai-iitm.github.io/NetGenes/.
Collapse
Affiliation(s)
- Vimaladhasan Senthamizhan
- Centre for Integrative Biology and Systems mEdicine (IBSE), Indian Institute of Technology (IIT) Madras, Chennai, India
- Robert Bosch Center for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai, India
| | - Balaraman Ravindran
- Centre for Integrative Biology and Systems mEdicine (IBSE), Indian Institute of Technology (IIT) Madras, Chennai, India
- Robert Bosch Center for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai, India
- Department of Computer Science and Engineering, IIT Madras, Chennai, India
| | - Karthik Raman
- Centre for Integrative Biology and Systems mEdicine (IBSE), Indian Institute of Technology (IIT) Madras, Chennai, India
- Robert Bosch Center for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai, India
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, IIT Madras, Chennai, India
| |
Collapse
|
4
|
DELEAT: gene essentiality prediction and deletion design for bacterial genome reduction. BMC Bioinformatics 2021; 22:444. [PMID: 34537011 PMCID: PMC8449488 DOI: 10.1186/s12859-021-04348-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 08/26/2021] [Indexed: 11/10/2022] Open
Abstract
Background The study of gene essentiality is fundamental to understand the basic principles of life, as well as for applications in many fields. In recent decades, dozens of sets of essential genes have been determined using different experimental and bioinformatics approaches, and this information has been useful for genome reduction of model organisms. Multiple in silico strategies have been developed to predict gene essentiality, but no optimal algorithm or set of gene features has been found yet, especially for non-model organisms with incomplete functional annotation. Results We have developed DELEAT v0.1 (DELetion design by Essentiality Analysis Tool), an easy-to-use bioinformatic tool which integrates an in silico gene essentiality classifier in a pipeline allowing automatic design of large-scale deletions in any bacterial genome. The essentiality classifier consists of a novel logistic regression model based on only six gene features which are not dependent on experimental data or functional annotation. As a proof of concept, we have applied this pipeline to the determination of dispensable regions in the genome of Bartonella quintana str. Toulouse. In this already reduced genome, 35 possible deletions have been delimited, spanning 29% of the genome. Conclusions Built on in silico gene essentiality predictions, we have developed an analysis pipeline which assists researchers throughout multiple stages of bacterial genome reduction projects, and created a novel classifier which is simple, fast, and universally applicable to any bacterial organism with a GenBank annotation file. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04348-5.
Collapse
|
5
|
Campos TL, Korhonen PK, Hofmann A, Gasser RB, Young ND. Harnessing model organism genomics to underpin the machine learning-based prediction of essential genes in eukaryotes - Biotechnological implications. Biotechnol Adv 2021; 54:107822. [PMID: 34461202 DOI: 10.1016/j.biotechadv.2021.107822] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Revised: 08/17/2021] [Accepted: 08/24/2021] [Indexed: 12/17/2022]
Abstract
The availability of high-quality genomes and advances in functional genomics have enabled large-scale studies of essential genes in model eukaryotes, including the 'elegant worm' (Caenorhabditis elegans; Nematoda) and the 'vinegar fly' (Drosophila melanogaster; Arthropoda). However, this is not the case for other, much less-studied organisms, such as socioeconomically important parasites, for which functional genomic platforms usually do not exist. Thus, there is a need to develop innovative techniques or approaches for the prediction, identification and investigation of essential genes. A key approach that could enable the prediction of such genes is machine learning (ML). Here, we undertake an historical review of experimental and computational approaches employed for the characterisation of essential genes in eukaryotes, with a particular focus on model ecdysozoans (C. elegans and D. melanogaster), and discuss the possible applicability of ML-approaches to organisms such as socioeconomically important parasites. We highlight some recent results showing that high-performance ML, combined with feature engineering, allows a reliable prediction of essential genes from extensive, publicly available 'omic data sets, with major potential to prioritise such genes (with statistical confidence) for subsequent functional genomic validation. These findings could 'open the door' to fundamental and applied research areas. Evidence of some commonality in the essential gene-complement between these two organisms indicates that an ML-engineering approach could find broader applicability to ecdysozoans such as parasitic nematodes or arthropods, provided that suitably large and informative data sets become/are available for proper feature engineering, and for the robust training and validation of algorithms. This area warrants detailed exploration to, for example, facilitate the identification and characterisation of essential molecules as novel targets for drugs and vaccines against parasitic diseases. This focus is particularly important, given the substantial impact that such diseases have worldwide, and the current challenges associated with their prevention and control and with drug resistance in parasite populations.
Collapse
Affiliation(s)
- Tulio L Campos
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia; Bioinformatics Core Facility, Instituto Aggeu Magalhães, Fundação Oswaldo Cruz (IAM-Fiocruz), Recife, Pernambuco, Brazil
| | - Pasi K Korhonen
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Andreas Hofmann
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia.
| | - Neil D Young
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia.
| |
Collapse
|
6
|
Le NQK, Do DT, Hung TNK, Lam LHT, Huynh TT, Nguyen NTK. A Computational Framework Based on Ensemble Deep Neural Networks for Essential Genes Identification. Int J Mol Sci 2020; 21:E9070. [PMID: 33260643 PMCID: PMC7730808 DOI: 10.3390/ijms21239070] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 11/25/2020] [Accepted: 11/26/2020] [Indexed: 01/13/2023] Open
Abstract
Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| | - Duyen Thi Do
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei 106, Taiwan;
| | - Truong Nguyen Khanh Hung
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Department of Orthopedic and Trauma, Cho Ray Hospital, Ho Chi Minh 70000, Vietnam
| | - Luu Ho Thanh Lam
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Intensive Care Unit, Children’s Hospital 2, Ho Chi Minh 70000, Vietnam
| | - Tuan-Tu Huynh
- Department of Electrical Engineering, Yuan Ze University, Taoyuan 320, Taiwan;
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Dong Nai 76120, Vietnam
| | - Ngan Thi Kim Nguyen
- School of Nutrition and Health Sciences, Taipei Medical University, Taipei 110, Taiwan;
| |
Collapse
|
7
|
Abstract
BACKGROUND Essential genes are those genes that are critical for the survival of an organism. The prediction of essential genes in bacteria can provide targets for the design of novel antibiotic compounds or antimicrobial strategies. RESULTS We propose a deep neural network for predicting essential genes in microbes. Our architecture called DEEPLYESSENTIAL makes minimal assumptions about the input data (i.e., it only uses gene primary sequence and the corresponding protein sequence) to carry out the prediction thus maximizing its practical application compared to existing predictors that require structural or topological features which might not be readily available. We also expose and study a hidden performance bias that effected previous classifiers. Extensive results show that DEEPLYESSENTIAL outperform existing classifiers that either employ down-sampling to balance the training set or use clustering to exclude multiple copies of orthologous genes. CONCLUSION Deep neural network architectures can efficiently predict whether a microbial gene is essential (or not) using only its sequence information.
Collapse
Affiliation(s)
- Md Abid Hasan
- Department of Computer Science and Engineering, University of California Riverside, 900 University Ave, Riverside, 92507 CA USA
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California Riverside, 900 University Ave, Riverside, 92507 CA USA
| |
Collapse
|
8
|
Xu L, Guo Z, Liu X. Prediction of essential genes in prokaryote based on artificial neural network. Genes Genomics 2019; 42:97-106. [DOI: 10.1007/s13258-019-00884-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2019] [Accepted: 10/30/2019] [Indexed: 12/12/2022]
|
9
|
Azhagesan K, Ravindran B, Raman K. Network-based features enable prediction of essential genes across diverse organisms. PLoS One 2018; 13:e0208722. [PMID: 30543651 PMCID: PMC6292609 DOI: 10.1371/journal.pone.0208722] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 11/21/2018] [Indexed: 12/19/2022] Open
Abstract
Machine learning approaches to predict essential genes have gained a lot of traction in recent years. These approaches predominantly make use of sequence and network-based features to predict essential genes. However, the scope of network-based features used by the existing approaches is very narrow. Further, many of these studies focus on predicting essential genes within the same organism, which cannot be readily used to predict essential genes across organisms. Therefore, there is clearly a need for a method that is able to predict essential genes across organisms, by leveraging network-based features. In this study, we extract several sets of network-based features from protein-protein association networks available from the STRING database. Our network features include some common measures of centrality, and also some novel recursive measures recently proposed in social network literature. We extract hundreds of network-based features from networks of 27 diverse organisms to predict the essentiality of 87000+ genes. Our results show that network-based features are statistically significantly better at classifying essential genes across diverse bacterial species, compared to the current state-of-the-art methods, which use mostly sequence and a few 'conventional' network-based features. Our diverse set of network properties gave an AUROC of 0.847 and a precision of 0.320 across 27 organisms. When we augmented the complete set of network features with sequence-derived features, we achieved an improved AUROC of 0.857 and a precision of 0.335. We also constructed a reduced set of 100 sequence and network features, which gave a comparable performance. Further, we show that our features are useful for predicting essential genes in new organisms by using leave-one-species-out validation. Our network features capture the local, global and neighbourhood properties of the network and are hence effective for prediction of essential genes across diverse organisms, even in the absence of other complex biological knowledge. Our approach can be readily exploited to predict essentiality for organisms in interactome databases such as the STRING, where both network and sequence are readily available. All codes are available at https://github.com/RamanLab/nbfpeg.
Collapse
Affiliation(s)
- Karthik Azhagesan
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras, Chennai – 600 036, India
- Initiative for Biological Systems Engineering (IBSE), IIT Madras, Chennai – 600 036, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai – 600 036, India
| | - Balaraman Ravindran
- Department of Computer Science and Engineering, IIT Madras, Chennai – 600 036, India
- Initiative for Biological Systems Engineering (IBSE), IIT Madras, Chennai – 600 036, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai – 600 036, India
- * E-mail: (BR); (KR)
| | - Karthik Raman
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras, Chennai – 600 036, India
- Initiative for Biological Systems Engineering (IBSE), IIT Madras, Chennai – 600 036, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai – 600 036, India
- * E-mail: (BR); (KR)
| |
Collapse
|
10
|
Dong C, Jin YT, Hua HL, Wen QF, Luo S, Zheng WX, Guo FB. Comprehensive review of the identification of essential genes using computational methods: focusing on feature implementation and assessment. Brief Bioinform 2018; 21:171-181. [PMID: 30496347 DOI: 10.1093/bib/bby116] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Revised: 11/01/2018] [Accepted: 11/02/2018] [Indexed: 02/06/2023] Open
Abstract
Essential genes have attracted increasing attention in recent years due to the important functions of these genes in organisms. Among the methods used to identify the essential genes, accurate and efficient computational methods can make up for the deficiencies of expensive and time-consuming experimental technologies. In this review, we have collected researches on essential gene predictions in prokaryotes and eukaryotes and summarized the five predominant types of features used in these studies. The five types of features include evolutionary conservation, domain information, network topology, sequence component and expression level. We have described how to implement the useful forms of these features and evaluated their performance based on the data of Escherichia coli MG1655, Bacillus subtilis 168 and human. The prerequisite and applicable range of these features is described. In addition, we have investigated the techniques used to weight features in various models. To facilitate researchers in the field, two available online tools, which are accessible for free and can be directly used to predict gene essentiality in prokaryotes and humans, were referred. This article provides a simple guide for the identification of essential genes in prokaryotes and eukaryotes.
Collapse
Affiliation(s)
- Chuan Dong
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hong-Li Hua
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Qing-Feng Wen
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Sen Luo
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wen-Xin Zheng
- School of Biomedical Engineering, Capital Medical University, Beijing, China
| | - Feng-Biao Guo
- School of Life Science and Technology, Center for Informational Biology, Intelligent Learning Institute for Science and Application, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
11
|
Peng C, Lin Y, Luo H, Gao F. A Comprehensive Overview of Online Resources to Identify and Predict Bacterial Essential Genes. Front Microbiol 2017; 8:2331. [PMID: 29230204 PMCID: PMC5711816 DOI: 10.3389/fmicb.2017.02331] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2017] [Accepted: 11/13/2017] [Indexed: 12/15/2022] Open
Abstract
Genes critical for the survival or reproduction of an organism in certain circumstances are classified as essential genes. Essential genes play a significant role in deciphering the survival mechanism of life. They may be greatly applied to pharmaceutics and synthetic biology. The continuous progress of experimental method for essential gene identification has accelerated the accumulation of gene essentiality data which facilitates the study of essential genes in silico. In this article, we present some available online resources related to gene essentiality, including bioinformatic software tools for transposon sequencing (Tn-seq) analysis, essential gene databases and online services to predict bacterial essential genes. We review several computational approaches that have been used to predict essential genes, and summarize the features used for gene essentiality prediction. In addition, we evaluate the available online bacterial essential gene prediction servers based on the experimentally validated essential gene sets of 30 bacteria from DEG. This article is intended to be a quick reference guide for the microbiologists interested in the essential genes.
Collapse
Affiliation(s)
- Chong Peng
- Department of Physics, School of Science, Tianjin University, Tianjin, China
| | - Yan Lin
- Department of Physics, School of Science, Tianjin University, Tianjin, China
| | - Hao Luo
- Department of Physics, School of Science, Tianjin University, Tianjin, China
| | - Feng Gao
- Department of Physics, School of Science, Tianjin University, Tianjin, China
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
- SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), Tianjin University, Tianjin, China
| |
Collapse
|
12
|
Khan S, Somvanshi P, Bhardwaj T, Mandal RK, Dar SA, Wahid M, Jawed A, Lohani M, Khan M, Areeshi MY, Haque S. Aspartate‐β‐semialdeyhyde dehydrogenase as a potential therapeutic target of
Mycobacterium tuberculosis
H37Rv: Evidence from in silico elementary mode analysis of biological network model. J Cell Biochem 2017; 119:2832-2842. [DOI: 10.1002/jcb.26458] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2017] [Accepted: 10/24/2017] [Indexed: 02/06/2023]
Affiliation(s)
- Saif Khan
- Department of Clinical NutritionCollege of Applied Medical Sciences University of Ha'ilHa'ilSaudi Arabia
| | | | | | - Raju K. Mandal
- Research and Scientific Studies UnitCollege of Nursing and Allied Health SciencesJazan UniversityJazanSaudi Arabia
| | - Sajad A. Dar
- Research and Scientific Studies UnitCollege of Nursing and Allied Health SciencesJazan UniversityJazanSaudi Arabia
| | - Mohd Wahid
- Research and Scientific Studies UnitCollege of Nursing and Allied Health SciencesJazan UniversityJazanSaudi Arabia
| | - Arshad Jawed
- Research and Scientific Studies UnitCollege of Nursing and Allied Health SciencesJazan UniversityJazanSaudi Arabia
| | - Mohtashim Lohani
- Research and Scientific Studies UnitCollege of Nursing and Allied Health SciencesJazan UniversityJazanSaudi Arabia
| | - Mahvish Khan
- Department of Clinical NutritionCollege of Applied Medical Sciences University of Ha'ilHa'ilSaudi Arabia
| | - Mohammed Y. Areeshi
- Research and Scientific Studies UnitCollege of Nursing and Allied Health SciencesJazan UniversityJazanSaudi Arabia
| | - Shafiul Haque
- Research and Scientific Studies UnitCollege of Nursing and Allied Health SciencesJazan UniversityJazanSaudi Arabia
| |
Collapse
|
13
|
Nigatu D, Sobetzko P, Yousef M, Henkel W. Sequence-based information-theoretic features for gene essentiality prediction. BMC Bioinformatics 2017; 18:473. [PMID: 29121868 PMCID: PMC5679510 DOI: 10.1186/s12859-017-1884-5] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Accepted: 10/26/2017] [Indexed: 11/10/2022] Open
Abstract
Background Identification of essential genes is not only useful for our understanding of the minimal gene set required for cellular life but also aids the identification of novel drug targets in pathogens. In this work, we present a simple and effective gene essentiality prediction method using information-theoretic features that are derived exclusively from the gene sequences. Results We developed a Random Forest classifier and performed an extensive model performance evaluation among and within 15 selected bacteria. In intra-organism predictions, where training and testing sets are taken from the same organism, AUC (Area Under the Curve) scores ranging from 0.73 to 0.90, 0.84 on average, were obtained. Cross-organism predictions using 5-fold cross-validation, pairwise, leave-one-species-out, leave-one-taxon-out, and cross-taxon yielded average AUC scores of 0.88, 0.75, 0.80, 0.82, and 0.78, respectively. To further show the applicability of our method in other domains of life, we predicted the essential genes of the yeast Schizosaccharomyces pombe and obtained a similar accuracy (AUC 0.84). Conclusions The proposed method enables a simple and reliable identification of essential genes without searching in databases for orthologs and demanding further experimental data such as network topology and gene-expression. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1884-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dawit Nigatu
- Transmission Systems Group, Jacobs University Bremen, Campus Ring 1, Bremen, D-28759, Germany.
| | - Patrick Sobetzko
- Philipps-Universität Marburg, LOEWE-Zentrum für Synthetische Mikrobiologie, Hans-Meerwein-Straße, Mehrzweckgebäude, Marburg, 35043, Germany
| | - Malik Yousef
- Community Information Systems, Zefat Academic College, Zefat, 13206, Israel
| | - Werner Henkel
- Transmission Systems Group, Jacobs University Bremen, Campus Ring 1, Bremen, D-28759, Germany
| |
Collapse
|
14
|
Liu X, Wang BJ, Xu L, Tang HL, Xu GQ. Selection of key sequence-based features for prediction of essential genes in 31 diverse bacterial species. PLoS One 2017; 12:e0174638. [PMID: 28358836 PMCID: PMC5373589 DOI: 10.1371/journal.pone.0174638] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2016] [Accepted: 03/12/2017] [Indexed: 01/02/2023] Open
Abstract
Genes that are indispensable for survival are essential genes. Many features have been proposed for computational prediction of essential genes. In this paper, the least absolute shrinkage and selection operator method was used to screen key sequence-based features related to gene essentiality. To assess the effects, the selected features were used to predict the essential genes from 31 bacterial species based on a support vector machine classifier. For all 31 bacterial objects (21 Gram-negative objects and ten Gram-positive objects), the features in the three datasets were reduced from 57, 59, and 58, to 40, 37, and 38, respectively, without loss of prediction accuracy. Results showed that some features were redundant for gene essentiality, so could be eliminated from future analyses. The selected features contained more complex (or key) biological information for gene essentiality, and could be of use in related research projects, such as gene prediction, synthetic biology, and drug design.
Collapse
Affiliation(s)
- Xiao Liu
- College of Communication Engineering, Chongqing University, Chongqing, China
- Key Laboratory of Chongqing for Bio-perception and Intelligent Information Processing, Chongqing, China
- * E-mail:
| | - Bao-Jin Wang
- College of Communication Engineering, Chongqing University, Chongqing, China
| | - Luo Xu
- College of Communication Engineering, Chongqing University, Chongqing, China
| | | | - Guo-Qing Xu
- College of Communication Engineering, Chongqing University, Chongqing, China
| |
Collapse
|
15
|
Nandi S, Subramanian A, Sarkar RR. An integrative machine learning strategy for improved prediction of essential genes in Escherichia coli metabolism using flux-coupled features. MOLECULAR BIOSYSTEMS 2017; 13:1584-1596. [DOI: 10.1039/c7mb00234c] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
We propose an integrated machine learning process to predict gene essentiality in Escherichia coli K-12 MG1655 metabolism that outperforms known methods.
Collapse
Affiliation(s)
- Sutanu Nandi
- Chemical Engineering and Process Development
- CSIR-National Chemical Laboratory
- Pune-411008
- India
- Academy of Scientific & Innovative Research (AcSIR)
| | - Abhishek Subramanian
- Chemical Engineering and Process Development
- CSIR-National Chemical Laboratory
- Pune-411008
- India
- Academy of Scientific & Innovative Research (AcSIR)
| | - Ram Rup Sarkar
- Chemical Engineering and Process Development
- CSIR-National Chemical Laboratory
- Pune-411008
- India
- Academy of Scientific & Innovative Research (AcSIR)
| |
Collapse
|
16
|
An Approach for Predicting Essential Genes Using Multiple Homology Mapping and Machine Learning Algorithms. BIOMED RESEARCH INTERNATIONAL 2016; 2016:7639397. [PMID: 27660763 PMCID: PMC5021884 DOI: 10.1155/2016/7639397] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/30/2016] [Revised: 07/25/2016] [Accepted: 08/04/2016] [Indexed: 11/17/2022]
Abstract
Investigation of essential genes is significant to comprehend the minimal gene sets of cell and discover potential drug targets. In this study, a novel approach based on multiple homology mapping and machine learning method was introduced to predict essential genes. We focused on 25 bacteria which have characterized essential genes. The predictions yielded the highest area under receiver operating characteristic (ROC) curve (AUC) of 0.9716 through tenfold cross-validation test. Proper features were utilized to construct models to make predictions in distantly related bacteria. The accuracy of predictions was evaluated via the consistency of predictions and known essential genes of target species. The highest AUC of 0.9552 and average AUC of 0.8314 were achieved when making predictions across organisms. An independent dataset from Synechococcus elongatus, which was released recently, was obtained for further assessment of the performance of our model. The AUC score of predictions is 0.7855, which is higher than other methods. This research presents that features obtained by homology mapping uniquely can achieve quite great or even better results than those integrated features. Meanwhile, the work indicates that machine learning-based method can assign more efficient weight coefficients than using empirical formula based on biological knowledge.
Collapse
|