1
|
Hasibi R, Michoel T, Oyarzún DA. Integration of graph neural networks and genome-scale metabolic models for predicting gene essentiality. NPJ Syst Biol Appl 2024; 10:24. [PMID: 38448436 PMCID: PMC10917767 DOI: 10.1038/s41540-024-00348-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 02/08/2024] [Indexed: 03/08/2024] Open
Abstract
Genome-scale metabolic models are powerful tools for understanding cellular physiology. Flux balance analysis (FBA), in particular, is an optimization-based approach widely employed for predicting metabolic phenotypes. In model microbes such as Escherichia coli, FBA has been successful at predicting essential genes, i.e. those genes that impair survival when deleted. A central assumption in this approach is that both wild type and deletion strains optimize the same fitness objective. Although the optimality assumption may hold for the wild type metabolic network, deletion strains are not subject to the same evolutionary pressures and knock-out mutants may steer their metabolism to meet other objectives for survival. Here, we present FlowGAT, a hybrid FBA-machine learning strategy for predicting essentiality directly from wild type metabolic phenotypes. The approach is based on graph-structured representation of metabolic fluxes predicted by FBA, where nodes correspond to enzymatic reactions and edges quantify the propagation of metabolite mass flow between a reaction and its neighbours. We integrate this information into a graph neural network that can be trained on knock-out fitness assay data. Comparisons across different model architectures reveal that FlowGAT predictions for E. coli are close to those of FBA for several growth conditions. This suggests that essentiality of enzymatic genes can be predicted by exploiting the inherent network structure of metabolism. Our approach demonstrates the benefits of combining the mechanistic insights afforded by genome-scale models with the ability of deep learning to infer patterns from complex datasets.
Collapse
Affiliation(s)
- Ramin Hasibi
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
| | - Tom Michoel
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
| | - Diego A Oyarzún
- School of Biological Sciences, University of Edinburgh, Edinburgh, UK.
- School of Informatics, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
2
|
Liang Y, Luo H, Lin Y, Gao F. Recent advances in the characterization of essential genes and development of a database of essential genes. IMETA 2024; 3:e157. [PMID: 38868518 PMCID: PMC10989110 DOI: 10.1002/imt2.157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 10/09/2023] [Indexed: 06/14/2024]
Abstract
Over the past few decades, there has been a significant interest in the study of essential genes, which are crucial for the survival of an organism under specific environmental conditions and thus have practical applications in the fields of synthetic biology and medicine. An increasing amount of experimental data on essential genes has been obtained with the continuous development of technological methods. Meanwhile, various computational prediction methods, related databases and web servers have emerged accordingly. To facilitate the study of essential genes, we have established a database of essential genes (DEG), which has become popular with continuous updates to facilitate essential gene feature analysis and prediction, drug and vaccine development, as well as artificial genome design and construction. In this article, we summarized the studies of essential genes, overviewed the relevant databases, and discussed their practical applications. Furthermore, we provided an overview of the main applications of DEG and conducted comprehensive analyses based on its latest version. However, it should be noted that the essential gene is a dynamic concept instead of a binary one, which presents both opportunities and challenges for their future development.
Collapse
Affiliation(s)
| | - Hao Luo
- Department of PhysicsTianjin UniversityTianjinChina
| | - Yan Lin
- Department of PhysicsTianjin UniversityTianjinChina
| | - Feng Gao
- Department of PhysicsTianjin UniversityTianjinChina
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education)Tianjin UniversityTianjinChina
- SynBio Research PlatformCollaborative Innovation Center of Chemical Science and Engineering (Tianjin)TianjinChina
| |
Collapse
|
3
|
Hu W, Li M, Xiao H, Guan L. Essential genes identification model based on sequence feature map and graph convolutional neural network. BMC Genomics 2024; 25:47. [PMID: 38200437 PMCID: PMC10777564 DOI: 10.1186/s12864-024-09958-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 01/01/2024] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Essential genes encode functions that play a vital role in the life activities of organisms, encompassing growth, development, immune system functioning, and cell structure maintenance. Conventional experimental techniques for identifying essential genes are resource-intensive and time-consuming, and the accuracy of current machine learning models needs further enhancement. Therefore, it is crucial to develop a robust computational model to accurately predict essential genes. RESULTS In this study, we introduce GCNN-SFM, a computational model for identifying essential genes in organisms, based on graph convolutional neural networks (GCNN). GCNN-SFM integrates a graph convolutional layer, a convolutional layer, and a fully connected layer to model and extract features from gene sequences of essential genes. Initially, the gene sequence is transformed into a feature map using coding techniques. Subsequently, a multi-layer GCN is employed to perform graph convolution operations, effectively capturing both local and global features of the gene sequence. Further feature extraction is performed, followed by integrating convolution and fully-connected layers to generate prediction results for essential genes. The gradient descent algorithm is utilized to iteratively update the cross-entropy loss function, thereby enhancing the accuracy of the prediction results. Meanwhile, model parameters are tuned to determine the optimal parameter combination that yields the best prediction performance during training. CONCLUSIONS Experimental evaluation demonstrates that GCNN-SFM surpasses various advanced essential gene prediction models and achieves an average accuracy of 94.53%. This study presents a novel and effective approach for identifying essential genes, which has significant implications for biology and genomics research.
Collapse
Affiliation(s)
- Wenxing Hu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
| | - Haiyang Xiao
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| |
Collapse
|
4
|
Aromolaran OT, Isewon I, Adedeji E, Oswald M, Adebiyi E, Koenig R, Oyelade J. Heuristic-enabled active machine learning: A case study of predicting essential developmental stage and immune response genes in Drosophila melanogaster. PLoS One 2023; 18:e0288023. [PMID: 37556452 PMCID: PMC10411809 DOI: 10.1371/journal.pone.0288023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 06/18/2023] [Indexed: 08/11/2023] Open
Abstract
Computational prediction of absolute essential genes using machine learning has gained wide attention in recent years. However, essential genes are mostly conditional and not absolute. Experimental techniques provide a reliable approach of identifying conditionally essential genes; however, experimental methods are laborious, time and resource consuming, hence computational techniques have been used to complement the experimental methods. Computational techniques such as supervised machine learning, or flux balance analysis are grossly limited due to the unavailability of required data for training the model or simulating the conditions for gene essentiality. This study developed a heuristic-enabled active machine learning method based on a light gradient boosting model to predict essential immune response and embryonic developmental genes in Drosophila melanogaster. We proposed a new sampling selection technique and introduced a heuristic function which replaces the human component in traditional active learning models. The heuristic function dynamically selects the unlabelled samples to improve the performance of the classifier in the next iteration. Testing the proposed model with four benchmark datasets, the proposed model showed superior performance when compared to traditional active learning models (random sampling and uncertainty sampling). Applying the model to identify conditionally essential genes, four novel essential immune response genes and a list of 48 novel genes that are essential in embryonic developmental condition were identified. We performed functional enrichment analysis of the predicted genes to elucidate their biological processes and the result evidence our predictions. Immune response and embryonic development related processes were significantly enriched in the essential immune response and embryonic developmental genes, respectively. Finally, we propose the predicted essential genes for future experimental studies and use of the developed tool accessible at http://heal.covenantuniversity.edu.ng for conditional essentiality predictions.
Collapse
Affiliation(s)
- Olufemi Tony Aromolaran
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Itunu Isewon
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Eunice Adedeji
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
- Department of Biochemistry, Covenant University, Ota, Ogun State, Nigeria
| | - Marcus Oswald
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Am Klinikum, Jena, Germany
- Institute of Infectious Diseases and Infection Control, Jena University Hospital, Am Klinikum, Jena, Germany
| | - Ezekiel Adebiyi
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Rainer Koenig
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Am Klinikum, Jena, Germany
- Institute of Infectious Diseases and Infection Control, Jena University Hospital, Am Klinikum, Jena, Germany
| | - Jelili Oyelade
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| |
Collapse
|
5
|
Rout RK, Umer S, Khandelwal M, Pati S, Mallik S, Balabantaray BK, Qin H. Identification of discriminant features from stationary pattern of nucleotide bases and their application to essential gene classification. Front Genet 2023; 14:1154120. [PMID: 37152988 PMCID: PMC10156977 DOI: 10.3389/fgene.2023.1154120] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 04/04/2023] [Indexed: 05/09/2023] Open
Abstract
Introduction: Essential genes are essential for the survival of various species. These genes are a family linked to critical cellular activities for species survival. These genes are coded for proteins that regulate central metabolism, gene translation, deoxyribonucleic acid replication, and fundamental cellular structure and facilitate intracellular and extracellular transport. Essential genes preserve crucial genomics information that may hold the key to a detailed knowledge of life and evolution. Essential gene studies have long been regarded as a vital topic in computational biology due to their relevance. An essential gene is composed of adenine, guanine, cytosine, and thymine and its various combinations. Methods: This paper presents a novel method of extracting information on the stationary patterns of nucleotides such as adenine, guanine, cytosine, and thymine in each gene. For this purpose, some co-occurrence matrices are derived that provide the statistical distribution of stationary patterns of nucleotides in the genes, which is helpful in establishing the relationship between the nucleotides. For extracting discriminant features from each co-occurrence matrix, energy, entropy, homogeneity, contrast, and dissimilarity features are computed, which are extracted from all co-occurrence matrices and then concatenated to form a feature vector representing each essential gene. Finally, supervised machine learning algorithms are applied for essential gene classification based on the extracted fixed-dimensional feature vectors. Results: For comparison, some existing state-of-the-art feature representation techniques such as Shannon entropy (SE), Hurst exponent (HE), fractal dimension (FD), and their combinations have been utilized. Discussion: An extensive experiment has been performed for classifying the essential genes of five species that show the robustness and effectiveness of the proposed methodology.
Collapse
Affiliation(s)
- Ranjeet Kumar Rout
- National Institute of Technology Srinagar, Hazratbal, Jammu and Kashmir, India
| | - Saiyed Umer
- Aliah University, Kolkata, West Bengal, India
| | - Monika Khandelwal
- National Institute of Technology Srinagar, Hazratbal, Jammu and Kashmir, India
| | - Smitarani Pati
- Dr. B R Ambedkar National Institute of Technology Jalandhar, Jalandhar, Punjab, India
| | - Saurav Mallik
- Harvard T H Chan School of Public Health, Boston, United States
- Department of Pharmacology and Toxicology, University of Arizona, Tucson, AZ, United States
- *Correspondence: Saurav Mallik, , ; Hong Qin,
| | | | - Hong Qin
- Department of Computer Science and Engineering, University of Tennessee at Chattanooga, Chattanooga, TN, United States
- *Correspondence: Saurav Mallik, , ; Hong Qin,
| |
Collapse
|
6
|
Feng Z, Xu M, Yang J, Zhang R, Geng Z, Mao T, Sheng Y, Wang L, Zhang J, Zhang H. Molecular characterization of a novel strain of Bacillus halotolerans protecting wheat from sheath blight disease caused by Rhizoctonia solani Kühn. FRONTIERS IN PLANT SCIENCE 2022; 13:1019512. [PMID: 36325560 PMCID: PMC9618607 DOI: 10.3389/fpls.2022.1019512] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 10/03/2022] [Indexed: 06/16/2023]
Abstract
UNLABELLED Rhizoctonia solani Kühn naturally infects and causes Sheath blight disease in cereal crops such as wheat, rice and maize, leading to severe reduction in grain yield and quality. In this work, a new bacterial strain Bacillus halotolerans LDFZ001 showing efficient antagonistic activity against the pathogenic strain Rhizoctonia solani Kühn sh-1 was isolated. Antagonistic, phylogenetic and whole genome sequencing analyses demonstrate that Bacillus halotolerans LDFZ001 strongly suppressed the growth of Rhizoctonia solani Kühn sh-1, showed a close evolutionary relationship with B. halotolerans F41-3, and possessed a 3,965,118 bp circular chromosome. Bioinformatic analysis demonstrated that the genome of Bacillus halotolerans LDFZ001 contained ten secondary metabolite biosynthetic gene clusters (BGCs) encoding five non-ribosomal peptide synthases, two polyketide synthase, two terpene synthases and one bacteriocin synthase, and a new kijanimicin biosynthetic gene cluster which might be responsible for the biosynthesis of novel compounds. Gene-editing experiments revealed that functional expression of phosphopantetheinyl transferase (SFP) and major facilitator superfamily (MFS) transporter genes in Bacillus halotolerans LDFZ001 was essential for its antifungal activity against R. solani Kühn sh-1. Moreover, the existence of two identical chitosanases may also make contribution to the antipathogen activity of Bacillus halotolerans LDFZ001. Our findings will provide fundamental information for the identification and isolation of new sheath blight resistant genes and bacterial strains which have a great potential to be used for the production of bacterial control agents. IMPORTANCE A new Bacillus halotolerans strain Bacillus halotolerans LDFZ001 resistant to sheath blight in wheat is isolated. Bacillus halotolerans LDFZ001 harbors a new kijanimicin biosynthetic gene cluster, and the functional expression of SFP and MFS contribute to its antipathogen ability.
Collapse
Affiliation(s)
- Zhibin Feng
- College of Life Science, Ludong University, Yantai, China
| | - Mingzhi Xu
- The Engineering Research Institute of Agriculture and Forestry, Ludong University, Yantai, China
- College of Agriculture, Ludong University, Yantai, China
| | - Jin Yang
- The Engineering Research Institute of Agriculture and Forestry, Ludong University, Yantai, China
- College of Agriculture, Ludong University, Yantai, China
| | - Renhong Zhang
- The Engineering Research Institute of Agriculture and Forestry, Ludong University, Yantai, China
- College of Agriculture, Ludong University, Yantai, China
| | - Zigui Geng
- The Engineering Research Institute of Agriculture and Forestry, Ludong University, Yantai, China
- College of Agriculture, Ludong University, Yantai, China
| | - Tingting Mao
- The Engineering Research Institute of Agriculture and Forestry, Ludong University, Yantai, China
- College of Agriculture, Ludong University, Yantai, China
- Key Laboratory of Molecular Module-Based Breeding of High Yield and Abiotic Resistant Plants in Universities of Shandong (Ludong University), Ludong University, Yantai, China
| | - Yuting Sheng
- The Engineering Research Institute of Agriculture and Forestry, Ludong University, Yantai, China
- College of Agriculture, Ludong University, Yantai, China
- Key Laboratory of Molecular Module-Based Breeding of High Yield and Abiotic Resistant Plants in Universities of Shandong (Ludong University), Ludong University, Yantai, China
| | - Limin Wang
- The Engineering Research Institute of Agriculture and Forestry, Ludong University, Yantai, China
- College of Agriculture, Ludong University, Yantai, China
- Key Laboratory of Molecular Module-Based Breeding of High Yield and Abiotic Resistant Plants in Universities of Shandong (Ludong University), Ludong University, Yantai, China
| | - Juan Zhang
- The Engineering Research Institute of Agriculture and Forestry, Ludong University, Yantai, China
- College of Agriculture, Ludong University, Yantai, China
- Key Laboratory of Molecular Module-Based Breeding of High Yield and Abiotic Resistant Plants in Universities of Shandong (Ludong University), Ludong University, Yantai, China
| | - Hongxia Zhang
- The Engineering Research Institute of Agriculture and Forestry, Ludong University, Yantai, China
- Key Laboratory of Molecular Module-Based Breeding of High Yield and Abiotic Resistant Plants in Universities of Shandong (Ludong University), Ludong University, Yantai, China
- Shandong Institute of Sericulture, Shandong Academy of Agricultural Sciences, Yantai, China
| |
Collapse
|
7
|
LeBlanc N, Charles TC. Bacterial genome reductions: Tools, applications, and challenges. Front Genome Ed 2022; 4:957289. [PMID: 36120530 PMCID: PMC9473318 DOI: 10.3389/fgeed.2022.957289] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 07/29/2022] [Indexed: 11/16/2022] Open
Abstract
Bacterial cells are widely used to produce value-added products due to their versatility, ease of manipulation, and the abundance of genome engineering tools. However, the efficiency of producing these desired biomolecules is often hindered by the cells’ own metabolism, genetic instability, and the toxicity of the product. To overcome these challenges, genome reductions have been performed, making strains with the potential of serving as chassis for downstream applications. Here we review the current technologies that enable the design and construction of such reduced-genome bacteria as well as the challenges that limit their assembly and applicability. While genomic reductions have shown improvement of many cellular characteristics, a major challenge still exists in constructing these cells efficiently and rapidly. Computational tools have been created in attempts at minimizing the time needed to design these organisms, but gaps still exist in modelling these reductions in silico. Genomic reductions are a promising avenue for improving the production of value-added products, constructing chassis cells, and for uncovering cellular function but are currently limited by their time-consuming construction methods. With improvements to and the creation of novel genome editing tools and in silico models, these approaches could be combined to expedite this process and create more streamlined and efficient cell factories.
Collapse
Affiliation(s)
- Nicole LeBlanc
- Department of Biology, University of Waterloo, Waterloo, ON, Canada
- *Correspondence: Nicole LeBlanc,
| | - Trevor C. Charles
- Department of Biology, University of Waterloo, Waterloo, ON, Canada
- Metagenom Bio Life Science Inc., Waterloo, ON, Canada
| |
Collapse
|
8
|
Srivastava N, Sarethy IP, Jeevanandam J, Danquah M. Emerging strategies for microbial screening of novel chemotherapeutics. J Mol Struct 2022. [DOI: 10.1016/j.molstruc.2022.132419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
9
|
Kania A. Harnessing the information theory and chaos game representation for pattern searching among essential and non-essential genes in Bacteria. J Theor Biol 2021; 531:110917. [PMID: 34563550 DOI: 10.1016/j.jtbi.2021.110917] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Revised: 08/19/2021] [Accepted: 09/21/2021] [Indexed: 11/29/2022]
Abstract
Proteins encoded by genes are engaged in most of the processes within a cell. Typing a minimal set of genes required for survival is still a challenging task. Essential genes seem to be more conservative and are usually responsible for basic functions, for instance, genetic information flow or energy production. Despite persistent advances in experimental methods, computer predictions may constitute an important part of this investigation. Firstly, they may embrace a huge amount of data and provide some characteristic patterns. Furthermore, they enable scientists to build models for predicting essential genes which are not yet verified experimentally. Some papers indicate interesting dependencies within essential genes sequences using different computer models. In this paper, an author took a three-step analysis for a deeper understanding of the fundamentals of essential and non-essential genes. Beginning from a simple nucleotide composition and finishing at long-range correlations, presents some characteristic patterns that are expected to be developed in future studies.
Collapse
Affiliation(s)
- Adrian Kania
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, Cracow 30-387, Poland
| |
Collapse
|
10
|
DELEAT: gene essentiality prediction and deletion design for bacterial genome reduction. BMC Bioinformatics 2021; 22:444. [PMID: 34537011 PMCID: PMC8449488 DOI: 10.1186/s12859-021-04348-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 08/26/2021] [Indexed: 11/10/2022] Open
Abstract
Background The study of gene essentiality is fundamental to understand the basic principles of life, as well as for applications in many fields. In recent decades, dozens of sets of essential genes have been determined using different experimental and bioinformatics approaches, and this information has been useful for genome reduction of model organisms. Multiple in silico strategies have been developed to predict gene essentiality, but no optimal algorithm or set of gene features has been found yet, especially for non-model organisms with incomplete functional annotation. Results We have developed DELEAT v0.1 (DELetion design by Essentiality Analysis Tool), an easy-to-use bioinformatic tool which integrates an in silico gene essentiality classifier in a pipeline allowing automatic design of large-scale deletions in any bacterial genome. The essentiality classifier consists of a novel logistic regression model based on only six gene features which are not dependent on experimental data or functional annotation. As a proof of concept, we have applied this pipeline to the determination of dispensable regions in the genome of Bartonella quintana str. Toulouse. In this already reduced genome, 35 possible deletions have been delimited, spanning 29% of the genome. Conclusions Built on in silico gene essentiality predictions, we have developed an analysis pipeline which assists researchers throughout multiple stages of bacterial genome reduction projects, and created a novel classifier which is simple, fast, and universally applicable to any bacterial organism with a GenBank annotation file. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04348-5.
Collapse
|
11
|
Semi-Supervised Learning Using Hierarchical Mixture Models: Gene Essentiality Case Study. MATHEMATICAL AND COMPUTATIONAL APPLICATIONS 2021. [DOI: 10.3390/mca26020040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Integrating gene-level data is useful for predicting the role of genes in biological processes. This problem has typically focused on supervised classification, which requires large training sets of positive and negative examples. However, training data sets that are too small for supervised approaches can still provide valuable information. We describe a hierarchical mixture model that uses limited positively labeled gene training data for semi-supervised learning. We focus on the problem of predicting essential genes, where a gene is required for the survival of an organism under particular conditions. We applied cross-validation and found that the inclusion of positively labeled samples in a semi-supervised learning framework with the hierarchical mixture model improves the detection of essential genes compared to unsupervised, supervised, and other semi-supervised approaches. There was also improved prediction performance when genes are incorrectly assumed to be non-essential. Our comparisons indicate that the incorporation of even small amounts of existing knowledge improves the accuracy of prediction and decreases variability in predictions. Although we focused on gene essentiality, the hierarchical mixture model and semi-supervised framework is standard for problems focused on prediction of genes or other features, with multiple data types characterizing the feature, and a small set of positive labels.
Collapse
|
12
|
Aromolaran O, Aromolaran D, Isewon I, Oyelade J. Machine learning approach to gene essentiality prediction: a review. Brief Bioinform 2021; 22:6219158. [PMID: 33842944 DOI: 10.1093/bib/bbab128] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 03/04/2021] [Accepted: 03/17/2021] [Indexed: 12/17/2022] Open
Abstract
Essential genes are critical for the growth and survival of any organism. The machine learning approach complements the experimental methods to minimize the resources required for essentiality assays. Previous studies revealed the need to discover relevant features that significantly classify essential genes, improve on the generalizability of prediction models across organisms, and construct a robust gold standard as the class label for the train data to enhance prediction. Findings also show that a significant limitation of the machine learning approach is predicting conditionally essential genes. The essentiality status of a gene can change due to a specific condition of the organism. This review examines various methods applied to essential gene prediction task, their strengths, limitations and the factors responsible for effective computational prediction of essential genes. We discussed categories of features and how they contribute to the classification performance of essentiality prediction models. Five categories of features, namely, gene sequence, protein sequence, network topology, homology and gene ontology-based features, were generated for Caenorhabditis elegans to perform a comparative analysis of their essentiality prediction capacity. Gene ontology-based feature category outperformed other categories of features majorly due to its high correlation with the genes' biological functions. However, the topology feature category provided the highest discriminatory power making it more suitable for essentiality prediction. The major limiting factor of machine learning to predict essential genes conditionality is the unavailability of labeled data for interest conditions that can train a classifier. Therefore, cooperative machine learning could further exploit models that can perform well in conditional essentiality predictions. SHORT ABSTRACT Identification of essential genes is imperative because it provides an understanding of the core structure and function, accelerating drug targets' discovery, among other functions. Recent studies have applied machine learning to complement the experimental identification of essential genes. However, several factors are limiting the performance of machine learning approaches. This review aims to present the standard procedure and resources available for predicting essential genes in organisms, and also highlight the factors responsible for the current limitation in using machine learning for conditional gene essentiality prediction. The choice of features and ML technique was identified as an important factor to predict essential genes effectively.
Collapse
Affiliation(s)
- Olufemi Aromolaran
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria.,Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Damilare Aromolaran
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria.,Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Itunuoluwa Isewon
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria.,Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Jelili Oyelade
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria.,Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| |
Collapse
|
13
|
Tang J, Wu X, Mou M, Wang C, Wang L, Li F, Guo M, Yin J, Xie W, Wang X, Wang Y, Ding Y, Xue W, Zhu F. GIMICA: host genetic and immune factors shaping human microbiota. Nucleic Acids Res 2021; 49:D715-D722. [PMID: 33045729 PMCID: PMC7779047 DOI: 10.1093/nar/gkaa851] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/09/2020] [Accepted: 10/08/2020] [Indexed: 01/09/2023] Open
Abstract
Besides the environmental factors having tremendous impacts on the composition of microbial community, the host factors have recently gained extensive attentions on their roles in shaping human microbiota. There are two major types of host factors: host genetic factors (HGFs) and host immune factors (HIFs). These factors of each type are essential for defining the chemical and physical landscapes inhabited by microbiota, and the collective consideration of both types have great implication to serve comprehensive health management. However, no database was available to provide the comprehensive factors of both types. Herein, a database entitled 'Host Genetic and Immune Factors Shaping Human Microbiota (GIMICA)' was constructed. Based on the 4257 microbes confirmed to inhabit nine sites of human body, 2851 HGFs (1368 single nucleotide polymorphisms (SNPs), 186 copy number variations (CNVs), and 1297 non-coding ribonucleic acids (RNAs)) modulating the expression of 370 microbes were collected, and 549 HIFs (126 lymphocytes and phagocytes, 387 immune proteins, and 36 immune pathways) regulating the abundance of 455 microbes were also provided. All in all, GIMICA enabled the collective consideration not only between different types of host factor but also between the host and environmental ones, which is freely accessible without login requirement at: https://idrblab.org/gimica/.
Collapse
Affiliation(s)
- Jing Tang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Xianglu Wu
- Joint International Research Lab of Reproductive and Development, Department of Reproductive Biology, School of Public Health, Chongqing Medical University, Chongqing 400016, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Chuan Wang
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Lidan Wang
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Fengcheng Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Maiyuan Guo
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Jiayi Yin
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Wenqin Xie
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Xiaona Wang
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Yingxiong Wang
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China.,Joint International Research Lab of Reproductive and Development, Department of Reproductive Biology, School of Public Health, Chongqing Medical University, Chongqing 400016, China
| | - Yubin Ding
- Joint International Research Lab of Reproductive and Development, Department of Reproductive Biology, School of Public Health, Chongqing Medical University, Chongqing 400016, China
| | - Weiwei Xue
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
14
|
Le NQK, Do DT, Hung TNK, Lam LHT, Huynh TT, Nguyen NTK. A Computational Framework Based on Ensemble Deep Neural Networks for Essential Genes Identification. Int J Mol Sci 2020; 21:E9070. [PMID: 33260643 PMCID: PMC7730808 DOI: 10.3390/ijms21239070] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 11/25/2020] [Accepted: 11/26/2020] [Indexed: 01/13/2023] Open
Abstract
Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| | - Duyen Thi Do
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei 106, Taiwan;
| | - Truong Nguyen Khanh Hung
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Department of Orthopedic and Trauma, Cho Ray Hospital, Ho Chi Minh 70000, Vietnam
| | - Luu Ho Thanh Lam
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Intensive Care Unit, Children’s Hospital 2, Ho Chi Minh 70000, Vietnam
| | - Tuan-Tu Huynh
- Department of Electrical Engineering, Yuan Ze University, Taoyuan 320, Taiwan;
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Dong Nai 76120, Vietnam
| | - Ngan Thi Kim Nguyen
- School of Nutrition and Health Sciences, Taipei Medical University, Taipei 110, Taiwan;
| |
Collapse
|
15
|
Kong X, Zhu B, Stone VN, Ge X, El-Rami FE, Donghai H, Xu P. ePath: an online database towards comprehensive essential gene annotation for prokaryotes. Sci Rep 2019; 9:12949. [PMID: 31506471 PMCID: PMC6737131 DOI: 10.1038/s41598-019-49098-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Accepted: 08/15/2019] [Indexed: 02/01/2023] Open
Abstract
Experimental techniques for identification of essential genes (EGs) in prokaryotes are usually expensive, time-consuming and sometimes unrealistic. Emerging in silico methods provide alternative methods for EG prediction, but often possess limitations including heavy computational requirements and lack of biological explanation. Here we propose a new computational algorithm for EG prediction in prokaryotes with an online database (ePath) for quick access to the EG prediction results of over 4,000 prokaryotes ( https://www.pubapps.vcu.edu/epath/ ). In ePath, gene essentiality is linked to biological functions annotated by KEGG Ortholog (KO). Two new scoring systems, namely, E_score and P_score, are proposed for each KO as the EG evaluation criteria. E_score represents appearance and essentiality of a given KO in existing experimental results of gene essentiality, while P_score denotes gene essentiality based on the principle that a gene is essential if it plays a role in genetic information processing, cell envelope maintenance or energy production. The new EG prediction algorithm shows prediction accuracy ranging from 75% to 91% based on validation from five new experimental studies on EG identification. Our overall goal with ePath is to provide a comprehensive and reliable reference for gene essentiality annotation, facilitating the study of those prokaryotes without experimentally derived gene essentiality information.
Collapse
Affiliation(s)
- Xiangzhen Kong
- Philips Institute for Oral Health Research, Virginia Commonwealth University, Richmond, Virginia, 23298, United States of America
| | - Bin Zhu
- Philips Institute for Oral Health Research, Virginia Commonwealth University, Richmond, Virginia, 23298, United States of America
| | - Victoria N Stone
- Philips Institute for Oral Health Research, Virginia Commonwealth University, Richmond, Virginia, 23298, United States of America
| | - Xiuchun Ge
- Philips Institute for Oral Health Research, Virginia Commonwealth University, Richmond, Virginia, 23298, United States of America
| | - Fadi E El-Rami
- Philips Institute for Oral Health Research, Virginia Commonwealth University, Richmond, Virginia, 23298, United States of America
| | - Huangfu Donghai
- Application Services, Virginia Commonwealth University, Richmond, Virginia, United States of America
| | - Ping Xu
- Philips Institute for Oral Health Research, Virginia Commonwealth University, Richmond, Virginia, 23298, United States of America.
- Department of Microbiology and Immunology, Virginia Commonwealth University, Richmond, Virginia, United States of America.
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, Virginia, United States of America.
| |
Collapse
|
16
|
Wen QF, Liu S, Dong C, Guo HX, Gao YZ, Guo FB. Geptop 2.0: An Updated, More Precise, and Faster Geptop Server for Identification of Prokaryotic Essential Genes. Front Microbiol 2019; 10:1236. [PMID: 31214154 PMCID: PMC6558110 DOI: 10.3389/fmicb.2019.01236] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2019] [Accepted: 05/17/2019] [Indexed: 12/16/2022] Open
Abstract
Geptop has performed effectively in the identification of prokaryotic essential genes since its first release in 2013. It estimates gene essentiality for prokaryotes based on orthology and phylogeny. Genome-scale essentiality data of more prokaryotic species are available, and the information has been collected into public essential gene repositories such as DEG and OGEE. A faster and more accurate toolkit is needed to meet the increasing prokaryotic genome data. We updated Geptop by supplementing more validated essentiality data into reference set (from 19 to 37 species), and introducing multi-process technology to accelerate the computing speed. Compared with Geptop 1.0 and other gene essentiality prediction models, Geptop 2.0 can generate more stable predictions and finish the computation in a shorter time. The software is available both as an online server and a downloadable standalone application. We hope that the improved Geptop 2.0 will facilitate researches in gene essentiality and the development of novel antibacterial drugs. The gene essentiality prediction tool is available at http://cefg.uestc.cn/geptop.
Collapse
Affiliation(s)
- Qing-Feng Wen
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Shuo Liu
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Chuan Dong
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hai-Xia Guo
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yi-Zhou Gao
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Feng-Biao Guo
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
17
|
Abstract
Background:
Essential proteins play important roles in the survival or reproduction of
an organism and support the stability of the system. Essential proteins are the minimum set of
proteins absolutely required to maintain a living cell. The identification of essential proteins is a
very important topic not only for a better comprehension of the minimal requirements for cellular
life, but also for a more efficient discovery of the human disease genes and drug targets.
Traditionally, as the experimental identification of essential proteins is complex, it usually requires
great time and expense. With the cumulation of high-throughput experimental data, many
computational methods that make useful complements to experimental methods have been
proposed to identify essential proteins. In addition, the ability to rapidly and precisely identify
essential proteins is of great significance for discovering disease genes and drug design, and has
great potential for applications in basic and synthetic biology research.
Objective:
The aim of this paper is to provide a review on the identification of essential proteins
and genes focusing on the current developments of different types of computational methods, point
out some progress and limitations of existing methods, and the challenges and directions for
further research are discussed.
Collapse
Affiliation(s)
- Ming Fang
- School of Computer Science, Shaanxi Normal University, Xi'an 710119, China
| | - Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi'an 710119, China
| | - Ling Guo
- College of Life Sciences, Shaanxi Normal University, Xi'an 710119, China
| |
Collapse
|
18
|
Guzmán GI, Olson CA, Hefner Y, Phaneuf PV, Catoiu E, Crepaldi LB, Micas LG, Palsson BO, Feist AM. Reframing gene essentiality in terms of adaptive flexibility. BMC SYSTEMS BIOLOGY 2018; 12:143. [PMID: 30558585 PMCID: PMC6296033 DOI: 10.1186/s12918-018-0653-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/16/2017] [Accepted: 11/13/2018] [Indexed: 12/17/2022]
Abstract
BACKGROUND Essentiality assays are important tools commonly utilized for the discovery of gene functions. Growth/no growth screens of single gene knockout strain collections are also often utilized to test the predictive power of genome-scale models. False positive predictions occur when computational analysis predicts a gene to be non-essential, however experimental screens deem the gene to be essential. One explanation for this inconsistency is that the model contains the wrong information, possibly an incorrectly annotated alternative pathway or isozyme reaction. Inconsistencies could also be attributed to experimental limitations, such as growth tests with arbitrary time cut-offs. The focus of this study was to resolve such inconsistencies to better understand isozyme activities and gene essentiality. RESULTS In this study, we explored the definition of conditional essentiality from a phenotypic and genomic perspective. Gene-deletion strains associated with false positive predictions of gene essentiality on defined minimal medium for Escherichia coli were targeted for extended growth tests followed by population sequencing and transcriptome analysis. Of the twenty false positive strains available and confirmed from the Keio single gene knock-out collection, 11 strains were shown to grow with longer incubation periods making these actual true positives. These strains grew reproducibly with a diverse range of growth phenotypes. The lag phase observed for these strains ranged from less than one day to more than 7 days. It was found that 9 out of 11 of the false positive strains that grew acquired mutations in at least one replicate experiment and the types of mutations ranged from SNPs and small indels associated with regulatory or metabolic elements to large regions of genome duplication. Comparison of the detected adaptive mutations, modeling predictions of alternate pathways and isozymes, and transcriptome analysis of KO strains suggested agreement for the observed growth phenotype for 6 out of the 9 cases where mutations were observed. CONCLUSIONS Longer-term growth experiments followed by whole genome sequencing and transcriptome analysis can provide a better understanding of conditional gene essentiality and mechanisms of adaptation to such perturbations. Compensatory mutations are largely reproducible mechanisms and are in agreement with genome-scale modeling predictions to loss of function gene deletion events.
Collapse
Affiliation(s)
- Gabriela I Guzmán
- Department of Bioengineering, University of California, San Diego, La Jolla, 92093, CA, USA
| | - Connor A Olson
- Department of Bioengineering, University of California, San Diego, La Jolla, 92093, CA, USA
| | - Ying Hefner
- Department of Bioengineering, University of California, San Diego, La Jolla, 92093, CA, USA
| | - Patrick V Phaneuf
- Department of Bioinformatics and Systems Biology, University of California, San Diego, 92093, La Jolla, CA, USA
| | - Edward Catoiu
- Department of Bioengineering, University of California, San Diego, La Jolla, 92093, CA, USA
| | - Lais B Crepaldi
- Department of Bioengineering, University of California, San Diego, La Jolla, 92093, CA, USA.,Department of Chemical Engineering, University of Ribeirão Preto, São Paulo, Brazil
| | - Lucas Goldschmidt Micas
- Department of Bioengineering, University of California, San Diego, La Jolla, 92093, CA, USA.,Department of Chemical and Petroleum Engineering, Fluminense Federal University, Niterói, Rio de Janeiro, Brazil
| | - Bernhard O Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, 92093, CA, USA.,Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Lyngby, Denmark.,Department of Pediatrics, University of California, San Diego, La Jolla, 92093, CA, USA
| | - Adam M Feist
- Department of Bioengineering, University of California, San Diego, La Jolla, 92093, CA, USA. .,Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Lyngby, Denmark.
| |
Collapse
|
19
|
Azhagesan K, Ravindran B, Raman K. Network-based features enable prediction of essential genes across diverse organisms. PLoS One 2018; 13:e0208722. [PMID: 30543651 PMCID: PMC6292609 DOI: 10.1371/journal.pone.0208722] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 11/21/2018] [Indexed: 12/19/2022] Open
Abstract
Machine learning approaches to predict essential genes have gained a lot of traction in recent years. These approaches predominantly make use of sequence and network-based features to predict essential genes. However, the scope of network-based features used by the existing approaches is very narrow. Further, many of these studies focus on predicting essential genes within the same organism, which cannot be readily used to predict essential genes across organisms. Therefore, there is clearly a need for a method that is able to predict essential genes across organisms, by leveraging network-based features. In this study, we extract several sets of network-based features from protein-protein association networks available from the STRING database. Our network features include some common measures of centrality, and also some novel recursive measures recently proposed in social network literature. We extract hundreds of network-based features from networks of 27 diverse organisms to predict the essentiality of 87000+ genes. Our results show that network-based features are statistically significantly better at classifying essential genes across diverse bacterial species, compared to the current state-of-the-art methods, which use mostly sequence and a few 'conventional' network-based features. Our diverse set of network properties gave an AUROC of 0.847 and a precision of 0.320 across 27 organisms. When we augmented the complete set of network features with sequence-derived features, we achieved an improved AUROC of 0.857 and a precision of 0.335. We also constructed a reduced set of 100 sequence and network features, which gave a comparable performance. Further, we show that our features are useful for predicting essential genes in new organisms by using leave-one-species-out validation. Our network features capture the local, global and neighbourhood properties of the network and are hence effective for prediction of essential genes across diverse organisms, even in the absence of other complex biological knowledge. Our approach can be readily exploited to predict essentiality for organisms in interactome databases such as the STRING, where both network and sequence are readily available. All codes are available at https://github.com/RamanLab/nbfpeg.
Collapse
Affiliation(s)
- Karthik Azhagesan
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras, Chennai – 600 036, India
- Initiative for Biological Systems Engineering (IBSE), IIT Madras, Chennai – 600 036, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai – 600 036, India
| | - Balaraman Ravindran
- Department of Computer Science and Engineering, IIT Madras, Chennai – 600 036, India
- Initiative for Biological Systems Engineering (IBSE), IIT Madras, Chennai – 600 036, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai – 600 036, India
- * E-mail: (BR); (KR)
| | - Karthik Raman
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras, Chennai – 600 036, India
- Initiative for Biological Systems Engineering (IBSE), IIT Madras, Chennai – 600 036, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai – 600 036, India
- * E-mail: (BR); (KR)
| |
Collapse
|
20
|
Dong C, Jin YT, Hua HL, Wen QF, Luo S, Zheng WX, Guo FB. Comprehensive review of the identification of essential genes using computational methods: focusing on feature implementation and assessment. Brief Bioinform 2018; 21:171-181. [PMID: 30496347 DOI: 10.1093/bib/bby116] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Revised: 11/01/2018] [Accepted: 11/02/2018] [Indexed: 02/06/2023] Open
Abstract
Essential genes have attracted increasing attention in recent years due to the important functions of these genes in organisms. Among the methods used to identify the essential genes, accurate and efficient computational methods can make up for the deficiencies of expensive and time-consuming experimental technologies. In this review, we have collected researches on essential gene predictions in prokaryotes and eukaryotes and summarized the five predominant types of features used in these studies. The five types of features include evolutionary conservation, domain information, network topology, sequence component and expression level. We have described how to implement the useful forms of these features and evaluated their performance based on the data of Escherichia coli MG1655, Bacillus subtilis 168 and human. The prerequisite and applicable range of these features is described. In addition, we have investigated the techniques used to weight features in various models. To facilitate researchers in the field, two available online tools, which are accessible for free and can be directly used to predict gene essentiality in prokaryotes and humans, were referred. This article provides a simple guide for the identification of essential genes in prokaryotes and eukaryotes.
Collapse
Affiliation(s)
- Chuan Dong
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hong-Li Hua
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Qing-Feng Wen
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Sen Luo
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wen-Xin Zheng
- School of Biomedical Engineering, Capital Medical University, Beijing, China
| | - Feng-Biao Guo
- School of Life Science and Technology, Center for Informational Biology, Intelligent Learning Institute for Science and Application, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
21
|
Peng C, Lin Y, Luo H, Gao F. A Comprehensive Overview of Online Resources to Identify and Predict Bacterial Essential Genes. Front Microbiol 2017; 8:2331. [PMID: 29230204 PMCID: PMC5711816 DOI: 10.3389/fmicb.2017.02331] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2017] [Accepted: 11/13/2017] [Indexed: 12/15/2022] Open
Abstract
Genes critical for the survival or reproduction of an organism in certain circumstances are classified as essential genes. Essential genes play a significant role in deciphering the survival mechanism of life. They may be greatly applied to pharmaceutics and synthetic biology. The continuous progress of experimental method for essential gene identification has accelerated the accumulation of gene essentiality data which facilitates the study of essential genes in silico. In this article, we present some available online resources related to gene essentiality, including bioinformatic software tools for transposon sequencing (Tn-seq) analysis, essential gene databases and online services to predict bacterial essential genes. We review several computational approaches that have been used to predict essential genes, and summarize the features used for gene essentiality prediction. In addition, we evaluate the available online bacterial essential gene prediction servers based on the experimentally validated essential gene sets of 30 bacteria from DEG. This article is intended to be a quick reference guide for the microbiologists interested in the essential genes.
Collapse
Affiliation(s)
- Chong Peng
- Department of Physics, School of Science, Tianjin University, Tianjin, China
| | - Yan Lin
- Department of Physics, School of Science, Tianjin University, Tianjin, China
| | - Hao Luo
- Department of Physics, School of Science, Tianjin University, Tianjin, China
| | - Feng Gao
- Department of Physics, School of Science, Tianjin University, Tianjin, China
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
- SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), Tianjin University, Tianjin, China
| |
Collapse
|
22
|
An Approach for Predicting Essential Genes Using Multiple Homology Mapping and Machine Learning Algorithms. BIOMED RESEARCH INTERNATIONAL 2016; 2016:7639397. [PMID: 27660763 PMCID: PMC5021884 DOI: 10.1155/2016/7639397] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/30/2016] [Revised: 07/25/2016] [Accepted: 08/04/2016] [Indexed: 11/17/2022]
Abstract
Investigation of essential genes is significant to comprehend the minimal gene sets of cell and discover potential drug targets. In this study, a novel approach based on multiple homology mapping and machine learning method was introduced to predict essential genes. We focused on 25 bacteria which have characterized essential genes. The predictions yielded the highest area under receiver operating characteristic (ROC) curve (AUC) of 0.9716 through tenfold cross-validation test. Proper features were utilized to construct models to make predictions in distantly related bacteria. The accuracy of predictions was evaluated via the consistency of predictions and known essential genes of target species. The highest AUC of 0.9552 and average AUC of 0.8314 were achieved when making predictions across organisms. An independent dataset from Synechococcus elongatus, which was released recently, was obtained for further assessment of the performance of our model. The AUC score of predictions is 0.7855, which is higher than other methods. This research presents that features obtained by homology mapping uniquely can achieve quite great or even better results than those integrated features. Meanwhile, the work indicates that machine learning-based method can assign more efficient weight coefficients than using empirical formula based on biological knowledge.
Collapse
|