1
|
Kim K, Choe D, Cho S, Palsson B, Cho BK. Reduction-to-synthesis: the dominant approach to genome-scale synthetic biology. Trends Biotechnol 2024; 42:1048-1063. [PMID: 38423803 DOI: 10.1016/j.tibtech.2024.02.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 02/09/2024] [Accepted: 02/12/2024] [Indexed: 03/02/2024]
Abstract
Advances in systems and synthetic biology have propelled the construction of reduced bacterial genomes. Genome reduction was initially focused on exploring properties of minimal genomes, but more recently it has been deployed as an engineering strategy to enhance strain performance. This review provides the latest updates on reduced genomes, focusing on dual-track approaches of top-down reduction and bottom-up synthesis for their construction. Using cases from studies that are based on established industrial workhorse strains, we discuss the construction of a series of synthetic phenotypes that are candidates for biotechnological applications. Finally, we address the possible uses of reduced genomes for biotechnological applications and the needed future research directions that may ultimately lead to the total synthesis of rationally designed genomes.
Collapse
Affiliation(s)
- Kangsan Kim
- Department of Biological Sciences, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea; KI for the BioCentury, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea
| | - Donghui Choe
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Suhyung Cho
- KI for the BioCentury, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea
| | - Bernhard Palsson
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA; Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Kongens, Lyngby, Denmark
| | - Byung-Kwan Cho
- Department of Biological Sciences, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea; KI for the BioCentury, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea; Graduate School of Engineering Biology, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
2
|
Liang Y, Luo H, Lin Y, Gao F. Recent advances in the characterization of essential genes and development of a database of essential genes. IMETA 2024; 3:e157. [PMID: 38868518 PMCID: PMC10989110 DOI: 10.1002/imt2.157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 10/09/2023] [Indexed: 06/14/2024]
Abstract
Over the past few decades, there has been a significant interest in the study of essential genes, which are crucial for the survival of an organism under specific environmental conditions and thus have practical applications in the fields of synthetic biology and medicine. An increasing amount of experimental data on essential genes has been obtained with the continuous development of technological methods. Meanwhile, various computational prediction methods, related databases and web servers have emerged accordingly. To facilitate the study of essential genes, we have established a database of essential genes (DEG), which has become popular with continuous updates to facilitate essential gene feature analysis and prediction, drug and vaccine development, as well as artificial genome design and construction. In this article, we summarized the studies of essential genes, overviewed the relevant databases, and discussed their practical applications. Furthermore, we provided an overview of the main applications of DEG and conducted comprehensive analyses based on its latest version. However, it should be noted that the essential gene is a dynamic concept instead of a binary one, which presents both opportunities and challenges for their future development.
Collapse
Affiliation(s)
| | - Hao Luo
- Department of PhysicsTianjin UniversityTianjinChina
| | - Yan Lin
- Department of PhysicsTianjin UniversityTianjinChina
| | - Feng Gao
- Department of PhysicsTianjin UniversityTianjinChina
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education)Tianjin UniversityTianjinChina
- SynBio Research PlatformCollaborative Innovation Center of Chemical Science and Engineering (Tianjin)TianjinChina
| |
Collapse
|
3
|
LaPolice TM, Huang YF. An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data. BMC Bioinformatics 2023; 24:347. [PMID: 37723435 PMCID: PMC10506225 DOI: 10.1186/s12859-023-05481-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 09/13/2023] [Indexed: 09/20/2023] Open
Abstract
BACKGROUND The ability to accurately predict essential genes intolerant to loss-of-function (LOF) mutations can dramatically improve the identification of disease-associated genes. Recently, there have been numerous computational methods developed to predict human essential genes from population genomic data. While the existing methods are highly predictive of essential genes of long length, they have limited power in pinpointing short essential genes due to the sparsity of polymorphisms in the human genome. RESULTS Motivated by the premise that population and functional genomic data may provide complementary evidence for gene essentiality, here we present an evolution-based deep learning model, DeepLOF, to predict essential genes in an unsupervised manner. Unlike previous population genetic methods, DeepLOF utilizes a novel deep learning framework to integrate both population and functional genomic data, allowing us to pinpoint short essential genes that can hardly be predicted from population genomic data alone. Compared with previous methods, DeepLOF shows unmatched performance in predicting ClinGen haploinsufficient genes, mouse essential genes, and essential genes in human cell lines. Notably, at a false positive rate of 5%, DeepLOF detects 50% more ClinGen haploinsufficient genes than previous methods. Furthermore, DeepLOF discovers 109 novel essential genes that are too short to be identified by previous methods. CONCLUSION The predictive power of DeepLOF shows that it is a compelling computational method to aid in the discovery of essential genes.
Collapse
Affiliation(s)
- Troy M LaPolice
- Department of Biology, Pennsylvania State University, University Park, PA, 16802, USA.
- Bioinformatics and Genomics Graduate Program, Pennsylvania State University, University Park, PA, 16802, USA.
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16802, USA.
| | - Yi-Fei Huang
- Department of Biology, Pennsylvania State University, University Park, PA, 16802, USA.
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16802, USA.
| |
Collapse
|
4
|
Li Y, Zeng M, Zhang F, Wu FX, Li M. DeepCellEss: cell line-specific essential protein prediction with attention-based interpretable deep learning. Bioinformatics 2023; 39:btac779. [PMID: 36458923 PMCID: PMC9825760 DOI: 10.1093/bioinformatics/btac779] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 11/25/2022] [Accepted: 12/01/2022] [Indexed: 12/05/2022] Open
Abstract
MOTIVATION Protein essentiality is usually accepted to be a conditional trait and strongly affected by cellular environments. However, existing computational methods often do not take such characteristics into account, preferring to incorporate all available data and train a general model for all cell lines. In addition, the lack of model interpretability limits further exploration and analysis of essential protein predictions. RESULTS In this study, we proposed DeepCellEss, a sequence-based interpretable deep learning framework for cell line-specific essential protein predictions. DeepCellEss utilizes a convolutional neural network and bidirectional long short-term memory to learn short- and long-range latent information from protein sequences. Further, a multi-head self-attention mechanism is used to provide residue-level model interpretability. For model construction, we collected extremely large-scale benchmark datasets across 323 cell lines. Extensive computational experiments demonstrate that DeepCellEss yields effective prediction performance for different cell lines and outperforms existing sequence-based methods as well as network-based centrality measures. Finally, we conducted some case studies to illustrate the necessity of considering specific cell lines and the superiority of DeepCellEss. We believe that DeepCellEss can serve as a useful tool for predicting essential proteins across different cell lines. AVAILABILITY AND IMPLEMENTATION The DeepCellEss web server is available at http://csuligroup.com:8000/DeepCellEss. The source code and data underlying this study can be obtained from https://github.com/CSUBioGroup/DeepCellEss. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yiming Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Zeng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Fuhao Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, Department of Computer Science, Department of Mechanical Engineering University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
5
|
Manzo M, Giordano M, Maddalena L, Guarracino MR, Granata I. Novel Data Science Methodologies for Essential Genes Identification Based on Network Analysis. STUDIES IN COMPUTATIONAL INTELLIGENCE 2023:117-145. [DOI: 10.1007/978-3-031-24453-7_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
6
|
Mukul Das M, Sarkar K. Evaluation of machine learning classifiers for predicting essential genes in Mycobacterium tuberculosis strains. Bioinformation 2022; 18:1126-1130. [PMID: 37701504 PMCID: PMC10492903 DOI: 10.6026/973206300181126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 12/20/2022] [Accepted: 12/31/2022] [Indexed: 09/14/2023] Open
Abstract
Accurate investigation and prediction of essential genes from bacterial genome is very important as it might be explored in effective targets for antimicrobial drugs and understanding biological mechanism of a cell. A subset of key features data obtained from 14 genome sequence-based features of 20 strains of Mycobacterium tuberculosis bacteria whose essential gene information was downloaded from ePath and NCBI database for mapping and matching essential genes by using a genome extraction program. The selection of key features was performed by using Genetic Algorithm. For each of three classifiers, 80%, 10% and 10% of subset key features were used for training, validation and testing, respectively. Experimental results (10-f-cv) illustrated that DNN (proposed), DT, and SVM achieved AUC of 0.98, 0.88 and 0.82, respectively. DNN (proposed) outperformed DT and SVM. The higher prediction accuracy of classifiers was observed because of using only key features which also justified better generalizability of classifiers and efficiency of key features related to gene essentiality. Besides, DNN (proposed) also showed best prediction performance while compared with other predictors used in previous studies. The genome extraction program was developed for mapping and matching of essential genes between ePath and NCBI database.
Collapse
Affiliation(s)
- Monish Mukul Das
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, Nadia - 741235
| | - Keka Sarkar
- Department of Microbiology, University of Kalyani, Kalyani, Nadia - 741235
| |
Collapse
|
7
|
LeBlanc N, Charles TC. Bacterial genome reductions: Tools, applications, and challenges. Front Genome Ed 2022; 4:957289. [PMID: 36120530 PMCID: PMC9473318 DOI: 10.3389/fgeed.2022.957289] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 07/29/2022] [Indexed: 11/16/2022] Open
Abstract
Bacterial cells are widely used to produce value-added products due to their versatility, ease of manipulation, and the abundance of genome engineering tools. However, the efficiency of producing these desired biomolecules is often hindered by the cells’ own metabolism, genetic instability, and the toxicity of the product. To overcome these challenges, genome reductions have been performed, making strains with the potential of serving as chassis for downstream applications. Here we review the current technologies that enable the design and construction of such reduced-genome bacteria as well as the challenges that limit their assembly and applicability. While genomic reductions have shown improvement of many cellular characteristics, a major challenge still exists in constructing these cells efficiently and rapidly. Computational tools have been created in attempts at minimizing the time needed to design these organisms, but gaps still exist in modelling these reductions in silico. Genomic reductions are a promising avenue for improving the production of value-added products, constructing chassis cells, and for uncovering cellular function but are currently limited by their time-consuming construction methods. With improvements to and the creation of novel genome editing tools and in silico models, these approaches could be combined to expedite this process and create more streamlined and efficient cell factories.
Collapse
Affiliation(s)
- Nicole LeBlanc
- Department of Biology, University of Waterloo, Waterloo, ON, Canada
- *Correspondence: Nicole LeBlanc,
| | - Trevor C. Charles
- Department of Biology, University of Waterloo, Waterloo, ON, Canada
- Metagenom Bio Life Science Inc., Waterloo, ON, Canada
| |
Collapse
|
8
|
Panditrao G, Bhowmick R, Meena C, Sarkar RR. Emerging landscape of molecular interaction networks: Opportunities, challenges and prospects. J Biosci 2022. [PMID: 36210749 PMCID: PMC9018971 DOI: 10.1007/s12038-022-00253-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Network biology finds application in interpreting molecular interaction networks and providing insightful inferences using graph theoretical analysis of biological systems. The integration of computational bio-modelling approaches with different hybrid network-based techniques provides additional information about the behaviour of complex systems. With increasing advances in high-throughput technologies in biological research, attempts have been made to incorporate this information into network structures, which has led to a continuous update of network biology approaches over time. The newly minted centrality measures accommodate the details of omics data and regulatory network structure information. The unification of graph network properties with classical mathematical and computational modelling approaches and technologically advanced approaches like machine-learning- and artificial intelligence-based algorithms leverages the potential application of these techniques. These computational advances prove beneficial and serve various applications such as essential gene prediction, identification of drug–disease interaction and gene prioritization. Hence, in this review, we have provided a comprehensive overview of the emerging landscape of molecular interaction networks using graph theoretical approaches. With the aim to provide information on the wide range of applications of network biology approaches in understanding the interaction and regulation of genes, proteins, enzymes and metabolites at different molecular levels, we have reviewed the methods that utilize network topological properties, emerging hybrid network-based approaches and applications that integrate machine learning techniques to analyse molecular interaction networks. Further, we have discussed the applications of these approaches in biomedical research with a note on future prospects.
Collapse
Affiliation(s)
- Gauri Panditrao
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune, 411008 India
| | - Rupa Bhowmick
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune, 411008 India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002 India
| | - Chandrakala Meena
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune, 411008 India
| | - Ram Rup Sarkar
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune, 411008 India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002 India
| |
Collapse
|
9
|
Baisya D, Ramesh A, Schwartz C, Lonardi S, Wheeldon I. Genome-wide functional screens enable the prediction of high activity CRISPR-Cas9 and -Cas12a guides in Yarrowia lipolytica. Nat Commun 2022; 13:922. [PMID: 35177617 PMCID: PMC8854577 DOI: 10.1038/s41467-022-28540-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 02/01/2022] [Indexed: 12/15/2022] Open
Abstract
Genome-wide functional genetic screens have been successful in discovering genotype-phenotype relationships and in engineering new phenotypes. While broadly applied in mammalian cell lines and in E. coli, use in non-conventional microorganisms has been limited, in part, due to the inability to accurately design high activity CRISPR guides in such species. Here, we develop an experimental-computational approach to sgRNA design that is specific to an organism of choice, in this case the oleaginous yeast Yarrowia lipolytica. A negative selection screen in the absence of non-homologous end-joining, the dominant DNA repair mechanism, was used to generate single guide RNA (sgRNA) activity profiles for both SpCas9 and LbCas12a. This genome-wide data served as input to a deep learning algorithm, DeepGuide, that is able to accurately predict guide activity. DeepGuide uses unsupervised learning to obtain a compressed representation of the genome, followed by supervised learning to map sgRNA sequence, genomic context, and epigenetic features with guide activity. Experimental validation, both genome-wide and with a subset of selected genes, confirms DeepGuide's ability to accurately predict high activity sgRNAs. DeepGuide provides an organism specific predictor of CRISPR guide activity that with retraining could be applied to other fungal species, prokaryotes, and other non-conventional organisms.
Collapse
Affiliation(s)
- Dipankar Baisya
- Department of Computer Science and Engineering, University of California, Riverside, CA, 92521, USA
| | - Adithya Ramesh
- Department of Chemical and Environmental Engineering, University of California, Riverside, CA, 92521, USA
| | - Cory Schwartz
- Department of Chemical and Environmental Engineering, University of California, Riverside, CA, 92521, USA
- iBio Inc., San Diego, CA, USA
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, CA, 92521, USA.
- Integrative Institute for Genome Biology, University of California, Riverside, CA, 92521, USA.
| | - Ian Wheeldon
- Department of Chemical and Environmental Engineering, University of California, Riverside, CA, 92521, USA.
- Integrative Institute for Genome Biology, University of California, Riverside, CA, 92521, USA.
- Center for Industrial Biotechnology, University of California, Riverside, CA, 92521, USA.
| |
Collapse
|
10
|
Liu X, Luo Y, He T, Ren M, Xu Y. Predicting essential genes of 37 prokaryotes by combining information-theoretic features. J Microbiol Methods 2021; 188:106297. [PMID: 34343487 DOI: 10.1016/j.mimet.2021.106297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 07/30/2021] [Accepted: 07/30/2021] [Indexed: 10/20/2022]
Abstract
Essential genes are required for the reproduction and survival of an organism. Rapid identification of essential genes has practical application value in biomedicine. Information theory is a discipline that studies information transmission. Based on the similarity between heredity and information transmission, measures derived from information theory can be applied to genetic sequence analysis on different scales. In this study, we employed 114 features extracted by information theory methods to construct an essential gene prediction model. We applied a backpropagation neural network to construct a classifier and employed it to predict essential genes of 37 prokaryotes. The performance of the classifier was evaluated by applying intra-organism prediction and leave-one-species-out prediction. Among 37 prokaryotes, intra-organism prediction and leave-one-species-out prediction yielded average AUC scores of 0.791 and 0.717, respectively. Considering the potential redundancy in the feature set, we performed feature selection and constructed a key feature subset. In the above two prediction methods, the average AUC scores of 37 organisms obtained by using key features were 0.786 and 0.714, respectively. The results show the potential and universality of information-theoretic features in the study of prokaryotic essential gene prediction.
Collapse
Affiliation(s)
- Xiao Liu
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing 400044, China.
| | - Yachuan Luo
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing 400044, China
| | - Ting He
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing 400044, China
| | - Meixiang Ren
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing 400044, China
| | - Yuqiao Xu
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing 400044, China
| |
Collapse
|
11
|
Hybrid Deep Learning Models with Sparse Enhancement Technique for Detection of Newly Grown Tree Leaves. SENSORS 2021; 21:s21062077. [PMID: 33809537 PMCID: PMC8001602 DOI: 10.3390/s21062077] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Revised: 03/04/2021] [Accepted: 03/12/2021] [Indexed: 12/21/2022]
Abstract
The life cycle of leaves, from sprout to senescence, is the phenomenon of regular changes such as budding, branching, leaf spreading, flowering, fruiting, leaf fall, and dormancy due to seasonal climate changes. It is the effect of temperature and moisture in the life cycle on physiological changes, so the detection of newly grown leaves (NGL) is helpful for the estimation of tree growth and even climate change. This study focused on the detection of NGL based on deep learning convolutional neural network (CNN) models with sparse enhancement (SE). As the NGL areas found in forest images have similar sparse characteristics, we used a sparse image to enhance the signal of the NGL. The difference between the NGL and the background could be further improved. We then proposed hybrid CNN models that combined U-net and SegNet features to perform image segmentation. As the NGL in the image were relatively small and tiny targets, in terms of data characteristics, they also belonged to the problem of imbalanced data. Therefore, this paper further proposed 3-Layer SegNet, 3-Layer U-SegNet, 2-Layer U-SegNet, and 2-Layer Conv-U-SegNet architectures to reduce the pooling degree of traditional semantic segmentation models, and used a loss function to increase the weight of the NGL. According to the experimental results, our proposed algorithms were indeed helpful for the image segmentation of NGL and could achieve better kappa results by 0.743.
Collapse
|
12
|
Nandi S, Ganguli P, Sarkar RR. Essential gene prediction using limited gene essentiality information-An integrative semi-supervised machine learning strategy. PLoS One 2020; 15:e0242943. [PMID: 33253254 PMCID: PMC7703937 DOI: 10.1371/journal.pone.0242943] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2020] [Accepted: 11/12/2020] [Indexed: 11/24/2022] Open
Abstract
Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.
Collapse
Affiliation(s)
- Sutanu Nandi
- Chemical Engineering and Process Development, CSIR-National Chemical Laboratory, Pune, Maharashtra, India
- Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India
| | - Piyali Ganguli
- Chemical Engineering and Process Development, CSIR-National Chemical Laboratory, Pune, Maharashtra, India
- Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India
| | - Ram Rup Sarkar
- Chemical Engineering and Process Development, CSIR-National Chemical Laboratory, Pune, Maharashtra, India
- Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India
| |
Collapse
|
13
|
Le NQK, Do DT, Hung TNK, Lam LHT, Huynh TT, Nguyen NTK. A Computational Framework Based on Ensemble Deep Neural Networks for Essential Genes Identification. Int J Mol Sci 2020; 21:E9070. [PMID: 33260643 PMCID: PMC7730808 DOI: 10.3390/ijms21239070] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 11/25/2020] [Accepted: 11/26/2020] [Indexed: 01/13/2023] Open
Abstract
Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| | - Duyen Thi Do
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei 106, Taiwan;
| | - Truong Nguyen Khanh Hung
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Department of Orthopedic and Trauma, Cho Ray Hospital, Ho Chi Minh 70000, Vietnam
| | - Luu Ho Thanh Lam
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Intensive Care Unit, Children’s Hospital 2, Ho Chi Minh 70000, Vietnam
| | - Tuan-Tu Huynh
- Department of Electrical Engineering, Yuan Ze University, Taoyuan 320, Taiwan;
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Dong Nai 76120, Vietnam
| | - Ngan Thi Kim Nguyen
- School of Nutrition and Health Sciences, Taipei Medical University, Taipei 110, Taiwan;
| |
Collapse
|
14
|
Yoon BJ, Qian X, Kahveci T, Pal R. Selected Research Articles from the 2019 International Workshop on Computational Network Biology: Modeling, Analysis, and Control (CNB-MAC). BMC Genomics 2020; 21:584. [PMID: 32900374 PMCID: PMC7487676 DOI: 10.1186/s12864-020-06934-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Affiliation(s)
- Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA.
- TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering (CBGSE), College Station, TX, 77845, USA.
- Computational Science Initiative, Brookhaven National Laboratory, Upton, NY, 11973, USA.
| | - Xiaoning Qian
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
- TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering (CBGSE), College Station, TX, 77845, USA
| | - Tamer Kahveci
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, 32611-6125, USA
| | - Ranadip Pal
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, 79409-3102, USA
| |
Collapse
|