1
|
Xu T, Wang S, Ma T, Dong Y, Ashby CR, Hao GF. The identification of essential cellular genes is critical for validating drug targets. Drug Discov Today 2024; 29:104215. [PMID: 39428084 DOI: 10.1016/j.drudis.2024.104215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 10/06/2024] [Accepted: 10/15/2024] [Indexed: 10/22/2024]
Abstract
Accurately identifying biological targets is crucial for advancing treatment options. Essential genes, vital for cell or organism survival, hold promise as potential drug targets in disease treatment. Although many studies have sought to identify essential genes as therapeutic targets in medicine and bioinformatics, systematic reviews on their relationship with drug targets are relatively rare. This work presents a comprehensive analysis to aid in identifying essential genes as potential targets for drug discovery, encompassing their relevance, identification methods, successful case studies, and challenges. This work will facilitate the identification of essential genes as therapeutic targets, thereby boosting new drug development.
Collapse
Affiliation(s)
- Ting Xu
- School of Pharmaceutical Sciences, Guizhou Engineering Laboratory for Synthetic Drugs, Guizhou University, Guiyang 550025, China
| | - Shuang Wang
- State Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for R&D of Fine Chemicals, Guizhou University, Guiyang 550025, China
| | - Tingting Ma
- School of Pharmaceutical Sciences, Guizhou Engineering Laboratory for Synthetic Drugs, Guizhou University, Guiyang 550025, China
| | - Yawen Dong
- School of Pharmaceutical Sciences, Guizhou Engineering Laboratory for Synthetic Drugs, Guizhou University, Guiyang 550025, China.
| | - Charles R Ashby
- Department of Pharmaceutical Sciences, St. John's University, New York, NY, USA.
| | - Ge-Fei Hao
- State Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for R&D of Fine Chemicals, Guizhou University, Guiyang 550025, China.
| |
Collapse
|
2
|
Granata I, Maddalena L, Manzo M, Guarracino MR, Giordano M. HELP: A computational framework for labelling and predicting human common and context-specific essential genes. PLoS Comput Biol 2024; 20:e1012076. [PMID: 39331694 PMCID: PMC11463781 DOI: 10.1371/journal.pcbi.1012076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 10/09/2024] [Accepted: 08/19/2024] [Indexed: 09/29/2024] Open
Abstract
Machine learning-based approaches are particularly suitable for identifying essential genes as they allow the generation of predictive models trained on features from multi-source data. Gene essentiality is neither binary nor static but determined by the context. The databases for essential gene annotation do not permit the personalisation of the context, and their update can be slower than the publication of new experimental data. We propose HELP (Human Gene Essentiality Labelling & Prediction), a computational framework for labelling and predicting essential genes. Its double scope allows for identifying genes based on dependency or not on experimental data. The effectiveness of the labelling method was demonstrated by comparing it with other approaches in overlapping the reference sets of essential gene annotations, where HELP demonstrated the best compromise between false and true positive rates. The gene attributes, including multi-omics and network embedding features, lead to high-performance prediction of essential genes while confirming the existence of essentiality nuances.
Collapse
Affiliation(s)
- Ilaria Granata
- Institute for High-Performance Computing and Networking, National Research Council, Naples, Italy
| | - Lucia Maddalena
- Institute for High-Performance Computing and Networking, National Research Council, Naples, Italy
| | - Mario Manzo
- Information Technology Services, University of Naples “L’Orientale”, Naples, Italy
| | - Mario Rosario Guarracino
- Laboratory of Algorithms and Technologies for Network Analysis, National Research University Higher School of Economics, Nizhny Novgorod, Russia
- Department of Economics and Law, University of Cassino and Southern Lazio, Cassino, Frosinone, Italy
| | - Maurizio Giordano
- Institute for High-Performance Computing and Networking, National Research Council, Naples, Italy
| |
Collapse
|
3
|
Hasibi R, Michoel T, Oyarzún DA. Integration of graph neural networks and genome-scale metabolic models for predicting gene essentiality. NPJ Syst Biol Appl 2024; 10:24. [PMID: 38448436 PMCID: PMC10917767 DOI: 10.1038/s41540-024-00348-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 02/08/2024] [Indexed: 03/08/2024] Open
Abstract
Genome-scale metabolic models are powerful tools for understanding cellular physiology. Flux balance analysis (FBA), in particular, is an optimization-based approach widely employed for predicting metabolic phenotypes. In model microbes such as Escherichia coli, FBA has been successful at predicting essential genes, i.e. those genes that impair survival when deleted. A central assumption in this approach is that both wild type and deletion strains optimize the same fitness objective. Although the optimality assumption may hold for the wild type metabolic network, deletion strains are not subject to the same evolutionary pressures and knock-out mutants may steer their metabolism to meet other objectives for survival. Here, we present FlowGAT, a hybrid FBA-machine learning strategy for predicting essentiality directly from wild type metabolic phenotypes. The approach is based on graph-structured representation of metabolic fluxes predicted by FBA, where nodes correspond to enzymatic reactions and edges quantify the propagation of metabolite mass flow between a reaction and its neighbours. We integrate this information into a graph neural network that can be trained on knock-out fitness assay data. Comparisons across different model architectures reveal that FlowGAT predictions for E. coli are close to those of FBA for several growth conditions. This suggests that essentiality of enzymatic genes can be predicted by exploiting the inherent network structure of metabolism. Our approach demonstrates the benefits of combining the mechanistic insights afforded by genome-scale models with the ability of deep learning to infer patterns from complex datasets.
Collapse
Affiliation(s)
- Ramin Hasibi
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
| | - Tom Michoel
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
| | - Diego A Oyarzún
- School of Biological Sciences, University of Edinburgh, Edinburgh, UK.
- School of Informatics, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
4
|
Liang Y, Luo H, Lin Y, Gao F. Recent advances in the characterization of essential genes and development of a database of essential genes. IMETA 2024; 3:e157. [PMID: 38868518 PMCID: PMC10989110 DOI: 10.1002/imt2.157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 10/09/2023] [Indexed: 06/14/2024]
Abstract
Over the past few decades, there has been a significant interest in the study of essential genes, which are crucial for the survival of an organism under specific environmental conditions and thus have practical applications in the fields of synthetic biology and medicine. An increasing amount of experimental data on essential genes has been obtained with the continuous development of technological methods. Meanwhile, various computational prediction methods, related databases and web servers have emerged accordingly. To facilitate the study of essential genes, we have established a database of essential genes (DEG), which has become popular with continuous updates to facilitate essential gene feature analysis and prediction, drug and vaccine development, as well as artificial genome design and construction. In this article, we summarized the studies of essential genes, overviewed the relevant databases, and discussed their practical applications. Furthermore, we provided an overview of the main applications of DEG and conducted comprehensive analyses based on its latest version. However, it should be noted that the essential gene is a dynamic concept instead of a binary one, which presents both opportunities and challenges for their future development.
Collapse
Affiliation(s)
| | - Hao Luo
- Department of PhysicsTianjin UniversityTianjinChina
| | - Yan Lin
- Department of PhysicsTianjin UniversityTianjinChina
| | - Feng Gao
- Department of PhysicsTianjin UniversityTianjinChina
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education)Tianjin UniversityTianjinChina
- SynBio Research PlatformCollaborative Innovation Center of Chemical Science and Engineering (Tianjin)TianjinChina
| |
Collapse
|
5
|
Ye C, Wu Q, Chen S, Zhang X, Xu W, Wu Y, Zhang Y, Yue Y. ECDEP: identifying essential proteins based on evolutionary community discovery and subcellular localization. BMC Genomics 2024; 25:117. [PMID: 38279081 PMCID: PMC10821549 DOI: 10.1186/s12864-024-10019-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 01/15/2024] [Indexed: 01/28/2024] Open
Abstract
BACKGROUND In cellular activities, essential proteins play a vital role and are instrumental in comprehending fundamental biological necessities and identifying pathogenic genes. Current deep learning approaches for predicting essential proteins underutilize the potential of gene expression data and are inadequate for the exploration of dynamic networks with limited evaluation across diverse species. RESULTS We introduce ECDEP, an essential protein identification model based on evolutionary community discovery. ECDEP integrates temporal gene expression data with a protein-protein interaction (PPI) network and employs the 3-Sigma rule to eliminate outliers at each time point, constructing a dynamic network. Next, we utilize edge birth and death information to establish an interaction streaming source to feed into the evolutionary community discovery algorithm and then identify overlapping communities during the evolution of the dynamic network. SVM recursive feature elimination (RFE) is applied to extract the most informative communities, which are combined with subcellular localization data for classification predictions. We assess the performance of ECDEP by comparing it against ten centrality methods, four shallow machine learning methods with RFE, and two deep learning methods that incorporate multiple biological data sources on Saccharomyces. Cerevisiae (S. cerevisiae), Homo sapiens (H. sapiens), Mus musculus, and Caenorhabditis elegans. ECDEP achieves an AP value of 0.86 on the H. sapiens dataset and the contribution ratio of community features in classification reaches 0.54 on the S. cerevisiae (Krogan) dataset. CONCLUSIONS Our proposed method adeptly integrates network dynamics and yields outstanding results across various datasets. Furthermore, the incorporation of evolutionary community discovery algorithms amplifies the capacity of gene expression data in classification.
Collapse
Affiliation(s)
- Chen Ye
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Qi Wu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Shuxia Chen
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Xuemei Zhang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Wenwen Xu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Yunzhi Wu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Youhua Zhang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Yi Yue
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China.
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
6
|
Rout RK, Umer S, Khandelwal M, Pati S, Mallik S, Balabantaray BK, Qin H. Identification of discriminant features from stationary pattern of nucleotide bases and their application to essential gene classification. Front Genet 2023; 14:1154120. [PMID: 37152988 PMCID: PMC10156977 DOI: 10.3389/fgene.2023.1154120] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 04/04/2023] [Indexed: 05/09/2023] Open
Abstract
Introduction: Essential genes are essential for the survival of various species. These genes are a family linked to critical cellular activities for species survival. These genes are coded for proteins that regulate central metabolism, gene translation, deoxyribonucleic acid replication, and fundamental cellular structure and facilitate intracellular and extracellular transport. Essential genes preserve crucial genomics information that may hold the key to a detailed knowledge of life and evolution. Essential gene studies have long been regarded as a vital topic in computational biology due to their relevance. An essential gene is composed of adenine, guanine, cytosine, and thymine and its various combinations. Methods: This paper presents a novel method of extracting information on the stationary patterns of nucleotides such as adenine, guanine, cytosine, and thymine in each gene. For this purpose, some co-occurrence matrices are derived that provide the statistical distribution of stationary patterns of nucleotides in the genes, which is helpful in establishing the relationship between the nucleotides. For extracting discriminant features from each co-occurrence matrix, energy, entropy, homogeneity, contrast, and dissimilarity features are computed, which are extracted from all co-occurrence matrices and then concatenated to form a feature vector representing each essential gene. Finally, supervised machine learning algorithms are applied for essential gene classification based on the extracted fixed-dimensional feature vectors. Results: For comparison, some existing state-of-the-art feature representation techniques such as Shannon entropy (SE), Hurst exponent (HE), fractal dimension (FD), and their combinations have been utilized. Discussion: An extensive experiment has been performed for classifying the essential genes of five species that show the robustness and effectiveness of the proposed methodology.
Collapse
Affiliation(s)
- Ranjeet Kumar Rout
- National Institute of Technology Srinagar, Hazratbal, Jammu and Kashmir, India
| | - Saiyed Umer
- Aliah University, Kolkata, West Bengal, India
| | - Monika Khandelwal
- National Institute of Technology Srinagar, Hazratbal, Jammu and Kashmir, India
| | - Smitarani Pati
- Dr. B R Ambedkar National Institute of Technology Jalandhar, Jalandhar, Punjab, India
| | - Saurav Mallik
- Harvard T H Chan School of Public Health, Boston, United States
- Department of Pharmacology and Toxicology, University of Arizona, Tucson, AZ, United States
- *Correspondence: Saurav Mallik, , ; Hong Qin,
| | | | - Hong Qin
- Department of Computer Science and Engineering, University of Tennessee at Chattanooga, Chattanooga, TN, United States
- *Correspondence: Saurav Mallik, , ; Hong Qin,
| |
Collapse
|
7
|
Li Y, Zeng M, Zhang F, Wu FX, Li M. DeepCellEss: cell line-specific essential protein prediction with attention-based interpretable deep learning. Bioinformatics 2023; 39:btac779. [PMID: 36458923 PMCID: PMC9825760 DOI: 10.1093/bioinformatics/btac779] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 11/25/2022] [Accepted: 12/01/2022] [Indexed: 12/05/2022] Open
Abstract
MOTIVATION Protein essentiality is usually accepted to be a conditional trait and strongly affected by cellular environments. However, existing computational methods often do not take such characteristics into account, preferring to incorporate all available data and train a general model for all cell lines. In addition, the lack of model interpretability limits further exploration and analysis of essential protein predictions. RESULTS In this study, we proposed DeepCellEss, a sequence-based interpretable deep learning framework for cell line-specific essential protein predictions. DeepCellEss utilizes a convolutional neural network and bidirectional long short-term memory to learn short- and long-range latent information from protein sequences. Further, a multi-head self-attention mechanism is used to provide residue-level model interpretability. For model construction, we collected extremely large-scale benchmark datasets across 323 cell lines. Extensive computational experiments demonstrate that DeepCellEss yields effective prediction performance for different cell lines and outperforms existing sequence-based methods as well as network-based centrality measures. Finally, we conducted some case studies to illustrate the necessity of considering specific cell lines and the superiority of DeepCellEss. We believe that DeepCellEss can serve as a useful tool for predicting essential proteins across different cell lines. AVAILABILITY AND IMPLEMENTATION The DeepCellEss web server is available at http://csuligroup.com:8000/DeepCellEss. The source code and data underlying this study can be obtained from https://github.com/CSUBioGroup/DeepCellEss. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yiming Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Zeng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Fuhao Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, Department of Computer Science, Department of Mechanical Engineering University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
8
|
Manzo M, Giordano M, Maddalena L, Guarracino MR, Granata I. Novel Data Science Methodologies for Essential Genes Identification Based on Network Analysis. STUDIES IN COMPUTATIONAL INTELLIGENCE 2023:117-145. [DOI: 10.1007/978-3-031-24453-7_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
9
|
Ismail E, Gad W, Hashem M. HEC-ASD: a hybrid ensemble-based classification model for predicting autism spectrum disorder disease genes. BMC Bioinformatics 2022; 23:554. [PMID: 36544099 PMCID: PMC9768984 DOI: 10.1186/s12859-022-05099-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 12/06/2022] [Indexed: 12/24/2022] Open
Abstract
PURPOSE Autism spectrum disorder (ASD) is the most prevalent disease today. The causes of its infection may be attributed to genetic causes by 80% and environmental causes by 20%. In spite of this, the majority of the current research is concerned with environmental causes, and the least proportion with the genetic causes of the disease. Autism is a complex disease, which makes it difficult to identify the genes that cause the disease. METHODS Hybrid ensemble-based classification (HEC-ASD) model for predicting ASD genes using gradient boosting machines is proposed. The proposed model utilizes gene ontology (GO) to construct a gene functional similarity matrix using hybrid gene similarity (HGS) method. HGS measures the semantic similarity between genes effectively. It combines the graph-based method, such as Wang method with the number of directed children's nodes of gene term from GO. Moreover, an ensemble gradient boosting classifier is adapted to enhance the prediction of genes forming a robust classification model. RESULTS The proposed model is evaluated using the Simons Foundation Autism Research Initiative (SFARI) gene database. The experimental results are promising as they improve the classification performance for predicting ASD genes. The results are compared with other approaches that used gene regulatory network (GRN), protein to protein interaction network (PPI), or GO. The HEC-ASD model reaches the highest prediction accuracy of 0.88% using ensemble learning classifiers. CONCLUSION The proposed model demonstrates that ensemble learning technique using gradient boosting is effective in predicting autism spectrum disorder genes. Moreover, the HEC-ASD model utilized GO rather than using PPI network and GRN.
Collapse
Affiliation(s)
- Eman Ismail
- grid.7269.a0000 0004 0621 1570Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Walaa Gad
- grid.7269.a0000 0004 0621 1570Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Mohamed Hashem
- grid.7269.a0000 0004 0621 1570Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| |
Collapse
|
10
|
Benstead-Hume G, Wooller SK, Renaut J, Dias S, Woodbine L, Carr AM, Pearl FMG. Biological network topology features predict gene dependencies in cancer cell-lines. BIOINFORMATICS ADVANCES 2022; 2:vbac084. [PMID: 36699394 PMCID: PMC9681200 DOI: 10.1093/bioadv/vbac084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 10/02/2022] [Accepted: 11/08/2022] [Indexed: 11/12/2022]
Abstract
Motivation Protein-protein interaction (PPI) networks have been shown to successfully predict essential proteins. However, such networks are derived generically from experiments on many thousands of different cells. Consequently, conventional PPI networks cannot capture the variation of genetic dependencies that exists across different cell types, let alone those that emerge as a result of the massive cell restructuring that occurs during carcinogenesis. Predicting cell-specific dependencies is of considerable therapeutic benefit, facilitating the use of drugs to inhibit those proteins on which the cancer cells have become specifically dependent. In order to go beyond the limitations of the generic PPI, we have attempted to personalise PPI networks to reflect cell-specific patterns of gene expression and mutation. By using 12 topological features of the resulting PPIs, together with matched gene dependency data from DepMap, we trained random-forest classifiers (DependANT) to predict novel gene dependencies. Results We found that DependANT improves the power of the baseline generic PPI models in predicting common gene dependencies, by up to 10.8% and is more sensitive than the baseline generic model when predicting genes on which only a small number of cell types are dependent. Availability and implementation Software available at https://bitbucket.org/bioinformatics_lab_sussex/dependant2. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Joanna Renaut
- Bioinformatics Lab, School of Life Sciences, University of Sussex, Brighton BN1 9QJ, UK
| | - Samantha Dias
- Genome Damage and Stability Centre, University of Sussex, Brighton BN1 9RQ, UK
| | - Lisa Woodbine
- Genome Damage and Stability Centre, University of Sussex, Brighton BN1 9RQ, UK
| | - Antony M Carr
- Genome Damage and Stability Centre, University of Sussex, Brighton BN1 9RQ, UK
| | | |
Collapse
|
11
|
Yue Y, Ye C, Peng PY, Zhai HX, Ahmad I, Xia C, Wu YZ, Zhang YH. A deep learning framework for identifying essential proteins based on multiple biological information. BMC Bioinformatics 2022; 23:318. [PMID: 35927611 PMCID: PMC9351218 DOI: 10.1186/s12859-022-04868-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 07/29/2022] [Indexed: 11/15/2022] Open
Abstract
Background Essential Proteins are demonstrated to exert vital functions on cellular processes and are indispensable for the survival and reproduction of the organism. Traditional centrality methods perform poorly on complex protein–protein interaction (PPI) networks. Machine learning approaches based on high-throughput data lack the exploitation of the temporal and spatial dimensions of biological information. Results We put forward a deep learning framework to predict essential proteins by integrating features obtained from the PPI network, subcellular localization, and gene expression profiles. In our model, the node2vec method is applied to learn continuous feature representations for proteins in the PPI network, which capture the diversity of connectivity patterns in the network. The concept of depthwise separable convolution is employed on gene expression profiles to extract properties and observe the trends of gene expression over time under different experimental conditions. Subcellular localization information is mapped into a long one-dimensional vector to capture its characteristics. Additionally, we use a sampling method to mitigate the impact of imbalanced learning when training the model. With experiments carried out on the data of Saccharomyces cerevisiae, results show that our model outperforms traditional centrality methods and machine learning methods. Likewise, the comparative experiments have manifested that our process of various biological information is preferable. Conclusions Our proposed deep learning framework effectively identifies essential proteins by integrating multiple biological data, proving a broader selection of subcellular localization information significantly improves the results of prediction and depthwise separable convolution implemented on gene expression profiles enhances the performance.
Collapse
Affiliation(s)
- Yi Yue
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China. .,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China. .,State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University, Hefei, 230036, China.
| | - Chen Ye
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Pei-Yun Peng
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Hui-Xin Zhai
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Iftikhar Ahmad
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Chuan Xia
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Yun-Zhi Wu
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China.,State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University, Hefei, 230036, China
| | - You-Hua Zhang
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China. .,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
12
|
Schapke J, Tavares A, Recamonde-Mendoza M. EPGAT: Gene Essentiality Prediction With Graph Attention Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1615-1626. [PMID: 33497339 DOI: 10.1109/tcbb.2021.3054738] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Identifying essential genes and proteins is a critical step towards a better understanding of human biology and pathology. Computational approaches helped to mitigate experimental constraints by exploring machine learning (ML) methods and the correlation of essentiality with biological information, especially protein-protein interaction (PPI) networks, to predict essential genes. Nonetheless, their performance is still limited, as network-based centralities are not exclusive proxies of essentiality, and traditional ML methods are unable to learn from non-euclidean domains such as graphs. Given these limitations, we proposed EPGAT, an approach for Essentiality Prediction based on Graph Attention Networks (GATs), which are attention-based Graph Neural Networks (GNNs), operating on graph-structured data. Our model directly learns gene essentiality patterns from PPI networks, integrating additional evidence from multiomics data encoded as node attributes. We benchmarked EPGAT for four organisms, including humans, accurately predicting gene essentiality with ROC AUC score ranging from 0.78 to 0.97. Our model significantly outperformed network-based and shallow ML-based methods and achieved a very competitive performance against the state-of-the-art node2vec embedding method. Notably, EPGAT was the most robust approach in scenarios with limited and imbalanced training data. Thus, the proposed approach offers a powerful and effective way to identify essential genes and proteins.
Collapse
|
13
|
Panditrao G, Bhowmick R, Meena C, Sarkar RR. Emerging landscape of molecular interaction networks: Opportunities, challenges and prospects. J Biosci 2022. [PMID: 36210749 PMCID: PMC9018971 DOI: 10.1007/s12038-022-00253-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Network biology finds application in interpreting molecular interaction networks and providing insightful inferences using graph theoretical analysis of biological systems. The integration of computational bio-modelling approaches with different hybrid network-based techniques provides additional information about the behaviour of complex systems. With increasing advances in high-throughput technologies in biological research, attempts have been made to incorporate this information into network structures, which has led to a continuous update of network biology approaches over time. The newly minted centrality measures accommodate the details of omics data and regulatory network structure information. The unification of graph network properties with classical mathematical and computational modelling approaches and technologically advanced approaches like machine-learning- and artificial intelligence-based algorithms leverages the potential application of these techniques. These computational advances prove beneficial and serve various applications such as essential gene prediction, identification of drug–disease interaction and gene prioritization. Hence, in this review, we have provided a comprehensive overview of the emerging landscape of molecular interaction networks using graph theoretical approaches. With the aim to provide information on the wide range of applications of network biology approaches in understanding the interaction and regulation of genes, proteins, enzymes and metabolites at different molecular levels, we have reviewed the methods that utilize network topological properties, emerging hybrid network-based approaches and applications that integrate machine learning techniques to analyse molecular interaction networks. Further, we have discussed the applications of these approaches in biomedical research with a note on future prospects.
Collapse
Affiliation(s)
- Gauri Panditrao
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune, 411008 India
| | - Rupa Bhowmick
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune, 411008 India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002 India
| | - Chandrakala Meena
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune, 411008 India
| | - Ram Rup Sarkar
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune, 411008 India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002 India
| |
Collapse
|
14
|
Hong C, Cao Q, Zhang Z, Tsui SKW, Yip KY. Reusability report: Capturing properties of biological objects and their relationships using graph neural networks. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00454-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
15
|
Kania A. Harnessing the information theory and chaos game representation for pattern searching among essential and non-essential genes in Bacteria. J Theor Biol 2021; 531:110917. [PMID: 34563550 DOI: 10.1016/j.jtbi.2021.110917] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Revised: 08/19/2021] [Accepted: 09/21/2021] [Indexed: 11/29/2022]
Abstract
Proteins encoded by genes are engaged in most of the processes within a cell. Typing a minimal set of genes required for survival is still a challenging task. Essential genes seem to be more conservative and are usually responsible for basic functions, for instance, genetic information flow or energy production. Despite persistent advances in experimental methods, computer predictions may constitute an important part of this investigation. Firstly, they may embrace a huge amount of data and provide some characteristic patterns. Furthermore, they enable scientists to build models for predicting essential genes which are not yet verified experimentally. Some papers indicate interesting dependencies within essential genes sequences using different computer models. In this paper, an author took a three-step analysis for a deeper understanding of the fundamentals of essential and non-essential genes. Beginning from a simple nucleotide composition and finishing at long-range correlations, presents some characteristic patterns that are expected to be developed in future studies.
Collapse
Affiliation(s)
- Adrian Kania
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, Cracow 30-387, Poland
| |
Collapse
|
16
|
Tsugawa H, Rai A, Saito K, Nakabayashi R. Metabolomics and complementary techniques to investigate the plant phytochemical cosmos. Nat Prod Rep 2021; 38:1729-1759. [PMID: 34668509 DOI: 10.1039/d1np00014d] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Covering: up to 2021Plants and their associated microbial communities are known to produce millions of metabolites, a majority of which are still not characterized and are speculated to possess novel bioactive properties. In addition to their role in plant physiology, these metabolites are also relevant as existing and next-generation medicine candidates. Elucidation of the plant metabolite diversity is thus valuable for the successful exploitation of natural resources for humankind. Herein, we present a comprehensive review on recent metabolomics approaches to illuminate molecular networks in plants, including chemical isolation and enzymatic production as well as the modern metabolomics approaches such as stable isotope labeling, ultrahigh-resolution mass spectrometry, metabolome imaging (spatial metabolomics), single-cell analysis, cheminformatics, and computational mass spectrometry. Mass spectrometry-based strategies to characterize plant metabolomes through metabolite identification and annotation are described in detail. We also highlight the use of phytochemical genomics to mine genes associated with specialized metabolites' biosynthesis. Understanding the metabolic diversity through biotechnological advances is fundamental to elucidate the functions of the plant-derived specialized metabolome.
Collapse
Affiliation(s)
- Hiroshi Tsugawa
- RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan. .,RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan.,Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 2-24-16 Nakamachi, Koganei, Tokyo 184-8588, Japan.,Graduate School of Medical Life Science, Yokohama City University, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan
| | - Amit Rai
- RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan. .,Plant Molecular Science Center, Chiba University, 1-8-1 Inohana, Chuo-ku, Chiba 260-8675, Japan
| | - Kazuki Saito
- RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan. .,Plant Molecular Science Center, Chiba University, 1-8-1 Inohana, Chuo-ku, Chiba 260-8675, Japan
| | - Ryo Nakabayashi
- RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan.
| |
Collapse
|
17
|
Li S, Zhang Z, Li X, Tan Y, Wang L, Chen Z. An iteration model for identifying essential proteins by combining comprehensive PPI network with biological information. BMC Bioinformatics 2021; 22:430. [PMID: 34496745 PMCID: PMC8425031 DOI: 10.1186/s12859-021-04300-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Accepted: 07/08/2021] [Indexed: 11/10/2022] Open
Abstract
Background Essential proteins have great impacts on cell survival and development, and played important roles in disease analysis and new drug design. However, since it is inefficient and costly to identify essential proteins by using biological experiments, then there is an urgent need for automated and accurate detection methods. In recent years, the recognition of essential proteins in protein interaction networks (PPI) has become a research hotspot, and many computational models for predicting essential proteins have been proposed successively. Results In order to achieve higher prediction performance, in this paper, a new prediction model called TGSO is proposed. In TGSO, a protein aggregation degree network is constructed first by adopting the node density measurement method for complex networks. And simultaneously, a protein co-expression interactive network is constructed by combining the gene expression information with the network connectivity, and a protein co-localization interaction network is constructed based on the subcellular localization data. And then, through integrating these three kinds of newly constructed networks, a comprehensive protein–protein interaction network will be obtained. Finally, based on the homology information, scores can be calculated out iteratively for different proteins, which can be utilized to estimate the importance of proteins effectively. Moreover, in order to evaluate the identification performance of TGSO, we have compared TGSO with 13 different latest competitive methods based on three kinds of yeast databases. And experimental results show that TGSO can achieve identification accuracies of 94%, 82% and 72% out of the top 1%, 5% and 10% candidate proteins respectively, which are to some degree superior to these state-of-the-art competitive models. Conclusions We constructed a comprehensive interactive network based on multi-source data to reduce the noise and errors in the initial PPI, and combined with iterative methods to improve the accuracy of necessary protein prediction, and means that TGSO may be conducive to the future development of essential protein recognition as well.
Collapse
Affiliation(s)
- Shiyuan Li
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China
| | - Zhen Zhang
- College of Electronic Information and Electrical Engineering, Changsha University, Changsha, 410022, China
| | - Xueyong Li
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China
| | - Yihong Tan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China. .,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China.
| | - Lei Wang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China
| | - Zhiping Chen
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China. .,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China.
| |
Collapse
|
18
|
Campos TL, Korhonen PK, Hofmann A, Gasser RB, Young ND. Harnessing model organism genomics to underpin the machine learning-based prediction of essential genes in eukaryotes - Biotechnological implications. Biotechnol Adv 2021; 54:107822. [PMID: 34461202 DOI: 10.1016/j.biotechadv.2021.107822] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Revised: 08/17/2021] [Accepted: 08/24/2021] [Indexed: 12/17/2022]
Abstract
The availability of high-quality genomes and advances in functional genomics have enabled large-scale studies of essential genes in model eukaryotes, including the 'elegant worm' (Caenorhabditis elegans; Nematoda) and the 'vinegar fly' (Drosophila melanogaster; Arthropoda). However, this is not the case for other, much less-studied organisms, such as socioeconomically important parasites, for which functional genomic platforms usually do not exist. Thus, there is a need to develop innovative techniques or approaches for the prediction, identification and investigation of essential genes. A key approach that could enable the prediction of such genes is machine learning (ML). Here, we undertake an historical review of experimental and computational approaches employed for the characterisation of essential genes in eukaryotes, with a particular focus on model ecdysozoans (C. elegans and D. melanogaster), and discuss the possible applicability of ML-approaches to organisms such as socioeconomically important parasites. We highlight some recent results showing that high-performance ML, combined with feature engineering, allows a reliable prediction of essential genes from extensive, publicly available 'omic data sets, with major potential to prioritise such genes (with statistical confidence) for subsequent functional genomic validation. These findings could 'open the door' to fundamental and applied research areas. Evidence of some commonality in the essential gene-complement between these two organisms indicates that an ML-engineering approach could find broader applicability to ecdysozoans such as parasitic nematodes or arthropods, provided that suitably large and informative data sets become/are available for proper feature engineering, and for the robust training and validation of algorithms. This area warrants detailed exploration to, for example, facilitate the identification and characterisation of essential molecules as novel targets for drugs and vaccines against parasitic diseases. This focus is particularly important, given the substantial impact that such diseases have worldwide, and the current challenges associated with their prevention and control and with drug resistance in parasite populations.
Collapse
Affiliation(s)
- Tulio L Campos
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia; Bioinformatics Core Facility, Instituto Aggeu Magalhães, Fundação Oswaldo Cruz (IAM-Fiocruz), Recife, Pernambuco, Brazil
| | - Pasi K Korhonen
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Andreas Hofmann
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia.
| | - Neil D Young
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia.
| |
Collapse
|
19
|
Zhang Z, van Dijk F, de Klein N, van Gijn ME, Franke LH, Sinke RJ, Swertz MA, van der Velde KJ. Feasibility of predicting allele specific expression from DNA sequencing using machine learning. Sci Rep 2021; 11:10606. [PMID: 34012022 PMCID: PMC8134421 DOI: 10.1038/s41598-021-89904-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 05/04/2021] [Indexed: 11/09/2022] Open
Abstract
Allele specific expression (ASE) concerns divergent expression quantity of alternative alleles and is measured by RNA sequencing. Multiple studies show that ASE plays a role in hereditary diseases by modulating penetrance or phenotype severity. However, genome diagnostics is based on DNA sequencing and therefore neglects gene expression regulation such as ASE. To take advantage of ASE in absence of RNA sequencing, it must be predicted using only DNA variation. We have constructed ASE models from BIOS (n = 3432) and GTEx (n = 369) that predict ASE using DNA features. These models are highly reproducible and comprise many different feature types, highlighting the complex regulation that underlies ASE. We applied the BIOS-trained model to population variants in three genes in which ASE plays a clinically relevant role: BRCA2, RET and NF1. This resulted in predicted ASE effects for 27 variants, of which 10 were known pathogenic variants. We demonstrated that ASE can be predicted from DNA features using machine learning. Future efforts may improve sensitivity and translate these models into a new type of genome diagnostic tool that prioritizes candidate pathogenic variants or regulators thereof for follow-up validation by RNA sequencing. All used code and machine learning models are available at GitHub and Zenodo.
Collapse
Affiliation(s)
- Zhenhua Zhang
- Genomics Coordination Center, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Freerk van Dijk
- Genomics Coordination Center, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
- Prinses Maxima Center for Child Oncology, Heidelberglaan 25, 3584 CS, Utrecht, The Netherlands
| | - Niek de Klein
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Mariëlle E van Gijn
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Lude H Franke
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Richard J Sinke
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Morris A Swertz
- Genomics Coordination Center, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - K Joeri van der Velde
- Genomics Coordination Center, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands.
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands.
| |
Collapse
|
20
|
Schonfeld E, Vendrow E, Vendrow J, Schonfeld E. On the relation of gene essentiality to intron structure: a computational and deep learning approach. Life Sci Alliance 2021; 4:4/6/e202000951. [PMID: 33906938 PMCID: PMC8127325 DOI: 10.26508/lsa.202000951] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2020] [Revised: 04/12/2021] [Accepted: 04/15/2021] [Indexed: 11/24/2022] Open
Abstract
Essential genes have been studied by copy number variants and deletions, both associated with introns. The premise of our work is that introns of essential genes have distinct characteristic properties. We provide support for this by training a deep learning model and demonstrating that introns alone can be used to classify essentiality. The model, limited to first introns, performs at an increased level, implicating first introns in essentiality. We identify unique properties of introns of essential genes, finding that their structure protects against deletion and intron-loss events, especially centered on the first intron. We show that GC density is increased in the first introns of essential genes, allowing for increased enhancer activity, protection against deletions, and improved splice site recognition. We find that first introns of essential genes are of remarkably smaller size than their nonessential counterparts, and to protect against common 3' end deletion events, essential genes carry an increased number of (smaller) introns. To demonstrate the importance of the seven features we identified, we train a feature-based model using only these features and achieve high performance.
Collapse
Affiliation(s)
| | | | - Joshua Vendrow
- University of California, Los Angeles, Los Angeles, CA, USA
| | | |
Collapse
|
21
|
Le NQK, Do DT, Hung TNK, Lam LHT, Huynh TT, Nguyen NTK. A Computational Framework Based on Ensemble Deep Neural Networks for Essential Genes Identification. Int J Mol Sci 2020; 21:E9070. [PMID: 33260643 PMCID: PMC7730808 DOI: 10.3390/ijms21239070] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 11/25/2020] [Accepted: 11/26/2020] [Indexed: 01/13/2023] Open
Abstract
Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| | - Duyen Thi Do
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei 106, Taiwan;
| | - Truong Nguyen Khanh Hung
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Department of Orthopedic and Trauma, Cho Ray Hospital, Ho Chi Minh 70000, Vietnam
| | - Luu Ho Thanh Lam
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Intensive Care Unit, Children’s Hospital 2, Ho Chi Minh 70000, Vietnam
| | - Tuan-Tu Huynh
- Department of Electrical Engineering, Yuan Ze University, Taoyuan 320, Taiwan;
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Dong Nai 76120, Vietnam
| | - Ngan Thi Kim Nguyen
- School of Nutrition and Health Sciences, Taipei Medical University, Taipei 110, Taiwan;
| |
Collapse
|