1
|
Durge AR, Shrimankar DD. DHFS-ECM: Design of a Dual Heuristic Feature Selection-based Ensemble Classification Model for the Identification of Bamboo Species from Genomic Sequences. Curr Genomics 2024; 25:185-201. [PMID: 39087000 PMCID: PMC11288165 DOI: 10.2174/0113892029268176240125055419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 01/16/2024] [Accepted: 01/16/2024] [Indexed: 08/02/2024] Open
Abstract
Background Analyzing genomic sequences plays a crucial role in understanding biological diversity and classifying Bamboo species. Existing methods for genomic sequence analysis suffer from limitations such as complexity, low accuracy, and the need for constant reconfiguration in response to evolving genomic datasets. Aim This study addresses these limitations by introducing a novel Dual Heuristic Feature Selection-based Ensemble Classification Model (DHFS-ECM) for the precise identification of Bamboo species from genomic sequences. Methods The proposed DHFS-ECM method employs a Genetic Algorithm to perform dual heuristic feature selection. This process maximizes inter-class variance, leading to the selection of informative N-gram feature sets. Subsequently, intra-class variance levels are used to create optimal training and validation sets, ensuring comprehensive coverage of class-specific features. The selected features are then processed through an ensemble classification layer, combining multiple stratification models for species-specific categorization. Results Comparative analysis with state-of-the-art methods demonstrate that DHFS-ECM achieves remarkable improvements in accuracy (9.5%), precision (5.9%), recall (8.5%), and AUC performance (4.5%). Importantly, the model maintains its performance even with an increased number of species classes due to the continuous learning facilitated by the Dual Heuristic Genetic Algorithm Model. Conclusion DHFS-ECM offers several key advantages, including efficient feature extraction, reduced model complexity, enhanced interpretability, and increased robustness and accuracy through the ensemble classification layer. These attributes make DHFS-ECM a promising tool for real-time clinical applications and a valuable contribution to the field of genomic sequence analysis.
Collapse
Affiliation(s)
- Aditi R Durge
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| | - Deepti D Shrimankar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| |
Collapse
|
2
|
Perea-Jacobo R, Paredes-Gutiérrez GR, Guerrero-Chevannier MÁ, Flores DL, Muñiz-Salazar R. Machine Learning of the Whole Genome Sequence of Mycobacterium tuberculosis: A Scoping PRISMA-Based Review. Microorganisms 2023; 11:1872. [PMID: 37630431 PMCID: PMC10456961 DOI: 10.3390/microorganisms11081872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 07/13/2023] [Accepted: 07/14/2023] [Indexed: 08/27/2023] Open
Abstract
Tuberculosis (TB) remains one of the most significant global health problems, posing a significant challenge to public health systems worldwide. However, diagnosing drug-resistant tuberculosis (DR-TB) has become increasingly challenging due to the rising number of multidrug-resistant (MDR-TB) cases, despite the development of new TB diagnostic tools. Even the World Health Organization-recommended methods such as Xpert MTB/XDR or Truenat are unable to detect all the Mycobacterium tuberculosis genome mutations associated with drug resistance. While Whole Genome Sequencing offers a more precise DR profile, the lack of user-friendly bioinformatics analysis applications hinders its widespread use. This review focuses on exploring various artificial intelligence models for predicting DR-TB profiles, analyzing relevant English-language articles using the PRISMA methodology through the Covidence platform. Our findings indicate that an Artificial Neural Network is the most commonly employed method, with non-statistical dimensionality reduction techniques preferred over traditional statistical approaches such as Principal Component Analysis or t-distributed Stochastic Neighbor Embedding.
Collapse
Affiliation(s)
- Ricardo Perea-Jacobo
- Facultad de Ingeniería Arquitectura y Diseño, Universidad Autónoma de Baja California, Campus Ensenada, Ensenada 22860, Mexico; (R.P.-J.); (G.R.P.-G.); (M.Á.G.-C.)
- Escuela de Ciencias de la Salud, Universidad Autónoma de Baja California, Campus Ensenada, Ensenada 22890, Mexico
| | - Guillermo René Paredes-Gutiérrez
- Facultad de Ingeniería Arquitectura y Diseño, Universidad Autónoma de Baja California, Campus Ensenada, Ensenada 22860, Mexico; (R.P.-J.); (G.R.P.-G.); (M.Á.G.-C.)
| | - Miguel Ángel Guerrero-Chevannier
- Facultad de Ingeniería Arquitectura y Diseño, Universidad Autónoma de Baja California, Campus Ensenada, Ensenada 22860, Mexico; (R.P.-J.); (G.R.P.-G.); (M.Á.G.-C.)
| | - Dora-Luz Flores
- Facultad de Ingeniería Arquitectura y Diseño, Universidad Autónoma de Baja California, Campus Ensenada, Ensenada 22860, Mexico; (R.P.-J.); (G.R.P.-G.); (M.Á.G.-C.)
| | - Raquel Muñiz-Salazar
- Escuela de Ciencias de la Salud, Universidad Autónoma de Baja California, Campus Ensenada, Ensenada 22890, Mexico
| |
Collapse
|
3
|
Durge AR, Shrimankar DD, Sawarkar AD. Heuristic Analysis of Genomic Sequence Processing Models for High Efficiency Prediction: A Statistical Perspective. Curr Genomics 2022; 23:299-317. [PMID: 36778194 PMCID: PMC9878859 DOI: 10.2174/1389202923666220927105311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 08/29/2022] [Accepted: 09/01/2022] [Indexed: 11/22/2022] Open
Abstract
Genome sequences indicate a wide variety of characteristics, which include species and sub-species type, genotype, diseases, growth indicators, yield quality, etc. To analyze and study the characteristics of the genome sequences across different species, various deep learning models have been proposed by researchers, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Multilayer Perceptrons (MLPs), etc., which vary in terms of evaluation performance, area of application and species that are processed. Due to a wide differentiation between the algorithmic implementations, it becomes difficult for research programmers to select the best possible genome processing model for their application. In order to facilitate this selection, the paper reviews a wide variety of such models and compares their performance in terms of accuracy, area of application, computational complexity, processing delay, precision and recall. Thus, in the present review, various deep learning and machine learning models have been presented that possess different accuracies for different applications. For multiple genomic data, Repeated Incremental Pruning to Produce Error Reduction with Support Vector Machine (Ripper SVM) outputs 99.7% of accuracy, and for cancer genomic data, it exhibits 99.27% of accuracy using the CNN Bayesian method. Whereas for Covid genome analysis, Bidirectional Long Short-Term Memory with CNN (BiLSTM CNN) exhibits the highest accuracy of 99.95%. A similar analysis of precision and recall of different models has been reviewed. Finally, this paper concludes with some interesting observations related to the genomic processing models and recommends applications for their efficient use.
Collapse
Affiliation(s)
- Aditi R. Durge
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| | - Deepti D. Shrimankar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India,Address correspondence to this author at the Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India; Tel: 9860606477; E-mail:
| | - Ankush D. Sawarkar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| |
Collapse
|
4
|
Cui ZJ, Zhang WT, Zhu Q, Zhang QY, Zhang HY. Using a Heat Diffusion Model to Detect Potential Drug Resistance Genes of Mycobacterium tuberculosis. Protein Pept Lett 2020; 27:711-717. [PMID: 32167422 DOI: 10.2174/0929866527666200313113157] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Revised: 12/01/2019] [Accepted: 12/21/2019] [Indexed: 01/01/2023]
Abstract
BACKGROUND Tuberculosis (TB), caused by Mycobacterium tuberculosis (Mtb), is one of the oldest known and most dangerous diseases. Although the spread of TB was controlled in the early 20th century using antibiotics and vaccines, TB has again become a threat because of increased drug resistance. There is still a lack of effective treatment regimens for a person who is already infected with multidrug-resistant Mtb (MDR-Mtb) or extensively drug-resistant Mtb (XDRMtb). In the past decades, many research groups have explored the drug resistance profiles of Mtb based on sequence data by GWAS, which identified some mutations that were significantly linked with drug resistance, and attempted to explain the resistance mechanisms. However, they mainly focused on several significant mutations in drug targets (e.g. rpoB, katG). Some genes which are potentially associated with drug resistance may be overlooked by the GWAS analysis. OBJECTIVE In this article, our motivation is to detect potential drug resistance genes of Mtb using a heat diffusion model. METHODS All sequencing data, which contained 127 samples of Mtb, i.e. 34 ethambutol-, 65 isoniazid-, 53 rifampicin- and 45 streptomycin-resistant strains. The raw sequence data were preprocessed using Trimmomatic software and aligned to the Mtb H37Rv reference genome using Bowtie2. From the resulting alignments, SAMtools and VarScan were used to filter sequences and call SNPs. The GWAS was performed by the PLINK package to obtain the significant SNPs, which were mapped to genes. The P-values of genes calculated by GWAS were transferred into a heat vector. The heat vector and the Mtb protein-protein interactions (PPI) derived from the STRING database were inputted into the heat diffusion model to obtain significant subnetworks by HotNet2. Finally, the most significant (P < 0.05) subnetworks associated with different phenotypes were obtained. To verify the change of binding energy between the drug and target before and after mutation, the method of molecular dynamics simulation was performed using the AMBER software. RESULTS We identified significant subnetworks in rifampicin-resistant samples. Excitingly, we found rpoB and rpoC, which are drug targets of rifampicin. From the protein structure of rpoB, the mutation location was extremely close to the drug binding site, with a distance of only 3.97 Å. Molecular dynamics simulation revealed that the binding energy of rpoB and rifampicin decreased after D435V mutation. To a large extent, this mutation can influence the affinity of drug-target binding. In addition, topA and pyrG were reported to be linked with drug resistance, and might be new TB drug targets. Other genes that have not yet been reported are worth further study. CONCLUSION Using a heat diffusion model in combination with GWAS results and protein-protein interactions, the significantly mutated subnetworks in rifampicin-resistant samples were found. The subnetwork not only contained the known targets of rifampicin (rpoB, rpoC), but also included topA and pyrG, which are potentially associated with drug resistance. Together, these results offer deeper insights into drug resistance of Mtb, and provides potential drug targets for finding new antituberculosis drugs.
Collapse
Affiliation(s)
- Ze-Jia Cui
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Wei-Tong Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Qiang Zhu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Qing-Ye Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Hong-Yu Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| |
Collapse
|
5
|
Gabrielian A, Engle E, Harris M, Wollenberg K, Juarez-Espinosa O, Glogowski A, Long A, Patti L, Hurt DE, Rosenthal A, Tartakovsky M. TB DEPOT (Data Exploration Portal): A multi-domain tuberculosis data analysis resource. PLoS One 2019; 14:e0217410. [PMID: 31120982 PMCID: PMC6532897 DOI: 10.1371/journal.pone.0217410] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Accepted: 05/10/2019] [Indexed: 02/06/2023] Open
Abstract
The NIAID TB Portals Program (TBPP) established a unique and growing database repository of socioeconomic, geographic, clinical, laboratory, radiological, and genomic data from patient cases of drug-resistant tuberculosis (DR-TB). Currently, there are 2,428 total cases from nine country sites (Azerbaijan, Belarus, Moldova, Georgia, Romania, China, India, Kazakhstan, and South Africa), 1,611 (66%) of which are multidrug- or extensively-drug resistant and 1,185 (49%), 863 (36%), and 952 (39%) of which contain X-ray, computed tomography (CT) scan, and genomic data, respectively. We introduce the Data Exploration Portal (TB DEPOT, https://depot.tbportals.niaid.nih.gov) to visualize and analyze these multi-domain data. The TB DEPOT leverages the TBPP integration of clinical, socioeconomic, genomic, and imaging data into standardized formats and enables user-driven, repeatable, and reproducible analyses. It furthers the TBPP goals to provide a web-enabled analytics platform to countries with a high burden of multidrug-resistant TB (MDR-TB) but limited IT resources and inaccessible data, and enables the reusability of data, in conformity with the NIH's Findable, Accessible, Interoperable, and Reusable (FAIR) principles. TB DEPOT provides access to "analysis-ready" data and the ability to generate and test complex clinically-oriented hypotheses instantaneously with minimal statistical background and data processing skills. TB DEPOT is also promising for enhancing medical training and furnishing well annotated, hard to find, MDR-TB patient cases. TB DEPOT, as part of TBPP, further fosters collaborative research efforts to better understand drug-resistant tuberculosis and aid in the development of novel diagnostics and personalized treatment regimens.
Collapse
Affiliation(s)
- Andrei Gabrielian
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Eric Engle
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Michael Harris
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Kurt Wollenberg
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Octavio Juarez-Espinosa
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Alexander Glogowski
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Alyssa Long
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Lisa Patti
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Darrell E. Hurt
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Alex Rosenthal
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| | - Mike Tartakovsky
- Office of Cyber Infrastructure & Computational Biology, National Institute of Allergy and Infectious Disease, National Institutes of Health, Bethesda, MD, United States of America
| |
Collapse
|
6
|
Disease Diagnosis in Smart Healthcare: Innovation, Technologies and Applications. SUSTAINABILITY 2017. [DOI: 10.3390/su9122309] [Citation(s) in RCA: 72] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|