1
|
Tanaka M. From Serendipity to Precision: Integrating AI, Multi-Omics, and Human-Specific Models for Personalized Neuropsychiatric Care. Biomedicines 2025; 13:167. [PMID: 39857751 PMCID: PMC11761901 DOI: 10.3390/biomedicines13010167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2024] [Revised: 01/04/2025] [Accepted: 01/10/2025] [Indexed: 01/27/2025] Open
Abstract
Background/Objectives: The dual forces of structured inquiry and serendipitous discovery have long shaped neuropsychiatric research, with groundbreaking treatments such as lithium and ketamine resulting from unexpected discoveries. However, relying on chance is becoming increasingly insufficient to address the rising prevalence of mental health disorders like depression and schizophrenia, which necessitate precise, innovative approaches. Emerging technologies like artificial intelligence, induced pluripotent stem cells, and multi-omics have the potential to transform this field by allowing for predictive, patient-specific interventions. Despite these advancements, traditional methodologies such as animal models and single-variable analyses continue to be used, frequently failing to capture the complexities of human neuropsychiatric conditions. Summary: This review critically evaluates the transition from serendipity to precision-based methodologies in neuropsychiatric research. It focuses on key innovations such as dynamic systems modeling and network-based approaches that use genetic, molecular, and environmental data to identify new therapeutic targets. Furthermore, it emphasizes the importance of interdisciplinary collaboration and human-specific models in overcoming the limitations of traditional approaches. Conclusions: We highlight precision psychiatry's transformative potential for revolutionizing mental health care. This paradigm shift, which combines cutting-edge technologies with systematic frameworks, promises increased diagnostic accuracy, reproducibility, and efficiency, paving the way for tailored treatments and better patient outcomes in neuropsychiatric care.
Collapse
Affiliation(s)
- Masaru Tanaka
- HUN-REN-SZTE Neuroscience Research Group, Hungarian Research Network, University of Szeged (HUN-REN-SZTE), Danube Neuroscience Research Laboratory, Tisza Lajos krt. 113, H-6725 Szeged, Hungary
| |
Collapse
|
2
|
Moulaei K, Afshari L, Moulaei R, Sabet B, Mousavi SM, Afrash MR. Explainable artificial intelligence for stroke prediction through comparison of deep learning and machine learning models. Sci Rep 2024; 14:31392. [PMID: 39733046 DOI: 10.1038/s41598-024-82931-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Accepted: 12/10/2024] [Indexed: 12/30/2024] Open
Abstract
Failure to predict stroke promptly may lead to delayed treatment, causing severe consequences like permanent neurological damage or death. Early detection using deep learning (DL) and machine learning (ML) models can enhance patient outcomes and mitigate the long-term effects of strokes. The aim of this study is to compare these models, exploring their efficacy in predicting stroke. This study analyzed a dataset comprising 663 records from patients hospitalized at Hazrat Rasool Akram Hospital in Tehran, Iran, including 401 healthy individuals and 262 stroke patients. A total of eight established ML (SVM, XGB, KNN, RF) and DL (DNN, FNN, LSTM, CNN) models were utilized to predict stroke. Techniques such as 10-fold cross-validation and hyperparameter tuning were implemented to prevent overfitting. The study also focused on interpretability through Shapley Additive Explanations (SHAP). The evaluation of model's performance was based on accuracy, specificity, sensitivity, F1-score, and ROC curve metrics. Among DL models, LSTM showed superior sensitivity at 96.15%, while FNN exhibited better specificity (96.0%), accuracy (96.0%), F1-score (95.0%), and ROC (98.0%) among DL models. For ML models, RF displayed higher sensitivity (99.9%), accuracy (99.0%), specificity (100%), F1-score (99.0%), and ROC (99.9%). Overall, RF outperformed all models, while DL models surpassed ML models in most metrics except for RF. DL models (CNN, LSTM, DNN, FNN) achieved sensitivities from 93.0 to 96.15%, specificities from 80.0 to 96.0%, accuracies from 92.0 to 96.0%, F1-scores from 87.34 to 95.0%, and ROC scores from 95.0 to 98.0%. In contrast, ML models (KNN, XGB, SVM) showed sensitivities between 29.0% and 94.0%, specificities between 89.47% and 96.0%, accuracies between 71.0% and 95.0%, F1-scores between 44.0% and 95.0%, and ROC scores between 64.0% and 95.0%. This study demonstrates the efficacy of DL and ML models in predicting stroke, with the RF models outperforming all others in key metrics. While DL models generally surpassed ML models, RF's exceptional performance highlights the potential of combining these technologies for early stroke detection, significantly improving patient outcomes by preventing severe consequences like permanent neurological damage or death.
Collapse
Affiliation(s)
- Khadijeh Moulaei
- Health Management and Economics Research Center, Health Management Research Institute, Iran University of Medical Sciences, Tehran, Iran
- Artificial Intelligence in Medical Sciences Research Center, Smart University of Medical Sciences, Tehran, Iran
| | - Lida Afshari
- Department of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
| | - Reza Moulaei
- Department of Orthopedic and Trauma Surgery, Shariati Hospital and School of Medicine, Tehran University of Medical Sciences, Tehran, Iran
| | - Babak Sabet
- Artificial Intelligence in Medical Sciences Research Center, Smart University of Medical Sciences, Tehran, Iran
- Department of Surgery, Faculty of Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Seyed Mohammad Mousavi
- Medical Informatics Research Center, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran
| | - Mohammad Reza Afrash
- Artificial Intelligence in Medical Sciences Research Center, Smart University of Medical Sciences, Tehran, Iran.
- Department of Artificial Intelligence in Medical Sciences Research Center, Smart University of Medical Sciences, Tehran, Iran.
| |
Collapse
|
3
|
Santucci K, Cheng Y, Xu SM, Janitz M. Enhancing novel isoform discovery: leveraging nanopore long-read sequencing and machine learning approaches. Brief Funct Genomics 2024; 23:683-694. [PMID: 39158328 DOI: 10.1093/bfgp/elae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 07/29/2024] [Accepted: 07/31/2024] [Indexed: 08/20/2024] Open
Abstract
Long-read sequencing technologies can capture entire RNA transcripts in a single sequencing read, reducing the ambiguity in constructing and quantifying transcript models in comparison to more common and earlier methods, such as short-read sequencing. Recent improvements in the accuracy of long-read sequencing technologies have expanded the scope for novel splice isoform detection and have also enabled a far more accurate reconstruction of complex splicing patterns and transcriptomes. Additionally, the incorporation and advancements of machine learning and deep learning algorithms in bioinformatic software have significantly improved the reliability of long-read sequencing transcriptomic studies. However, there is a lack of consensus on what bioinformatic tools and pipelines produce the most precise and consistent results. Thus, this review aims to discuss and compare the performance of available methods for novel isoform discovery with long-read sequencing technologies, with 25 tools being presented. Furthermore, this review intends to demonstrate the need for developing standard analytical pipelines, tools, and transcript model conventions for novel isoform discovery and transcriptomic studies.
Collapse
Affiliation(s)
- Kristina Santucci
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| | - Yuning Cheng
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| | - Si-Mei Xu
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| | - Michael Janitz
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| |
Collapse
|
4
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
5
|
Ben Ncir CE, Ben HajKacem MA, Alattas M. Enhancing intrusion detection performance using explainable ensemble deep learning. PeerJ Comput Sci 2024; 10:e2289. [PMID: 39314740 PMCID: PMC11419647 DOI: 10.7717/peerj-cs.2289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 08/06/2024] [Indexed: 09/25/2024]
Abstract
Given the exponential growth of available data in large networks, the need for an accurate and explainable intrusion detection system has become of high necessity to effectively discover attacks in such networks. To deal with this challenge, we propose a two-phase Explainable Ensemble deep learning-based method (EED) for intrusion detection. In the first phase, a new ensemble intrusion detection model using three one-dimensional long short-term memory networks (LSTM) is designed for an accurate attack identification. The outputs of three classifiers are aggregated using a meta-learner algorithm resulting in refined and improved results. In the second phase, interpretability and explainability of EED outputs are enhanced by leveraging the capabilities of SHape Additive exPplanations (SHAP). Factors contributing to the identification and classification of attacks are highlighted which allows security experts to understand and interpret the attack behavior and then implement effective response strategies to improve the network security. Experiments conducted on real datasets have shown the effectiveness of EED compared to conventional intrusion detection methods in terms of both accuracy and explainability. The EED method exhibits high accuracy in accurately identifying and classifying attacks while providing transparency and interpretability.
Collapse
Affiliation(s)
| | | | - Mohammed Alattas
- MIS Department, College of Business, University of Jeddah, Jeddah, Jeddah, Saudi Arabia
| |
Collapse
|
6
|
Miller C, Portlock T, Nyaga DM, O'Sullivan JM. A review of model evaluation metrics for machine learning in genetics and genomics. FRONTIERS IN BIOINFORMATICS 2024; 4:1457619. [PMID: 39318760 PMCID: PMC11420621 DOI: 10.3389/fbinf.2024.1457619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Accepted: 08/27/2024] [Indexed: 09/26/2024] Open
Abstract
Machine learning (ML) has shown great promise in genetics and genomics where large and complex datasets have the potential to provide insight into many aspects of disease risk, pathogenesis of genetic disorders, and prediction of health and wellbeing. However, with this possibility there is a responsibility to exercise caution against biases and inflation of results that can have harmful unintended impacts. Therefore, researchers must understand the metrics used to evaluate ML models which can influence the critical interpretation of results. In this review we provide an overview of ML metrics for clustering, classification, and regression and highlight the advantages and disadvantages of each. We also detail common pitfalls that occur during model evaluation. Finally, we provide examples of how researchers can assess and utilise the results of ML models, specifically from a genomics perspective.
Collapse
Affiliation(s)
- Catriona Miller
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - Theo Portlock
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - Denis M Nyaga
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - Justin M O'Sullivan
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
- The Maurice Wilkins Centre, The University of Auckland, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science Technology and Research, Singapore, Singapore
| |
Collapse
|
7
|
Hakami MA. Harnessing machine learning potential for personalised drug design and overcoming drug resistance. J Drug Target 2024; 32:918-930. [PMID: 38842417 DOI: 10.1080/1061186x.2024.2365934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 06/01/2024] [Accepted: 06/04/2024] [Indexed: 06/07/2024]
Abstract
Drug resistance in cancer treatment presents a significant challenge, necessitating innovative approaches to improve therapeutic efficacy. Integrating machine learning (ML) in cancer research is promising as ML algorithms outrival in analysing complex datasets, identifying patterns, and predicting treatment outcomes. Leveraging diverse data sources such as genomic profiles, clinical records, and drug response assays, ML uncovers molecular mechanisms of drug resistance, enabling personalised treatment, maximising efficacy and minimising adverse effects. Various ML algorithms contribute to the drug discovery process - Random Forest and Decision Trees predict drug-target interactions and aid in virtual screening, and SVM classify leads on bioactivity data. Neural Networks model QSAR to optimise lead compounds and K-means clustering group compounds with similar chemical properties aiding compound selection. Gaussian Processes predict drug responses, Bayesian Networks infer causal relationships, Autoencoders generate novel compounds, and Genetic Algorithms optimise molecular structures. These algorithms collectively enhance efficiency and success rates in drug design endeavours, from lead identification to optimisation and are cost-effective, empowering clinicians with real-time treatment monitoring and improving patient outcomes. This review highlights the immense potential of ML in revolutionising cancer care through effective drug design to reduce drug resistance, and we have also discussed various limitations and research gaps to understand better.
Collapse
Affiliation(s)
- Mohammed Ageeli Hakami
- Department of Clinical Laboratory Sciences, College of Applied Medical Sciences, Shaqra University, Al-Quwayiyah, Riyadh, Saudi Arabia
| |
Collapse
|
8
|
Álvarez-Machancoses Ó, Faraggi E, deAndrés-Galiana EJ, Fernández-Martínez JL, Kloczkowski A. Prediction of Deleterious Single Amino Acid Polymorphisms with a Consensus Holdout Sampler. Curr Genomics 2024; 25:171-184. [PMID: 39086995 PMCID: PMC11288160 DOI: 10.2174/0113892029236347240308054538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 08/03/2023] [Accepted: 09/22/2023] [Indexed: 08/02/2024] Open
Abstract
Background Single Amino Acid Polymorphisms (SAPs) or nonsynonymous Single Nucleotide Variants (nsSNVs) are the most common genetic variations. They result from missense mutations where a single base pair substitution changes the genetic code in such a way that the triplet of bases (codon) at a given position is coding a different amino acid. Since genetic mutations sometimes cause genetic diseases, it is important to comprehend and foresee which variations are harmful and which ones are neutral (not causing changes in the phenotype). This can be posed as a classification problem. Methods Computational methods using machine intelligence are gradually replacing repetitive and exceedingly overpriced mutagenic tests. By and large, uneven quality, deficiencies, and irregularities of nsSNVs datasets debase the convenience of artificial intelligence-based methods. Subsequently, strong and more exact approaches are needed to address these problems. In the present work paper, we show a consensus classifier built on the holdout sampler, which appears strong and precise and outflanks all other popular methods. Results We produced 100 holdouts to test the structures and diverse classification variables of diverse classifiers during the training phase. The finest performing holdouts were chosen to develop a consensus classifier and tested using a k-fold (1 ≤ k ≤5) cross-validation method. We also examined which protein properties have the biggest impact on the precise prediction of the effects of nsSNVs. Conclusion Our Consensus Holdout Sampler outflanks other popular algorithms, and gives excellent results, highly accurate with low standard deviation. The advantage of our method emerges from using a tree of holdouts, where diverse LM/AI-based programs are sampled in diverse ways.
Collapse
Affiliation(s)
- Óscar Álvarez-Machancoses
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
| | - Eshel Faraggi
- School of Science, Indiana University-Purdue University Indianapolis, IN, USA
| | - Enrique J deAndrés-Galiana
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
- Department of Computer Science, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
| | - Juan L Fernández-Martínez
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
| | - Andrzej Kloczkowski
- Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
- Department of Pediatrics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
9
|
Nadal E, Benito E, Ródenas-Navarro AM, Palanca A, Martinez-Hervas S, Civera M, Ortega J, Alabadi B, Piqueras L, Ródenas JJ, Real JT. Machine Learning Model in Obesity to Predict Weight Loss One Year after Bariatric Surgery: A Pilot Study. Biomedicines 2024; 12:1175. [PMID: 38927382 PMCID: PMC11200726 DOI: 10.3390/biomedicines12061175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 05/13/2024] [Accepted: 05/21/2024] [Indexed: 06/28/2024] Open
Abstract
Roux-en-Y gastric bypass (RYGB) is a treatment for severe obesity. However, many patients have insufficient total weight loss (TWL) after RYGB. Although multiple factors have been involved, their influence is incompletely known. The aim of this exploratory study was to evaluate the feasibility and reliability of the use of machine learning (ML) techniques to estimate the success in weight loss after RYGP, based on clinical, anthropometric and biochemical data, in order to identify morbidly obese patients with poor weight responses. We retrospectively analyzed 118 patients, who underwent RYGB at the Hospital Clínico Universitario of Valencia (Spain) between 2013 and 2017. We applied a ML approach using local linear embedding (LLE) as a tool for the evaluation and classification of the main parameters in conjunction with evolutionary algorithms for the optimization and adjustment of the parameter model. The variables associated with one-year postoperative %TWL were obstructive sleep apnea, osteoarthritis, insulin treatment, preoperative weight, insulin resistance index, apolipoprotein A, uric acid, complement component 3, and vitamin B12. The model correctly classified 71.4% of subjects with TWL < 30% although 36.4% with TWL ≥ 30% were incorrectly classified as "unsuccessful procedures". The ML-model processed moderate discriminatory precision in the validation set. Thus, in severe obesity, ML-models can be useful to assist in the selection of patients before bariatric surgery.
Collapse
Affiliation(s)
- Enrique Nadal
- Instituto Universitario de Ingeniería Mecánica y Biomecánica (I2MB), Universitat Politècnica de València, 46022 Valencia, Spain;
| | - Esther Benito
- CIBER de Diabetes y Enfermedades Metabólicas Asociadas (CIBERDEM), Instituto de Salud Carlos III (ISCIII), 28040 Madrid, Spain; (E.B.); (B.A.); (L.P.); (J.T.R.)
| | - Ana María Ródenas-Navarro
- Endocrinology and Nutrition Service, Clinical University Hospital of Valencia, 46010 Valencia, Spain; (A.M.R.-N.); (A.P.); (M.C.)
| | - Ana Palanca
- Endocrinology and Nutrition Service, Clinical University Hospital of Valencia, 46010 Valencia, Spain; (A.M.R.-N.); (A.P.); (M.C.)
- INCLIVA Biomedical Research Institute, 46010 Valencia, Spain;
| | - Sergio Martinez-Hervas
- CIBER de Diabetes y Enfermedades Metabólicas Asociadas (CIBERDEM), Instituto de Salud Carlos III (ISCIII), 28040 Madrid, Spain; (E.B.); (B.A.); (L.P.); (J.T.R.)
- Endocrinology and Nutrition Service, Clinical University Hospital of Valencia, 46010 Valencia, Spain; (A.M.R.-N.); (A.P.); (M.C.)
- INCLIVA Biomedical Research Institute, 46010 Valencia, Spain;
- Department of Medicine, University of Valencia, 46010 Valencia, Spain
| | - Miguel Civera
- Endocrinology and Nutrition Service, Clinical University Hospital of Valencia, 46010 Valencia, Spain; (A.M.R.-N.); (A.P.); (M.C.)
- INCLIVA Biomedical Research Institute, 46010 Valencia, Spain;
| | - Joaquín Ortega
- INCLIVA Biomedical Research Institute, 46010 Valencia, Spain;
- General Surgery Service, University Hospital of Valencia, 46010 Valencia, Spain
- Department of Surgery, University of Valencia, 46010 Valencia, Spain
| | - Blanca Alabadi
- CIBER de Diabetes y Enfermedades Metabólicas Asociadas (CIBERDEM), Instituto de Salud Carlos III (ISCIII), 28040 Madrid, Spain; (E.B.); (B.A.); (L.P.); (J.T.R.)
- Endocrinology and Nutrition Service, Clinical University Hospital of Valencia, 46010 Valencia, Spain; (A.M.R.-N.); (A.P.); (M.C.)
- INCLIVA Biomedical Research Institute, 46010 Valencia, Spain;
| | - Laura Piqueras
- CIBER de Diabetes y Enfermedades Metabólicas Asociadas (CIBERDEM), Instituto de Salud Carlos III (ISCIII), 28040 Madrid, Spain; (E.B.); (B.A.); (L.P.); (J.T.R.)
- INCLIVA Biomedical Research Institute, 46010 Valencia, Spain;
- Department of Pharmacology, University of Valencia, 46010 Valencia, Spain
| | - Juan José Ródenas
- Instituto Universitario de Ingeniería Mecánica y Biomecánica (I2MB), Universitat Politècnica de València, 46022 Valencia, Spain;
| | - José T. Real
- CIBER de Diabetes y Enfermedades Metabólicas Asociadas (CIBERDEM), Instituto de Salud Carlos III (ISCIII), 28040 Madrid, Spain; (E.B.); (B.A.); (L.P.); (J.T.R.)
- Endocrinology and Nutrition Service, Clinical University Hospital of Valencia, 46010 Valencia, Spain; (A.M.R.-N.); (A.P.); (M.C.)
- INCLIVA Biomedical Research Institute, 46010 Valencia, Spain;
- Department of Medicine, University of Valencia, 46010 Valencia, Spain
| |
Collapse
|
10
|
Gündüz HA, Mreches R, Moosbauer J, Robertson G, To XY, Franzosa EA, Huttenhower C, Rezaei M, McHardy AC, Bischl B, Münch PC, Binder M. Optimized model architectures for deep learning on genomic data. Commun Biol 2024; 7:516. [PMID: 38693292 PMCID: PMC11063068 DOI: 10.1038/s42003-024-06161-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 04/08/2024] [Indexed: 05/03/2024] Open
Abstract
The success of deep learning in various applications depends on task-specific architecture design choices, including the types, hyperparameters, and number of layers. In computational biology, there is no consensus on the optimal architecture design, and decisions are often made using insights from more well-established fields such as computer vision. These may not consider the domain-specific characteristics of genome sequences, potentially limiting performance. Here, we present GenomeNet-Architect, a neural architecture design framework that automatically optimizes deep learning models for genome sequence data. It optimizes the overall layout of the architecture, with a search space specifically designed for genomics. Additionally, it optimizes hyperparameters of individual layers and the model training procedure. On a viral classification task, GenomeNet-Architect reduced the read-level misclassification rate by 19%, with 67% faster inference and 83% fewer parameters, and achieved similar contig-level accuracy with ~100 times fewer parameters compared to the best-performing deep learning baselines.
Collapse
Affiliation(s)
- Hüseyin Anil Gündüz
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - René Mreches
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Julia Moosbauer
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Gary Robertson
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Xiao-Yin To
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Eric A Franzosa
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | - Curtis Huttenhower
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | - Mina Rezaei
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Alice C McHardy
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
- German Centre for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany
| | - Bernd Bischl
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Philipp C Münch
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany.
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA.
- German Centre for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany.
| | - Martin Binder
- Department of Statistics, LMU Munich, Munich, Germany.
- Munich Center for Machine Learning, Munich, Germany.
| |
Collapse
|
11
|
Lac L, Leung CK, Hu P. Computational frameworks integrating deep learning and statistical models in mining multimodal omics data. J Biomed Inform 2024; 152:104629. [PMID: 38552994 DOI: 10.1016/j.jbi.2024.104629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 02/26/2024] [Accepted: 03/25/2024] [Indexed: 04/04/2024]
Abstract
BACKGROUND In health research, multimodal omics data analysis is widely used to address important clinical and biological questions. Traditional statistical methods rely on the strong assumptions of distribution. Statistical methods such as testing and differential expression are commonly used in omics analysis. Deep learning, on the other hand, is an advanced computer science technique that is powerful in mining high-dimensional omics data for prediction tasks. Recently, integrative frameworks or methods have been developed for omics studies that combine statistical models and deep learning algorithms. METHODS AND RESULTS The aim of these integrative frameworks is to combine the strengths of both statistical methods and deep learning algorithms to improve prediction accuracy while also providing interpretability and explainability. This review report discusses the current state-of-the-art integrative frameworks, their limitations, and potential future directions in survival and time-to-event longitudinal analysis, dimension reduction and clustering, regression and classification, feature selection, and causal and transfer learning.
Collapse
Affiliation(s)
- Leann Lac
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada; Department of Statistics, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Carson K Leung
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Pingzhao Hu
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada; Department of Biochemistry, Western University, London, Ontario, Canada; Department of Computer Science, Western University, London, Ontario, Canada; Department of Oncology, Western University, London, Ontario, Canada; Department of Epidemiology and Biostatistics, Western University, London, Ontario, Canada; The Children's Health Research Institute, Lawson Health Research Institute, London, Ontario, Canada.
| |
Collapse
|
12
|
Chakraborty C, Bhattacharya M, Sharma AR, Chatterjee S, Agoramoorthy G, Lee SS. Structural Landscape of nsp Coding Genomic Regions of SARS-CoV-2-ssRNA Genome: A Structural Genomics Approach Toward Identification of Druggable Genome, Ligand-Binding Pockets, and Structure-Based Druggability. Mol Biotechnol 2024; 66:641-662. [PMID: 36463562 PMCID: PMC9735222 DOI: 10.1007/s12033-022-00605-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Accepted: 11/07/2022] [Indexed: 12/05/2022]
Abstract
SARS-CoV-2 has a single-stranded RNA genome (+ssRNA), and synthesizes structural and non-structural proteins (nsps). All 16 nsp are synthesized from the ORF1a, and ORF1b regions associated with different life cycle preprocesses, including replication. The regions of ORF1a synthesizes nsp1 to 11, and ORF1b synthesizes nsp12 to 16. In this paper, we have predicted the secondary structure conformations, entropy & mountain plots, RNA secondary structure in a linear fashion, and 3D structure of nsp coding genes of the SARS-CoV-2 genome. We have also analyzed the A, T, G, C, A+T, and G+C contents, GC-profiling of these genes, showing the range of the GC content from 34.23 to 48.52%. We have observed that the GC-profile value of the nsp coding genomic regions was less (about 0.375) compared to the whole genome (about 0.38). Additionally, druggable pockets were identified from the secondary structure-guided 3D structural conformations. For secondary structure generation of all the nsp coding genes (nsp 1-16), we used a recent algorithm-based tool (deep learning-based) along with the conventional algorithms (centroid and MFE-based) to develop secondary structural conformations, and we found stem-loop, multi-branch loop, pseudoknot, and the bulge structural components, etc. The 3D model shows bound and unbound forms, branched structures, duplex structures, three-way junctions, four-way junctions, etc. Finally, we identified binding pockets of nsp coding genes which will help as a fundamental resource for future researchers to develop RNA-targeted therapeutics using the druggable genome.
Collapse
Affiliation(s)
- Chiranjib Chakraborty
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal, 700126, India.
| | - Manojit Bhattacharya
- Department of Zoology, Fakir Mohan University, Vyasa Vihar, Balasore, Odisha, 756020, India
| | - Ashish Ranjan Sharma
- Institute for Skeletal Aging & Orthopaedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon-si, Gangwon-do, 24252, Republic of Korea
| | - Srijan Chatterjee
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal, 700126, India
| | | | - Sang-Soo Lee
- Institute for Skeletal Aging & Orthopaedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon-si, Gangwon-do, 24252, Republic of Korea
| |
Collapse
|
13
|
Dotan E, Jaschek G, Pupko T, Belinkov Y. Effect of tokenization on transformers for biological sequences. Bioinformatics 2024; 40:btae196. [PMID: 38608190 PMCID: PMC11055402 DOI: 10.1093/bioinformatics/btae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/20/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open
Abstract
MOTIVATION Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.
Collapse
Affiliation(s)
- Edo Dotan
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Gal Jaschek
- Department of Genetics, Yale University School of Medicine, New Haven, CT 06510, United States
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Yonatan Belinkov
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
| |
Collapse
|
14
|
Mota LFM, Arikawa LM, Santos SWB, Fernandes Júnior GA, Alves AAC, Rosa GJM, Mercadante MEZ, Cyrillo JNSG, Carvalheiro R, Albuquerque LG. Benchmarking machine learning and parametric methods for genomic prediction of feed efficiency-related traits in Nellore cattle. Sci Rep 2024; 14:6404. [PMID: 38493207 PMCID: PMC10944497 DOI: 10.1038/s41598-024-57234-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 03/15/2024] [Indexed: 03/18/2024] Open
Abstract
Genomic selection (GS) offers a promising opportunity for selecting more efficient animals to use consumed energy for maintenance and growth functions, impacting profitability and environmental sustainability. Here, we compared the prediction accuracy of multi-layer neural network (MLNN) and support vector regression (SVR) against single-trait (STGBLUP), multi-trait genomic best linear unbiased prediction (MTGBLUP), and Bayesian regression (BayesA, BayesB, BayesC, BRR, and BLasso) for feed efficiency (FE) traits. FE-related traits were measured in 1156 Nellore cattle from an experimental breeding program genotyped for ~ 300 K markers after quality control. Prediction accuracy (Acc) was evaluated using a forward validation splitting the dataset based on birth year, considering the phenotypes adjusted for the fixed effects and covariates as pseudo-phenotypes. The MLNN and SVR approaches were trained by randomly splitting the training population into fivefold to select the best hyperparameters. The results show that the machine learning methods (MLNN and SVR) and MTGBLUP outperformed STGBLUP and the Bayesian regression approaches, increasing the Acc by approximately 8.9%, 14.6%, and 13.7% using MLNN, SVR, and MTGBLUP, respectively. Acc for SVR and MTGBLUP were slightly different, ranging from 0.62 to 0.69 and 0.62 to 0.68, respectively, with empirically unbiased for both models (0.97 and 1.09). Our results indicated that SVR and MTGBLUBP approaches were more accurate in predicting FE-related traits than Bayesian regression and STGBLUP and seemed competitive for GS of complex phenotypes with various degrees of inheritance.
Collapse
Affiliation(s)
- Lucio F M Mota
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil.
| | - Leonardo M Arikawa
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
| | - Samuel W B Santos
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
| | - Gerardo A Fernandes Júnior
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
| | - Anderson A C Alves
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
| | - Guilherme J M Rosa
- Department of Animal and Dairy Sciences, University of Wisconsin, Madison, WI, 53706, USA
| | - Maria E Z Mercadante
- Institute of Animal Science, Beef Cattle Research Center, Sertãozinho, SP, 14174-000, Brazil
- National Council for Science and Technological Development, Brasilia, DF, 71605-001, Brazil
| | - Joslaine N S G Cyrillo
- Institute of Animal Science, Beef Cattle Research Center, Sertãozinho, SP, 14174-000, Brazil
| | - Roberto Carvalheiro
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
- National Council for Science and Technological Development, Brasilia, DF, 71605-001, Brazil
| | - Lucia G Albuquerque
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil.
- National Council for Science and Technological Development, Brasilia, DF, 71605-001, Brazil.
| |
Collapse
|
15
|
Chafai N, Bonizzi L, Botti S, Badaoui B. Emerging applications of machine learning in genomic medicine and healthcare. Crit Rev Clin Lab Sci 2024; 61:140-163. [PMID: 37815417 DOI: 10.1080/10408363.2023.2259466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 09/12/2023] [Indexed: 10/11/2023]
Abstract
The integration of artificial intelligence technologies has propelled the progress of clinical and genomic medicine in recent years. The significant increase in computing power has facilitated the ability of artificial intelligence models to analyze and extract features from extensive medical data and images, thereby contributing to the advancement of intelligent diagnostic tools. Artificial intelligence (AI) models have been utilized in the field of personalized medicine to integrate clinical data and genomic information of patients. This integration allows for the identification of customized treatment recommendations, ultimately leading to enhanced patient outcomes. Notwithstanding the notable advancements, the application of artificial intelligence (AI) in the field of medicine is impeded by various obstacles such as the limited availability of clinical and genomic data, the diversity of datasets, ethical implications, and the inconclusive interpretation of AI models' results. In this review, a comprehensive evaluation of multiple machine learning algorithms utilized in the fields of clinical and genomic medicine is conducted. Furthermore, we present an overview of the implementation of artificial intelligence (AI) in the fields of clinical medicine, drug discovery, and genomic medicine. Finally, a number of constraints pertaining to the implementation of artificial intelligence within the healthcare industry are examined.
Collapse
Affiliation(s)
- Narjice Chafai
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
| | - Luigi Bonizzi
- Department of Biomedical, Surgical and Dental Science, University of Milan, Milan, Italy
| | - Sara Botti
- PTP Science Park, Via Einstein - Loc. Cascina Codazza, Lodi, Italy
| | - Bouabid Badaoui
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
- African Sustainable Agriculture Research Institute (ASARI), Mohammed VI Polytechnic University (UM6P), Laâyoune, Morocco
| |
Collapse
|
16
|
Danishuddin, Khan S, Kim JJ. From cancer big data to treatment: Artificial intelligence in cancer research. J Gene Med 2024; 26:e3629. [PMID: 37940369 DOI: 10.1002/jgm.3629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 09/12/2023] [Accepted: 10/18/2023] [Indexed: 11/10/2023] Open
Abstract
In recent years, developing the idea of "cancer big data" has emerged as a result of the significant expansion of various fields such as clinical research, genomics, proteomics and public health records. Advances in omics technologies are making a significant contribution to cancer big data in biomedicine and disease diagnosis. The increasingly availability of extensive cancer big data has set the stage for the development of multimodal artificial intelligence (AI) frameworks. These frameworks aim to analyze high-dimensional multi-omics data, extracting meaningful information that is challenging to obtain manually. Although interpretability and data quality remain critical challenges, these methods hold great promise for advancing our understanding of cancer biology and improving patient care and clinical outcomes. Here, we provide an overview of cancer big data and explore the applications of both traditional machine learning and deep learning approaches in cancer genomic and proteomic studies. We briefly discuss the challenges and potential of AI techniques in the integrated analysis of omics data, as well as the future direction of personalized treatment options in cancer.
Collapse
Affiliation(s)
- Danishuddin
- Department of Biotechnology, Yeungnam University, Gyeongsan, Gyeongbuk, South Korea
| | - Shawez Khan
- National Center for Cancer Immune Therapy (CCIT-DK), Department of Oncology, Copenhagen University Hospital, Herlev, Denmark
| | - Jong Joo Kim
- Department of Biotechnology, Yeungnam University, Gyeongsan, Gyeongbuk, South Korea
| |
Collapse
|
17
|
Nisar S, Haris M. Neuroimaging genetics approaches to identify new biomarkers for the early diagnosis of autism spectrum disorder. Mol Psychiatry 2023; 28:4995-5008. [PMID: 37069342 PMCID: PMC11041805 DOI: 10.1038/s41380-023-02060-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 03/23/2023] [Accepted: 03/28/2023] [Indexed: 04/19/2023]
Abstract
Autism-spectrum disorders (ASDs) are developmental disabilities that manifest in early childhood and are characterized by qualitative abnormalities in social behaviors, communication skills, and restrictive or repetitive behaviors. To explore the neurobiological mechanisms in ASD, extensive research has been done to identify potential diagnostic biomarkers through a neuroimaging genetics approach. Neuroimaging genetics helps to identify ASD-risk genes that contribute to structural and functional variations in brain circuitry and validate biological changes by elucidating the mechanisms and pathways that confer genetic risk. Integrating artificial intelligence models with neuroimaging data lays the groundwork for accurate diagnosis and facilitates the identification of early diagnostic biomarkers for ASD. This review discusses the significance of neuroimaging genetics approaches to gaining a better understanding of the perturbed neurochemical system and molecular pathways in ASD and how these approaches can detect structural, functional, and metabolic changes and lead to the discovery of novel biomarkers for the early diagnosis of ASD.
Collapse
Affiliation(s)
- Sabah Nisar
- Laboratory of Molecular and Metabolic Imaging, Sidra Medicine, Doha, Qatar
- Department of Diagnostic Imaging, St Jude Children's Research Hospital, Memphis, TN, USA
| | - Mohammad Haris
- Laboratory of Molecular and Metabolic Imaging, Sidra Medicine, Doha, Qatar.
- Center for Advanced Metabolic Imaging in Precision Medicine, Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Laboratory Animal Research Center, Qatar University, Doha, Qatar.
| |
Collapse
|
18
|
Halawani R, Buchert M, Chen YPP. Deep learning exploration of single-cell and spatially resolved cancer transcriptomics to unravel tumour heterogeneity. Comput Biol Med 2023; 164:107274. [PMID: 37506451 DOI: 10.1016/j.compbiomed.2023.107274] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Revised: 07/03/2023] [Accepted: 07/16/2023] [Indexed: 07/30/2023]
Abstract
Tumour heterogeneity is one of the critical confounding aspects in decoding tumour growth. Malignant cells display variations in their gene transcription profiles and mutation spectra even when originating from a single progenitor cell. Single-cell and spatial transcriptomics sequencing have recently emerged as key technologies for unravelling tumour heterogeneity. Single-cell sequencing promotes individual cell-type identification through transcriptome-wide gene expression measurements of each cell. Spatial transcriptomics facilitates identification of cell-cell interactions and the structural organization of heterogeneous cells within a tumour tissue through associating spatial RNA abundance of cells at distinct spots in the tissue section. However, extracting features and analyzing single-cell and spatial transcriptomics data poses challenges. Single-cell transcriptome data is extremely noisy and its sparse nature and dropouts can lead to misinterpretation of gene expression and the misclassification of cell types. Deep learning predictive power can overcome data challenges, provide high-resolution analysis and enhance precision oncology applications that involve early cancer prognosis, diagnosis, patient survival estimation and anti-cancer therapy planning. In this paper, we provide a background to and review of the recent progress of deep learning frameworks to investigate tumour heterogeneity using both single-cell and spatial transcriptomics data types.
Collapse
Affiliation(s)
- Raid Halawani
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
| | - Michael Buchert
- School of Cancer Medicine, La Trobe University, Melbourne, Victoria, Australia; Olivia Newton-John Cancer Research Institute, Melbourne, Victoria, Australia
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia.
| |
Collapse
|
19
|
Alatrany AS, Khan W, Hussain AJ, Mustafina J, Al-Jumeily D. Transfer Learning for Classification of Alzheimer's Disease Based on Genome Wide Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2700-2711. [PMID: 37018274 DOI: 10.1109/tcbb.2022.3233869] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Alzheimer's disease (AD) is a type of brain disorder that is regarded as a degenerative disease because the corresponding symptoms aggravate with the time progression. Single nucleotide polymorphisms (SNPs) have been identified as relevant biomarkers for this condition. This study aims to identify SNPs biomarkers associated with the AD in order to perform a reliable classification of AD. In contrast to existing related works, we utilize deep transfer learning with varying experimental analysis for reliable classification of AD. For this purpose, the convolutional neural networks (CNN) are firstly trained over the genome-wide association studies (GWAS) dataset requested from the AD neuroimaging initiative. We then employ the deep transfer learning for further training of our CNN (as base model) over a different AD GWAS dataset, to extract the final set of features. The extracted features are then fed into Support Vector Machine for classification of AD. Detailed experiments are performed using multiple datasets and varying experimental configurations. The statistical outcomes indicate an accuracy of 89% which is a significant improvement when benchmarked with existing related works.
Collapse
|
20
|
Morabito F, Adornetto C, Monti P, Amaro A, Reggiani F, Colombo M, Rodriguez-Aldana Y, Tripepi G, D’Arrigo G, Vener C, Torricelli F, Rossi T, Neri A, Ferrarini M, Cutrona G, Gentile M, Greco G. Genes selection using deep learning and explainable artificial intelligence for chronic lymphocytic leukemia predicting the need and time to therapy. Front Oncol 2023; 13:1198992. [PMID: 37719021 PMCID: PMC10501728 DOI: 10.3389/fonc.2023.1198992] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 07/31/2023] [Indexed: 09/19/2023] Open
Abstract
Analyzing gene expression profiles (GEP) through artificial intelligence provides meaningful insight into cancer disease. This study introduces DeepSHAP Autoencoder Filter for Genes Selection (DSAF-GS), a novel deep learning and explainable artificial intelligence-based approach for feature selection in genomics-scale data. DSAF-GS exploits the autoencoder's reconstruction capabilities without changing the original feature space, enhancing the interpretation of the results. Explainable artificial intelligence is then used to select the informative genes for chronic lymphocytic leukemia prognosis of 217 cases from a GEP database comprising roughly 20,000 genes. The model for prognosis prediction achieved an accuracy of 86.4%, a sensitivity of 85.0%, and a specificity of 87.5%. According to the proposed approach, predictions were strongly influenced by CEACAM19 and PIGP, moderately influenced by MKL1 and GNE, and poorly influenced by other genes. The 10 most influential genes were selected for further analysis. Among them, FADD, FIBP, FIBP, GNE, IGF1R, MKL1, PIGP, and SLC39A6 were identified in the Reactome pathway database as involved in signal transduction, transcription, protein metabolism, immune system, cell cycle, and apoptosis. Moreover, according to the network model of the 3D protein-protein interaction (PPI) explored using the NetworkAnalyst tool, FADD, FIBP, IGF1R, QTRT1, GNE, SLC39A6, and MKL1 appear coupled into a complex network. Finally, all 10 selected genes showed a predictive power on time to first treatment (TTFT) in univariate analyses on a basic prognostic model including IGHV mutational status, del(11q) and del(17p), NOTCH1 mutations, β2-microglobulin, Rai stage, and B-lymphocytosis known to predict TTFT in CLL. However, only IGF1R [hazard ratio (HR) 1.41, 95% CI 1.08-1.84, P=0.013), COL28A1 (HR 0.32, 95% CI 0.10-0.97, P=0.045), and QTRT1 (HR 7.73, 95% CI 2.48-24.04, P<0.001) genes were significantly associated with TTFT in multivariable analyses when combined with the prognostic factors of the basic model, ultimately increasing the Harrell's c-index and the explained variation to 78.6% (versus 76.5% of the basic prognostic model) and 52.6% (versus 42.2% of the basic prognostic model), respectively. Also, the goodness of model fit was enhanced (χ2 = 20.1, P=0.002), indicating its improved performance above the basic prognostic model. In conclusion, DSAF-GS identified a group of significant genes for CLL prognosis, suggesting future directions for bio-molecular research.
Collapse
Affiliation(s)
| | - Carlo Adornetto
- Department of Mathematics and Computer Science, University of Calabria, Cosenza, Italy
| | - Paola Monti
- Mutagenesis and Cancer Prevention Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Adriana Amaro
- Tumor Epigenetics Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Francesco Reggiani
- Tumor Epigenetics Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Monica Colombo
- Molecular Pathology Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | | | - Giovanni Tripepi
- Consiglio Nazionale delle Ricerche, Istituto di Fisiologia Clinica del Consiglio Nazionale delle Ricerche (CNR), Reggio Calabria, Italy
| | - Graziella D’Arrigo
- Consiglio Nazionale delle Ricerche, Istituto di Fisiologia Clinica del Consiglio Nazionale delle Ricerche (CNR), Reggio Calabria, Italy
| | - Claudia Vener
- Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy
| | - Federica Torricelli
- Laboratory of Translational Research, Azienda Unità Sanitaria Locale - Istituto di Ricovero e Cura a Crabtree Scientifico (USL-IRCCS) of Reggio Emilia, Reggio Emilia, Italy
| | - Teresa Rossi
- Laboratory of Translational Research, Azienda Unità Sanitaria Locale - Istituto di Ricovero e Cura a Crabtree Scientifico (USL-IRCCS) of Reggio Emilia, Reggio Emilia, Italy
| | - Antonino Neri
- Scientific Directorate, Azienda Unità Sanitaria Locale - Istituto di Ricovero e Cura a Carattere Scientifico (USL-IRCCS) of Reggio Emilia, Reggio Emilia, Italy
| | - Manlio Ferrarini
- Unità Operariva (UO) Molecular Pathology, Ospedale Policlinico San Martino Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS), Genoa, Italy
| | - Giovanna Cutrona
- Molecular Pathology Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Massimo Gentile
- Hematology Unit, Department of Onco-Hematology, Azienda Ospedaliera (A.O.) of Cosenza, Cosenza, Italy
- Department of Pharmacy and Health and Nutritional Sciences, University of Calabria, Cosenza, Italy
| | - Gianluigi Greco
- Department of Mathematics and Computer Science, University of Calabria, Cosenza, Italy
| |
Collapse
|
21
|
Jahanyar B, Tabatabaee H, Rowhanimanesh A. MS-ACGAN: A modified auxiliary classifier generative adversarial network for schizophrenia's samples augmentation based on microarray gene expression data. Comput Biol Med 2023; 162:107024. [PMID: 37263150 DOI: 10.1016/j.compbiomed.2023.107024] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Revised: 05/01/2023] [Accepted: 05/09/2023] [Indexed: 06/03/2023]
Abstract
Artificial intelligence-based models and robust computational methods have expedited the data-to-knowledge trajectory in precision medicine. Although machine learning models have been widely applied in medical data analysis, some barriers are yet to be challenging, such as available biosample shortage, prohibitive costs, rare diseases, and ethical considerations. Transcriptomics, an omics approach that studies gene activities and provides gene expression data such as microarray and RNA-Sequences faces the difficulties of biospecimen collection, particularly for mental disorders, as some psychiatric patients avoid medical care. Microarray data suffers from the low number of available samples, making it challenging to apply machine learning models. However, adversarial generative network (GAN), the hottest paradigm in deep learning, has created unprecedented momentum in data augmentation and efficiently expands datasets. This paper proposes a novel model termed MS-ACGAN, where the generator feeds on a bordered Gaussian distribution. In machine learning, calibration is of utmost importance, which gives insight into model uncertainty and is considered a crucial step toward improving the robustness and reliability of models. Therefore, we apply calibration techniques to classifiers and focus on estimating their probabilities as accurately as possible. Additionally, we present our trustworthy outputs by harnessing confidence intervals that confine the point estimate limitations and report a range of expected values for performance metrics. Both concepts statistically describe the implemented model's reliability in this study. Furthermore, we employ two quantitative measures, GAN-train and GAN-test, to demonstrate that the artificial data generated by our robust approach remarkably resembles the original data characteristics.
Collapse
Affiliation(s)
- Bahareh Jahanyar
- Department of Computer Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran
| | - Hamid Tabatabaee
- Department of Computer Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran.
| | | |
Collapse
|
22
|
Khodadadi A, Ghanbari Bousejin N, Molaei S, Kumar Chauhan V, Zhu T, Clifton DA. Improving Diagnostics with Deep Forest Applied to Electronic Health Records. SENSORS (BASEL, SWITZERLAND) 2023; 23:6571. [PMID: 37514865 PMCID: PMC10384165 DOI: 10.3390/s23146571] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 07/08/2023] [Accepted: 07/14/2023] [Indexed: 07/30/2023]
Abstract
An electronic health record (EHR) is a vital high-dimensional part of medical concepts. Discovering implicit correlations in the information of this data set and the research and informative aspects can improve the treatment and management process. The challenge of concern is the data sources' limitations in finding a stable model to relate medical concepts and use these existing connections. This paper presents Patient Forest, a novel end-to-end approach for learning patient representations from tree-structured data for readmission and mortality prediction tasks. By leveraging statistical features, the proposed model is able to provide an accurate and reliable classifier for predicting readmission and mortality. Experiments on MIMIC-III and eICU datasets demonstrate Patient Forest outperforms existing machine learning models, especially when the training data are limited. Additionally, a qualitative evaluation of Patient Forest is conducted by visualising the learnt representations in 2D space using the t-SNE, which further confirms the effectiveness of the proposed model in learning EHR representations.
Collapse
Affiliation(s)
- Atieh Khodadadi
- Institute of Applied Informatics and Formal Description Methods, Karlsruhe Institute of Technology, 76133 Karlsruhe, Germany
| | | | - Soheila Molaei
- Department of Engineering Science, University of Oxford, Oxford OX1 3AZ, UK; (V.K.C.); (T.Z.); (D.A.C.)
| | - Vinod Kumar Chauhan
- Department of Engineering Science, University of Oxford, Oxford OX1 3AZ, UK; (V.K.C.); (T.Z.); (D.A.C.)
| | - Tingting Zhu
- Department of Engineering Science, University of Oxford, Oxford OX1 3AZ, UK; (V.K.C.); (T.Z.); (D.A.C.)
| | - David A. Clifton
- Department of Engineering Science, University of Oxford, Oxford OX1 3AZ, UK; (V.K.C.); (T.Z.); (D.A.C.)
- Oxford-Suzhou Centre for Advanced Research (OSCAR), Suzhou 215123, China
| |
Collapse
|
23
|
Raudenska M, Vicar T, Gumulec J, Masarik M. Johann Gregor Mendel: the victory of statistics over human imagination. Eur J Hum Genet 2023; 31:744-748. [PMID: 36755104 PMCID: PMC9909140 DOI: 10.1038/s41431-023-01303-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 01/11/2023] [Accepted: 01/24/2023] [Indexed: 02/10/2023] Open
Abstract
In 2022, we celebrated 200 years since the birth of Johann Gregor Mendel. Although his contributions to science went unrecognized during his lifetime, Mendel not only described the principles of monogenic inheritance but also pioneered the modern way of doing science based on precise experimental data acquisition and evaluation. Novel statistical and algorithmic approaches are now at the center of scientific work, showing that work that is considered marginal in one era can become a mainstream research approach in the next era. The onset of data-driven science caused a shift from hypothesis-testing to hypothesis-generating approaches in science. Mendel is remembered here as a promoter of this approach, and the benefits of big data and statistical approaches are discussed.
Collapse
Affiliation(s)
- Martina Raudenska
- Department of Physiology, Faculty of Medicine, Masaryk University/Kamenice 5, CZ-625 00, Brno, Czech Republic
- Department of Pathological Physiology, Faculty of Medicine, Masaryk University/Kamenice 5, CZ-625 00, Brno, Czech Republic
| | - Tomas Vicar
- Department of Physiology, Faculty of Medicine, Masaryk University/Kamenice 5, CZ-625 00, Brno, Czech Republic
- Department of Biomedical Engineering, Faculty of Electrical Engineering and Communication, Brno University of Technology, Technicka 3058/10, Brno, Czech Republic
| | - Jaromir Gumulec
- Department of Physiology, Faculty of Medicine, Masaryk University/Kamenice 5, CZ-625 00, Brno, Czech Republic
- Department of Pathological Physiology, Faculty of Medicine, Masaryk University/Kamenice 5, CZ-625 00, Brno, Czech Republic
| | - Michal Masarik
- Department of Physiology, Faculty of Medicine, Masaryk University/Kamenice 5, CZ-625 00, Brno, Czech Republic.
- Department of Pathological Physiology, Faculty of Medicine, Masaryk University/Kamenice 5, CZ-625 00, Brno, Czech Republic.
- BIOCEV, First Faculty of Medicine, Charles University, Prumyslova 595, CZ-252 50, Vestec, Czech Republic.
| |
Collapse
|
24
|
Lacan A, Sebag M, Hanczar B. GAN-based data augmentation for transcriptomics: survey and comparative assessment. Bioinformatics 2023; 39:i111-i120. [PMID: 37387181 DOI: 10.1093/bioinformatics/btad239] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. RESULTS This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. AVAILABILITY AND IMPLEMENTATION All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics.
Collapse
Affiliation(s)
- Alice Lacan
- IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France
| | - Michèle Sebag
- TAU, CNRS-INRIA-LISN, University Paris-Saclay, Gif-sur-Yvette 91190, France
| | - Blaise Hanczar
- IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France
| |
Collapse
|
25
|
Zabardast A, Tamer EG, Son YA, Yılmaz A. An automated framework for evaluation of deep learning models for splice site predictions. Sci Rep 2023; 13:10221. [PMID: 37353532 PMCID: PMC10290104 DOI: 10.1038/s41598-023-34795-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 05/08/2023] [Indexed: 06/25/2023] Open
Abstract
A novel framework for the automated evaluation of various deep learning-based splice site detectors is presented. The framework eliminates time-consuming development and experimenting activities for different codebases, architectures, and configurations to obtain the best models for a given RNA splice site dataset. RNA splicing is a cellular process in which pre-mRNAs are processed into mature mRNAs and used to produce multiple mRNA transcripts from a single gene sequence. Since the advancement of sequencing technologies, many splice site variants have been identified and associated with the diseases. So, RNA splice site prediction is essential for gene finding, genome annotation, disease-causing variants, and identification of potential biomarkers. Recently, deep learning models performed highly accurately for classifying genomic signals. Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and its bidirectional version (BLSTM), Gated Recurrent Unit (GRU), and its bidirectional version (BGRU) are promising models. During genomic data analysis, CNN's locality feature helps where each nucleotide correlates with other bases in its vicinity. In contrast, BLSTM can be trained bidirectionally, allowing sequential data to be processed from forward and reverse directions. Therefore, it can process 1-D encoded genomic data effectively. Even though both methods have been used in the literature, a performance comparison was missing. To compare selected models under similar conditions, we have created a blueprint for a series of networks with five different levels. As a case study, we compared CNN and BLSTM models' learning capabilities as building blocks for RNA splice site prediction in two different datasets. Overall, CNN performed better with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) in human splice site prediction. Likewise, an outperforming performance with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) is achieved in C. elegans splice site prediction. Overall, our results showed that CNN learns faster than BLSTM and BGRU. Moreover, CNN performs better at extracting sequence patterns than BLSTM and BGRU. To our knowledge, no other framework is developed explicitly for evaluating splice detection models to decide the best possible model in an automated manner. So, the proposed framework and the blueprint would help selecting different deep learning models, such as CNN vs. BLSTM and BGRU, for splice site analysis or similar classification tasks and in different problems.
Collapse
Affiliation(s)
- Amin Zabardast
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Elif Güney Tamer
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Yeşim Aydın Son
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Arif Yılmaz
- Institute of Data Science, Maastricht University, Maastricht, The Netherlands.
| |
Collapse
|
26
|
Yang S, Kim SH, Kang M, Joo JY. Harnessing deep learning into hidden mutations of neurological disorders for therapeutic challenges. Arch Pharm Res 2023:10.1007/s12272-023-01450-5. [PMID: 37261600 DOI: 10.1007/s12272-023-01450-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 05/26/2023] [Indexed: 06/02/2023]
Abstract
The relevant study of transcriptome-wide variations and neurological disorders in the evolved field of genomic data science is on the rise. Deep learning has been highlighted utilizing algorithms on massive amounts of data in a human-like manner, and is expected to predict the dependency or druggability of hidden mutations within the genome. Enormous mutational variants in coding and noncoding transcripts have been discovered along the genome by far, despite of the fine-tuned genetic proofreading machinery. These variants could be capable of inducing various pathological conditions, including neurological disorders, which require lifelong care. Several limitations and questions emerge, including the use of conventional processes via limited patient-driven sequence acquisitions and decoding-based inferences as well as how rare variants can be deduced as a population-specific etiology. These puzzles require harnessing of advanced systems for precise disease prediction, drug development and drug applications. In this review, we summarize the pathophysiological discoveries of pathogenic variants in both coding and noncoding transcripts in neurological disorders, and the current advantage of deep learning applications. In addition, we discuss the challenges encountered and how to outperform them with advancing interpretation.
Collapse
Affiliation(s)
- Sumin Yang
- Department of Pharmacy, College of Pharmacy, Hanyang University, Rm 407, Bldg.42, 55 Hanyangdaehak-Ro, Sangnok-Gu Ansan, Ansan, Gyeonggi-Do, 15588, Republic of Korea
| | - Sung-Hyun Kim
- Department of Pharmacy, College of Pharmacy, Hanyang University, Rm 407, Bldg.42, 55 Hanyangdaehak-Ro, Sangnok-Gu Ansan, Ansan, Gyeonggi-Do, 15588, Republic of Korea
| | - Mingon Kang
- Department of Computer Science, University of Nevada, Las Vegas, NV, 89154, USA
| | - Jae-Yeol Joo
- Department of Pharmacy, College of Pharmacy, Hanyang University, Rm 407, Bldg.42, 55 Hanyangdaehak-Ro, Sangnok-Gu Ansan, Ansan, Gyeonggi-Do, 15588, Republic of Korea.
| |
Collapse
|
27
|
Akyüz K, Goisauf M, Chassang G, Kozera Ł, Mežinska S, Tzortzatou-Nanopoulou O, Mayrhofer MT. Post-identifiability in changing sociotechnological genomic data environments. BIOSOCIETIES 2023:1-28. [PMID: 37359141 PMCID: PMC10042674 DOI: 10.1057/s41292-023-00299-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/13/2023] [Indexed: 03/30/2023]
Abstract
Data practices in biomedical research often rely on standards that build on normative assumptions regarding privacy and involve 'ethics work.' In an increasingly datafied research environment, identifiability gains a new temporal and spatial dimension, especially in regard to genomic data. In this paper, we analyze how genomic identifiability is considered as a specific data issue in a recent controversial case: publication of the genome sequence of the HeLa cell line. Considering developments in the sociotechnological and data environment, such as big data, biomedical, recreational, and research uses of genomics, our analysis highlights what it means to be (re-)identifiable in the postgenomic era. By showing how the risk of genomic identifiability is not a specificity of the HeLa controversy, but rather a systematic data issue, we argue that a new conceptualization is needed. With the notion of post-identifiability as a sociotechnological situation, we show how past assumptions and ideas about future possibilities come together in the case of genomic identifiability. We conclude by discussing how kinship, temporality, and openness are subject to renewed negotiations along with the changing understandings and expectations of identifiability and status of genomic data.
Collapse
Affiliation(s)
- Kaya Akyüz
- Department of Science and Technology Studies, University of Vienna, Universitätsstraße 7/Stiege II/6, Stock (NIG), 1010 Vienna, Austria
- BBMRI-ERIC, Graz, Austria
| | - Melanie Goisauf
- Department of Science and Technology Studies, University of Vienna, Universitätsstraße 7/Stiege II/6, Stock (NIG), 1010 Vienna, Austria
- BBMRI-ERIC, Graz, Austria
| | - Gauthier Chassang
- CERPOP, Université de Toulouse, Inserm, Université Paul Sabatier, Toulouse, France
- Plateforme GenoToul Societal “Ethique et Biosciences”, Toulouse, France
| | | | - Signe Mežinska
- Institute of Clinical and Preventive Medicine, University of Latvia, Riga, Latvia
- BBMRI.LV, Riga, Latvia
| | | | | |
Collapse
|
28
|
Alharbi F, Vakanski A. Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review. Bioengineering (Basel) 2023; 10:bioengineering10020173. [PMID: 36829667 PMCID: PMC9952758 DOI: 10.3390/bioengineering10020173] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 01/24/2023] [Accepted: 01/26/2023] [Indexed: 01/31/2023] Open
Abstract
Cancer is a term that denotes a group of diseases caused by the abnormal growth of cells that can spread in different parts of the body. According to the World Health Organization (WHO), cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer, as it is indicative of the biochemical processes in tissue and cells, as well as the genetic characteristics of an organism. Deoxyribonucleic acid (DNA) microarrays and ribonucleic acid (RNA)-sequencing methods for gene expression data allow quantifying the expression levels of genes and produce valuable data for computational analysis. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods. Both conventional and deep learning-based approaches are reviewed, with an emphasis on the application of deep learning models due to their comparative advantages for identifying gene patterns that are distinctive for various types of cancers. Relevant works that employ the most commonly used deep neural network architectures are covered, including multi-layer perceptrons, as well as convolutional, recurrent, graph, and transformer networks. This survey also presents an overview of the data collection methods for gene expression analysis and lists important datasets that are commonly used for supervised machine learning for this task. Furthermore, we review pertinent techniques for feature engineering and data preprocessing that are typically used to handle the high dimensionality of gene expression data, caused by a large number of genes present in data samples. The paper concludes with a discussion of future research directions for machine learning-based gene expression analysis for cancer classification.
Collapse
|
29
|
Abstract
Advancements in high-throughput sequencing have yielded vast amounts of genomic data, which are studied using genome-wide association study (GWAS)/phenome-wide association study (PheWAS) methods to identify associations between the genotype and phenotype. The associated findings have contributed to pharmacogenomics and improved clinical decision support at the point of care in many healthcare systems. However, the accumulation of genomic data from sequencing and clinical data from electronic health records (EHRs) poses significant challenges for data scientists. Following the rise of artificial intelligence (AI) technology such as machine learning and deep learning, an increasing number of GWAS/PheWAS studies have successfully leveraged this technology to overcome the aforementioned challenges. In this review, we focus on the application of data science and AI technology in three areas, including risk prediction and identification of causal single-nucleotide polymorphisms, EHR-based phenotyping and CRISPR guide RNA design. Additionally, we highlight a few emerging AI technologies, such as transfer learning and multi-view learning, which will or have started to benefit genomic studies.
Collapse
Affiliation(s)
- Jing Lin
- NUHS Corporate Office, National University Health System, Singapore
| | - Kee Yuan Ngiam
- NUHS Corporate Office, National University Health System, Singapore,Department of Surgery, National University of Singapore, Singapore,Correspondence: A/Prof Kee Yuan Ngiam, Group Chief Technology Officer, NUHS Corporate Office, National University Health System, 1E Kent Ridge Road, 119228, Singapore. E-mail:
| |
Collapse
|
30
|
Chandra A, Tünnermann L, Löfstedt T, Gratz R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife 2023; 12:e82819. [PMID: 36651724 PMCID: PMC9848389 DOI: 10.7554/elife.82819] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 01/06/2023] [Indexed: 01/19/2023] Open
Abstract
Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough in life science applications, in particular in protein property prediction. There is hope that deep learning can close the gap between the number of sequenced proteins and proteins with known properties based on lab experiments. Language models from the field of natural language processing have gained popularity for protein property predictions and have led to a new computational revolution in biology, where old prediction results are being improved regularly. Such models can learn useful multipurpose representations of proteins from large open repositories of protein sequences and can be used, for instance, to predict protein properties. The field of natural language processing is growing quickly because of developments in a class of models based on a particular model-the Transformer model. We review recent developments and the use of large-scale Transformer models in applications for predicting protein characteristics and how such models can be used to predict, for example, post-translational modifications. We review shortcomings of other deep learning models and explain how the Transformer models have quickly proven to be a very promising way to unravel information hidden in the sequences of amino acids.
Collapse
Affiliation(s)
- Abel Chandra
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Laura Tünnermann
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
| | - Tommy Löfstedt
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Regina Gratz
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
- Department of Forest Ecology and Management, Swedish University of Agricultural SciencesUmeåSweden
| |
Collapse
|
31
|
Qi R, Zou Q. Trends and Potential of Machine Learning and Deep Learning in Drug Study at Single-Cell Level. RESEARCH (WASHINGTON, D.C.) 2023; 6:0050. [PMID: 36930772 PMCID: PMC10013796 DOI: 10.34133/research.0050] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Accepted: 12/27/2022] [Indexed: 01/12/2023]
Abstract
Cancer treatments always face challenging problems, particularly drug resistance due to tumor cell heterogeneity. The existing datasets include the relationship between gene expression and drug sensitivities; however, the majority are based on tissue-level studies. Study drugs at the single-cell level are perspective to overcome minimal residual disease caused by subclonal resistant cancer cells retained after initial curative therapy. Fortunately, machine learning techniques can help us understand how different types of cells respond to different cancer drugs from the perspective of single-cell gene expression. Good modeling using single-cell data and drug response information will not only improve machine learning for cell-drug outcome prediction but also facilitate the discovery of drugs for specific cancer subgroups and specific cancer treatments. In this paper, we review machine learning and deep learning approaches in drug research. By analyzing the application of these methods on cancer cell lines and single-cell data and comparing the technical gap between single-cell sequencing data analysis and single-cell drug sensitivity analysis, we hope to explore the trends and potential of drug research at the single-cell data level and provide more inspiration for drug research at the single-cell level. We anticipate that this review will stimulate the innovative use of machine learning methods to address new challenges in precision medicine more broadly.
Collapse
Affiliation(s)
- Ren Qi
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
32
|
Small RNA Targets: Advances in Prediction Tools and High-Throughput Profiling. BIOLOGY 2022; 11:biology11121798. [PMID: 36552307 PMCID: PMC9775672 DOI: 10.3390/biology11121798] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 11/27/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022]
Abstract
MicroRNAs (miRNAs) are an abundant class of small non-coding RNAs that regulate gene expression at the post-transcriptional level. They are suggested to be involved in most biological processes of the cell primarily by targeting messenger RNAs (mRNAs) for cleavage or translational repression. Their binding to their target sites is mediated by the Argonaute (AGO) family of proteins. Thus, miRNA target prediction is pivotal for research and clinical applications. Moreover, transfer-RNA-derived fragments (tRFs) and other types of small RNAs have been found to be potent regulators of Ago-mediated gene expression. Their role in mRNA regulation is still to be fully elucidated, and advancements in the computational prediction of their targets are in their infancy. To shed light on these complex RNA-RNA interactions, the availability of good quality high-throughput data and reliable computational methods is of utmost importance. Even though the arsenal of computational approaches in the field has been enriched in the last decade, there is still a degree of discrepancy between the results they yield. This review offers an overview of the relevant advancements in the field of bioinformatics and machine learning and summarizes the key strategies utilized for small RNA target prediction. Furthermore, we report the recent development of high-throughput sequencing technologies, and explore the role of non-miRNA AGO driver sequences.
Collapse
|
33
|
Srinivasu PN, Shafi J, Krishna TB, Sujatha CN, Praveen SP, Ijaz MF. Using Recurrent Neural Networks for Predicting Type-2 Diabetes from Genomic and Tabular Data. Diagnostics (Basel) 2022; 12:3067. [PMID: 36553074 PMCID: PMC9776641 DOI: 10.3390/diagnostics12123067] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 12/01/2022] [Accepted: 12/04/2022] [Indexed: 12/12/2022] Open
Abstract
The development of genomic technology for smart diagnosis and therapies for various diseases has lately been the most demanding area for computer-aided diagnostic and treatment research. Exponential breakthroughs in artificial intelligence and machine intelligence technologies could pave the way for identifying challenges afflicting the healthcare industry. Genomics is paving the way for predicting future illnesses, including cancer, Alzheimer's disease, and diabetes. Machine learning advancements have expedited the pace of biomedical informatics research and inspired new branches of computational biology. Furthermore, knowing gene relationships has resulted in developing more accurate models that can effectively detect patterns in vast volumes of data, making classification models important in various domains. Recurrent Neural Network models have a memory that allows them to quickly remember knowledge from previous cycles and process genetic data. The present work focuses on type 2 diabetes prediction using gene sequences derived from genomic DNA fragments through automated feature selection and feature extraction procedures for matching gene patterns with training data. The suggested model was tested using tabular data to predict type 2 diabetes based on several parameters. The performance of neural networks incorporating Recurrent Neural Network (RNN) components, Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU) was tested in this research. The model's efficiency is assessed using the evaluation metrics such as Sensitivity, Specificity, Accuracy, F1-Score, and Mathews Correlation Coefficient (MCC). The suggested technique predicted future illnesses with fair Accuracy. Furthermore, our research showed that the suggested model could be used in real-world scenarios and that input risk variables from an end-user Android application could be kept and evaluated on a secure remote server.
Collapse
Affiliation(s)
- Parvathaneni Naga Srinivasu
- Department of Computer Science and Engineering, Prasad V. Potluri Siddhartha Institute of Technology, Vijayawada 520007, Andhra Pradesh, India
| | - Jana Shafi
- Department of Computer Science, College of Arts and Science, Prince Sattam bin Abdul Aziz University, Wadi Ad-Dawasir 11991, Saudi Arabia
| | - T Balamurali Krishna
- Department of Computer Science and Engineering, Dhanekula Institute of Engineering and Technology, Vijayawada 521139, Andhra Pradesh, India
| | - Canavoy Narahari Sujatha
- Department of Electronics and Communication Engineering, Sreenidhi Institute of Science and Technology, Hyderabad 501301, Telangana, India
| | - S Phani Praveen
- Department of Computer Science and Engineering, Prasad V. Potluri Siddhartha Institute of Technology, Vijayawada 520007, Andhra Pradesh, India
| | - Muhammad Fazal Ijaz
- Department of Intelligent Mechatronics Engineering, Sejong University, Seoul 05006, Republic of Korea
| |
Collapse
|
34
|
Muneeb M, Feng S, Henschel A. Transfer learning for genotype-phenotype prediction using deep learning models. BMC Bioinformatics 2022; 23:511. [PMID: 36447153 PMCID: PMC9710151 DOI: 10.1186/s12859-022-05036-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 11/05/2022] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND For some understudied populations, genotype data is minimal for genotype-phenotype prediction. However, we can use the data of some other large populations to learn about the disease-causing SNPs and use that knowledge for the genotype-phenotype prediction of small populations. This manuscript illustrated that transfer learning is applicable for genotype data and genotype-phenotype prediction. RESULTS Using HAPGEN2 and PhenotypeSimulator, we generated eight phenotypes for 500 cases/500 controls (CEU, large population) and 100 cases/100 controls (YRI, small populations). We considered 5 (4 phenotypes) and 10 (4 phenotypes) different risk SNPs for each phenotype to evaluate the proposed method. The improved accuracy with transfer learning for eight different phenotypes was between 2 and 14.2 percent. The two-tailed p-value between the classification accuracies for all phenotypes without transfer learning and with transfer learning was 0.0306 for five risk SNPs phenotypes and 0.0478 for ten risk SNPs phenotypes. CONCLUSION The proposed pipeline is used to transfer knowledge for the case/control classification of the small population. In addition, we argue that this method can also be used in the realm of endangered species and personalized medicine. If the large population data is extensive compared to small population data, expect transfer learning results to improve significantly. We show that Transfer learning is capable to create powerful models for genotype-phenotype predictions in large, well-studied populations and fine-tune these models to populations were data is sparse.
Collapse
Affiliation(s)
- Muhammad Muneeb
- grid.440568.b0000 0004 1762 9729Department of Electrical Engineering and Computer Science, Khalifa University of Science and Technology, Al Saada St - Zone 1, Abu Dhabi, United Arab Emirates
| | - Samuel Feng
- grid.449223.a0000 0004 1754 9534Department of Science and Engineering, Sorbonne University Abu Dhabi, PO Box 38044, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- grid.440568.b0000 0004 1762 9729Department of Electrical Engineering and Computer Science, Khalifa University of Science and Technology, Al Saada St - Zone 1, Abu Dhabi, United Arab Emirates
| |
Collapse
|
35
|
Wang K, Yang B, Li Q, Liu S. Systematic Evaluation of Genomic Prediction Algorithms for Genomic Prediction and Breeding of Aquatic Animals. Genes (Basel) 2022; 13:genes13122247. [PMID: 36553514 PMCID: PMC9778314 DOI: 10.3390/genes13122247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 11/18/2022] [Accepted: 11/25/2022] [Indexed: 12/04/2022] Open
Abstract
The extensive use of genomic selection (GS) in livestock and crops has led to a series of genomic-prediction (GP) algorithms despite the lack of a single algorithm that can suit all the species and traits. A systematic evaluation of available GP algorithms is thus necessary to identify the optimal GP algorithm for selective breeding in aquaculture species. In this study, a systematic comparison of ten GP algorithms, including both traditional and machine-learning algorithms, was conducted using publicly available genotype and phenotype data of eight traits, including weight and disease resistance traits, from five aquaculture species. The study aimed to provide insights into the optimal algorithm for GP in aquatic animals. Notably, no algorithm showed the best performance in all traits. However, reproducing kernel Hilbert space (RKHS) and support-vector machine (SVM) algorithms achieved relatively high prediction accuracies in most of the tested traits. Bayes A and random forest (RF) better prevented noise interference in the phenotypic data compared to the other algorithms. The prediction performances of GP algorithms in the Crassostrea gigas dataset were improved by using a genome-wide association study (GWAS) to select subsets of significant SNPs. An R package, "ASGS," which integrates the commonly used traditional and machine-learning algorithms for efficiently finding the optimal algorithm, was developed to assist the application of genomic selection breeding of aquaculture species. This work provides valuable information and a tool for optimizing algorithms for GP, aiding genetic breeding in aquaculture species.
Collapse
Affiliation(s)
- Kuiqin Wang
- Key Laboratory of Mariculture, Ministry of Education, College of Fisheries, Ocean University of China, Qingdao 266003, China
| | - Ben Yang
- Key Laboratory of Mariculture, Ministry of Education, College of Fisheries, Ocean University of China, Qingdao 266003, China
| | - Qi Li
- Key Laboratory of Mariculture, Ministry of Education, College of Fisheries, Ocean University of China, Qingdao 266003, China
- Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, China
| | - Shikai Liu
- Key Laboratory of Mariculture, Ministry of Education, College of Fisheries, Ocean University of China, Qingdao 266003, China
- Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, China
- Correspondence: ; Tel.: +86-0532-82032595
| |
Collapse
|
36
|
Using model explanations to guide deep learning models towards consistent explanations for EHR data. Sci Rep 2022; 12:19899. [PMID: 36400825 PMCID: PMC9674624 DOI: 10.1038/s41598-022-24356-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Accepted: 11/14/2022] [Indexed: 11/19/2022] Open
Abstract
It has been shown that identical deep learning (DL) architectures will produce distinct explanations when trained with different hyperparameters that are orthogonal to the task (e.g. random seed, training set order). In domains such as healthcare and finance, where transparency and explainability is paramount, this can be a significant barrier to DL adoption. In this study we present a further analysis of explanation (in)consistency on 6 tabular datasets/tasks, with a focus on Electronic Health Records data. We propose a novel deep learning ensemble architecture that trains its sub-models to produce consistent explanations, improving explanation consistency by as much as 315% (e.g. from 0.02433 to 0.1011 on MIMIC-IV), and on average by 124% (e.g. from 0.12282 to 0.4450 on the BCW dataset). We evaluate the effectiveness of our proposed technique and discuss the implications our results have for both industrial applications of DL and explainability as well as future methodological work.
Collapse
|
37
|
Towards a better understanding of TF-DNA binding prediction from genomic features. Comput Biol Med 2022; 149:105993. [DOI: 10.1016/j.compbiomed.2022.105993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/12/2022] [Accepted: 08/14/2022] [Indexed: 11/17/2022]
|
38
|
Piernik M, Brzezinski D, Sztromwasser P, Pacewicz K, Majer-Burman W, Gniot M, Sielski D, Bryzghalov O, Wozna A, Zawadzki P. DBFE: distribution-based feature extraction from structural variants in whole-genome data. Bioinformatics 2022; 38:4466-4473. [PMID: 35929780 DOI: 10.1093/bioinformatics/btac513] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 07/12/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Whole-genome sequencing has revolutionized biosciences by providing tools for constructing complete DNA sequences of individuals. With entire genomes at hand, scientists can pinpoint DNA fragments responsible for oncogenesis and predict patient responses to cancer treatments. Machine learning plays a paramount role in this process. However, the sheer volume of whole-genome data makes it difficult to encode the characteristics of genomic variants as features for learning algorithms. RESULTS In this article, we propose three feature extraction methods that facilitate classifier learning from sets of genomic variants. The core contributions of this work include: (i) strategies for determining features using variant length binning, clustering and density estimation; (ii) a programing library for automating distribution-based feature extraction in machine learning pipelines. The proposed methods have been validated on five real-world datasets using four different classification algorithms and a clustering approach. Experiments on genomes of 219 ovarian, 61 lung and 929 breast cancer patients show that the proposed approaches automatically identify genomic biomarkers associated with cancer subtypes and clinical response to oncological treatment. Finally, we show that the extracted features can be used alongside unsupervised learning methods to analyze genomic samples. AVAILABILITY AND IMPLEMENTATION The source code of the presented algorithms and reproducible experimental scripts are available on Github at https://github.com/MNMdiagnostics/dbfe. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maciej Piernik
- Institute of Computing Science, Faculty of Computing and Telecommunications, Poznan University of Technology, 60-965 Poznan, Poland.,MNM Bioscience Inc., Cambridge, MA 02142, USA
| | - Dariusz Brzezinski
- Institute of Computing Science, Faculty of Computing and Telecommunications, Poznan University of Technology, 60-965 Poznan, Poland.,MNM Bioscience Inc., Cambridge, MA 02142, USA.,Institute of Bioorganic Chemistry of the Polish Academy of Sciences, 61-704 Poznan, Poland
| | | | | | | | - Michal Gniot
- MNM Bioscience Inc., Cambridge, MA 02142, USA.,Department of Hematology and Bone Marrow Transplantation, Poznan University of Medical Sciences, 60-569 Poznan, Poland
| | | | | | - Alicja Wozna
- MNM Bioscience Inc., Cambridge, MA 02142, USA.,Faculty of Physics, Adam Mickiewicz University, 61-614 Poznan, Poland
| | - Pawel Zawadzki
- MNM Bioscience Inc., Cambridge, MA 02142, USA.,Faculty of Physics, Adam Mickiewicz University, 61-614 Poznan, Poland
| |
Collapse
|
39
|
Zoghi S, Masoudi MS, Taheri R. The Evolving Role of Next Generation Sequencing in Pediatric Neurosurgery: a Call for Action for Research, Clinical Practice, and Optimization of Care. World Neurosurg 2022; 168:232-242. [PMID: 36122859 DOI: 10.1016/j.wneu.2022.09.056] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Revised: 09/12/2022] [Accepted: 09/13/2022] [Indexed: 11/29/2022]
Abstract
NGS (Next-Generation Sequencing) is one of the most promising technologies that have truly revolutionized many aspects of clinical practice in recent years. It has been and is increasingly applied in many disciplines of medicine; however, it appears that pediatric neurosurgery despite its great potential has not truly embraced this new technology and is hesitant to employ it in its routine practice and guidelines. In this review, we briefly summarized the developments that lead to the establishment of NGS technology, reviewed the current applications and potentials of NGS in the disorders treated by pediatric neurosurgeons, and lastly discuss the steps we need to take to better harness NGS in pediatric neurosurgery.
Collapse
Affiliation(s)
- Sina Zoghi
- Department of Neurosurgery, Shiraz University of Medical Sciences, Shiraz, Iran; Student Research Committee, Shiraz University of Medical Sciences, Shiraz, Iran
| | | | - Reza Taheri
- Department of Neurosurgery, Shiraz University of Medical Sciences, Shiraz, Iran.
| |
Collapse
|
40
|
Bouzinier MA, Etin D, Trifonov SI, Evdokimova VN, Ulitin V, Shen J, Kokorev A, Ghazani AA, Chekaluk Y, Albertyn Z, Giersch A, Morton CC, Abraamyan F, Bendapudi PK, Sunyaev S, Undiagnosed Diseases Network, Brigham Genomic Medicine, SEQuencing A Baby For An Optimal Outcome, Quantori, Krier JB. AnFiSA: An open-source computational platform for the analysis of sequencing data for rare genetic disease. J Biomed Inform 2022; 133:104174. [PMID: 35998814 DOI: 10.1016/j.jbi.2022.104174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 07/23/2022] [Accepted: 08/15/2022] [Indexed: 11/28/2022]
Abstract
Despite genomic sequencing rapidly transforming from being a bench-side tool to a routine procedure in a hospital, there is a noticeable lack of genomic analysis software that supports both clinical and research workflows as well as crowdsourcing. Furthermore, most existing software packages are not forward-compatible in regards to supporting ever-changing diagnostic rules adopted by the genetics community. Regular updates of genomics databases pose challenges for reproducible and traceable automated genetic diagnostics tools. Lastly, most of the software tools score low on explainability amongst clinicians. We have created a fully open-source variant curation tool, AnFiSA, with the intention to invite and accept contributions from clinicians, researchers, and professional software developers. The design of AnFiSA addresses the aforementioned issues via the following architectural principles: using a multidimensional database management system (DBMS) for genomic data to address reproducibility, curated decision trees adaptable to changing clinical rules, and a crowdsourcing-friendly interface to address difficult-to-diagnose cases. We discuss how we have chosen our technology stack and describe the design and implementation of the software. Finally, we show in detail how selected workflows can be implemented using the current version of AnFiSA by a medical geneticist.
Collapse
Affiliation(s)
- M A Bouzinier
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| | - D Etin
- Forome Association, Boston, MA, USA; Oracle Corporation, USA.
| | | | - V N Evdokimova
- Forome Association, Boston, MA, USA; SBCS Scientific Biomedical Consulting Services, London, UK
| | - V Ulitin
- Forome Association, Boston, MA, USA
| | - J Shen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - A Kokorev
- ITMO University, St. Petersburg, Russian Federation
| | - A A Ghazani
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Harvard Medical School, Boston, MA, USA; Brigham Genomic Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Y Chekaluk
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Z Albertyn
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - A Giersch
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - C C Morton
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Department of Obstetrics and Gynecology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA; Manchester Centre for Audiology and Deafness (ManCAD), School of Health Sciences, University of Manchester, UK
| | - F Abraamyan
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - P K Bendapudi
- Division of Hemostasis and Thrombosis, Beth Israel Deaconess Medical Center, Boston, MA, USA; Division of Hematology and Blood Transfusion Service, Massachusetts General Hospital, Boston, MA, USA
| | - S Sunyaev
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| | | | | | | | | | - J B Krier
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
41
|
Assessing comparative importance of DNA sequence and epigenetic modifications on gene expression using a deep convolutional neural network. Comput Struct Biotechnol J 2022; 20:3814-3823. [PMID: 35891778 PMCID: PMC9307602 DOI: 10.1016/j.csbj.2022.07.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 07/05/2022] [Accepted: 07/05/2022] [Indexed: 11/26/2022] Open
Abstract
Gene expression is regulated at both transcriptional and post-transcriptional levels. DNA sequence and epigenetic modifications are key factors which regulate gene transcription. Understanding their complex interactions and their respective contributions to gene expression regulation remains a challenge in biological studies. We have developed iSEGnet, a framework of deep convolutional neural network to predict mRNA abundance using the information on DNA sequences as well as epigenetic modifications within genes and their cis-regulatory regions. We demonstrate that our framework outperforms other machine learning models in terms of predicting mRNA abundance using transcriptional and epigenetic profiles from six distinct cell lines/types chosen from the ENCODE. The analysis from the learned models also reveals that specific regions around promotors and transcription termination sites are most important for gene expression regulation. Using the method of Integrated Gradients, we identify narrow segments in these regions which are most likely to impact gene expression for a specific epigenetic modification. We further show that these identified segments are enriched in known active regulatory regions by comparing the transcription factor binding sites obtained via ChIP-seq. Moreover, we demonstrate how iSEGnet can uncover potential transcription factors that have regulatory functions in cancer using two cancer multi-omics data.
Collapse
|
42
|
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16:26. [PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 07/12/2022] [Indexed: 12/02/2022] Open
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.
Collapse
Affiliation(s)
- Wardah S Alharbi
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia
| | - Mamoon Rashid
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia.
| |
Collapse
|
43
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
44
|
Pocevičiūtė M, Eilertsen G, Jarkman S, Lundström C. Generalisation effects of predictive uncertainty estimation in deep learning for digital pathology. Sci Rep 2022; 12:8329. [PMID: 35585087 PMCID: PMC9117245 DOI: 10.1038/s41598-022-11826-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Accepted: 04/27/2022] [Indexed: 01/20/2023] Open
Abstract
Deep learning (DL) has shown great potential in digital pathology applications. The robustness of a diagnostic DL-based solution is essential for safe clinical deployment. In this work we evaluate if adding uncertainty estimates for DL predictions in digital pathology could result in increased value for the clinical applications, by boosting the general predictive performance or by detecting mispredictions. We compare the effectiveness of model-integrated methods (MC dropout and Deep ensembles) with a model-agnostic approach (Test time augmentation, TTA). Moreover, four uncertainty metrics are compared. Our experiments focus on two domain shift scenarios: a shift to a different medical center and to an underrepresented subtype of cancer. Our results show that uncertainty estimates increase reliability by reducing a model’s sensitivity to classification threshold selection as well as by detecting between 70 and 90% of the mispredictions done by the model. Overall, the deep ensembles method achieved the best performance closely followed by TTA.
Collapse
Affiliation(s)
- Milda Pocevičiūtė
- Department of Science and Technology, Linköping University, Linköping, Sweden. .,Center for Medical Image Science and Visualization, Linköping University, Linköping, Sweden.
| | - Gabriel Eilertsen
- Department of Science and Technology, Linköping University, Linköping, Sweden.,Center for Medical Image Science and Visualization, Linköping University, Linköping, Sweden
| | - Sofia Jarkman
- Center for Medical Image Science and Visualization, Linköping University, Linköping, Sweden.,Department of Clinical Pathology, and Department of Biomedical and Clinical Sciences, Linköping University, Linköping, Sweden
| | - Claes Lundström
- Department of Science and Technology, Linköping University, Linköping, Sweden.,Center for Medical Image Science and Visualization, Linköping University, Linköping, Sweden.,Sectra AB, Linköping, Sweden
| |
Collapse
|
45
|
Bhat GR, Sethi I, Rah B, Kumar R, Afroze D. Innovative in Silico Approaches for Characterization of Genes and Proteins. Front Genet 2022; 13:865182. [PMID: 35664302 PMCID: PMC9159363 DOI: 10.3389/fgene.2022.865182] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2022] [Accepted: 04/11/2022] [Indexed: 11/13/2022] Open
Abstract
Bioinformatics is an amalgamation of biology, mathematics and computer science. It is a science which gathers the information from biology in terms of molecules and applies the informatic techniques to the gathered information for understanding and organizing the data in a useful manner. With the help of bioinformatics, the experimental data generated is stored in several databases available online like nucleotide database, protein databases, GENBANK and others. The data stored in these databases is used as reference for experimental evaluation and validation. Till now several online tools have been developed to analyze the genomic, transcriptomic, proteomics, epigenomics and metabolomics data. Some of them include Human Splicing Finder (HSF), Exonic Splicing Enhancer Mutation taster, and others. A number of SNPs are observed in the non-coding, intronic regions and play a role in the regulation of genes, which may or may not directly impose an effect on the protein expression. Many mutations are thought to influence the splicing mechanism by affecting the existing splice sites or creating a new sites. To predict the effect of mutation (SNP) on splicing mechanism/signal, HSF was developed. Thus, the tool is helpful in predicting the effect of mutations on splicing signals and can provide data even for better understanding of the intronic mutations that can be further validated experimentally. Additionally, rapid advancement in proteomics have steered researchers to organize the study of protein structure, function, relationships, and dynamics in space and time. Thus the effective integration of all of these technological interventions will eventually lead to steering up of next-generation systems biology, which will provide valuable biological insights in the field of research, diagnostic, therapeutic and development of personalized medicine.
Collapse
Affiliation(s)
- Gh. Rasool Bhat
- Advanced Centre for Human Genetics, Sher-I- Kashmir Institute of Medical Sciences, Soura, India
| | - Itty Sethi
- Institute of Human Genetics, University of Jammu, Jammu, India
| | - Bilal Rah
- Advanced Centre for Human Genetics, Sher-I- Kashmir Institute of Medical Sciences, Soura, India
| | - Rakesh Kumar
- School of Biotechnology, Shri Mata Vaishno Devi University, Katra, India
| | - Dil Afroze
- Advanced Centre for Human Genetics, Sher-I- Kashmir Institute of Medical Sciences, Soura, India
| |
Collapse
|
46
|
Saravi B, Hassel F, Ülkümen S, Zink A, Shavlokhova V, Couillard-Despres S, Boeker M, Obid P, Lang GM. Artificial Intelligence-Driven Prediction Modeling and Decision Making in Spine Surgery Using Hybrid Machine Learning Models. J Pers Med 2022; 12:jpm12040509. [PMID: 35455625 PMCID: PMC9029065 DOI: 10.3390/jpm12040509] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Revised: 03/18/2022] [Accepted: 03/19/2022] [Indexed: 12/22/2022] Open
Abstract
Healthcare systems worldwide generate vast amounts of data from many different sources. Although of high complexity for a human being, it is essential to determine the patterns and minor variations in the genomic, radiological, laboratory, or clinical data that reliably differentiate phenotypes or allow high predictive accuracy in health-related tasks. Convolutional neural networks (CNN) are increasingly applied to image data for various tasks. Its use for non-imaging data becomes feasible through different modern machine learning techniques, converting non-imaging data into images before inputting them into the CNN model. Considering also that healthcare providers do not solely use one data modality for their decisions, this approach opens the door for multi-input/mixed data models which use a combination of patient information, such as genomic, radiological, and clinical data, to train a hybrid deep learning model. Thus, this reflects the main characteristic of artificial intelligence: simulating natural human behavior. The present review focuses on key advances in machine and deep learning, allowing for multi-perspective pattern recognition across the entire information set of patients in spine surgery. This is the first review of artificial intelligence focusing on hybrid models for deep learning applications in spine surgery, to the best of our knowledge. This is especially interesting as future tools are unlikely to use solely one data modality. The techniques discussed could become important in establishing a new approach to decision-making in spine surgery based on three fundamental pillars: (1) patient-specific, (2) artificial intelligence-driven, (3) integrating multimodal data. The findings reveal promising research that already took place to develop multi-input mixed-data hybrid decision-supporting models. Their implementation in spine surgery may hence be only a matter of time.
Collapse
Affiliation(s)
- Babak Saravi
- Department of Orthopedics and Trauma Surgery, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, 79108 Freiburg, Germany; (S.Ü.); (P.O.); (G.M.L.)
- Department of Spine Surgery, Loretto Hospital, 79100 Freiburg, Germany; (F.H.); (A.Z.)
- Institute of Experimental Neuroregeneration, Spinal Cord Injury and Tissue Regeneration Center Salzburg (SCI-TReCS), Paracelsus Medical University, 5020 Salzburg, Austria;
- Correspondence:
| | - Frank Hassel
- Department of Spine Surgery, Loretto Hospital, 79100 Freiburg, Germany; (F.H.); (A.Z.)
| | - Sara Ülkümen
- Department of Orthopedics and Trauma Surgery, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, 79108 Freiburg, Germany; (S.Ü.); (P.O.); (G.M.L.)
- Department of Spine Surgery, Loretto Hospital, 79100 Freiburg, Germany; (F.H.); (A.Z.)
| | - Alisia Zink
- Department of Spine Surgery, Loretto Hospital, 79100 Freiburg, Germany; (F.H.); (A.Z.)
| | - Veronika Shavlokhova
- Department of Oral and Maxillofacial Surgery, University Hospital Heidelberg, 69120 Heidelberg, Germany;
| | - Sebastien Couillard-Despres
- Institute of Experimental Neuroregeneration, Spinal Cord Injury and Tissue Regeneration Center Salzburg (SCI-TReCS), Paracelsus Medical University, 5020 Salzburg, Austria;
- Austrian Cluster for Tissue Regeneration, 1200 Vienna, Austria
| | - Martin Boeker
- Intelligence and Informatics in Medicine, Medical Center Rechts der Isar, School of Medicine, Technical University of Munich, 81675 Munich, Germany;
| | - Peter Obid
- Department of Orthopedics and Trauma Surgery, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, 79108 Freiburg, Germany; (S.Ü.); (P.O.); (G.M.L.)
| | - Gernot Michael Lang
- Department of Orthopedics and Trauma Surgery, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, 79108 Freiburg, Germany; (S.Ü.); (P.O.); (G.M.L.)
| |
Collapse
|
47
|
Bio-Constrained Codes with Neural Network for Density-Based DNA Data Storage. MATHEMATICS 2022. [DOI: 10.3390/math10050845] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
DNA has evolved as a cutting-edge medium for digital information storage due to its extremely high density and durable preservation to accommodate the data explosion. However, the strings of DNA are prone to errors during the hybridization process. In addition, DNA synthesis and sequences come with a cost that depends on the number of nucleotides present. An efficient model to store a large amount of data in a small number of nucleotides is essential, and it must control the hybridization errors among the base pairs. In this paper, a novel computational model is presented to design large DNA libraries of oligonucleotides. It is established by integrating a neural network (NN) with combinatorial biological constraints, including constant GC-content and satisfying Hamming distance and reverse-complement constraints. We develop a simple and efficient implementation of NNs to produce the optimal DNA codes, which opens the door to applying neural networks for DNA-based data storage. Further, the combinatorial bio-constraints are introduced to improve the lower bounds and to avoid the occurrence of errors in the DNA codes. Our goal is to compute large DNA codes in shorter sequences, which should avoid non-specific hybridization errors by satisfying the bio-constrained coding. The proposed model yields a significant improvement in the DNA library by explicitly constructing larger codes than the prior published codes.
Collapse
|
48
|
Carvalho E, Morais M, Ferreira H, Silva M, Guimarães S, Pêgo A. A paradigm shift: Bioengineering meets mechanobiology towards overcoming remyelination failure. Biomaterials 2022; 283:121427. [DOI: 10.1016/j.biomaterials.2022.121427] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Revised: 01/31/2022] [Accepted: 02/17/2022] [Indexed: 12/14/2022]
|
49
|
Kaczmarek E, Nanayakkara J, Sedghi A, Pesteie M, Tuschl T, Renwick N, Mousavi P. Topology preserving stratification of tissue neoplasticity using Deep Neural Maps and microRNA signatures. BMC Bioinformatics 2022; 23:38. [PMID: 35026982 PMCID: PMC8756719 DOI: 10.1186/s12859-022-04559-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Accepted: 12/30/2021] [Indexed: 11/14/2022] Open
Abstract
Background Accurate cancer classification is essential for correct treatment selection and better prognostication. microRNAs (miRNAs) are small RNA molecules that negatively regulate gene expression, and their dyresgulation is a common disease mechanism in many cancers. Through a clearer understanding of miRNA dysregulation in cancer, improved mechanistic knowledge and better treatments can be sought. Results We present a topology-preserving deep learning framework to study miRNA dysregulation in cancer. Our study comprises miRNA expression profiles from 3685 cancer and non-cancer tissue samples and hierarchical annotations on organ and neoplasticity status. Using unsupervised learning, a two-dimensional topological map is trained to cluster similar tissue samples. Labelled samples are used after training to identify clustering accuracy in terms of tissue-of-origin and neoplasticity status. In addition, an approach using activation gradients is developed to determine the attention of the networks to miRNAs that drive the clustering. Using this deep learning framework, we classify the neoplasticity status of held-out test samples with an accuracy of 91.07%, the tissue-of-origin with 86.36%, and combined neoplasticity status and tissue-of-origin with an accuracy of 84.28%. The topological maps display the ability of miRNAs to recognize tissue types and neoplasticity status. Importantly, when our approach identifies samples that do not cluster well with their respective classes, activation gradients provide further insight in cancer subtypes or grades. Conclusions An unsupervised deep learning approach is developed for cancer classification and interpretation. This work provides an intuitive approach for understanding molecular properties of cancer and has significant potential for cancer classification and treatment selection.
Collapse
|
50
|
R E, Jain DK, Kotecha K, Pandya S, Reddy SS, E R, Varadarajan V, Mahanti A, V S. Hybrid Deep Neural Network for Handling Data Imbalance in Precursor MicroRNA. Front Public Health 2022; 9:821410. [PMID: 35004605 PMCID: PMC8733243 DOI: 10.3389/fpubh.2021.821410] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 12/03/2021] [Indexed: 11/13/2022] Open
Abstract
Over the last decade, the field of bioinformatics has been increasing rapidly. Robust bioinformatics tools are going to play a vital role in future progress. Scientists working in the field of bioinformatics conduct a large number of researches to extract knowledge from the biological data available. Several bioinformatics issues have evolved as a result of the creation of massive amounts of unbalanced data. The classification of precursor microRNA (pre miRNA) from the imbalanced RNA genome data is one such problem. The examinations proved that pre miRNAs (precursor microRNAs) could serve as oncogene or tumor suppressors in various cancer types. This paper introduces a Hybrid Deep Neural Network framework (H-DNN) for the classification of pre miRNA in imbalanced data. The proposed H-DNN framework is an integration of Deep Artificial Neural Networks (Deep ANN) and Deep Decision Tree Classifiers. The Deep ANN in the proposed H-DNN helps to extract the meaningful features and the Deep Decision Tree Classifier helps to classify the pre miRNA accurately. Experimentation of H-DNN was done with genomes of animals, plants, humans, and Arabidopsis with an imbalance ratio up to 1:5000 and virus with a ratio of 1:400. Experimental results showed an accuracy of more than 99% in all the cases and the time complexity of the proposed H-DNN is also very less when compared with the other existing approaches.
Collapse
Affiliation(s)
- Elakkiya R
- School of Computing, SASTRA Deemed University, Thanjavur, India
| | - Deepak Kumar Jain
- College of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Ketan Kotecha
- Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune, India
| | - Sharnil Pandya
- Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
| | | | - Rajalakshmi E
- School of Computing, SASTRA Deemed University, Thanjavur, India
| | - Vijayakumar Varadarajan
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia
| | | | | |
Collapse
|