1
|
Shirguppe S, Gapinske M, Swami D, Gosstola N, Acharya P, Miskalis A, Joulani D, Szkwarek MG, Bhattacharjee A, Elias G, Stilger M, Winter J, Woods WS, Anand D, Lim CKW, Gaj T, Perez-Pinera P. In vivo CRISPR base editing for treatment of Huntington's disease. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.05.602282. [PMID: 39005280 PMCID: PMC11245100 DOI: 10.1101/2024.07.05.602282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Huntington's disease (HD) is an inherited and ultimately fatal neurodegenerative disorder caused by an expanded polyglutamine-encoding CAG repeat within exon 1 of the huntingtin (HTT) gene, which produces a mutant protein that destroys striatal and cortical neurons. Importantly, a critical event in the pathogenesis of HD is the proteolytic cleavage of the mutant HTT protein by caspase-6, which generates fragments of the N-terminal domain of the protein that form highly toxic aggregates. Given the role that proteolysis of the mutant HTT protein plays in HD, strategies for preventing this process hold potential for treating the disorder. By screening 141 CRISPR base editor variants targeting splice elements in the HTT gene, we identified platforms capable of producing HTT protein isoforms resistant to caspase-6-mediated proteolysis via editing of the splice acceptor sequence for exon 13. When delivered to the striatum of a rodent HD model, these base editors induced efficient exon skipping and decreased the formation of the N-terminal fragments, which in turn reduced HTT protein aggregation and attenuated striatal and cortical atrophy. Collectively, these results illustrate the potential for CRISPR base editing to decrease the toxicity of the mutant HTT protein for HD.
Collapse
|
2
|
Babu HWS, Elangovan A, Iyer M, Kirola L, Muthusamy S, Jeeth P, Muthukumar S, Vanlalpeka H, Gopalakrishnan AV, Kadhirvel S, Kumar NS, Vellingiri B. Association Study Between Kynurenine 3-Monooxygenase (KMO) Gene and Parkinson's Disease Patients. Mol Neurobiol 2024; 61:3867-3881. [PMID: 38040995 DOI: 10.1007/s12035-023-03815-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 11/18/2023] [Indexed: 12/03/2023]
Abstract
The influence of various risk factors such as aging, intricate cellular molecular processes, and lifestyle factors like smoking, alcohol consumption, caffeine intake, and occupational factors has received increased focus in relation to the risk and development of Parkinson's disease (PD). Limited research has been conducted on the assessment of lifestyle impact on kynurenine 3-monooxygenase (KMO) gene in PD. A total of 164 subjects, including 82 PD cases and 82 healthy individuals, were recruited based on specific inclusion and exclusion criteria. The severity of PD and clinical assessment were evaluated using the Unified Parkinson's Disease Rating Scale (UPDRS) and Hoehn and Yahr (HY) scaling. Sanger sequencing was performed to analyse the KMO gene in the recruited subjects, and case-control studies were conducted. The UPDRS assessment revealed significant impairments in smell, tremors, walking, and posture instability in the late-onset PD cohorts. The HY scaling indicated a higher proportion of late-onset cohorts in stage 2. Moreover, both alcoholic and non-alcoholic groups showed significantly increased levels of 3-HK in late-onset PD. Gene analysis identified missense variants at position g.241593373 T > A (rs752312199) and intronic variants at positions g.241592623A > G (rs640718), g.241592800C > A (rs990388262), g.241592802A > C (rs1350160268), g.241592808 T > C (rs1478255936), and g.241592812G > T (rs948928931). The alterations in the KMO gene were found to influence the levels of kynurenic acid (KYNA) and 3-hydroxykynurenine (3-HK). Genomic analysis revealed a high prevalence of missense mutations in the late-onset PD groups, leading to a decline in 3-HK levels in patients. This leads to the reduction of the progression of disease in late-onset groups which shows that this mutation may lead to the protective effect on the PD subjects. This study suggests the use of KYNA and 3-HK as potential biomarkers in analysing the progression of disease. This study is limited by its small sample size. To overcome this limitation, a larger study involving in greater number of participants is needed to thoroughly investigate the KMO gene and KP metabolites, to enhance our understanding of Parkinson's disease progression, and to enhance diagnostic capabilities.
Collapse
Affiliation(s)
- Harysh Winster Suresh Babu
- Human Molecular Cytogenetics and Stem Cell Laboratory, Department of Human Genetics and Molecular Biology, Bharathiar University, Coimbatore, 641 046, Tamil Nadu, India
- Stem Cell and Regenerative Medicine, Translational Research, Department of Zoology, School of Basic Sciences, Central University of Punjab, Bathinda, 151401, Punjab, India
| | - Ajay Elangovan
- Stem Cell and Regenerative Medicine, Translational Research, Department of Zoology, School of Basic Sciences, Central University of Punjab, Bathinda, 151401, Punjab, India
| | - Mahalaxmi Iyer
- Department of Microbiology, School of Basic Sciences, Central University of Punjab, Bathinda, 151401, Punjab, India
- Centre for Neuroscience, Department of Biotechnology, Karpagam Academy of Higher Education (Deemed to be University), Coimbatore, India
| | - Laxmi Kirola
- Amity Institute of Biotechnology, Amity University, Noida, 201301, India
- Department of Biotechnology, School of Health Sciences and Technology (SoHST), UPES University, Dehradun, 248007, Uttarakhand, India
| | - Sureshan Muthusamy
- School of Chemical & Biotechnology, SASTRA Deemed University, Thanjavur, 613401, India
| | - Priyanka Jeeth
- Structural and Computational Biology Laboratory, Department of Computational Sciences, Central University of Punjab, 151401, Bathinda, Punjab, India
| | - Sindduja Muthukumar
- Stem Cell and Regenerative Medicine, Translational Research, Department of Zoology, School of Basic Sciences, Central University of Punjab, Bathinda, 151401, Punjab, India
| | - Harvey Vanlalpeka
- Department of Obstetrics and Gynaecology, Zoram Medical College, Falkawn, 796005, India
| | - Abilash Valsala Gopalakrishnan
- Department of Biomedical Sciences, School of Biosciences and Technology, Vellore Institute of Technology, Tamil Nadu, Vellore, 632 014, India
| | - Saraboji Kadhirvel
- Structural and Computational Biology Laboratory, Department of Computational Sciences, Central University of Punjab, 151401, Bathinda, Punjab, India
| | | | - Balachandar Vellingiri
- Stem Cell and Regenerative Medicine, Translational Research, Department of Zoology, School of Basic Sciences, Central University of Punjab, Bathinda, 151401, Punjab, India.
| |
Collapse
|
3
|
Hwang H, Jeon H, Yeo N, Baek D. Big data and deep learning for RNA biology. Exp Mol Med 2024; 56:1293-1321. [PMID: 38871816 PMCID: PMC11263376 DOI: 10.1038/s12276-024-01243-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 02/27/2024] [Accepted: 03/05/2024] [Indexed: 06/15/2024] Open
Abstract
The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.
Collapse
Affiliation(s)
- Hyeonseo Hwang
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Hyeonseong Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Genome4me Inc., Seoul, Republic of Korea
| | - Nagyeong Yeo
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Daehyun Baek
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- Genome4me Inc., Seoul, Republic of Korea.
| |
Collapse
|
4
|
Zhao J, Li J, Yao J, Lin G, Chen C, Ye H, He X, Qu S, Chen Y, Wang D, Liang Y, Gao Z, Wu F. Enhanced PSO feature selection with Runge-Kutta and Gaussian sampling for precise gastric cancer recurrence prediction. Comput Biol Med 2024; 175:108437. [PMID: 38669732 DOI: 10.1016/j.compbiomed.2024.108437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 03/14/2024] [Accepted: 04/07/2024] [Indexed: 04/28/2024]
Abstract
Gastric cancer (GC), characterized by its inconspicuous initial symptoms and rapid invasiveness, presents a formidable challenge. Overlooking postoperative intervention opportunities may result in the dissemination of tumors to adjacent areas and distant organs, thereby substantially diminishing prospects for patient survival. Consequently, the prompt recognition and management of GC postoperative recurrence emerge as a matter of paramount urgency to mitigate the deleterious implications of the ailment. This study proposes an enhanced feature selection model, bRSPSO-FKNN, integrating boosted particle swarm optimization (RSPSO) with fuzzy k-nearest neighbor (FKNN), for predicting GC. It incorporates the Runge-Kutta search, for improved model accuracy, and Gaussian sampling, enhancing the search performance and helping to avoid locally optimal solutions. It outperforms the sophisticated variants of particle swarm optimization when evaluated in the CEC 2014 test suite. Furthermore, the bRSPSO-FKNN feature selection model was introduced for GC recurrence prediction analysis, achieving up to 82.082 % and 86.185 % accuracy and specificity, respectively. In summation, this model attains a notable level of precision, poised to ameliorate the early warning system for GC recurrence and, in turn, advance therapeutic options for afflicted patients.
Collapse
Affiliation(s)
- Jungang Zhao
- Department of Hepatobiliary Surgery, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - JiaCheng Li
- Department of Hepatobiliary Surgery, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Jiangqiao Yao
- Department of Hepatobiliary Surgery, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Ganglian Lin
- Department of Hepatobiliary Surgery, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Chao Chen
- Department of Gastroenterology, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Huajun Ye
- Department of Gastroenterology, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Xixi He
- Department of Gastroenterology, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Shanghu Qu
- Department of Urology, Yunnan Tumor Hospital and the Third Affiliated Hospital of Kunming Medical University, Kunming, Yunnan, China.
| | - Yuxin Chen
- Department of Gastroenterology, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Danhong Wang
- Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Yingqi Liang
- School of Pharmaceutical Sciences, Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Zhihong Gao
- Zhejiang Engineering Research Center of Intelligent Medicine, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| | - Fang Wu
- Department of Gastroenterology, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China.
| |
Collapse
|
5
|
Bhattacharyya N, Chai N, Hafford-Tear NJ, Sadan AN, Szabo A, Zarouchlioti C, Jedlickova J, Leung SK, Liao T, Dudakova L, Skalicka P, Parekh M, Moghul I, Jeffries AR, Cheetham ME, Muthusamy K, Hardcastle AJ, Pontikos N, Liskova P, Tuft SJ, Davidson AE. Deciphering novel TCF4-driven mechanisms underlying a common triplet repeat expansion-mediated disease. PLoS Genet 2024; 20:e1011230. [PMID: 38713708 PMCID: PMC11101122 DOI: 10.1371/journal.pgen.1011230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 05/17/2024] [Accepted: 03/19/2024] [Indexed: 05/09/2024] Open
Abstract
Fuchs endothelial corneal dystrophy (FECD) is an age-related cause of vision loss, and the most common repeat expansion-mediated disease in humans characterised to date. Up to 80% of European FECD cases have been attributed to expansion of a non-coding CTG repeat element (termed CTG18.1) located within the ubiquitously expressed transcription factor encoding gene, TCF4. The non-coding nature of the repeat and the transcriptomic complexity of TCF4 have made it extremely challenging to experimentally decipher the molecular mechanisms underlying this disease. Here we comprehensively describe CTG18.1 expansion-driven molecular components of disease within primary patient-derived corneal endothelial cells (CECs), generated from a large cohort of individuals with CTG18.1-expanded (Exp+) and CTG 18.1-independent (Exp-) FECD. We employ long-read, short-read, and spatial transcriptomic techniques to interrogate expansion-specific transcriptomic biomarkers. Interrogation of long-read sequencing and alternative splicing analysis of short-read transcriptomic data together reveals the global extent of altered splicing occurring within Exp+ FECD, and unique transcripts associated with CTG18.1-expansions. Similarly, differential gene expression analysis highlights the total transcriptomic consequences of Exp+ FECD within CECs. Furthermore, differential exon usage, pathway enrichment and spatial transcriptomics reveal TCF4 isoform ratio skewing solely in Exp+ FECD with potential downstream functional consequences. Lastly, exome data from 134 Exp- FECD cases identified rare (minor allele frequency <0.005) and potentially deleterious (CADD>15) TCF4 variants in 7/134 FECD Exp- cases, suggesting that TCF4 variants independent of CTG18.1 may increase FECD risk. In summary, our study supports the hypothesis that at least two distinct pathogenic mechanisms, RNA toxicity and TCF4 isoform-specific dysregulation, both underpin the pathophysiology of FECD. We anticipate these data will inform and guide the development of translational interventions for this common triplet-repeat mediated disease.
Collapse
Affiliation(s)
- Nihar Bhattacharyya
- University College London Institute of Ophthalmology, London, United Kingdom
| | - Niuzheng Chai
- University College London Institute of Ophthalmology, London, United Kingdom
| | | | - Amanda N. Sadan
- University College London Institute of Ophthalmology, London, United Kingdom
| | - Anita Szabo
- University College London Institute of Ophthalmology, London, United Kingdom
| | | | - Jana Jedlickova
- Department of Paediatrics and Inherited Metabolic Disorders, First Faculty of Medicine, Charles University and General University Hospital in Prague, Prague, Czech Republic
| | - Szi Kay Leung
- Faculty of Health and Life Sciences, University of Exeter, Exeter, United Kingdom
| | - Tianyi Liao
- University College London Institute of Ophthalmology, London, United Kingdom
| | - Lubica Dudakova
- Department of Paediatrics and Inherited Metabolic Disorders, First Faculty of Medicine, Charles University and General University Hospital in Prague, Prague, Czech Republic
| | - Pavlina Skalicka
- Department of Paediatrics and Inherited Metabolic Disorders, First Faculty of Medicine, Charles University and General University Hospital in Prague, Prague, Czech Republic
- Department of Ophthalmology, First Faculty of Medicine, Charles University and General University Hospital in Prague, Prague, Czech Republic
| | - Mohit Parekh
- University College London Institute of Ophthalmology, London, United Kingdom
| | - Ismail Moghul
- University College London Institute of Ophthalmology, London, United Kingdom
- Moorfields Eye Hospital, London, United Kingdom
| | - Aaron R. Jeffries
- Faculty of Health and Life Sciences, University of Exeter, Exeter, United Kingdom
| | - Michael E. Cheetham
- University College London Institute of Ophthalmology, London, United Kingdom
| | | | - Alison J. Hardcastle
- University College London Institute of Ophthalmology, London, United Kingdom
- Moorfields Eye Hospital, London, United Kingdom
| | - Nikolas Pontikos
- University College London Institute of Ophthalmology, London, United Kingdom
- Moorfields Eye Hospital, London, United Kingdom
| | - Petra Liskova
- Department of Paediatrics and Inherited Metabolic Disorders, First Faculty of Medicine, Charles University and General University Hospital in Prague, Prague, Czech Republic
- Department of Ophthalmology, First Faculty of Medicine, Charles University and General University Hospital in Prague, Prague, Czech Republic
| | - Stephen J. Tuft
- University College London Institute of Ophthalmology, London, United Kingdom
- Moorfields Eye Hospital, London, United Kingdom
| | - Alice E. Davidson
- University College London Institute of Ophthalmology, London, United Kingdom
- Moorfields Eye Hospital, London, United Kingdom
| |
Collapse
|
6
|
Lee H, Ozbulak U, Park H, Depuydt S, De Neve W, Vankerschaver J. Assessing the reliability of point mutation as data augmentation for deep learning with genomic data. BMC Bioinformatics 2024; 25:170. [PMID: 38689247 PMCID: PMC11059627 DOI: 10.1186/s12859-024-05787-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 04/15/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Deep neural networks (DNNs) have the potential to revolutionize our understanding and treatment of genetic diseases. An inherent limitation of deep neural networks, however, is their high demand for data during training. To overcome this challenge, other fields, such as computer vision, use various data augmentation techniques to artificially increase the available training data for DNNs. Unfortunately, most data augmentation techniques used in other domains do not transfer well to genomic data. RESULTS Most genomic data possesses peculiar properties and data augmentations may significantly alter the intrinsic properties of the data. In this work, we propose a novel data augmentation technique for genomic data inspired by biology: point mutations. By employing point mutations as substitutes for codons, we demonstrate that our newly proposed data augmentation technique enhances the performance of DNNs across various genomic tasks that involve coding regions, such as translation initiation and splice site detection. CONCLUSION Silent and missense mutations are found to positively influence effectiveness, while nonsense mutations and random mutations in non-coding regions generally lead to degradation. Overall, point mutation-based augmentations in genomic datasets present valuable opportunities for improving the accuracy and reliability of predictive models for DNA sequences.
Collapse
Affiliation(s)
| | - Utku Ozbulak
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Homin Park
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Stephen Depuydt
- Erasmus Brussels University of Applied Sciences and Arts, Brussels, Belgium
| | - Wesley De Neve
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Joris Vankerschaver
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium.
| |
Collapse
|
7
|
Xu C, Bao S, Chen H, Jiang T, Zhang C. Reference-informed prediction of alternative splicing and splicing-altering mutations from sequences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.22.586363. [PMID: 38586002 PMCID: PMC10996483 DOI: 10.1101/2024.03.22.586363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Alternative splicing plays a crucial role in protein diversity and gene expression regulation in higher eukaryotes and mutations causing dysregulated splicing underlie a range of genetic diseases. Computational prediction of alternative splicing from genomic sequences not only provides insight into gene-regulatory mechanisms but also helps identify disease-causing mutations and drug targets. However, the current methods for the quantitative prediction of splice site usage still have limited accuracy. Here, we present DeltaSplice, a deep neural network model optimized to learn the impact of mutations on quantitative changes in alternative splicing from the comparative analysis of homologous genes. The model architecture enables DeltaSplice to perform "reference-informed prediction" by incorporating the known splice site usage of a reference gene sequence to improve its prediction on splicing-altering mutations. We benchmarked DeltaSplice and several other state-of-the-art methods on various prediction tasks, including evolutionary sequence divergence on lineage-specific splicing and splicing-altering mutations in human populations and neurodevelopmental disorders, and demonstrated that DeltaSplice outperformed consistently. DeltaSplice predicted ~15% of splicing quantitative trait loci (sQTLs) in the human brain as causal splicing-altering variants. It also predicted splicing-altering de novo mutations outside the splice sites in a subset of patients affected by autism and other neurodevelopmental disorders, including 19 genes with recurrent splicing-altering mutations. Among the new candidate disease risk genes, MFN1 is involved in mitochondria fusion, which is frequently disrupted in autism patients. Our work expanded the capacity of in silico splicing models with potential applications in genetic diagnosis and the development of splicing-based precision medicine.
Collapse
Affiliation(s)
- Chencheng Xu
- Bioinformatics Division, BNRIST, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
- Present address: Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Suying Bao
- Department of Systems Biology, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA
- Present address: Regeneron Pharmaceuticals, Terrytown, NY 10591, USA
| | - Hao Chen
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
- Present address: Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Tao Jiang
- Bioinformatics Division, BNRIST, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Chaolin Zhang
- Department of Systems Biology, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA
| |
Collapse
|
8
|
Speakman E, Gunaratne GH. On a kneading theory for gene-splicing. CHAOS (WOODBURY, N.Y.) 2024; 34:043125. [PMID: 38579148 DOI: 10.1063/5.0199364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 03/05/2024] [Indexed: 04/07/2024]
Abstract
Two well-known facets in protein synthesis in eukaryotic cells are transcription of DNA to pre-RNA in the nucleus and the translation of messenger-RNA (mRNA) to proteins in the cytoplasm. A critical intermediate step is the removal of segments (introns) containing ∼97% of the nucleic-acid sites in pre-RNA and sequential alignment of the retained segments (exons) to form mRNA through a process referred to as splicing. Alternative forms of splicing enrich the proteome while abnormal splicing can enhance the likelihood of a cell developing cancer or other diseases. Mechanisms for splicing and origins of splicing errors are only partially deciphered. Our goal is to determine if rules on splicing can be inferred from data analytics on nucleic-acid sequences. Toward that end, we represent a nucleic-acid site as a point in a plane defined in terms of the anterior and posterior sub-sequences of the site. The "point-set" representation expands analytical approaches, including the use of statistical tools, to characterize genome sequences. It is found that point-sets for exons and introns are visually different, and that the differences can be quantified using a family of generalized moments. We design a machine-learning algorithm that can recognize individual exons or introns with 91% accuracy. Point-set distributions and generalized moments are found to differ between organisms.
Collapse
Affiliation(s)
- Ethan Speakman
- Department of Physics, University of Houston, Houston, Texas 77204, USA
| | | |
Collapse
|
9
|
Liu X, Zhang H, Zeng Y, Zhu X, Zhu L, Fu J. DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks. Genes (Basel) 2024; 15:404. [PMID: 38674339 PMCID: PMC11048956 DOI: 10.3390/genes15040404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 03/20/2024] [Accepted: 03/23/2024] [Indexed: 04/28/2024] Open
Abstract
The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms for enhanced accuracy in capturing the intricate features of splice sites. We constructed multiple datasets using the most recent versions of genomic data from three different organisms, Oryza sativa japonica, Arabidopsis thaliana and Homo sapiens. This approach allows us to train models with a richer set of high-quality data. DRANetSplicer outperformed benchmark methods on donor and acceptor splice site datasets, achieving an average accuracy of (96.57%, 95.82%) across the three organisms. Comparative analyses with benchmark methods, including SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, revealed DRANetSplicer's superior predictive performance, resulting in at least a (4.2%, 11.6%) relative reduction in average error rate. We utilized the DRANetSplicer model trained on O. sativa japonica data to predict splice sites in A. thaliana, achieving accuracies for donor and acceptor sites of (94.89%, 94.25%). These results indicate that DRANetSplicer possesses excellent cross-organism predictive capabilities, with its performance in cross-organism predictions even surpassing that of benchmark methods in non-cross-organism predictions. Cross-organism validation showcased DRANetSplicer's excellence in predicting splice sites across similar organisms, supporting its applicability in gene annotation for understudied organisms. We employed multiple methods to visualize the decision-making process of the model. The visualization results indicate that DRANetSplicer can learn and interpret well-known biological features, further validating its overall performance. Our study systematically examined and confirmed the predictive ability of DRANetSplicer from various levels and perspectives, indicating that its practical application in gene annotation is justified.
Collapse
Affiliation(s)
- Xueyan Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Hongyan Zhang
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Ying Zeng
- School of Computer and Communication, Hunan Institute of Engineering, Xiangtan 411104, China;
| | - Xinghui Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Lei Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Jiahui Fu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| |
Collapse
|
10
|
Ferese R, Scala S, Suppa A, Campopiano R, Asci F, Zampogna A, Chiaravalloti MA, Griguoli A, Storto M, Pardo AD, Giardina E, Zampatti S, Fornai F, Novelli G, Fanelli M, Zecca C, Logroscino G, Centonze D, Gambardella S. Cohort analysis of novel SPAST variants in SPG4 patients and implementation of in vitro and in vivo studies to identify the pathogenic mechanism caused by splicing mutations. Front Neurol 2023; 14:1296924. [PMID: 38145127 PMCID: PMC10748595 DOI: 10.3389/fneur.2023.1296924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 11/14/2023] [Indexed: 12/26/2023] Open
Abstract
Introduction Pure hereditary spastic paraplegia (SPG) type 4 (SPG4) is caused by mutations of SPAST gene. This study aimed to analyze SPAST variants in SPG4 patients to highlight the occurrence of splicing mutations and combine functional studies to assess the relevance of these variants in the molecular mechanisms of the disease. Methods We performed an NGS panel in 105 patients, in silico analysis for splicing mutations, and in vitro minigene assay. Results and discussion The NGS panel was applied to screen 105 patients carrying a clinical phenotype corresponding to upper motor neuron syndrome (UMNS), selectively affecting motor control of lower limbs. Pathogenic mutations in SPAST were identified in 12 patients (11.42%), 5 missense, 3 frameshift, and 4 splicing variants. Then, we focused on the patients carrying splicing variants using a combined approach of in silico and in vitro analysis through minigene assay and RNA, if available. For two splicing variants (i.e., c.1245+1G>A and c.1414-2A>T), functional assays confirm the types of molecular alterations suggested by the in silico analysis (loss of exon 9 and exon 12). In contrast, the splicing variant c.1005-1delG differed from what was predicted (skipping exon 7), and the functional study indicates the loss of frame and formation of a premature stop codon. The present study evidenced the high splice variants in SPG4 patients and indicated the relevance of functional assays added to in silico analysis to decipher the pathogenic mechanism.
Collapse
Affiliation(s)
| | | | - Antonio Suppa
- IRCCS Neuromed, Pozzilli, Italy
- Department of Human Neurosciences, Sapienza University of Rome, Rome, Italy
| | | | | | | | | | | | | | | | - Emiliano Giardina
- Genomic Medicine Laboratory, IRCCS Fondazione Santa Lucia, Rome, Italy
| | - Stefania Zampatti
- Genomic Medicine Laboratory, IRCCS Fondazione Santa Lucia, Rome, Italy
| | - Francesco Fornai
- IRCCS Neuromed, Pozzilli, Italy
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, Pisa, Italy
| | - Giuseppe Novelli
- IRCCS Neuromed, Pozzilli, Italy
- Department of Biomedicine and Prevention, University of Rome “Tor Vergata”, Rome, Italy
| | - Mirco Fanelli
- Department of Biomolecular Sciences, University of Urbino “Carlo Bo”, Urbino, Italy
| | - Chiara Zecca
- Center for Neurodegenerative Diseases and the Aging Brain, Department of Clinical Research in Neurology of the University of Bari “Aldo Moro” at “Pia Fondazione Card G. Panico” Hospital Tricase, Lecce, Italy
| | - Giancarlo Logroscino
- Center for Neurodegenerative Diseases and the Aging Brain, Department of Clinical Research in Neurology of the University of Bari “Aldo Moro” at “Pia Fondazione Card G. Panico” Hospital Tricase, Lecce, Italy
| | - Diego Centonze
- IRCCS Neuromed, Pozzilli, Italy
- Department of Systems Medicine, Tor Vergata University, Rome, Italy
| | - Stefano Gambardella
- IRCCS Neuromed, Pozzilli, Italy
- Department of Biomolecular Sciences, University of Urbino “Carlo Bo”, Urbino, Italy
| |
Collapse
|
11
|
Toussaint PA, Leiser F, Thiebes S, Schlesner M, Brors B, Sunyaev A. Explainable artificial intelligence for omics data: a systematic mapping study. Brief Bioinform 2023; 25:bbad453. [PMID: 38113073 PMCID: PMC10729786 DOI: 10.1093/bib/bbad453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 07/28/2023] [Accepted: 11/08/2023] [Indexed: 12/21/2023] Open
Abstract
Researchers increasingly turn to explainable artificial intelligence (XAI) to analyze omics data and gain insights into the underlying biological processes. Yet, given the interdisciplinary nature of the field, many findings have only been shared in their respective research community. An overview of XAI for omics data is needed to highlight promising approaches and help detect common issues. Toward this end, we conducted a systematic mapping study. To identify relevant literature, we queried Scopus, PubMed, Web of Science, BioRxiv, MedRxiv and arXiv. Based on keywording, we developed a coding scheme with 10 facets regarding the studies' AI methods, explainability methods and omics data. Our mapping study resulted in 405 included papers published between 2010 and 2023. The inspected papers analyze DNA-based (mostly genomic), transcriptomic, proteomic or metabolomic data by means of neural networks, tree-based methods, statistical methods and further AI methods. The preferred post-hoc explainability methods are feature relevance (n = 166) and visual explanation (n = 52), while papers using interpretable approaches often resort to the use of transparent models (n = 83) or architecture modifications (n = 72). With many research gaps still apparent for XAI for omics data, we deduced eight research directions and discuss their potential for the field. We also provide exemplary research questions for each direction. Many problems with the adoption of XAI for omics data in clinical practice are yet to be resolved. This systematic mapping study outlines extant research on the topic and provides research directions for researchers and practitioners.
Collapse
Affiliation(s)
- Philipp A Toussaint
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
- HIDSS4Health – Helmholtz Information and Data Science School for Health, Karlsruhe, Heidelberg, Germany
| | - Florian Leiser
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Scott Thiebes
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Matthias Schlesner
- Biomedical Informatics, Data Mining and Data Analytics, Faculty of Applied Computer Science and Medical Faculty, University of Augsburg, Augsburg, Germany
| | - Benedikt Brors
- Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Translational Oncology, National Center for Tumor Diseases, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Ali Sunyaev
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
12
|
Weklak D, Tisborn J, Mangold MH, Scheu R, Wodrich H, Hagedorn C, Jönsson F, Kreppel F. Insights from the Construction of Adenovirus-Based Vaccine Candidates against SARS-CoV-2: Expecting the Unexpected. Viruses 2023; 15:2155. [PMID: 38005833 PMCID: PMC10675337 DOI: 10.3390/v15112155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 10/20/2023] [Accepted: 10/23/2023] [Indexed: 11/26/2023] Open
Abstract
To contain the spread of the SARS-CoV-2 pandemic, rapid development of vaccines was required in 2020. Rational design, international efforts, and a lot of hard work yielded the market approval of novel SARS-CoV-2 vaccines based on diverse platforms such as mRNA or adenovirus vectors. The great success of these technologies, in fact, contributed significantly to control the pandemic. Consequently, most scientific literature available in the public domain discloses the results of clinical trials and reveals data of efficaciousness. However, a description of processes and rationales that led to specific vaccine design is only partially available, in particular for adenovirus vectors, even though it could prove helpful for future developments. Here, we disclose our insights from the endeavors to design compatible functional adenoviral vector platform expression cassettes for the SARS-CoV-2 spike protein. We observed that contextualizing genes from an ssRNA virus into a DNA virus provides significant challenges. Besides affecting physical titers, expression cassette design of adenoviral vaccine candidates can affect viral propagation and spike protein expression. Splicing of mRNAs was affected, and fusogenicity of the spike protein in ACE2-overexpressing cells was enhanced when the ER retention signal was deleted.
Collapse
Affiliation(s)
- Denice Weklak
- Chair of Biochemistry and Molecular Medicine, Center for Biomedical Education and Research (ZBAF), Witten/Herdecke University, Stockumer Str. 10, 58453 Witten, Germany; (D.W.); (J.T.); (M.H.M.); (R.S.); (C.H.)
| | - Julian Tisborn
- Chair of Biochemistry and Molecular Medicine, Center for Biomedical Education and Research (ZBAF), Witten/Herdecke University, Stockumer Str. 10, 58453 Witten, Germany; (D.W.); (J.T.); (M.H.M.); (R.S.); (C.H.)
| | - Maurin Helen Mangold
- Chair of Biochemistry and Molecular Medicine, Center for Biomedical Education and Research (ZBAF), Witten/Herdecke University, Stockumer Str. 10, 58453 Witten, Germany; (D.W.); (J.T.); (M.H.M.); (R.S.); (C.H.)
| | - Raphael Scheu
- Chair of Biochemistry and Molecular Medicine, Center for Biomedical Education and Research (ZBAF), Witten/Herdecke University, Stockumer Str. 10, 58453 Witten, Germany; (D.W.); (J.T.); (M.H.M.); (R.S.); (C.H.)
| | - Harald Wodrich
- Microbiologie Fondamentale et Pathogénicité, MFP CNRS UMR 5234, Université de Bordeaux, 33076 Bordeaux, France;
| | - Claudia Hagedorn
- Chair of Biochemistry and Molecular Medicine, Center for Biomedical Education and Research (ZBAF), Witten/Herdecke University, Stockumer Str. 10, 58453 Witten, Germany; (D.W.); (J.T.); (M.H.M.); (R.S.); (C.H.)
| | - Franziska Jönsson
- Chair of Biochemistry and Molecular Medicine, Center for Biomedical Education and Research (ZBAF), Witten/Herdecke University, Stockumer Str. 10, 58453 Witten, Germany; (D.W.); (J.T.); (M.H.M.); (R.S.); (C.H.)
| | - Florian Kreppel
- Chair of Biochemistry and Molecular Medicine, Center for Biomedical Education and Research (ZBAF), Witten/Herdecke University, Stockumer Str. 10, 58453 Witten, Germany; (D.W.); (J.T.); (M.H.M.); (R.S.); (C.H.)
| |
Collapse
|
13
|
Ditz JC, Reuter B, Pfeifer N. Inherently interpretable position-aware convolutional motif kernel networks for biological sequencing data. Sci Rep 2023; 13:17216. [PMID: 37821530 PMCID: PMC10567796 DOI: 10.1038/s41598-023-44175-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 10/04/2023] [Indexed: 10/13/2023] Open
Abstract
Artificial neural networks show promising performance in detecting correlations within data that are associated with specific outcomes. However, the black-box nature of such models can hinder the knowledge advancement in research fields by obscuring the decision process and preventing scientist to fully conceptualize predicted outcomes. Furthermore, domain experts like healthcare providers need explainable predictions to assess whether a predicted outcome can be trusted in high stakes scenarios and to help them integrating a model into their own routine. Therefore, interpretable models play a crucial role for the incorporation of machine learning into high stakes scenarios like healthcare. In this paper we introduce Convolutional Motif Kernel Networks, a neural network architecture that involves learning a feature representation within a subspace of the reproducing kernel Hilbert space of the position-aware motif kernel function. The resulting model enables to directly interpret and evaluate prediction outcomes by providing a biologically and medically meaningful explanation without the need for additional post-hoc analysis. We show that our model is able to robustly learn on small datasets and reaches state-of-the-art performance on relevant healthcare prediction tasks. Our proposed method can be utilized on DNA and protein sequences. Furthermore, we show that the proposed method learns biologically meaningful concepts directly from data using an end-to-end learning scheme.
Collapse
Affiliation(s)
- Jonas C Ditz
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany.
| | - Bernhard Reuter
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany
| | - Nico Pfeifer
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany.
| |
Collapse
|
14
|
Shen F, Hu C, Huang X, He H, Yang D, Zhao J, Yang X. Advances in alternative splicing identification: deep learning and pantranscriptome. FRONTIERS IN PLANT SCIENCE 2023; 14:1232466. [PMID: 37790793 PMCID: PMC10544900 DOI: 10.3389/fpls.2023.1232466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 08/28/2023] [Indexed: 10/05/2023]
Abstract
In plants, alternative splicing is a crucial mechanism for regulating gene expression at the post-transcriptional level, which leads to diverse proteins by generating multiple mature mRNA isoforms and diversify the gene regulation. Due to the complexity and variability of this process, accurate identification of splicing events is a vital step in studying alternative splicing. This article presents the application of alternative splicing algorithms with or without reference genomes in plants, as well as the integration of advanced deep learning techniques for improved detection accuracy. In addition, we also discuss alternative splicing studies in the pan-genomic background and the usefulness of integrated strategies for fully profiling alternative splicing.
Collapse
Affiliation(s)
- Fei Shen
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Chenyang Hu
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
- Shanxi Key Lab of Chinese Jujube, College of Life Science, Yan’an University, Yan’an, Shanxi, China
| | - Xin Huang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Hao He
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Deng Yang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Jirong Zhao
- Shanxi Key Lab of Chinese Jujube, College of Life Science, Yan’an University, Yan’an, Shanxi, China
| | - Xiaozeng Yang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| |
Collapse
|
15
|
Chao KH, Mao A, Salzberg SL, Pertea M. Splam: a deep-learning-based splice site predictor that improves spliced alignments. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.27.550754. [PMID: 37546880 PMCID: PMC10402160 DOI: 10.1101/2023.07.27.550754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
The process of splicing messenger RNA to remove introns plays a central role in creating genes and gene variants. Here we describe Splam, a novel method for predicting splice junctions in DNA based on deep residual convolutional neural networks. Unlike some previous models, Splam looks at a relatively limited window of 400 base pairs flanking each splice site, motivated by the observation that the biological process of splicing relies primarily on signals within this window. Additionally, Splam introduces the idea of training the network on donor and acceptor pairs together, based on the principle that the splicing machinery recognizes both ends of each intron at once. We compare Splam's accuracy to recent state-of-the-art splice site prediction methods, particularly SpliceAI, another method that uses deep neural networks. Our results show that Splam is consistently more accurate than SpliceAI, with an overall accuracy of 96% at predicting human splice junctions. Splam generalizes even to non-human species, including distant ones like the flowering plant Arabidopsis thaliana. Finally, we demonstrate the use of Splam on a novel application: processing the spliced alignments of RNA-seq data to identify and eliminate errors. We show that when used in this manner, Splam yields substantial improvements in the accuracy of downstream transcriptome analysis of both poly(A) and ribo-depleted RNA-seq libraries. Overall, Splam offers a faster and more accurate approach to detecting splice junctions, while also providing a reliable and efficient solution for cleaning up erroneous spliced alignments.
Collapse
Affiliation(s)
- Kuan-Hao Chao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Alan Mao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Steven L Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21211, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
16
|
Zabardast A, Tamer EG, Son YA, Yılmaz A. An automated framework for evaluation of deep learning models for splice site predictions. Sci Rep 2023; 13:10221. [PMID: 37353532 PMCID: PMC10290104 DOI: 10.1038/s41598-023-34795-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 05/08/2023] [Indexed: 06/25/2023] Open
Abstract
A novel framework for the automated evaluation of various deep learning-based splice site detectors is presented. The framework eliminates time-consuming development and experimenting activities for different codebases, architectures, and configurations to obtain the best models for a given RNA splice site dataset. RNA splicing is a cellular process in which pre-mRNAs are processed into mature mRNAs and used to produce multiple mRNA transcripts from a single gene sequence. Since the advancement of sequencing technologies, many splice site variants have been identified and associated with the diseases. So, RNA splice site prediction is essential for gene finding, genome annotation, disease-causing variants, and identification of potential biomarkers. Recently, deep learning models performed highly accurately for classifying genomic signals. Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and its bidirectional version (BLSTM), Gated Recurrent Unit (GRU), and its bidirectional version (BGRU) are promising models. During genomic data analysis, CNN's locality feature helps where each nucleotide correlates with other bases in its vicinity. In contrast, BLSTM can be trained bidirectionally, allowing sequential data to be processed from forward and reverse directions. Therefore, it can process 1-D encoded genomic data effectively. Even though both methods have been used in the literature, a performance comparison was missing. To compare selected models under similar conditions, we have created a blueprint for a series of networks with five different levels. As a case study, we compared CNN and BLSTM models' learning capabilities as building blocks for RNA splice site prediction in two different datasets. Overall, CNN performed better with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) in human splice site prediction. Likewise, an outperforming performance with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) is achieved in C. elegans splice site prediction. Overall, our results showed that CNN learns faster than BLSTM and BGRU. Moreover, CNN performs better at extracting sequence patterns than BLSTM and BGRU. To our knowledge, no other framework is developed explicitly for evaluating splice detection models to decide the best possible model in an automated manner. So, the proposed framework and the blueprint would help selecting different deep learning models, such as CNN vs. BLSTM and BGRU, for splice site analysis or similar classification tasks and in different problems.
Collapse
Affiliation(s)
- Amin Zabardast
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Elif Güney Tamer
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Yeşim Aydın Son
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Arif Yılmaz
- Institute of Data Science, Maastricht University, Maastricht, The Netherlands.
| |
Collapse
|
17
|
McBeath E, Fujiwara K, Hofmann MC. Evidence-Based Guide to Using Artificial Introns for Tissue-Specific Knockout in Mice. Int J Mol Sci 2023; 24:10258. [PMID: 37373404 PMCID: PMC10299402 DOI: 10.3390/ijms241210258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/09/2023] [Accepted: 06/10/2023] [Indexed: 06/29/2023] Open
Abstract
Up until recently, methods for generating floxed mice either conventionally or by CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats)-Cas9 (CRISPR-associated protein 9) editing have been technically challenging, expensive and error-prone, or time-consuming. To circumvent these issues, several labs have started successfully using a small artificial intron to conditionally knockout (KO) a gene of interest in mice. However, many other labs are having difficulty getting the technique to work. The key problem appears to be either a failure in achieving correct splicing after the introduction of the artificial intron into the gene or, just as crucial, insufficient functional KO of the gene's protein after Cre-induced removal of the intron's branchpoint. Presented here is a guide on how to choose an appropriate exon and where to place the recombinase-regulated artificial intron (rAI) in that exon to prevent disrupting normal gene splicing while maximizing mRNA degradation after recombinase treatment. The reasoning behind each step in the guide is also discussed. Following these recommendations should increase the success rate of this easy, new, and alternative technique for producing tissue-specific KO mice.
Collapse
Affiliation(s)
- Elena McBeath
- Department of Endocrine Neoplasia & Hormonal Disorders, MD Anderson Cancer Center, Houston, TX 77030, USA;
| | - Keigi Fujiwara
- National Coalition of Independent Scholars, Brattleboro, VT 05301, USA;
| | - Marie-Claude Hofmann
- Department of Endocrine Neoplasia & Hormonal Disorders, MD Anderson Cancer Center, Houston, TX 77030, USA;
| |
Collapse
|
18
|
Akpokiro V, Chowdhury HMAM, Olowofila S, Nusrat R, Oluwadare O. CNNSplice: Robust models for splice site prediction using convolutional neural networks. Comput Struct Biotechnol J 2023; 21:3210-3223. [PMID: 37304005 PMCID: PMC10250157 DOI: 10.1016/j.csbj.2023.05.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 05/25/2023] [Accepted: 05/28/2023] [Indexed: 06/13/2023] Open
Abstract
The identification of splice site, or segments of an RNA gene where noncoding and coding sequences are connected in the 5' and 3' directions, is an essential post-transcriptional step for the annotation of functional genes and is required for the study and analysis of biological function in eukaryotic organisms through protein production and gene expression. Splice site detection tools have been proposed for this purpose; however, the models of these tools have a specific use case and are inefficiently or typically untransferable between organisms. Here, we present CNNSplice, a set of deep convolutional neural network models for splice site prediction. Using the five-fold cross-validation model selection technique, we explore several models based on typical machine learning applications and propose five high-performing models to efficiently predict the true and false SS in balanced and imbalanced datasets. Our evaluation results indicate that CNNSplice's models achieve a better performance compared with existing methods across five organisms' datasets. In addition, our generality test shows CNNSplice's model ability to predict and annotate splice sites in new or poorly trained genome datasets indicating a broad application spectrum. CNNSplice demonstrates improved model prediction, interpretability, and generalizability on genomic datasets compared to existing splice site prediction tools. We have developed a web server for the CNNSplice algorithm which can be publicly accessed here: http://www.cnnsplice.online.
Collapse
|
19
|
Lin BC, Katneni U, Jankowska KI, Meyer D, Kimchi-Sarfaty C. In silico methods for predicting functional synonymous variants. Genome Biol 2023; 24:126. [PMID: 37217943 PMCID: PMC10204308 DOI: 10.1186/s13059-023-02966-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 05/10/2023] [Indexed: 05/24/2023] Open
Abstract
Single nucleotide variants (SNVs) contribute to human genomic diversity. Synonymous SNVs are previously considered to be "silent," but mounting evidence has revealed that these variants can cause RNA and protein changes and are implicated in over 85 human diseases and cancers. Recent improvements in computational platforms have led to the development of numerous machine-learning tools, which can be used to advance synonymous SNV research. In this review, we discuss tools that should be used to investigate synonymous variants. We provide supportive examples from seminal studies that demonstrate how these tools have driven new discoveries of functional synonymous SNVs.
Collapse
Affiliation(s)
- Brian C Lin
- Hemostasis Branch 1, Division of Hemostasis, Office of Plasma Protein Therapeutics CMC, Office of Therapeutic Products, Center for Biologics Evaluation and Research, US FDA, Silver Spring, MD, USA
| | - Upendra Katneni
- Hemostasis Branch 1, Division of Hemostasis, Office of Plasma Protein Therapeutics CMC, Office of Therapeutic Products, Center for Biologics Evaluation and Research, US FDA, Silver Spring, MD, USA
| | - Katarzyna I Jankowska
- Hemostasis Branch 1, Division of Hemostasis, Office of Plasma Protein Therapeutics CMC, Office of Therapeutic Products, Center for Biologics Evaluation and Research, US FDA, Silver Spring, MD, USA
| | - Douglas Meyer
- Hemostasis Branch 1, Division of Hemostasis, Office of Plasma Protein Therapeutics CMC, Office of Therapeutic Products, Center for Biologics Evaluation and Research, US FDA, Silver Spring, MD, USA
| | - Chava Kimchi-Sarfaty
- Hemostasis Branch 1, Division of Hemostasis, Office of Plasma Protein Therapeutics CMC, Office of Therapeutic Products, Center for Biologics Evaluation and Research, US FDA, Silver Spring, MD, USA.
| |
Collapse
|
20
|
Rogalska ME, Vivori C, Valcárcel J. Regulation of pre-mRNA splicing: roles in physiology and disease, and therapeutic prospects. Nat Rev Genet 2023; 24:251-269. [PMID: 36526860 DOI: 10.1038/s41576-022-00556-8] [Citation(s) in RCA: 50] [Impact Index Per Article: 50.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/10/2022] [Indexed: 12/23/2022]
Abstract
The removal of introns from mRNA precursors and its regulation by alternative splicing are key for eukaryotic gene expression and cellular function, as evidenced by the numerous pathologies induced or modified by splicing alterations. Major recent advances have been made in understanding the structures and functions of the splicing machinery, in the description and classification of physiological and pathological isoforms and in the development of the first therapies for genetic diseases based on modulation of splicing. Here, we review this progress and discuss important remaining challenges, including predicting splice sites from genomic sequences, understanding the variety of molecular mechanisms and logic of splicing regulation, and harnessing this knowledge for probing gene function and disease aetiology and for the design of novel therapeutic approaches.
Collapse
Affiliation(s)
- Malgorzata Ewa Rogalska
- Genome Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Claudia Vivori
- Genome Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
- The Francis Crick Institute, London, UK
| | - Juan Valcárcel
- Genome Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
21
|
Patterson A, Elbasir A, Tian B, Auslander N. Computational Methods Summarizing Mutational Patterns in Cancer: Promise and Limitations for Clinical Applications. Cancers (Basel) 2023; 15:cancers15071958. [PMID: 37046619 PMCID: PMC10093138 DOI: 10.3390/cancers15071958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 02/24/2023] [Accepted: 03/09/2023] [Indexed: 03/29/2023] Open
Abstract
Since the rise of next-generation sequencing technologies, the catalogue of mutations in cancer has been continuously expanding. To address the complexity of the cancer-genomic landscape and extract meaningful insights, numerous computational approaches have been developed over the last two decades. In this review, we survey the current leading computational methods to derive intricate mutational patterns in the context of clinical relevance. We begin with mutation signatures, explaining first how mutation signatures were developed and then examining the utility of studies using mutation signatures to correlate environmental effects on the cancer genome. Next, we examine current clinical research that employs mutation signatures and discuss the potential use cases and challenges of mutation signatures in clinical decision-making. We then examine computational studies developing tools to investigate complex patterns of mutations beyond the context of mutational signatures. We survey methods to identify cancer-driver genes, from single-driver studies to pathway and network analyses. In addition, we review methods inferring complex combinations of mutations for clinical tasks and using mutations integrated with multi-omics data to better predict cancer phenotypes. We examine the use of these tools for either discovery or prediction, including prediction of tumor origin, treatment outcomes, prognosis, and cancer typing. We further discuss the main limitations preventing widespread clinical integration of computational tools for the diagnosis and treatment of cancer. We end by proposing solutions to address these challenges using recent advances in machine learning.
Collapse
Affiliation(s)
- Andrew Patterson
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- The Wistar Institute, Philadelphia, PA 19104, USA
| | | | - Bin Tian
- The Wistar Institute, Philadelphia, PA 19104, USA
| | - Noam Auslander
- The Wistar Institute, Philadelphia, PA 19104, USA
- Department of Cancer Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
- Correspondence:
| |
Collapse
|
22
|
A deep intronic TCTN2 variant activating a cryptic exon predicted by SpliceRover in a patient with Joubert syndrome. J Hum Genet 2023:10.1038/s10038-023-01143-3. [PMID: 36894704 DOI: 10.1038/s10038-023-01143-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 01/26/2023] [Accepted: 02/27/2023] [Indexed: 03/11/2023]
Abstract
The recent introduction of genome sequencing in genetic analysis has led to the identification of pathogenic variants located in deep introns. Recently, several new tools have emerged to predict the impact of variants on splicing. Here, we present a Japanese boy of Joubert syndrome with biallelic TCTN2 variants. Exome sequencing identified only a heterozygous maternal nonsense TCTN2 variant (NM_024809.5:c.916C >T, p.(Gln306Ter)). Subsequent genome sequencing identified a deep intronic variant (c.1033+423G>A) inherited from his father. The machine learning algorithms SpliceAI, Squirls, and Pangolin were unable to predict alterations in splicing by the c.1033+423G>A variant. SpliceRover, a tool for splice site prediction using FASTA sequence, was able to detect a cryptic exon which was 85-bp away from the variant and within the inverted Alu sequence while SpliceRover scores for these splice sites showed slight increase (donor) or decrease (acceptor) between the reference and mutant sequences. RNA sequencing and RT-PCR using urinary cells confirmed inclusion of the cryptic exon. The patient showed major symptoms of TCTN2-related disorders such as developmental delay, dysmorphic facial features and polydactyly. He also showed uncommon features such as retinal dystrophy, exotropia, abnormal pattern of respiration, and periventricular heterotopia, confirming these as one of features of TCTN2-related disorders. Our study highlights usefulness of genome sequencing and RNA sequencing using urinary cells for molecular diagnosis of genetic disorders and suggests that database of cryptic splice sites predicted in introns by SpliceRover using the reference sequences can be helpful in extracting candidate variants from large numbers of intronic variants in genome sequencing.
Collapse
|
23
|
Barbosa P, Savisaar R, Carmo-Fonseca M, Fonseca A. Computational prediction of human deep intronic variation. Gigascience 2022; 12:giad085. [PMID: 37878682 PMCID: PMC10599398 DOI: 10.1093/gigascience/giad085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 06/07/2023] [Accepted: 09/20/2023] [Indexed: 10/27/2023] Open
Abstract
BACKGROUND The adoption of whole-genome sequencing in genetic screens has facilitated the detection of genetic variation in the intronic regions of genes, far from annotated splice sites. However, selecting an appropriate computational tool to discriminate functionally relevant genetic variants from those with no effect is challenging, particularly for deep intronic regions where independent benchmarks are scarce. RESULTS In this study, we have provided an overview of the computational methods available and the extent to which they can be used to analyze deep intronic variation. We leveraged diverse datasets to extensively evaluate tool performance across different intronic regions, distinguishing between variants that are expected to disrupt splicing through different molecular mechanisms. Notably, we compared the performance of SpliceAI, a widely used sequence-based deep learning model, with that of more recent methods that extend its original implementation. We observed considerable differences in tool performance depending on the region considered, with variants generating cryptic splice sites being better predicted than those that potentially affect splicing regulatory elements. Finally, we devised a novel quantitative assessment of tool interpretability and found that tools providing mechanistic explanations of their predictions are often correct with respect to the ground - information, but the use of these tools results in decreased predictive power when compared to black box methods. CONCLUSIONS Our findings translate into practical recommendations for tool usage and provide a reference framework for applying prediction tools in deep intronic regions, enabling more informed decision-making by practitioners.
Collapse
Affiliation(s)
- Pedro Barbosa
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016,, Lisboa, Portugal
- Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, 1649-028, Lisboa, Portugal
| | | | - Maria Carmo-Fonseca
- Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, 1649-028, Lisboa, Portugal
| | - Alcides Fonseca
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016,, Lisboa, Portugal
| |
Collapse
|
24
|
Ding W, Abdel-Basset M, Hawash H, Ali AM. Explainability of artificial intelligence methods, applications and challenges: A comprehensive survey. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.10.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
25
|
Comparison of In Silico Tools for Splice-Altering Variant Prediction Using Established Spliceogenic Variants: An End-User’s Point of View. Int J Genomics 2022; 2022:5265686. [PMID: 36275637 PMCID: PMC9584665 DOI: 10.1155/2022/5265686] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Revised: 07/18/2022] [Accepted: 08/10/2022] [Indexed: 11/18/2022] Open
Abstract
Assessing the impact of variants of unknown significance on splicing has become a critical issue and a bottleneck, especially with the widespread implementation of whole-genome or exome sequencing. Although multiple in silico tools are available, the interpretation and application of these tools are difficult and practical guidelines are still lacking. A streamlined decision-making process can facilitate the downstream RNA analysis in a more efficient manner. Therefore, we evaluated the performance of 8 in silico tools (Splice Site Finder, MaxEntScan, Splice-site prediction by neural network, GeneSplicer, Human Splicing Finder, SpliceAI, Splicing Predictions in Consensus Elements, and SpliceRover) using 114 NF1 spliceogenic variants, experimentally validated at the mRNA level. The change in the predicted score incurred by the variant of the nearest wild-type splice site was analyzed, and for type II, III, and IV splice variants, the change in the prediction score of de novo or cryptic splice site was also analyzed. SpliceAI and SpliceRover, tools based on deep learning, outperformed all other tools, with AUCs of 0.972 and 0.924, respectively. For de novo and cryptic splice sites, SpliceAI outperformed all other tools and showed a sensitivity of 95.7% at an optimal cut-off of 0.02 score change. Our results show that deep learning algorithms, especially those of SpliceAI, are validated at a significantly higher rate than other in silico tools for clinically relevant NF1 variants. This suggests that deep learning algorithms outperform traditional probabilistic approaches and classical machine learning tools in predicting the de novo and cryptic splice sites.
Collapse
|
26
|
Park HM, Park Y, Berani U, Bang E, Vankerschaver J, Van Messem A, De Neve W, Shim H. In silico optimization of RNA-protein interactions for CRISPR-Cas13-based antimicrobials. Biol Direct 2022; 17:27. [PMID: 36207756 PMCID: PMC9547417 DOI: 10.1186/s13062-022-00339-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Accepted: 09/19/2022] [Indexed: 12/04/2022] Open
Abstract
RNA–protein interactions are crucial for diverse biological processes. In prokaryotes, RNA–protein interactions enable adaptive immunity through CRISPR-Cas systems. These defence systems utilize CRISPR RNA (crRNA) templates acquired from past infections to destroy foreign genetic elements through crRNA-mediated nuclease activities of Cas proteins. Thanks to the programmability and specificity of CRISPR-Cas systems, CRISPR-based antimicrobials have the potential to be repurposed as new types of antibiotics. Unlike traditional antibiotics, these CRISPR-based antimicrobials can be designed to target specific bacteria and minimize detrimental effects on the human microbiome during antibacterial therapy. In this study, we explore the potential of CRISPR-based antimicrobials by optimizing the RNA–protein interactions of crRNAs and Cas13 proteins. CRISPR-Cas13 systems are unique as they degrade specific foreign RNAs using the crRNA template, which leads to non-specific RNase activities and cell cycle arrest. We show that a high proportion of the Cas13 systems have no colocalized CRISPR arrays, and the lack of direct association between crRNAs and Cas proteins may result in suboptimal RNA–protein interactions in the current tools. Here, we investigate the RNA–protein interactions of the Cas13-based systems by curating the validation dataset of Cas13 protein and CRISPR repeat pairs that are experimentally validated to interact, and the candidate dataset of CRISPR repeats that reside on the same genome as the currently known Cas13 proteins. To find optimal CRISPR-Cas13 interactions, we first validate the 3-D structure prediction of crRNAs based on their experimental structures. Next, we test a number of RNA–protein interaction programs to optimize the in silico docking of crRNAs with the Cas13 proteins. From this optimized pipeline, we find a number of candidate crRNAs that have comparable or better in silico docking with the Cas13 proteins of the current tools. This study fully automatizes the in silico optimization of RNA–protein interactions as an efficient preliminary step for designing effective CRISPR-Cas13-based antimicrobials.
Collapse
Affiliation(s)
- Ho-Min Park
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.,Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Yunseol Park
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Urta Berani
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Eunkyu Bang
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Joris Vankerschaver
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.,Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| | | | - Wesley De Neve
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.,Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Hyunjin Shim
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
| |
Collapse
|
27
|
Akpokiro V, Martin T, Oluwadare O. EnsembleSplice: ensemble deep learning model for splice site prediction. BMC Bioinformatics 2022; 23:413. [PMID: 36203144 PMCID: PMC9535948 DOI: 10.1186/s12859-022-04971-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Accepted: 09/29/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying splice site regions is an important step in the genomic DNA sequencing pipelines of biomedical and pharmaceutical research. Within this research purview, efficient and accurate splice site detection is highly desirable, and a variety of computational models have been developed toward this end. Neural network architectures have recently been shown to outperform classical machine learning approaches for the task of splice site prediction. Despite these advances, there is still considerable potential for improvement, especially regarding model prediction accuracy, and error rate. RESULTS Given these deficits, we propose EnsembleSplice, an ensemble learning architecture made up of four (4) distinct convolutional neural networks (CNN) model architecture combination that outperform existing splice site detection methods in the experimental evaluation metrics considered including the accuracies and error rates. We trained and tested a variety of ensembles made up of CNNs and DNNs using the five-fold cross-validation method to identify the model that performed the best across the evaluation and diversity metrics. As a result, we developed our diverse and highly effective splice site (SS) detection model, which we evaluated using two (2) genomic Homo sapiens datasets and the Arabidopsis thaliana dataset. The results showed that for of the Homo sapiens EnsembleSplice achieved accuracies of 94.16% for one of the acceptor splice sites and 95.97% for donor splice sites, with an error rate for the same Homo sapiens dataset, 4.03% for the donor splice sites and 5.84% for the acceptor splice sites datasets. CONCLUSIONS Our five-fold cross validation ensured the prediction accuracy of our models are consistent. For reproducibility, all the datasets used, models generated, and results in our work are publicly available in our GitHub repository here: https://github.com/OluwadareLab/EnsembleSplice.
Collapse
Affiliation(s)
- Victor Akpokiro
- Department of Computer Science, University of Colorado, Colorado Springs, CO, 80918, USA
| | - Trevor Martin
- Department of Mathematics, Oberlin College, Oberlin, OH, 44074, USA
| | - Oluwatosin Oluwadare
- Department of Computer Science, University of Colorado, Colorado Springs, CO, 80918, USA.
| |
Collapse
|
28
|
Liu Q, Fang H, Wang X, Wang M, Li S, Coin LJM, Li F, Song J. DeepGenGrep: a general deep learning-based predictor for multiple genomic signals and regions. Bioinformatics 2022; 38:4053-4061. [PMID: 35799358 DOI: 10.1093/bioinformatics/btac454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 04/11/2022] [Accepted: 07/06/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Accurate annotation of different genomic signals and regions (GSRs) from DNA sequences is fundamentally important for understanding gene structure, regulation and function. Numerous efforts have been made to develop machine learning-based predictors for in silico identification of GSRs. However, it remains a great challenge to identify GSRs as the performance of most existing approaches is unsatisfactory. As such, it is highly desirable to develop more accurate computational methods for GSRs prediction. RESULTS In this study, we propose a general deep learning framework termed DeepGenGrep, a general predictor for the systematic identification of multiple different GSRs from genomic DNA sequences. DeepGenGrep leverages the power of hybrid neural networks comprising a three-layer convolutional neural network and a two-layer long short-term memory to effectively learn useful feature representations from sequences. Benchmarking experiments demonstrate that DeepGenGrep outperforms several state-of-the-art approaches on identifying polyadenylation signals, translation initiation sites and splice sites across four eukaryotic species including Homo sapiens, Mus musculus, Bos taurus and Drosophila melanogaster. Overall, DeepGenGrep represents a useful tool for the high-throughput and cost-effective identification of potential GSRs in eukaryotic genomes. AVAILABILITY AND IMPLEMENTATION The webserver and source code are freely available at http://bigdata.biocie.cn/deepgengrep/home and Github (https://github.com/wx-cie/DeepGenGrep/). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Honglin Fang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Xiao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Lachlan J M Coin
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC 3000, Australia
| | - Fuyi Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC 3000, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
29
|
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16:26. [PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 07/12/2022] [Indexed: 12/02/2022] Open
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.
Collapse
Affiliation(s)
- Wardah S Alharbi
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia
| | - Mamoon Rashid
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia.
| |
Collapse
|
30
|
Lee J, Jeong H, Won D, Shin S, Lee ST, Choi JR, Byeon SH, Kuht HJ, Thomas MG, Han J. Noncanonical Splice Site and Deep Intronic FRMD7 Variants Activate Cryptic Exons in X-linked Infantile Nystagmus. Transl Vis Sci Technol 2022; 11:25. [PMID: 35762937 PMCID: PMC9251792 DOI: 10.1167/tvst.11.6.25] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Purpose We aim to report noncoding pathogenic variants in patients with FRMD7-related infantile nystagmus (FIN). Methods Genome sequencing (n = 2 families) and reanalysis of targeted panel next generation sequencing (n = 2 families) was performed in genetically unsolved cases of suspected FIN. Previous sequence analysis showed no pathogenic coding variants in genes associated with infantile nystagmus. SpliceAI, SpliceRover, and Alamut consensus programs were used to annotate noncoding variants. Minigene splicing assay was performed to confirm aberrant splicing. In silico analysis of exonic splicing enhancer and silencer was also performed. Results FRMD7 intronic variants were identified based on genome sequencing and targeted next-generation sequencing analysis. These included c.285-12A>G (pedigree 1), c.284+63T>A (pedigrees 2 and 3), and c. 383-1368A>G (pedigree 4). All variants were absent in gnomAD, and the both c.285-12A>G and c.284+63T>A variants were predicted to enhance new splicing acceptor gains with SpliceAI, SpliceRover, and Alamut consensus approaches. However, the c.383-1368 A>G variant only had a significant impact score on the SpliceRover program. The c.383-1368A>G variant was predicted to promote pseudoexon inclusion by binding of exonic splicing enhancer. Aberrant exonizations were validated through minigene constructs, and all variants were segregated in the families. Conclusions Deep learning–based annotation of noncoding variants facilitates the discovery of hidden genetic variations in patients with FIN. This study provides evidence of effectiveness of combined deep learning–based splicing tools to identify hidden pathogenic variants in previously unsolved patients with infantile nystagmus. Translational Relevance These results demonstrate robust analysis using two deep learning splicing predictions and in vitro functional study can lead to finding hidden genetic variations in unsolved patients.
Collapse
Affiliation(s)
- Junwon Lee
- Institute of Vision Research, Department of Ophthalmology, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| | - Han Jeong
- Brain Korea 21 Project for Medical Science, Yonsei University, Seoul, South Korea.,Institute of Vision Research, Department of Ophthalmology, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| | - Dongju Won
- Department of Laboratory Medicine, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| | - Saeam Shin
- Department of Laboratory Medicine, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| | - Seung-Tae Lee
- Department of Laboratory Medicine, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea.,Dxome Co., Ltd. Seongnam-si, Gyeonggi-do, South Korea
| | - Jong Rak Choi
- Department of Laboratory Medicine, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea.,Dxome Co., Ltd. Seongnam-si, Gyeonggi-do, South Korea
| | - Suk Ho Byeon
- Brain Korea 21 Project for Medical Science, Yonsei University, Seoul, South Korea.,Institute of Vision Research, Department of Ophthalmology, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| | - Helen J Kuht
- The University of Leicester Ulverscroft Eye Unit, Department of Neuroscience, Psychology and Behaviour, University of Leicester, RKCSB, PO Box 65, Leicester LE2 7LX, UK
| | - Mervyn G Thomas
- The University of Leicester Ulverscroft Eye Unit, Department of Neuroscience, Psychology and Behaviour, University of Leicester, RKCSB, PO Box 65, Leicester LE2 7LX, UK
| | - Jinu Han
- Institute of Vision Research, Department of Ophthalmology, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| |
Collapse
|
31
|
Fernandez-Castillo E, Barbosa-Santillán LI, Falcon-Morales L, Sánchez-Escobar JJ. Deep Splicer: A CNN Model for Splice Site Prediction in Genetic Sequences. Genes (Basel) 2022; 13:907. [PMID: 35627292 PMCID: PMC9141016 DOI: 10.3390/genes13050907] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 05/12/2022] [Accepted: 05/13/2022] [Indexed: 02/05/2023] Open
Abstract
Many living organisms have DNA in their cells that is responsible for their biological features. DNA is an organic molecule of two complementary strands of four different nucleotides wound up in a double helix. These nucleotides are adenine (A), thymine (T), guanine (G), and cytosine (C). Genes are DNA sequences containing the information to synthesize proteins. The genes of higher eukaryotic organisms contain coding sequences, known as exons and non-coding sequences, known as introns, which are removed on splice sites after the DNA is transcribed into RNA. Genome annotation is the process of identifying the location of coding regions and determining their function. This process is fundamental for understanding gene structure; however, it is time-consuming and expensive when done by biochemical methods. With technological advances, splice site detection can be done computationally. Although various software tools have been developed to predict splice sites, they need to improve accuracy and reduce false-positive rates. The main goal of this research was to generate Deep Splicer, a deep learning model to identify splice sites in the genomes of humans and other species. This model has good performance metrics and a lower false-positive rate than the currently existing tools. Deep Splicer achieved an accuracy between 93.55% and 99.66% on the genetic sequences of different organisms, while Splice2Deep, another splice site detection tool, had an accuracy between 90.52% and 98.08%. Splice2Deep surpassed Deep Splicer on the accuracy obtained after evaluating C. elegans genomic sequences (97.88% vs. 93.62%) and A. thaliana (95.40% vs. 94.93%); however, Deep Splicer's accuracy was better for H. sapiens (98.94% vs. 97.15%) and D. melanogaster (97.14% vs. 92.30%). The rate of false positives was 0.11% for human genetic sequences and 0.25% for other species' genetic sequences. Another splice prediction tool, Splice Finder, had between 1% and 3% of false positives for human sequences, while other species' sequences had around 4% and 10%.
Collapse
Affiliation(s)
- Elisa Fernandez-Castillo
- School of Engineering and Sciences, Monterrey Institute of Technology and Higher Education, Guadalajara 45201, Mexico; (L.I.B.-S.); (L.F.-M.)
| | - Liliana Ibeth Barbosa-Santillán
- School of Engineering and Sciences, Monterrey Institute of Technology and Higher Education, Guadalajara 45201, Mexico; (L.I.B.-S.); (L.F.-M.)
| | - Luis Falcon-Morales
- School of Engineering and Sciences, Monterrey Institute of Technology and Higher Education, Guadalajara 45201, Mexico; (L.I.B.-S.); (L.F.-M.)
| | | |
Collapse
|
32
|
A systems genomics approach to uncover patient-specific pathogenic pathways and proteins in ulcerative colitis. Nat Commun 2022; 13:2299. [PMID: 35484353 PMCID: PMC9051123 DOI: 10.1038/s41467-022-29998-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Accepted: 04/06/2022] [Indexed: 12/11/2022] Open
Abstract
We describe a precision medicine workflow, the integrated single nucleotide polymorphism network platform (iSNP), designed to determine the mechanisms by which SNPs affect cellular regulatory networks, and how SNP co-occurrences contribute to disease pathogenesis in ulcerative colitis (UC). Using SNP profiles of 378 UC patients we map the regulatory effects of the SNPs to a human signalling network containing protein-protein, miRNA-mRNA and transcription factor binding interactions. With unsupervised clustering algorithms we group these patient-specific networks into four distinct clusters driven by PRKCB, HLA, SNAI1/CEBPB/PTPN1 and VEGFA/XPO5/POLH hubs. The pathway analysis identifies calcium homeostasis, wound healing and cell motility as key processes in UC pathogenesis. Using transcriptomic data from an independent patient cohort, with three complementary validation approaches focusing on the SNP-affected genes, the patient specific modules and affected functions, we confirm the regulatory impact of non-coding SNPs. iSNP identified regulatory effects for disease-associated non-coding SNPs, and by predicting the patient-specific pathogenic processes, we propose a systems-level way to stratify patients. Single Nucleotide Polymorphisms (SNPs) affect cellular regulatory networks, and SNP co-occurrences contribute to disease pathogenesis in ulcerative colitis (UC). Here the authors introduce iSNP, a precision medicine pipeline that combines genomics and network biology approaches to uncover patient specific pathways affected in complex diseases.
Collapse
|
33
|
Jankovic B, Gojobori T. From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome. Hum Genomics 2022; 16:7. [PMID: 35180894 PMCID: PMC8855580 DOI: 10.1186/s40246-022-00376-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/02/2022] [Indexed: 11/25/2022] Open
Abstract
Identification of genomic signals as indicators for functional genomic elements is one of the areas that received early and widespread application of machine learning methods. With time, the methods applied grew in variety and generally exhibited a tendency to improve their ability to identify some major genomic and transcriptomics signals. The evolution of machine learning in genomics followed a similar path to applications of machine learning in other fields. These were impacted in a major way by three dominant developments, namely an enormous increase in availability and quality of data, a significant increase in computational power available to machine learning applications, and finally, new machine learning paradigms, of which deep learning is the most well-known example. It is not easy in general to distinguish factors leading to improvements in results of applications of machine learning. This is even more so in the field of genomics, where the advent of next-generation sequencing and the increased ability to perform functional analysis of raw data have had a major effect on the applicability of machine learning in OMICS fields. In this paper, we survey the results from a subset of published work in application of machine learning in the recognition of genomic signals and regions in human genome and summarize some lessons learnt from this endeavor. There is no doubt that a significant progress has been made both in terms of accuracy and reliability of models. Questions remain however whether the progress has been sufficient and what these developments bring to the field of genomics in general and human genomics in particular. Improving usability, interpretability and accuracy of models remains an important open challenge for current and future research in application of machine learning and more generally of artificial intelligence methods in genomics.
Collapse
Affiliation(s)
- Boris Jankovic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. .,Division of Biological and Environmental Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| |
Collapse
|
34
|
Oligonucleotide correction of an intronic TIMMDC1 variant in cells of patients with severe neurodegenerative disorder. NPJ Genom Med 2022; 7:9. [PMID: 35091571 PMCID: PMC8799713 DOI: 10.1038/s41525-021-00277-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2021] [Accepted: 12/09/2021] [Indexed: 11/08/2022] Open
Abstract
TIMMDC1 encodes the Translocase of Inner Mitochondrial Membrane Domain-Containing protein 1 (TIMMDC1) subunit of complex I of the electron transport chain responsible for ATP production. We studied a consanguineous family with two affected children, now deceased, who presented with failure to thrive in the early postnatal period, poor feeding, hypotonia, peripheral neuropathy and drug-resistant epilepsy. Genome sequencing data revealed a known, deep intronic pathogenic variant TIMMDC1 c.597-1340A>G, also present in gnomAD (~1/5000 frequency), that enhances aberrant splicing. Using RNA and protein analysis we show almost complete loss of TIMMDC1 protein and compromised mitochondrial complex I function. We have designed and applied two different splice-switching antisense oligonucleotides (SSO) to restore normal TIMMDC1 mRNA processing and protein levels in patients' cells. Quantitative proteomics and real-time metabolic analysis of mitochondrial function on patient fibroblasts treated with SSOs showed restoration of complex I subunit abundance and function. SSO-mediated therapy of this inevitably fatal TIMMDC1 neurologic disorder is an attractive possibility.
Collapse
|
35
|
Kowarz E, Krutzke L, Külp M, Streb P, Larghero P, Reis J, Bracharz S, Engler T, Kochanek S, Marschalek R. Vaccine-induced COVID-19 mimicry syndrome. eLife 2022; 11:74974. [PMID: 35084333 PMCID: PMC8846585 DOI: 10.7554/elife.74974] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 01/21/2022] [Indexed: 12/02/2022] Open
Abstract
To fight the COVID-19 pandemic caused by the RNA virus SARS-CoV-2, a global vaccination campaign is in progress to achieve the immunization of billions of people mainly with adenoviral vector- or mRNA-based vaccines, all of which encode the SARS-CoV-2 Spike protein. In some rare cases, cerebral venous sinus thromboses (CVST) have been reported as a severe side effect occurring 4–14 days after the first vaccination and were often accompanied by thrombocytopenia. Besides CVST, splanchnic vein thromboses (SVT) and other thromboembolic events have been observed. These events only occurred following vaccination with adenoviral vector-based vaccines but not following vaccination with mRNA-based vaccines. Meanwhile, scientists have proposed an immune-based pathomechanism and the condition has been coined vaccine-induced immune thrombotic thrombocytopenia (VITT). Here, we describe an unexpected mechanism that could explain thromboembolic events occurring with DNA-based but not with RNA-based vaccines. We show that DNA-encoded mRNA coding for Spike protein can be spliced in a way that the transmembrane anchor of Spike is lost, so that nearly full-length Spike is secreted from cells. Secreted Spike variants could potentially initiate severe side effects when binding to cells via the ACE2 receptor. Avoiding such splicing events should become part of a rational vaccine design to increase safety of prospective vaccines.
Collapse
Affiliation(s)
- Eric Kowarz
- Institute of Pharmaceutical Biology, Goethe-University, Frankfurt/Main, Germany
| | - Lea Krutzke
- Department of Gene Therapy, University of Ulm, Ulm, Germany
| | - Marius Külp
- Institute of Pharmaceutical Biology, Goethe-University, Frankfurt/Main, Germany
| | - Patrick Streb
- Institute of Pharmaceutical Biology, Goethe-University, Frankfurt/Main, Germany
| | - Patrizia Larghero
- Institute of Pharmaceutical Biology, Goethe-University, Frankfurt/Main, Germany
| | - Jennifer Reis
- Institute of Pharmaceutical Biology, Goethe-University, Frankfurt/Main, Germany
| | - Silvia Bracharz
- Institute of Pharmaceutical Biology, Goethe-University, Frankfurt/Main, Germany
| | - Tatjana Engler
- Department of Gene Therapy, University of Ulm, Ulm, Germany
| | | | - Rolf Marschalek
- Institute of Pharmaceutical Biology, Goethe-University, Frankfurt/Main, Germany
| |
Collapse
|
36
|
Abstract
In Eukarya, immature mRNA transcripts (pre-mRNA) often contain coding sequences, or exons, interleaved by non-coding sequences, or introns. Introns are removed upon splicing, and further regulation of the retained exons leads to alternatively spliced mRNA. The splicing reaction requires the stepwise assembly of the spliceosome, a macromolecular machine composed of small nuclear ribonucleoproteins (snRNPs). This review focuses on the early stage of spliceosome assembly, when U1 snRNP defines each intron 5’-splice site (5ʹss) in the pre-mRNA. We first introduce the splicing reaction and the impact of alternative splicing on gene expression regulation. Thereafter, we extensively discuss splicing descriptors that influence the 5ʹss selection by U1 snRNP, such as sequence determinants, and interactions mediated by U1-specific proteins or U1 small nuclear RNA (U1 snRNA). We also include examples of diseases that affect the 5ʹss selection by U1 snRNP, and discuss recent therapeutic advances that manipulate U1 snRNP 5ʹss selectivity with antisense oligonucleotides and small-molecule splicing switches.
Collapse
Affiliation(s)
- Florian Malard
- Inserm U1212, CNRS UMR5320, ARNA Laboratory, University of Bordeaux, Bordeaux Cedex, France
| | - Cameron D Mackereth
- Inserm U1212, CNRS UMR5320, ARNA Laboratory, University of Bordeaux, Bordeaux Cedex, France
| | - Sébastien Campagne
- Inserm U1212, CNRS UMR5320, ARNA Laboratory, University of Bordeaux, Bordeaux Cedex, France
| |
Collapse
|
37
|
Yang G, Ye Q, Xia J. Unbox the black-box for the medical explainable AI via multi-modal and multi-centre data fusion: A mini-review, two showcases and beyond. AN INTERNATIONAL JOURNAL ON INFORMATION FUSION 2022; 77:29-52. [PMID: 34980946 PMCID: PMC8459787 DOI: 10.1016/j.inffus.2021.07.016] [Citation(s) in RCA: 140] [Impact Index Per Article: 70.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Revised: 05/25/2021] [Accepted: 07/25/2021] [Indexed: 05/04/2023]
Abstract
Explainable Artificial Intelligence (XAI) is an emerging research topic of machine learning aimed at unboxing how AI systems' black-box choices are made. This research field inspects the measures and models involved in decision-making and seeks solutions to explain them explicitly. Many of the machine learning algorithms cannot manifest how and why a decision has been cast. This is particularly true of the most popular deep neural network approaches currently in use. Consequently, our confidence in AI systems can be hindered by the lack of explainability in these black-box models. The XAI becomes more and more crucial for deep learning powered applications, especially for medical and healthcare studies, although in general these deep neural networks can return an arresting dividend in performance. The insufficient explainability and transparency in most existing AI systems can be one of the major reasons that successful implementation and integration of AI tools into routine clinical practice are uncommon. In this study, we first surveyed the current progress of XAI and in particular its advances in healthcare applications. We then introduced our solutions for XAI leveraging multi-modal and multi-centre data fusion, and subsequently validated in two showcases following real clinical scenarios. Comprehensive quantitative and qualitative analyses can prove the efficacy of our proposed XAI solutions, from which we can envisage successful applications in a broader range of clinical questions.
Collapse
Affiliation(s)
- Guang Yang
- National Heart and Lung Institute, Imperial College London, London, UK
- Royal Brompton Hospital, London, UK
- Imperial Institute of Advanced Technology, Hangzhou, China
| | - Qinghao Ye
- Hangzhou Ocean’s Smart Boya Co., Ltd, China
- University of California, San Diego, La Jolla, CA, USA
| | - Jun Xia
- Radiology Department, Shenzhen Second People’s Hospital, Shenzhen, China
| |
Collapse
|
38
|
Scalzitti N, Kress A, Orhand R, Weber T, Moulinier L, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Spliceator: multi-species splice site prediction using convolutional neural networks. BMC Bioinformatics 2021; 22:561. [PMID: 34814826 PMCID: PMC8609763 DOI: 10.1186/s12859-021-04471-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 11/09/2021] [Indexed: 12/14/2022] Open
Abstract
Background Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. Results We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89–92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. Conclusions Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04471-3.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Arnaud Kress
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Romain Orhand
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Thomas Weber
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Luc Moulinier
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Anne Jeannin-Girardon
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Pierre Collet
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Julie D Thompson
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.
| |
Collapse
|
39
|
Abstract
Interpreting the effects of genetic variants is key to understanding individual susceptibility to disease and designing personalized therapeutic approaches. Modern experimental technologies are enabling the generation of massive compendia of human genome sequence data and associated molecular and phenotypic traits, together with genome-scale expression, epigenomics and other functional genomic data. Integrative computational models can leverage these data to understand variant impact, elucidate the effect of dysregulated genes on biological pathways in specific disease and tissue contexts, and interpret disease risk beyond what is feasible with experiments alone. In this Review, we discuss recent developments in machine learning algorithms for genome interpretation and for integrative molecular-level modelling of cells, tissues and organs relevant to disease. More specifically, we highlight existing methods and key challenges and opportunities in identifying specific disease-causing genetic variants and linking them to molecular pathways and, ultimately, to disease phenotypes.
Collapse
|
40
|
Riepe TV, Khan M, Roosing S, Cremers FPM, 't Hoen PAC. Benchmarking deep learning splice prediction tools using functional splice assays. Hum Mutat 2021; 42:799-810. [PMID: 33942434 PMCID: PMC8360004 DOI: 10.1002/humu.24212] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 03/16/2021] [Accepted: 04/17/2021] [Indexed: 12/21/2022]
Abstract
Hereditary disorders are frequently caused by genetic variants that affect pre-messenger RNA splicing. Though genetic variants in the canonical splice motifs are almost always disrupting splicing, the pathogenicity of variants in the noncanonical splice sites (NCSS) and deep intronic (DI) regions are difficult to predict. Multiple splice prediction tools have been developed for this purpose, with the latest tools employing deep learning algorithms. We benchmarked established and deep learning splice prediction tools on published gold standard sets of 71 NCSS and 81 DI variants in the ABCA4 gene and 61 NCSS variants in the MYBPC3 gene with functional assessment in midigene and minigene splice assays. The selection of splice prediction tools included CADD, DSSP, GeneSplicer, MaxEntScan, MMSplice, NNSPLICE, SPIDEX, SpliceAI, SpliceRover, and SpliceSiteFinder-like. The best-performing splice prediction tool for the different variants was SpliceRover for ABCA4 NCSS variants, SpliceAI for ABCA4 DI variants, and the Alamut 3/4 consensus approach (GeneSplicer, MaxEntScacn, NNSPLICE and SpliceSiteFinder-like) for NCSS variants in MYBPC3 based on the area under the receiver operator curve. Overall, the performance in a real-time clinical setting is much more modest than reported by the developers of the tools.
Collapse
Affiliation(s)
- Tabea V. Riepe
- Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life SciencesRadboud University Medical CenterNijmegenThe Netherlands
- Department of Human Genetics and Donders Institute for Brain, Cognition and BehaviorRadboud University Medical CenterNijmegenThe Netherlands
| | - Mubeen Khan
- Department of Human Genetics and Donders Institute for Brain, Cognition and BehaviorRadboud University Medical CenterNijmegenThe Netherlands
| | - Susanne Roosing
- Department of Human Genetics and Donders Institute for Brain, Cognition and BehaviorRadboud University Medical CenterNijmegenThe Netherlands
| | - Frans P. M. Cremers
- Department of Human Genetics and Donders Institute for Brain, Cognition and BehaviorRadboud University Medical CenterNijmegenThe Netherlands
| | - Peter A. C. 't Hoen
- Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life SciencesRadboud University Medical CenterNijmegenThe Netherlands
| |
Collapse
|
41
|
Dasari CM, Bhukya R. Explainable deep neural networks for novel viral genome prediction. APPL INTELL 2021; 52:3002-3017. [PMID: 34764607 PMCID: PMC8232563 DOI: 10.1007/s10489-021-02572-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/26/2021] [Indexed: 11/27/2022]
Abstract
Viral infection causes a wide variety of human diseases including cancer and COVID-19. Viruses invade host cells and associate with host molecules, potentially disrupting the normal function of hosts that leads to fatal diseases. Novel viral genome prediction is crucial for understanding the complex viral diseases like AIDS and Ebola. While most existing computational techniques classify viral genomes, the efficiency of the classification depends solely on the structural features extracted. The state-of-the-art DNN models achieved excellent performance by automatic extraction of classification features, but the degree of model explainability is relatively poor. During model training for viral prediction, proposed CNN, CNN-LSTM based methods (EdeepVPP, EdeepVPP-hybrid) automatically extracts features. EdeepVPP also performs model interpretability in order to extract the most important patterns that cause viral genomes through learned filters. It is an interpretable CNN model that extracts vital biologically relevant patterns (features) from feature maps of viral sequences. The EdeepVPP-hybrid predictor outperforms all the existing methods by achieving 0.992 mean AUC-ROC and 0.990 AUC-PR on 19 human metagenomic contig experiment datasets using 10-fold cross-validation. We evaluate the ability of CNN filters to detect patterns across high average activation values. To further asses the robustness of EdeepVPP model, we perform leave-one-experiment-out cross-validation. It can work as a recommendation system to further analyze the raw sequences labeled as ‘unknown’ by alignment-based methods. We show that our interpretable model can extract patterns that are considered to be the most important features for predicting virus sequences through learned filters.
Collapse
Affiliation(s)
| | - Raju Bhukya
- National Institute of Technology, Warangal, Telangana 506004 India
| |
Collapse
|
42
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
43
|
Dutta A, Singh KK, Anand A. SpliceViNCI: Visualizing the splicing of non-canonical introns through recurrent neural networks. J Bioinform Comput Biol 2021; 19:2150014. [PMID: 34088258 DOI: 10.1142/s0219720021500141] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Most of the current computational models for splice junction prediction are based on the identification of canonical splice junctions. However, it is observed that the junctions lacking the consensus dimers GT and AG also undergo splicing. Identification of such splice junctions, called the non-canonical splice junctions, is also essential for a comprehensive understanding of the splicing phenomenon. This work focuses on the identification of non-canonical splice junctions through the application of a bidirectional long short-term memory (BLSTM) network. Furthermore, we apply a back-propagation-based (integrated gradient) and a perturbation-based (occlusion) visualization techniques to extract the non-canonical splicing features learned by the model. The features obtained are validated with the existing knowledge from the literature. Integrated gradient extracts features that comprise contiguous nucleotides, whereas occlusion extracts features that are individual nucleotides distributed across the sequence.
Collapse
Affiliation(s)
- Aparajita Dutta
- Department of CSE, Indian Institute of Technology, Guwahati, India
| | | | - Ashish Anand
- Department of CSE, Indian Institute of Technology, Guwahati, India
| |
Collapse
|
44
|
MET Exon 14 Skipping: A Case Study for the Detection of Genetic Variants in Cancer Driver Genes by Deep Learning. Int J Mol Sci 2021; 22:ijms22084217. [PMID: 33921709 PMCID: PMC8072630 DOI: 10.3390/ijms22084217] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Revised: 04/13/2021] [Accepted: 04/17/2021] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Disruption of alternative splicing (AS) is frequently observed in cancer and might represent an important signature for tumor progression and therapy. Exon skipping (ES) represents one of the most frequent AS events, and in non-small cell lung cancer (NSCLC) MET exon 14 skipping was shown to be targetable. METHODS We constructed neural networks (NN/CNN) specifically designed to detect MET exon 14 skipping events using RNAseq data. Furthermore, for discovery purposes we also developed a sparsely connected autoencoder to identify uncharacterized MET isoforms. RESULTS The neural networks had a Met exon 14 skipping detection rate greater than 94% when tested on a manually curated set of 690 TCGA bronchus and lung samples. When globally applied to 2605 TCGA samples, we observed that the majority of false positives was characterized by a blurry coverage of exon 14, but interestingly they share a common coverage peak in the second intron and we speculate that this event could be the transcription signature of a LINE1 (Long Interspersed Nuclear Element 1)-MET (Mesenchymal Epithelial Transition receptor tyrosine kinase) fusion. CONCLUSIONS Taken together, our results indicate that neural networks can be an effective tool to provide a quick classification of pathological transcription events, and sparsely connected autoencoders could represent the basis for the development of an effective discovery tool.
Collapse
|
45
|
Clauwaert J, Menschaert G, Waegeman W. Explainability in transformer models for functional genomics. Brief Bioinform 2021; 22:6214646. [PMID: 33834200 PMCID: PMC8425421 DOI: 10.1093/bib/bbab060] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 01/28/2021] [Accepted: 02/05/2021] [Indexed: 11/16/2022] Open
Abstract
The effectiveness of deep learning methods can be largely attributed to the automated extraction of relevant features from raw data. In the field of functional genomics, this generally concerns the automatic selection of relevant nucleotide motifs from DNA sequences. To benefit from automated learning methods, new strategies are required that unveil the decision-making process of trained models. In this paper, we present a new approach that has been successful in gathering insights on the transcription process in Escherichia coli. This work builds upon a transformer-based neural network framework designed for prokaryotic genome annotation purposes. We find that the majority of subunits (attention heads) of the model are specialized towards identifying transcription factors and are able to successfully characterize both their binding sites and consensus sequences, uncovering both well-known and potentially novel elements involved in the initiation of the transcription process. With the specialization of the attention heads occurring automatically, we believe transformer models to be of high interest towards the creation of explainable neural networks in this field.
Collapse
Affiliation(s)
- Jim Clauwaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| | - Gerben Menschaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| | - Willem Waegeman
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| |
Collapse
|
46
|
Kong L, Chen Y, Xu F, Xu M, Li Z, Fang J, Zhang L, Pian C. Mining influential genes based on deep learning. BMC Bioinformatics 2021; 22:27. [PMID: 33482718 PMCID: PMC7821411 DOI: 10.1186/s12859-021-03972-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Accepted: 01/15/2021] [Indexed: 11/17/2022] Open
Abstract
Background Currently, large-scale gene expression profiling has been successfully applied to the discovery of functional connections among diseases, genetic perturbation, and drug action. To address the cost of an ever-expanding gene expression profile, a new, low-cost, high-throughput reduced representation expression profiling method called L1000 was proposed, with which one million profiles were produced. Although a set of ~ 1000 carefully chosen landmark genes that can capture ~ 80% of information from the whole genome has been identified for use in L1000, the robustness of using these landmark genes to infer target genes is not satisfactory. Therefore, more efficient computational methods are still needed to deep mine the influential genes in the genome. Results Here, we propose a computational framework based on deep learning to mine a subset of genes that can cover more genomic information. Specifically, an AutoEncoder framework is first constructed to learn the non-linear relationship between genes, and then DeepLIFT is applied to calculate gene importance scores. Using this data-driven approach, we have re-obtained a landmark gene set. The result shows that our landmark genes can predict target genes more accurately and robustly than that of L1000 based on two metrics [mean absolute error (MAE) and Pearson correlation coefficient (PCC)]. This reveals that the landmark genes detected by our method contain more genomic information. Conclusions We believe that our proposed framework is very suitable for the analysis of biological big data to reveal the mysteries of life. Furthermore, the landmark genes inferred from this study can be used for the explosive amplification of gene expression profiles to facilitate research into functional connections.
Collapse
Affiliation(s)
- Lingpeng Kong
- College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China
| | - Yuanyuan Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Fengjiao Xu
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Mingmin Xu
- College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China
| | - Zutan Li
- College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China
| | - Jingya Fang
- College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China
| | - Liangyun Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China.
| | - Cong Pian
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China.
| |
Collapse
|
47
|
Wei C, Zhang J, Yuan X, He Z, Liu G, Wu J. NeuroTIS: Enhancing the prediction of translation initiation sites in mRNA sequences via a hybrid dependency network and deep learning framework. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2020.106459] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
48
|
Nonsense-associated altered splicing of MAP3K1 in two siblings with 46,XY disorders of sex development. Sci Rep 2020; 10:17375. [PMID: 33060765 PMCID: PMC7567082 DOI: 10.1038/s41598-020-74405-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2020] [Accepted: 09/29/2020] [Indexed: 01/31/2023] Open
Abstract
Although splicing errors due to single nucleotide variants represent a common cause of monogenic disorders, only a few variants have been shown to create new splice sites in exons. Here, we report an MAP3K1 splice variant identified in two siblings with 46,XY disorder of sex development. The patients carried a maternally derived c.2254C>T variant. The variant was initially recognized as a nonsense substitution leading to nonsense-mediated mRNA decay (p.Gln752Ter); however, RT-PCR for lymphoblastoid cell lines showed that this variant created a new splice donor site and caused 39 amino acid deletion (p.Gln752_Arg790del). All transcripts from the variant allele appeared to undergo altered splicing. The two patients exhibited undermasculinized genitalia with and without hypergonadotropism. Testosterone enanthate injections and dihydrotestosterone ointment applications yielded only slight increase in their penile length. Dihydrotestosterone-induced APOD transactivation was less significant in patients’ genital skin fibroblasts compared with that in control samples. This study provides an example of nonsense-associated altered splicing, in which a highly potent exonic splice site was created. Furthermore, our data, in conjunction with the previous data indicating the association between MAP3K1 and androgen receptor signaling, imply that the combination of testicular dysgenesis and androgen insensitivity may be a unique phenotype of MAP3K1 abnormalities.
Collapse
|
49
|
Thanapattheerakul T, Engchuan W, Chan JH. Predicting the effect of variants on splicing using Convolutional Neural Networks. PeerJ 2020; 8:e9470. [PMID: 32704450 PMCID: PMC7346860 DOI: 10.7717/peerj.9470] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 06/11/2020] [Indexed: 11/23/2022] Open
Abstract
Mutations that cause an error in the splicing of a messenger RNA (mRNA) can lead to diseases in humans. Various computational models have been developed to recognize the sequence pattern of the splice sites. In recent studies, Convolutional Neural Network (CNN) architectures were shown to outperform other existing models in predicting the splice sites. However, an insufficient effort has been put into extending the CNN model to predict the effect of the genomic variants on the splicing of mRNAs. This study proposes a framework to elaborate on the utility of CNNs to assess the effect of splice variants on the identification of potential disease-causing variants that disrupt the RNA splicing process. Five models, including three CNN-based and two non-CNN machine learning based, were trained and compared using two existing splice site datasets, Genome Wide Human splice sites (GWH) and a dataset provided at the Deep Learning and Artificial Intelligence winter school 2018 (DLAI). The donor sites were also used to test on the HSplice tool to evaluate the predictive models. To improve the effectiveness of predictive models, two datasets were combined. The CNN model with four convolutional layers showed the best splice site prediction performance with an AUPRC of 93.4% and 88.8% for donor and acceptor sites, respectively. The effects of variants on splicing were estimated by applying the best model on variant data from the ClinVar database. Based on the estimation, the framework could effectively differentiate pathogenic variants from the benign variants (p = 5.9 × 10−7). These promising results support that the proposed framework could be applied in future genetic studies to identify disease causing loci involving the splicing mechanism. The datasets and Python scripts used in this study are available on the GitHub repository at https://github.com/smiile8888/rna-splice-sites-recognition.
Collapse
Affiliation(s)
| | - Worrawat Engchuan
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,The Centre for Applied Genomics, The Hospital of Sick Children, Toronto, Ontario, Canada
| | - Jonathan H Chan
- School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand.,IC2-DLab, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
| |
Collapse
|
50
|
Payrovnaziri SN, Chen Z, Rengifo-Moreno P, Miller T, Bian J, Chen JH, Liu X, He Z. Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J Am Med Inform Assoc 2020; 27:1173-1185. [PMID: 32417928 PMCID: PMC7647281 DOI: 10.1093/jamia/ocaa053] [Citation(s) in RCA: 87] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 04/01/2020] [Accepted: 04/07/2020] [Indexed: 01/08/2023] Open
Abstract
OBJECTIVE To conduct a systematic scoping review of explainable artificial intelligence (XAI) models that use real-world electronic health record data, categorize these techniques according to different biomedical applications, identify gaps of current studies, and suggest future research directions. MATERIALS AND METHODS We searched MEDLINE, IEEE Xplore, and the Association for Computing Machinery (ACM) Digital Library to identify relevant papers published between January 1, 2009 and May 1, 2019. We summarized these studies based on the year of publication, prediction tasks, machine learning algorithm, dataset(s) used to build the models, the scope, category, and evaluation of the XAI methods. We further assessed the reproducibility of the studies in terms of the availability of data and code and discussed open issues and challenges. RESULTS Forty-two articles were included in this review. We reported the research trend and most-studied diseases. We grouped XAI methods into 5 categories: knowledge distillation and rule extraction (N = 13), intrinsically interpretable models (N = 9), data dimensionality reduction (N = 8), attention mechanism (N = 7), and feature interaction and importance (N = 5). DISCUSSION XAI evaluation is an open issue that requires a deeper focus in the case of medical applications. We also discuss the importance of reproducibility of research work in this field, as well as the challenges and opportunities of XAI from 2 medical professionals' point of view. CONCLUSION Based on our review, we found that XAI evaluation in medicine has not been adequately and formally practiced. Reproducibility remains a critical concern. Ample opportunities exist to advance XAI research in medicine.
Collapse
Affiliation(s)
| | - Zhaoyi Chen
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Pablo Rengifo-Moreno
- College of Medicine, Florida State University, Tallahassee, Florida, USA
- Tallahassee Memorial Hospital, Tallahassee, Florida, USA
| | - Tim Miller
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Jonathan H Chen
- Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, California, USA
- Division of Hospital Medicine, Department of Medicine, Stanford University, Stanford, California, USA
| | - Xiuwen Liu
- Department of Computer Science, Florida State University, Tallahassee, Florida, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, Florida, USA
| |
Collapse
|