1
|
Ng AYE, Chan SN, Pek JW. Genetic compensation between ribosomal protein paralogs mediated by a cognate circular RNA. Cell Rep 2024; 43:114228. [PMID: 38735045 DOI: 10.1016/j.celrep.2024.114228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 04/19/2024] [Accepted: 04/26/2024] [Indexed: 05/14/2024] Open
Abstract
Inter-regulation between related genes, such as ribosomal protein (RP) paralogs, has been observed to be important for genetic compensation and paralog-specific functions. However, how paralogs communicate to modulate their expression levels is unknown. Here, we report a circular RNA involved in the inter-regulation between RP paralogs RpL22 and RpL22-like during Drosophila spermatogenesis. Both paralogs are mutually regulated by the circular stable intronic sequence RNA (sisRNA) circRpL22(NE,3S) produced from the RpL22 locus. RpL22 represses itself and RpL22-like. Interestingly, circRpL22 binds to RpL22 to repress RpL22-like, but not RpL22, suggesting that circRpL22 modulates RpL22's function. circRpL22 is in turn controlled by RpL22-like, which regulates RpL22 binding to circRpL22 to indirectly modulate RpL22. This circRpL22-centric inter-regulatory circuit enables the loss of RpL22-like to be genetically compensated by RpL22 upregulation to ensure robust male germline development. Thus, our study identifies sisRNA as a possible mechanism of genetic crosstalk between paralogous genes.
Collapse
Affiliation(s)
- Amanda Yunn Ee Ng
- Temasek Life Sciences Laboratory, 1 Research Link National University of Singapore, Singapore 117604, Singapore; Department of Biological Sciences, National University of Singapore, 14 Science Drive Singapore 117543, Singapore
| | - Seow Neng Chan
- Temasek Life Sciences Laboratory, 1 Research Link National University of Singapore, Singapore 117604, Singapore
| | - Jun Wei Pek
- Temasek Life Sciences Laboratory, 1 Research Link National University of Singapore, Singapore 117604, Singapore; Department of Biological Sciences, National University of Singapore, 14 Science Drive Singapore 117543, Singapore.
| |
Collapse
|
2
|
Pulli K, Saarimäki-Vire J, Ahonen P, Liu X, Ibrahim H, Chandra V, Santambrogio A, Wang Y, Vaaralahti K, Iivonen AP, Känsäkoski J, Tommiska J, Kemkem Y, Varjosalo M, Vuoristo S, Andoniadou CL, Otonkoski T, Raivio T. A splice site variant in MADD affects hormone expression in pancreatic β cells and pituitary gonadotropes. JCI Insight 2024; 9:e167598. [PMID: 38775154 PMCID: PMC11141940 DOI: 10.1172/jci.insight.167598] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 04/12/2024] [Indexed: 06/02/2024] Open
Abstract
MAPK activating death domain (MADD) is a multifunctional protein regulating small GTPases RAB3 and RAB27, MAPK signaling, and cell survival. Polymorphisms in the MADD locus are associated with glycemic traits, but patients with biallelic variants in MADD manifest a complex syndrome affecting nervous, endocrine, exocrine, and hematological systems. We identified a homozygous splice site variant in MADD in 2 siblings with developmental delay, diabetes, congenital hypogonadotropic hypogonadism, and growth hormone deficiency. This variant led to skipping of exon 30 and in-frame deletion of 36 amino acids. To elucidate how this mutation causes pleiotropic endocrine phenotypes, we generated relevant cellular models with deletion of MADD exon 30 (dex30). We observed reduced numbers of β cells, decreased insulin content, and increased proinsulin-to-insulin ratio in dex30 human embryonic stem cell-derived pancreatic islets. Concordantly, dex30 led to decreased insulin expression in human β cell line EndoC-βH1. Furthermore, dex30 resulted in decreased luteinizing hormone expression in mouse pituitary gonadotrope cell line LβT2 but did not affect ontogeny of stem cell-derived GnRH neurons. Protein-protein interactions of wild-type and dex30 MADD revealed changes affecting multiple signaling pathways, while the GDP/GTP exchange activity of dex30 MADD remained intact. Our results suggest MADD-specific processes regulate hormone expression in pancreatic β cells and pituitary gonadotropes.
Collapse
Affiliation(s)
- Kristiina Pulli
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
| | - Jonna Saarimäki-Vire
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
| | - Pekka Ahonen
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
| | - Xiaonan Liu
- Institute of Biotechnology, Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
| | - Hazem Ibrahim
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
| | - Vikash Chandra
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
| | - Alice Santambrogio
- Centre for Craniofacial and Regenerative Biology, King’s College London, London, United Kingdom
- Department of Medicine III, University Hospital Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Yafei Wang
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
| | - Kirsi Vaaralahti
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
| | - Anna-Pauliina Iivonen
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
| | - Johanna Känsäkoski
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
- Department of Physiology, Faculty of Medicine
| | - Johanna Tommiska
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
- Department of Physiology, Faculty of Medicine
| | - Yasmine Kemkem
- Centre for Craniofacial and Regenerative Biology, King’s College London, London, United Kingdom
| | - Markku Varjosalo
- Institute of Biotechnology, Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
| | - Sanna Vuoristo
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
- Department of Obstetrics and Gynecology; and
- HiLIFE, University of Helsinki, Helsinki, Finland
| | - Cynthia L. Andoniadou
- Centre for Craniofacial and Regenerative Biology, King’s College London, London, United Kingdom
- Department of Medicine III, University Hospital Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Timo Otonkoski
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
- New Children’s Hospital, Helsinki University Hospital, Pediatric Research Center, Helsinki, Finland
| | - Taneli Raivio
- Stem Cells and Metabolism Research Program (STEMM), Research Programs Unit, Faculty of Medicine, and
- Department of Physiology, Faculty of Medicine
- New Children’s Hospital, Helsinki University Hospital, Pediatric Research Center, Helsinki, Finland
| |
Collapse
|
3
|
Lee H, Ozbulak U, Park H, Depuydt S, De Neve W, Vankerschaver J. Assessing the reliability of point mutation as data augmentation for deep learning with genomic data. BMC Bioinformatics 2024; 25:170. [PMID: 38689247 PMCID: PMC11059627 DOI: 10.1186/s12859-024-05787-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 04/15/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Deep neural networks (DNNs) have the potential to revolutionize our understanding and treatment of genetic diseases. An inherent limitation of deep neural networks, however, is their high demand for data during training. To overcome this challenge, other fields, such as computer vision, use various data augmentation techniques to artificially increase the available training data for DNNs. Unfortunately, most data augmentation techniques used in other domains do not transfer well to genomic data. RESULTS Most genomic data possesses peculiar properties and data augmentations may significantly alter the intrinsic properties of the data. In this work, we propose a novel data augmentation technique for genomic data inspired by biology: point mutations. By employing point mutations as substitutes for codons, we demonstrate that our newly proposed data augmentation technique enhances the performance of DNNs across various genomic tasks that involve coding regions, such as translation initiation and splice site detection. CONCLUSION Silent and missense mutations are found to positively influence effectiveness, while nonsense mutations and random mutations in non-coding regions generally lead to degradation. Overall, point mutation-based augmentations in genomic datasets present valuable opportunities for improving the accuracy and reliability of predictive models for DNA sequences.
Collapse
Affiliation(s)
| | - Utku Ozbulak
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Homin Park
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Stephen Depuydt
- Erasmus Brussels University of Applied Sciences and Arts, Brussels, Belgium
| | - Wesley De Neve
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Joris Vankerschaver
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium.
| |
Collapse
|
4
|
Chen K, Zhou Y, Ding M, Wang Y, Ren Z, Yang Y. Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Brief Bioinform 2024; 25:bbae163. [PMID: 38605640 PMCID: PMC11009468 DOI: 10.1093/bib/bbae163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 02/22/2024] [Accepted: 03/19/2024] [Indexed: 04/13/2024] Open
Abstract
Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.
Collapse
Affiliation(s)
- Ken Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yue Zhou
- Peng Cheng Laboratory, Shenzhen, China
| | - Maolin Ding
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yu Wang
- Peng Cheng Laboratory, Shenzhen, China
| | | | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
- Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, China
| |
Collapse
|
5
|
Liu X, Zhang H, Zeng Y, Zhu X, Zhu L, Fu J. DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks. Genes (Basel) 2024; 15:404. [PMID: 38674339 PMCID: PMC11048956 DOI: 10.3390/genes15040404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 03/20/2024] [Accepted: 03/23/2024] [Indexed: 04/28/2024] Open
Abstract
The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms for enhanced accuracy in capturing the intricate features of splice sites. We constructed multiple datasets using the most recent versions of genomic data from three different organisms, Oryza sativa japonica, Arabidopsis thaliana and Homo sapiens. This approach allows us to train models with a richer set of high-quality data. DRANetSplicer outperformed benchmark methods on donor and acceptor splice site datasets, achieving an average accuracy of (96.57%, 95.82%) across the three organisms. Comparative analyses with benchmark methods, including SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, revealed DRANetSplicer's superior predictive performance, resulting in at least a (4.2%, 11.6%) relative reduction in average error rate. We utilized the DRANetSplicer model trained on O. sativa japonica data to predict splice sites in A. thaliana, achieving accuracies for donor and acceptor sites of (94.89%, 94.25%). These results indicate that DRANetSplicer possesses excellent cross-organism predictive capabilities, with its performance in cross-organism predictions even surpassing that of benchmark methods in non-cross-organism predictions. Cross-organism validation showcased DRANetSplicer's excellence in predicting splice sites across similar organisms, supporting its applicability in gene annotation for understudied organisms. We employed multiple methods to visualize the decision-making process of the model. The visualization results indicate that DRANetSplicer can learn and interpret well-known biological features, further validating its overall performance. Our study systematically examined and confirmed the predictive ability of DRANetSplicer from various levels and perspectives, indicating that its practical application in gene annotation is justified.
Collapse
Affiliation(s)
- Xueyan Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Hongyan Zhang
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Ying Zeng
- School of Computer and Communication, Hunan Institute of Engineering, Xiangtan 411104, China;
| | - Xinghui Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Lei Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Jiahui Fu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| |
Collapse
|
6
|
Früh SP, Früh MA, Kaufer BB, Göbel TW. Unraveling the chicken T cell repertoire with enhanced genome annotation. Front Immunol 2024; 15:1359169. [PMID: 38550579 PMCID: PMC10972964 DOI: 10.3389/fimmu.2024.1359169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 02/23/2024] [Indexed: 04/02/2024] Open
Abstract
T cell receptor (TCR) repertoire sequencing has emerged as a powerful tool for understanding the diversity and functionality of T cells within the host immune system. Yet, the chicken TCR repertoire remains poorly understood due to incomplete genome annotation of the TCR loci, despite the importance of chickens in agriculture and as an immunological model. Here, we addressed this critical issue by employing 5' rapid amplification of complementary DNA ends (5'RACE) TCR repertoire sequencing with molecular barcoding of complementary DNA (cDNA) molecules. Simultaneously, we enhanced the genome annotation of TCR Variable (V), Diversity (D, only present in β and δ loci) and Joining (J) genes in the chicken genome. To enhance the efficiency of TCR annotations, we developed VJ-gene-finder, an algorithm designed to extract VJ gene candidates from deoxyribonucleic acid (DNA) sequences. Using this tool, we achieved a comprehensive annotation of all known chicken TCR loci, including the α/δ locus on chromosome 27. Evolutionary analysis revealed that each locus evolved separately by duplication of long homology units. To define the baseline TCR diversity in healthy chickens and to demonstrate the feasibility of the approach, we characterized the splenic α/β/γ/δ TCR repertoire. Analysis of the repertoires revealed preferential usage of specific V and J combinations in all chains, while the overall features were characteristic of unbiased repertoires. We observed moderate levels of shared complementarity-determining region 3 (CDR3) clonotypes among individual birds within the α and γ chain repertoires, including the most frequently occurring clonotypes. However, the β and δ repertoires were predominantly unique to each bird. Taken together, our TCR repertoire analysis allowed us to decipher the composition, diversity, and functionality of T cells in chickens. This work not only represents a significant step towards understanding avian T cell biology, but will also shed light on host-pathogen interactions, vaccine development, and the evolutionary history of avian immunology.
Collapse
Affiliation(s)
- Simon P. Früh
- Department of Veterinary Sciences, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Virology, Freie Universität Berlin, Berlin, Germany
| | | | | | - Thomas W. Göbel
- Department of Veterinary Sciences, Ludwig-Maximilians-Universität München, Munich, Germany
| |
Collapse
|
7
|
Bagger FO, Borgwardt L, Jespersen AS, Hansen AR, Bertelsen B, Kodama M, Nielsen FC. Whole genome sequencing in clinical practice. BMC Med Genomics 2024; 17:39. [PMID: 38287327 PMCID: PMC10823711 DOI: 10.1186/s12920-024-01795-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 01/01/2024] [Indexed: 01/31/2024] Open
Abstract
Whole genome sequencing (WGS) is becoming the preferred method for molecular genetic diagnosis of rare and unknown diseases and for identification of actionable cancer drivers. Compared to other molecular genetic methods, WGS captures most genomic variation and eliminates the need for sequential genetic testing. Whereas, the laboratory requirements are similar to conventional molecular genetics, the amount of data is large and WGS requires a comprehensive computational and storage infrastructure in order to facilitate data processing within a clinically relevant timeframe. The output of a single WGS analyses is roughly 5 MIO variants and data interpretation involves specialized staff collaborating with the clinical specialists in order to provide standard of care reports. Although the field is continuously refining the standards for variant classification, there are still unresolved issues associated with the clinical application. The review provides an overview of WGS in clinical practice - describing the technology and current applications as well as challenges connected with data processing, interpretation and clinical reporting.
Collapse
Affiliation(s)
- Frederik Otzen Bagger
- Center for Genomic Medicine, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
| | - Line Borgwardt
- Center for Genomic Medicine, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
| | - Andreas Sand Jespersen
- Center for Genomic Medicine, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
| | - Anna Reimer Hansen
- Center for Genomic Medicine, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
| | - Birgitte Bertelsen
- Center for Genomic Medicine, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
| | - Miyako Kodama
- Center for Genomic Medicine, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
| | - Finn Cilius Nielsen
- Center for Genomic Medicine, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
8
|
Samanta A, George N, Arnaoutova I, Chen HD, Mansfield BC, Hart C, Carlo T, Chou JY. CRISPR/Cas9-based double-strand oligonucleotide insertion strategy corrects metabolic abnormalities in murine glycogen storage disease type-Ia. J Inherit Metab Dis 2023; 46:1147-1158. [PMID: 37467014 PMCID: PMC10796839 DOI: 10.1002/jimd.12660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/11/2023] [Revised: 06/23/2023] [Accepted: 07/17/2023] [Indexed: 07/20/2023]
Abstract
Glycogen storage disease type-Ia (GSD-Ia), characterized by impaired blood glucose homeostasis, is caused by a deficiency in glucose-6-phosphatase-α (G6Pase-α or G6PC). Using the G6pc-R83C mouse model of GSD-Ia, we explored a CRISPR/Cas9-based double-strand DNA oligonucleotide (dsODN) insertional strategy that uses the nonhomologous end-joining repair mechanism to correct the pathogenic p.R83C variant in G6pc exon-2. The strategy is based on the insertion of a short dsODN into G6pc exon-2 to disrupt the native exon and to introduce an additional splice acceptor site and the correcting sequence. When transcribed and spliced, the edited gene would generate a wild-type mRNA encoding the native G6Pase-α protein. The editing reagents formulated in lipid nanoparticles (LNPs) were delivered to the liver. Mice were treated either with one dose of LNP-dsODN at age 4 weeks or with two doses of LNP-dsODN at age 2 and 4 weeks. The G6pc-R83C mice receiving successful editing expressed ~4% of normal hepatic G6Pase-α activity, maintained glucose homeostasis, lacked hypoglycemic seizures, and displayed normalized blood metabolite profile. The outcomes are consistent with preclinical studies supporting previous gene augmentation therapy which is currently in clinical trials. This editing strategy may offer the basis for a therapeutic approach with an earlier clinical intervention than gene augmentation, with the additional benefit of a potentially permanent correction of the GSD-Ia phenotype.
Collapse
Affiliation(s)
- Ananya Samanta
- Section on Cellular Differentiation, Division of Translational Medicine, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA
| | - Nelson George
- Section on Cellular Differentiation, Division of Translational Medicine, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA
| | - Irina Arnaoutova
- Section on Cellular Differentiation, Division of Translational Medicine, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA
| | - Hung-Dar Chen
- Section on Cellular Differentiation, Division of Translational Medicine, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA
| | - Brian C. Mansfield
- Section on Cellular Differentiation, Division of Translational Medicine, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA
| | - Christopher Hart
- Current affiliation, Prime Medicine Inc, Cambridge, MA 02139, USA
| | - Troy Carlo
- Current affiliation, Prime Medicine Inc, Cambridge, MA 02139, USA
| | - Janice Y. Chou
- Section on Cellular Differentiation, Division of Translational Medicine, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|
9
|
Sridhar A, More AS, Jadhav AR, Patil K, Mavlankar A, Dixit VM, Bapat SA. Pattern recognition in the landscape of seemingly random chimeric transcripts. Comput Struct Biotechnol J 2023; 21:5153-5164. [PMID: 37920814 PMCID: PMC10618115 DOI: 10.1016/j.csbj.2023.10.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 10/13/2023] [Accepted: 10/16/2023] [Indexed: 11/04/2023] Open
Abstract
The molecular and functional diversity generated by chimeric transcripts (CTs) that are derived from two genes is indicated to contribute to tumor cell survival. Several gaps yet exist. The present research is a systematic study of the spectrum of CTs identified in RNA sequencing datasets of 160 ovarian cancer samples in the The Cancer Genome Atlas (TCGA) (https://portal.gdc.cancer.gov). Structural annotation revealed complexities emerging from chromosomal localization of partner genes, differential splicing and inclusion of regulatory, untranslated regions. Identification of phenotype-specific associations further resolved a dynamically modulated mesenchymal signature during transformation. On an evolutionary background, protein-coding CTs were indicated to be highly conserved, while non-coding CTs may have evolved more recently. We also realized that the current premise postulating structural alterations or neighbouring gene readthrough generating CTs is not valid in instances wherein the parental genes are genomically distanced. In addressing this lacuna, we identified the essentiality of specific spatiotemporal arrangements mediated gene proximities in 3D space for the generation of CTs. All these features together suggest non-random mechanisms towards increasing the molecular diversity in a cell through chimera formation either in parallel or with cross-talks with the indigenous regulatory network.
Collapse
Affiliation(s)
- Aksheetha Sridhar
- Open Health Systems Laboratory, 9601 Medical Centre Drive, Rockville, MD 20850, US
| | - Ankita S. More
- National Centre for Cell Science, Savitribai Phule Pune University, Ganeshkhind, Pune 411 007, Maharashtra, India
| | - Amruta R. Jadhav
- National Centre for Cell Science, Savitribai Phule Pune University, Ganeshkhind, Pune 411 007, Maharashtra, India
| | - Komal Patil
- National Centre for Cell Science, Savitribai Phule Pune University, Ganeshkhind, Pune 411 007, Maharashtra, India
| | - Anuj Mavlankar
- National Centre for Cell Science, Savitribai Phule Pune University, Ganeshkhind, Pune 411 007, Maharashtra, India
| | - Vaishnavi M. Dixit
- National Centre for Cell Science, Savitribai Phule Pune University, Ganeshkhind, Pune 411 007, Maharashtra, India
| | - Sharmila A. Bapat
- Open Health Systems Laboratory, 9601 Medical Centre Drive, Rockville, MD 20850, US
- National Centre for Cell Science, Savitribai Phule Pune University, Ganeshkhind, Pune 411 007, Maharashtra, India
| |
Collapse
|
10
|
Liao SE, Sudarshan M, Regev O. Deciphering RNA splicing logic with interpretable machine learning. Proc Natl Acad Sci U S A 2023; 120:e2221165120. [PMID: 37796983 PMCID: PMC10576025 DOI: 10.1073/pnas.2221165120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 08/29/2023] [Indexed: 10/07/2023] Open
Abstract
Machine learning methods, particularly neural networks trained on large datasets, are transforming how scientists approach scientific discovery and experimental design. However, current state-of-the-art neural networks are limited by their uninterpretability: Despite their excellent accuracy, they cannot describe how they arrived at their predictions. Here, using an "interpretable-by-design" approach, we present a neural network model that provides insights into RNA splicing, a fundamental process in the transfer of genomic information into functional biochemical products. Although we designed our model to emphasize interpretability, its predictive accuracy is on par with state-of-the-art models. To demonstrate the model's interpretability, we introduce a visualization that, for any given exon, allows us to trace and quantify the entire decision process from input sequence to output splicing prediction. Importantly, the model revealed uncharacterized components of the splicing logic, which we experimentally validated. This study highlights how interpretable machine learning can advance scientific discovery.
Collapse
Affiliation(s)
- Susan E. Liao
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY10012
| | - Mukund Sudarshan
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY10012
| | - Oded Regev
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY10012
| |
Collapse
|
11
|
Cook S, Hooser BN, Williams DC, Kortz G, Aleman M, Minor K, Koziol J, Friedenberg SG, Cullen JN, Shelton GD, Ekenstedt KJ. Canine models of Charcot-Marie-Tooth: MTMR2, MPZ, and SH3TC2 variants in golden retrievers with congenital hypomyelinating polyneuropathy. Neuromuscul Disord 2023; 33:677-691. [PMID: 37400349 PMCID: PMC10530471 DOI: 10.1016/j.nmd.2023.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 06/06/2023] [Accepted: 06/19/2023] [Indexed: 07/05/2023]
Abstract
Congenital hypomyelinating polyneuropathy (HPN) restricted to the peripheral nervous system was reported in 1989 in two Golden Retriever (GR) littermates. Recently, four additional cases of congenital HPN in young, unrelated GRs were diagnosed via neurological examination, electrodiagnostic evaluation, and peripheral nerve pathology. Whole-genome sequencing was performed on all four GRs, and variants from each dog were compared to variants found across >1,000 other dogs, all presumably unaffected with HPN. Likely causative variants were identified for each HPN-affected GR. Two cases shared a homozygous splice donor site variant in MTMR2, with a stop codon introduced within six codons following the inclusion of the intron. One case had a heterozygous MPZ isoleucine to threonine substitution. The last case had a homozygous SH3TC2 nonsense variant predicted to truncate approximately one-half of the protein. Haplotype analysis using 524 GR established the novelty of the identified variants. Each variant occurs within genes that are associated with the human Charcot-Marie-Tooth (CMT) group of heterogeneous diseases, affecting the peripheral nervous system. Testing a large GR population (n = >200) did not identify any dogs with these variants. Although these variants are rare within the general GR population, breeders should be cautious to avoid propagating these alleles.
Collapse
Affiliation(s)
- Shawna Cook
- Department of Basic Medical Sciences, College of Veterinary Medicine, Purdue University, West Lafayette, IN, USA.
| | - Blair N Hooser
- Department of Basic Medical Sciences, College of Veterinary Medicine, Purdue University, West Lafayette, IN, USA
| | - D Colette Williams
- The William R. Pritchard Veterinary Medical Teaching Hospital, University of California, Davis, Davis, CA, USA
| | - Gregg Kortz
- VCA Sacramento Veterinary Referral Center, Sacramento CA, USA
| | - Monica Aleman
- The William R. Pritchard Veterinary Medical Teaching Hospital, University of California, Davis, Davis, CA, USA
| | - Katie Minor
- Department of Veterinary Clinical Sciences, College of Veterinary Medicine, University of Minnesota, Saint Paul, MN, USA
| | - Jennifer Koziol
- School of Veterinary Medicine, Texas Tech University, Amarillo, TX, USA
| | - Steven G Friedenberg
- Department of Veterinary Clinical Sciences, College of Veterinary Medicine, University of Minnesota, Saint Paul, MN, USA
| | - Jonah N Cullen
- Department of Veterinary Clinical Sciences, College of Veterinary Medicine, University of Minnesota, Saint Paul, MN, USA
| | - G Diane Shelton
- Department of Pathology, School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Kari J Ekenstedt
- Department of Basic Medical Sciences, College of Veterinary Medicine, Purdue University, West Lafayette, IN, USA
| |
Collapse
|
12
|
McBeath E, Fujiwara K, Hofmann MC. Evidence-Based Guide to Using Artificial Introns for Tissue-Specific Knockout in Mice. Int J Mol Sci 2023; 24:10258. [PMID: 37373404 PMCID: PMC10299402 DOI: 10.3390/ijms241210258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/09/2023] [Accepted: 06/10/2023] [Indexed: 06/29/2023] Open
Abstract
Up until recently, methods for generating floxed mice either conventionally or by CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats)-Cas9 (CRISPR-associated protein 9) editing have been technically challenging, expensive and error-prone, or time-consuming. To circumvent these issues, several labs have started successfully using a small artificial intron to conditionally knockout (KO) a gene of interest in mice. However, many other labs are having difficulty getting the technique to work. The key problem appears to be either a failure in achieving correct splicing after the introduction of the artificial intron into the gene or, just as crucial, insufficient functional KO of the gene's protein after Cre-induced removal of the intron's branchpoint. Presented here is a guide on how to choose an appropriate exon and where to place the recombinase-regulated artificial intron (rAI) in that exon to prevent disrupting normal gene splicing while maximizing mRNA degradation after recombinase treatment. The reasoning behind each step in the guide is also discussed. Following these recommendations should increase the success rate of this easy, new, and alternative technique for producing tissue-specific KO mice.
Collapse
Affiliation(s)
- Elena McBeath
- Department of Endocrine Neoplasia & Hormonal Disorders, MD Anderson Cancer Center, Houston, TX 77030, USA;
| | - Keigi Fujiwara
- National Coalition of Independent Scholars, Brattleboro, VT 05301, USA;
| | - Marie-Claude Hofmann
- Department of Endocrine Neoplasia & Hormonal Disorders, MD Anderson Cancer Center, Houston, TX 77030, USA;
| |
Collapse
|
13
|
Scholl CL, Holmstrup M, Graham LA, Davies PL. Polyproline type II helical antifreeze proteins are widespread in Collembola and likely originated over 400 million years ago in the Ordovician Period. Sci Rep 2023; 13:8880. [PMID: 37264058 DOI: 10.1038/s41598-023-35983-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 05/26/2023] [Indexed: 06/03/2023] Open
Abstract
Antifreeze proteins (AFPs) bind to ice crystals to prevent organisms from freezing. A diversity of AFP folds has been found in fish and insects, including alpha helices, globular proteins, and several different beta solenoids. But the variety of AFPs in flightless arthropods, like Collembola, has not yet been adequately assessed. Here, antifreeze activity was shown to be present in 18 of the 22 species of Collembola from cold or temperate zones. Several methods were used to characterize these AFPs, including isolation by ice affinity purification, MALDI mass spectrometry, amino acid composition analysis, tandem mass spectrometry sequencing, transcriptome sequencing, and bioinformatic investigations of sequence databases. All of these AFPs had a high glycine content and were predicted to have the same polyproline type II helical bundle fold, a fold unique to Collembola. These Hexapods arose in the Ordovician Period with the two orders known to produce AFPs diverging around 400 million years ago during the Andean-Saharan Ice Age. Therefore, it is likely that the AFP arose then and persisted in many lineages through the following two ice ages and intervening warm periods, unlike the AFPs of fish which arose independently during the Cenozoic Ice Age beginning ~ 30 million years ago.
Collapse
Affiliation(s)
- Connor L Scholl
- Department of Biomedical and Molecular Sciences, Queen's University, 18 Stuart Street, Kingston, ON, K7L3N6, Canada
| | - Martin Holmstrup
- Section of Terrestrial Ecology, Department of Ecoscience, Aarhus University, C.F. Møllers Allé 4, 8000, Aarhus C, Denmark
- Arctic Research Center, Aarhus University, Ny Munkegade 114, 8000, Aarhus C, Denmark
| | - Laurie A Graham
- Department of Biomedical and Molecular Sciences, Queen's University, 18 Stuart Street, Kingston, ON, K7L3N6, Canada
| | - Peter L Davies
- Department of Biomedical and Molecular Sciences, Queen's University, 18 Stuart Street, Kingston, ON, K7L3N6, Canada.
| |
Collapse
|
14
|
Takeda A, Ueki M, Abe J, Maeta K, Horiguchi T, Yamazawa H, Izumi G, Chida-Nagai A, Sasaki D, Tsujioka T, Sato I, Shiraishi M, Matsuo M. A case of infantile Barth syndrome with severe heart failure: Importance of splicing variants in the TAZ gene. Mol Genet Genomic Med 2023:e2190. [PMID: 37186429 DOI: 10.1002/mgg3.2190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Revised: 03/30/2023] [Accepted: 04/11/2023] [Indexed: 05/17/2023] Open
Abstract
Barth syndrome (BTHS) is an X-linked disorder characterized by cardiomyopathy, skeletal myopathy, and 3-methylglutaconic aciduria. The causative pathogenic variants for BTHS are in TAZ, which encodes a putative acyltransferase named tafazzin and is involved in the remodeling of cardiolipin in the inner mitochondrial membranes. Pathogenic variants in TAZ result in mitochondrial structural and functional abnormalities. We report a case of infantile BTHS with severe heart failure, left ventricular noncompaction, and lactic acidosis, having a missense c.640C>T (p.His214Tyr) variant in TAZ, which is considered a pathogenic variant based on the previously reported amino acid substitution at the same site (c.641A>G, p.His214Arg). However, in this previously reported case, heart function was compensated and not entirely similar to the present case. Silico prediction analysis suggested that c.640C>T could alter the TAZ messenger RNA (mRNA) splicing process. TAZ mRNAs in isolated peripheral mononuclear cells from the patient and in vitro splicing analysis using minigenes of TAZ found an 8 bp deletion at the 3' end of exon 8, which resulted in the formation of a termination codon in the coding region of exon 9 (H214Nfs*3). These findings suggest that splicing abnormalities should always be considered in BTHS.
Collapse
Affiliation(s)
- Atsuhito Takeda
- Department of Pediatrics, Faculty of Medicine, Hokkaido University, Sapporo, Japan
| | - Masahiro Ueki
- Department of Pediatrics, Faculty of Medicine, Hokkaido University, Sapporo, Japan
| | - Jiro Abe
- MRC Mitochondrial Biology Unit, University of Cambridge, Cambridge, UK
| | - Kazuhiro Maeta
- KNC Department of Nucleic Acid Drug Discovery, Faculty of Rehabilitation, Kobe Gakuin University, Kobe, Japan
- Research Center for Locomotion Biology, Kobe Gakuin University, Kobe, Japan
| | - Tomoko Horiguchi
- KNC Department of Nucleic Acid Drug Discovery, Faculty of Rehabilitation, Kobe Gakuin University, Kobe, Japan
- Research Center for Locomotion Biology, Kobe Gakuin University, Kobe, Japan
| | - Hirokuni Yamazawa
- Department of Pediatrics, Faculty of Medicine, Hokkaido University, Sapporo, Japan
| | - Gaku Izumi
- Department of Pediatrics, Faculty of Medicine, Hokkaido University, Sapporo, Japan
| | - Ayako Chida-Nagai
- Department of Pediatrics, Faculty of Medicine, Hokkaido University, Sapporo, Japan
| | - Daisuke Sasaki
- Department of Pediatrics, Faculty of Medicine, Hokkaido University, Sapporo, Japan
| | - Takao Tsujioka
- Department of Pediatrics, Faculty of Medicine, Hokkaido University, Sapporo, Japan
| | - Itsumi Sato
- Department of Pediatrics, Faculty of Medicine, Hokkaido University, Sapporo, Japan
| | - Masahiro Shiraishi
- Department of Pediatrics, Faculty of Medicine, Hokkaido University, Sapporo, Japan
| | - Masafumi Matsuo
- KNC Department of Nucleic Acid Drug Discovery, Faculty of Rehabilitation, Kobe Gakuin University, Kobe, Japan
- Research Center for Locomotion Biology, Kobe Gakuin University, Kobe, Japan
- Faculty of Health Sciences, Kobe Tokiwa University, Kobe, Japan
| |
Collapse
|
15
|
Cvrčková F, Bezvoda R. Gaining Insight into Large Gene Families with the Aid of Bioinformatic Tools. Methods Mol Biol 2023; 2604:173-191. [PMID: 36773233 DOI: 10.1007/978-1-0716-2867-6_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
Abstract
Proteins participating in plant cell morphogenesis are often encoded by large gene families, in some cases comprising paralogs with variable (modular) domain organization, as in the case of the formin (FH2 protein) family of actin nucleators that can have also additional functions. Unravelling the phylogeny of such a complex gene family brings a number of specific challenges but may be crucial for predictions of protein function and for experimental design. Here we present an overview of our "cottage industry" semi-manual bioinformatic approach, based mostly, though not exclusively, on freely available software tools, which we used to obtain insight into the evolutionary history of plant FH2 proteins and some other components of the plant cell morphogenesis apparatus.
Collapse
Affiliation(s)
- Fatima Cvrčková
- Department of Experimental Plant Biology, Faculty of Science, Charles University, CZ, Prague, Czechia.
| | - Radek Bezvoda
- Department of Experimental Plant Biology, Faculty of Science, Charles University, CZ, Prague, Czechia
| |
Collapse
|
16
|
Zhong V, Archibald BN, Brophy JAN. Transcriptional and post-transcriptional controls for tuning gene expression in plants. CURRENT OPINION IN PLANT BIOLOGY 2023; 71:102315. [PMID: 36462457 DOI: 10.1016/j.pbi.2022.102315] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 10/22/2022] [Accepted: 10/27/2022] [Indexed: 06/17/2023]
Abstract
Plant biotechnologists seek to modify plants through genetic reprogramming, but our ability to precisely control gene expression in plants is still limited. Here, we review transcription and translation in the model plants Arabidopsis thaliana and Nicotiana benthamiana with an eye toward control points that may be used to predictably modify gene expression. We highlight differences in gene expression requirements between these plants and other species, and discuss the ways in which our understanding of gene expression has been used to engineer plants. This review is intended to serve as a resource for plant scientists looking to achieve precise control over gene expression.
Collapse
Affiliation(s)
- Vivian Zhong
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Bella N Archibald
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
17
|
Barbosa P, Savisaar R, Carmo-Fonseca M, Fonseca A. Computational prediction of human deep intronic variation. Gigascience 2022; 12:giad085. [PMID: 37878682 PMCID: PMC10599398 DOI: 10.1093/gigascience/giad085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 06/07/2023] [Accepted: 09/20/2023] [Indexed: 10/27/2023] Open
Abstract
BACKGROUND The adoption of whole-genome sequencing in genetic screens has facilitated the detection of genetic variation in the intronic regions of genes, far from annotated splice sites. However, selecting an appropriate computational tool to discriminate functionally relevant genetic variants from those with no effect is challenging, particularly for deep intronic regions where independent benchmarks are scarce. RESULTS In this study, we have provided an overview of the computational methods available and the extent to which they can be used to analyze deep intronic variation. We leveraged diverse datasets to extensively evaluate tool performance across different intronic regions, distinguishing between variants that are expected to disrupt splicing through different molecular mechanisms. Notably, we compared the performance of SpliceAI, a widely used sequence-based deep learning model, with that of more recent methods that extend its original implementation. We observed considerable differences in tool performance depending on the region considered, with variants generating cryptic splice sites being better predicted than those that potentially affect splicing regulatory elements. Finally, we devised a novel quantitative assessment of tool interpretability and found that tools providing mechanistic explanations of their predictions are often correct with respect to the ground - information, but the use of these tools results in decreased predictive power when compared to black box methods. CONCLUSIONS Our findings translate into practical recommendations for tool usage and provide a reference framework for applying prediction tools in deep intronic regions, enabling more informed decision-making by practitioners.
Collapse
Affiliation(s)
- Pedro Barbosa
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016,, Lisboa, Portugal
- Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, 1649-028, Lisboa, Portugal
| | | | - Maria Carmo-Fonseca
- Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, 1649-028, Lisboa, Portugal
| | - Alcides Fonseca
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016,, Lisboa, Portugal
| |
Collapse
|
18
|
Mobilome of the Rhus Gall Aphid Schlechtendalia chinensis Provides Insight into TE Insertion-Related Inactivation of Functional Genes. Int J Mol Sci 2022; 23:ijms232415967. [PMID: 36555609 PMCID: PMC9783078 DOI: 10.3390/ijms232415967] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 12/07/2022] [Accepted: 12/12/2022] [Indexed: 12/23/2022] Open
Abstract
Transposable elements (TEs) comprise a considerable proportion of insect genomic DNA; how they contribute to genome structure and organization is still poorly understood. Here, we present an analysis of the TE repertoire in the chromosome-level genome assembly of Rhus gall aphid Schlechtendalia chinensis. The TE fractions are composed of at least 32 different superfamilies and many TEs from different families were transcriptionally active in the S. chinensis genome. Furthermore, different types of transposase-derived proteins were also found in the S. chinensis genome. We also provide insight into the TEs related insertional inactivation, and exogenization of TEs in functional genes. We considered that the presence of TE fragments in the introns of functional genes could impact the activity of functional genes, and a large number of TE fragments in introns could lead to the indirect inactivation of functional genes. The present study will be beneficial in understanding the role and impact of TEs in genomic evolution of their hosts.
Collapse
|
19
|
Akpokiro V, Martin T, Oluwadare O. EnsembleSplice: ensemble deep learning model for splice site prediction. BMC Bioinformatics 2022; 23:413. [PMID: 36203144 PMCID: PMC9535948 DOI: 10.1186/s12859-022-04971-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Accepted: 09/29/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying splice site regions is an important step in the genomic DNA sequencing pipelines of biomedical and pharmaceutical research. Within this research purview, efficient and accurate splice site detection is highly desirable, and a variety of computational models have been developed toward this end. Neural network architectures have recently been shown to outperform classical machine learning approaches for the task of splice site prediction. Despite these advances, there is still considerable potential for improvement, especially regarding model prediction accuracy, and error rate. RESULTS Given these deficits, we propose EnsembleSplice, an ensemble learning architecture made up of four (4) distinct convolutional neural networks (CNN) model architecture combination that outperform existing splice site detection methods in the experimental evaluation metrics considered including the accuracies and error rates. We trained and tested a variety of ensembles made up of CNNs and DNNs using the five-fold cross-validation method to identify the model that performed the best across the evaluation and diversity metrics. As a result, we developed our diverse and highly effective splice site (SS) detection model, which we evaluated using two (2) genomic Homo sapiens datasets and the Arabidopsis thaliana dataset. The results showed that for of the Homo sapiens EnsembleSplice achieved accuracies of 94.16% for one of the acceptor splice sites and 95.97% for donor splice sites, with an error rate for the same Homo sapiens dataset, 4.03% for the donor splice sites and 5.84% for the acceptor splice sites datasets. CONCLUSIONS Our five-fold cross validation ensured the prediction accuracy of our models are consistent. For reproducibility, all the datasets used, models generated, and results in our work are publicly available in our GitHub repository here: https://github.com/OluwadareLab/EnsembleSplice.
Collapse
Affiliation(s)
- Victor Akpokiro
- Department of Computer Science, University of Colorado, Colorado Springs, CO, 80918, USA
| | - Trevor Martin
- Department of Mathematics, Oberlin College, Oberlin, OH, 44074, USA
| | - Oluwatosin Oluwadare
- Department of Computer Science, University of Colorado, Colorado Springs, CO, 80918, USA.
| |
Collapse
|
20
|
Fernandez-Castillo E, Barbosa-Santillán LI, Falcon-Morales L, Sánchez-Escobar JJ. Deep Splicer: A CNN Model for Splice Site Prediction in Genetic Sequences. Genes (Basel) 2022; 13:907. [PMID: 35627292 PMCID: PMC9141016 DOI: 10.3390/genes13050907] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 05/12/2022] [Accepted: 05/13/2022] [Indexed: 02/05/2023] Open
Abstract
Many living organisms have DNA in their cells that is responsible for their biological features. DNA is an organic molecule of two complementary strands of four different nucleotides wound up in a double helix. These nucleotides are adenine (A), thymine (T), guanine (G), and cytosine (C). Genes are DNA sequences containing the information to synthesize proteins. The genes of higher eukaryotic organisms contain coding sequences, known as exons and non-coding sequences, known as introns, which are removed on splice sites after the DNA is transcribed into RNA. Genome annotation is the process of identifying the location of coding regions and determining their function. This process is fundamental for understanding gene structure; however, it is time-consuming and expensive when done by biochemical methods. With technological advances, splice site detection can be done computationally. Although various software tools have been developed to predict splice sites, they need to improve accuracy and reduce false-positive rates. The main goal of this research was to generate Deep Splicer, a deep learning model to identify splice sites in the genomes of humans and other species. This model has good performance metrics and a lower false-positive rate than the currently existing tools. Deep Splicer achieved an accuracy between 93.55% and 99.66% on the genetic sequences of different organisms, while Splice2Deep, another splice site detection tool, had an accuracy between 90.52% and 98.08%. Splice2Deep surpassed Deep Splicer on the accuracy obtained after evaluating C. elegans genomic sequences (97.88% vs. 93.62%) and A. thaliana (95.40% vs. 94.93%); however, Deep Splicer's accuracy was better for H. sapiens (98.94% vs. 97.15%) and D. melanogaster (97.14% vs. 92.30%). The rate of false positives was 0.11% for human genetic sequences and 0.25% for other species' genetic sequences. Another splice prediction tool, Splice Finder, had between 1% and 3% of false positives for human sequences, while other species' sequences had around 4% and 10%.
Collapse
Affiliation(s)
- Elisa Fernandez-Castillo
- School of Engineering and Sciences, Monterrey Institute of Technology and Higher Education, Guadalajara 45201, Mexico; (L.I.B.-S.); (L.F.-M.)
| | - Liliana Ibeth Barbosa-Santillán
- School of Engineering and Sciences, Monterrey Institute of Technology and Higher Education, Guadalajara 45201, Mexico; (L.I.B.-S.); (L.F.-M.)
| | - Luis Falcon-Morales
- School of Engineering and Sciences, Monterrey Institute of Technology and Higher Education, Guadalajara 45201, Mexico; (L.I.B.-S.); (L.F.-M.)
| | | |
Collapse
|
21
|
Zhu L, Li W. Roles of Physicochemical and Structural Properties of RNA-Binding Proteins in Predicting the Activities of Trans-Acting Splicing Factors with Machine Learning. Int J Mol Sci 2022; 23:ijms23084426. [PMID: 35457243 PMCID: PMC9030803 DOI: 10.3390/ijms23084426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/13/2022] [Accepted: 04/14/2022] [Indexed: 02/06/2023] Open
Abstract
Trans-acting splicing factors play a pivotal role in modulating alternative splicing by specifically binding to cis-elements in pre-mRNAs. There are approximately 1500 RNA-binding proteins (RBPs) in the human genome, but the activities of these RBPs in alternative splicing are unknown. Since determining RBP activities through experimental methods is expensive and time consuming, the development of an efficient computational method for predicting the activities of RBPs in alternative splicing from their sequences is of great practical importance. Recently, a machine learning model for predicting the activities of splicing factors was built based on features of single and dual amino acid compositions. Here, we explored the role of physicochemical and structural properties in predicting their activities in alternative splicing using machine learning approaches and found that the prediction performance is significantly improved by including these properties. By combining the minimum redundancy–maximum relevance (mRMR) method and forward feature searching strategy, a promising feature subset with 24 features was obtained to predict the activities of RBPs. The feature subset consists of 16 dual amino acid compositions, 5 physicochemical features, and 3 structural features. The physicochemical and structural properties were as important as the sequence composition features for an accurate prediction of the activities of splicing factors. The hydrophobicity and distribution of coil are suggested to be the key physicochemical and structural features, respectively.
Collapse
Affiliation(s)
| | - Wenjin Li
- Correspondence: ; Tel.: +86-0755-26942336
| |
Collapse
|