51
|
Kathail P, Shuai RW, Chung R, Ye CJ, Loeb GB, Ioannidis NM. Current genomic deep learning models display decreased performance in cell type-specific accessible regions. Genome Biol 2024; 25:202. [PMID: 39090688 PMCID: PMC11293111 DOI: 10.1186/s13059-024-03335-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 07/10/2024] [Indexed: 08/04/2024] Open
Abstract
BACKGROUND A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type-specific CREs contain a large proportion of complex disease heritability. RESULTS We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks) and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models-Enformer and Sei-varies across the genome and is reduced in cell type-specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type-specific regulatory syntax-through single-task learning or high capacity multi-task models-can improve performance in cell type-specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. CONCLUSIONS Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type-specific accessible regions. We also identify strategies to maximize performance in cell type-specific accessible regions.
Collapse
Affiliation(s)
- Pooja Kathail
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| | - Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Chung
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA
- Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Gabriel B Loeb
- Division of Nephrology, Department of Medicine, University of California, San Francisco, CA, USA.
- Cardiovascular Research Institute, University of California, San Francisco, CA, USA.
| | - Nilah M Ioannidis
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
52
|
Wang X, Li F, Zhang Y, Imoto S, Shen HH, Li S, Guo Y, Yang J, Song J. Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects. Brief Bioinform 2024; 25:bbae446. [PMID: 39276327 PMCID: PMC11401448 DOI: 10.1093/bib/bbae446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/08/2024] [Accepted: 08/27/2024] [Indexed: 09/16/2024] Open
Abstract
Recent advancements in high-throughput sequencing technologies have significantly enhanced our ability to unravel the intricacies of gene regulatory processes. A critical challenge in this endeavor is the identification of variant effects, a key factor in comprehending the mechanisms underlying gene regulation. Non-coding variants, constituting over 90% of all variants, have garnered increasing attention in recent years. The exploration of gene variant impacts and regulatory mechanisms has spurred the development of various deep learning approaches, providing new insights into the global regulatory landscape through the analysis of extensive genetic data. Here, we provide a comprehensive overview of the development of the non-coding variants models based on bulk and single-cell sequencing data and their model-based interpretation and downstream tasks. This review delineates the popular sequencing technologies for epigenetic profiling and deep learning approaches for discerning the effects of non-coding variants. Additionally, we summarize the limitations of current approaches in variant effect prediction research and outline opportunities for improvement. We anticipate that our study will offer a practical and useful guide for the bioinformatic community to further advance the unraveling of genetic variant effects.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- South Australian immunoGENomics Cancer Institute (SAiGENCI), Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
| | - Yiwen Zhang
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Seiya Imoto
- Genome Center, Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Hsin-Hui Shen
- Department of Materials Science and Engineering, Faculty of Engineering, Monash University, Clayton, VIC 3800, Australia
| | - Shanshan Li
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Yuming Guo
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Jian Yang
- School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310030, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang 310024, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
53
|
Alagarswamy K, Shi W, Boini A, Messaoudi N, Grasso V, Cattabiani T, Turner B, Croner R, Kahlert UD, Gumbs A. Should AI-Powered Whole-Genome Sequencing Be Used Routinely for Personalized Decision Support in Surgical Oncology—A Scoping Review. BIOMEDINFORMATICS 2024; 4:1757-1772. [DOI: 10.3390/biomedinformatics4030096] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
In this scoping review, we delve into the transformative potential of artificial intelligence (AI) in addressing challenges inherent in whole-genome sequencing (WGS) analysis, with a specific focus on its implications in oncology. Unveiling the limitations of existing sequencing technologies, the review illuminates how AI-powered methods emerge as innovative solutions to surmount these obstacles. The evolution of DNA sequencing technologies, progressing from Sanger sequencing to next-generation sequencing, sets the backdrop for AI’s emergence as a potent ally in processing and analyzing the voluminous genomic data generated. Particularly, deep learning methods play a pivotal role in extracting knowledge and discerning patterns from the vast landscape of genomic information. In the context of oncology, AI-powered methods exhibit considerable potential across diverse facets of WGS analysis, including variant calling, structural variation identification, and pharmacogenomic analysis. This review underscores the significance of multimodal approaches in diagnoses and therapies, highlighting the importance of ongoing research and development in AI-powered WGS techniques. Integrating AI into the analytical framework empowers scientists and clinicians to unravel the intricate interplay of genomics within the realm of multi-omics research, paving the way for more successful personalized and targeted treatments.
Collapse
Affiliation(s)
| | - Wenjie Shi
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
| | - Aishwarya Boini
- Davao Medical School Foundation, Davao City 8000, Philippines
| | - Nouredin Messaoudi
- Department of Hepatopancreatobiliary Surgery, Vrije Universiteit Brussel (VUB), Universitair Ziekenhuis Brussel (UZ Brussel), Europe Hospitals, 1090 Brussels, Belgium
| | - Vincent Grasso
- Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131, USA
| | | | | | - Roland Croner
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
| | - Ulf D. Kahlert
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
| | - Andrew Gumbs
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
- Talos Surgical, Inc., New Castle, DE 19720, USA
- Department of Surgery, American Hospital of Tbilisi, 0102 Tbilisi, Georgia
| |
Collapse
|
54
|
Kathail P, Shuai RW, Chung R, Ye CJ, Loeb GB, Ioannidis NM. Current genomic deep learning models display decreased performance in cell type specific accessible regions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.05.602265. [PMID: 39026761 PMCID: PMC11257480 DOI: 10.1101/2024.07.05.602265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
Background A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type specific CREs contain a large proportion of complex disease heritability. Results We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks), and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models - Enformer and Sei - varies across the genome and is reduced in cell type specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type specific regulatory syntax - through single-task learning or high capacity multi-task models - can improve performance in cell type specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. Conclusions Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type specific accessible regions. We also identify strategies to maximize performance in cell type specific accessible regions.
Collapse
Affiliation(s)
- Pooja Kathail
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Richard W. Shuai
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Chung
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Gabriel B. Loeb
- Division of Nephrology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Nilah M. Ioannidis
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| |
Collapse
|
55
|
Duttke SH, Guzman C, Chang M, Delos Santos NP, McDonald BR, Xie J, Carlin AF, Heinz S, Benner C. Position-dependent function of human sequence-specific transcription factors. Nature 2024; 631:891-898. [PMID: 39020164 PMCID: PMC11269187 DOI: 10.1038/s41586-024-07662-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 06/04/2024] [Indexed: 07/19/2024]
Abstract
Patterns of transcriptional activity are encoded in our genome through regulatory elements such as promoters or enhancers that, paradoxically, contain similar assortments of sequence-specific transcription factor (TF) binding sites1-3. Knowledge of how these sequence motifs encode multiple, often overlapping, gene expression programs is central to understanding gene regulation and how mutations in non-coding DNA manifest in disease4,5. Here, by studying gene regulation from the perspective of individual transcription start sites (TSSs), using natural genetic variation, perturbation of endogenous TF protein levels and massively parallel analysis of natural and synthetic regulatory elements, we show that the effect of TF binding on transcription initiation is position dependent. Analysing TF-binding-site occurrences relative to the TSS, we identified several motifs with highly preferential positioning. We show that these patterns are a combination of a TF's distinct functional profiles-many TFs, including canonical activators such as NRF1, NFY and Sp1, activate or repress transcription initiation depending on their precise position relative to the TSS. As such, TFs and their spacing collectively guide the site and frequency of transcription initiation. More broadly, these findings reveal how similar assortments of TF binding sites can generate distinct gene regulatory outcomes depending on their spatial configuration and how DNA sequence polymorphisms may contribute to transcription variation and disease and underscore a critical role for TSS data in decoding the regulatory information of our genome.
Collapse
Affiliation(s)
- Sascha H Duttke
- School of Molecular Biosciences, College of Veterinary Medicine, Washington State University, Pullman, WA, USA.
| | - Carlos Guzman
- Department of Medicine, Division of Endocrinology, U.C. San Diego School of Medicine, La Jolla, CA, USA
| | - Max Chang
- Department of Medicine, Division of Endocrinology, U.C. San Diego School of Medicine, La Jolla, CA, USA
| | - Nathaniel P Delos Santos
- Department of Medicine, Division of Endocrinology, U.C. San Diego School of Medicine, La Jolla, CA, USA
| | - Bayley R McDonald
- School of Molecular Biosciences, College of Veterinary Medicine, Washington State University, Pullman, WA, USA
| | - Jialei Xie
- Department of Pathology and Medicine, U.C. San Diego School of Medicine, La Jolla, CA, USA
| | - Aaron F Carlin
- Department of Pathology and Medicine, U.C. San Diego School of Medicine, La Jolla, CA, USA
| | - Sven Heinz
- Department of Medicine, Division of Endocrinology, U.C. San Diego School of Medicine, La Jolla, CA, USA.
| | - Christopher Benner
- Department of Medicine, Division of Endocrinology, U.C. San Diego School of Medicine, La Jolla, CA, USA.
| |
Collapse
|
56
|
Moeckel C, Mouratidis I, Chantzi N, Uzun Y, Georgakopoulos-Soares I. Advances in computational and experimental approaches for deciphering transcriptional regulatory networks: Understanding the roles of cis-regulatory elements is essential, and recent research utilizing MPRAs, STARR-seq, CRISPR-Cas9, and machine learning has yielded valuable insights. Bioessays 2024; 46:e2300210. [PMID: 38715516 PMCID: PMC11444527 DOI: 10.1002/bies.202300210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/22/2024] [Accepted: 04/23/2024] [Indexed: 05/16/2024]
Abstract
Understanding the influence of cis-regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell-type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression-based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR-Cas9-based screening, which have significantly contributed to understanding TF binding preferences and cis-regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis-regulatory logic is analyzed. These computational advances have far-reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Yasin Uzun
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Pediatrics, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
57
|
Ghoreishifar M, Chamberlain AJ, Xiang R, Prowse-Wilkins CP, Lopdell TJ, Littlejohn MD, Pryce JE, Goddard ME. Allele-specific binding variants causing ChIP-seq peak height of histone modification are not enriched in expression QTL annotations. Genet Sel Evol 2024; 56:50. [PMID: 38937662 PMCID: PMC11212393 DOI: 10.1186/s12711-024-00916-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 06/04/2024] [Indexed: 06/29/2024] Open
Abstract
BACKGROUND Genome sequence variants affecting complex traits (quantitative trait loci, QTL) are enriched in functional regions of the genome, such as those marked by certain histone modifications. These variants are believed to influence gene expression. However, due to the linkage disequilibrium among nearby variants, pinpointing the precise location of QTL is challenging. We aimed to identify allele-specific binding (ASB) QTL (asbQTL) that cause variation in the level of histone modification, as measured by the height of peaks assayed by ChIP-seq (chromatin immunoprecipitation sequencing). We identified DNA sequences that predict the difference between alleles in ChIP-seq peak height in H3K4me3 and H3K27ac histone modifications in the mammary glands of cows. RESULTS We used a gapped k-mer support vector machine, a novel best linear unbiased prediction model, and a multiple linear regression model that combines the other two approaches to predict variant impacts on peak height. For each method, a subset of 1000 sites with the highest magnitude of predicted ASB was considered as candidate asbQTL. The accuracy of this prediction was measured by the proportion where the predicted direction matched the observed direction. Prediction accuracy ranged between 0.59 and 0.74, suggesting that these 1000 sites are enriched for asbQTL. Using independent data, we investigated functional enrichment in the candidate asbQTL set and three control groups, including non-causal ASB sites, non-ASB variants under a peak, and SNPs (single nucleotide polymorphisms) not under a peak. For H3K4me3, a higher proportion of the candidate asbQTL were confirmed as ASB when compared to the non-causal ASB sites (P < 0.01). However, these candidate asbQTL did not enrich for the other annotations, including expression QTL (eQTL), allele-specific expression QTL (aseQTL) and sites conserved across mammals (P > 0.05). CONCLUSIONS We identified putatively causal sites for asbQTL using the DNA sequence surrounding these sites. Our results suggest that many sites influencing histone modifications may not directly affect gene expression. However, it is important to acknowledge that distinguishing between putative causal ASB sites and other non-causal ASB sites in high linkage disequilibrium with the causal sites regarding their impact on gene expression may be challenging due to limitations in statistical power.
Collapse
Affiliation(s)
- Mohammad Ghoreishifar
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia.
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia.
| | - Amanda J Chamberlain
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Ruidong Xiang
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- Faculty of Veterinary & Agricultural Science, University of Melbourne, Parkville, VIC, 3010, Australia
| | - Claire P Prowse-Wilkins
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- Faculty of Veterinary & Agricultural Science, University of Melbourne, Parkville, VIC, 3010, Australia
| | - Thomas J Lopdell
- Research and Development, Livestock Improvement Corporation, Private Bag 3016, Hamilton, 3240, New Zealand
| | - Mathew D Littlejohn
- Research and Development, Livestock Improvement Corporation, Private Bag 3016, Hamilton, 3240, New Zealand
| | - Jennie E Pryce
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Michael E Goddard
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- Faculty of Veterinary & Agricultural Science, University of Melbourne, Parkville, VIC, 3010, Australia
| |
Collapse
|
58
|
Qiu C, Su K, Luo Z, Tian Q, Zhao L, Wu L, Deng H, Shen H. Developing and comparing deep learning and machine learning algorithms for osteoporosis risk prediction. Front Artif Intell 2024; 7:1355287. [PMID: 38919268 PMCID: PMC11196804 DOI: 10.3389/frai.2024.1355287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 05/31/2024] [Indexed: 06/27/2024] Open
Abstract
Introduction Osteoporosis, characterized by low bone mineral density (BMD), is an increasingly serious public health issue. So far, several traditional regression models and machine learning (ML) algorithms have been proposed for predicting osteoporosis risk. However, these models have shown relatively low accuracy in clinical implementation. Recently proposed deep learning (DL) approaches, such as deep neural network (DNN), which can discover knowledge from complex hidden interactions, offer a new opportunity to improve predictive performance. In this study, we aimed to assess whether DNN can achieve a better performance in osteoporosis risk prediction. Methods By utilizing hip BMD and extensive demographic and routine clinical data of 8,134 subjects with age more than 40 from the Louisiana Osteoporosis Study (LOS), we developed and constructed a novel DNN framework for predicting osteoporosis risk and compared its performance in osteoporosis risk prediction with four conventional ML models, namely random forest (RF), artificial neural network (ANN), k-nearest neighbor (KNN), and support vector machine (SVM), as well as a traditional regression model termed osteoporosis self-assessment tool (OST). Model performance was assessed by area under 'receiver operating curve' (AUC) and accuracy. Results By using 16 discriminative variables, we observed that the DNN approach achieved the best predictive performance (AUC = 0.848) in classifying osteoporosis (hip BMD T-score ≤ -1.0) and non-osteoporosis risk (hip BMD T-score > -1.0) subjects, compared to the other approaches. Feature importance analysis showed that the top 10 most important variables identified by the DNN model were weight, age, gender, grip strength, height, beer drinking, diastolic pressure, alcohol drinking, smoke years, and economic level. Furthermore, we performed subsampling analysis to assess the effects of varying number of sample size and variables on the predictive performance of these tested models. Notably, we observed that the DNN model performed equally well (AUC = 0.846) even by utilizing only the top 10 most important variables for osteoporosis risk prediction. Meanwhile, the DNN model can still achieve a high predictive performance (AUC = 0.826) when sample size was reduced to 50% of the original dataset. Conclusion In conclusion, we developed a novel DNN model which was considered to be an effective algorithm for early diagnosis and intervention of osteoporosis in the aging population.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Hongwen Deng
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Hui Shen
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, United States
| |
Collapse
|
59
|
Gjoni K, Pollard KS. SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models. Bioinformatics 2024; 40:btae340. [PMID: 38796686 PMCID: PMC11153836 DOI: 10.1093/bioinformatics/btae340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 05/04/2024] [Accepted: 05/24/2024] [Indexed: 05/28/2024] Open
Abstract
SUMMARY The increasing development of sequence-based machine learning models has raised the demand for manipulating sequences for this application. However, existing approaches to edit and evaluate genome sequences using models have limitations, such as incompatibility with structural variants, challenges in identifying responsible sequence perturbations, and the need for vcf file inputs and phased data. To address these bottlenecks, we present Sequence Mutator for Predictive Models (SuPreMo), a scalable and comprehensive tool for performing and supporting in silico mutagenesis experiments. We then demonstrate how pairs of reference and perturbed sequences can be used with machine learning models to prioritize pathogenic variants or discover new functional sequences. AVAILABILITY AND IMPLEMENTATION SuPreMo was written in Python, and can be run using only one line of code to generate both sequences and 3D genome disruption scores. The codebase, instructions for installation and use, and tutorials are on the GitHub page: https://github.com/ketringjoni/SuPreMo.
Collapse
Affiliation(s)
- Ketrin Gjoni
- Institute of Data Science and Biotechnology, Gladstone Institutes, 1650 Owens Street, San Francisco, CA 94158, United States
- Department of Epidemiology & Biostatistics, University of California, San Francisco, CA 94158, United States
| | - Katherine S Pollard
- Institute of Data Science and Biotechnology, Gladstone Institutes, 1650 Owens Street, San Francisco, CA 94158, United States
- Department of Epidemiology & Biostatistics, University of California, San Francisco, CA 94158, United States
- Chan Zuckerberg Biohub, San Francisco, CA 94158, United States
| |
Collapse
|
60
|
Iurlaro M, Masoni F, Flyamer IM, Wirbelauer C, Iskar M, Burger L, Giorgetti L, Schübeler D. Systematic assessment of ISWI subunits shows that NURF creates local accessibility for CTCF. Nat Genet 2024; 56:1203-1212. [PMID: 38816647 PMCID: PMC11176080 DOI: 10.1038/s41588-024-01767-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 04/23/2024] [Indexed: 06/01/2024]
Abstract
Catalytic activity of the imitation switch (ISWI) family of remodelers is critical for nucleosomal organization and DNA binding of certain transcription factors, including the insulator protein CTCF. Here we define the contribution of individual subcomplexes by deriving a panel of isogenic mouse stem cell lines, each lacking one of six ISWI accessory subunits. Individual deletions of subunits of either CERF, RSF, ACF, WICH or NoRC subcomplexes only moderately affect the chromatin landscape, while removal of the NURF-specific subunit BPTF leads to a strong reduction in chromatin accessibility and SNF2H ATPase localization around CTCF sites. This affects adjacent nucleosome occupancy and CTCF binding. At a group of sites with reduced chromatin accessibility, CTCF binding persists but cohesin occupancy is reduced, resulting in decreased insulation. These results suggest that CTCF binding can be separated from its function as an insulator in nuclear organization and identify a specific role for NURF in mediating SNF2H localization and chromatin opening at bound CTCF sites.
Collapse
Affiliation(s)
- Mario Iurlaro
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
- Disease Area Oncology, Novartis Biomedical Research, Basel, Switzerland
| | - Francesca Masoni
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
- Faculty of Science, University of Basel, Basel, Switzerland
| | - Ilya M Flyamer
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| | | | - Murat Iskar
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| | - Lukas Burger
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
- Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Luca Giorgetti
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| | - Dirk Schübeler
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland.
- Faculty of Science, University of Basel, Basel, Switzerland.
| |
Collapse
|
61
|
Zhu S, Yuan S, Niu R, Zhou Y, Wang Z, Xu G. RNAirport: a deep neural network-based database characterizing representative gene models in plants. J Genet Genomics 2024; 51:652-664. [PMID: 38518981 DOI: 10.1016/j.jgg.2024.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 03/15/2024] [Accepted: 03/16/2024] [Indexed: 03/24/2024]
Abstract
A 5'-leader, known initially as the 5'-untranslated region, contains multiple isoforms due to alternative splicing (aS) and alternative transcription start site (aTSS). Therefore, a representative 5'-leader is demanded to examine the embedded RNA regulatory elements in controlling translation efficiency. Here, we develop a ranking algorithm and a deep-learning model to annotate representative 5'-leaders for five plant species. We rank the intra-sample and inter-sample frequency of aS-mediated transcript isoforms using the Kruskal-Wallis test-based algorithm and identify the representative aS-5'-leader. To further assign a representative 5'-end, we train the deep-learning model 5'leaderP to learn aTSS-mediated 5'-end distribution patterns from cap-analysis gene expression data. The model accurately predicts the 5'-end, confirmed experimentally in Arabidopsis and rice. The representative 5'-leader-contained gene models and 5'leaderP can be accessed at RNAirport (http://www.rnairport.com/leader5P/). The Stage 1 annotation of 5'-leader records 5'-leader diversity and will pave the way to Ribo-Seq open-reading frame annotation, identical to the project recently initiated by human GENCODE.
Collapse
Affiliation(s)
- Sitao Zhu
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Shu Yuan
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Ruixia Niu
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Yulu Zhou
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Zhao Wang
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Guoyong Xu
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China; Hubei Hongshan Laboratory, Wuhan, Hubei 430070, China.
| |
Collapse
|
62
|
Zhao L, Hao R, Chai Z, Fu W, Yang W, Li C, Liu Q, Jiang Y. DeepOCR: A multi-species deep-learning framework for accurate identification of open chromatin regions in livestock. Comput Biol Chem 2024; 110:108077. [PMID: 38691895 DOI: 10.1016/j.compbiolchem.2024.108077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 03/27/2024] [Accepted: 04/16/2024] [Indexed: 05/03/2024]
Abstract
A wealth of experimental evidence has suggested that open chromatin regions (OCRs) are involved in many critical biological activities, such as DNA replication, enhancer activity, and gene transcription. Accurately identifying OCRs in livestock species can provide critical insights into the distribution and characteristics of OCRs for disease treatment in livestock, thereby improving animal welfare. However, most current machine-learning methods for OCR prediction were originally designed for a limited number of model organisms, such as humans and some model organisms, and thus their performance on non-model organisms, specifically livestock, is often unsatisfactory. To bridge this gap, we propose DeepOCR, a lightweight depth-separable residual network model for predicting OCRs in livestock, including chicken, cattle, and sheep. DeepOCR integrates a single convolution layer and two improved residue structure blocks to extract and learn important features from the input DNA sequences. A fully connected layer was also employed to further process the extracted features and improve the robustness of the entire network. Our benchmarking experiments demonstrated superior prediction performance of DeepOCR compared to state-of-the-art approaches on testing datasets of the three species. The source code of DeepOCR is freely available for academic purposes at https://github.com/jasonzhao371/DeepOCR/. We anticipate DeepOCR servers as a practical and reliable computational tool for OCR-related studies in livestock species.
Collapse
Affiliation(s)
- Liangwei Zhao
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ran Hao
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ziyi Chai
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Weiwei Fu
- College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou, Gansu 730020, China
| | - Wei Yang
- National Clinical Research Center for Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen 518112, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling 712100, China.
| | - Yu Jiang
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China; Key Laboratory of Livestock Biology, Northwest A&F University, Yangling, Shaanxi 712100, China.
| |
Collapse
|
63
|
Zhang L, Bartosovic M. Single-cell mapping of cell-type specific chromatin architecture in the central nervous system. Curr Opin Struct Biol 2024; 86:102824. [PMID: 38723561 DOI: 10.1016/j.sbi.2024.102824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 03/22/2024] [Accepted: 04/08/2024] [Indexed: 05/19/2024]
Abstract
Determining how chromatin is structured in the nucleus is critical to studying its role in gene regulation. Recent advances in the analysis of single-cell chromatin architecture have considerably improved our understanding of cell-type-specific chromosome conformation and nuclear architecture. In this review, we discuss the methods used for analysis of 3D chromatin conformation, including sequencing-based methods, imaging-based techniques, and computational approaches. We further review the application of these methods in the study of the role of chromatin topology in neural development and disorders.
Collapse
Affiliation(s)
- Letian Zhang
- Department of Biochemistry and Biophysics, Svante Arrhenius väg 16C, 162 53, Stockholm, Sweden. https://twitter.com/LetianZHANG_
| | - Marek Bartosovic
- Department of Biochemistry and Biophysics, Svante Arrhenius väg 16C, 162 53, Stockholm, Sweden.
| |
Collapse
|
64
|
Naqvi S, Kim S, Tabatabaee S, Pampari A, Kundaje A, Pritchard JK, Wysocka J. Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.28.596078. [PMID: 38853998 PMCID: PMC11160683 DOI: 10.1101/2024.05.28.596078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Deep learning approaches have made significant advances in predicting cell type-specific chromatin patterns from the identity and arrangement of transcription factor (TF) binding motifs. However, most models have been applied in unperturbed contexts, precluding a predictive understanding of how chromatin state responds to TF perturbation. Here, we used transfer learning to train and interpret deep learning models that use DNA sequence to predict, with accuracy approaching experimental reproducibility, how the concentration of two dosage-sensitive TFs (TWIST1, SOX9) affects regulatory element (RE) chromatin accessibility in facial progenitor cells. High-affinity motifs that allow for heterotypic TF co-binding and are concentrated at the center of REs buffer against quantitative changes in TF dosage and strongly predict unperturbed accessibility. In contrast, motifs with low-affinity or homotypic binding distributed throughout REs lead to sensitive responses with minimal contributions to unperturbed accessibility. Both buffering and sensitizing features show signatures of purifying selection. We validated these predictive sequence features using reporter assays and showed that a biophysical model of TF-nucleosome competition can explain the sensitizing effect of low-affinity motifs. Our approach of combining transfer learning and quantitative measurements of the chromatin response to TF dosage therefore represents a powerful method to reveal additional layers of the cis-regulatory code.
Collapse
Affiliation(s)
- Sahin Naqvi
- Departments of Chemical and Systems Biology and Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, California, USA
- Division of Gastroenterology, Hepatology, and Nutrition, Boston Children’s Hospital, Boston, MA, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, USA
- Lead contact
| | - Seungsoo Kim
- Departments of Chemical and Systems Biology and Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA, USA
- These authors contributed equally
| | - Saman Tabatabaee
- Departments of Chemical and Systems Biology and Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
- These authors contributed equally
| | - Anusri Pampari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University, Stanford, California, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, California, USA
- Department of Biology, Stanford University, Stanford, CA, USA
| | - Joanna Wysocka
- Departments of Chemical and Systems Biology and Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
65
|
Qi T, Zhou Y, Sheng Y, Li Z, Yang Y, Liu Q, Ge Q. Prediction of Transcription Factor Binding Sites on Cell-Free DNA Based on Deep Learning. J Chem Inf Model 2024; 64:4002-4008. [PMID: 38798191 DOI: 10.1021/acs.jcim.4c00047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Transcription factors (TFs) are important regulatory elements for vital cellular activities, and the identification of transcription factor binding sites (TFBS) can help to explore gene regulatory mechanisms. Research studies have proved that cfDNA (cell-free DNA) shows relatively higher coverage at TFBS due to the protection by TF from degradation by nucleases and short fragments of cfDNA are enriched in TFBS. However, there are still great difficulties in the noninvasive identification of TFBSs from experimental techniques. In this study, we propose a deep learning-based approach that can noninvasively predict TFBSs of cfDNA by learning sequence information from known TFBSs through convolutional neural networks. Under the addition of long short-term memory, our model achieved an area under the curve of 84%. Based on this model to predict cfDNA, we found consistent motifs in cfDNA fragments and lower coverage occurred upstream and downstream of these cfDNA fragments, which is consistent with a previous study. We also found that the binding sites of the same TF differ in different cell lines. TF-specific target genes were detected from cfDNA and were enriched in cancer-related pathways. In summary, our method of locating TFBSs from plasma has the potential to reflect the intrinsic regulatory mechanism from a noninvasive perspective and provide technical guidance for dynamic monitoring of disease in clinical practice.
Collapse
Affiliation(s)
- Ting Qi
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Ying Zhou
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Yuqi Sheng
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Zhihui Li
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Yuwei Yang
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Quanjun Liu
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Qinyu Ge
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| |
Collapse
|
66
|
Lally P, Gómez-Romero L, Tierrafría VH, Aquino P, Rioualen C, Zhang X, Kim S, Baniulyte G, Plitnick J, Smith C, Babu M, Collado-Vides J, Wade JT, Galagan JE. Predictive Biophysical Neural Network Modeling of a Compendium of in vivo Transcription Factor DNA Binding Profiles for Escherichia coli. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.23.594371. [PMID: 38826350 PMCID: PMC11142182 DOI: 10.1101/2024.05.23.594371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
The DNA binding of most Escherichia coli Transcription Factors (TFs) has not been comprehensively mapped, and few have models that can quantitatively predict binding affinity. We report the global mapping of in vivo DNA binding for 139 E. coli TFs using ChIP-Seq. We used these data to train BoltzNet, a novel neural network that predicts TF binding energy from DNA sequence. BoltzNet mirrors a quantitative biophysical model and provides directly interpretable predictions genome-wide at nucleotide resolution. We used BoltzNet to quantitatively design novel binding sites, which we validated with biophysical experiments on purified protein. We have generated models for 125 TFs that provide insight into global features of TF binding, including clustering of sites, the role of accessory bases, the relevance of weak sites, and the background affinity of the genome. Our paper provides new paradigms for studying TF-DNA binding and for the development of biophysically motivated neural networks.
Collapse
Affiliation(s)
- Patrick Lally
- Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215
| | - Laura Gómez-Romero
- Instituto Nacional de Medicina Genómica, Periférico Sur 4809, Arenal Tepepan, Ciudad de México 14610, México
- Escuela de Medicina y Ciencias de la Salud, Tecnológico de Monterrey, Ciudad de México, México
| | - Víctor H. Tierrafría
- Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, México
| | - Patricia Aquino
- Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215
| | - Claire Rioualen
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, México
| | - Xiaoman Zhang
- Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215
| | - Sunyoung Kim
- Department of Biochemistry, University of Regina, Regina, Saskatchewan, SK S4S 0A2, Canada
| | | | - Jonathan Plitnick
- Wadsworth Center, New York State Department of Health, Albany, NY, USA
| | - Carol Smith
- Wadsworth Center, New York State Department of Health, Albany, NY, USA
| | - Mohan Babu
- Department of Biochemistry, University of Regina, Regina, Saskatchewan, SK S4S 0A2, Canada
| | - Julio Collado-Vides
- Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, México
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Joseph T. Wade
- Wadsworth Center, New York State Department of Health, Albany, NY, USA
- Department of Biomedical Sciences, University at Albany, SUNY, Albany, NY, USA
| | - James E. Galagan
- Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215
- Bioinformatics Program, Boston University, 24 Cummington Mall, Boston, MA 02215
| |
Collapse
|
67
|
Xie Z, Xu X, Li L, Wu C, Ma Y, He J, Wei S, Wang J, Feng X. Residual networks without pooling layers improve the accuracy of genomic predictions. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2024; 137:138. [PMID: 38771334 DOI: 10.1007/s00122-024-04649-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 05/10/2024] [Indexed: 05/22/2024]
Abstract
KEY MESSAGE Residual neural network genomic selection is the first GS algorithm to reach 35 layers, and its prediction accuracy surpasses previous algorithms. With the decrease in DNA sequencing costs and the development of deep learning, phenotype prediction accuracy by genomic selection (GS) continues to improve. Residual networks, a widely validated deep learning technique, are introduced to deep learning for GS. Since each locus has a different weighted impact on the phenotype, strided convolutions are more suitable for GS problems than pooling layers. Through the above technological innovations, we propose a GS deep learning algorithm, residual neural network for genomic selection (ResGS). ResGS is the first neural network to reach 35 layers in GS. In 15 cases from four public data, the prediction accuracy of ResGS is higher than that of ridge-regression best linear unbiased prediction, support vector regression, random forest, gradient boosting regressor, and deep neural network genomic prediction in most cases. ResGS performs well in dealing with gene-environment interaction. Phenotypes from other environments are imported into ResGS along with genetic data. The prediction results are much better than just providing genetic data as input, which demonstrates the effectiveness of GS multi-modal learning. Standard deviation is recommended as an auxiliary GS evaluation metric, which could improve the distribution of predicted results. Deep learning for GS, such as ResGS, is becoming more accurate in phenotype prediction.
Collapse
Affiliation(s)
| | - Xiaogang Xu
- School of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, 310012, China.
| | - Ling Li
- Zhejiang Laboratory, Hangzhou, 311100, China
| | - Cuiling Wu
- Zhejiang Laboratory, Hangzhou, 311100, China
| | - Yinxing Ma
- Zhejiang Laboratory, Hangzhou, 311100, China
| | - Jingjing He
- Zhejiang Laboratory, Hangzhou, 311100, China
| | - Sidi Wei
- Zhejiang Laboratory, Hangzhou, 311100, China
| | - Jun Wang
- Zhejiang Laboratory, Hangzhou, 311100, China
| | - Xianzhong Feng
- Key Laboratory of Soybean Molecular Design Breeding, Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun, 130102, China
| |
Collapse
|
68
|
Baumgarten N, Rumpf L, Kessler T, Schulz MH. A statistical approach for identifying single nucleotide variants that affect transcription factor binding. iScience 2024; 27:109765. [PMID: 38736546 PMCID: PMC11088338 DOI: 10.1016/j.isci.2024.109765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 01/30/2024] [Accepted: 04/15/2024] [Indexed: 05/14/2024] Open
Abstract
Non-coding variants located within regulatory elements may alter gene expression by modifying transcription factor (TF) binding sites, thereby leading to functional consequences. Different TF models are being used to assess the effect of DNA sequence variants, such as single nucleotide variants (SNVs). Often existing methods are slow and do not assess statistical significance of results. We investigated the distribution of absolute maximal differential TF binding scores for general computational models that affect TF binding. We find that a modified Laplace distribution can adequately approximate the empirical distributions. A benchmark on in vitro and in vivo datasets showed that our approach improves upon an existing method in terms of performance and speed. Applications on eQTLs and on a genome-wide association study illustrate the usefulness of our statistics by highlighting cell type-specific regulators and target genes. An implementation of our approach is freely available on GitHub and as bioconda package.
Collapse
Affiliation(s)
- Nina Baumgarten
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| | - Laura Rumpf
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| | - Thorsten Kessler
- German Heart Centre Munich, Department of Cardiology, School of Medicine and Health, Technical University of Munich, 80636 Munich, Germany
- German Centre for Cardiovascular Research, Partner Site Munich Heart Alliance, 80636 Munich, Germany
| | - Marcel H. Schulz
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| |
Collapse
|
69
|
García Sánchez N, Ugarte Carro E, Prieto-Santamaría L, Rodríguez-González A. Protein sequence analysis in the context of drug repurposing. BMC Med Inform Decis Mak 2024; 24:122. [PMID: 38741115 DOI: 10.1186/s12911-024-02531-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 05/08/2024] [Indexed: 05/16/2024] Open
Abstract
MOTIVATION Drug repurposing speeds up the development of new treatments, being less costly, risky, and time consuming than de novo drug discovery. There are numerous biological elements that contribute to the development of diseases and, as a result, to the repurposing of drugs. METHODS In this article, we analysed the potential role of protein sequences in drug repurposing scenarios. For this purpose, we embedded the protein sequences by performing four state of the art methods and validated their capacity to encapsulate essential biological information through visualization. Then, we compared the differences in sequence distance between protein-drug target pairs of drug repurposing and non - drug repurposing data. Thus, we were able to uncover patterns that define protein sequences in repurposing cases. RESULTS We found statistically significant sequence distance differences between protein pairs in the repurposing data and the rest of protein pairs in non-repurposing data. In this manner, we verified the potential of using numerical representations of sequences to generate repurposing hypotheses in the future.
Collapse
Affiliation(s)
- Natalia García Sánchez
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain
| | - Esther Ugarte Carro
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain
| | - Lucía Prieto-Santamaría
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain
- ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, 28660, Spain
| | - Alejandro Rodríguez-González
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain.
- ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, 28660, Spain.
| |
Collapse
|
70
|
Foroozandeh Shahraki M, Farahbod M, Libbrecht MW. Robust chromatin state annotation. Genome Res 2024; 34:469-483. [PMID: 38514204 PMCID: PMC11067878 DOI: 10.1101/gr.278343.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 03/19/2024] [Indexed: 03/23/2024]
Abstract
With the goal of mapping genomic activity, international projects have recently measured epigenetic activity in hundreds of cell and tissue types. Chromatin state annotations produced by segmentation and genome annotation (SAGA) methods have emerged as the predominant way to summarize these epigenomic data sets in order to annotate the genome. These chromatin state annotations are essential for many genomic tasks, including identifying active regulatory elements and interpreting disease-associated genetic variation. However, despite the widespread applications of SAGA methods, no principled approach exists to evaluate the statistical significance of chromatin state assignments. Here, we propose the first method for assigning calibrated confidence scores to chromatin state annotations. Toward this goal, we performed a comprehensive evaluation of the reproducibility of the two most widely used existing SAGA methods, ChromHMM and Segway. We found that their predictions are frequently irreproducible. For example, when applying the same SAGA method on two sets of experimental replicates, 27%-69% of predicted enhancers fail to replicate. This suggests that a substantial fraction of predicted elements in existing chromatin state annotations cannot be relied upon. To remedy this problem, we introduce SAGAconf, a method for assigning a measure of confidence (r-value) to chromatin state annotations. SAGAconf works with any SAGA method and assigns an r-value to each genomic bin of a chromatin state annotation that represents the probability that the label of this bin will be reproduced in a replicated experiment. Thus, SAGAconf allows a researcher to select only the reliable predictions from a chromatin annotation for use in downstream analyses.
Collapse
Affiliation(s)
| | - Marjan Farahbod
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia V51 1S6, Canada
| | - Maxwell W Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia V51 1S6, Canada
| |
Collapse
|
71
|
Zou G, Huang Y, Zhang S, Ko KP, Kim B, Zhang J, Venkatesan V, Pizzi MP, Fan Y, Jun S, Niu N, Wang H, Song S, Ajani JA, Park JI. E-cadherin loss drives diffuse-type gastric tumorigenesis via EZH2-mediated reprogramming. J Exp Med 2024; 221:e20230561. [PMID: 38411616 PMCID: PMC10899090 DOI: 10.1084/jem.20230561] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 09/27/2023] [Accepted: 01/29/2024] [Indexed: 02/28/2024] Open
Abstract
Diffuse-type gastric adenocarcinoma (DGAC) is a deadly cancer often diagnosed late and resistant to treatment. While hereditary DGAC is linked to CDH1 mutations, the role of CDH1/E-cadherin inactivation in sporadic DGAC tumorigenesis remains elusive. We discovered CDH1 inactivation in a subset of DGAC patient tumors. Analyzing single-cell transcriptomes in malignant ascites, we identified two DGAC subtypes: DGAC1 (CDH1 loss) and DGAC2 (lacking immune response). DGAC1 displayed distinct molecular signatures, activated DGAC-related pathways, and an abundance of exhausted T cells in ascites. Genetically engineered murine gastric organoids showed that Cdh1 knock-out (KO), KrasG12D, Trp53 KO (EKP) accelerates tumorigenesis with immune evasion compared with KrasG12D, Trp53 KO (KP). We also identified EZH2 as a key mediator promoting CDH1 loss-associated DGAC tumorigenesis. These findings highlight DGAC's molecular diversity and potential for personalized treatment in CDH1-inactivated patients.
Collapse
Affiliation(s)
- Gengyi Zou
- Division of Radiation Oncology, Department of Experimental Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Yuanjian Huang
- Division of Radiation Oncology, Department of Experimental Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Department of General Surgery, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
| | - Shengzhe Zhang
- Division of Radiation Oncology, Department of Experimental Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Kyung-Pil Ko
- Division of Radiation Oncology, Department of Experimental Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Bongjun Kim
- Division of Radiation Oncology, Department of Experimental Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Jie Zhang
- Division of Radiation Oncology, Department of Experimental Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Vishwa Venkatesan
- Division of Radiation Oncology, Department of Experimental Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Melissa P. Pizzi
- Department of GI Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Yibo Fan
- Department of GI Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Sohee Jun
- Division of Radiation Oncology, Department of Experimental Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Na Niu
- Department of Pathology, Yale School of Medicine, New Haven, CT, USA
| | - Huamin Wang
- Division of Pathology/Lab Medicine, Department of Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Shumei Song
- Department of GI Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Jaffer A. Ajani
- Department of GI Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Jae-Il Park
- Division of Radiation Oncology, Department of Experimental Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Graduate School of Biomedical Sciences, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Program in Genetics and Epigenetics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
72
|
Duncan AG, Mitchell JA, Moses AM. Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation. Bioinformatics 2024; 40:btae190. [PMID: 38588559 PMCID: PMC11042905 DOI: 10.1093/bioinformatics/btae190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 01/12/2024] [Accepted: 04/05/2024] [Indexed: 04/10/2024] Open
Abstract
MOTIVATION Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited. RESULTS Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics. AVAILABILITY AND IMPLEMENTATION The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.
Collapse
Affiliation(s)
- Andrew G Duncan
- Cell & Systems Biology, University of Toronto, Toronto, ON M5S 3G5, Canada
| | | | - Alan M Moses
- Cell & Systems Biology, University of Toronto, Toronto, ON M5S 3G5, Canada
| |
Collapse
|
73
|
Chen Z, Ain NU, Zhao Q, Zhang X. From tradition to innovation: conventional and deep learning frameworks in genome annotation. Brief Bioinform 2024; 25:bbae138. [PMID: 38581418 PMCID: PMC10998533 DOI: 10.1093/bib/bbae138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 03/08/2024] [Accepted: 03/10/2024] [Indexed: 04/08/2024] Open
Abstract
Following the milestone success of the Human Genome Project, the 'Encyclopedia of DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about the numerous functional elements within the genome. This endeavor coincided with the emergence of numerous novel technologies, accompanied by the provision of vast amounts of whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful information from this massive dataset has become a critical aspect of many recent studies, particularly in annotating and predicting the functions of unknown genes. The core idea behind genome annotation is to identify genes and various functional elements within the genome sequence and infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for functional verification. However, early bioinformatics algorithms and software primarily employed shallow learning techniques; thus, the ability to characterize data and features learning was limited. With the widespread adoption of RNA-Seq technology, scientists from the biological community began to harness the potential of machine learning and deep learning approaches for gene structure prediction and functional annotation. In this context, we reviewed both conventional methods and contemporary deep learning frameworks, and highlighted novel perspectives on the challenges arising during annotation underscoring the dynamic nature of this evolving scientific landscape.
Collapse
Affiliation(s)
- Zhaojia Chen
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China
- College of Biomedical Engineering, Taiyuan University of Technology, Jinzhong 030600, China
| | - Noor ul Ain
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China
| | - Qian Zhao
- State Key Laboratory for Ecological Pest Control of Fujian/Taiwan Crops and College of Life Science, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Xingtan Zhang
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China
| |
Collapse
|
74
|
Oktavian MR, Nistor J, Gruenwald JT, Xu Y. Integrating core physics and machine learning for improved parameter prediction in boiling water reactor operations. Sci Rep 2024; 14:5835. [PMID: 38461347 PMCID: PMC10924948 DOI: 10.1038/s41598-024-56388-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 03/06/2024] [Indexed: 03/11/2024] Open
Abstract
This study introduces a novel method for enhancing Boiling Water Reactor (BWR) operation simulations by integrating machine learning (ML) models with conventional simulation techniques. The ML model is trained to identify and correct errors in low-fidelity simulation outputs, traditionally derived from core physics computations. These corrections aim to align the low-fidelity results closely with high-fidelity data. Precise predictions of nuclear reactor parameters like core eigenvalue and power distribution are crucial for efficient fuel management and adherence to technical specifications. Current high-fidelity transport calculations, while accurate, are impractical for real-time predictions due to extensive computational demands. Our approach, therefore, utilizes the standard two-step simulation process-assembly-level lattice physics calculations followed by whole-core nodal diffusion computations-to generate initial results, which are then refined using the ML-based error correction model. The methodology focuses on improving simulation accuracy in regular BWR operations rather than developing a universal ML predictor for reactor physics. By training an advanced neural network model on the difference in high-fidelity and low-fidelity simulations, the model can reduce the nodal power error from low-fidelity simulations to around 1% on average and the core eigenvalue down to under 100 pcm. This result is under the condition of the normal variations of control rod pattern and core flow rate changes in standard BWR operations used in the training and evaluation of the machine learning model. This work suggests a promising approach for achieving more accurate, computationally feasible simulation solutions in nuclear reactor operation and management.
Collapse
Affiliation(s)
- M R Oktavian
- Blue Wave AI Labs, 1281 Win Hentschel Blvd, West Lafayette, IN, 47906, USA.
- School of Nuclear Engineering, Purdue University, 363 North Grant Street, #5281, West Lafayette, IN, 47907, USA.
| | - J Nistor
- Blue Wave AI Labs, 1281 Win Hentschel Blvd, West Lafayette, IN, 47906, USA
- Department of Physics and Astronomy, Purdue University, 525 Northwestern Avenue, West Lafayette, IN, 47907, USA
| | - J T Gruenwald
- Blue Wave AI Labs, 1281 Win Hentschel Blvd, West Lafayette, IN, 47906, USA
| | - Y Xu
- School of Nuclear Engineering, Purdue University, 363 North Grant Street, #5281, West Lafayette, IN, 47907, USA
| |
Collapse
|
75
|
Michielsen L, Reinders MJT, Mahfouz A. Predicting cell population-specific gene expression from genomic sequence. FRONTIERS IN BIOINFORMATICS 2024; 4:1347276. [PMID: 38501113 PMCID: PMC10944912 DOI: 10.3389/fbinf.2024.1347276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 01/23/2024] [Indexed: 03/20/2024] Open
Abstract
Most regulatory elements, especially enhancer sequences, are cell population-specific. One could even argue that a distinct set of regulatory elements is what defines a cell population. However, discovering which non-coding regions of the DNA are essential in which context, and as a result, which genes are expressed, is a difficult task. Some computational models tackle this problem by predicting gene expression directly from the genomic sequence. These models are currently limited to predicting bulk measurements and mainly make tissue-specific predictions. Here, we present a model that leverages single-cell RNA-sequencing data to predict gene expression. We show that cell population-specific models outperform tissue-specific models, especially when the expression profile of a cell population and the corresponding tissue are dissimilar. Further, we show that our model can prioritize GWAS variants and learn motifs of transcription factor binding sites. We envision that our model can be useful for delineating cell population-specific regulatory elements.
Collapse
Affiliation(s)
- Lieke Michielsen
- Department of Human Genetics, Leiden University Medical Center, Leiden, Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, Netherlands
| | - Marcel J. T. Reinders
- Department of Human Genetics, Leiden University Medical Center, Leiden, Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, Netherlands
| | - Ahmed Mahfouz
- Department of Human Genetics, Leiden University Medical Center, Leiden, Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, Netherlands
| |
Collapse
|
76
|
Gong M, Yu Y, Wang Z, Zhang J, Wang X, Fu C, Zhang Y, Wang X. scAuto as a comprehensive framework for single-cell chromatin accessibility data analysis. Comput Biol Med 2024; 171:108230. [PMID: 38442554 DOI: 10.1016/j.compbiomed.2024.108230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 02/06/2024] [Accepted: 02/25/2024] [Indexed: 03/07/2024]
Abstract
Interpreting single-cell chromatin accessibility data is crucial for understanding intercellular heterogeneity regulation. Despite the progress in computational methods for analyzing this data, there is still a lack of a comprehensive analytical framework and a user-friendly online analysis tool. To fill this gap, we developed a pre-trained deep learning-based framework, single-cell auto-correlation transformers (scAuto), to overcome the challenge. Following DNABERT's methodology of pre-training and fine-tuning, scAuto learns a general understanding of DNA sequence's grammar by being pre-trained on unlabeled human genome via self-supervision; it is then transferred to the single-cell chromatin accessibility analysis task of scATAC-seq data for supervised fine-tuning. We extensively validated scAuto on the Buenrostro2018 dataset, demonstrating its superior performance on chromatin accessibility prediction, single-cell clustering, and data denoising. Based on scAuto, we further developed an interactive web server for single-cell chromatin accessibility data analysis. It integrates tutorial-style interfaces for those with limited programming skills. The platform is accessible at http://zhanglab.icaup.cn. To our knowledge, this work is expected to help analyze single-cell chromatin accessibility data and facilitate the development of precision medicine.
Collapse
Affiliation(s)
- Meiqin Gong
- Department of Obstetrics and Gynecology, West China Second University Hospital, Sichuan University, Chengdu, 610041, China
| | - Yun Yu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Zixuan Wang
- College of Electronics and information Engineering, SiChuan University, Chengdu, 610065, China
| | - Junming Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Xiongyi Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Cheng Fu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Xiaodong Wang
- Department of Obstetrics and Gynecology, West China Second University Hospital, Sichuan University, Chengdu, 610041, China.
| |
Collapse
|
77
|
Kwak IY, Kim BC, Lee J, Kang T, Garry DJ, Zhang J, Gong W. Proformer: a hybrid macaron transformer model predicts expression values from promoter sequences. BMC Bioinformatics 2024; 25:81. [PMID: 38378442 PMCID: PMC10877777 DOI: 10.1186/s12859-024-05645-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 01/08/2024] [Indexed: 02/22/2024] Open
Abstract
The breakthrough high-throughput measurement of the cis-regulatory activity of millions of randomly generated promoters provides an unprecedented opportunity to systematically decode the cis-regulatory logic that determines the expression values. We developed an end-to-end transformer encoder architecture named Proformer to predict the expression values from DNA sequences. Proformer used a Macaron-like Transformer encoder architecture, where two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer. The sliding k-mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learned positional embedding and strand embedding (forward strand vs. reverse complemented strand) as the sequence input. Moreover, Proformer introduced multiple expression heads with mask filling to prevent the transformer models from collapsing when training on relatively small amount of data. We empirically determined that this design had significantly better performance than the conventional design such as using the global pooling layer as the output layer for the regression task. These analyses support the notion that Proformer provides a novel method of learning and enhances our understanding of how cis-regulatory sequences determine the expression values.
Collapse
Affiliation(s)
- Il-Youp Kwak
- Department of Applied Statistics, Chung‑Ang University, Seoul, Republic of Korea
| | - Byeong-Chan Kim
- Department of Applied Statistics, Chung‑Ang University, Seoul, Republic of Korea
| | - Juhyun Lee
- Department of Applied Statistics, Chung‑Ang University, Seoul, Republic of Korea
| | - Taein Kang
- Department of Applied Statistics, Chung‑Ang University, Seoul, Republic of Korea
| | - Daniel J Garry
- Cardiovascular Division, Department of Medicine, Lillehei Heart Institute, University of Minnesota, 2231 6th St SE, Minneapolis, MN, 55455, USA.
- Stem Cell Institute, University of Minnesota, Minneapolis, MN, 55455, USA.
- Paul and Sheila Wellstone Muscular Dystrophy Center, University of Minnesota, Minneapolis, MN, 55455, USA.
| | - Jianyi Zhang
- Department of Biomedical Engineering, The University of Alabama at Birmingham, Birmingham, AL, 35233, USA
| | - Wuming Gong
- Cardiovascular Division, Department of Medicine, Lillehei Heart Institute, University of Minnesota, 2231 6th St SE, Minneapolis, MN, 55455, USA.
| |
Collapse
|
78
|
Rafi AM, Nogina D, Penzar D, Lee D, Lee D, Kim N, Kim S, Kim D, Shin Y, Kwak IY, Meshcheryakov G, Lando A, Zinkevich A, Kim BC, Lee J, Kang T, Vaishnav ED, Yadollahpour P, Kim S, Albrecht J, Regev A, Gong W, Kulakovskiy IV, Meyer P, de Boer C. Evaluation and optimization of sequence-based gene regulatory deep learning models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.26.538471. [PMID: 38405704 PMCID: PMC10888977 DOI: 10.1101/2023.04.26.538471] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes in deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics model performance is lacking. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast, to best capture the relationship between regulatory DNA and gene expression. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. While some benchmarks produced similar results across the top-performing models, others differed substantially. All top-performing models used neural networks, but diverged in architectures and novel training strategies, tailored to genomics sequence data. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide any given model into logically equivalent building blocks. We tested all possible combinations for the top three models and observed performance improvements for each. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets. Overall, we demonstrate that high-quality gold-standard genomics datasets can drive significant progress in model development.
Collapse
Affiliation(s)
| | - Daria Nogina
- Lomonosov Moscow State University, Moscow, Russia
| | - Dmitry Penzar
- Lomonosov Moscow State University, Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | - Dohoon Lee
- Seoul National University, Seoul, South Korea
| | | | - Nayeon Kim
- Seoul National University, Seoul, South Korea
| | | | - Dohyeon Kim
- Seoul National University, Seoul, South Korea
| | - Yeojin Shin
- Seoul National University, Seoul, South Korea
| | | | | | | | - Arsenii Zinkevich
- Lomonosov Moscow State University, Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | | | - Juhyun Lee
- Chung-Ang University, Seoul, South Korea
| | - Taein Kang
- Chung-Ang University, Seoul, South Korea
| | | | | | - Sun Kim
- Seoul National University, Seoul, South Korea
| | | | - Aviv Regev
- Broad Institute of MIT and Harvard, Massachusetts, United States
- Genentech, South San Francisco, CA, USA
| | - Wuming Gong
- University of Minnesota, Minneapolis, United States
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | | | - Carl de Boer
- University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
79
|
Hassan J, Saeed SM, Deka L, Uddin MJ, Das DB. Applications of Machine Learning (ML) and Mathematical Modeling (MM) in Healthcare with Special Focus on Cancer Prognosis and Anticancer Therapy: Current Status and Challenges. Pharmaceutics 2024; 16:260. [PMID: 38399314 PMCID: PMC10892549 DOI: 10.3390/pharmaceutics16020260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 01/29/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024] Open
Abstract
The use of data-driven high-throughput analytical techniques, which has given rise to computational oncology, is undisputed. The widespread use of machine learning (ML) and mathematical modeling (MM)-based techniques is widely acknowledged. These two approaches have fueled the advancement in cancer research and eventually led to the uptake of telemedicine in cancer care. For diagnostic, prognostic, and treatment purposes concerning different types of cancer research, vast databases of varied information with manifold dimensions are required, and indeed, all this information can only be managed by an automated system developed utilizing ML and MM. In addition, MM is being used to probe the relationship between the pharmacokinetics and pharmacodynamics (PK/PD interactions) of anti-cancer substances to improve cancer treatment, and also to refine the quality of existing treatment models by being incorporated at all steps of research and development related to cancer and in routine patient care. This review will serve as a consolidation of the advancement and benefits of ML and MM techniques with a special focus on the area of cancer prognosis and anticancer therapy, leading to the identification of challenges (data quantity, ethical consideration, and data privacy) which are yet to be fully addressed in current studies.
Collapse
Affiliation(s)
- Jasmin Hassan
- Drug Delivery & Therapeutics Lab, Dhaka 1212, Bangladesh; (J.H.); (S.M.S.)
| | | | - Lipika Deka
- Faculty of Computing, Engineering and Media, De Montfort University, Leicester LE1 9BH, UK;
| | - Md Jasim Uddin
- Department of Pharmaceutical Technology, Faculty of Pharmacy, Universiti Malaya, Kuala Lumpur 50603, Malaysia
| | - Diganta B. Das
- Department of Chemical Engineering, Loughborough University, Loughborough LE11 3TU, UK
| |
Collapse
|
80
|
Taskiran II, Spanier KI, Dickmänken H, Kempynck N, Pančíková A, Ekşi EC, Hulselmans G, Ismail JN, Theunis K, Vandepoel R, Christiaens V, Mauduit D, Aerts S. Cell-type-directed design of synthetic enhancers. Nature 2024; 626:212-220. [PMID: 38086419 PMCID: PMC10830415 DOI: 10.1038/s41586-023-06936-2] [Citation(s) in RCA: 38] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 12/05/2023] [Indexed: 01/19/2024]
Abstract
Transcriptional enhancers act as docking stations for combinations of transcription factors and thereby regulate spatiotemporal activation of their target genes1. It has been a long-standing goal in the field to decode the regulatory logic of an enhancer and to understand the details of how spatiotemporal gene expression is encoded in an enhancer sequence. Here we show that deep learning models2-6, can be used to efficiently design synthetic, cell-type-specific enhancers, starting from random sequences, and that this optimization process allows detailed tracing of enhancer features at single-nucleotide resolution. We evaluate the function of fully synthetic enhancers to specifically target Kenyon cells or glial cells in the fruit fly brain using transgenic animals. We further exploit enhancer design to create 'dual-code' enhancers that target two cell types and minimal enhancers smaller than 50 base pairs that are fully functional. By examining the state space searches towards local optima, we characterize enhancer codes through the strength, combination and arrangement of transcription factor activator and transcription factor repressor motifs. Finally, we apply the same strategies to successfully design human enhancers, which adhere to enhancer rules similar to those of Drosophila enhancers. Enhancer design guided by deep learning leads to better understanding of how enhancers work and shows that their code can be exploited to manipulate cell states.
Collapse
Affiliation(s)
- Ibrahim I Taskiran
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Katina I Spanier
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Hannah Dickmänken
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Niklas Kempynck
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Alexandra Pančíková
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
- VIB-KULeuven Center for Cancer Biology, Leuven, Belgium
| | - Eren Can Ekşi
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Gert Hulselmans
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Joy N Ismail
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
- UK Dementia Research Institute at Imperial College London, London, UK
| | - Koen Theunis
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Roel Vandepoel
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Valerie Christiaens
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - David Mauduit
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Stein Aerts
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium.
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium.
- Department of Human Genetics, KU Leuven, Leuven, Belgium.
| |
Collapse
|
81
|
Ye F, Wang J, Li J, Mei Y, Guo G. Mapping Cell Atlases at the Single-Cell Level. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2305449. [PMID: 38145338 PMCID: PMC10885669 DOI: 10.1002/advs.202305449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 12/01/2023] [Indexed: 12/26/2023]
Abstract
Recent advancements in single-cell technologies have led to rapid developments in the construction of cell atlases. These atlases have the potential to provide detailed information about every cell type in different organisms, enabling the characterization of cellular diversity at the single-cell level. Global efforts in developing comprehensive cell atlases have profound implications for both basic research and clinical applications. This review provides a broad overview of the cellular diversity and dynamics across various biological systems. In addition, the incorporation of machine learning techniques into cell atlas analyses opens up exciting prospects for the field of integrative biology.
Collapse
Affiliation(s)
- Fang Ye
- Bone Marrow Transplantation Center of the First Affiliated Hospital, and Center for Stem Cell and Regenerative MedicineZhejiang University School of MedicineHangzhouZhejiang310000China
- Liangzhu LaboratoryZhejiang UniversityHangzhouZhejiang311121China
| | - Jingjing Wang
- Bone Marrow Transplantation Center of the First Affiliated Hospital, and Center for Stem Cell and Regenerative MedicineZhejiang University School of MedicineHangzhouZhejiang310000China
- Liangzhu LaboratoryZhejiang UniversityHangzhouZhejiang311121China
| | - Jiaqi Li
- Bone Marrow Transplantation Center of the First Affiliated Hospital, and Center for Stem Cell and Regenerative MedicineZhejiang University School of MedicineHangzhouZhejiang310000China
| | - Yuqing Mei
- Bone Marrow Transplantation Center of the First Affiliated Hospital, and Center for Stem Cell and Regenerative MedicineZhejiang University School of MedicineHangzhouZhejiang310000China
| | - Guoji Guo
- Bone Marrow Transplantation Center of the First Affiliated Hospital, and Center for Stem Cell and Regenerative MedicineZhejiang University School of MedicineHangzhouZhejiang310000China
- Liangzhu LaboratoryZhejiang UniversityHangzhouZhejiang311121China
- Zhejiang Provincial Key Lab for Tissue Engineering and Regenerative MedicineDr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative MedicineHangzhouZhejiang310058China
- Institute of HematologyZhejiang UniversityHangzhouZhejiang310000China
| |
Collapse
|
82
|
de Almeida BP, Schaub C, Pagani M, Secchia S, Furlong EEM, Stark A. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature 2024; 626:207-211. [PMID: 38086418 PMCID: PMC10830412 DOI: 10.1038/s41586-023-06905-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 11/28/2023] [Indexed: 01/19/2024]
Abstract
Enhancers control gene expression and have crucial roles in development and homeostasis1-3. However, the targeted de novo design of enhancers with tissue-specific activities has remained challenging. Here we combine deep learning and transfer learning to design tissue-specific enhancers for five tissues in the Drosophila melanogaster embryo: the central nervous system, epidermis, gut, muscle and brain. We first train convolutional neural networks using genome-wide single-cell assay for transposase-accessible chromatin with sequencing (ATAC-seq) datasets and then fine-tune the convolutional neural networks with smaller-scale data from in vivo enhancer activity assays, yielding models with 13% to 76% positive predictive value according to cross-validation. We designed and experimentally assessed 40 synthetic enhancers (8 per tissue) in vivo, of which 31 (78%) were active and 27 (68%) functioned in the target tissue (100% for central nervous system and muscle). The strategy of combining genome-wide and small-scale functional datasets by transfer learning is generally applicable and should enable the design of tissue-, cell type- and cell state-specific enhancers in any system.
Collapse
Affiliation(s)
- Bernardo P de Almeida
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria
- Vienna BioCenter PhD Program, Doctoral School of the University of Vienna and Medical University of Vienna, Vienna, Austria
- InstaDeep, Paris, France
| | - Christoph Schaub
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Michaela Pagani
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria
| | - Stefano Secchia
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Eileen E M Furlong
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Alexander Stark
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria.
- Medical University of Vienna, Vienna BioCenter (VBC), Vienna, Austria.
| |
Collapse
|
83
|
Cho HJ, Wang Z, Cong Y, Bekiranov S, Zhang A, Zang C. DARDN: A Deep-Learning Approach for CTCF Binding Sequence Classification and Oncogenic Regulatory Feature Discovery. Genes (Basel) 2024; 15:144. [PMID: 38397134 PMCID: PMC10888155 DOI: 10.3390/genes15020144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2023] [Revised: 01/16/2024] [Accepted: 01/18/2024] [Indexed: 02/25/2024] Open
Abstract
Characterization of gene regulatory mechanisms in cancer is a key task in cancer genomics. CCCTC-binding factor (CTCF), a DNA binding protein, exhibits specific binding patterns in the genome of cancer cells and has a non-canonical function to facilitate oncogenic transcription programs by cooperating with transcription factors bound at flanking distal regions. Identification of DNA sequence features from a broad genomic region that distinguish cancer-specific CTCF binding sites from regular CTCF binding sites can help find oncogenic transcription factors in a cancer type. However, the presence of long DNA sequences without localization information makes it difficult to perform conventional motif analysis. Here, we present DNAResDualNet (DARDN), a computational method that utilizes convolutional neural networks (CNNs) for predicting cancer-specific CTCF binding sites from long DNA sequences and employs DeepLIFT, a method for interpretability of deep learning models that explains the model's output in terms of the contributions of its input features. The method is used for identifying DNA sequence features associated with cancer-specific CTCF binding. Evaluation on DNA sequences associated with CTCF binding sites in T-cell acute lymphoblastic leukemia (T-ALL) and other cancer types demonstrates DARDN's ability in classifying DNA sequences surrounding cancer-specific CTCF binding from control constitutive CTCF binding and identifying sequence motifs for transcription factors potentially active in each specific cancer type. We identify potential oncogenic transcription factors in T-ALL, acute myeloid leukemia (AML), breast cancer (BRCA), colorectal cancer (CRC), lung adenocarcinoma (LUAD), and prostate cancer (PRAD). Our work demonstrates the power of advanced machine learning and feature discovery approach in finding biologically meaningful information from complex high-throughput sequencing data.
Collapse
Affiliation(s)
- Hyun Jae Cho
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA;
| | - Zhenjia Wang
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22903, USA; (Z.W.); (Y.C.)
| | - Yidan Cong
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22903, USA; (Z.W.); (Y.C.)
| | - Stefan Bekiranov
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22903, USA;
| | - Aidong Zhang
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA;
| | - Chongzhi Zang
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22903, USA; (Z.W.); (Y.C.)
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22903, USA;
| |
Collapse
|
84
|
Han D, Li Y, Wang L, Liang X, Miao Y, Li W, Wang S, Wang Z. Comparative analysis of models in predicting the effects of SNPs on TF-DNA binding using large-scale in vitro and in vivo data. Brief Bioinform 2024; 25:bbae110. [PMID: 38517697 PMCID: PMC10959158 DOI: 10.1093/bib/bbae110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 02/22/2024] [Accepted: 02/26/2024] [Indexed: 03/24/2024] Open
Abstract
Non-coding variants associated with complex traits can alter the motifs of transcription factor (TF)-deoxyribonucleic acid binding. Although many computational models have been developed to predict the effects of non-coding variants on TF binding, their predictive power lacks systematic evaluation. Here we have evaluated 14 different models built on position weight matrices (PWMs), support vector machines, ordinary least squares and deep neural networks (DNNs), using large-scale in vitro (i.e. SNP-SELEX) and in vivo (i.e. allele-specific binding, ASB) TF binding data. Our results show that the accuracy of each model in predicting SNP effects in vitro significantly exceeds that achieved in vivo. For in vitro variant impact prediction, kmer/gkm-based machine learning methods (deltaSVM_HT-SELEX, QBiC-Pred) trained on in vitro datasets exhibit the best performance. For in vivo ASB variant prediction, DNN-based multitask models (DeepSEA, Sei, Enformer) trained on the ChIP-seq dataset exhibit relatively superior performance. Among the PWM-based methods, tRap demonstrates better performance in both in vitro and in vivo evaluations. In addition, we find that TF classes such as basic leucine zipper factors could be predicted more accurately, whereas those such as C2H2 zinc finger factors are predicted less accurately, aligning with the evolutionary conservation of these TF classes. We also underscore the significance of non-sequence factors such as cis-regulatory element type, TF expression, interactions and post-translational modifications in influencing the in vivo predictive performance of TFs. Our research provides valuable insights into selecting prioritization methods for non-coding variants and further optimizing such models.
Collapse
Affiliation(s)
- Dongmei Han
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China
| | - Yurun Li
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China
| | - Linxiao Wang
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China
| | - Xuan Liang
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China
| | - Yuanyuan Miao
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China
| | - Wenran Li
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China
| | - Sijia Wang
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China
| | - Zhen Wang
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China
| |
Collapse
|
85
|
Zhou J, Weinberger DR, Han S. Deep learning predicts DNA methylation regulatory variants in specific brain cell types and enhances fine mapping for brain disorders. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.18.576319. [PMID: 38293210 PMCID: PMC10827166 DOI: 10.1101/2024.01.18.576319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
DNA methylation (DNAm) is essential for brain development and function and potentially mediates the effects of genetic risk variants underlying brain disorders. We present INTERACT, a transformer-based deep learning model to predict regulatory variants impacting DNAm levels in specific brain cell types, leveraging existing single-nucleus DNAm data from the human brain. We show that INTERACT accurately predicts cell type-specific DNAm profiles, achieving an average area under the Receiver Operating Characteristic curve of 0.98 across cell types. Furthermore, INTERACT predicts cell type-specific DNAm regulatory variants, which reflect cellular context and enrich the heritability of brain-related traits in relevant cell types. Importantly, we demonstrate that incorporating predicted variant effects and DNAm levels of CpG sites enhances the fine mapping for three brain disorders-schizophrenia, depression, and Alzheimer's disease-and facilitates mapping causal genes to particular cell types. Our study highlights the power of deep learning in identifying cell type-specific regulatory variants, which will enhance our understanding of the genetics of complex traits.
Collapse
Affiliation(s)
- Jiyun Zhou
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21287, USA
| | - Daniel R. Weinberger
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21287, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21287, USA
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Department of Neurology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Shizhong Han
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21287, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21287, USA
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| |
Collapse
|
86
|
Kang CK, Kim AR. Deep molecular learning of transcriptional control of a synthetic CRE enhancer and its variants. iScience 2024; 27:108747. [PMID: 38222110 PMCID: PMC10784702 DOI: 10.1016/j.isci.2023.108747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Revised: 08/29/2023] [Accepted: 12/12/2023] [Indexed: 01/16/2024] Open
Abstract
Massively parallel reporter assay measures transcriptional activities of various cis-regulatory modules (CRMs) in a single experiment. We developed a thermodynamic computational model framework that calculates quantitative levels of gene expression directly from regulatory DNA sequences. Using the framework, we investigated the molecular mechanisms of cis-regulatory mutations of a synthetic enhancer that cause abnormal gene expression. We found that, in a human cell line, competitive binding between family transcription factors (TFs) with slightly different binding preferences significantly increases the accuracy of recapitulating the transcriptional effects of thousands of single- or multi-mutations. We also discovered that even if various harmful mutations occurred in an activator binding site, CRM could stably maintain or even increase gene expression through a certain form of competitive binding between family TFs. These findings enhance understanding the effect of SNPs and indels on CRMs and would help building robust custom-designed CRMs for biologics production and gene therapy.
Collapse
Affiliation(s)
- Chan-Koo Kang
- School of Life Science, Handong Global University, Pohang, Gyeong-Buk 37554, South Korea
- Department of Advanced Convergence, Handong Global University, Pohang, Gyeong-Buk 37554, South Korea
| | - Ah-Ram Kim
- School of Life Science, Handong Global University, Pohang, Gyeong-Buk 37554, South Korea
- Department of Advanced Convergence, Handong Global University, Pohang, Gyeong-Buk 37554, South Korea
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- School of Applied Artificial Intelligence, Handong Global University, Pohang, Gyeong-Buk 37554, South Korea
| |
Collapse
|
87
|
Retel JS, Poehlmann A, Chiou J, Steffen A, Clevert DA. A fast machine learning dataloader for epigenetic tracks from BigWig files. Bioinformatics 2024; 40:btad767. [PMID: 38175786 PMCID: PMC10782802 DOI: 10.1093/bioinformatics/btad767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 12/12/2023] [Indexed: 01/06/2024] Open
Abstract
SUMMARY We created bigwig-loader, a data-loader for epigenetic profiles from BigWig files that decompresses and processes information for multiple intervals from multiple BigWig files in parallel. This is an access pattern needed to create training batches for typical machine learning models on epigenetics data. Using a new codec, the decompression can be done on a graphical processing unit (GPU) making it fast enough to create the training batches during training, mitigating the need for saving preprocessed training examples to disk. AVAILABILITY AND IMPLEMENTATION The bigwig-loader installation instructions and source code can be accessed at https://github.com/pfizer-opensource/bigwig-loader.
Collapse
Affiliation(s)
- Joren Sebastian Retel
- Machine Learning Research, Pfizer Worldwide Research Development and Medical, Friedrichstraße 110, Berlin 10117, Germany
| | - Andreas Poehlmann
- Machine Learning Research, Pfizer Worldwide Research Development and Medical, Friedrichstraße 110, Berlin 10117, Germany
| | - Josh Chiou
- Machine Learning Research, Pfizer Worldwide Research Development and Medical, Friedrichstraße 110, Berlin 10117, Germany
| | - Andreas Steffen
- Machine Learning Research, Pfizer Worldwide Research Development and Medical, Friedrichstraße 110, Berlin 10117, Germany
| | - Djork-Arné Clevert
- Machine Learning Research, Pfizer Worldwide Research Development and Medical, Friedrichstraße 110, Berlin 10117, Germany
| |
Collapse
|
88
|
Yang Z, Li X, Sheng L, Zhu M, Lan X, Gu F. Multiomics-integrated deep language model enables in silico genome-wide detection of transcription factor binding site in unexplored biosamples. Bioinformatics 2024; 40:btae013. [PMID: 38216534 PMCID: PMC10812877 DOI: 10.1093/bioinformatics/btae013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 12/07/2023] [Accepted: 01/11/2024] [Indexed: 01/14/2024] Open
Abstract
MOTIVATION Transcription factor binding sites (TFBS) are regulatory elements that have significant impact on transcription regulation and cell fate determination. Canonical motifs, biological experiments, and computational methods have made it possible to discover TFBS. However, most existing in silico TFBS prediction models are solely DNA-based, and are trained and utilized within the same biosample, which fail to infer TFBS in experimentally unexplored biosamples. RESULTS Here, we propose TFBS prediction by modified TransFormer (TFTF), a multimodal deep language architecture which integrates multiomics information in epigenetic studies. In comparison to existing computational techniques, TFTF has state-of-the-art accuracy, and is also the first approach to accurately perform genome-wide detection for cell-type and species-specific TFBS in experimentally unexplored biosamples. Compared to peak calling methods, TFTF consistently discovers true TFBS in threshold tuning-free way, with higher recalled rates. The underlying mechanism of TFTF reveals greater attention to the targeted TF's motif region in TFBS, and general attention to the entire peak region in non-TFBS. TFTF can benefit from the integration of broader and more diverse data for improvement and can be applied to multiple epigenetic scenarios. AVAILABILITY AND IMPLEMENTATION We provide a web server (https://tftf.ibreed.cn/) for users to utilize TFTF model. Users can train TFTF model and discover TFBS with their own data.
Collapse
Affiliation(s)
- Zikun Yang
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| | - Xin Li
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| | - Lele Sheng
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| | - Ming Zhu
- Department of Basic Medical Science, School of Medicine, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Joint Center for Life Sciences, Tsinghua University, Beijing 100084, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
| | - Xun Lan
- Department of Basic Medical Science, School of Medicine, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Joint Center for Life Sciences, Tsinghua University, Beijing 100084, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
| | - Fei Gu
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| |
Collapse
|
89
|
Gao L, Behrens A, Rodschinka G, Forcelloni S, Wani S, Strasser K, Nedialkova DD. Selective gene expression maintains human tRNA anticodon pools during differentiation. Nat Cell Biol 2024; 26:100-112. [PMID: 38191669 PMCID: PMC10791582 DOI: 10.1038/s41556-023-01317-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Accepted: 11/16/2023] [Indexed: 01/10/2024]
Abstract
Transfer RNAs are essential for translating genetic information into proteins. The human genome contains hundreds of predicted tRNA genes, many in multiple copies. How their expression is regulated to control tRNA repertoires is unknown. Here we combined quantitative tRNA profiling and chromatin immunoprecipitation with sequencing to measure tRNA expression following the differentiation of human induced pluripotent stem cells into neuronal and cardiac cells. We find that tRNA transcript levels vary substantially, whereas tRNA anticodon pools, which govern decoding rates, are more stable among cell types. Mechanistically, RNA polymerase III transcribes a wide range of tRNA genes in human induced pluripotent stem cells but on differentiation becomes constrained to a subset we define as housekeeping tRNAs. This shift is mediated by decreased mTORC1 signalling, which activates the RNA polymerase III repressor MAF1. Our data explain how tRNA anticodon pools are buffered to maintain decoding speed across cell types and reveal that mTORC1 drives selective tRNA expression during differentiation.
Collapse
Affiliation(s)
- Lexi Gao
- Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Andrew Behrens
- Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Geraldine Rodschinka
- Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Sergio Forcelloni
- Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Sascha Wani
- Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Katrin Strasser
- Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Danny D Nedialkova
- Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Martinsried, Germany.
- Department of Bioscience, TUM School of Natural Sciences, Technical University of Munich, Garching, Germany.
| |
Collapse
|
90
|
Sun M, Gao AX, Li A, Ledesma-Amaro R, Wang P, Chen W, Bai Z, Liu X. Hyper-production of porcine contagious pleuropneumonia subunit vaccine proteins in Escherichia coli by developing a bicistronic T7 expression system. Biotechnol J 2024; 19:e2300187. [PMID: 38178735 DOI: 10.1002/biot.202300187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 12/15/2023] [Accepted: 12/20/2023] [Indexed: 01/06/2024]
Abstract
The ApxII toxin and the outer membrane lipoprotein (Oml) of Actinobacillus pleuropneumoniae are important vaccine antigens against porcine contagious pleuropneumonia (PCP), a prevalent infectious disease affecting the swine industry worldwide. Previous studies have reported the recombinant expression of ApxII and Oml in Escherichia coli; however, their yields were not satisfactory. Here, we aimed to enhance the production of ApxII and Oml by constructing a bicistronic expression system based on the widely used T7 promoter. To create efficient T7 bicistronic expression cassettes, 16 different fore-cistron sequences were introduced downstream of the T7 promoter. The expression of three vaccine antigens Oml1, Oml7, and ApxII in the four strongest bicistronic vectors were enhanced compared to the monocistronic control. Further optimization of the fermentation conditions in micro-well plates (MWP) led to improved production. Finally, the production yields reached unprecedented levels of 2.43 g L-1 of Oml1, 2.59 g L-1 of Oml7, and 1.21 g L-1 of ApxII, in a 5 L bioreactor. These three antigens also demonstrated well-protective immunity against A. pleuropneumoniae infection. In conclusion, this study establishes an efficient bicistronic T7 expression system that can be used to express recombinant proteins in E. coli and achieves the hyper-production of PCP vaccine proteins.
Collapse
Affiliation(s)
- Manman Sun
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, Jiangnan University, Wuxi, China
- Department of Bioengineering and Imperial College Centre for Synthetic Biology, Imperial College London, London, UK
- Key Laboratory of High Magnetic Field and Ion Beam Physical Biology, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China
| | - Alex Xiong Gao
- Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong, China
| | - An Li
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, Jiangnan University, Wuxi, China
| | - Rodrigo Ledesma-Amaro
- Department of Bioengineering and Imperial College Centre for Synthetic Biology, Imperial College London, London, UK
| | - Peng Wang
- Key Laboratory of High Magnetic Field and Ion Beam Physical Biology, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China
| | - Wenchao Chen
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Wuhan, Hubei, China
| | - Zhonghu Bai
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, Jiangnan University, Wuxi, China
| | - Xiuxia Liu
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, Jiangnan University, Wuxi, China
| |
Collapse
|
91
|
Bajwa A, Rastogi R, Kathail P, Shuai RW, Ioannidis NM. Characterizing uncertainty in predictions of genomic sequence-to-activity models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.21.572730. [PMID: 38187742 PMCID: PMC10769392 DOI: 10.1101/2023.12.21.572730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predictions of relative activity levels across the human reference genome, but their performance is more limited for predicting the effects of genetic variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 model replicates. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high-confidence predictions on reference sequences, even when incorrect, and low-confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models.
Collapse
Affiliation(s)
- Ayesha Bajwa
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Ruchir Rastogi
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Pooja Kathail
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
| | - Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| |
Collapse
|
92
|
Wei J, Lotfy P, Faizi K, Baungaard S, Gibson E, Wang E, Slabodkin H, Kinnaman E, Chandrasekaran S, Kitano H, Durrant MG, Duffy CV, Pawluk A, Hsu PD, Konermann S. Deep learning and CRISPR-Cas13d ortholog discovery for optimized RNA targeting. Cell Syst 2023; 14:1087-1102.e13. [PMID: 38091991 DOI: 10.1016/j.cels.2023.11.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 05/03/2023] [Accepted: 11/20/2023] [Indexed: 12/23/2023]
Abstract
Effective and precise mammalian transcriptome engineering technologies are needed to accelerate biological discovery and RNA therapeutics. Despite the promise of programmable CRISPR-Cas13 ribonucleases, their utility has been hampered by an incomplete understanding of guide RNA design rules and cellular toxicity resulting from off-target or collateral RNA cleavage. Here, we quantified the performance of over 127,000 RfxCas13d (CasRx) guide RNAs and systematically evaluated seven machine learning models to build a guide efficiency prediction algorithm orthogonally validated across multiple human cell types. Deep learning model interpretation revealed preferred sequence motifs and secondary features for highly efficient guides. We next identified and screened 46 novel Cas13d orthologs, finding that DjCas13d achieves low cellular toxicity and high specificity-even when targeting abundant transcripts in sensitive cell types, including stem cells and neurons. Our Cas13d guide efficiency model was successfully generalized to DjCas13d, illustrating the power of combining machine learning with ortholog discovery to advance RNA targeting in human cells.
Collapse
Affiliation(s)
- Jingyi Wei
- Department of Bioengineering, Stanford University, Stanford, CA, USA; Department of Biochemistry, Stanford University, Stanford, CA, USA; Arc Institute, Palo Alto, CA, USA
| | - Peter Lotfy
- Laboratory of Molecular and Cell Biology, Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Kian Faizi
- Laboratory of Molecular and Cell Biology, Salk Institute for Biological Studies, La Jolla, CA, USA
| | | | | | - Eleanor Wang
- Laboratory of Molecular and Cell Biology, Salk Institute for Biological Studies, La Jolla, CA, USA; Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA; Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA
| | - Hannah Slabodkin
- Department of Biochemistry, Stanford University, Stanford, CA, USA; Arc Institute, Palo Alto, CA, USA
| | - Emily Kinnaman
- Department of Biochemistry, Stanford University, Stanford, CA, USA; Arc Institute, Palo Alto, CA, USA
| | - Sita Chandrasekaran
- Arc Institute, Palo Alto, CA, USA; Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA; Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA
| | - Hugo Kitano
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Matthew G Durrant
- Arc Institute, Palo Alto, CA, USA; Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA; Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA
| | - Connor V Duffy
- Arc Institute, Palo Alto, CA, USA; Department of Genetics, Stanford University, Stanford, CA, USA
| | | | - Patrick D Hsu
- Arc Institute, Palo Alto, CA, USA; Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA; Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA.
| | - Silvana Konermann
- Department of Biochemistry, Stanford University, Stanford, CA, USA; Arc Institute, Palo Alto, CA, USA.
| |
Collapse
|
93
|
Zu S, Li YE, Wang K, Armand EJ, Mamde S, Amaral ML, Wang Y, Chu A, Xie Y, Miller M, Xu J, Wang Z, Zhang K, Jia B, Hou X, Lin L, Yang Q, Lee S, Li B, Kuan S, Liu H, Zhou J, Pinto-Duarte A, Lucero J, Osteen J, Nunn M, Smith KA, Tasic B, Yao Z, Zeng H, Wang Z, Shang J, Behrens MM, Ecker JR, Wang A, Preissl S, Ren B. Single-cell analysis of chromatin accessibility in the adult mouse brain. Nature 2023; 624:378-389. [PMID: 38092917 PMCID: PMC10719105 DOI: 10.1038/s41586-023-06824-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 11/01/2023] [Indexed: 12/17/2023]
Abstract
Recent advances in single-cell technologies have led to the discovery of thousands of brain cell types; however, our understanding of the gene regulatory programs in these cell types is far from complete1-4. Here we report a comprehensive atlas of candidate cis-regulatory DNA elements (cCREs) in the adult mouse brain, generated by analysing chromatin accessibility in 2.3 million individual brain cells from 117 anatomical dissections. The atlas includes approximately 1 million cCREs and their chromatin accessibility across 1,482 distinct brain cell populations, adding over 446,000 cCREs to the most recent such annotation in the mouse genome. The mouse brain cCREs are moderately conserved in the human brain. The mouse-specific cCREs-specifically, those identified from a subset of cortical excitatory neurons-are strongly enriched for transposable elements, suggesting a potential role for transposable elements in the emergence of new regulatory programs and neuronal diversity. Finally, we infer the gene regulatory networks in over 260 subclasses of mouse brain cells and develop deep-learning models to predict the activities of gene regulatory elements in different brain cell types from the DNA sequence alone. Our results provide a resource for the analysis of cell-type-specific gene regulation programs in both mouse and human brains.
Collapse
Affiliation(s)
- Songpeng Zu
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Yang Eric Li
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
- Department of Neurosurgery and Genetics, Washington University School of Medicine, St Louis, MO, USA
| | - Kangli Wang
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Ethan J Armand
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Sainath Mamde
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Maria Luisa Amaral
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Yuelai Wang
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Andre Chu
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Yang Xie
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Michael Miller
- Center for Epigenomics, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Jie Xu
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Zhaoning Wang
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Kai Zhang
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Bojing Jia
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Xiaomeng Hou
- Center for Epigenomics, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Lin Lin
- Center for Epigenomics, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Qian Yang
- Center for Epigenomics, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Seoyeon Lee
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Bin Li
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Samantha Kuan
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Hanqing Liu
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Jingtian Zhou
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
| | | | - Jacinta Lucero
- The Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Julia Osteen
- The Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Michael Nunn
- Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, USA
| | | | | | - Zizhen Yao
- Allen Institute for Brain Science, Seattle, WA, USA
| | - Hongkui Zeng
- Allen Institute for Brain Science, Seattle, WA, USA
| | - Zihan Wang
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Jingbo Shang
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | | | - Joseph R Ecker
- Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Allen Wang
- Center for Epigenomics, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Sebastian Preissl
- Center for Epigenomics, University of California San Diego, School of Medicine, La Jolla, CA, USA
- Institute of Experimental and Clinical Pharmacology and Toxicology, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Bing Ren
- Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA.
- Center for Epigenomics, University of California San Diego, School of Medicine, La Jolla, CA, USA.
| |
Collapse
|
94
|
Sasse A, Ng B, Spiro AE, Tasaki S, Bennett DA, Gaiteri C, De Jager PL, Chikina M, Mostafavi S. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat Genet 2023; 55:2060-2064. [PMID: 38036778 DOI: 10.1038/s41588-023-01524-6] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 09/08/2023] [Indexed: 12/02/2023]
Abstract
Deep learning methods have recently become the state of the art in a variety of regulatory genomic tasks1-6, including the prediction of gene expression from genomic DNA. As such, these methods promise to serve as important tools in interpreting the full spectrum of genetic variation observed in personal genomes. Previous evaluation strategies have assessed their predictions of gene expression across genomic regions; however, systematic benchmarking is lacking to assess their predictions across individuals, which would directly evaluate their utility as personal DNA interpreters. We used paired whole genome sequencing and gene expression from 839 individuals in the ROSMAP study7 to evaluate the ability of current methods to predict gene expression variation across individuals at varied loci. Our approach identifies a limitation of current methods to correctly predict the direction of variant effects. We show that this limitation stems from insufficiently learned sequence motif grammar and suggest new model training strategies to improve performance.
Collapse
Affiliation(s)
- Alexander Sasse
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Bernard Ng
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA
| | - Anna E Spiro
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Shinya Tasaki
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA
| | - David A Bennett
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA
| | - Christopher Gaiteri
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Philip L De Jager
- Center for Translational & Computational Neuroimmunology, Department of Neurology, and the Taub Institute for the Study of Alzheimer's Disease and the Aging Brain, Columbia University Irving Medical Center, New York, NY, USA
| | - Maria Chikina
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
- Canadian Institute for Advanced Research, Toronto, Ontario, Canada.
| |
Collapse
|
95
|
Fernandez ME, Martinez-Romero J, Aon MA, Bernier M, Price NL, de Cabo R. How is Big Data reshaping preclinical aging research? Lab Anim (NY) 2023; 52:289-314. [PMID: 38017182 DOI: 10.1038/s41684-023-01286-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 10/10/2023] [Indexed: 11/30/2023]
Abstract
The exponential scientific and technological progress during the past 30 years has favored the comprehensive characterization of aging processes with their multivariate nature, leading to the advent of Big Data in preclinical aging research. Spanning from molecular omics to organism-level deep phenotyping, Big Data demands large computational resources for storage and analysis, as well as new analytical tools and conceptual frameworks to gain novel insights leading to discovery. Systems biology has emerged as a paradigm that utilizes Big Data to gain insightful information enabling a better understanding of living organisms, visualized as multilayered networks of interacting molecules, cells, tissues and organs at different spatiotemporal scales. In this framework, where aging, health and disease represent emergent states from an evolving dynamic complex system, context given by, for example, strain, sex and feeding times, becomes paramount for defining the biological trajectory of an organism. Using bioinformatics and artificial intelligence, the systems biology approach is leading to remarkable advances in our understanding of the underlying mechanism of aging biology and assisting in creative experimental study designs in animal models. Future in-depth knowledge acquisition will depend on the ability to fully integrate information from different spatiotemporal scales in organisms, which will probably require the adoption of theories and methods from the field of complex systems. Here we review state-of-the-art approaches in preclinical research, with a focus on rodent models, that are leading to conceptual and/or technical advances in leveraging Big Data to understand basic aging biology and its full translational potential.
Collapse
Affiliation(s)
- Maria Emilia Fernandez
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Jorge Martinez-Romero
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
- Laboratory of Epidemiology and Population Science, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Miguel A Aon
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
- Laboratory of Cardiovascular Science, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Michel Bernier
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Nathan L Price
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Rafael de Cabo
- Experimental Gerontology Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA.
| |
Collapse
|
96
|
Huang C, Shuai RW, Baokar P, Chung R, Rastogi R, Kathail P, Ioannidis NM. Personal transcriptome variation is poorly explained by current genomic deep learning models. Nat Genet 2023; 55:2056-2059. [PMID: 38036790 PMCID: PMC10703684 DOI: 10.1038/s41588-023-01574-w] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2023] [Accepted: 10/18/2023] [Indexed: 12/02/2023]
Abstract
Genomic deep learning models can predict genome-wide epigenetic features and gene expression levels directly from DNA sequence. While current models perform well at predicting gene expression levels across genes in different cell types from the reference genome, their ability to explain expression variation between individuals due to cis-regulatory genetic variants remains largely unexplored. Here, we evaluate four state-of-the-art models on paired personal genome and transcriptome data and find limited performance when explaining variation in expression across individuals. In addition, models often fail to predict the correct direction of effect of cis-regulatory genetic variation on expression.
Collapse
Affiliation(s)
- Connie Huang
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Parth Baokar
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Ryan Chung
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
| | - Ruchir Rastogi
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Pooja Kathail
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA.
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
97
|
Lee D, Han SK, Yaacov O, Berk-Rauch H, Mathiyalagan P, Ganesh SK, Chakravarti A. Tissue-specific and tissue-agnostic effects of genome sequence variation modulating blood pressure. Cell Rep 2023; 42:113351. [PMID: 37910504 PMCID: PMC10726310 DOI: 10.1016/j.celrep.2023.113351] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 09/21/2023] [Accepted: 10/11/2023] [Indexed: 11/03/2023] Open
Abstract
Genome-wide association studies (GWASs) have identified numerous variants associated with polygenic traits and diseases. However, with few exceptions, a mechanistic understanding of which variants affect which genes in which tissues to modulate trait variation is lacking. Here, we present genomic analyses to explain trait heritability of blood pressure (BP) through the genetics of transcriptional regulation using GWASs, multiomics data from different tissues, and machine learning approaches. Approximately 500,000 predicted regulatory variants across four tissues explain 33.4% of variant heritability: 2.5%, 5.3%, 7.7%, and 11.8% for kidney-, adrenal-, heart-, and artery-specific variants, respectively. Variation in the enhancers involved shows greater tissue specificity than in the genes they regulate, suggesting that gene regulatory networks perturbed by enhancer variants in a tissue relevant to a phenotype are the major source of interindividual variation in BP. Thus, our study provides an approach to scan human tissue and cell types for their physiological contribution to any trait.
Collapse
Affiliation(s)
- Dongwon Lee
- Department of Pediatrics, Division of Nephrology, Boston Children's Hospital, Boston & Harvard Medical School, Boston, MA, USA.
| | - Seong Kyu Han
- Department of Pediatrics, Division of Nephrology, Boston Children's Hospital, Boston & Harvard Medical School, Boston, MA, USA
| | - Or Yaacov
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA
| | - Hanna Berk-Rauch
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA
| | - Prabhu Mathiyalagan
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA
| | - Santhi K Ganesh
- Department of Internal Medicine & Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Aravinda Chakravarti
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA.
| |
Collapse
|
98
|
Chen Y, Paramo MI, Zhang Y, Yao L, Shah SR, Jin Y, Zhang J, Pan X, Yu H. Finding Needles in the Haystack: Strategies for Uncovering Noncoding Regulatory Variants. Annu Rev Genet 2023; 57:201-222. [PMID: 37562413 DOI: 10.1146/annurev-genet-030723-120717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Abstract
Despite accumulating evidence implicating noncoding variants in human diseases, unraveling their functionality remains a significant challenge. Systematic annotations of the regulatory landscape and the growth of sequence variant data sets have fueled the development of tools and methods to identify causal noncoding variants and evaluate their regulatory effects. Here, we review the latest advances in the field and discuss potential future research avenues to gain a more in-depth understanding of noncoding regulatory variants.
Collapse
Affiliation(s)
- You Chen
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Mauricio I Paramo
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Yingying Zhang
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Li Yao
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| | - Sagar R Shah
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Yiyang Jin
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Junke Zhang
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| | - Xiuqi Pan
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Haiyuan Yu
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| |
Collapse
|
99
|
Bhogale S, Seward C, Stubbs L, Sinha S. SEAMoD: A fully interpretable neural network for cis-regulatory analysis of differentially expressed genes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.09.565900. [PMID: 38014229 PMCID: PMC10680628 DOI: 10.1101/2023.11.09.565900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
A common way to investigate gene regulatory mechanisms is to identify differentially expressed genes using transcriptomics, find their candidate enhancers using epigenomics, and search for over-represented transcription factor (TF) motifs in these enhancers using bioinformatics tools. A related follow-up task is to model gene expression as a function of enhancer sequences and rank TF motifs by their contribution to such models, thus prioritizing among regulators. We present a new computational tool called SEAMoD that performs the above tasks of motif finding and sequence-to-expression modeling simultaneously. It trains a convolutional neural network model to relate enhancer sequences to differential expression in one or more biological conditions. The model uses TF motifs to interpret the sequences, learning these motifs and their relative importance to each biological condition from data. It also utilizes epigenomic information in the form of activity scores of putative enhancers and automatically searches for the most promising enhancer for each gene. Compared to existing neural network models of non-coding sequences, SEAMoD uses far fewer parameters, requires far less training data, and emphasizes biological interpretability. We used SEAMoD to understand regulatory mechanisms underlying the differentiation of neural stem cell (NSC) derived from mouse forebrain. We profiled gene expression and histone modifications in NSC and three differentiated cell types and used SEAMoD to model differential expression of nearly 12,000 genes with an accuracy of 81%, in the process identifying the Olig2, E2f family TFs, Foxo3, and Tcf4 as key transcriptional regulators of the differentiation process.
Collapse
|
100
|
Pagnamenta AT, Camps C, Giacopuzzi E, Taylor JM, Hashim M, Calpena E, Kaisaki PJ, Hashimoto A, Yu J, Sanders E, Schwessinger R, Hughes JR, Lunter G, Dreau H, Ferla M, Lange L, Kesim Y, Ragoussis V, Vavoulis DV, Allroggen H, Ansorge O, Babbs C, Banka S, Baños-Piñero B, Beeson D, Ben-Ami T, Bennett DL, Bento C, Blair E, Brasch-Andersen C, Bull KR, Cario H, Cilliers D, Conti V, Davies EG, Dhalla F, Dacal BD, Dong Y, Dunford JE, Guerrini R, Harris AL, Hartley J, Hollander G, Javaid K, Kane M, Kelly D, Kelly D, Knight SJL, Kreins AY, Kvikstad EM, Langman CB, Lester T, Lines KE, Lord SR, Lu X, Mansour S, Manzur A, Maroofian R, Marsden B, Mason J, McGowan SJ, Mei D, Mlcochova H, Murakami Y, Németh AH, Okoli S, Ormondroyd E, Ousager LB, Palace J, Patel SY, Pentony MM, Pugh C, Rad A, Ramesh A, Riva SG, Roberts I, Roy N, Salminen O, Schilling KD, Scott C, Sen A, Smith C, Stevenson M, Thakker RV, Twigg SRF, Uhlig HH, van Wijk R, Vona B, Wall S, Wang J, Watkins H, Zak J, Schuh AH, Kini U, Wilkie AOM, Popitsch N, Taylor JC. Structural and non-coding variants increase the diagnostic yield of clinical whole genome sequencing for rare diseases. Genome Med 2023; 15:94. [PMID: 37946251 PMCID: PMC10636885 DOI: 10.1186/s13073-023-01240-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Accepted: 09/27/2023] [Indexed: 11/12/2023] Open
Abstract
BACKGROUND Whole genome sequencing is increasingly being used for the diagnosis of patients with rare diseases. However, the diagnostic yields of many studies, particularly those conducted in a healthcare setting, are often disappointingly low, at 25-30%. This is in part because although entire genomes are sequenced, analysis is often confined to in silico gene panels or coding regions of the genome. METHODS We undertook WGS on a cohort of 122 unrelated rare disease patients and their relatives (300 genomes) who had been pre-screened by gene panels or arrays. Patients were recruited from a broad spectrum of clinical specialties. We applied a bioinformatics pipeline that would allow comprehensive analysis of all variant types. We combined established bioinformatics tools for phenotypic and genomic analysis with our novel algorithms (SVRare, ALTSPLICE and GREEN-DB) to detect and annotate structural, splice site and non-coding variants. RESULTS Our diagnostic yield was 43/122 cases (35%), although 47/122 cases (39%) were considered solved when considering novel candidate genes with supporting functional data into account. Structural, splice site and deep intronic variants contributed to 20/47 (43%) of our solved cases. Five genes that are novel, or were novel at the time of discovery, were identified, whilst a further three genes are putative novel disease genes with evidence of causality. We identified variants of uncertain significance in a further fourteen candidate genes. The phenotypic spectrum associated with RMND1 was expanded to include polymicrogyria. Two patients with secondary findings in FBN1 and KCNQ1 were confirmed to have previously unidentified Marfan and long QT syndromes, respectively, and were referred for further clinical interventions. Clinical diagnoses were changed in six patients and treatment adjustments made for eight individuals, which for five patients was considered life-saving. CONCLUSIONS Genome sequencing is increasingly being considered as a first-line genetic test in routine clinical settings and can make a substantial contribution to rapidly identifying a causal aetiology for many patients, shortening their diagnostic odyssey. We have demonstrated that structural, splice site and intronic variants make a significant contribution to diagnostic yield and that comprehensive analysis of the entire genome is essential to maximise the value of clinical genome sequencing.
Collapse
Affiliation(s)
- Alistair T Pagnamenta
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Carme Camps
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Edoardo Giacopuzzi
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- Human Technopole, Viale Rita Levi Montalcini 1, 20157, Milan, Italy
| | - John M Taylor
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- Oxford Genetics Laboratories, Oxford University Hospitals NHS Foundation Trust, Churchill Hospital, Old Road, Oxford, OX3 7LE, UK
| | - Mona Hashim
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Eduardo Calpena
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Pamela J Kaisaki
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Akiko Hashimoto
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Jing Yu
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Edward Sanders
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Ron Schwessinger
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Jim R Hughes
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Gerton Lunter
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
- University Medical Center Groningen, Groningen University, PO Box 72, 9700 AB, Groningen, The Netherlands
| | - Helene Dreau
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- Department of Oncology, Oxford Molecular Diagnostics Centre, University of Oxford, Level 4, John Radcliffe Hospital, Headley Way, Oxford, OX3 9DU, UK
| | - Matteo Ferla
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Lukas Lange
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Yesim Kesim
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Vassilis Ragoussis
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Dimitrios V Vavoulis
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- Department of Oncology, Oxford Molecular Diagnostics Centre, University of Oxford, Level 4, John Radcliffe Hospital, Headley Way, Oxford, OX3 9DU, UK
| | - Holger Allroggen
- Neurosciences Department, UHCW NHS Trust, Clifford Bridge Road, Coventry, CV2 2DX, UK
| | - Olaf Ansorge
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK
| | - Christian Babbs
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Siddharth Banka
- Division of Evolution, Infection and Genomics, School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
- Manchester Centre for Genomic Medicine, Saint Mary's Hospital, Oxford Road, Manchester, M13 9WL, UK
| | - Benito Baños-Piñero
- Oxford Genetics Laboratories, Oxford University Hospitals NHS Foundation Trust, Churchill Hospital, Old Road, Oxford, OX3 7LE, UK
| | - David Beeson
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK
| | - Tal Ben-Ami
- Pediatric Hematology-Oncology Unit, Kaplan Medical Center, Rehovot, Israel
| | - David L Bennett
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK
| | - Celeste Bento
- Hematology Department, Hospitais da Universidade de Coimbra, Coimbra, Portugal
| | - Edward Blair
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- Oxford Centre for Genomic Medicine, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 7LE, UK
| | - Charlotte Brasch-Andersen
- Department of Clinical Genetics, Odense University Hospital and Department of Clinical Research, University of Southern Denmark, Odense, Denmark
| | - Katherine R Bull
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- Nuffield Department of Medicine, University of Oxford, Oxford, OX3 7BN, UK
| | - Holger Cario
- Department of Pediatrics and Adolescent Medicine, University Medical Center, Eythstrasse 24, 89075, Ulm, Germany
| | - Deirdre Cilliers
- Oxford Centre for Genomic Medicine, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 7LE, UK
| | - Valerio Conti
- Neuroscience Department, Meyer Children's Hospital IRCCS, Viale Pieraccini 24, 50139, Florence, Italy
| | - E Graham Davies
- Department of Immunology, Great Ormond Street Hospital for Children NHS Trust and UCL Great Ormond Street Institute of Child Health, Zayed Centre for Research, 2Nd Floor, 20C Guilford Street, London, WC1N 1DZ, UK
| | - Fatima Dhalla
- Department of Paediatrics, Institute of Developmental and Regenerative Medicine, IMS-Tetsuya Nakamura Building, Old Road Campus, Roosevelt Drive, Oxford, OX3 7TY, UK
| | - Beatriz Diez Dacal
- Oxford Genetics Laboratories, Oxford University Hospitals NHS Foundation Trust, Churchill Hospital, Old Road, Oxford, OX3 7LE, UK
| | - Yin Dong
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK
| | - James E Dunford
- Oxford NIHR Musculoskeletal BRC and Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Nuffield Orthopaedic Centre, Old Road, Oxford, OX3 7HE, UK
| | - Renzo Guerrini
- Neuroscience Department, Meyer Children's Hospital IRCCS, Viale Pieraccini 24, 50139, Florence, Italy
| | - Adrian L Harris
- Department of Oncology, University of Oxford, Old Road Campus Research Building, Oxford, OX3 7DQ, UK
| | - Jane Hartley
- Liver Unit, Birmingham Women's & Children's Hospital and University of Birmingham, Steelhouse Lane, Birmingham, B4 6NH, UK
| | - Georg Hollander
- Department of Paediatrics, University of Oxford, Level 2, Children's Hospital, John Radcliffe Hospital, Oxford, OX3 9DU, UK
| | - Kassim Javaid
- Oxford NIHR Musculoskeletal BRC and Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Nuffield Orthopaedic Centre, Old Road, Oxford, OX3 7HE, UK
| | - Maureen Kane
- Department of Pharmaceutical Sciences, School of Pharmacy, University of Maryland, Pharmacy Hall North, Room 731, 20 N. Pine Street, Baltimore, MD, 21201, USA
| | - Deirdre Kelly
- Liver Unit, Birmingham Women's & Children's Hospital and University of Birmingham, Steelhouse Lane, Birmingham, B4 6NH, UK
| | - Dominic Kelly
- Children's Hospital, OUH NHS Foundation Trust, NIHR Oxford BRC, Headley Way, Oxford, OX3 9DU, UK
| | - Samantha J L Knight
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Alexandra Y Kreins
- Department of Immunology, Great Ormond Street Hospital for Children NHS Trust and UCL Great Ormond Street Institute of Child Health, Zayed Centre for Research, 2Nd Floor, 20C Guilford Street, London, WC1N 1DZ, UK
| | - Erika M Kvikstad
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Craig B Langman
- Feinberg School of Medicine, Northwestern University, 211 E Chicago Avenue, Chicago, IL, MS37, USA
| | - Tracy Lester
- Oxford Genetics Laboratories, Oxford University Hospitals NHS Foundation Trust, Churchill Hospital, Old Road, Oxford, OX3 7LE, UK
| | - Kate E Lines
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- University of Oxford, Academic Endocrine Unit, OCDEM, Churchill Hospital, Oxford, OX3 7LJ, UK
| | - Simon R Lord
- Early Phase Clinical Trials Unit, Department of Oncology, University of Oxford, Cancer and Haematology Centre, Level 2 Administration Area, Churchill Hospital, Oxford, OX3 7LJ, UK
| | - Xin Lu
- Nuffield Department of Clinical Medicine, Ludwig Institute for Cancer Research, University of Oxford, Old Road Campus Research Building, Oxford, OX3 7DQ, UK
| | - Sahar Mansour
- St George's University Hospitals NHS Foundation Trust, Blackshore Road, Tooting, London, SW17 0QT, UK
| | - Adnan Manzur
- MRC Centre for Neuromuscular Diseases, National Hospital for Neurology and Neurosurgery, Queen Square, London, WC1N 3BG, UK
| | - Reza Maroofian
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology and The National Hospital for Neurology and Neurosurgery, London, WC1N 3BG, UK
| | - Brian Marsden
- Nuffield Department of Medicine, Kennedy Institute, University of Oxford, Oxford, OX3 7BN, UK
| | - Joanne Mason
- Yourgene Health Headquarters, Skelton House, Lloyd Street North, Manchester Science Park, Manchester, M15 6SH, UK
| | - Simon J McGowan
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Davide Mei
- Neuroscience Department, Meyer Children's Hospital IRCCS, Viale Pieraccini 24, 50139, Florence, Italy
| | - Hana Mlcochova
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Yoshiko Murakami
- Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Andrea H Németh
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK
- Oxford Centre for Genomic Medicine, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 7LE, UK
| | - Steven Okoli
- Imperial College NHS Trust, Department of Haematology, Hammersmith Hospital, Du Cane Road, London, W12 0HS, UK
| | - Elizabeth Ormondroyd
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- University of Oxford, Level 6 West Wing, Oxford, OX3 9DU, JR, UK
| | - Lilian Bomme Ousager
- Department of Clinical Genetics, Odense University Hospital and Department of Clinical Research, University of Southern Denmark, Odense, Denmark
| | - Jacqueline Palace
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK
| | - Smita Y Patel
- Clinical Immunology, John Radcliffe Hospital, Level 4A, Oxford, OX3 9DU, UK
| | - Melissa M Pentony
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
| | - Chris Pugh
- Nuffield Department of Medicine, University of Oxford, Oxford, OX3 7BN, UK
| | - Aboulfazl Rad
- Department of Otolaryngology-Head & Neck Surgery, Tübingen Hearing Research Centre, Eberhard Karls University, Elfriede-Aulhorn-Str. 5, 72076, Tübingen, Germany
| | - Archana Ramesh
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK
| | - Simone G Riva
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Irene Roberts
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
- Department of Paediatrics, University of Oxford, Level 2, Children's Hospital, John Radcliffe Hospital, Oxford, OX3 9DU, UK
| | - Noémi Roy
- Department of Haematology, Oxford University Hospitals NHS Foundation Trust, Level 4, Haematology, John Radcliffe Hospital, Oxford, OX3 9DU, UK
| | - Outi Salminen
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- Department of Oncology, Oxford Molecular Diagnostics Centre, University of Oxford, Level 4, John Radcliffe Hospital, Headley Way, Oxford, OX3 9DU, UK
| | - Kyleen D Schilling
- Ann & Robert H. Lurie Children's Hospital of Chicago, 225 E Chicago Avenue, Chicago, IL, 60611, USA
| | - Caroline Scott
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Arjune Sen
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK
| | - Conrad Smith
- Oxford Genetics Laboratories, Oxford University Hospitals NHS Foundation Trust, Churchill Hospital, Old Road, Oxford, OX3 7LE, UK
| | - Mark Stevenson
- University of Oxford, Academic Endocrine Unit, OCDEM, Churchill Hospital, Oxford, OX3 7LJ, UK
| | - Rajesh V Thakker
- University of Oxford, Academic Endocrine Unit, OCDEM, Churchill Hospital, Oxford, OX3 7LJ, UK
| | - Stephen R F Twigg
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Holm H Uhlig
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- Department of Paediatrics, University of Oxford, Level 2, Children's Hospital, John Radcliffe Hospital, Oxford, OX3 9DU, UK
- Translational Gastroenterology Unit, John Radcliffe Hospital, Oxford, OX3 9DU, UK
| | - Richard van Wijk
- UMC Utrecht, Heidelberglaan 100, 3584 CX, Utrecht, The Netherlands
| | - Barbara Vona
- Department of Otolaryngology-Head & Neck Surgery, Tübingen Hearing Research Centre, Eberhard Karls University, Elfriede-Aulhorn-Str. 5, 72076, Tübingen, Germany
- Institute of Human Genetics, University Medical Center Göttingen, Heinrich-Düker-Weg 12, 37073, Göttingen, Germany
- Institute for Auditory Neuroscience and InnerEarLab, University Medical Center Göttingen, Robert-Koch-Str. 40, 37075, Göttingen, Germany
| | - Steven Wall
- Oxford Craniofacial Unit, John Radcliffe Hospital, Level LG1, West Wing, Oxford, OX3 9DU, UK
| | - Jing Wang
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK
| | - Hugh Watkins
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- University of Oxford, Level 6 West Wing, Oxford, OX3 9DU, JR, UK
| | - Jaroslav Zak
- Nuffield Department of Clinical Medicine, Ludwig Institute for Cancer Research, University of Oxford, Old Road Campus Research Building, Oxford, OX3 7DQ, UK
- Department of Immunology and Microbiology, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA, 92037, USA
| | - Anna H Schuh
- Department of Oncology, Oxford Molecular Diagnostics Centre, University of Oxford, Level 4, John Radcliffe Hospital, Headley Way, Oxford, OX3 9DU, UK
| | - Usha Kini
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- Oxford Centre for Genomic Medicine, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 7LE, UK
| | - Andrew O M Wilkie
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DS, UK
| | - Niko Popitsch
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK
- Department of Biochemistry and Cell Biology, Max Perutz Labs, University of Vienna, Vienna BioCenter(VBC), Dr.-Bohr-Gasse 9, 1030, Vienna, Austria
| | - Jenny C Taylor
- Wellcome Centre for Human Genetics, University of Oxford, Old Road Campus, Roosevelt Drive, Oxford, OX3 7BN, UK.
- NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust, Oxford, OX3 9DU, UK.
| |
Collapse
|