1
|
Pugalenthi PV, He B, Xie L, Nho K, Saykin AJ, Yan J. Deciphering the tissue-specific functional effect of Alzheimer risk SNPs with deep genome annotation. BioData Min 2024; 17:50. [PMID: 39538253 PMCID: PMC11558841 DOI: 10.1186/s13040-024-00400-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 10/14/2024] [Indexed: 11/16/2024] Open
Abstract
Alzheimer's disease (AD) is a highly heritable brain dementia, along with substantial failure of cognitive function. Large-scale genome-wide association studies (GWASs) have led to a set of SNPs significantly associated with AD and related traits. GWAS hits usually emerge as clusters where a lead SNP with the highest significance is surrounded by other less significant neighboring SNPs. Although functionality is not guaranteed even with the strongest associations in GWASs, lead SNPs have historically been the focus of the field, with the remaining associations inferred to be redundant. Recent deep genome annotation tools enable the prediction of function from a segment of a DNA sequence with significantly improved precision, which allows in-silico mutagenesis to interrogate the functional effect of SNP alleles. In this project, we explored the impact of top AD GWAS hits around APOE region on chromatin functions and whether it will be altered by the genetic context (i.e., alleles of neighboring SNPs). Our results showed that highly correlated SNPs in the same LD block could have distinct impacts on downstream functions. Although some GWAS lead SNPs showed dominant functional effects regardless of the neighborhood SNP alleles, several other SNPs did exhibit enhanced loss or gain of function under certain genetic contexts, suggesting potential additional information hidden in the LD blocks.
Collapse
Affiliation(s)
- Pradeep Varathan Pugalenthi
- Department of Biomedical Engineering and Informatics, Indiana University Indianapolis, 420 University Blvd, Indianapolis, IN, 46202, USA
| | - Bing He
- Department of Biomedical Engineering and Informatics, Indiana University Indianapolis, 420 University Blvd, Indianapolis, IN, 46202, USA
| | - Linhui Xie
- Department of Electrical and Computer Engineering, Purdue University Indianapolis, 420 University Blvd, Indianapolis, IN, 46202, USA
| | - Kwangsik Nho
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, 550 University Blvd, Indianapolis, IN, 46202, USA
| | - Andrew J Saykin
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, 550 University Blvd, Indianapolis, IN, 46202, USA
| | - Jingwen Yan
- Department of Biomedical Engineering and Informatics, Indiana University Indianapolis, 420 University Blvd, Indianapolis, IN, 46202, USA.
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, 550 University Blvd, Indianapolis, IN, 46202, USA.
| |
Collapse
|
2
|
Zhang Q, Wang S, Li Z, Pan Y, Huang D. Cross-Species Prediction of Transcription Factor Binding by Adversarial Training of a Novel Nucleotide-Level Deep Neural Network. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2405685. [PMID: 39076052 PMCID: PMC11423150 DOI: 10.1002/advs.202405685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Indexed: 07/31/2024]
Abstract
Cross-species prediction of TF binding remains a major challenge due to the rapid evolutionary turnover of individual TF binding sites, resulting in cross-species predictive performance being consistently worse than within-species performance. In this study, a novel Nucleotide-Level Deep Neural Network (NLDNN) is first proposed to predict TF binding within or across species. NLDNN regards the task of TF binding prediction as a nucleotide-level regression task, which takes DNA sequences as input and directly predicts experimental coverage values. Beyond predictive performance, it also assesses model performance by locating potential TF binding regions, discriminating TF-specific single-nucleotide polymorphisms (SNPs), and identifying causal disease-associated SNPs. The experimental results show that NLDNN outperforms the competing methods in these tasks. Then, a dual-path framework is designed for adversarial training of NLDNN to further improve the cross-species prediction performance by pulling the domain space of human and mouse species closer. Through comparison and analysis, it finds that adversarial training not only can improve the cross-species prediction performance between humans and mice but also enhance the ability to locate TF binding regions and discriminate TF-specific SNPs. By visualizing the predictions, it is figured out that the framework corrects some mispredictions by amplifying the coverage values of incorrectly predicted peaks.
Collapse
Affiliation(s)
- Qinhu Zhang
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
- Division of Life Sciences and MedicineUniversity of Science and Technology of ChinaHefei230021China
- Big Data and Intelligent Computing Research CenterGuangxi Academy of ScienceNanning530007China
| | - Siguo Wang
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
| | - Zhipeng Li
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
| | - Yijie Pan
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
| | - De‐Shuang Huang
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
- Institute for Regenerative MedicineShanghai East HospitalTongji UniversityShanghai200092China
| |
Collapse
|
3
|
Kathail P, Shuai RW, Chung R, Ye CJ, Loeb GB, Ioannidis NM. Current genomic deep learning models display decreased performance in cell type-specific accessible regions. Genome Biol 2024; 25:202. [PMID: 39090688 PMCID: PMC11293111 DOI: 10.1186/s13059-024-03335-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 07/10/2024] [Indexed: 08/04/2024] Open
Abstract
BACKGROUND A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type-specific CREs contain a large proportion of complex disease heritability. RESULTS We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks) and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models-Enformer and Sei-varies across the genome and is reduced in cell type-specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type-specific regulatory syntax-through single-task learning or high capacity multi-task models-can improve performance in cell type-specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. CONCLUSIONS Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type-specific accessible regions. We also identify strategies to maximize performance in cell type-specific accessible regions.
Collapse
Affiliation(s)
- Pooja Kathail
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| | - Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Chung
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA
- Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Gabriel B Loeb
- Division of Nephrology, Department of Medicine, University of California, San Francisco, CA, USA.
- Cardiovascular Research Institute, University of California, San Francisco, CA, USA.
| | - Nilah M Ioannidis
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
4
|
Kathail P, Shuai RW, Chung R, Ye CJ, Loeb GB, Ioannidis NM. Current genomic deep learning models display decreased performance in cell type specific accessible regions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.05.602265. [PMID: 39026761 PMCID: PMC11257480 DOI: 10.1101/2024.07.05.602265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
Background A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type specific CREs contain a large proportion of complex disease heritability. Results We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks), and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models - Enformer and Sei - varies across the genome and is reduced in cell type specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type specific regulatory syntax - through single-task learning or high capacity multi-task models - can improve performance in cell type specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. Conclusions Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type specific accessible regions. We also identify strategies to maximize performance in cell type specific accessible regions.
Collapse
Affiliation(s)
- Pooja Kathail
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Richard W. Shuai
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Chung
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Gabriel B. Loeb
- Division of Nephrology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Nilah M. Ioannidis
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| |
Collapse
|
5
|
Dorans E, Jagadeesh K, Dey K, Price AL. Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.24.24307813. [PMID: 38826240 PMCID: PMC11142273 DOI: 10.1101/2024.05.24.24307813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Methods that analyze single-cell paired RNA-seq and ATAC-seq multiome data have shown great promise in linking regulatory elements to genes. However, existing methods differ in their modeling assumptions and approaches to account for biological and technical noise-leading to low concordance in their linking scores-and do not capture the effects of genomic distance. We propose pgBoost, an integrative modeling framework that trains a non-linear combination of existing linking strategies (including genomic distance) on fine-mapped eQTL data to assign a probabilistic score to each candidate SNP-gene link. We applied pgBoost to single-cell multiome data from 85k cells representing 6 major immune/blood cell types. pgBoost attained higher enrichment for fine-mapped eSNP-eGene pairs (e.g. 21x at distance >10kb) than existing methods (1.2-10x; p-value for difference = 5e-13 vs. distance-based method and < 4e-35 for each other method), with larger improvements at larger distances (e.g. 35x vs. 0.89-6.6x at distance >100kb; p-value for difference < 0.002 vs. each other method). pgBoost also outperformed existing methods in enrichment for CRISPR-validated links (e.g. 4.8x vs. 1.6-4.1x at distance >10kb; p-value for difference = 0.25 vs. distance-based method and < 2e-5 for each other method), with larger improvements at larger distances (e.g. 15x vs. 1.6-2.5x at distance >100kb; p-value for difference < 0.009 for each other method). Similar improvements in enrichment were observed for links derived from Activity-By-Contact (ABC) scores and GWAS data. We further determined that restricting pgBoost to features from a focal cell type improved the identification of SNP-gene links relevant to that cell type. We highlight several examples where pgBoost linked fine-mapped GWAS variants to experimentally validated or biologically plausible target genes that were not implicated by other methods. In conclusion, a non-linear combination of linking strategies, including genomic distance, improves power to identify target genes underlying GWAS associations.
Collapse
|
6
|
Robson ES, Ioannidis NM. GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.12.562113. [PMID: 37904945 PMCID: PMC10614795 DOI: 10.1101/2023.10.12.562113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields - including benchmarking, auditing, and algorithmic fairness - are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.
Collapse
Affiliation(s)
- Eyes S Robson
- Center for Computational Biology, UC Berkeley, Berkeley, CA 94720
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA 94720
| |
Collapse
|
7
|
Pugalenthi PV, He B, Xie L, Nho K, Saykin AJ, Yan J. Deciphering the tissue-specific functional effect of Alzheimer risk SNPs with deep genome annotation. RESEARCH SQUARE 2024:rs.3.rs-3871665. [PMID: 38405816 PMCID: PMC10889055 DOI: 10.21203/rs.3.rs-3871665/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Alzheimer's disease (AD) is a highly heritable brain dementia, along with substantial failure of cognitive function. Large-scale genome-wide association studies (GWASs) have led to a significant set of SNPs associated with AD and related traits. GWAS hits usually emerge as clusters where a lead SNP with the highest significance is surrounded by other less significant neighboring SNPs. Although functionality is not guaranteed even with the strongest associations in GWASs, lead SNPs have historically been the focus of the field, with the remaining associations inferred to be redundant. Recent deep genome annotation tools enable the prediction of function from a segment of a DNA sequence with significantly improved precision, which allows in-silico mutagenesis to interrogate the functional effect of SNP alleles. In this project, we explored the impact of top AD GWAS hits on chromatin functions and whether it will be altered by the genetic context (i.e., alleles of neighboring SNPs). Our results showed that highly correlated SNPs in the same LD block could have distinct impacts on downstream functions. Although some GWAS lead SNPs showed dominant functional effects regardless of the neighborhood SNP alleles, several other SNPs did exhibit enhanced loss or gain of function under certain genetic contexts, suggesting potential additional information hidden in the LD blocks.
Collapse
Affiliation(s)
- Pradeep Varathan Pugalenthi
- Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, 420 University Blvd, Indianapolis, 46202, Indiana, United States
| | - Bing He
- Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, 420 University Blvd, Indianapolis, 46202, Indiana, United States
| | - Linhui Xie
- Department of Electrical and Computer Engineering, Indiana University-Purdue University Indianapolis, 420 University Blvd, Indianapolis, 46202, Indiana, United States
| | - Kwangsik Nho
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, 550 University Blvd, Indianapolis, 46202, Indiana, United States
| | - Andrew J Saykin
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, 550 University Blvd, Indianapolis, 46202, Indiana, United States
| | - Jingwen Yan
- Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, 420 University Blvd, Indianapolis, 46202, Indiana, United States
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, 550 University Blvd, Indianapolis, 46202, Indiana, United States
| |
Collapse
|
8
|
Stricker M, Zhang W, Cheng WY, Gazal S, Dendrou C, Nahkuri S, Palamara PF. Genome-wide classification of epigenetic activity reveals regions of enriched heritability in immune-related traits. CELL GENOMICS 2024; 4:100469. [PMID: 38190103 PMCID: PMC10794845 DOI: 10.1016/j.xgen.2023.100469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 07/04/2023] [Accepted: 11/29/2023] [Indexed: 01/09/2024]
Abstract
Epigenetics underpins the regulation of genes known to play a key role in the adaptive and innate immune system (AIIS). We developed a method, EpiNN, that leverages epigenetic data to detect AIIS-relevant genomic regions and used it to detect 2,765 putative AIIS loci. Experimental validation of one of these loci, DNMT1, provided evidence for a novel AIIS-specific transcription start site. We built a genome-wide AIIS annotation and used linkage disequilibrium (LD) score regression to test whether it predicts regional heritability using association statistics for 176 traits. We detected significant heritability effects (average |τ∗|=1.65) for 20 out of 26 immune-relevant traits. In a meta-analysis, immune-relevant traits and diseases were 4.45× more enriched for heritability than other traits. The EpiNN annotation was also depleted of trans-ancestry genetic correlation, indicating ancestry-specific effects. These results underscore the effectiveness of leveraging supervised learning algorithms and epigenetic data to detect loci implicated in specific classes of traits and diseases.
Collapse
Affiliation(s)
| | - Weijiao Zhang
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Wei-Yi Cheng
- Data & Analytics, Roche Pharma Research & Early Development, Roche Innovation Center New York, Little Falls, NJ, USA
| | - Steven Gazal
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Calliope Dendrou
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Satu Nahkuri
- Data & Analytics, Roche Pharma Research & Early Development, Roche Innovation Center Zürich, Zürich, Switzerland.
| | - Pier Francesco Palamara
- Department of Statistics, University of Oxford, Oxford, UK; Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK.
| |
Collapse
|
9
|
Baranger DAA, Hatoum AS, Polimanti R, Gelernter J, Edenberg HJ, Bogdan R, Agrawal A. Multi-omics cannot replace sample size in genome-wide association studies. GENES, BRAIN, AND BEHAVIOR 2023; 22:e12846. [PMID: 36977197 PMCID: PMC10733567 DOI: 10.1111/gbb.12846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 02/20/2023] [Accepted: 03/13/2023] [Indexed: 03/30/2023]
Abstract
The integration of multi-omics information (e.g., epigenetics and transcriptomics) can be useful for interpreting findings from genome-wide association studies (GWAS). It has been suggested that multi-omics could circumvent or greatly reduce the need to increase GWAS sample sizes for novel variant discovery. We tested whether incorporating multi-omics information in earlier and smaller-sized GWAS boosts true-positive discovery of genes that were later revealed by larger GWAS of the same/similar traits. We applied 10 different analytic approaches to integrating multi-omics data from 12 sources (e.g., Genotype-Tissue Expression project) to test whether earlier and smaller GWAS of 4 brain-related traits (alcohol use disorder/problematic alcohol use, major depression/depression, schizophrenia, and intracranial volume/brain volume) could detect genes that were revealed by a later and larger GWAS. Multi-omics data did not reliably identify novel genes in earlier less-powered GWAS (PPV <0.2; 80% false-positive associations). Machine learning predictions marginally increased the number of identified novel genes, correctly identifying 1-8 additional genes, but only for well-powered early GWAS of highly heritable traits (i.e., intracranial volume and schizophrenia). Although multi-omics, particularly positional mapping (i.e., fastBAT, MAGMA, and H-MAGMA), can help to prioritize genes within genome-wide significant loci (PPVs = 0.5-1.0) and translate them into information about disease biology, it does not reliably increase novel gene discovery in brain-related GWAS. To increase power for discovery of novel genes and loci, increasing sample size is required.
Collapse
Affiliation(s)
- David A. A. Baranger
- Department of Psychological & Brain SciencesWashington University in St. Louis Medical SchoolSaint LouisMissouriUSA
| | - Alexander S. Hatoum
- Department of PsychiatryWashington University School of MedicineSaint LouisMissouriUSA
| | - Renato Polimanti
- Department of Psychiatry, Division of Human GeneticsYale School of MedicineNew HavenConnecticutUSA
- PsychiatryVeterans Affairs Connecticut Healthcare SystemWest HavenConnecticutUSA
| | - Joel Gelernter
- Department of Psychiatry, Division of Human GeneticsYale School of MedicineNew HavenConnecticutUSA
- PsychiatryVeterans Affairs Connecticut Healthcare SystemWest HavenConnecticutUSA
- Department of GeneticsYale School of MedicineNew HavenConnecticutUSA
- Department of NeuroscienceYale School of MedicineNew HavenConnecticutUSA
| | - Howard J. Edenberg
- Department of Biochemistry and Molecular BiologyIndiana University School of MedicineIndianapolisIndianaUSA
- Department of Medical and Molecular GeneticsIndiana University School of MedicineIndianapolisIndianaUSA
| | - Ryan Bogdan
- Department of Psychological & Brain SciencesWashington University in St. Louis Medical SchoolSaint LouisMissouriUSA
| | - Arpana Agrawal
- Department of PsychiatryWashington University School of MedicineSaint LouisMissouriUSA
| |
Collapse
|
10
|
Varathan P, Xie L, He B, Saykin AJ, Nho K, Yan J. Deciphering the tissue-specific functional effect of Alzheimer risk SNPs with deep genome annotation. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.10.23.23297399. [PMID: 37961458 PMCID: PMC10635176 DOI: 10.1101/2023.10.23.23297399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Alzheimer's disease (AD) is a highly heritable brain dementia, along with substantial failure of cognitive function. Large-scale genome-wide association studies (GWAS) have led to a significant set of SNPs associated with AD and related traits. GWAS hits usually emerge as clusters where a lead SNP with the highest significance is surrounded by other less significant neighboring SNPs. Although functionality is not guaranteed with even the strongest associations in the GWAS, the lead SNPs have been historically the focus of the field, with the remaining associations inferred as redundant. Recent deep genome annotation tools enable the prediction of function from a segment of DNA sequence with significantly improved precision, which allows in-silico mutagenesis to interrogate the functional effect of SNP alleles. In this project, we explored the impact of top AD GWAS hits on the chromatin functions, and whether it will be altered by the genomic context (i.e., alleles of neighborhood SNPs). Our results showed that highly correlated SNPs in the same LD block could have distinct impact on the downstream functions. Although some GWAS lead SNPs showed dominating functional effect regardless of the neighborhood SNP alleles, several other ones do get enhanced loss or gain of function under certain genomic context, suggesting potential extra information hidden in the LD blocks.
Collapse
|
11
|
Majdandzic A, Rajesh C, Koo PK. Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol 2023; 24:109. [PMID: 37161475 PMCID: PMC10169356 DOI: 10.1186/s13059-023-02956-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 04/28/2023] [Indexed: 05/11/2023] Open
Abstract
Post hoc attribution methods can provide insights into the learned patterns from deep neural networks (DNNs) trained on high-throughput functional genomics data. However, in practice, their resultant attribution maps can be challenging to interpret due to spurious importance scores for seemingly arbitrary nucleotides. Here, we identify a previously overlooked attribution noise source that arises from how DNNs handle one-hot encoded DNA. We demonstrate this noise is pervasive across various genomic DNNs and introduce a statistical correction that effectively reduces it, leading to more reliable attribution maps. Our approach represents a promising step towards gaining meaningful insights from DNNs in regulatory genomics.
Collapse
Affiliation(s)
- Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Chandana Rajesh
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA.
| |
Collapse
|
12
|
Lee NK, Tang Z, Toneyan S, Koo PK. EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations. Genome Biol 2023; 24:105. [PMID: 37143118 PMCID: PMC10161416 DOI: 10.1186/s13059-023-02941-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 04/17/2023] [Indexed: 05/06/2023] Open
Abstract
Deep neural networks (DNNs) hold promise for functional genomics prediction, but their generalization capability may be limited by the amount of available data. To address this, we propose EvoAug, a suite of evolution-inspired augmentations that enhance the training of genomic DNNs by increasing genetic variation. Random transformation of DNA sequences can potentially alter their function in unknown ways, so we employ a fine-tuning procedure using the original non-transformed data to preserve functional integrity. Our results demonstrate that EvoAug substantially improves the generalization and interpretability of established DNNs across prominent regulatory genomics prediction tasks, offering a robust solution for genomic DNNs.
Collapse
Affiliation(s)
- Nicholas Keone Lee
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Shushan Toneyan
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA.
| |
Collapse
|
13
|
Bucholc M, James C, Al Khleifat A, Badhwar A, Clarke N, Dehsarvi A, Madan CR, Marzi SJ, Shand C, Schilder BM, Tamburin S, Tantiangco HM, Lourida I, Llewellyn DJ, Ranson JM. Artificial Intelligence for Dementia Research Methods Optimization. ARXIV 2023:arXiv:2303.01949v1. [PMID: 36911275 PMCID: PMC10002770] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/14/2023]
Abstract
INTRODUCTION Machine learning (ML) has been extremely successful in identifying key features from high-dimensional datasets and executing complicated tasks with human expert levels of accuracy or greater. METHODS We summarize and critically evaluate current applications of ML in dementia research and highlight directions for future research. RESULTS We present an overview of ML algorithms most frequently used in dementia research and highlight future opportunities for the use of ML in clinical practice, experimental medicine, and clinical trials. We discuss issues of reproducibility, replicability and interpretability and how these impact the clinical applicability of dementia research. Finally, we give examples of how state-of-the-art methods, such as transfer learning, multi-task learning, and reinforcement learning, may be applied to overcome these issues and aid the translation of research to clinical practice in the future. DISCUSSION ML-based models hold great promise to advance our understanding of the underlying causes and pathological mechanisms of dementia.
Collapse
Affiliation(s)
- Magda Bucholc
- Cognitive Analytics Research Lab, School of Computing, Engineering & Intelligent Systems, Ulster University, Derry, UK
| | - Charlotte James
- NIHR Bristol Biomedical Research Centre, University Hospitals Bristol and Weston NHS Foundation Trust and University of Bristol, Bristol, UK
| | - Ahmad Al Khleifat
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London, United Kingdom
| | - AmanPreet Badhwar
- Multiomics Investigation of Neurodegenerative Diseases (MIND) Lab, Centre de Recherche de l’Institut Universitaire de Gériatrie de Montréal, Montréal, Canada
- Institut de génie biomédical, Université de Montréal, Montréal, Canada
- Département de Pharmacologie et Physiologie, Université de Montréal, Montréal, Canada
| | - Natasha Clarke
- Multiomics Investigation of Neurodegenerative Diseases (MIND) Lab, Centre de Recherche de l’Institut Universitaire de Gériatrie de Montréal, Montréal, Canada
| | - Amir Dehsarvi
- Aberdeen Biomedical Imaging Centre, School of Medicine, Medical Sciences, and Nutrition, University of Aberdeen, Aberdeen, UK
| | | | - Sarah J. Marzi
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Cameron Shand
- Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK
| | - Brian M. Schilder
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Stefano Tamburin
- Department of Neurosciences, Biomedicine and Movement Sciences, University of Verona, Verona, Italy
| | | | | | - David J. Llewellyn
- University of Exeter Medical School, Exeter, UK
- The Alan Turing Institute, London, UK
| | | |
Collapse
|
14
|
Toneyan S, Tang Z, Koo PK. Evaluating deep learning for predicting epigenomic profiles. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00570-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
15
|
Exploration of Tools for the Interpretation of Human Non-Coding Variants. Int J Mol Sci 2022; 23:ijms232112977. [PMID: 36361767 PMCID: PMC9654743 DOI: 10.3390/ijms232112977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 10/17/2022] [Accepted: 10/23/2022] [Indexed: 02/01/2023] Open
Abstract
The advent of Whole Genome Sequencing (WGS) broadened the genetic variation detection range, revealing the presence of variants even in non-coding regions of the genome, which would have been missed using targeted approaches. One of the most challenging issues in WGS analysis regards the interpretation of annotated variants. This review focuses on tools suitable for the functional annotation of variants falling into non-coding regions. It couples the description of non-coding genomic areas with the results and performance of existing tools for a functional interpretation of the effect of variants in these regions. Tools were tested in a controlled genomic scenario, representing the ground-truth and allowing us to determine software performance.
Collapse
|
16
|
Dey KK, Gazal S, van de Geijn B, Kim SS, Nasser J, Engreitz JM, Price AL. SNP-to-gene linking strategies reveal contributions of enhancer-related and candidate master-regulator genes to autoimmune disease. CELL GENOMICS 2022; 2:100145. [PMID: 35873673 PMCID: PMC9306342 DOI: 10.1016/j.xgen.2022.100145] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 04/03/2021] [Accepted: 05/27/2022] [Indexed: 12/11/2022]
Abstract
We assess contributions to autoimmune disease of genes whose regulation is driven by enhancer regions (enhancer-related) and genes that regulate other genes in trans (candidate master-regulator). We link these genes to SNPs using several SNP-to-gene (S2G) strategies and apply heritability analyses to draw three conclusions about 11 autoimmune/blood-related diseases/traits. First, several characterizations of enhancer-related genes using functional genomics data are informative for autoimmune disease heritability after conditioning on a broad set of regulatory annotations. Second, candidate master-regulator genes defined using trans-eQTL in blood are also conditionally informative for autoimmune disease heritability. Third, integrating enhancer-related and master-regulator gene sets with protein-protein interaction (PPI) network information magnified their disease signal. The resulting PPI-enhancer gene score produced >2-fold stronger heritability signal and >2-fold stronger enrichment for drug targets, compared with the recently proposed enhancer domain score. In each case, functionally informed S2G strategies produced 4.1- to 13-fold stronger disease signals than conventional window-based strategies.
Collapse
Affiliation(s)
- Kushal K. Dey
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Bryce van de Geijn
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Genentech, South San Francisco, CA 94080, USA
| | - Samuel Sungil Kim
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Joseph Nasser
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Jesse M. Engreitz
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
- BASE Initiative, Betty Irene Moore Children’s Heart Center, Lucile Packard Children’s Hospital, Stanford University School of Medicine, Stanford, CA 94304, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Alkes L. Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| |
Collapse
|
17
|
Yang M, Huang L, Huang H, Tang H, Zhang N, Yang H, Wu J, Mu F. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res 2022; 50:e81. [PMID: 35536244 PMCID: PMC9371931 DOI: 10.1093/nar/gkac326] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 02/22/2022] [Accepted: 05/09/2022] [Indexed: 12/12/2022] Open
Abstract
Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.
Collapse
Affiliation(s)
- Meng Yang
- MGI, BGI-Shenzhen, Shenzhen 518083, China.,Department of Biology, University of Copenhagen, Copenhagen DK-2200, Denmark
| | | | | | - Hui Tang
- MGI, BGI-Shenzhen, Shenzhen 518083, China
| | - Nan Zhang
- MGI, BGI-Shenzhen, Shenzhen 518083, China
| | - Huanming Yang
- BGI-Shenzhen, Shenzhen 518083, China.,Guangdong Provincial Academician Workstation of BGI Synthetic Genomics, BGI-Shenzhen, Shenzhen, 518120, China
| | - Jihong Wu
- Department of Ophthalmology, Eye & ENT Hospital, Shanghai Medical College, Fudan University, Shanghai, China.,Shanghai Key Laboratory of Visual Impairment and Restoration, Science and Technology Commission of Shanghai Municipality, Shanghai, China.,Key Laboratory of Myopia (Fudan University), Chinese Academy of Medical Sciences, National Health Commission, Shanghai, China
| | - Feng Mu
- MGI, BGI-Shenzhen, Shenzhen 518083, China
| |
Collapse
|
18
|
Schilder BM, Raj T. Fine-mapping of Parkinson's disease susceptibility loci identifies putative causal variants. Hum Mol Genet 2022; 31:888-900. [PMID: 34617105 PMCID: PMC8947317 DOI: 10.1093/hmg/ddab294] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2021] [Revised: 09/05/2021] [Accepted: 09/11/2021] [Indexed: 12/13/2022] Open
Abstract
Recent genome-wide association studies have identified 78 loci associated with Parkinson's disease susceptibility but the underlying mechanisms remain largely unclear. To identify likely causal variants for disease risk, we fine-mapped these Parkinson's-associated loci using four different fine-mapping methods. We then integrated multi-assay cell type-specific epigenomic profiles to pinpoint the likely mechanism of action of each variant, allowing us to identify Consensus single nucleotide polymorphism (SNPs) that disrupt LRRK2 and FCGR2A regulatory elements in microglia, an MBNL2 enhancer in oligodendrocytes, and a DYRK1A enhancer in neurons. This genome-wide functional fine-mapping investigation of Parkinson's disease substantially advances our understanding of the causal mechanisms underlying this complex disease while avoiding focus on spurious, non-causal mechanisms. Together, these results provide a robust, comprehensive list of the likely causal variants, genes and cell-types underlying Parkinson's disease risk as demonstrated by consistently greater enrichment of our fine-mapped SNPs relative to lead GWAS SNPs across independent functional impact annotations. In addition, our approach prioritized an average of 3/85 variants per locus as putatively causal, making downstream experimental studies both more tractable and more likely to yield disease-relevant, actionable results. Large-scale studies comparing individuals with Parkinson's disease to age-matched controls have identified many regions of the genome associated with the disease. However, there is widespread correlation between different parts of the genome, making it difficult to tell which genetic variants cause Parkinson's and which are simply co-inherited with causal variants. We therefore applied a suite of statistical models to identify the most likely causal genetic variants (i.e. fine-mapping). We then linked these genetic variants with epigenomic and gene expression signatures across a wide variety of tissues and cell types to identify how these variants cause disease. Therefore, this study provides a comprehensive and robust list of cellular and molecular mechanisms that may serve as targets in the development of more effective Parkinson's therapeutics.
Collapse
Affiliation(s)
- Brian M Schilder
- Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Ronald M. Loeb Center for Alzheimer’s disease, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Estelle and Daniel Maggin Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Towfique Raj
- Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Ronald M. Loeb Center for Alzheimer’s disease, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Estelle and Daniel Maggin Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| |
Collapse
|
19
|
Schilder BM, Humphrey J, Raj T. echolocatoR: an automated end-to-end statistical and functional genomic fine-mapping pipeline. Bioinformatics 2022; 38:536-539. [PMID: 34529038 PMCID: PMC10060715 DOI: 10.1093/bioinformatics/btab658] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Revised: 09/06/2021] [Accepted: 09/13/2021] [Indexed: 02/03/2023] Open
Abstract
SUMMARY echolocatoR integrates a diverse suite of statistical and functional fine-mapping tools to identify, test enrichment in, and visualize high-confidence causal consensus variants in any phenotype. It requires minimal input from users (a summary statistics file), can be run in a single R function, and provides extensive access to relevant datasets (e.g. reference linkage disequilibrium panels, quantitative trait loci, genome-wide annotations, cell-type-specific epigenomics), thereby enabling rapid, robust and scalable end-to-end fine-mapping investigations. AVAILABILITY AND IMPLEMENTATION echolocatoR is an open-source R package available through GitHub under the GNU General Public License (Version 3) license: https://github.com/RajLabMSSM/echolocatoR. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Brian M Schilder
- Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Ronald M. Loeb Center for Alzheimer’s Disease, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Estelle and Daniel Maggin Department of Neurology, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
| | - Jack Humphrey
- Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Ronald M. Loeb Center for Alzheimer’s Disease, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Estelle and Daniel Maggin Department of Neurology, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
| | - Towfique Raj
- Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Ronald M. Loeb Center for Alzheimer’s Disease, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
- Estelle and Daniel Maggin Department of Neurology, Icahn School of Medicine at Mount Sinai, New York 10029, NY, USA
| |
Collapse
|
20
|
Schilder BM, Navarro E, Raj T. Multi-omic insights into Parkinson's Disease: From genetic associations to functional mechanisms. Neurobiol Dis 2021; 163:105580. [PMID: 34871738 PMCID: PMC10101343 DOI: 10.1016/j.nbd.2021.105580] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 11/17/2021] [Accepted: 12/02/2021] [Indexed: 02/07/2023] Open
Abstract
Genome-Wide Association Studies (GWAS) have elucidated the genetic components of Parkinson's Disease (PD). However, because the vast majority of GWAS association signals fall within non-coding regions, translating these results into an interpretable, mechanistic understanding of the disease etiology remains a major challenge in the field. In this review, we provide an overview of the approaches to prioritize putative causal variants and genes as well as summarise the primary findings of previous studies. We then discuss recent efforts to integrate multi-omics data to identify likely pathogenic cell types and biological pathways implicated in PD pathogenesis. We have compiled full summary statistics of cell-type, tissue, and phentoype enrichment analyses from multiple studies of PD GWAS and provided them in a standardized format as a resource for the research community (https://github.com/RajLabMSSM/PD_omics_review). Finally, we discuss the experimental, computational, and conceptual advances that will be necessary to fully elucidate the effects of functional variants and genes on cellular dysregulation and disease risk.
Collapse
Affiliation(s)
- Brian M Schilder
- Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Ronald M. Loeb Center for Alzheimer's disease, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Estelle and Daniel Maggin Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom; UK Dementia Research Institute at Imperial College London, London, United Kingdom.
| | - Elisa Navarro
- Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Ronald M. Loeb Center for Alzheimer's disease, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Estelle and Daniel Maggin Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Sección Departamental de Bioquímica y Biología Molecular, Facultad de Medicina, Universidad Complutense de Madrid, Madrid, Spain
| | - Towfique Raj
- Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Ronald M. Loeb Center for Alzheimer's disease, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Estelle and Daniel Maggin Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, United States.
| |
Collapse
|
21
|
Lorkowski J, Kolaszyńska O, Pokorski M. Artificial Intelligence and Precision Medicine: A Perspective. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2021; 1375:1-11. [PMID: 34138457 DOI: 10.1007/5584_2021_652] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
This article aims to present how the advanced solutions of artificial intelligence and precision medicine work together to refine medical management. Multi-omics seems the most suitable approach for biological analysis of data on precision medicine and artificial intelligence. We searched PubMed and Google Scholar databases to collect pertinent articles appearing up to 5 March 2021. Genetics, oncology, radiology, and the recent coronavirus disease (COVID-19) pandemic were chosen as representative fields addressing the cross-compliance of artificial intelligence (AI) and precision medicine based on the highest number of articles, topicality, and interconnectedness of the issue. Overall, we identified and perused 1572 articles. AI is a breakthrough that takes part in shaping the Fourth Industrial Revolution in medicine and health care, changing the long-time accepted diagnostic and treatment regimens and approaches. AI-based link prediction models may be outstandingly helpful in the literature search for drug repurposing or finding new therapeutical modalities in rapidly erupting wide-scale diseases such as the recent COVID-19.
Collapse
Affiliation(s)
- Jacek Lorkowski
- Department of Orthopedics, Traumatology and Sports Medicine, Central Clinical Hospital of the Ministry of Internal Affairs and Administration, Warsaw, Poland. .,Faculty of Health Sciences, Medical University of Mazovia, Warsaw, Poland.
| | - Oliwia Kolaszyńska
- Department of Cardiology, Independent Public Regional Hospital, Szczecin, Poland
| | - Mieczysław Pokorski
- Institute of Health Sciences, Opole University, Opole, Poland.,Faculty of Health Sciences, The Jan Długosz University in Częstochowa, Częstochowa, Poland
| |
Collapse
|
22
|
Kim SS, Dey KK, Weissbrod O, Márquez-Luna C, Gazal S, Price AL. Improving the informativeness of Mendelian disease-derived pathogenicity scores for common disease. Nat Commun 2020; 11:6258. [PMID: 33288751 PMCID: PMC7721881 DOI: 10.1038/s41467-020-20087-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Accepted: 11/09/2020] [Indexed: 02/08/2023] Open
Abstract
Despite considerable progress on pathogenicity scores prioritizing variants for Mendelian disease, little is known about the utility of these scores for common disease. Here, we assess the informativeness of Mendelian disease-derived pathogenicity scores for common disease and improve upon existing scores. We first apply stratified linkage disequilibrium (LD) score regression to evaluate published pathogenicity scores across 41 common diseases and complex traits (average N = 320K). Several of the resulting annotations are informative for common disease, even after conditioning on a broad set of functional annotations. We then improve upon published pathogenicity scores by developing AnnotBoost, a machine learning framework to impute and denoise pathogenicity scores using a broad set of functional annotations. AnnotBoost substantially increases the informativeness for common disease of both previously uninformative and previously informative pathogenicity scores, implying that Mendelian and common disease variants share similar properties. The boosted scores also produce improvements in heritability model fit and in classifying disease-associated, fine-mapped SNPs. Our boosted scores may improve fine-mapping and candidate gene discovery for common disease.
Collapse
Affiliation(s)
- Samuel S Kim
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02142, USA.
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.
| | - Kushal K Dey
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Omer Weissbrod
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Carla Márquez-Luna
- Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.
| |
Collapse
|
23
|
Amariuta T, Ishigaki K, Sugishita H, Ohta T, Koido M, Dey KK, Matsuda K, Murakami Y, Price AL, Kawakami E, Terao C, Raychaudhuri S. Improving the trans-ancestry portability of polygenic risk scores by prioritizing variants in predicted cell-type-specific regulatory elements. Nat Genet 2020; 52:1346-1354. [PMID: 33257898 PMCID: PMC8049522 DOI: 10.1038/s41588-020-00740-8] [Citation(s) in RCA: 107] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2020] [Accepted: 10/19/2020] [Indexed: 12/15/2022]
Abstract
Poor trans-ancestry portability of polygenic risk scores is a consequence of Eurocentric genetic studies and limited knowledge of shared causal variants. Leveraging regulatory annotations may improve portability by prioritizing functional over tagging variants. We constructed a resource of 707 cell-type-specific IMPACT regulatory annotations by aggregating 5,345 epigenetic datasets to predict binding patterns of 142 transcription factors across 245 cell types. We then partitioned the common SNP heritability of 111 genome-wide association study summary statistics of European (average n ≈ 189,000) and East Asian (average n ≈ 157,000) origin. IMPACT annotations captured consistent SNP heritability between populations, suggesting prioritization of shared functional variants. Variant prioritization using IMPACT resulted in increased trans-ancestry portability of polygenic risk scores from Europeans to East Asians across all 21 phenotypes analyzed (49.9% mean relative increase in R2). Our study identifies a crucial role for functional annotations such as IMPACT to improve the trans-ancestry portability of genetic data.
Collapse
Affiliation(s)
- Tiffany Amariuta
- Center for Data Sciences, Harvard Medical School, Boston, MA, USA
- Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Graduate School of Arts and Sciences, Harvard University, Cambridge, MA, USA
| | - Kazuyoshi Ishigaki
- Center for Data Sciences, Harvard Medical School, Boston, MA, USA
- Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan
| | - Hiroki Sugishita
- Laboratory for Developmental Genetics, RIKEN Center for Integrative Medical Sciences (IMS), Kanagawa, Japan
| | - Tazro Ohta
- Medical Sciences Innovation Hub Program, RIKEN, Kanagawa, Japan
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Shizuoka, Japan
| | - Masaru Koido
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan
- Division of Molecular Pathology, Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Kushal K Dey
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Koichi Matsuda
- Laboratory of Genome Technology, Human Genome Center, Institute of Medical Science, The University of Tokyo, Tokyo, Japan
- Laboratory of Clinical Genome Sequencing, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Yoshinori Murakami
- Division of Molecular Pathology, Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Alkes L Price
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Eiryo Kawakami
- Medical Sciences Innovation Hub Program, RIKEN, Kanagawa, Japan
- Artificial Intelligence Medicine, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Chikashi Terao
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan
- Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan
| | - Soumya Raychaudhuri
- Center for Data Sciences, Harvard Medical School, Boston, MA, USA.
- Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Centre for Genetics and Genomics Versus Arthritis, Centre for Musculoskeletal Research, Manchester Academic Health Science Centre, The University of Manchester, Manchester, UK.
| |
Collapse
|