1
|
Ogloblinsky MSC, Bocher O, Aloui C, Leutenegger AL, Ozisik O, Baudot A, Tournier-Lasserve E, Castillo-Madeen H, Lewinsohn D, Conrad DF, Génin E, Marenne G. PSAP-Genomic-Regions: A Method Leveraging Population Data to Prioritize Coding and Non-Coding Variants in Whole Genome Sequencing for Rare Disease Diagnosis. Genet Epidemiol 2025; 49:e22593. [PMID: 39318036 DOI: 10.1002/gepi.22593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 07/30/2024] [Accepted: 09/03/2024] [Indexed: 09/26/2024]
Abstract
The introduction of Next-Generation Sequencing technologies in the clinics has improved rare disease diagnosis. Nonetheless, for very heterogeneous or very rare diseases, more than half of cases still lack molecular diagnosis. Novel strategies are needed to prioritize variants within a single individual. The Population Sampling Probability (PSAP) method was developed to meet this aim but only for coding variants in exome data. Here, we propose an extension of the PSAP method to the non-coding genome called PSAP-genomic-regions. In this extension, instead of considering genes as testing units (PSAP-genes strategy), we use genomic regions defined over the whole genome that pinpoint potential functional constraints. We conceived an evaluation protocol for our method using artificially generated disease exomes and genomes, by inserting coding and non-coding pathogenic ClinVar variants in large data sets of exomes and genomes from the general population. PSAP-genomic-regions significantly improves the ranking of these variants compared to using a pathogenicity score alone. Using PSAP-genomic-regions, more than 50% of non-coding ClinVar variants were among the top 10 variants of the genome. On real sequencing data from six patients with Cerebral Small Vessel Disease and nine patients with male infertility, all causal variants were ranked in the top 100 variants with PSAP-genomic-regions. By revisiting the testing units used in the PSAP method to include non-coding variants, we have developed PSAP-genomic-regions, an efficient whole-genome prioritization tool which offers promising results for the diagnosis of unresolved rare diseases.
Collapse
Affiliation(s)
| | - Ozvan Bocher
- Univ Brest, Inserm, EFS, UMR 1078, GGB, Brest, France
- Institute of Translational Genomics, Helmholtz Zentrum München, Munich, Germany
| | - Chaker Aloui
- Inserm, NeuroDiderot, Unité Mixte de Recherche, Université Paris Cité, Paris, France
| | | | - Ozan Ozisik
- INSERM, Marseille Medical Genetics (MMG), Aix Marseille University, Marseille, France
| | - Anaïs Baudot
- INSERM, Marseille Medical Genetics (MMG), Aix Marseille University, Marseille, France
| | - Elisabeth Tournier-Lasserve
- Inserm, NeuroDiderot, Unité Mixte de Recherche, Université Paris Cité, Paris, France
- Assistance Publique-Hôpitaux de Paris, Service de Génétique Moléculaire Neurovasculaire, Hôpital Saint-Louis, Paris, France
| | - Helen Castillo-Madeen
- Division of Genetics, Oregon National Primate Research Center, Oregon Health and Science University, Portland, Oregon, USA
| | - Daniel Lewinsohn
- Division of Genetics, Oregon National Primate Research Center, Oregon Health and Science University, Portland, Oregon, USA
| | - Donald F Conrad
- Division of Genetics, Oregon National Primate Research Center, Oregon Health and Science University, Portland, Oregon, USA
| | - Emmanuelle Génin
- Univ Brest, Inserm, EFS, UMR 1078, GGB, Brest, France
- Centre Hospitalier Régional Universitaire de Brest, Brest, France
| | | |
Collapse
|
2
|
Wang Y, Liang N, Gao G. Quantifying the regulatory potential of genetic variants via a hybrid sequence-oriented model with SVEN. Nat Commun 2024; 15:10917. [PMID: 39738063 DOI: 10.1038/s41467-024-55392-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Accepted: 12/06/2024] [Indexed: 01/01/2025] Open
Abstract
Deciphering how noncoding DNA determines gene expression is critical for decoding the functional genome. Understanding the transcription effects of noncoding genetic variants are still major unsolved problems, which is critical for downstream applications in human genetics and precision medicine. Here, we integrate regulatory-specific neural networks and tissue-specific gradient-boosting trees to build SVEN: a hybrid sequence-oriented architecture that can accurately predict tissue-specific gene expression level and quantify the tissue-specific transcriptomic impacts of structural variants across more than 350 tissues and cell lines. We further systematically screen a large-scale structural variants dataset derived from 3622 individuals and clinical structural variants from ClinVar, and provide an overview of transcriptomic impacts of structural variants in population. As a sequence-oriented model, SVEN is also able to predict regulatory effects for small noncoding variants. We expect that SVEN will enable more effective in silico analysis and interpretation of human genome-wide disease-related genetic variants.
Collapse
Affiliation(s)
- Yu Wang
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) and Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, 100871, Beijing, China
- Changping Laboratory, 102206, Beijing, China
| | - Nan Liang
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) and Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, 100871, Beijing, China
| | - Ge Gao
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) and Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, 100871, Beijing, China.
- Changping Laboratory, 102206, Beijing, China.
| |
Collapse
|
3
|
Villani RM, McKenzie ME, Davidson AL, Spurdle AB. Regional-specific calibration enables application of computational evidence for clinical classification of 5' cis-regulatory variants in Mendelian disease. Am J Hum Genet 2024; 111:1301-1315. [PMID: 38815586 PMCID: PMC11267523 DOI: 10.1016/j.ajhg.2024.05.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 05/02/2024] [Accepted: 05/03/2024] [Indexed: 06/01/2024] Open
Abstract
To date, clinical genetic testing for Mendelian disease variants has focused heavily on exonic coding and intronic gene regions. This multi-step study was undertaken to provide an evidence base for selecting and applying computational approaches for use in clinical classification of 5' cis-regulatory region variants. Curated datasets of clinically reported disease-causing 5' cis-regulatory region variants and variants from matched genomic regions in population controls were used to calibrate six bioinformatic tools as predictors of variant pathogenicity. Likelihood ratio estimates were aligned to code weights following ClinGen recommendations for application of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) classification scheme. Considering code assignment across all reference dataset variants, performance was best for CADD (81.2%) and REMM (81.5%). Optimized thresholds provided moderate evidence toward pathogenicity (CADD, REMM) and moderate (CADD) or supporting (REMM) evidence against pathogenicity. Both sensitivity and specificity of prediction were improved when further categorizing variants based on location in an EPDnew-defined promoter region. Combining predictions (CADD, REMM, and location in a promoter region) increased specificity at the expense of sensitivity. Importantly, the optimal CADD thresholds for assigning ACMG/AMP codes PP3 (≥10) and BP4 (≤8) were vastly different from recommendations for protein-coding variants (PP3 ≥25.3; BP4 ≤22.7); CADD <22.7 would incorrectly assign BP4 for >90% of reported disease-causing cis-regulatory region variants. Our results demonstrate the need to consider a tiered approach and tailored score thresholds to optimize bioinformatic impact prediction for clinical classification of 5' cis-regulatory region variants.
Collapse
Affiliation(s)
- Rehan M Villani
- Population Health Program, QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
| | - Maddison E McKenzie
- Population Health Program, QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
| | - Aimee L Davidson
- Population Health Program, QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
| | - Amanda B Spurdle
- Population Health Program, QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia; University of Queensland, Brisbane, Queensland, Australia.
| |
Collapse
|
4
|
Dorans E, Jagadeesh K, Dey K, Price AL. Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.24.24307813. [PMID: 38826240 PMCID: PMC11142273 DOI: 10.1101/2024.05.24.24307813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Methods that analyze single-cell paired RNA-seq and ATAC-seq multiome data have shown great promise in linking regulatory elements to genes. However, existing methods differ in their modeling assumptions and approaches to account for biological and technical noise-leading to low concordance in their linking scores-and do not capture the effects of genomic distance. We propose pgBoost, an integrative modeling framework that trains a non-linear combination of existing linking strategies (including genomic distance) on fine-mapped eQTL data to assign a probabilistic score to each candidate SNP-gene link. We applied pgBoost to single-cell multiome data from 85k cells representing 6 major immune/blood cell types. pgBoost attained higher enrichment for fine-mapped eSNP-eGene pairs (e.g. 21x at distance >10kb) than existing methods (1.2-10x; p-value for difference = 5e-13 vs. distance-based method and < 4e-35 for each other method), with larger improvements at larger distances (e.g. 35x vs. 0.89-6.6x at distance >100kb; p-value for difference < 0.002 vs. each other method). pgBoost also outperformed existing methods in enrichment for CRISPR-validated links (e.g. 4.8x vs. 1.6-4.1x at distance >10kb; p-value for difference = 0.25 vs. distance-based method and < 2e-5 for each other method), with larger improvements at larger distances (e.g. 15x vs. 1.6-2.5x at distance >100kb; p-value for difference < 0.009 for each other method). Similar improvements in enrichment were observed for links derived from Activity-By-Contact (ABC) scores and GWAS data. We further determined that restricting pgBoost to features from a focal cell type improved the identification of SNP-gene links relevant to that cell type. We highlight several examples where pgBoost linked fine-mapped GWAS variants to experimentally validated or biologically plausible target genes that were not implicated by other methods. In conclusion, a non-linear combination of linking strategies, including genomic distance, improves power to identify target genes underlying GWAS associations.
Collapse
|
5
|
Ding M, Chen K, Yang Y, Zhao H. Prioritizing genomic variants pathogenicity via DNA, RNA, and protein-level features based on extreme gradient boosting. Hum Genet 2024:10.1007/s00439-024-02667-0. [PMID: 38575818 DOI: 10.1007/s00439-024-02667-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 03/05/2024] [Indexed: 04/06/2024]
Abstract
Genetic diseases are mostly implicated with genetic variants, including missense, synonymous, non-sense, and copy number variants. These different kinds of variants are indicated to affect phenotypes in various ways from previous studies. It remains essential but challenging to understand the functional consequences of these genetic variants, especially the noncoding ones, due to the lack of corresponding annotations. While many computational methods have been proposed to identify the risk variants. Most of them have only curated DNA-level and protein-level annotations to predict the pathogenicity of the variants, and others have been restricted to missense variants exclusively. In this study, we have curated DNA-, RNA-, and protein-level features to discriminate disease-causing variants in both coding and noncoding regions, where the features of protein sequences and protein structures have been shown essential for analyzing missense variants in coding regions while the features related to RNA-splicing and RBP binding are significant for variants in noncoding regions and synonymous variants in coding regions. Through the integration of these features, we have formulated the Multi-level feature Genomic Variants Predictor (ML-GVP) using the gradient boosting tree. The method has been trained on more than 400,000 variants in the Sherloc-training set from the 6th critical assessment of genome interpretation with superior performance. The method is one of the two best-performing predictors on the blind test in the Sherloc assessment, and is further confirmed by another independent test dataset of de novo variants.
Collapse
Affiliation(s)
- Maolin Ding
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Ken Chen
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China.
- Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-Sen University), Ministry of Education, Guangzhou, China.
| | - Huiying Zhao
- Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, 510000, China.
| |
Collapse
|
6
|
Chen SF, Loguercio S, Chen KY, Lee SE, Park JB, Liu S, Sadaei HJ, Torkamani A. Artificial Intelligence for Risk Assessment on Primary Prevention of Coronary Artery Disease. CURRENT CARDIOVASCULAR RISK REPORTS 2023; 17:215-231. [DOI: 10.1007/s12170-023-00731-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/09/2023] [Indexed: 01/04/2025]
Abstract
Abstract
Purpose of Review
Coronary artery disease (CAD) is a common and etiologically complex disease worldwide. Current guidelines for primary prevention, or the prevention of a first acute event, include relatively simple risk assessment and leave substantial room for improvement both for risk ascertainment and selection of prevention strategies. Here, we review how advances in big data and predictive modeling foreshadow a promising future of improved risk assessment and precision medicine for CAD.
Recent Findings
Artificial intelligence (AI) has improved the utility of high dimensional data, providing an opportunity to better understand the interplay between numerous CAD risk factors. Beyond applications of AI in cardiac imaging, the vanguard application of AI in healthcare, recent translational research is also revealing a promising path for AI in multi-modal risk prediction using standard biomarkers, genetic and other omics technologies, a variety of biosensors, and unstructured data from electronic health records (EHRs). However, gaps remain in clinical validation of AI models, most notably in the actionability of complex risk prediction for more precise therapeutic interventions.
Summary
The recent availability of nation-scale biobank datasets has provided a tremendous opportunity to richly characterize longitudinal health trajectories using health data collected at home, at laboratories, and through clinic visits. The ever-growing availability of deep genotype-phenotype data is poised to drive a transition from simple risk prediction algorithms to complex, “data-hungry,” AI models in clinical decision-making. While AI models provide the means to incorporate essentially all risk factors into comprehensive risk prediction frameworks, there remains a need to wrap these predictions in interpretable frameworks that map to our understanding of underlying biological mechanisms and associated personalized intervention. This review explores recent advances in the role of machine learning and AI in CAD primary prevention and highlights current strengths as well as limitations mediating potential future applications.
Collapse
|
7
|
Chen Y, Paramo MI, Zhang Y, Yao L, Shah SR, Jin Y, Zhang J, Pan X, Yu H. Finding Needles in the Haystack: Strategies for Uncovering Noncoding Regulatory Variants. Annu Rev Genet 2023; 57:201-222. [PMID: 37562413 DOI: 10.1146/annurev-genet-030723-120717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Abstract
Despite accumulating evidence implicating noncoding variants in human diseases, unraveling their functionality remains a significant challenge. Systematic annotations of the regulatory landscape and the growth of sequence variant data sets have fueled the development of tools and methods to identify causal noncoding variants and evaluate their regulatory effects. Here, we review the latest advances in the field and discuss potential future research avenues to gain a more in-depth understanding of noncoding regulatory variants.
Collapse
Affiliation(s)
- You Chen
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Mauricio I Paramo
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Yingying Zhang
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Li Yao
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| | - Sagar R Shah
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Yiyang Jin
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Junke Zhang
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| | - Xiuqi Pan
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Haiyuan Yu
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| |
Collapse
|
8
|
Chu X, Guan B, Dai L, Liu JX, Li F, Shang J. Network embedding framework for driver gene discovery by combining functional and structural information. BMC Genomics 2023; 24:426. [PMID: 37516822 PMCID: PMC10386255 DOI: 10.1186/s12864-023-09515-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2022] [Accepted: 07/13/2023] [Indexed: 07/31/2023] Open
Abstract
Comprehensive analysis of multiple data sets can identify potential driver genes for various cancers. In recent years, driver gene discovery based on massive mutation data and gene interaction networks has attracted increasing attention, but there is still a need to explore combining functional and structural information of genes in protein interaction networks to identify driver genes. Therefore, we propose a network embedding framework combining functional and structural information to identify driver genes. Firstly, we combine the mutation data and gene interaction networks to construct mutation integration network using network propagation algorithm. Secondly, the struc2vec model is used for extracting gene features from the mutation integration network, which contains both gene's functional and structural information. Finally, machine learning algorithms are utilized to identify the driver genes. Compared with the previous four excellent methods, our method can find gene pairs that are distant from each other through structural similarities and has better performance in identifying driver genes for 12 cancers in the cancer genome atlas. At the same time, we also conduct a comparative analysis of three gene interaction networks, three gene standard sets, and five machine learning algorithms. Our framework provides a new perspective for feature selection to identify novel driver genes.
Collapse
Affiliation(s)
- Xin Chu
- School of Computer Science, Qufu Normal University, Rizhao, 27826, China
| | - Boxin Guan
- School of Computer Science, Qufu Normal University, Rizhao, 27826, China
| | - Lingyun Dai
- School of Computer Science, Qufu Normal University, Rizhao, 27826, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, 27826, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao, 27826, China.
| | - Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao, 27826, China.
| |
Collapse
|
9
|
Nicolle R, Altin N, Siquier-Pernet K, Salignac S, Blanc P, Munnich A, Bole-Feysot C, Malan V, Caron B, Nitschké P, Desguerre I, Boddaert N, Rio M, Rausell A, Cantagrel V. A non-coding variant in the Kozak sequence of RARS2 strongly decreases protein levels and causes pontocerebellar hypoplasia. BMC Med Genomics 2023; 16:143. [PMID: 37344844 DOI: 10.1186/s12920-023-01582-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 06/16/2023] [Indexed: 06/23/2023] Open
Abstract
Bi-allelic variants in the mitochondrial arginyl-transfer RNA synthetase (RARS2) gene have been involved in early-onset encephalopathies classified as pontocerebellar hypoplasia (PCH) type 6 and in epileptic encephalopathy. A variant (NM_020320.3:c.-2A > G) in the promoter and 5'UTR of the RARS2 gene has been previously identified in a family with PCH. Only a mild impact of this variant on the mRNA level has been detected. As RARS2 is non-dosage-sensitive, this observation is not conclusive in regard of the pathogenicity of the variant.We report and describe here a new patient with the same variant in the RARS2 gene, at the homozygous state. This patient presents with a clinical phenotype consistent with PCH6 although in the absence of lactic acidosis. In agreement with the previous study, we measured RARS2 mRNA levels in patient's fibroblasts and detected a partially preserved gene expression compared to control. Importantly, this variant is located in the Kozak sequence that controls translation initiation. Therefore, we investigated the impact on protein translation using a bioinformatic approach and western blotting. We show here that this variant, additionally to its effect on the transcription, also disrupts the consensus Kozak sequence, and has a major impact on RARS2 protein translation. Through the identification of this additional case and the characterization of the molecular consequences, we clarified the involvement of this Kozak variant in PCH and on protein synthesis. This work also points to the current limitation in the pathogenicity prediction of variants located in the translation initiation region.
Collapse
Affiliation(s)
- Romain Nicolle
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
- Clinical Bioinformatics Laboratory, Université Paris Cité, INSERM UMR 1163, Imagine Institute, Paris, 75015, France
| | - Nami Altin
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
| | - Karine Siquier-Pernet
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
| | - Sherlina Salignac
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
| | - Pierre Blanc
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Arnold Munnich
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Christine Bole-Feysot
- Genomics Platform, Université Paris Cité, INSERM UMR 1163, Imagine Institute, Paris, 75015, France
| | - Valérie Malan
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Barthélémy Caron
- Clinical Bioinformatics Laboratory, Université Paris Cité, INSERM UMR 1163, Imagine Institute, Paris, 75015, France
| | - Patrick Nitschké
- Bioinformatics Core Facility, Université Paris Cité, INSERM UMR 1163, Imagine Institute, 75015, Paris, France
| | - Isabelle Desguerre
- Département de Neurologie Pédiatrique, AP-HP, Necker Hospital for Sick Children, 75015, Paris, France
| | - Nathalie Boddaert
- Département de Radiologie Pédiatrique, AP-HP, Necker Hospital for Sick Children and Université Paris Cité, INSERM UMR 1163 and INSERM U1299, Imagine Institute, Paris, 75015, France
| | - Marlène Rio
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Antonio Rausell
- Clinical Bioinformatics Laboratory, Université Paris Cité, INSERM UMR 1163, Imagine Institute, Paris, 75015, France
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Vincent Cantagrel
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France.
| |
Collapse
|
10
|
Babushkina NP, Kucher AN. Regulatory Potential of SNP Markers in Genes of DNA Repair Systems. Mol Biol 2023. [DOI: 10.1134/s002689332301003x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
11
|
Yang H, Chen R, Wang Q, Wei Q, Ji Y, Zhong X, Li B. TVAR: assessing tissue-specific functional effects of non-coding variants with deep learning. Bioinformatics 2022; 38:4697-4704. [PMID: 36063453 PMCID: PMC9563698 DOI: 10.1093/bioinformatics/btac608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2021] [Revised: 07/29/2022] [Accepted: 09/02/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Analysis of whole-genome sequencing (WGS) for genetics is still a challenge due to the lack of accurate functional annotation of non-coding variants, especially the rare ones. As eQTLs have been extensively implicated in the genetics of human diseases, we hypothesize that rare non-coding variants discovered in WGS play a regulatory role in predisposing disease risk. RESULTS With thousands of tissue- and cell-type-specific epigenomic features, we propose TVAR. This multi-label learning-based deep neural network predicts the functionality of non-coding variants in the genome based on eQTLs across 49 human tissues in the GTEx project. TVAR learns the relationships between high-dimensional epigenomics and eQTLs across tissues, taking the correlation among tissues into account to understand shared and tissue-specific eQTL effects. As a result, TVAR outputs tissue-specific annotations, with an average AUROC of 0.77 across these tissues. We evaluate TVAR's performance on four complex diseases (coronary artery disease, breast cancer, Type 2 diabetes and Schizophrenia), using TVAR's tissue-specific annotations, and observe its superior performance in predicting functional variants for both common and rare variants, compared with five existing state-of-the-art tools. We further evaluate TVAR's G-score, a scoring scheme across all tissues, on ClinVar, fine-mapped GWAS loci, Massive Parallel Reporter Assay (MPRA) validated variants and observe the consistently better performance of TVAR compared with other competing tools. AVAILABILITY AND IMPLEMENTATION The TVAR source code and its scores on the ClinVar catalog, fine mapped GWAS Loci, high confidence eQTLs from GTEx dataset, and MPRA validated functional variants are available at https://github.com/haiyang1986/TVAR. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hai Yang
- Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA
| | - Rui Chen
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37232, USA
| | - Quan Wang
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37232, USA
| | - Qiang Wei
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37232, USA
| | - Ying Ji
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37232, USA
| | - Xue Zhong
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37232, USA
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Bingshan Li
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37232, USA
| |
Collapse
|
12
|
Downes DJ, Hughes JR. Natural and Experimental Rewiring of Gene Regulatory Regions. Annu Rev Genomics Hum Genet 2022; 23:73-97. [PMID: 35472292 DOI: 10.1146/annurev-genom-112921-010715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The successful development and ongoing functioning of complex organisms depend on the faithful execution of the genetic code. A critical step in this process is the correct spatial and temporal expression of genes. The highly orchestrated transcription of genes is controlled primarily by cis-regulatory elements: promoters, enhancers, and insulators. The medical importance of this key biological process can be seen by the frequency with which mutations and inherited variants that alter cis-regulatory elements lead to monogenic and complex diseases and cancer. Here, we provide an overview of the methods available to characterize and perturb gene regulatory circuits. We then highlight mechanisms through which regulatory rewiring contributes to disease, and conclude with a perspective on how our understanding of gene regulation can be used to improve human health.
Collapse
Affiliation(s)
- Damien J Downes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine, University of Oxford, Oxford, United Kingdom;
| | - Jim R Hughes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine, University of Oxford, Oxford, United Kingdom;
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine, University of Oxford, Oxford, United Kingdom;
| |
Collapse
|
13
|
Ellingford JM, Ahn JW, Bagnall RD, Baralle D, Barton S, Campbell C, Downes K, Ellard S, Duff-Farrier C, FitzPatrick DR, Greally JM, Ingles J, Krishnan N, Lord J, Martin HC, Newman WG, O'Donnell-Luria A, Ramsden SC, Rehm HL, Richardson E, Singer-Berk M, Taylor JC, Williams M, Wood JC, Wright CF, Harrison SM, Whiffin N. Recommendations for clinical interpretation of variants found in non-coding regions of the genome. Genome Med 2022; 14:73. [PMID: 35850704 PMCID: PMC9295495 DOI: 10.1186/s13073-022-01073-3] [Citation(s) in RCA: 95] [Impact Index Per Article: 31.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 06/16/2022] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND The majority of clinical genetic testing focuses almost exclusively on regions of the genome that directly encode proteins. The important role of variants in non-coding regions in penetrant disease is, however, increasingly being demonstrated, and the use of whole genome sequencing in clinical diagnostic settings is rising across a large range of genetic disorders. Despite this, there is no existing guidance on how current guidelines designed primarily for variants in protein-coding regions should be adapted for variants identified in other genomic contexts. METHODS We convened a panel of nine clinical and research scientists with wide-ranging expertise in clinical variant interpretation, with specific experience in variants within non-coding regions. This panel discussed and refined an initial draft of the guidelines which were then extensively tested and reviewed by external groups. RESULTS We discuss considerations specifically for variants in non-coding regions of the genome. We outline how to define candidate regulatory elements, highlight examples of mechanisms through which non-coding region variants can lead to penetrant monogenic disease, and outline how existing guidelines can be adapted for the interpretation of these variants. CONCLUSIONS These recommendations aim to increase the number and range of non-coding region variants that can be clinically interpreted, which, together with a compatible phenotype, can lead to new diagnoses and catalyse the discovery of novel disease mechanisms.
Collapse
Affiliation(s)
- Jamie M Ellingford
- Division of Evolution, Infection and Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicines and Health, University of Manchester, Manchester, M13 9PT, UK.
- Manchester Centre for Genomic Medicine, St Mary's Hospital, Manchester University NHS Foundation Trust, Manchester, M13 9WL, UK.
- Genomics England, London, UK.
| | - Joo Wook Ahn
- Cambridge Genomics Laboratory, Cambridge University Hospitals NHS Foundation Trust, Cambridge Biomedical Campus, Cambridge, UK
| | - Richard D Bagnall
- Agnes Ginges Centre for Molecular Cardiology at Centenary Institute, University of Sydney, Sydney, Australia
| | - Diana Baralle
- School of Human Development and Health, Faculty of Medicine, University of Southampton, Southampton, UK
- Wessex Clinical Genetics Service, University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - Stephanie Barton
- Manchester Centre for Genomic Medicine, St Mary's Hospital, Manchester University NHS Foundation Trust, Manchester, M13 9WL, UK
| | - Chris Campbell
- Manchester Centre for Genomic Medicine, St Mary's Hospital, Manchester University NHS Foundation Trust, Manchester, M13 9WL, UK
| | - Kate Downes
- Cambridge Genomics Laboratory, Cambridge University Hospitals NHS Foundation Trust, Cambridge Biomedical Campus, Cambridge, UK
| | - Sian Ellard
- Institute of Biomedical and Clinical Science, University of Exeter Medical School, Exeter, UK
- South West Genomic Laboratory Hub, Exeter Genomic Laboratory, Royal Devon and Exeter NHS Foundation Trust, Exeter, UK
| | - Celia Duff-Farrier
- South West NHS Genomic Laboratory Hub, Bristol Genetics Laboratory, North Bristol NHS Trust, Bristol, UK
| | - David R FitzPatrick
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Western General Hospital, Edinburgh, UK
| | - John M Greally
- Department of Pediatrics, Division of Pediatric Genetic, Medicine, Children's Hospital at Montefiore/Montefiore Medical Center/Albert, Einstein College of Medicine, Bronx, NY, USA
| | - Jodie Ingles
- Centre for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, Australia
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Australia
| | - Neesha Krishnan
- Centre for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, Australia
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Australia
| | - Jenny Lord
- School of Human Development and Health, Faculty of Medicine, University of Southampton, Southampton, UK
| | - Hilary C Martin
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - William G Newman
- Division of Evolution, Infection and Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicines and Health, University of Manchester, Manchester, M13 9PT, UK
- Manchester Centre for Genomic Medicine, St Mary's Hospital, Manchester University NHS Foundation Trust, Manchester, M13 9WL, UK
| | - Anne O'Donnell-Luria
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Simon C Ramsden
- Manchester Centre for Genomic Medicine, St Mary's Hospital, Manchester University NHS Foundation Trust, Manchester, M13 9WL, UK
| | - Heidi L Rehm
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Ebony Richardson
- Centre for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, Australia
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Australia
| | - Moriel Singer-Berk
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Jenny C Taylor
- National Institute for Health Research Oxford Biomedical Research Centre, Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | - Maggie Williams
- South West NHS Genomic Laboratory Hub, Bristol Genetics Laboratory, North Bristol NHS Trust, Bristol, UK
| | - Jordan C Wood
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Caroline F Wright
- Institute of Biomedical and Clinical Science, University of Exeter Medical School, Exeter, UK
| | - Steven M Harrison
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Ambry Genetics, Aliso Viejo, CA, USA
| | - Nicola Whiffin
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK.
| |
Collapse
|
14
|
Brown A, Mead ME, Steenwyk JL, Goldman GH, Rokas A. Extensive non-coding sequence divergence between the major human pathogen Aspergillus fumigatus and its relatives. FRONTIERS IN FUNGAL BIOLOGY 2022; 3:802494. [PMID: 36866034 PMCID: PMC9977105 DOI: 10.3389/ffunb.2022.802494] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Accepted: 06/09/2022] [Indexed: 11/13/2022]
Abstract
Invasive aspergillosis is a deadly fungal disease; more than 400,000 patients are infected worldwide each year and the mortality rate can be as high as 50-95%. Of the ~450 species in the genus Aspergillus only a few are known to be clinically relevant, with the major pathogen Aspergillus fumigatus being responsible for ~50% of all invasive mold infections. Genomic comparisons between A. fumigatus and other Aspergillus species have historically focused on protein-coding regions. However, most A. fumigatus genes, including those that modulate its virulence, are also present in other pathogenic and non-pathogenic closely related species. Our hypothesis is that differential gene regulation - mediated through the non-coding regions upstream of genes' first codon - contributes to A. fumigatus pathogenicity. To begin testing this, we compared non-coding regions upstream of the first codon of single-copy orthologous genes from the two A. fumigatus reference strains Af293 and A1163 and eight closely related Aspergillus section Fumigati species. We found that these non-coding regions showed extensive sequence variation and lack of homology across species. By examining the evolutionary rates of both protein-coding and non-coding regions in a subset of orthologous genes with highly conserved non-coding regions across the phylogeny, we identified 418 genes, including 25 genes known to modulate A. fumigatus virulence, whose non-coding regions exhibit a different rate of evolution in A. fumigatus. Examination of sequence alignments of these non-coding regions revealed numerous instances of insertions, deletions, and other types of mutations of at least a few nucleotides in A. fumigatus compared to its close relatives. These results show that closely related Aspergillus species that vary greatly in their pathogenicity exhibit extensive non-coding sequence variation and identify numerous changes in non-coding regions of A. fumigatus genes known to contribute to virulence.
Collapse
Affiliation(s)
- Alec Brown
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, United States
| | - Matthew E. Mead
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, United States
| | - Jacob L. Steenwyk
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, United States
| | - Gustavo H. Goldman
- Faculdade de Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto, São Paulo, Brazil
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, United States
| |
Collapse
|
15
|
Classification of non-coding variants with high pathogenic impact. PLoS Genet 2022; 18:e1010191. [PMID: 35486646 PMCID: PMC9094564 DOI: 10.1371/journal.pgen.1010191] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Revised: 05/11/2022] [Accepted: 04/05/2022] [Indexed: 01/22/2023] Open
Abstract
Whole genome sequencing is increasingly used to diagnose medical conditions of genetic origin. While both coding and non-coding DNA variants contribute to a wide range of diseases, most patients who receive a WGS-based diagnosis today harbour a protein-coding mutation. Functional interpretation and prioritization of non-coding variants represents a persistent challenge, and disease-causing non-coding variants remain largely unidentified. Depending on the disease, WGS fails to identify a candidate variant in 20–80% of patients, severely limiting the usefulness of sequencing for personalised medicine. Here we present FINSURF, a machine-learning approach to predict the functional impact of non-coding variants in regulatory regions. FINSURF outperforms state-of-the-art methods, owing in particular to optimized control variants selection during training. In addition to ranking candidate variants, FINSURF breaks down the score for each variant into contributions from individual annotations, facilitating the evaluation of their functional relevance. We applied FINSURF to a diverse set of 30 diseases with described causative non-coding mutations, and correctly identified the disease-causative non-coding variant within the ten top hits in 22 cases. FINSURF is implemented as an online server to as well as custom browser tracks, and provides a quick and efficient solution to prioritize candidate non-coding variants in realistic clinical settings.
Collapse
|
16
|
Effects of Multi-Omics Characteristics on Identification of Driver Genes Using Machine Learning Algorithms. Genes (Basel) 2022; 13:genes13050716. [PMID: 35627101 PMCID: PMC9141966 DOI: 10.3390/genes13050716] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 04/16/2022] [Accepted: 04/18/2022] [Indexed: 12/19/2022] Open
Abstract
Cancer is a complex disease caused by genomic and epigenetic alterations; hence, identifying meaningful cancer drivers is an important and challenging task. Most studies have detected cancer drivers with mutated traits, while few studies consider multiple omics characteristics as important factors. In this study, we present a framework to analyze the effects of multi-omics characteristics on the identification of driver genes. We utilize four machine learning algorithms within this framework to detect cancer driver genes in pan-cancer data, including 75 characteristics among 19,636 genes. The 75 features are divided into four types and analyzed using Kullback–Leibler divergence based on CGC genes and non-CGC genes. We detect cancer driver genes in two different ways. One is to detect driver genes from a single feature type, while the other is from the top N features. The first analysis denotes that the mutational features are the best characteristics. The second analysis reveals that the top 45 features are the most effective feature combinations and superior to the mutational features. The top 45 features not only contain mutational features but also three other types of features. Therefore, our study extends the detection of cancer driver genes and provides a more comprehensive understanding of cancer mechanisms.
Collapse
|
17
|
Giacopuzzi E, Popitsch N, Taylor JC. GREEN-DB: a framework for the annotation and prioritization of non-coding regulatory variants from whole-genome sequencing data. Nucleic Acids Res 2022; 50:2522-2535. [PMID: 35234913 PMCID: PMC8934622 DOI: 10.1093/nar/gkac130] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Revised: 02/02/2022] [Accepted: 02/14/2022] [Indexed: 11/25/2022] Open
Abstract
Non-coding variants have long been recognized as important contributors to common disease risks, but with the expansion of clinical whole genome sequencing, examples of rare, high-impact non-coding variants are also accumulating. Despite recent advances in the study of regulatory elements and the availability of specialized data collections, the systematic annotation of non-coding variants from genome sequencing remains challenging. Here, we propose a new framework for the prioritization of non-coding regulatory variants that integrates information about regulatory regions with prediction scores and HPO-based prioritization. Firstly, we created a comprehensive collection of annotations for regulatory regions including a database of 2.4 million regulatory elements (GREEN-DB) annotated with controlled gene(s), tissue(s) and associated phenotype(s) where available. Secondly, we calculated a variation constraint metric and showed that constrained regulatory regions associate with disease-associated genes and essential genes from mouse knock-outs. Thirdly, we compared 19 non-coding impact prediction scores providing suggestions for variant prioritization. Finally, we developed a VCF annotation tool (GREEN-VARAN) that can integrate all these elements to annotate variants for their potential regulatory impact. In our evaluation, we show that GREEN-DB can capture previously published disease-associated non-coding variants as well as identify additional candidate disease genes in trio analyses.
Collapse
Affiliation(s)
- Edoardo Giacopuzzi
- Wellcome Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
- National Institute for Health Research Oxford Biomedical Research Centre, Oxford OX4 2PG, UK
| | - Niko Popitsch
- Wellcome Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
- Max Perutz Labs, University of Vienna, Dr. Bohr-Gasse 9, 1030 Vienna, Austria
| | - Jenny C Taylor
- Wellcome Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
- National Institute for Health Research Oxford Biomedical Research Centre, Oxford OX4 2PG, UK
| |
Collapse
|
18
|
Kundu K, Tardaguila M, Mann AL, Watt S, Ponstingl H, Vasquez L, Von Schiller D, Morrell NW, Stegle O, Pastinen T, Sawcer SJ, Anderson CA, Walter K, Soranzo N. Genetic associations at regulatory phenotypes improve fine-mapping of causal variants for 12 immune-mediated diseases. Nat Genet 2022; 54:251-262. [PMID: 35288711 DOI: 10.1038/s41588-022-01025-y] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 01/31/2022] [Indexed: 12/11/2022]
Abstract
The resolution of causal genetic variants informs understanding of disease biology. We used regulatory quantitative trait loci (QTLs) from the BLUEPRINT, GTEx and eQTLGen projects to fine-map putative causal variants for 12 immune-mediated diseases. We identify 340 unique loci that colocalize with high posterior probability (≥98%) with regulatory QTLs and apply Bayesian frameworks to fine-map associations at each locus. We show that fine-mapping credible sets derived from regulatory QTLs are smaller compared to disease summary statistics. Further, they are enriched for more functionally interpretable candidate causal variants and for putatively causal insertion/deletion (INDEL) polymorphisms. Finally, we use massively parallel reporter assays to evaluate candidate causal variants at the ITGA4 locus associated with inflammatory bowel disease. Overall, our findings suggest that fine-mapping applied to disease-colocalizing regulatory QTLs can enhance the discovery of putative causal disease variants and enhance insights into the underlying causal genes and molecular mechanisms.
Collapse
Affiliation(s)
- Kousik Kundu
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.,Department of Haematology, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
| | - Manuel Tardaguila
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Alice L Mann
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Stephen Watt
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Hannes Ponstingl
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Louella Vasquez
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Dominique Von Schiller
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Nicholas W Morrell
- Division of Respiratory Medicine, Department of Medicine, University of Cambridge School of Clinical Medicine, Addenbrooke's and Papworth Hospitals, Cambridge, UK
| | - Oliver Stegle
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.,Division of Computational Genomics and Systems Genetics, German Cancer Research Center, Heidelberg, Germany.,Cellular Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Tomi Pastinen
- Genomic Medicine Center, Children's Mercy Kansas City and Children's Mercy Research Institute, Kansas City, MO, USA
| | - Stephen J Sawcer
- Department of Clinical Neurosciences, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
| | - Carl A Anderson
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Klaudia Walter
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Nicole Soranzo
- Human Genetics, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK. .,Department of Haematology, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK. .,British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK. .,National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK. .,Genomics Research Centre, Human Technopole, Milan, Italy.
| |
Collapse
|
19
|
Caron B, Patin E, Rotival M, Charbit B, Albert ML, Quintana-Murci L, Duffy D, Rausell A. Integrative genetic and immune cell analysis of plasma proteins in healthy donors identifies novel associations involving primary immune deficiency genes. Genome Med 2022; 14:28. [PMID: 35264221 PMCID: PMC8905727 DOI: 10.1186/s13073-022-01032-y] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 02/15/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Blood plasma proteins play an important role in immune defense against pathogens, including cytokine signaling, the complement system, and the acute-phase response. Recent large-scale studies have reported genetic (i.e., protein quantitative trait loci, pQTLs) and non-genetic factors, such as age and sex, as major determinants to inter-individual variability in immune response variation. However, the contribution of blood-cell composition to plasma protein heterogeneity has not been fully characterized and may act as a mediating factor in association studies. METHODS Here, we evaluated plasma protein levels from 400 unrelated healthy individuals of western European ancestry, who were stratified by sex and two decades of life (20-29 and 60-69 years), from the Milieu Intérieur cohort. We quantified 229 proteins by Luminex in a clinically certified laboratory and their levels of variation were analyzed together with 5.2 million single-nucleotide polymorphisms. With respect to non-genetic variables, we included 254 lifestyle and biochemical factors, as well as counts of seven circulating immune cell populations measured by hemogram and standardized flow cytometry. RESULTS Collectively, we found 152 significant associations involving 49 proteins and 20 non-genetic variables. Consistent with previous studies, age and sex showed a global, pervasive impact on plasma protein heterogeneity, while body mass index and other health status variables were among the non-genetic factors with the highest number of associations. After controlling for these covariates, we identified 100 and 12 pQTLs acting in cis and trans, respectively, collectively associated with 87 plasma proteins and including 19 novel genetic associations. Genetic factors explained the largest fraction of the variability of plasma protein levels, as compared to non-genetic factors. In addition, blood-cell fractions, including leukocytes, lymphocytes, monocytes, neutrophils, eosinophils, basophils, and platelets, had a larger contribution to inter-individual variability than age and sex and appeared as confounders of specific genetic associations. Finally, we identified new genetic associations with plasma protein levels of five monogenic Mendelian disease genes including two primary immunodeficiency genes (Ficolin-3 and FAS). CONCLUSIONS Our study identified novel genetic and non-genetic factors associated to plasma protein levels which may inform health status and disease management.
Collapse
Affiliation(s)
- Barthelemy Caron
- Université de Paris, INSERM UMR1163, Imagine Institute, Clinical Bioinformatics Laboratory, F-75006, Paris, France
| | - Etienne Patin
- Human Evolutionary Genetics Unit, Institut Pasteur, UMR2000, CNRS, Université de Paris, F-75015, Paris, France
| | - Maxime Rotival
- Human Evolutionary Genetics Unit, Institut Pasteur, UMR2000, CNRS, Université de Paris, F-75015, Paris, France
| | - Bruno Charbit
- Cytometry and Biomarkers UTechS, CRT, Institut Pasteur, Université de Paris, F-75015, Paris, France
| | | | - Lluis Quintana-Murci
- Human Evolutionary Genetics Unit, Institut Pasteur, UMR2000, CNRS, Université de Paris, F-75015, Paris, France
- Human Genomics and Evolution, Collège de France, F-75005, Paris, France
| | - Darragh Duffy
- Cytometry and Biomarkers UTechS, CRT, Institut Pasteur, Université de Paris, F-75015, Paris, France.
- Translational Immunology Unit, Institut Pasteur, Université de Paris, F-75015, Paris, France.
| | - Antonio Rausell
- Université de Paris, INSERM UMR1163, Imagine Institute, Clinical Bioinformatics Laboratory, F-75006, Paris, France.
- Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, F-75015, Paris, France.
| |
Collapse
|
20
|
Danis D, Jacobsen JOB, Carmody LC, Gargano MA, McMurry JA, Hegde A, Haendel MA, Valentini G, Smedley D, Robinson PN. Interpretable prioritization of splice variants in diagnostic next-generation sequencing. Am J Hum Genet 2021; 108:1564-1577. [PMID: 34289339 DOI: 10.1016/j.ajhg.2021.06.014] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 06/18/2021] [Indexed: 12/11/2022] Open
Abstract
A critical challenge in genetic diagnostics is the computational assessment of candidate splice variants, specifically the interpretation of nucleotide changes located outside of the highly conserved dinucleotide sequences at the 5' and 3' ends of introns. To address this gap, we developed the Super Quick Information-content Random-forest Learning of Splice variants (SQUIRLS) algorithm. SQUIRLS generates a small set of interpretable features for machine learning by calculating the information-content of wild-type and variant sequences of canonical and cryptic splice sites, assessing changes in candidate splicing regulatory sequences, and incorporating characteristics of the sequence such as exon length, disruptions of the AG exclusion zone, and conservation. We curated a comprehensive collection of disease-associated splice-altering variants at positions outside of the highly conserved AG/GT dinucleotides at the termini of introns. SQUIRLS trains two random-forest classifiers for the donor and for the acceptor and combines their outputs by logistic regression to yield a final score. We show that SQUIRLS transcends previous state-of-the-art accuracy in classifying splice variants as assessed by rank analysis in simulated exomes, and is significantly faster than competing methods. SQUIRLS provides tabular output files for incorporation into diagnostic pipelines for exome and genome analysis, as well as visualizations that contextualize predicted effects of variants on splicing to make it easier to interpret splice variants in diagnostic settings.
Collapse
Affiliation(s)
- Daniel Danis
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT 06032, USA
| | - Julius O B Jacobsen
- William Harvey Research Institute, Charterhouse Square, Barts and the London School of Medicine and Dentistry Queen, Queen Mary University of London, EC1M 6BQ London, UK
| | - Leigh C Carmody
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT 06032, USA
| | - Michael A Gargano
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT 06032, USA
| | - Julie A McMurry
- University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Ayushi Hegde
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT 06032, USA
| | | | - Giorgio Valentini
- Anacleto Lab - Dipartimento di Informatica and DSRC, Università degli Studi di Milano, Via Celoria 18, 20133 Milan, Italy; CINI National Laboratory in Artificial Intelligence and Intelligent Systems-AIIS, Rome, Italy
| | - Damian Smedley
- William Harvey Research Institute, Charterhouse Square, Barts and the London School of Medicine and Dentistry Queen, Queen Mary University of London, EC1M 6BQ London, UK
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT 06032, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, USA.
| |
Collapse
|
21
|
Délot EC, Vilain E. Towards improved genetic diagnosis of human differences of sex development. Nat Rev Genet 2021; 22:588-602. [PMID: 34083777 PMCID: PMC10598994 DOI: 10.1038/s41576-021-00365-5] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/14/2021] [Indexed: 02/05/2023]
Abstract
Despite being collectively among the most frequent congenital developmental conditions worldwide, differences of sex development (DSD) lack recognition and research funding. As a result, what constitutes optimal management remains uncertain. Identification of the individual conditions under the DSD umbrella is challenging and molecular genetic diagnosis is frequently not achieved, which has psychosocial and health-related repercussions for patients and their families. New genomic approaches have the potential to resolve this impasse through better detection of protein-coding variants and ascertainment of under-recognized aetiology, such as mosaic, structural, non-coding or epigenetic variants. Ultimately, it is hoped that better outcomes data, improved understanding of the molecular causes and greater public awareness will bring an end to the stigma often associated with DSD.
Collapse
Affiliation(s)
- Emmanuèle C Délot
- Center for Genetic Medicine Research, Children's Research Institute, Children's National Hospital, Washington, DC, USA
- Department of Genomics and Precision Medicine, School of Medicine and Health Sciences, George Washington University, Washington, DC, USA
| | - Eric Vilain
- Center for Genetic Medicine Research, Children's Research Institute, Children's National Hospital, Washington, DC, USA.
- Department of Genomics and Precision Medicine, School of Medicine and Health Sciences, George Washington University, Washington, DC, USA.
| |
Collapse
|
22
|
Requena F, Abdallah HH, García A, Nitschké P, Romana S, Malan V, Rausell A. CNVxplorer: a web tool to assist clinical interpretation of CNVs in rare disease patients. Nucleic Acids Res 2021; 49:W93-W103. [PMID: 34019647 PMCID: PMC8262689 DOI: 10.1093/nar/gkab347] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 04/12/2021] [Accepted: 05/20/2021] [Indexed: 12/20/2022] Open
Abstract
Copy Number Variants (CNVs) are an important cause of rare diseases. Array-based Comparative Genomic Hybridization tests yield a ∼12% diagnostic rate, with ∼8% of patients presenting CNVs of unknown significance. CNVs interpretation is particularly challenging on genomic regions outside of those overlapping with previously reported structural variants or disease-associated genes. Recent studies showed that a more comprehensive evaluation of CNV features, leveraging both coding and non-coding impacts, can significantly improve diagnostic rates. However, currently available CNV interpretation tools are mostly gene-centric or provide only non-interactive annotations difficult to assess in the clinical practice. Here, we present CNVxplorer, a web server suited for the functional assessment of CNVs in a clinical diagnostic setting. CNVxplorer mines a comprehensive set of clinical, genomic, and epigenomic features associated with CNVs. It provides sequence constraint metrics, impact on regulatory elements and topologically associating domains, as well as expression patterns. Analyses offered cover (a) agreement with patient phenotypes; (b) visualizations of associations among genes, regulatory elements and transcription factors; (c) enrichment on functional and pathway annotations and (d) co-occurrence of terms across PubMed publications related to the query CNVs. A flexible evaluation workflow allows dynamic re-interrogation in clinical sessions. CNVxplorer is publicly available at http://cnvxplorer.com.
Collapse
Affiliation(s)
- Francisco Requena
- Université de Paris, Institut Imagine, F-75006 Paris, France
- Clinical Bioinformatics Laboratory, Imagine Institute, INSERM UMR1163, F-75015 Paris, France
| | - Hamza Hadj Abdallah
- Université de Paris, Institut Imagine, F-75006 Paris, France
- Service de Cytogénétique, Hôpital Necker-Enfants Malades, APHP, F-75015 Paris, France
| | - Alejandro García
- Université de Paris, Institut Imagine, F-75006 Paris, France
- Clinical Bioinformatics Laboratory, Imagine Institute, INSERM UMR1163, F-75015 Paris, France
| | - Patrick Nitschké
- Université de Paris, Institut Imagine, F-75006 Paris, France
- Plateforme de Bioinformatique, Université Paris Descartes, F-75015 Paris, France
| | - Sergi Romana
- Université de Paris, Institut Imagine, F-75006 Paris, France
- Service de Cytogénétique, Hôpital Necker-Enfants Malades, APHP, F-75015 Paris, France
| | - Valérie Malan
- Université de Paris, Institut Imagine, F-75006 Paris, France
- Service de Cytogénétique, Hôpital Necker-Enfants Malades, APHP, F-75015 Paris, France
| | - Antonio Rausell
- Université de Paris, Institut Imagine, F-75006 Paris, France
- Clinical Bioinformatics Laboratory, Imagine Institute, INSERM UMR1163, F-75015 Paris, France
- Service de Génétique Moleculaire, Hôpital Necker-Enfants Malades, APHP, F-75015, Paris, France
| |
Collapse
|
23
|
Kim SS, Dey KK, Weissbrod O, Márquez-Luna C, Gazal S, Price AL. Improving the informativeness of Mendelian disease-derived pathogenicity scores for common disease. Nat Commun 2020; 11:6258. [PMID: 33288751 PMCID: PMC7721881 DOI: 10.1038/s41467-020-20087-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Accepted: 11/09/2020] [Indexed: 02/08/2023] Open
Abstract
Despite considerable progress on pathogenicity scores prioritizing variants for Mendelian disease, little is known about the utility of these scores for common disease. Here, we assess the informativeness of Mendelian disease-derived pathogenicity scores for common disease and improve upon existing scores. We first apply stratified linkage disequilibrium (LD) score regression to evaluate published pathogenicity scores across 41 common diseases and complex traits (average N = 320K). Several of the resulting annotations are informative for common disease, even after conditioning on a broad set of functional annotations. We then improve upon published pathogenicity scores by developing AnnotBoost, a machine learning framework to impute and denoise pathogenicity scores using a broad set of functional annotations. AnnotBoost substantially increases the informativeness for common disease of both previously uninformative and previously informative pathogenicity scores, implying that Mendelian and common disease variants share similar properties. The boosted scores also produce improvements in heritability model fit and in classifying disease-associated, fine-mapped SNPs. Our boosted scores may improve fine-mapping and candidate gene discovery for common disease.
Collapse
Affiliation(s)
- Samuel S Kim
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02142, USA.
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.
| | - Kushal K Dey
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Omer Weissbrod
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Carla Márquez-Luna
- Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.
| |
Collapse
|
24
|
Choudhuri A, Trompouki E, Abraham BJ, Colli LM, Kock KH, Mallard W, Yang ML, Vinjamur DS, Ghamari A, Sporrij A, Hoi K, Hummel B, Boatman S, Chan V, Tseng S, Nandakumar SK, Yang S, Lichtig A, Superdock M, Grimes SN, Bowman TV, Zhou Y, Takahashi S, Joehanes R, Cantor AB, Bauer DE, Ganesh SK, Rinn J, Albert PS, Bulyk ML, Chanock SJ, Young RA, Zon LI. Common variants in signaling transcription-factor-binding sites drive phenotypic variability in red blood cell traits. Nat Genet 2020; 52:1333-1345. [PMID: 33230299 DOI: 10.1038/s41588-020-00738-2] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Accepted: 10/14/2020] [Indexed: 12/13/2022]
Abstract
Genome-wide association studies identify genomic variants associated with human traits and diseases. Most trait-associated variants are located within cell-type-specific enhancers, but the molecular mechanisms governing phenotypic variation are less well understood. Here, we show that many enhancer variants associated with red blood cell (RBC) traits map to enhancers that are co-bound by lineage-specific master transcription factors (MTFs) and signaling transcription factors (STFs) responsive to extracellular signals. The majority of enhancer variants reside on STF and not MTF motifs, perturbing DNA binding by various STFs (BMP/TGF-β-directed SMADs or WNT-induced TCFs) and affecting target gene expression. Analyses of engineered human blood cells and expression quantitative trait loci verify that disrupted STF binding leads to altered gene expression. Our results propose that the majority of the RBC-trait-associated variants that reside on transcription-factor-binding sequences fall in STF target sequences, suggesting that the phenotypic variation of RBC traits could stem from altered responsiveness to extracellular stimuli.
Collapse
Affiliation(s)
- Avik Choudhuri
- Harvard Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA.,Stem Cell Program and Division of Hematology/Oncology, Boston Children's Hospital, Boston, MA, USA
| | - Eirini Trompouki
- Stem Cell Program and Division of Hematology/Oncology, Boston Children's Hospital, Boston, MA, USA.,Department of Cellular and Molecular Immunology, Max Planck Institute of Immunobiology and Epigenetics, Freiburg, Germany.,CIBSS Centre for Integrative Biological Signaling Studies, University of Freiburg, Freiburg, Germany
| | - Brian J Abraham
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA.,Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Leandro M Colli
- Division of Cancer Epidemiology & Genetics, National Cancer Institute, Bethesda, MD, USA.,Department of Medical Imaging, Hematology, and Medical Oncology, Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, Brazil
| | - Kian Hong Kock
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.,Program in Biological and Biomedical Sciences, Harvard University, Cambridge, MA, USA
| | - William Mallard
- Harvard Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA.,The Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA
| | - Min-Lee Yang
- Division of Cardiovascular Medicine, Department of Internal Medicine and Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Divya S Vinjamur
- Division of Hematology and Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | - Alireza Ghamari
- Division of Pediatric Hematology-Oncology, Boston Children's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Audrey Sporrij
- Harvard Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA
| | - Karen Hoi
- Harvard Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA
| | - Barbara Hummel
- Department of Cellular and Molecular Immunology, Max Planck Institute of Immunobiology and Epigenetics, Freiburg, Germany
| | - Sonja Boatman
- Stem Cell Program and Division of Hematology/Oncology, Boston Children's Hospital, Boston, MA, USA
| | - Victoria Chan
- Harvard Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA
| | - Sierra Tseng
- Harvard Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA
| | - Satish K Nandakumar
- Division of Hematology and Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | - Song Yang
- Stem Cell Program and Division of Hematology/Oncology, Boston Children's Hospital, Boston, MA, USA
| | - Asher Lichtig
- Stem Cell Program and Division of Hematology/Oncology, Boston Children's Hospital, Boston, MA, USA
| | - Michael Superdock
- Stem Cell Program and Division of Hematology/Oncology, Boston Children's Hospital, Boston, MA, USA
| | - Seraj N Grimes
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.,Summer Institute in Biomedical Informatics, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Teresa V Bowman
- Stem Cell Program and Division of Hematology/Oncology, Boston Children's Hospital, Boston, MA, USA.,Albert Einstein College of Medicine, Bronx, NY, USA
| | - Yi Zhou
- Stem Cell Program and Division of Hematology/Oncology, Boston Children's Hospital, Boston, MA, USA
| | | | - Roby Joehanes
- Hebrew Senior Life, Harvard Medical School, Boston, MA, USA.,Framingham Heart Study, National Heart, Blood, and Lung Institute, National Institutes of Health, Bethesda, MD, USA
| | - Alan B Cantor
- Division of Pediatric Hematology-Oncology, Boston Children's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Daniel E Bauer
- Division of Hematology and Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | - Santhi K Ganesh
- Division of Cardiovascular Medicine, Department of Internal Medicine and Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - John Rinn
- Harvard Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA.,Department of Biochemistry, University of Colorado Boulder, Boulder, CO, USA
| | - Paul S Albert
- Division of Cancer Epidemiology & Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Martha L Bulyk
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.,Program in Biological and Biomedical Sciences, Harvard University, Cambridge, MA, USA.,The Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA.,Summer Institute in Biomedical Informatics, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.,Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Stephen J Chanock
- Division of Cancer Epidemiology & Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Richard A Young
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA.,Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Leonard I Zon
- Harvard Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA. .,Stem Cell Program and Division of Hematology/Oncology, Children's Hospital Boston, Harvard Stem Cell Institute, Harvard Medical School and Howard Hughes Medical Institute, Boston, MA, USA.
| |
Collapse
|
25
|
Sarkar A, Yang Y, Vihinen M. Variation benchmark datasets: update, criteria, quality and applications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5710862. [PMID: 32016318 PMCID: PMC6997940 DOI: 10.1093/database/baz117] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 06/03/2019] [Accepted: 07/01/2019] [Indexed: 02/07/2023]
Abstract
Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data. Database URL: http://structure.bmc.lu.se/VariBench
Collapse
Affiliation(s)
- Anasua Sarkar
- Department of Experimental Medical Science, BMC B13, Lund University, SE-22 184 Lund, Sweden
| | - Yang Yang
- School of Computer Science and Technology, Soochow University, No1. Shizi Street, Suzhou, 215006 Jiangsu, China.,Provincial Key Laboratory for Computer Information Processing Technology, No1. Shizi Street, Soochow University, Suzhou, 215006 Jiangsu, China
| | - Mauno Vihinen
- Department of Experimental Medical Science, BMC B13, Lund University, SE-22 184 Lund, Sweden
| |
Collapse
|
26
|
Lyu J, Li JJ, Su J, Peng F, Chen YE, Ge X, Li W. DORGE: Discovery of Oncogenes and tumoR suppressor genes using Genetic and Epigenetic features. SCIENCE ADVANCES 2020; 6:6/46/eaba6784. [PMID: 33177077 PMCID: PMC7673741 DOI: 10.1126/sciadv.aba6784] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Accepted: 09/29/2020] [Indexed: 05/09/2023]
Abstract
Data-driven discovery of cancer driver genes, including tumor suppressor genes (TSGs) and oncogenes (OGs), is imperative for cancer prevention, diagnosis, and treatment. Although epigenetic alterations are important for tumor initiation and progression, most known driver genes were identified based on genetic alterations alone. Here, we developed an algorithm, DORGE (Discovery of Oncogenes and tumor suppressoR genes using Genetic and Epigenetic features), to identify TSGs and OGs by integrating comprehensive genetic and epigenetic data. DORGE identified histone modifications as strong predictors for TSGs, and it found missense mutations, super enhancers, and methylation differences as strong predictors for OGs. We extensively validated DORGE-predicted cancer driver genes using independent functional genomics data. We also found that DORGE-predicted dual-functional genes (both TSGs and OGs) are enriched at hubs in protein-protein interaction and drug-gene networks. Overall, our study has deepened the understanding of epigenetic mechanisms in tumorigenesis and revealed previously undetected cancer driver genes.
Collapse
Affiliation(s)
- Jie Lyu
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095, USA.
| | - Jianzhong Su
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030, USA
| | - Fanglue Peng
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030, USA
| | - Yiling Elaine Chen
- Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA 92697, USA.
| |
Collapse
|
27
|
Huang D, Yi X, Zhou Y, Yao H, Xu H, Wang J, Zhang S, Nong W, Wang P, Shi L, Xuan C, Li M, Wang J, Li W, Kwan HS, Sham PC, Wang K, Li MJ. Ultrafast and scalable variant annotation and prioritization with big functional genomics data. Genome Res 2020; 30:1789-1801. [PMID: 33060171 PMCID: PMC7706736 DOI: 10.1101/gr.267997.120] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Accepted: 09/22/2020] [Indexed: 02/06/2023]
Abstract
The advances of large-scale genomics studies have enabled compilation of cell type–specific, genome-wide DNA functional elements at high resolution. With the growing volume of functional annotation data and sequencing variants, existing variant annotation algorithms lack the efficiency and scalability to process big genomic data, particularly when annotating whole-genome sequencing variants against a huge database with billions of genomic features. Here, we develop VarNote to rapidly annotate genome-scale variants in large and complex functional annotation resources. Equipped with a novel index system and a parallel random-sweep searching algorithm, VarNote shows substantial performance improvements (two to three orders of magnitude) over existing algorithms at different scales. It supports both region-based and allele-specific annotations and introduces advanced functions for the flexible extraction of annotations. By integrating massive base-wise and context-dependent annotations in the VarNote framework, we introduce three efficient and accurate pipelines to prioritize the causal regulatory variants for common diseases, Mendelian disorders, and cancers.
Collapse
Affiliation(s)
- Dandan Huang
- The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300070, China.,Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China.,Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Xianfu Yi
- School of Biomedical Engineering, Tianjin Medical University, Tianjin 300070, China
| | - Yao Zhou
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Hongcheng Yao
- School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Hang Xu
- The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300070, China.,School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Jianhua Wang
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Shijie Zhang
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Wenyan Nong
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR 999077, China
| | - Panwen Wang
- Department of Health Sciences Research and Center for Individualized Medicine, Mayo Clinic, Scottsdale, Arizona 85259, USA
| | - Lei Shi
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Chenghao Xuan
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Miaoxin Li
- Center for Genome Research, Center for Precision Medicine, Zhongshan School of Medicine, First Affiliated Hospital, Sun Yat-Sen University, Guangzhou 510080, China
| | - Junwen Wang
- Department of Health Sciences Research and Center for Individualized Medicine, Mayo Clinic, Scottsdale, Arizona 85259, USA
| | - Weidong Li
- Department of Genetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Hoi Shan Kwan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR 999077, China
| | - Pak Chung Sham
- Centre of Genomics Sciences, Departments of Psychiatry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
| | - Mulin Jun Li
- The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300070, China.,Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China.,Department of Epidemiology and Biostatistics, Tianjin Key Laboratory of Molecular Cancer Epidemiology, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300070, China
| |
Collapse
|
28
|
Evaluating the informativeness of deep learning annotations for human complex diseases. Nat Commun 2020; 11:4703. [PMID: 32943643 PMCID: PMC7499261 DOI: 10.1038/s41467-020-18515-4] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Accepted: 08/25/2020] [Indexed: 12/12/2022] Open
Abstract
Deep learning models have shown great promise in predicting regulatory effects from DNA sequence, but their informativeness for human complex diseases is not fully understood. Here, we evaluate genome-wide SNP annotations from two previous deep learning models, DeepSEA and Basenji, by applying stratified LD score regression to 41 diseases and traits (average N = 320K), conditioning on a broad set of coding, conserved and regulatory annotations. We aggregated annotations across all (respectively blood or brain) tissues/cell-types in meta-analyses across all (respectively 11 blood or 8 brain) traits. The annotations were highly enriched for disease heritability, but produced only limited conditionally significant results: non-tissue-specific and brain-specific Basenji-H3K4me3 for all traits and brain traits respectively. We conclude that deep learning models have yet to achieve their full potential to provide considerable unique information for complex disease, and that their conditional informativeness for disease cannot be inferred from their accuracy in predicting regulatory annotations. Deep learning models have shown great promise in predicting regulatory effects from DNA sequence. Here the authors evaluate sequence-based epigenomic deep learning models and conclude that these models are not yet ready to inform our knowledge of human disease.
Collapse
|
29
|
van der Lee R, Correard S, Wasserman WW. Deregulated Regulators: Disease-Causing cis Variants in Transcription Factor Genes. Trends Genet 2020; 36:523-539. [DOI: 10.1016/j.tig.2020.04.006] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2020] [Revised: 04/15/2020] [Accepted: 04/16/2020] [Indexed: 12/12/2022]
|
30
|
Zhang S, He Y, Liu H, Zhai H, Huang D, Yi X, Dong X, Wang Z, Zhao K, Zhou Y, Wang J, Yao H, Xu H, Yang Z, Sham PC, Chen K, Li MJ. regBase: whole genome base-wise aggregation and functional prediction for human non-coding regulatory variants. Nucleic Acids Res 2020; 47:e134. [PMID: 31511901 PMCID: PMC6868349 DOI: 10.1093/nar/gkz774] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Accepted: 08/29/2019] [Indexed: 12/19/2022] Open
Abstract
Predicting the functional or pathogenic regulatory variants in the human non-coding genome facilitates the interpretation of disease causation. While numerous prediction methods are available, their performance is inconsistent or restricted to specific tasks, which raises the demand of developing comprehensive integration for those methods. Here, we compile whole genome base-wise aggregations, regBase, that incorporate largest prediction scores. Building on different assumptions of causality, we train three composite models to score functional, pathogenic and cancer driver non-coding regulatory variants respectively. We demonstrate the superior and stable performance of our models using independent benchmarks and show great success to fine-map causal regulatory variants on specific locus or at base-wise resolution. We believe that regBase database together with three composite models will be useful in different areas of human genetic studies, such as annotation-based casual variant fine-mapping, pathogenic variant discovery as well as cancer driver mutation identification. regBase is freely available at https://github.com/mulinlab/regBase.
Collapse
Affiliation(s)
- Shijie Zhang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Yukun He
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Huanhuan Liu
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Haoyu Zhai
- Department of Computer Science, University of Illinois Urbana-Champaign, IL, USA
| | - Dandan Huang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China.,Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Xianfu Yi
- School of Biomedical Engineering, Tianjin Medical University, Tianjin, China
| | - Xiaobao Dong
- Department of Genetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Zhao Wang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Ke Zhao
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Yao Zhou
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Jianhua Wang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Hongcheng Yao
- School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Hang Xu
- School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Zhenglu Yang
- College of Computer Science, Nankai University, Tianjin, China
| | - Pak Chung Sham
- Centre of Genomics Sciences, State Key Laboratory of Brain and Cognitive Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Kexin Chen
- Department of Epidemiology and Biostatistics, Tianjin Key Laboratory of Molecular Cancer Epidemiology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Mulin Jun Li
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China.,Department of Epidemiology and Biostatistics, Tianjin Key Laboratory of Molecular Cancer Epidemiology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| |
Collapse
|
31
|
Petrini A, Mesiti M, Schubach M, Frasca M, Danis D, Re M, Grossi G, Cappelletti L, Castrignanò T, Robinson PN, Valentini G. parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants. Gigascience 2020; 9:giaa052. [PMID: 32444882 PMCID: PMC7244787 DOI: 10.1093/gigascience/giaa052] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2019] [Revised: 10/31/2019] [Accepted: 04/28/2020] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Several prediction problems in computational biology and genomic medicine are characterized by both big data as well as a high imbalance between examples to be learned, whereby positive examples can represent a tiny minority with respect to negative examples. For instance, deleterious or pathogenic variants are overwhelmed by the sea of neutral variants in the non-coding regions of the genome: thus, the prediction of deleterious variants is a challenging, highly imbalanced classification problem, and classical prediction tools fail to detect the rare pathogenic examples among the huge amount of neutral variants or undergo severe restrictions in managing big genomic data. RESULTS To overcome these limitations we propose parSMURF, a method that adopts a hyper-ensemble approach and oversampling and undersampling techniques to deal with imbalanced data, and parallel computational techniques to both manage big genomic data and substantially speed up the computation. The synergy between Bayesian optimization techniques and the parallel nature of parSMURF enables efficient and user-friendly automatic tuning of the hyper-parameters of the algorithm, and allows specific learning problems in genomic medicine to be easily fit. Moreover, by using MPI parallel and machine learning ensemble techniques, parSMURF can manage big data by partitioning them across the nodes of a high-performance computing cluster. Results with synthetic data and with single-nucleotide variants associated with Mendelian diseases and with genome-wide association study hits in the non-coding regions of the human genome, involhing millions of examples, show that parSMURF achieves state-of-the-art results and an 80-fold speed-up with respect to the sequential version. CONCLUSIONS parSMURF is a parallel machine learning tool that can be trained to learn different genomic problems, and its multiple levels of parallelization and high scalability allow us to efficiently fit problems characterized by big and imbalanced genomic data. The C++ OpenMP multi-core version tailored to a single workstation and the C++ MPI/OpenMP hybrid multi-core and multi-node parSMURF version tailored to a High Performance Computing cluster are both available at https://github.com/AnacletoLAB/parSMURF.
Collapse
Affiliation(s)
- Alessandro Petrini
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Marco Mesiti
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Max Schubach
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Straße 2, 10178 Berlin, Germany
- Charité – Universitätsmedizin Berlin, Chariteplatz 1, 10117 Berlin, Germany
| | - Marco Frasca
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Daniel Danis
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington (CT) - 06032, United States of America
| | - Matteo Re
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Giuliano Grossi
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Luca Cappelletti
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
| | - Tiziana Castrignanò
- CINECA, SCAI SuperComputing Applications and Innovation Department, Via dei Tizii 6, 00185 Roma, Italy
- University of Tuscia, Department of Ecological and Biological Sciences (DEB), Largo dell'Università snc, 01100 Viterbo, Italy
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington (CT) - 06032, United States of America
| | - Giorgio Valentini
- Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy
- CINI National Laboratory in Artificial Intelligence and Intelligent Systems - AIIS, Università di Roma, Via Ariosto 25, 00185 Roma, Italy
| |
Collapse
|
32
|
Xu D, Wang C, Kiryluk K, Buxbaum JD, Ionita-Laza I. Co-localization between Sequence Constraint and Epigenomic Information Improves Interpretation of Whole-Genome Sequencing Data. Am J Hum Genet 2020; 106:513-524. [PMID: 32243819 PMCID: PMC7118583 DOI: 10.1016/j.ajhg.2020.03.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 03/09/2020] [Indexed: 10/24/2022] Open
Abstract
The identification of functional regions in the noncoding human genome is difficult but critical in order to gain understanding of the role noncoding variation plays in gene regulation in human health and disease. We describe here a co-localization approach that aims to identify constrained sequences that co-localize with tissue- or cell-type-specific regulatory regions, and we show that the resulting score is particularly well suited for the identification of rare regulatory variants. For 127 tissues and cell types in the ENCODE/Roadmap Epigenomics Project, we provide catalogs of putative tissue- or cell-type-specific regulatory regions under sequence constraint. We use the newly developed co-localization score for brain tissues to score de novo mutations in whole genomes from 1,902 individuals affected with autism spectrum disorder (ASD) and their unaffected siblings in the Simons Simplex Collection. We show that noncoding de novo mutations near genes co-expressed in midfetal brain with high confidence ASD risk genes, and near FMRP gene targets are more likely to be in co-localized regions if they occur in ASD probands versus in their unaffected siblings. We also observed a similar enrichment for mutations near lincRNAs, previously shown to co-express with ASD risk genes. Additionally, we provide strong evidence that prioritized de novo mutations in autism probands point to a small set of well-known ASD genes, the disruption of which produces relevant mouse phenotypes such as abnormal social investigation and abnormal discrimination/associative learning, unlike the de novo mutations in unaffected siblings. The genome-wide co-localization results are available online.
Collapse
Affiliation(s)
- Danqing Xu
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Chen Wang
- Department of Biostatistics, Columbia University, New York, NY 10032, USA; Department of Medicine, Columbia University, New York, NY 10032, USA
| | - Krzysztof Kiryluk
- Department of Medicine, Columbia University, New York, NY 10032, USA
| | - Joseph D Buxbaum
- Departments of Psychiatry, Neuroscience, and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Friedman Brain Institute and Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | | |
Collapse
|
33
|
Regulatory genome variants in human susceptibility to infection. Hum Genet 2019; 139:759-768. [PMID: 31807864 DOI: 10.1007/s00439-019-02091-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2019] [Accepted: 11/18/2019] [Indexed: 12/20/2022]
Abstract
Genome studies have accelerated the discovery of common and rare genetic variants associated with susceptibility to infection and with disease severity. Genome-wide association studies identified many common genetic variants associated with modest risk for infection. Over 80% of these common variants map to the non-coding genome and are thought to modulate the regulatory networks. Exome sequencing has rapidly expanded the number of recognized primary immunodeficiencies through the identification of rare coding variants. In contrast, less than 29 primary immunodeficiencies have causative rare variation mapped outside protein-coding regions. In the future, whole genome sequencing will accelerate the identification of rare variants of substantial phenotypic impact that disrupt essential regulatory elements and the three-dimensional structure of chromatin.
Collapse
|
34
|
Wells A, Heckerman D, Torkamani A, Yin L, Sebat J, Ren B, Telenti A, di Iulio J. Ranking of non-coding pathogenic variants and putative essential regions of the human genome. Nat Commun 2019; 10:5241. [PMID: 31748530 PMCID: PMC6868241 DOI: 10.1038/s41467-019-13212-3] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2019] [Accepted: 10/28/2019] [Indexed: 12/20/2022] Open
Abstract
A gene is considered essential if loss of function results in loss of viability, fitness or in disease. This concept is well established for coding genes; however, non-coding regions are thought less likely to be determinants of critical functions. Here we train a machine learning model using functional, mutational and structural features, including new genome essentiality metrics, 3D genome organization and enhancer reporter data to identify deleterious variants in non-coding regions. We assess the model for functional correlates by using data from tiling-deletion-based and CRISPR interference screens of activity of cis-regulatory elements in over 3 Mb of genome sequence. Finally, we explore two user cases that involve indels and the disruption of enhancers associated with a developmental disease. We rank variants in the non-coding genome according to their predicted deleteriousness. The model prioritizes non-coding regions associated with regulation of important genes and with cell viability, an in vitro surrogate of essentiality. Whole genome sequencing (WGS) holds promise to solve a subset of Mendelian disease cases for which exome sequencing did not provide a genetic diagnosis. Here, Wells et al. report a supervised machine learning model trained on functional, mutational and structural features for rank-scoring and interpreting variants in non-coding regions from WGS.
Collapse
Affiliation(s)
- Alex Wells
- Stanford University, Stanford, CA, 94305, USA
| | - David Heckerman
- Department of Computer Sciences, University of California Los Angeles, Los Angeles, CA, 90024, USA
| | - Ali Torkamani
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
| | - Li Yin
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
| | - Jonathan Sebat
- Beyster Institute for Psychiatric Genomics, Department of Psychiatry, University of California San Diego, La Jolla, CA, 92093, USA.,Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, 92093, USA.,Department of Pediatrics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Bing Ren
- Ludwig Institute for Cancer Research, La Jolla, CA, 92093, USA
| | - Amalio Telenti
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA. .,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA. .,Vir Biotechnology, Inc., San Francisco, CA, 94158, USA.
| | - Julia di Iulio
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA. .,Vir Biotechnology, Inc., San Francisco, CA, 94158, USA.
| |
Collapse
|