1
|
Zhou Y, Cosentino J, Yun T, Biradar MI, Shreibati J, Lai D, Schwantes-An TH, Luben R, McCaw Z, Engmann J, Providencia R, Schmidt AF, Munroe P, Yang H, Carroll A, Khawaja AP, McLean CY, Behsaz B, Hormozdiari F. Utilizing multimodal AI to improve genetic analyses of cardiovascular traits. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.19.24304547. [PMID: 38562791 PMCID: PMC10984061 DOI: 10.1101/2024.03.19.24304547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Electronic health records, biobanks, and wearable biosensors contain multiple high-dimensional clinical data (HDCD) modalities (e.g., ECG, Photoplethysmography (PPG), and MRI) for each individual. Access to multimodal HDCD provides a unique opportunity for genetic studies of complex traits because different modalities relevant to a single physiological system (e.g., circulatory system) encode complementary and overlapping information. We propose a novel multimodal deep learning method, M-REGLE, for discovering genetic associations from a joint representation of multiple complementary HDCD modalities. We showcase the effectiveness of this model by applying it to several cardiovascular modalities. M-REGLE jointly learns a lower representation (i.e., latent factors) of multimodal HDCD using a convolutional variational autoencoder, performs genome wide association studies (GWAS) on each latent factor, then combines the results to study the genetics of the underlying system. To validate the advantages of M-REGLE and multimodal learning, we apply it to common cardiovascular modalities (PPG and ECG), and compare its results to unimodal learning methods in which representations are learned from each data modality separately, but the downstream genetic analyses are performed on the combined unimodal representations. M-REGLE identifies 19.3% more loci on the 12-lead ECG dataset, 13.0% more loci on the ECG lead I + PPG dataset, and its genetic risk score significantly outperforms the unimodal risk score at predicting cardiac phenotypes, such as atrial fibrillation (Afib), in multiple biobanks.
Collapse
Affiliation(s)
| | | | | | - Mahantesh I Biradar
- NIHR Biomedical Research Centre at Moorfields Eye Hospital & UCL Institute of Ophthalmology, London EC1V 9EL, UK
- MRC Epidemiology Unit, University of Cambridge, Cambridge CB2 0SL, UK
| | | | - Dongbing Lai
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Tae-Hwi Schwantes-An
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Robert Luben
- NIHR Biomedical Research Centre at Moorfields Eye Hospital & UCL Institute of Ophthalmology, London EC1V 9EL, UK
- MRC Epidemiology Unit, University of Cambridge, Cambridge CB2 0SL, UK
| | - Zachary McCaw
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Jorgen Engmann
- Center for Translational Genomics, Population Science and Experimental Medicine, Institute of Cardiovascular Science, University College London, UK
| | - Rui Providencia
- Institute of Health Informatics Research, University College London, London, UK
- Electrophysiology Department, Barts Heart Centre, St. Bartholomew's Hospital, London, UK
| | - Amand Floriaan Schmidt
- Department of Cardiology; Amsterdam University Medical Centres, Amsterdam, The Netherlands
- Institute of Cardiovascular Science; University College London, London, UK
- Division of Heart and Lungs, University Medical Center Utrecht, Utrecht, Netherlands
| | - Patricia Munroe
- William Harvey Research Institute, Barts and the London Faculty of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - Howard Yang
- Google Research, San Francisco CA, 94105 USA
| | | | - Anthony P Khawaja
- NIHR Biomedical Research Centre at Moorfields Eye Hospital & UCL Institute of Ophthalmology, London EC1V 9EL, UK
- MRC Epidemiology Unit, University of Cambridge, Cambridge CB2 0SL, UK
| | | | | | | |
Collapse
|
2
|
Yun T, Cosentino J, Behsaz B, McCaw ZR, Hill D, Luben R, Lai D, Bates J, Yang H, Schwantes-An TH, Zhou Y, Khawaja AP, Carroll A, Hobbs BD, Cho MH, McLean CY, Hormozdiari F. Unsupervised representation learning improves genomic discovery and risk prediction for respiratory and circulatory functions and diseases. medRxiv 2023:2023.04.28.23289285. [PMID: 37163049 PMCID: PMC10168505 DOI: 10.1101/2023.04.28.23289285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
High-dimensional clinical data are becoming more accessible in biobank-scale datasets. However, effectively utilizing high-dimensional clinical data for genetic discovery remains challenging. Here we introduce a general deep learning-based framework, REpresentation learning for Genetic discovery on Low-dimensional Embeddings (REGLE), for discovering associations between genetic variants and high-dimensional clinical data. REGLE uses convolutional variational autoencoders to compute a non-linear, low-dimensional, disentangled embedding of the data with highly heritable individual components. REGLE can incorporate expert-defined or clinical features and provides a framework to create accurate disease-specific polygenic risk scores (PRS) in datasets which have minimal expert phenotyping. We apply REGLE to both respiratory and circulatory systems: spirograms which measure lung function and photoplethysmograms (PPG) which measure blood volume changes. Genome-wide association studies on REGLE embeddings identify more genome-wide significant loci than existing methods and replicate known loci for both spirograms and PPG, demonstrating the generality of the framework. Furthermore, these embeddings are associated with overall survival. Finally, we construct a set of PRSs that improve predictive performance of asthma, chronic obstructive pulmonary disease, hypertension, and systolic blood pressure in multiple biobanks. Thus, REGLE embeddings can quantify clinically relevant features that are not currently captured in a standardized or automated way.
Collapse
Affiliation(s)
| | | | | | | | - Davin Hill
- Department of Electrical and Computer Engineering, Northeastern University, Boston, MA 94304, USA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA 02115, USA
| | - Robert Luben
- NIHR Biomedical Research Centre at Moorfields Eye Hospital & UCL Institute of Ophthalmology, London EC1V 9EL, UK
- MRC Epidemiology Unit, University of Cambridge, Cambridge CB2 0SL, UK
| | - Dongbing Lai
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - John Bates
- Verily Life Sciences, South San Francisco, CA 94080, USA
| | | | - Tae-Hwi Schwantes-An
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Division of Cardiology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | | | - Anthony P. Khawaja
- NIHR Biomedical Research Centre at Moorfields Eye Hospital & UCL Institute of Ophthalmology, London EC1V 9EL, UK
- MRC Epidemiology Unit, University of Cambridge, Cambridge CB2 0SL, UK
| | | | - Brian D. Hobbs
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA 02115, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Michael H. Cho
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA 02115, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
| | | | | |
Collapse
|
3
|
Hill D, Torop M, Masoomi A, Castaldi PJ, Silverman EK, Bodduluri S, Bhatt SP, Yun T, McLean CY, Hormozdiari F, Dy J, Cho MH, Hobbs BD. Deep Learning Utilizing Suboptimal Spirometry Data to Improve Lung Function and Mortality Prediction in the UK Biobank. medRxiv 2023:2023.04.28.23289178. [PMID: 37162978 PMCID: PMC10168495 DOI: 10.1101/2023.04.28.23289178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Background Spirometry measures lung function by selecting the best of multiple efforts meeting pre-specified quality control (QC), and reporting two key metrics: forced expiratory volume in 1 second (FEV1) and forced vital capacity (FVC). We hypothesize that discarded submaximal and QC-failing data meaningfully contribute to the prediction of airflow obstruction and all-cause mortality. Methods We evaluated volume-time spirometry data from the UK Biobank. We identified "best" spirometry efforts as those passing QC with the maximum FVC. "Discarded" efforts were either submaximal or failed QC. To create a combined representation of lung function we implemented a contrastive learning approach, Spirogram-based Contrastive Learning Framework (Spiro-CLF), which utilized all recorded volume-time curves per participant and applied different transformations (e.g. flow-volume, flow-time). In a held-out 20% testing subset we applied the Spiro-CLF representation of a participant's overall lung function to 1) binary predictions of FEV1/FVC < 0.7 and FEV1 Percent Predicted (FEV1PP) < 80%, indicative of airflow obstruction, and 2) Cox regression for all-cause mortality. Findings We included 940,705 volume-time curves from 352,684 UK Biobank participants with 2-3 spirometry efforts per individual (66.7% with 3 efforts) and at least one QC-passing spirometry effort. Of all spirometry efforts, 24.1% failed QC and 37.5% were submaximal. Spiro-CLF prediction of FEV1/FVC < 0.7 utilizing discarded spirometry efforts had an Area under the Receiver Operating Characteristics (AUROC) of 0.981 (0.863 for FEV1PP prediction). Incorporating discarded spirometry efforts in all-cause mortality prediction was associated with a concordance index (c-index) of 0.654, which exceeded the c-indices from FEV1 (0.590), FVC (0.559), or FEV1/FVC (0.599) from each participant's single best effort. Interpretation A contrastive learning model using raw spirometry curves can accurately predict lung function using submaximal and QC-failing efforts. This model also has superior prediction of all-cause mortality compared to standard lung function measurements. Funding MHC is supported by NIH R01HL137927, R01HL135142, HL147148, and HL089856.BDH is supported by NIH K08HL136928, U01 HL089856, and an Alpha-1 Foundation Research Grant.DH is supported by NIH 2T32HL007427-41EKS is supported by NIH R01 HL152728, R01 HL147148, U01 HL089856, R01 HL133135, P01 HL132825, and P01 HL114501.PJC is supported by NIH R01HL124233 and R01HL147326.SPB is supported by NIH R01HL151421 and UH3HL155806.TY, FH, and CYM are employees of Google LLC.
Collapse
Affiliation(s)
- Davin Hill
- Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, USA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| | - Max Torop
- Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, USA
| | - Aria Masoomi
- Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, USA
| | - Peter J. Castaldi
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Division of General Medicine and Primary Care, Brigham and Women’s Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Edwin K. Silverman
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| | - Sandeep Bodduluri
- Division of Pulmonary, Allergy and Critical Care Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Surya P. Bhatt
- Division of Pulmonary, Allergy and Critical Care Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | | | | | | | - Jennifer Dy
- Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, USA
| | - Michael H. Cho
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| | - Brian D. Hobbs
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| |
Collapse
|
4
|
Cosentino J, Behsaz B, Alipanahi B, McCaw ZR, Hill D, Schwantes-An TH, Lai D, Carroll A, Hobbs BD, Cho MH, McLean CY, Hormozdiari F. Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nat Genet 2023; 55:787-795. [PMID: 37069358 DOI: 10.1038/s41588-023-01372-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 03/14/2023] [Indexed: 04/19/2023]
Abstract
Chronic obstructive pulmonary disease (COPD), the third leading cause of death worldwide, is highly heritable. While COPD is clinically defined by applying thresholds to summary measures of lung function, a quantitative liability score has more power to identify genetic signals. Here we train a deep convolutional neural network on noisy self-reported and International Classification of Diseases labels to predict COPD case-control status from high-dimensional raw spirograms and use the model's predictions as a liability score. The machine-learning-based (ML-based) liability score accurately discriminates COPD cases and controls, and predicts COPD-related hospitalization without any domain-specific knowledge. Moreover, the ML-based liability score is associated with overall survival and exacerbation events. A genome-wide association study on the ML-based liability score replicates existing COPD and lung function loci and also identifies 67 new loci. Lastly, our method provides a general framework to use ML methods and medical-record-based labels that does not require domain knowledge or expert curation to improve disease prediction and genomic discovery for drug design.
Collapse
Affiliation(s)
| | | | | | | | - Davin Hill
- Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, USA
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Tae-Hwi Schwantes-An
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA
- Division of Cardiology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Dongbing Lai
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA
| | | | - Brian D Hobbs
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Michael H Cho
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | | | | |
Collapse
|
5
|
Gazal S, Weissbrod O, Hormozdiari F, Dey KK, Nasser J, Jagadeesh KA, Weiner DJ, Shi H, Fulco CP, O'Connor LJ, Pasaniuc B, Engreitz JM, Price AL. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat Genet 2022; 54:827-836. [PMID: 35668300 PMCID: PMC9894581 DOI: 10.1038/s41588-022-01087-y] [Citation(s) in RCA: 48] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 04/27/2022] [Indexed: 02/04/2023]
Abstract
Disease-associated single-nucleotide polymorphisms (SNPs) generally do not implicate target genes, as most disease SNPs are regulatory. Many SNP-to-gene (S2G) linking strategies have been developed to link regulatory SNPs to the genes that they regulate in cis. Here, we developed a heritability-based framework for evaluating and combining different S2G strategies to optimize their informativeness for common disease risk. Our optimal combined S2G strategy (cS2G) included seven constituent S2G strategies and achieved a precision of 0.75 and a recall of 0.33, more than doubling the recall of any individual strategy. We applied cS2G to fine-mapping results for 49 UK Biobank diseases/traits to predict 5,095 causal SNP-gene-disease triplets (with S2G-derived functional interpretation) with high confidence. We further applied cS2G to provide an empirical assessment of disease omnigenicity; we determined that the top 1% of genes explained roughly half of the SNP heritability linked to all genes and that gene-level architectures vary with variant allele frequency.
Collapse
Affiliation(s)
- Steven Gazal
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Omer Weissbrod
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Kushal K Dey
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Joseph Nasser
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Karthik A Jagadeesh
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Charles P Fulco
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Bristol Myers Squibb, Cambridge, MA, USA
| | | | - Bogdan Pasaniuc
- Departments of Computational Medicine, Human Genetics, Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Jesse M Engreitz
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- BASE Initiative, Betty Irene Moore Children's Heart Center, Lucile Packard Children's Hospital, Stanford University School of Medicine, Stanford, CA, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
6
|
LaPierre N, Taraszka K, Huang H, He R, Hormozdiari F, Eskin E. Identifying causal variants by fine mapping across multiple studies. PLoS Genet 2021; 17:e1009733. [PMID: 34543273 PMCID: PMC8491908 DOI: 10.1371/journal.pgen.1009733] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 10/05/2021] [Accepted: 07/21/2021] [Indexed: 11/18/2022] Open
Abstract
Increasingly large Genome-Wide Association Studies (GWAS) have yielded numerous variants associated with many complex traits, motivating the development of "fine mapping" methods to identify which of the associated variants are causal. Additionally, GWAS of the same trait for different populations are increasingly available, raising the possibility of refining fine mapping results further by leveraging different linkage disequilibrium (LD) structures across studies. Here, we introduce multiple study causal variants identification in associated regions (MsCAVIAR), a method that extends the popular CAVIAR fine mapping framework to a multiple study setting using a random effects model. MsCAVIAR only requires summary statistics and LD as input, accounts for uncertainty in association statistics using a multivariate normal model, allows for multiple causal variants at a locus, and explicitly models the possibility of different SNP effect sizes in different populations. We demonstrate the efficacy of MsCAVIAR in both a simulation study and a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein (HDL).
Collapse
Affiliation(s)
- Nathan LaPierre
- Department of Computer Science, University of California, Los Angeles, California, United States
| | - Kodi Taraszka
- Department of Computer Science, University of California, Los Angeles, California, United States
| | - Helen Huang
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States
| | - Rosemary He
- Department of Mathematics, University of California, Los Angeles, California, United States
| | - Farhad Hormozdiari
- Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, California, United States
- Department of Human Genetics, University of California, Los Angeles, California, United States
- Department of Computational Medicine, University of California, Los Angeles, California, United States
| |
Collapse
|
7
|
Alipanahi B, Hormozdiari F, Behsaz B, Cosentino J, McCaw ZR, Schorsch E, Sculley D, Dorfman EH, Foster PJ, Peng LH, Phene S, Hammel N, Carroll A, Khawaja AP, McLean CY. Large-scale machine-learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology. Am J Hum Genet 2021; 108:1217-1230. [PMID: 34077760 PMCID: PMC8322934 DOI: 10.1016/j.ajhg.2021.05.004] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 05/10/2021] [Indexed: 02/06/2023] Open
Abstract
Genome-wide association studies (GWASs) require accurate cohort phenotyping, but expert labeling can be costly, time intensive, and variable. Here, we develop a machine learning (ML) model to predict glaucomatous optic nerve head features from color fundus photographs. We used the model to predict vertical cup-to-disc ratio (VCDR), a diagnostic parameter and cardinal endophenotype for glaucoma, in 65,680 Europeans in the UK Biobank (UKB). A GWAS of ML-based VCDR identified 299 independent genome-wide significant (GWS; p ≤ 5 × 10-8) hits in 156 loci. The ML-based GWAS replicated 62 of 65 GWS loci from a recent VCDR GWAS in the UKB for which two ophthalmologists manually labeled images for 67,040 Europeans. The ML-based GWAS also identified 93 novel loci, significantly expanding our understanding of the genetic etiologies of glaucoma and VCDR. Pathway analyses support the biological significance of the novel hits to VCDR: select loci near genes involved in neuronal and synaptic biology or harboring variants are known to cause severe Mendelian ophthalmic disease. Finally, the ML-based GWAS results significantly improve polygenic prediction of VCDR and primary open-angle glaucoma in the independent EPIC-Norfolk cohort.
Collapse
Affiliation(s)
| | | | | | | | | | | | - D Sculley
- Google Health, Cambridge, MA 02142, USA
| | | | - Paul J Foster
- NIHR Biomedical Research Centre at Moorfields Eye Hospital and UCL Institute of Ophthalmology, London EC1V 9EL, UK
| | | | | | | | | | - Anthony P Khawaja
- NIHR Biomedical Research Centre at Moorfields Eye Hospital and UCL Institute of Ophthalmology, London EC1V 9EL, UK; MRC Epidemiology Unit, University of Cambridge, Cambridge CB2 0SL, UK
| | | |
Collapse
|
8
|
Abstract
In standard genome-wide association studies (GWAS), the standard association test is underpowered to detect associations between loci with multiple causal variants with small effect sizes. We propose a statistical method, Model-based Association test Reflecting causal Status (MARS), that finds associations between variants in risk loci and a phenotype, considering the causal status of variants, only requiring the existing summary statistics to detect associated risk loci. Utilizing extensive simulated data and real data, we show that MARS increases the power of detecting true associated risk loci compared to previous approaches that consider multiple variants, while controlling the type I error.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, 02115 MA USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Junghyun Jung
- Department of Life Science, Dongguk University-Seoul, Seoul, 04620 South Korea
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, Los Angeles, 90095 CA USA
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, 90095 CA USA
| | - Jong Wha J. Joo
- Department of Computer Science and Engineering, Dongguk University-Seoul, Seoul, 04620 South Korea
| |
Collapse
|
9
|
Barbeira AN, Bonazzola R, Gamazon ER, Liang Y, Park Y, Kim-Hellmuth S, Wang G, Jiang Z, Zhou D, Hormozdiari F, Liu B, Rao A, Hamel AR, Pividori MD, Aguet F, Bastarache L, Jordan DM, Verbanck M, Do R, Stephens M, Ardlie K, McCarthy M, Montgomery SB, Segrè AV, Brown CD, Lappalainen T, Wen X, Im HK. Exploiting the GTEx resources to decipher the mechanisms at GWAS loci. Genome Biol 2021; 22:49. [PMID: 33499903 PMCID: PMC7836161 DOI: 10.1186/s13059-020-02252-4] [Citation(s) in RCA: 108] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2020] [Accepted: 12/18/2020] [Indexed: 12/12/2022] Open
Abstract
The resources generated by the GTEx consortium offer unprecedented opportunities to advance our understanding of the biology of human diseases. Here, we present an in-depth examination of the phenotypic consequences of transcriptome regulation and a blueprint for the functional interpretation of genome-wide association study-discovered loci. Across a broad set of complex traits and diseases, we demonstrate widespread dose-dependent effects of RNA expression and splicing. We develop a data-driven framework to benchmark methods that prioritize causal genes and find no single approach outperforms the combination of multiple approaches. Using colocalization and association approaches that take into account the observed allelic heterogeneity of gene expression, we propose potential target genes for 47% (2519 out of 5385) of the GWAS loci examined.
Collapse
Affiliation(s)
- Alvaro N Barbeira
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA
| | - Rodrigo Bonazzola
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA
| | - Eric R Gamazon
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
- Data Science Institute, Vanderbilt University, Nashville, TN, USA
- Clare Hall, University of Cambridge, Cambridge, UK
- MRC Epidemiology Unit, University of Cambridge, Cambridge, UK
| | - Yanyu Liang
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA
| | - YoSon Park
- Department of Genetics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA
| | - Sarah Kim-Hellmuth
- Statistical Genetics, Max Planck Institute of Psychiatry, Munich, Germany
- New York Genome Center, New York, NY, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Gao Wang
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Zhuoxun Jiang
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA
| | - Dan Zhou
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Farhad Hormozdiari
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Boxiang Liu
- Department of Biology, Stanford University, Stanford, 94305, CA, USA
| | - Abhiram Rao
- Department of Biology, Stanford University, Stanford, 94305, CA, USA
| | - Andrew R Hamel
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Ocular Genomics Institute, Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
| | - Milton D Pividori
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA
| | - François Aguet
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Lisa Bastarache
- Department of Biomedical Informatics, Department of Medicine, Vanderbilt University, Nashville, TN, USA
- Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Daniel M Jordan
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Marie Verbanck
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Université de Paris - EA 7537 BIOSTM, Paris, France
| | - Ron Do
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Kristin Ardlie
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Stephen B Montgomery
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Ayellet V Segrè
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Ocular Genomics Institute, Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
| | - Christopher D Brown
- Department of Genetics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA
| | - Tuuli Lappalainen
- New York Genome Center, New York, NY, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Xiaoquan Wen
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Hae Kyung Im
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA.
| |
Collapse
|
10
|
Weissbrod O, Hormozdiari F, Benner C, Cui R, Ulirsch J, Gazal S, Schoech AP, van de Geijn B, Reshef Y, Márquez-Luna C, O'Connor L, Pirinen M, Finucane HK, Price AL. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat Genet 2020; 52:1355-1363. [PMID: 33199916 PMCID: PMC7710571 DOI: 10.1038/s41588-020-00735-5] [Citation(s) in RCA: 136] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Accepted: 10/02/2020] [Indexed: 01/16/2023]
Abstract
Fine-mapping aims to identify causal variants impacting complex traits. We propose PolyFun, a computationally scalable framework to improve fine-mapping accuracy by leveraging functional annotations across the entire genome-not just genome-wide-significant loci-to specify prior probabilities for fine-mapping methods such as SuSiE or FINEMAP. In simulations, PolyFun + SuSiE and PolyFun + FINEMAP were well calibrated and identified >20% more variants with a posterior causal probability >0.95 than identified in their nonfunctionally informed counterparts. In analyses of 49 UK Biobank traits (average n = 318,000), PolyFun + SuSiE identified 3,025 fine-mapped variant-trait pairs with posterior causal probability >0.95, a >32% improvement versus SuSiE. We used posterior mean per-SNP heritabilities from PolyFun + SuSiE to perform polygenic localization, constructing minimal sets of common SNPs causally explaining 50% of common SNP heritability; these sets ranged in size from 28 (hair color) to 3,400 (height) to 2 million (number of children). In conclusion, PolyFun prioritizes variants for functional follow-up and provides insights into complex trait architectures.
Collapse
Affiliation(s)
- Omer Weissbrod
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Christian Benner
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Ran Cui
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Jacob Ulirsch
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Program in Biological and Biomedical Sciences, Harvard Medical School, Cambridge, MA, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Armin P Schoech
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Bryce van de Geijn
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Yakir Reshef
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Carla Márquez-Luna
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Luke O'Connor
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Matti Pirinen
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- Department of Public Health, University of Helsinki, Helsinki, Finland
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
| | - Hilary K Finucane
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
11
|
Gay NR, Gloudemans M, Antonio ML, Abell NS, Balliu B, Park Y, Martin AR, Musharoff S, Rao AS, Aguet F, Barbeira AN, Bonazzola R, Hormozdiari F, Ardlie KG, Brown CD, Im HK, Lappalainen T, Wen X, Montgomery SB. Impact of admixture and ancestry on eQTL analysis and GWAS colocalization in GTEx. Genome Biol 2020; 21:233. [PMID: 32912333 PMCID: PMC7488497 DOI: 10.1186/s13059-020-02113-0] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Accepted: 07/19/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Population structure among study subjects may confound genetic association studies, and lack of proper correction can lead to spurious findings. The Genotype-Tissue Expression (GTEx) project largely contains individuals of European ancestry, but the v8 release also includes up to 15% of individuals of non-European ancestry. Assessing ancestry-based adjustments in GTEx improves portability of this research across populations and further characterizes the impact of population structure on GWAS colocalization. RESULTS Here, we identify a subset of 117 individuals in GTEx (v8) with a high degree of population admixture and estimate genome-wide local ancestry. We perform genome-wide cis-eQTL mapping using admixed samples in seven tissues, adjusted by either global or local ancestry. Consistent with previous work, we observe improved power with local ancestry adjustment. At loci where the two adjustments produce different lead variants, we observe 31 loci (0.02%) where a significant colocalization is called only with one eQTL ancestry adjustment method. Notably, both adjustments produce similar numbers of significant colocalizations within each of two different colocalization methods, COLOC and FINEMAP. Finally, we identify a small subset of eQTL-associated variants highly correlated with local ancestry, providing a resource to enhance functional follow-up. CONCLUSIONS We provide a local ancestry map for admixed individuals in the GTEx v8 release and describe the impact of ancestry and admixture on gene expression, eQTLs, and GWAS colocalization. While the majority of the results are concordant between local and global ancestry-based adjustments, we identify distinct advantages and disadvantages to each approach.
Collapse
Affiliation(s)
- Nicole R. Gay
- Department of Genetics, Stanford University, Stanford, CA USA
| | | | | | - Nathan S. Abell
- Department of Genetics, Stanford University, Stanford, CA USA
| | - Brunilda Balliu
- Department of Biomathematics, University of California, Los Angeles, Los Angeles, CA USA
| | - YoSon Park
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA
| | - Alicia R. Martin
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA USA
- Stanley Center for Psychiatric Research, Broad Institute, Cambridge, MA USA
| | | | - Abhiram S. Rao
- Department of Bioengineering, Stanford University, Stanford, CA USA
| | - François Aguet
- The Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Alvaro N. Barbeira
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL USA
| | - Rodrigo Bonazzola
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL USA
| | - Farhad Hormozdiari
- The Broad Institute of MIT and Harvard, Cambridge, MA USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA USA
| | - GTEx Consortium
- Department of Genetics, Stanford University, Stanford, CA USA
- Biomedical Informatics, Stanford University, Stanford, CA USA
- Department of Biomathematics, University of California, Los Angeles, Los Angeles, CA USA
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA USA
- Stanley Center for Psychiatric Research, Broad Institute, Cambridge, MA USA
- Department of Bioengineering, Stanford University, Stanford, CA USA
- The Broad Institute of MIT and Harvard, Cambridge, MA USA
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA USA
- New York Genome Center, New York, NY USA
- Department of Systems Biology, Columbia University, New York, NY USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI USA
- Department of Pathology, Stanford University, Stanford, CA USA
| | | | - Christopher D. Brown
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA
| | - Hae Kyung Im
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL USA
| | - Tuuli Lappalainen
- New York Genome Center, New York, NY USA
- Department of Systems Biology, Columbia University, New York, NY USA
| | - Xiaoquan Wen
- Department of Biostatistics, University of Michigan, Ann Arbor, MI USA
| | - Stephen B. Montgomery
- Department of Genetics, Stanford University, Stanford, CA USA
- Department of Pathology, Stanford University, Stanford, CA USA
| |
Collapse
|
12
|
van de Geijn B, Finucane H, Gazal S, Hormozdiari F, Amariuta T, Liu X, Gusev A, Loh PR, Reshef Y, Kichaev G, Raychauduri S, Price AL. Annotations capturing cell type-specific TF binding explain a large fraction of disease heritability. Hum Mol Genet 2020; 29:1057-1067. [PMID: 31595288 PMCID: PMC7206853 DOI: 10.1093/hmg/ddz226] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 08/12/2019] [Accepted: 09/10/2019] [Indexed: 12/21/2022] Open
Abstract
Regulatory variation plays a major role in complex disease and that cell type-specific binding of transcription factors (TF) is critical to gene regulation. However, assessing the contribution of genetic variation in TF-binding sites to disease heritability is challenging, as binding is often cell type-specific and annotations from directly measured TF binding are not currently available for most cell type-TF pairs. We investigate approaches to annotate TF binding, including directly measured chromatin data and sequence-based predictions. We find that TF-binding annotations constructed by intersecting sequence-based TF-binding predictions with cell type-specific chromatin data explain a large fraction of heritability across a broad set of diseases and corresponding cell types; this strategy of constructing annotations addresses both the limitation that identical sequences may be bound or unbound depending on surrounding chromatin context and the limitation that sequence-based predictions are generally not cell type-specific. We partitioned the heritability of 49 diseases and complex traits using stratified linkage disequilibrium (LD) score regression with the baseline-LD model (which is not cell type-specific) plus the new annotations. We determined that 100 bp windows around MotifMap sequenced-based TF-binding predictions intersected with a union of six cell type-specific chromatin marks (imputed using ChromImpute) performed best, with an 58% increase in heritability enrichment compared to the chromatin marks alone (11.6× vs. 7.3×, P = 9 × 10-14 for difference) and a 20% increase in cell type-specific signal conditional on annotations from the baseline-LD model (P = 8 × 10-11 for difference). Our results show that TF-binding annotations explain substantial disease heritability and can help refine genome-wide association signals.
Collapse
Affiliation(s)
- Bryce van de Geijn
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston 02115, MA, USA
| | - Hilary Finucane
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston 02115, MA, USA
| | - Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston 02115, MA, USA
| | - Tiffany Amariuta
- Center for Data Sciences, Harvard Medical School, Boston, MA 02215, USA
- Divisions of Genetics, Rheumatology, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02215, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, USA
- Graduate School of Arts and Sciences, Harvard University, Boston, MA 02215, USA
| | - Xuanyao Liu
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston 02115, MA, USA
| | | | - Po-Ru Loh
- Brigham and Women’s Hospital, Boston, MA 02215, USA
| | - Yakir Reshef
- Department of Computer Science, Harvard University, Cambridge, MA 02138, USA
- Harvard/MIT MD/PhD Program, Boston, MA 02215, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02215, USA
| | - Gleb Kichaev
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, CA 90095, USA
| | - Soumya Raychauduri
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston 02115, MA, USA
- Center for Data Sciences, Harvard Medical School, Boston, MA 02215, USA
- Divisions of Genetics, Rheumatology, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02215, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, USA
- Graduate School of Arts and Sciences, Harvard University, Boston, MA 02215, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston 02115, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02215, USA
| |
Collapse
|
13
|
Zou J, Hormozdiari F, Jew B, Castel SE, Lappalainen T, Ernst J, Sul JH, Eskin E. Leveraging allelic imbalance to refine fine-mapping for eQTL studies. PLoS Genet 2019; 15:e1008481. [PMID: 31834882 PMCID: PMC6952111 DOI: 10.1371/journal.pgen.1008481] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 01/09/2020] [Accepted: 10/15/2019] [Indexed: 11/18/2022] Open
Abstract
Many disease risk loci identified in genome-wide association studies are present in non-coding regions of the genome. Previous studies have found enrichment of expression quantitative trait loci (eQTLs) in disease risk loci, indicating that identifying causal variants for gene expression is important for elucidating the genetic basis of not only gene expression but also complex traits. However, detecting causal variants is challenging due to complex genetic correlation among variants known as linkage disequilibrium (LD) and the presence of multiple causal variants within a locus. Although several fine-mapping approaches have been developed to overcome these challenges, they may produce large sets of putative causal variants when true causal variants are in high LD with many non-causal variants. In eQTL studies, there is an additional source of information that can be used to improve fine-mapping called allelic imbalance (AIM) that measures imbalance in gene expression on two chromosomes of a diploid organism. In this work, we develop a novel statistical method that leverages both AIM and total expression data to detect causal variants that regulate gene expression. We illustrate through simulations and application to 10 tissues of the Genotype-Tissue Expression (GTEx) dataset that our method identifies the true causal variants with higher specificity than an approach that uses only eQTL information. Across all tissues and genes, our method achieves a median reduction rate of 11% in the number of putative causal variants. We use chromatin state data from the Roadmap Epigenomics Consortium to show that the putative causal variants identified by our method are enriched for active regions of the genome, providing orthogonal support that our method identifies causal variants with increased specificity. In recent years, many studies have identified genetic variants that are associated with the expression of genes (eQTLs). While thousands of eQTLs have been identified, not all associated variants cause changes in gene expression. This is in part due to the complex patterns of genetic correlation in the human genome. If a region of the genome contains many genetic variants that are highly correlated with each other, non-causal genetic variants close to a causal variant are also correlated with gene expression. Statistical fine-mapping is the process of identifying true causal variants from a set of candidate variants. In regions with high genetic correlation, previous fine-mapping methods may not be able to differentiate causal variants from nearby variants. We propose a method that utilizes a complementary source of information called allelic imbalance (AIM). We show that by combining eQTL and AIM data, we can identify the true causal variants more efficiently and substantially decrease the number of putative causal variants for downstream analysis.
Collapse
Affiliation(s)
- Jennifer Zou
- Computer Science Department, University of California Los Angeles, Los Angeles, California, United States of America
| | - Farhad Hormozdiari
- Genetic Epidemiology and Statistical Genetics Program, Harvard University, Cambridge, Massachusetts, United States of America
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Brandon Jew
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, California, United States of America
| | - Stephane E. Castel
- New York Genome Center, New York, New York, United States of America
- Department of Systems Biology, Columbia University, New York, New York, United States of America
| | - Tuuli Lappalainen
- New York Genome Center, New York, New York, United States of America
- Department of Systems Biology, Columbia University, New York, New York, United States of America
| | - Jason Ernst
- Computer Science Department, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Biological Chemistry, University of California Los Angeles, Los Angeles, California, United States of America
| | - Jae Hoon Sul
- Department of Psychiatry and Biobehavioral Sciences, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail: (JHS); (EE)
| | - Eleazar Eskin
- Computer Science Department, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail: (JHS); (EE)
| |
Collapse
|
14
|
Kim SS, Dai C, Hormozdiari F, van de Geijn B, Gazal S, Park Y, O’Connor L, Amariuta T, Loh PR, Finucane H, Raychaudhuri S, Price AL. Genes with High Network Connectivity Are Enriched for Disease Heritability. Am J Hum Genet 2019; 105:1302. [PMID: 31809749 DOI: 10.1016/j.ajhg.2019.11.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022] Open
|
15
|
Hormozdiari F, van de Geijn B, Nasser J, Weissbrod O, Gazal S, Ju CJT, Connor LO, Hujoel MLA, Engreitz J, Hormozdiari F, Price AL. Functional disease architectures reveal unique biological role of transposable elements. Nat Commun 2019; 10:4054. [PMID: 31492842 PMCID: PMC6731302 DOI: 10.1038/s41467-019-11957-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Accepted: 08/08/2019] [Indexed: 12/19/2022] Open
Abstract
Transposable elements (TE) comprise roughly half of the human genome. Though initially derided as junk DNA, they have been widely hypothesized to contribute to the evolution of gene regulation. However, the contribution of TE to the genetic architecture of diseases remains unknown. Here, we analyze data from 41 independent diseases and complex traits to draw three conclusions. First, TE are uniquely informative for disease heritability. Despite overall depletion for heritability (54% of SNPs, 39 ± 2% of heritability), TE explain substantially more heritability than expected based on their depletion for known functional annotations. This implies that TE acquire function in ways that differ from known functional annotations. Second, older TE contribute more to disease heritability, consistent with acquiring biological function. Third, Short Interspersed Nuclear Elements (SINE) are far more enriched for blood traits than for other traits. Our results can help elucidate the biological roles that TE play in the genetic architecture of diseases.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA. .,Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Bryce van de Geijn
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.,Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Joseph Nasser
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Omer Weissbrod
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.,Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.,Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Chelsea J-T Ju
- Department of Computer Science, University of California, Los Angeles, CA, 90095, USA
| | - Luke O' Connor
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.,Program in Bioinformatics and Integrative Genomics, Harvard Graduate School of Arts and Sciences, Boston, MA, USA
| | - Margaux L A Hujoel
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Jesse Engreitz
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Fereydoun Hormozdiari
- Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, 95616, USA.,MIND Institute and UC-Davis Genome Center, Davis, CA, 95616, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA. .,Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. .,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.
| |
Collapse
|
16
|
O'Connor LJ, Schoech AP, Hormozdiari F, Gazal S, Patterson N, Price AL. Extreme Polygenicity of Complex Traits Is Explained by Negative Selection. Am J Hum Genet 2019; 105:456-476. [PMID: 31402091 PMCID: PMC6732528 DOI: 10.1016/j.ajhg.2019.07.003] [Citation(s) in RCA: 103] [Impact Index Per Article: 20.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Accepted: 07/03/2019] [Indexed: 12/16/2022] Open
Abstract
Complex traits and common diseases are extremely polygenic, their heritability spread across thousands of loci. One possible explanation is that thousands of genes and loci have similarly important biological effects when mutated. However, we hypothesize that for most complex traits, relatively few genes and loci are critical, and negative selection-purging large-effect mutations in these regions-leaves behind common-variant associations in thousands of less critical regions instead. We refer to this phenomenon as flattening. To quantify its effects, we introduce a mathematical definition of polygenicity, the effective number of independently associated SNPs (Me), which describes how evenly the heritability of a trait is spread across the genome. We developed a method, stratified LD fourth moments regression (S-LD4M), to estimate Me, validating that it produces robust estimates in simulations. Analyzing 33 complex traits (average N = 361k), we determined that heritability is spread ∼4× more evenly among common SNPs than among low-frequency SNPs. This difference, together with evolutionary modeling of new mutations, suggests that complex traits would be orders of magnitude less polygenic if not for the influence of negative selection. We also determined that heritability is spread more evenly within functionally important regions in proportion to their heritability enrichment; functionally important regions do not harbor common SNPs with greatly increased causal effect sizes, due to selective constraint. Our results suggest that for most complex traits, the genes and loci with the most critical biological effects often differ from those with the strongest common-variant associations.
Collapse
Affiliation(s)
- Luke J O'Connor
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Bioinformatics and Integrative Genomics, Harvard Graduate School of Arts and Sciences, Boston, MA 02115, USA.
| | - Armin P Schoech
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Nick Patterson
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
17
|
Kim SS, Dai C, Hormozdiari F, van de Geijn B, Gazal S, Park Y, O'Connor L, Amariuta T, Loh PR, Finucane H, Raychaudhuri S, Price AL. Genes with High Network Connectivity Are Enriched for Disease Heritability. Am J Hum Genet 2019; 104:896-913. [PMID: 31051114 PMCID: PMC6506868 DOI: 10.1016/j.ajhg.2019.03.020] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Accepted: 03/20/2019] [Indexed: 12/13/2022] Open
Abstract
Recent studies have highlighted the role of gene networks in disease biology. To formally assess this, we constructed a broad set of pathway, network, and pathway+network annotations and applied stratified LD score regression to 42 diseases and complex traits (average N = 323K) to identify enriched annotations. First, we analyzed 18,119 biological pathways. We identified 156 pathway-trait pairs whose disease enrichment was statistically significant (FDR < 5%) after conditioning on all genes and 75 known functional annotations (from the baseline-LD model), a stringent step that greatly reduced the number of pathways detected; most significant pathway-trait pairs were previously unreported. Next, for each of four published gene networks, we constructed probabilistic annotations based on network connectivity. For each gene network, the network connectivity annotation was strongly significantly enriched. Surprisingly, the enrichments were fully explained by excess overlap between network annotations and regulatory annotations from the baseline-LD model, validating the informativeness of the baseline-LD model and emphasizing the importance of accounting for regulatory annotations in gene network analyses. Finally, for each of the 156 enriched pathway-trait pairs, for each of the four gene networks, we constructed pathway+network annotations by annotating genes with high network connectivity to the input pathway. For each gene network, these pathway+network annotations were strongly significantly enriched for the corresponding traits. Once again, the enrichments were largely explained by the baseline-LD model. In conclusion, gene network connectivity is highly informative for disease architectures, but the information in gene networks may be subsumed by regulatory annotations, emphasizing the importance of accounting for known annotations.
Collapse
Affiliation(s)
- Samuel S Kim
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.
| | - Chengzhen Dai
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Bryce van de Geijn
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Yongjin Park
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Luke O'Connor
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Bioinformatics and Integrative Genomics, Harvard University, Cambridge, MA 02138, USA
| | - Tiffany Amariuta
- Program in Bioinformatics and Integrative Genomics, Harvard University, Cambridge, MA 02138, USA
| | - Po-Ru Loh
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Hilary Finucane
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Soumya Raychaudhuri
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.
| |
Collapse
|
18
|
Hujoel MLA, Gazal S, Hormozdiari F, van de Geijn B, Price AL. Disease Heritability Enrichment of Regulatory Elements Is Concentrated in Elements with Ancient Sequence Age and Conserved Function across Species. Am J Hum Genet 2019; 104:611-624. [PMID: 30905396 PMCID: PMC6451699 DOI: 10.1016/j.ajhg.2019.02.008] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 02/05/2019] [Indexed: 02/06/2023] Open
Abstract
Regulatory elements, e.g., enhancers and promoters, have been widely reported to be enriched for disease and complex trait heritability. We investigated how this enrichment varies with the age of the underlying genome sequence, the conservation of regulatory function across species, and the target gene of the regulatory element. We estimated heritability enrichment by applying stratified LD score regression to summary statistics from 41 independent diseases and complex traits (average N = 320K) and meta-analyzing results across traits. Enrichment of human putative enhancers and promoters was larger in elements with older sequence age, assessed via alignment with other species irrespective of conserved functionality: putative enhancer elements with ancient sequence age (older than the split between marsupial and placental mammals) were 8.8× enriched (versus 2.5× for all putative enhancers; p = 3e-14), and promoter elements with ancient sequence age were 13.5× enriched (versus 5.1× for all promoters; p = 5e-16). Enrichment of human putative enhancers and promoters was also larger in elements whose regulatory function was conserved across species, e.g., human putative enhancers that were enhancers in ≥5 of 9 other mammals were 4.6× enriched (p = 5e-12 versus all putative enhancers). Enrichment of human promoters was larger in promoters of loss-of-function intolerant genes: 12.0× enrichment (p = 8e-15 versus all promoters). The mean value of several measures of negative selection within these genomic annotations mirrored all of these findings. Notably, the annotations with these excess heritability enrichments were jointly significant conditional on each other and on our baseline-LD model, which includes a broad set of coding, conserved, regulatory, and LD-related annotations.
Collapse
Affiliation(s)
- Margaux L A Hujoel
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Division of Biostatistics, Dana-Farber Cancer Institute, Boston, MA 02215, USA.
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Bryce van de Geijn
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Alkes L Price
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
19
|
Reshef YA, Finucane HK, Kelley DR, Gusev A, Kotliar D, Ulirsch JC, Hormozdiari F, Nasser J, O'Connor L, van de Geijn B, Loh PR, Grossman SR, Bhatia G, Gazal S, Palamara PF, Pinello L, Patterson N, Adams RP, Price AL. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. Nat Genet 2018; 50:1483-1493. [PMID: 30177862 PMCID: PMC6202062 DOI: 10.1038/s41588-018-0196-7] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Accepted: 07/11/2018] [Indexed: 12/19/2022]
Abstract
Biological interpretation of genome-wide association study data frequently involves assessing whether SNPs linked to a biological process, for example, binding of a transcription factor, show unsigned enrichment for disease signal. However, signed annotations quantifying whether each SNP allele promotes or hinders the biological process can enable stronger statements about disease mechanism. We introduce a method, signed linkage disequilibrium profile regression, for detecting genome-wide directional effects of signed functional annotations on disease risk. We validate the method via simulations and application to molecular quantitative trait loci in blood, recovering known transcriptional regulators. We apply the method to expression quantitative trait loci in 48 Genotype-Tissue Expression tissues, identifying 651 transcription factor-tissue associations including 30 with robust evidence of tissue specificity. We apply the method to 46 diseases and complex traits (average n = 290 K), identifying 77 annotation-trait associations representing 12 independent transcription factor-trait associations, and characterize the underlying transcriptional programs using gene-set enrichment analyses. Our results implicate new causal disease genes and new disease mechanisms.
Collapse
Affiliation(s)
- Yakir A Reshef
- Department of Computer Science, Harvard University, Cambridge, MA, USA.
- Harvard/MIT MD/PhD Program, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | | | - David R Kelley
- California Life Sciences LLC, South San Francisco, CA, USA
| | | | - Dylan Kotliar
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Jacob C Ulirsch
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Dana Farber Cancer Institute, Boston, MA, USA
- Boston Children's Hospital, Boston, MA, USA
| | - Farhad Hormozdiari
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Joseph Nasser
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Luke O'Connor
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Program in Bioinformatics and Integrative Genomics, Harvard University, Cambridge, MA, USA
| | - Bryce van de Geijn
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Po-Ru Loh
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Sharon R Grossman
- Harvard/MIT MD/PhD Program, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Gaurav Bhatia
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Pier Francesco Palamara
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Statistics, University of Oxford, Oxford, UK
| | - Luca Pinello
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Massachusetts General Hospital, Charlestown, MA, USA
- Department of Pathology, Harvard Medical School, Boston, MA, USA
| | | | - Ryan P Adams
- Google Brain, New York, NY, USA
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Alkes L Price
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
20
|
Wu Y, Hormozdiari F, Joo JWJ, Eskin E. Improving Imputation Accuracy by Inferring Causal Variants in Genetic Studies. J Comput Biol 2018; 26:1203-1213. [PMID: 30272994 DOI: 10.1089/cmb.2018.0139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Genotype imputation has been widely utilized for two reasons in the analysis of genome-wide association studies (GWAS). One reason is to increase the power for association studies when causal single nucleotide polymorphisms are not collected in the GWAS. The second reason is to aid the interpretation of a GWAS result by predicting the association statistics at untyped variants. In this article, we show that prediction of association statistics at untyped variants that have an influence on the trait produces is overly conservative. Current imputation methods assume that none of the variants in a region (locus consists of multiple variants) affect the trait, which is often inconsistent with the observed data. In this article, we propose a new method, CAUSAL-Imp, which can impute the association statistics at untyped variants while taking into account variants in the region that may affect the trait. Our method builds on recent methods that impute the marginal statistics for GWAS by utilizing the fact that marginal statistics follow a multivariate normal distribution. We utilize both simulated and real data sets to assess the performance of our method. We show that traditional imputation approaches underestimate the association statistics for variants involved in the trait, and our results demonstrate that our approach provides less biased estimates of these association statistics.
Collapse
Affiliation(s)
- Yue Wu
- Department of Computer Science, University of California Los Angeles, Los Angeles, California
| | - Farhad Hormozdiari
- Department of Computer Science, University of California Los Angeles, Los Angeles, California.,Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts.,Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | - Jong Wha J Joo
- Department of Computer Science and Engineering, Dongguk University, Seoul, South Korea
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, Los Angeles, California.,Department of Human Genetics, University of California Los Angeles, Los Angeles, California
| |
Collapse
|
21
|
Hormozdiari F, Gazal S, van de Geijn B, Finucane HK, Ju CJT, Loh PR, Schoech A, Reshef Y, Liu X, O'Connor L, Gusev A, Eskin E, Price AL. Leveraging molecular quantitative trait loci to understand the genetic architecture of diseases and complex traits. Nat Genet 2018; 50:1041-1047. [PMID: 29942083 PMCID: PMC6030458 DOI: 10.1038/s41588-018-0148-2] [Citation(s) in RCA: 100] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2017] [Accepted: 04/27/2018] [Indexed: 12/20/2022]
Abstract
There is increasing evidence that many risk loci found using genome-wide association studies are molecular quantitative trait loci (QTLs). Here we introduce a new set of functional annotations based on causal posterior probabilities of fine-mapped molecular cis-QTLs, using data from the Genotype-Tissue Expression (GTEx) and BLUEPRINT consortia. We show that these annotations are more strongly enriched for heritability (5.84× for eQTLs; P = 1.19 × 10-31) across 41 diseases and complex traits than annotations containing all significant molecular QTLs (1.80× for expression (e)QTLs). eQTL annotations obtained by meta-analyzing all GTEx tissues generally performed best, whereas tissue-specific eQTL annotations produced stronger enrichments for blood- and brain-related diseases and traits. eQTL annotations restricted to loss-of-function intolerant genes were even more enriched for heritability (17.06×; P = 1.20 × 10-35). All molecular QTLs except splicing QTLs remained significantly enriched in joint analysis, indicating that each of these annotations is uniquely informative for disease and complex trait architectures.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Steven Gazal
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Bryce van de Geijn
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Hilary K Finucane
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Chelsea J-T Ju
- Department of Computer Science, University of California, Los Angeles, CA, USA
| | - Po-Ru Loh
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Armin Schoech
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Yakir Reshef
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Xuanyao Liu
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Luke O'Connor
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Program in Bioinformatics and Integrative Genomics, Harvard Graduate School of Arts and Sciences, Boston, MA, USA
| | - Alexander Gusev
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
- Dana Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, CA, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
22
|
Gamazon ER, Segrè AV, van de Bunt M, Wen X, Xi HS, Hormozdiari F, Ongen H, Konkashbaev A, Derks EM, Aguet F, Quan J, Nicolae DL, Eskin E, Kellis M, Getz G, McCarthy MI, Dermitzakis ET, Cox NJ, Ardlie KG. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat Genet 2018; 50:956-967. [PMID: 29955180 PMCID: PMC6248311 DOI: 10.1038/s41588-018-0154-4] [Citation(s) in RCA: 273] [Impact Index Per Article: 45.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2017] [Accepted: 05/08/2018] [Indexed: 12/27/2022]
Abstract
We apply integrative approaches to expression quantitative loci (eQTLs) from 44 tissues from the Genotype-Tissue Expression project and genome-wide association study data. About 60% of known trait-associated loci are in linkage disequilibrium with a cis-eQTL, over half of which were not found in previous large-scale whole blood studies. Applying polygenic analyses to metabolic, cardiovascular, anthropometric, autoimmune, and neurodegenerative traits, we find that eQTLs are significantly enriched for trait associations in relevant pathogenic tissues and explain a substantial proportion of the heritability (40-80%). For most traits, tissue-shared eQTLs underlie a greater proportion of trait associations, although tissue-specific eQTLs have a greater contribution to some traits, such as blood pressure. By integrating information from biological pathways with eQTL target genes and applying a gene-based approach, we validate previously implicated causal genes and pathways, and propose new variant and gene associations for several complex traits, which we replicate in the UK BioBank and BioVU.
Collapse
Affiliation(s)
- Eric R Gamazon
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA. .,Clare Hall, University of Cambridge, Cambridge, UK.
| | - Ayellet V Segrè
- The Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA. .,Department of Ophthalmology and Ocular Genomics Institute, Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA.
| | - Martijn van de Bunt
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.,Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Oxford, UK
| | - Xiaoquan Wen
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Hualin S Xi
- Computational Sciences, Pfizer Inc, Cambridge, MA, USA
| | - Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, CA, USA.,Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Halit Ongen
- Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland.,Institute for Genetics and Genomics in Geneva (iG3), University of Geneva, Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Anuar Konkashbaev
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Eske M Derks
- Translational Neurogenomics Group, QIMR Berghofer, Brisbane, Queensland, Australia
| | - François Aguet
- The Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA
| | - Jie Quan
- Computational Sciences, Pfizer Inc, Cambridge, MA, USA
| | | | - Dan L Nicolae
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA.,Department of Statistics, The University of Chicago, Chicago, IL, USA.,Department of Human Genetics, The University of Chicago, Chicago, IL, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, USA
| | - Manolis Kellis
- The Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA.,Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Gad Getz
- The Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA.,Massachusetts General Hospital Cancer Center and Department of Pathology, Massachusetts General Hospital, Boston, MA, USA
| | - Mark I McCarthy
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.,Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Oxford, UK
| | - Emmanouil T Dermitzakis
- Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland.,Institute for Genetics and Genomics in Geneva (iG3), University of Geneva, Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Nancy J Cox
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Kristin G Ardlie
- The Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA
| |
Collapse
|
23
|
Crawford NG, Kelly DE, Hansen MEB, Beltrame MH, Fan S, Bowman SL, Jewett E, Ranciaro A, Thompson S, Lo Y, Pfeifer SP, Jensen JD, Campbell MC, Beggs W, Hormozdiari F, Mpoloka SW, Mokone GG, Nyambo T, Meskel DW, Belay G, Haut J, Rothschild H, Zon L, Zhou Y, Kovacs MA, Xu M, Zhang T, Bishop K, Sinclair J, Rivas C, Elliot E, Choi J, Li SA, Hicks B, Burgess S, Abnet C, Watkins-Chow DE, Oceana E, Song YS, Eskin E, Brown KM, Marks MS, Loftus SK, Pavan WJ, Yeager M, Chanock S, Tishkoff SA. Loci associated with skin pigmentation identified in African populations. Science 2017; 358:eaan8433. [PMID: 29025994 PMCID: PMC5759959 DOI: 10.1126/science.aan8433] [Citation(s) in RCA: 191] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2017] [Accepted: 10/03/2017] [Indexed: 12/13/2022]
Abstract
Despite the wide range of skin pigmentation in humans, little is known about its genetic basis in global populations. Examining ethnically diverse African genomes, we identify variants in or near SLC24A5, MFSD12, DDB1, TMEM138, OCA2, and HERC2 that are significantly associated with skin pigmentation. Genetic evidence indicates that the light pigmentation variant at SLC24A5 was introduced into East Africa by gene flow from non-Africans. At all other loci, variants associated with dark pigmentation in Africans are identical by descent in South Asian and Australo-Melanesian populations. Functional analyses indicate that MFSD12 encodes a lysosomal protein that affects melanogenesis in zebrafish and mice, and that mutations in melanocyte-specific regulatory regions near DDB1/TMEM138 correlate with expression of ultraviolet response genes under selection in Eurasians.
Collapse
Affiliation(s)
- Nicholas G Crawford
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Derek E Kelly
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew E B Hansen
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Marcia H Beltrame
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Shaohua Fan
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Shanna L Bowman
- Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia Research Institute, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine and Department of Physiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Ethan Jewett
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94704, USA
- Department of Statistics, University of California, Berkeley, Berkeley, CA 94704, USA
| | - Alessia Ranciaro
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Simon Thompson
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Yancy Lo
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Susanne P Pfeifer
- School of Life Sciences, Arizona State University, Tempe, AZ 85287, USA
| | - Jeffrey D Jensen
- School of Life Sciences, Arizona State University, Tempe, AZ 85287, USA
| | - Michael C Campbell
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Biology, Howard University, Washington, DC 20059, USA
| | - William Beggs
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Program in Medical and Population Genetics, Broad Institute of Massachusetts Institute of Technology (MIT) and Harvard, Cambridge, MA 02142, USA
| | | | - Gaonyadiwe George Mokone
- Department of Biomedical Sciences, University of Botswana School of Medicine, Gaborone, Botswana
| | - Thomas Nyambo
- Department of Biochemistry, Muhimbili University of Health and Allied Sciences, Dar es Salaam, Tanzania
| | | | - Gurja Belay
- Department of Biology, Addis Ababa University, Addis Ababa, Ethiopia
| | - Jake Haut
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Harriet Rothschild
- Stem Cell Program, Division of Hematology and Oncology, Pediatric Hematology Program, Boston Children's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Leonard Zon
- Stem Cell Program, Division of Hematology and Oncology, Pediatric Hematology Program, Boston Children's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02115, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Yi Zhou
- Stem Cell Program, Division of Hematology and Oncology, Pediatric Hematology Program, Boston Children's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02115, USA
- Harvard Stem Cell Institute, Harvard University, Cambridge, MA 02138, USA
| | - Michael A Kovacs
- Laboratory of Translational Genomics, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Mai Xu
- Laboratory of Translational Genomics, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Tongwu Zhang
- Laboratory of Translational Genomics, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Kevin Bishop
- Translational and Functional Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Jason Sinclair
- Translational and Functional Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Cecilia Rivas
- Genetic Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Eugene Elliot
- Genetic Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Jiyeon Choi
- Laboratory of Translational Genomics, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Shengchao A Li
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20892, USA
- Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., Frederick, MD 21701, USA
| | - Belynda Hicks
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20892, USA
- Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., Frederick, MD 21701, USA
| | - Shawn Burgess
- Translational and Functional Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Christian Abnet
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20892, USA
| | - Dawn E Watkins-Chow
- Genetic Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Elena Oceana
- Department of Molecular Pharmacology, Physiology and Biotechnology, Brown University, Providence, RI 02912, USA
| | - Yun S Song
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94704, USA
- Department of Statistics, University of California, Berkeley, Berkeley, CA 94704, USA
- Chan Zuckerberg Biohub, San Francisco, CA 94158, USA
- Department of Biology, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Eleazar Eskin
- Department of Computer Science and Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Kevin M Brown
- Laboratory of Translational Genomics, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Michael S Marks
- Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia Research Institute, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine and Department of Physiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Stacie K Loftus
- Genetic Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - William J Pavan
- Genetic Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Meredith Yeager
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20892, USA
- Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., Frederick, MD 21701, USA
| | - Stephen Chanock
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20892, USA
| | - Sarah A Tishkoff
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
- Department of Biology, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
24
|
Abstract
Motivation: Expression quantitative trait loci (eQTLs) are genetic variants that affect gene expression. In eQTL studies, one important task is to find eGenes or genes whose expressions are associated with at least one eQTL. The standard statistical method to determine whether a gene is an eGene requires association testing at all nearby variants and the permutation test to correct for multiple testing. The standard method however does not consider genomic annotation of the variants. In practice, variants near gene transcription start sites (TSSs) or certain histone modifications are likely to regulate gene expression. In this article, we introduce a novel eGene detection method that considers this empirical evidence and thereby increases the statistical power. Results: We applied our method to the liver Genotype-Tissue Expression (GTEx) data using distance from TSSs, DNase hypersensitivity sites, and six histone modifications as the genomic annotations for the variants. Each of these annotations helped us detected more candidate eGenes. Distance from TSS appears to be the most important annotation; specifically, using this annotation, our method discovered 50% more candidate eGenes than the standard permutation method. Contact:buhm.han@amc.seoul.kr or eeskin@cs.ucla.edu
Collapse
Affiliation(s)
| | | | | | - Jae Hoon Sul
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Jason Ernst
- Department of Computer Science Department of Biological Chemistry
| | - Buhm Han
- Department of Convergence Medicine, University of Ulsan College of Medicine & Asan Institute for Life Sciences, Asan Medical Center, Seoul, Republic of Korea
| | - Eleazar Eskin
- Department of Computer Science Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
25
|
Buckley MT, Racimo F, Allentoft ME, Jensen MK, Jonsson A, Huang H, Hormozdiari F, Sikora M, Marnetto D, Eskin E, Jørgensen ME, Grarup N, Pedersen O, Hansen T, Kraft P, Willerslev E, Nielsen R. Selection in Europeans on Fatty Acid Desaturases Associated with Dietary Changes. Mol Biol Evol 2017; 34:1307-1318. [PMID: 28333262 PMCID: PMC5435082 DOI: 10.1093/molbev/msx103] [Citation(s) in RCA: 63] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
FADS genes encode fatty acid desaturases that are important for the conversion of short chain polyunsaturated fatty acids (PUFAs) to long chain fatty acids. Prior studies indicate that the FADS genes have been subjected to strong positive selection in Africa, South Asia, Greenland, and Europe. By comparing FADS sequencing data from present-day and Bronze Age (5-3k years ago) Europeans, we identify possible targets of selection in the European population, which suggest that selection has targeted different alleles in the FADS genes in Europe than it has in South Asia or Greenland. The alleles showing the strongest changes in allele frequency since the Bronze Age show associations with expression changes and multiple lipid-related phenotypes. Furthermore, the selected alleles are associated with a decrease in linoleic acid and an increase in arachidonic and eicosapentaenoic acids among Europeans; this is an opposite effect of that observed for selected alleles in Inuit from Greenland. We show that multiple SNPs in the region affect expression levels and PUFA synthesis. Additionally, we find evidence for a gene-environment interaction influencing low-density lipoprotein (LDL) levels between alleles affecting PUFA synthesis and PUFA dietary intake: carriers of the derived allele display lower LDL cholesterol levels with a higher intake of PUFAs. We hypothesize that the selective patterns observed in Europeans were driven by a change in dietary composition of fatty acids following the transition to agriculture, resulting in a lower intake of arachidonic acid and eicosapentaenoic acid, but a higher intake of linoleic acid and α-linolenic acid.
Collapse
Affiliation(s)
- Matthew T. Buckley
- Departments of Integrative Biology and Statistics, University of California Berkeley, Berkeley, CA
| | - Fernando Racimo
- Departments of Integrative Biology and Statistics, University of California Berkeley, Berkeley, CA
| | - Morten E. Allentoft
- Natural History Museum of Denmark, University of Copenhagen, Copenhagen K, Denmark
| | - Majken K. Jensen
- Department of Nutrition, Harvard T.H. Chan School of Public Health & Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA
| | - Anna Jonsson
- The Novo Nordisk Foundation Center for Basic Metabolic Research, Section of Metabolic Genetics, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen Ø, Denmark
| | - Hongyan Huang
- Program in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, CA, USA
| | - Martin Sikora
- Natural History Museum of Denmark, University of Copenhagen, Copenhagen K, Denmark
| | - Davide Marnetto
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Turin, Italy
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, CA, USA
| | - Marit E. Jørgensen
- National Institute of Public Health, University of Southern Denmark, Copenhagen, Denmark
- Steno Diabetes Center Copenhagen, Gentofte, Denmark
| | - Niels Grarup
- The Novo Nordisk Foundation Center for Basic Metabolic Research, Section of Metabolic Genetics, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen Ø, Denmark
| | - Oluf Pedersen
- The Novo Nordisk Foundation Center for Basic Metabolic Research, Section of Metabolic Genetics, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen Ø, Denmark
| | - Torben Hansen
- The Novo Nordisk Foundation Center for Basic Metabolic Research, Section of Metabolic Genetics, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen Ø, Denmark
| | - Peter Kraft
- Program in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Eske Willerslev
- Natural History Museum of Denmark, University of Copenhagen, Copenhagen K, Denmark
| | - Rasmus Nielsen
- Departments of Integrative Biology and Statistics, University of California Berkeley, Berkeley, CA
- Natural History Museum of Denmark, University of Copenhagen, Copenhagen K, Denmark
| |
Collapse
|
26
|
Hormozdiari F, Zhu A, Kichaev G, Ju CJT, Segrè AV, Joo JWJ, Won H, Sankararaman S, Pasaniuc B, Shifman S, Eskin E. Widespread Allelic Heterogeneity in Complex Traits. Am J Hum Genet 2017; 100:789-802. [PMID: 28475861 PMCID: PMC5420356 DOI: 10.1016/j.ajhg.2017.04.005] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2016] [Accepted: 04/07/2017] [Indexed: 12/24/2022] Open
Abstract
Recent successes in genome-wide association studies (GWASs) make it possible to address important questions about the genetic architecture of complex traits, such as allele frequency and effect size. One lesser-known aspect of complex traits is the extent of allelic heterogeneity (AH) arising from multiple causal variants at a locus. We developed a computational method to infer the probability of AH and applied it to three GWASs and four expression quantitative trait loci (eQTL) datasets. We identified a total of 4,152 loci with strong evidence of AH. The proportion of all loci with identified AH is 4%-23% in eQTLs, 35% in GWASs of high-density lipoprotein (HDL), and 23% in GWASs of schizophrenia. For eQTLs, we observed a strong correlation between sample size and the proportion of loci with AH (R2 = 0.85, p = 2.2 × 10-16), indicating that statistical power prevents identification of AH in other loci. Understanding the extent of AH may guide the development of new methods for fine mapping and association mapping of complex traits.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA; Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Anthony Zhu
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Gleb Kichaev
- Bioinformatics IDP, University of California, Los Angeles, CA 90095, USA
| | - Chelsea J-T Ju
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Ayellet V Segrè
- Cancer Program, The Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA 02142, USA
| | - Jong Wha J Joo
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA; Department of Computer Science Engineering, Dongguk University-Seoul, 04620 Seoul, South Korea
| | - Hyejung Won
- Neurogenetics Program, Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, CA 90095, USA
| | - Sriram Sankararaman
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Bogdan Pasaniuc
- Department of Human Genetics, University of California, Los Angeles, CA 90095, USA; Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA 90095, USA
| | - Sagiv Shifman
- Department of Genetics, The Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 91904, Israel.
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, CA 90095, USA.
| |
Collapse
|
27
|
Mangul S, Yang TH, Hormozdiari F, Dainis AM, Tseng E, Ashley EA, Zelikovsky A, Eskin E. HapIso: An Accurate Method for the Haplotype- Specific Isoforms Reconstruction From Long Single-Molecule Reads. IEEE Trans Nanobioscience 2017; 16:108-115. [PMID: 28328508 DOI: 10.1109/tnb.2017.2675981] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Sequencing of RNA provides the possibility to study an individual's transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts, allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present Haplotype-specific Isoform reconstruction (HapIso), a method able to tolerate the relatively high error rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k -means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within the cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We used the family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error rate and accurately partition the reads into the parental alleles of the isoform transcripts. We also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate allele-specific expression of genes of interest. Our method was able to correct reads and determine Glu1883Lys point mutation of clinical significance validated by GeneDx HCM panel. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads.
Collapse
|
28
|
Hormozdiari F, van de Bunt M, Segrè AV, Li X, Joo JWJ, Bilow M, Sul JH, Sankararaman S, Pasaniuc B, Eskin E. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet 2016; 99:1245-1260. [PMID: 27866706 DOI: 10.1016/j.ajhg.2016.10.003] [Citation(s) in RCA: 396] [Impact Index Per Article: 49.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Accepted: 10/03/2016] [Indexed: 01/01/2023] Open
Abstract
The vast majority of genome-wide association study (GWAS) risk loci fall in non-coding regions of the genome. One possible hypothesis is that these GWAS risk loci alter the individual's disease risk through their effect on gene expression in different tissues. In order to understand the mechanisms driving a GWAS risk locus, it is helpful to determine which gene is affected in specific tissue types. For example, the relevant gene and tissue could play a role in the disease mechanism if the same variant responsible for a GWAS locus also affects gene expression. Identifying whether or not the same variant is causal in both GWASs and expression quantitative trail locus (eQTL) studies is challenging because of the uncertainty induced by linkage disequilibrium and the fact that some loci harbor multiple causal variants. However, current methods that address this problem assume that each locus contains a single causal variant. In this paper, we present eCAVIAR, a probabilistic method that has several key advantages over existing methods. First, our method can account for more than one causal variant in any given locus. Second, it can leverage summary statistics without accessing the individual genotype data. We use both simulated and real datasets to demonstrate the utility of our method. Using publicly available eQTL data on 45 different tissues, we demonstrate that eCAVIAR can prioritize likely relevant tissues and target genes for a set of glucose- and insulin-related trait loci.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Martijn van de Bunt
- Oxford Centre for Diabetes, Endocrinology, & Metabolism, University of Oxford, Oxford OX3 7LJ, UK; Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Ayellet V Segrè
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Xiao Li
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Jong Wha J Joo
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Michael Bilow
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Jae Hoon Sul
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, Los Angeles, CA 90095, USA; Semel Center for Informatics and Personalized Genomics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Sriram Sankararaman
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Bogdan Pasaniuc
- Department of Pathology and Laboratory Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
29
|
Won H, de la Torre-Ubieta L, Stein JL, Parikshak NN, Huang J, Opland CK, Gandal MJ, Sutton GJ, Hormozdiari F, Lu D, Lee C, Eskin E, Voineagu I, Ernst J, Geschwind DH. Chromosome conformation elucidates regulatory relationships in developing human brain. Nature 2016; 538:523-527. [PMID: 27760116 DOI: 10.1038/nature19847] [Citation(s) in RCA: 356] [Impact Index Per Article: 44.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 09/15/2016] [Indexed: 12/18/2022]
Abstract
Three-dimensional physical interactions within chromosomes dynamically regulate gene expression in a tissue-specific manner. However, the 3D organization of chromosomes during human brain development and its role in regulating gene networks dysregulated in neurodevelopmental disorders, such as autism or schizophrenia, are unknown. Here we generate high-resolution 3D maps of chromatin contacts during human corticogenesis, permitting large-scale annotation of previously uncharacterized regulatory relationships relevant to the evolution of human cognition and disease. Our analyses identify hundreds of genes that physically interact with enhancers gained on the human lineage, many of which are under purifying selection and associated with human cognitive function. We integrate chromatin contacts with non-coding variants identified in schizophrenia genome-wide association studies (GWAS), highlighting multiple candidate schizophrenia risk genes and pathways, including transcription factors involved in neurogenesis, and cholinergic signalling molecules, several of which are supported by independent expression quantitative trait loci and gene expression analyses. Genome editing in human neural progenitors suggests that one of these distal schizophrenia GWAS loci regulates FOXG1 expression, supporting its potential role as a schizophrenia risk gene. This work provides a framework for understanding the effect of non-coding regulatory elements on human brain development and the evolution of cognition, and highlights novel mechanisms underlying neuropsychiatric disorders.
Collapse
Affiliation(s)
- Hyejung Won
- Department of Neurology, Center for Autism Research and Treatment, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Luis de la Torre-Ubieta
- Department of Neurology, Center for Autism Research and Treatment, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Jason L Stein
- Department of Neurology, Center for Autism Research and Treatment, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Neelroop N Parikshak
- Program in Neurobehavioral Genetics, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Jerry Huang
- Department of Neurology, Center for Autism Research and Treatment, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Carli K Opland
- Department of Neurology, Center for Autism Research and Treatment, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Michael J Gandal
- Department of Neurology, Center for Autism Research and Treatment, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Gavin J Sutton
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Farhad Hormozdiari
- Department of Computer Science, University of California Los Angeles, California 90095, USA
| | - Daning Lu
- Department of Neurology, Center for Autism Research and Treatment, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Changhoon Lee
- Department of Neurology, Center for Autism Research and Treatment, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, California 90095, USA.,Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Irina Voineagu
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Jason Ernst
- Department of Computer Science, University of California Los Angeles, California 90095, USA.,Department of Biological Chemistry, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| | - Daniel H Geschwind
- Department of Neurology, Center for Autism Research and Treatment, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA.,Program in Neurobehavioral Genetics, Semel Institute, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA.,Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, California 90095, USA
| |
Collapse
|
30
|
Hasin-Brumshtein Y, Khan AH, Hormozdiari F, Pan C, Parks BW, Petyuk VA, Piehowski PD, Brümmer A, Pellegrini M, Xiao X, Eskin E, Smith RD, Lusis AJ, Smith DJ. Hypothalamic transcriptomes of 99 mouse strains reveal trans eQTL hotspots, splicing QTLs and novel non-coding genes. eLife 2016; 5. [PMID: 27623010 PMCID: PMC5053804 DOI: 10.7554/elife.15614] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2016] [Accepted: 09/12/2016] [Indexed: 12/19/2022] Open
Abstract
Previous studies had shown that the integration of genome wide expression profiles, in metabolic tissues, with genetic and phenotypic variance, provided valuable insight into the underlying molecular mechanisms. We used RNA-Seq to characterize hypothalamic transcriptome in 99 inbred strains of mice from the Hybrid Mouse Diversity Panel (HMDP), a reference resource population for cardiovascular and metabolic traits. We report numerous novel transcripts supported by proteomic analyses, as well as novel non coding RNAs. High resolution genetic mapping of transcript levels in HMDP, reveals both local and trans expression Quantitative Trait Loci (eQTLs) demonstrating 2 trans eQTL 'hotspots' associated with expression of hundreds of genes. We also report thousands of alternative splicing events regulated by genetic variants. Finally, comparison with about 150 metabolic and cardiovascular traits revealed many highly significant associations. Our data provide a rich resource for understanding the many physiologic functions mediated by the hypothalamus and their genetic regulation. DOI:http://dx.doi.org/10.7554/eLife.15614.001 Metabolism is a term that describes all the chemical reactions that are involved in keeping a living organism alive. Diseases related to metabolism – such as obesity, heart disease and diabetes – are a major health problem in the Western world. The causes of these diseases are complex and include both environmental factors, such as diet and exercise, and genetics. Indeed, many genetic variants that contribute to obesity have been uncovered in both humans and mice. However, it is only dimly understood how these genetic variants affect the underlying networks of interacting genes that cause metabolic disorders. Measuring gene activity or expression, and tracing how genetic instructions are carried from DNA into RNA and proteins, can reliably identify groups of genes that correlate with metabolic traits in specific organs. This strategy was successfully used in previous studies to reveal new information about abnormalities linked to obesity in specific tissues such as the liver and fat tissues. It was also shown that this approach might suggest new molecules that could be targeted to treat metabolic disorders. A brain region called the hypothalamus is key to the control of metabolism, including feeding behavior and obesity. Hasin-Brumshtein et al. set out to explore gene expression in the hypothalamus of 99 different strains of mice, in the hope that the data will help identify new connections between gene expression and metabolism. This approach showed that thousands of new and known genes are expressed in the mouse hypothalamus, some of which coded for proteins, and some of which did not. Hasin-Brumshtein et al. uncovered two genetic variants that controlled the expression of hundreds of other genes. Further analysis then revealed thousands of genetic variants that regulated the expression of, and type of RNA (so-called "spliceforms") produced from neighboring genes. Also, the expression of many individual genes showed significant similarities with about 150 metabolic measurements that had been evaluated previously in the mice. This new dataset is a unique resource that can be coupled with different approaches to test existing ideas and develop new ones about the role of particular genes or genetic mechanisms in obesity. Future studies will likely focus on new genes that show strong associations with attributes that are relevant to metabolic disorders, such as insulin levels, weight and fat mass. DOI:http://dx.doi.org/10.7554/eLife.15614.002
Collapse
Affiliation(s)
- Yehudit Hasin-Brumshtein
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, United States.,David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, United States.,Department of Microbiology, University of California, Los Angeles, Los Angeles, United states.,Department of Immunology and Molecular Genetics, University of California, Los Angeles, Los Angeles, United States
| | - Arshad H Khan
- Department of Molecular and Medical Pharmacology, University of California, Los Angeles, Los Angeles, United States
| | - Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, Los Angeles, United States
| | - Calvin Pan
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, United States.,David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, United States
| | - Brian W Parks
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, United States.,David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, United States.,Department of Microbiology, University of California, Los Angeles, Los Angeles, United states.,Department of Immunology and Molecular Genetics, University of California, Los Angeles, Los Angeles, United States
| | - Vladislav A Petyuk
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, United States
| | - Paul D Piehowski
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, United States
| | - Anneke Brümmer
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, United States
| | - Matteo Pellegrini
- Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, Los Angeles, United States
| | - Xinshu Xiao
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, United States
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, Los Angeles, United States
| | - Richard D Smith
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, United States
| | - Aldons J Lusis
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, United States.,David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, United States.,Department of Microbiology, University of California, Los Angeles, Los Angeles, United states.,Department of Immunology and Molecular Genetics, University of California, Los Angeles, Los Angeles, United States
| | - Desmond J Smith
- Department of Molecular and Medical Pharmacology, University of California, Los Angeles, Los Angeles, United States
| |
Collapse
|
31
|
Hormozdiari F, Kang EY, Bilow M, Ben-David E, Vulpe C, McLachlan S, Lusis AJ, Han B, Eskin E. Imputing Phenotypes for Genome-wide Association Studies. Am J Hum Genet 2016; 99:89-103. [PMID: 27292110 PMCID: PMC5005435 DOI: 10.1016/j.ajhg.2016.04.013] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2016] [Accepted: 04/28/2016] [Indexed: 01/23/2023] Open
Abstract
Genome-wide association studies (GWASs) have been successful in detecting variants correlated with phenotypes of clinical interest. However, the power to detect these variants depends on the number of individuals whose phenotypes are collected, and for phenotypes that are difficult to collect, the sample size might be insufficient to achieve the desired statistical power. The phenotype of interest is often difficult to collect, whereas surrogate phenotypes or related phenotypes are easier to collect and have already been collected in very large samples. This paper demonstrates how we take advantage of these additional related phenotypes to impute the phenotype of interest or target phenotype and then perform association analysis. Our approach leverages the correlation structure between phenotypes to perform the imputation. The correlation structure can be estimated from a smaller complete dataset for which both the target and related phenotypes have been collected. Under some assumptions, the statistical power can be computed analytically given the correlation structure of the phenotypes used in imputation. In addition, our method can impute the summary statistic of the target phenotype as a weighted linear combination of the summary statistics of related phenotypes. Thus, our method is applicable to datasets for which we have access only to summary statistics and not to the raw genotypes. We illustrate our approach by analyzing associated loci to triglycerides (TGs), body mass index (BMI), and systolic blood pressure (SBP) in the Northern Finland Birth Cohort dataset.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Eun Yong Kang
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Michael Bilow
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Eyal Ben-David
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Chris Vulpe
- Department of Nutritional Science and Toxicology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Stela McLachlan
- Centre for Population Health Sciences, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh EH8 9AG, UK
| | - Aldons J Lusis
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Buhm Han
- Department of Convergence Medicine, University of Ulsan College of Medicine & Asan Institute for Life Sciences, Asan Medical Center, Seoul 05505, Republic of Korea.
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
32
|
Abstract
BACKGROUND Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM. RESULTS We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach. CONCLUSIONS We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data.
Collapse
Affiliation(s)
- Jong Wha J Joo
- Bioinformatics IDP, University of California, Los Angeles, CA, USA
| | - Farhad Hormozdiari
- Computer Science Department, University of California, Los Angeles, CA, USA
| | - Buhm Han
- Department of Convergence Medicine, University of Ulsan College of Medicine & Asan Institute for Life Sciences, Asan Medical Center, Seoul, 138-736, Republic of Korea.
| | - Eleazar Eskin
- Computer Science Department, University of California, Los Angeles, CA, USA. .,Department of Human Genetics, University of California, Los Angeles, CA, USA.
| |
Collapse
|
33
|
Freedman AH, Schweizer RM, Ortega-Del Vecchyo D, Han E, Davis BW, Gronau I, Silva PM, Galaverni M, Fan Z, Marx P, Lorente-Galdos B, Ramirez O, Hormozdiari F, Alkan C, Vilà C, Squire K, Geffen E, Kusak J, Boyko AR, Parker HG, Lee C, Tadigotla V, Siepel A, Bustamante CD, Harkins TT, Nelson SF, Marques-Bonet T, Ostrander EA, Wayne RK, Novembre J. Demographically-Based Evaluation of Genomic Regions under Selection in Domestic Dogs. PLoS Genet 2016; 12:e1005851. [PMID: 26943675 PMCID: PMC4778760 DOI: 10.1371/journal.pgen.1005851] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2015] [Accepted: 01/18/2016] [Indexed: 12/31/2022] Open
Abstract
Controlling for background demographic effects is important for accurately identifying loci that have recently undergone positive selection. To date, the effects of demography have not yet been explicitly considered when identifying loci under selection during dog domestication. To investigate positive selection on the dog lineage early in the domestication, we examined patterns of polymorphism in six canid genomes that were previously used to infer a demographic model of dog domestication. Using an inferred demographic model, we computed false discovery rates (FDR) and identified 349 outlier regions consistent with positive selection at a low FDR. The signals in the top 100 regions were frequently centered on candidate genes related to brain function and behavior, including LHFPL3, CADM2, GRIK3, SH3GL2, MBP, PDE7B, NTAN1, and GLRA1. These regions contained significant enrichments in behavioral ontology categories. The 3rd top hit, CCRN4L, plays a major role in lipid metabolism, that is supported by additional metabolism related candidates revealed in our scan, including SCP2D1 and PDXC1. Comparing our method to an empirical outlier approach that does not directly account for demography, we found only modest overlaps between the two methods, with 60% of empirical outliers having no overlap with our demography-based outlier detection approach. Demography-aware approaches have lower-rates of false discovery. Our top candidates for selection, in addition to expanding the set of neurobehavioral candidate genes, include genes related to lipid metabolism, suggesting a dietary target of selection that was important during the period when proto-dogs hunted and fed alongside hunter-gatherers.
Collapse
Affiliation(s)
- Adam H. Freedman
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
- * E-mail:
| | - Rena M. Schweizer
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Diego Ortega-Del Vecchyo
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Eunjung Han
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Brian W. Davis
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Ilan Gronau
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | | | | | - Zhenxin Fan
- Key Laboratory of Bioresources and Ecoenvironment, Sichuan University, Chengdu, China
| | - Peter Marx
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Budapest, Hungary
| | - Belen Lorente-Galdos
- ICREA at Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Barcelona, Spain
| | - Oscar Ramirez
- ICREA at Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Barcelona, Spain
| | - Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, Los Angeles, California, United States of America
| | | | - Carles Vilà
- Estación Biológia de Doñana EBD-CSIC, Sevilla, Spain
| | - Kevin Squire
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Eli Geffen
- Department of Zoology, Tel Aviv University, Tel Aviv, Israel
| | - Josip Kusak
- Department of Biology, University of Zagreb, Zagreb, Croatia
| | - Adam R. Boyko
- Department of Biomedical Sciences, Cornell University, Ithaca, New York, United States of America
| | - Heidi G. Parker
- ICREA at Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Barcelona, Spain
| | - Clarence Lee
- Life Technologies, Foster City, California, United States of America
| | - Vasisht Tadigotla
- Life Technologies, Foster City, California, United States of America
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | | | | | - Stanley F. Nelson
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Tomas Marques-Bonet
- ICREA at Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Barcelona, Spain
- Centro Nacional de Analisis Genomico (CNAG/PCB), Baldiri Reixach 4–8, Barcelona, Spain
| | - Elaine A. Ostrander
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Robert K. Wayne
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| | - John Novembre
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| |
Collapse
|
34
|
Abstract
Motivation: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider ‘causal variants’ as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations. Results: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability ρ. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2. Availability and implementation: Software is freely available for download at genetics.cs.ucla.edu/caviar. Contact: eeskin@cs.ucla.edu
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, Inter-Departmental Program in Bioinformatics, Department of Human Genetics and Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA 90095, USA
| | - Gleb Kichaev
- Department of Computer Science, Inter-Departmental Program in Bioinformatics, Department of Human Genetics and Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA 90095, USA
| | - Wen-Yun Yang
- Department of Computer Science, Inter-Departmental Program in Bioinformatics, Department of Human Genetics and Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA 90095, USA
| | - Bogdan Pasaniuc
- Department of Computer Science, Inter-Departmental Program in Bioinformatics, Department of Human Genetics and Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA 90095, USA Department of Computer Science, Inter-Departmental Program in Bioinformatics, Department of Human Genetics and Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA 90095, USA Department of Computer Science, Inter-Departmental Program in Bioinformatics, Department of Human Genetics and Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA 90095, USA
| | - Eleazar Eskin
- Department of Computer Science, Inter-Departmental Program in Bioinformatics, Department of Human Genetics and Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA 90095, USA Department of Computer Science, Inter-Departmental Program in Bioinformatics, Department of Human Genetics and Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
35
|
Park DS, Baran Y, Hormozdiari F, Eng C, Torgerson DG, Burchard EG, Zaitlen N. PIGS: improved estimates of identity-by-descent probabilities by probabilistic IBD graph sampling. BMC Bioinformatics 2015; 16 Suppl 5:S9. [PMID: 25860540 PMCID: PMC4402697 DOI: 10.1186/1471-2105-16-s5-s9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Identifying segments in the genome of different individuals that are identical-by-descent (IBD) is a fundamental element of genetics. IBD data is used for numerous applications including demographic inference, heritability estimation, and mapping disease loci. Simultaneous detection of IBD over multiple haplotypes has proven to be computationally difficult. To overcome this, many state of the art methods estimate the probability of IBD between each pair of haplotypes separately. While computationally efficient, these methods fail to leverage the clique structure of IBD resulting in less powerful IBD identification, especially for small IBD segments. We develop a hybrid approach (PIGS), which combines the computational efficiency of pairwise methods with the power of multiway methods. It leverages the IBD graph structure to compute the probability of IBD conditional on all pairwise estimates simultaneously. We show via extensive simulations and analysis of real data that our method produces a substantial increase in the number of identified small IBD segments.
Collapse
|
36
|
Hormozdiari F, Eskin E. Memory efficient assembly of human genome. J Bioinform Comput Biol 2015; 13:1550008. [PMID: 25603998 DOI: 10.1142/s0219720015500080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The ability to detect the genetic variations between two individuals is an essential component for genetic studies. In these studies, obtaining the genome sequence of both individuals is the first step toward variation detection problem. The emergence of high-throughput sequencing (HTS) technology has made DNA sequencing practical, and is widely used by diagnosticians to increase their knowledge about the casual factor in genetic related diseases. As HTS advances, more data are generated every day than the amount that scientists can process. Genome assembly is one of the existing methods to tackle the variation detection problem. The de Bruijn graph formulation of the assembly problem is widely used in the field. Furthermore, it is the only method which can assemble any genome in linear time. However, it requires an enormous amount of memory in order to assemble any mammalian size genome. The high demands of sequencing more individuals and the urge to assemble them are the driving forces for a memory efficient assembler. In this work, we propose a novel method which builds the de Bruijn graph while consuming lower memory. Moreover, our proposed method can reduce the memory usage by 37% compared to the existing methods. In addition, we used a real data set (chromosome 17 of A/J strain) to illustrate the performance of our method.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA
| | | |
Collapse
|
37
|
Yang WY, Hormozdiari F, Eskin E, Pasaniuc B. A spatial haplotype copying model with applications to genotype imputation. J Comput Biol 2014; 22:451-62. [PMID: 25526526 DOI: 10.1089/cmb.2014.0151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Ever since its introduction, the haplotype copy model has proven to be one of the most successful approaches for modeling genetic variation in human populations, with applications ranging from ancestry inference to genotype phasing and imputation. Motivated by coalescent theory, this approach assumes that any chromosome (haplotype) can be modeled as a mosaic of segments copied from a set of chromosomes sampled from the same population. At the core of the model is the assumption that any chromosome from the sample is equally likely to contribute a priori to the copying process. Motivated by recent works that model genetic variation in a geographic continuum, we propose a new spatial-aware haplotype copy model that jointly models geography and the haplotype copying process. We extend hidden Markov models of haplotype diversity such that at any given location, haplotypes that are closest in the genetic-geographic continuum map are a priori more likely to contribute to the copying process than distant ones. Through simulations starting from the 1000 Genomes data, we show that our model achieves superior accuracy in genotype imputation over the standard spatial-unaware haplotype copy model. In addition, we show the utility of our model in selecting a small personalized reference panel for imputation that leads to both improved accuracy as well as to a lower computational runtime than the standard approach. Finally, we show our proposed model can be used to localize individuals on the genetic-geographical map on the basis of their genotype data.
Collapse
Affiliation(s)
- Wen-Yun Yang
- 1 Department of Computer Science, University of California , Los Angeles, California
| | | | | | | |
Collapse
|
38
|
Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, Kraft P, Pasaniuc B. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet 2014; 10:e1004722. [PMID: 25357204 PMCID: PMC4214605 DOI: 10.1371/journal.pgen.1004722] [Citation(s) in RCA: 320] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2014] [Accepted: 09/01/2014] [Indexed: 11/18/2022] Open
Abstract
Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data. Genome-wide association studies (GWAS) have successfully identified numerous regions in the genome that harbor genetic variants that increase risk for various complex traits and diseases. However, it is generally the case that GWAS risk variants are not themselves causally affecting the trait, but rather, are correlated to the true causal variant through linkage disequilibrium (LD). Plausible causal variants are identified in fine-mapping studies through targeted sequencing followed by prioritization of variants for functional validation. In this work, we propose methods that leverage two sources of independent information, the association strength and genomic functional location, to prioritize causal variants. We demonstrate in simulations and empirical data that our approach reduces the number of SNPs that need to be selected for follow-up to identify the true causal variants at GWAS risk loci.
Collapse
Affiliation(s)
- Gleb Kichaev
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, California, United States of America
| | - Wen-Yun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Sara Lindstrom
- Program in Genetic Epidemiology and Statistical Genetics, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Farhad Hormozdiari
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Eleazar Eskin
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - Alkes L. Price
- Program in Genetic Epidemiology and Statistical Genetics, Harvard School of Public Health, Boston, Massachusetts, United States of America
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Peter Kraft
- Program in Genetic Epidemiology and Statistical Genetics, Harvard School of Public Health, Boston, Massachusetts, United States of America
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Bogdan Pasaniuc
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
39
|
Lee D, Hormozdiari F, Xin H, Hach F, Mutlu O, Alkan C. Fast and accurate mapping of Complete Genomics reads. Methods 2014; 79-80:3-10. [PMID: 25461772 DOI: 10.1016/j.ymeth.2014.10.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2014] [Revised: 10/01/2014] [Accepted: 10/13/2014] [Indexed: 12/31/2022] Open
Abstract
Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologies to correct for shortcomings; but also requires to develop different algorithms for each platform due to the differences in data types and error models. The first problem to tackle in analyzing HTS data for resequencing applications is the read mapping stage, where many tools have been developed for the most popular HTS methods, but publicly available and open source aligners are still lacking for the Complete Genomics (CG) platform. Unfortunately, Burrows-Wheeler based methods are not practical for CG data due to the gapped nature of the reads generated by this method. Here we provide a sensitive read mapper (sirFAST) for the CG technology based on the seed-and-extend paradigm that can quickly map CG reads to a reference genome. We evaluate the performance and accuracy of sirFAST using both simulated and publicly available real data sets, showing high precision and recall rates.
Collapse
Affiliation(s)
- Donghyuk Lee
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Farhad Hormozdiari
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, USA
| | - Hongyi Xin
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Faraz Hach
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - Onur Mutlu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA.
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara, Turkey.
| |
Collapse
|
40
|
Hormozdiari F, Joo JWJ, Wadia A, Guan F, Ostrosky R, Sahai A, Eskin E. Privacy preserving protocol for detecting genetic relatives using rare variants. ACTA ACUST UNITED AC 2014; 30:i204-11. [PMID: 24931985 PMCID: PMC4058916 DOI: 10.1093/bioinformatics/btu294] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION High-throughput sequencing technologies have impacted many areas of genetic research. One such area is the identification of relatives from genetic data. The standard approach for the identification of genetic relatives collects the genomic data of all individuals and stores it in a database. Then, each pair of individuals is compared to detect the set of genetic relatives, and the matched individuals are informed. The main drawback of this approach is the requirement of sharing your genetic data with a trusted third party to perform the relatedness test. RESULTS In this work, we propose a secure protocol to detect the genetic relatives from sequencing data while not exposing any information about their genomes. We assume that individuals have access to their genome sequences but do not want to share their genomes with anyone else. Unlike previous approaches, our approach uses both common and rare variants which provide the ability to detect much more distant relationships securely. We use a simulated data generated from the 1000 genomes data and illustrate that we can easily detect up to fifth degree cousins which was not possible using the existing methods. We also show in the 1000 genomes data with cryptic relationships that our method can detect these individuals. AVAILABILITY The software is freely available for download at http://genetics.cs.ucla.edu/crypto/.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, Bioinformatics IDP, Department of Mathematics and Department of Human Genetics, University of California, LA 90095, USA
| | - Jong Wha J Joo
- Department of Computer Science, Bioinformatics IDP, Department of Mathematics and Department of Human Genetics, University of California, LA 90095, USA
| | - Akshay Wadia
- Department of Computer Science, Bioinformatics IDP, Department of Mathematics and Department of Human Genetics, University of California, LA 90095, USA
| | - Feng Guan
- Department of Computer Science, Bioinformatics IDP, Department of Mathematics and Department of Human Genetics, University of California, LA 90095, USA
| | - Rafail Ostrosky
- Department of Computer Science, Bioinformatics IDP, Department of Mathematics and Department of Human Genetics, University of California, LA 90095, USA
| | - Amit Sahai
- Department of Computer Science, Bioinformatics IDP, Department of Mathematics and Department of Human Genetics, University of California, LA 90095, USA
| | - Eleazar Eskin
- Department of Computer Science, Bioinformatics IDP, Department of Mathematics and Department of Human Genetics, University of California, LA 90095, USADepartment of Computer Science, Bioinformatics IDP, Department of Mathematics and Department of Human Genetics, University of California, LA 90095, USA
| |
Collapse
|
41
|
Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics 2014; 198:497-508. [PMID: 25104515 PMCID: PMC4196608 DOI: 10.1534/genetics.114.167908] [Citation(s) in RCA: 263] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2014] [Accepted: 07/18/2014] [Indexed: 12/22/2022] Open
Abstract
Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, California 90095
| | - Emrah Kostem
- Department of Computer Science, University of California, Los Angeles, California 90095
| | - Eun Yong Kang
- Department of Computer Science, University of California, Los Angeles, California 90095
| | - Bogdan Pasaniuc
- Department of Human Genetics, University of California, Los Angeles, California 90095 Department of Pathology and Laboratory Medicine, University of California, Los Angeles, California 90095
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, California 90095 Department of Human Genetics, University of California, Los Angeles, California 90095
| |
Collapse
|
42
|
Hasin-Brumshtein Y, Hormozdiari F, Martin L, van Nas A, Eskin E, Lusis AJ, Drake TA. Allele-specific expression and eQTL analysis in mouse adipose tissue. BMC Genomics 2014; 15:471. [PMID: 24927774 PMCID: PMC4089026 DOI: 10.1186/1471-2164-15-471] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2013] [Accepted: 05/07/2014] [Indexed: 11/17/2022] Open
Abstract
Background The simplest definition of cis-eQTLs versus trans, refers to genetic variants that affect expression in an allele specific manner, with implications on underlying mechanism. Yet, due to technical limitations of expression microarrays, the vast majority of eQTL studies performed in the last decade used a genomic distance based definition as a surrogate for cis, therefore exploring local rather than cis-eQTLs. Results In this study we use RNAseq to explore allele specific expression (ASE) in adipose tissue of male and female F1 mice, produced from reciprocal crosses of C57BL/6J and DBA/2J strains. Comparison of the identified cis-eQTLs, to local-eQTLs, that were obtained from adipose tissue expression in two previous population based studies in our laboratory, yields poor overlap between the two mapping approaches, while both local-eQTL studies show highly concordant results. Specifically, local-eQTL studies show ~60% overlap between themselves, while only 15-20% of local-eQTLs are identified as cis by ASE, and less than 50% of ASE genes are recovered in local-eQTL studies. Utilizing recently published ENCODE data, we also find that ASE genes show significant bias for SNPs prevalence in DNase I hypersensitive sites that is ASE direction specific. Conclusions We suggest a new approach to analysis of allele specific expression that is more sensitive and accurate than the commonly used fisher or chi-square statistics. Our analysis indicates that technical differences between the cis and local-eQTL approaches, such as differences in genomic background or sex specificity, account for relatively small fraction of the discrepancy. Therefore, we suggest that the differences between two eQTL mapping approaches may facilitate sorting of SNP-eQTL interactions into true cis and trans, and that a considerable portion of local-eQTL may actually represent trans interactions. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-471) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yehudit Hasin-Brumshtein
- Department of Medicine/Division of Cardiology, David Geffen School of Medicine, University of California, Los Angeles, CA 90095, USA.
| | | | | | | | | | | | | |
Collapse
|
43
|
Hach F, Sarrafi I, Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic Acids Res 2014; 42:W494-500. [PMID: 24810850 PMCID: PMC4086126 DOI: 10.1093/nar/gku370] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the ‘best’ mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net.
Collapse
Affiliation(s)
- Faraz Hach
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6
| | - Iman Sarrafi
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6
| | - Farhad Hormozdiari
- Computer Science Department, University of California, Los Angeles, CA, USA, 90095
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, 06800 Ankara, Turkey
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington, Seattle, WA, USA, 98195
| | - S Cenk Sahinalp
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6 School of Informatics and Computing, Indiana University, Bloomington, IN, USA, 47405
| |
Collapse
|
44
|
Orozco LD, Rubbi L, Martin LJ, Fang F, Hormozdiari F, Che N, Smith AD, Lusis AJ, Pellegrini M. Intergenerational genomic DNA methylation patterns in mouse hybrid strains. Genome Biol 2014; 15:R68. [PMID: 24887417 PMCID: PMC4076608 DOI: 10.1186/gb-2014-15-5-r68] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2013] [Accepted: 04/30/2014] [Indexed: 11/10/2022] Open
Abstract
Background DNA methylation is a contributing factor to both rare and common human diseases, and plays a major role in development and gene silencing. While the variation of DNA methylation among individuals has been partially characterized, the degree to which methylation patterns are preserved across generations is still poorly understood. To determine the extent of methylation differences between two generations of mice we examined DNA methylation patterns in the livers of eight parental and F1 mice from C57BL/6J and DBA/2J mouse strains using bisulfite sequencing. Results We find a large proportion of reproducible methylation differences between C57BL/6J and DBA/2J chromosomes in CpGs, which are highly heritable between parent and F1 mice. We also find sex differences in methylation levels in 396 genes, and 11% of these are differentially expressed between females and males. Using a recently developed approach to identify allelically methylated regions independently of genotypic differences, we identify 112 novel putative imprinted genes and microRNAs, and validate imprinting at the RNA level in 10 of these genes. Conclusions The majority of DNA methylation differences among individuals are associated with genetic differences, and a much smaller proportion of these epigenetic differences are due to sex, imprinting or stochastic intergenerational effects. Epigenetic differences can be a determining factor in heritable traits and should be considered in association studies for molecular and clinical traits, as we observed that methylation differences in the mouse model are highly heritable and can have functional consequences on molecular traits such as gene expression.
Collapse
|
45
|
Abstract
The development of high-throughput genomic technologies has impacted many areas of genetic research. While many applications of these technologies focus on the discovery of genes involved in disease from population samples, applications of genomic technologies to an individual's genome or personal genomics have recently gained much interest. One such application is the identification of relatives from genetic data. In this application, genetic information from a set of individuals is collected in a database, and each pair of individuals is compared in order to identify genetic relatives. An inherent issue that arises in the identification of relatives is privacy. In this article, we propose a method for identifying genetic relatives without compromising privacy by taking advantage of novel cryptographic techniques customized for secure and private comparison of genetic information. We demonstrate the utility of these techniques by allowing a pair of individuals to discover whether or not they are related without compromising their genetic information or revealing it to a third party. The idea is that individuals only share enough special-purpose cryptographically protected information with each other to identify whether or not they are relatives, but not enough to expose any information about their genomes. We show in HapMap and 1000 Genomes data that our method can recover first- and second-order genetic relationships and, through simulations, show that our method can identify relationships as distant as third cousins while preserving privacy.
Collapse
Affiliation(s)
- Dan He
- Department of Computer Science, University of California, Los Angeles, Los Angeles, California 90095,USA
| | | | | | | | | | | | | | | |
Collapse
|
46
|
Frésard L, Leroux S, Servin B, Gourichon D, Dehais P, Cristobal MS, Marsaud N, Vignoles F, Bed'hom B, Coville JL, Hormozdiari F, Beaumont C, Zerjal T, Vignal A, Morisson M, Lagarrigue S, Pitel F. Transcriptome-wide investigation of genomic imprinting in chicken. Nucleic Acids Res 2014; 42:3768-82. [PMID: 24452801 PMCID: PMC3973300 DOI: 10.1093/nar/gkt1390] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Genomic imprinting is an epigenetic mechanism by which alleles of some specific genes are expressed in a parent-of-origin manner. It has been observed in mammals and marsupials, but not in birds. Until now, only a few genes orthologous to mammalian imprinted ones have been analyzed in chicken and did not demonstrate any evidence of imprinting in this species. However, several published observations such as imprinted-like QTL in poultry or reciprocal effects keep the question open. Our main objective was thus to screen the entire chicken genome for parental-allele-specific differential expression on whole embryonic transcriptomes, using high-throughput sequencing. To identify the parental origin of each observed haplotype, two chicken experimental populations were used, as inbred and as genetically distant as possible. Two families were produced from two reciprocal crosses. Transcripts from 20 embryos were sequenced using NGS technology, producing ∼200 Gb of sequences. This allowed the detection of 79 potentially imprinted SNPs, through an analysis method that we validated by detecting imprinting from mouse data already published. However, out of 23 candidates tested by pyrosequencing, none could be confirmed. These results come together, without a priori, with previous statements and phylogenetic considerations assessing the absence of genomic imprinting in chicken.
Collapse
Affiliation(s)
- Laure Frésard
- INRA, UMR444 Laboratoire de Génétique Cellulaire, Castanet-Tolosan F-31326, France, ENVT, UMR444 Laboratoire de Génétique Cellulaire, Toulouse F-31076, France, INRA, PEAT Pôle d'Expérimentation Avicole de Tours, Nouzilly F- 37380, France, INRA, Sigenae UR875 Biométrie et Intelligence Artificielle, Castanet-Tolosan F-31326, France, INRA, GeT-PlaGe Genotoul, Castanet-Tolosan F-31326, France, INRA, UMR1313 Génétique animale et biologie intégrative, Jouy en Josas F-78350, France, AgroParisTech, UMR1313 Génétique animale et biologie intégrative, Jouy en Josas F-78350, France, Department of Computer Sciences, University of California, Los Angeles, CA 90095, USA, INRA, UR83 Recherche Avicoles, Nouzilly F- 37380, France and Agrocampus Ouest, UMR1348 Physiologie, Environnement et Génétique pour l'Animal et les Systèmes d'Élevage, Animal Genetics Laboratory, Rennes F-35000, France
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Freedman AH, Gronau I, Schweizer RM, Ortega-Del Vecchyo D, Han E, Silva PM, Galaverni M, Fan Z, Marx P, Lorente-Galdos B, Beale H, Ramirez O, Hormozdiari F, Alkan C, Vilà C, Squire K, Geffen E, Kusak J, Boyko AR, Parker HG, Lee C, Tadigotla V, Siepel A, Bustamante CD, Harkins TT, Nelson SF, Ostrander EA, Marques-Bonet T, Wayne RK, Novembre J. Genome sequencing highlights the dynamic early history of dogs. PLoS Genet 2014; 10:e1004016. [PMID: 24453982 PMCID: PMC3894170 DOI: 10.1371/journal.pgen.1004016] [Citation(s) in RCA: 323] [Impact Index Per Article: 32.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2013] [Accepted: 10/28/2013] [Indexed: 11/18/2022] Open
Abstract
To identify genetic changes underlying dog domestication and reconstruct their early evolutionary history, we generated high-quality genome sequences from three gray wolves, one from each of the three putative centers of dog domestication, two basal dog lineages (Basenji and Dingo) and a golden jackal as an outgroup. Analysis of these sequences supports a demographic model in which dogs and wolves diverged through a dynamic process involving population bottlenecks in both lineages and post-divergence gene flow. In dogs, the domestication bottleneck involved at least a 16-fold reduction in population size, a much more severe bottleneck than estimated previously. A sharp bottleneck in wolves occurred soon after their divergence from dogs, implying that the pool of diversity from which dogs arose was substantially larger than represented by modern wolf populations. We narrow the plausible range for the date of initial dog domestication to an interval spanning 11–16 thousand years ago, predating the rise of agriculture. In light of this finding, we expand upon previous work regarding the increase in copy number of the amylase gene (AMY2B) in dogs, which is believed to have aided digestion of starch in agricultural refuse. We find standing variation for amylase copy number variation in wolves and little or no copy number increase in the Dingo and Husky lineages. In conjunction with the estimated timing of dog origins, these results provide additional support to archaeological finds, suggesting the earliest dogs arose alongside hunter-gathers rather than agriculturists. Regarding the geographic origin of dogs, we find that, surprisingly, none of the extant wolf lineages from putative domestication centers is more closely related to dogs, and, instead, the sampled wolves form a sister monophyletic clade. This result, in combination with dog-wolf admixture during the process of domestication, suggests that a re-evaluation of past hypotheses regarding dog origins is necessary. The process of dog domestication is still poorly understood, largely because no studies thus far have leveraged deeply sequenced whole genomes from wolves and dogs to simultaneously evaluate support for the proposed source regions: East Asia, the Middle East, and Europe. To investigate dog origins, we sequence three wolf genomes from the putative centers of origin, two basal dog breeds (Basenji and Dingo), and a golden jackal as an outgroup. We find that none of the wolf lineages from the hypothesized domestication centers is supported as the source lineage for dogs, and that dogs and wolves diverged 11,000–16,000 years ago in a process involving extensive admixture and that was followed by a bottleneck in wolves. In addition, we investigate the amylase (AMY2B) gene family expansion in dogs, which has recently been suggested as being critical to domestication in response to increased dietary starch. We find standing variation in AMY2B copy number in wolves and show that some breeds, such as Dingo and Husky, lack the AMY2B expansion. This suggests that, at the beginning of the domestication process, dogs may have been characterized by a more carnivorous diet than their modern day counterparts, a diet held in common with early hunter-gatherers.
Collapse
Affiliation(s)
- Adam H. Freedman
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Ilan Gronau
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Rena M. Schweizer
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Diego Ortega-Del Vecchyo
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Eunjung Han
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
| | | | | | - Zhenxin Fan
- Key Laboratory of Bioresources and Ecoenvironment, Sichuan University, Chengdu, China
| | - Peter Marx
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Budapest, Hungary
| | | | - Holly Beale
- National Institutes of Health/NHGRI, Bethesda, Maryland, United States of America
| | - Oscar Ramirez
- Institut de Biologia Evolutiva (CSIC-Univ Pompeu Fabra), Barcelona, Spain
| | - Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, Los Angeles, California, United States of America
| | | | - Carles Vilà
- Estación Biológia de Doñana EBD-CSIC, Sevilla, Spain
| | - Kevin Squire
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Eli Geffen
- Department of Zoology, Tel Aviv University, Tel Aviv, Israel
| | | | - Adam R. Boyko
- Department of Veterinary Medicine, Cornell University, Ithaca, New York, United States of America
| | - Heidi G. Parker
- National Institutes of Health/NHGRI, Bethesda, Maryland, United States of America
| | - Clarence Lee
- Life Technologies, Foster City, California, United States of America
| | - Vasisht Tadigotla
- Life Technologies, Foster City, California, United States of America
| | - Adam Siepel
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | | | | | - Stanley F. Nelson
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Elaine A. Ostrander
- National Institutes of Health/NHGRI, Bethesda, Maryland, United States of America
| | - Tomas Marques-Bonet
- Institut de Biologia Evolutiva (CSIC-Univ Pompeu Fabra), Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA). 08010, Barcelona, Spain
| | - Robert K. Wayne
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
- * E-mail: (RKW); (JN)
| | - John Novembre
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, California, United States of America
- * E-mail: (RKW); (JN)
| |
Collapse
|
48
|
Eskin I, Hormozdiari F, Conde L, Riby J, Skibola CF, Eskin E, Halperin E. eALPS: estimating abundance levels in pooled sequencing using available genotyping data. J Comput Biol 2013; 20:861-77. [PMID: 24144111 DOI: 10.1089/cmb.2013.0105] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
The recent advances in high-throughput sequencing technologies bring the potential of a better characterization of the genetic variation in humans and other organisms. In many occasions, either by design or by necessity, the sequencing procedure is performed on a pool of DNA samples with different abundances, where the abundance of each sample is unknown. Such a scenario is naturally occurring in the case of metagenomics analysis where a pool of bacteria is sequenced, or in the case of population studies involving DNA pools by design. Particularly, various pooling designs were recently suggested that can identify carriers of rare alleles in large cohorts, dramatically reducing the cost of such large-scale sequencing projects. A fundamental problem with such approaches for population studies is that the uncertainty of DNA proportions from different individuals in the pools might lead to spurious associations. Fortunately, it is often the case that the genotype data of at least some of the individuals in the pool is known. Here, we propose a method (eALPS) that uses the genotype data in conjunction with the pooled sequence data in order to accurately estimate the proportions of the samples in the pool, even in cases where not all individuals in the pool were genotyped (eALPS-LD). Using real data from a sequencing pooling study of non-Hodgkin's lymphoma, we demonstrate that the estimation of the proportions is crucial, since otherwise there is a risk for false discoveries. Additionally, we demonstrate that our approach is also applicable to the problem of quantification of species in metagenomics samples (eALPS-BCR) and is particularly suitable for metagenomic quantification of closely related species.
Collapse
Affiliation(s)
- Itamar Eskin
- 1 The Blavatnik School of Computer Science, Tel-Aviv University , Tel Aviv, Israel
| | | | | | | | | | | | | |
Collapse
|
49
|
Yang WY, Hormozdiari F, Wang Z, He D, Pasaniuc B, Eskin E. Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data. Bioinformatics 2013; 29:2245-52. [PMID: 23825370 PMCID: PMC3753566 DOI: 10.1093/bioinformatics/btt386] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2012] [Revised: 06/19/2013] [Accepted: 06/28/2013] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Haplotypes, defined as the sequence of alleles on one chromosome, are crucial for many genetic analyses. As experimental determination of haplotypes is extremely expensive, haplotypes are traditionally inferred using computational approaches from genotype data, i.e. the mixture of the genetic information from both haplotypes. Best performing approaches for haplotype inference rely on Hidden Markov Models, with the underlying assumption that the haplotypes of a given individual can be represented as a mosaic of segments from other haplotypes in the same population. Such algorithms use this model to predict the most likely haplotypes that explain the observed genotype data conditional on reference panel of haplotypes. With rapid advances in short read sequencing technologies, sequencing is quickly establishing as a powerful approach for collecting genetic variation information. As opposed to traditional genotyping-array technologies that independently call genotypes at polymorphic sites, short read sequencing often collects haplotypic information; a read spanning more than one polymorphic locus (multi-single nucleotide polymorphic read) contains information on the haplotype from which the read originates. However, this information is generally ignored in existing approaches for haplotype phasing and genotype-calling from short read data. RESULTS In this article, we propose a novel framework for haplotype inference from short read sequencing that leverages multi-single nucleotide polymorphic reads together with a reference panel of haplotypes. The basis of our approach is a new probabilistic model that finds the most likely haplotype segments from the reference panel to explain the short read sequencing data for a given individual. We devised an efficient sampling method within a probabilistic model to achieve superior performance than existing methods. Using simulated sequencing reads from real individual genotypes in the HapMap data and the 1000 Genomes projects, we show that our method is highly accurate and computationally efficient. Our haplotype predictions improve accuracy over the basic haplotype copying model by ∼20% with comparable computational time, and over another recently proposed approach Hap-SeqX by ∼10% with significantly reduced computational time and memory usage. AVAILABILITY Publicly available software is available at http://genetics.cs.ucla.edu/harsh CONTACT bpasaniuc@mednet.ucla.edu or eeskin@cs.ucla.edu.
Collapse
Affiliation(s)
- Wen-Yun Yang
- Department of Computer Science and Inter-Departmental Program in Bioinformatics, University of California, Los Angeles, CA 90095, USA
| | | | | | | | | | | |
Collapse
|
50
|
Abstract
Copy number variations (CNVs) are widely known to be an important mediator for diseases and traits. The development of high-throughput sequencing (HTS) technologies has provided great opportunities to identify CNV regions in mammalian genomes. In a typical experiment, millions of short reads obtained from a genome of interest are mapped to a reference genome. The mapping information can be used to identify CNV regions. One important challenge in analyzing the mapping information is the large fraction of reads that can be mapped to multiple positions. Most existing methods either only consider reads that can be uniquely mapped to the reference genome or randomly place a read to one of its mapping positions. Therefore, these methods have low power to detect CNVs located within repeated sequences. In this study, we propose a probabilistic model, CNVeM, that utilizes the inherent uncertainty of read mapping. We use maximum likelihood to estimate locations and copy numbers of copied regions and implement an expectation-maximization (EM) algorithm. One important contribution of our model is that we can distinguish between regions in the reference genome that differ from each other by as little as 0.1%. As our model aims to predict the copy number of each nucleotide, we can predict the CNV boundaries with high resolution. We apply our method to simulated datasets and achieve higher accuracy compared to CNVnator. Moreover, we apply our method to real data from which we detected known CNVs. To our knowledge, this is the first attempt to predict CNVs at nucleotide resolution and to utilize uncertainty of read mapping.
Collapse
Affiliation(s)
- Zhanyong Wang
- Computer Science Department, University of California Los Angeles, Los Angeles, CA 90095-1596, USA
| | | | | | | | | |
Collapse
|