1
|
Vandersteen AM, Weerakkody RA, Parry DA, Kanonidou C, Toddie-Moore DJ, Vandrovcova J, Darlay R, Santoyo-Lopez J, Meynert A, Kazkaz H, Grahame R, Cummings C, Bartlett M, Ghali N, Brady AF, Pope FM, van Dijk FS, Cordell HJ, Aitman TJ. Genetic complexity of diagnostically unresolved Ehlers-Danlos syndrome. J Med Genet 2024; 61:232-238. [PMID: 37813462 DOI: 10.1136/jmg-2023-109329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 09/18/2023] [Indexed: 10/17/2023]
Abstract
BACKGROUND The Ehlers-Danlos syndromes (EDS) are heritable disorders of connective tissue (HDCT), reclassified in the 2017 nosology into 13 subtypes. The genetic basis for hypermobile Ehlers-Danlos syndrome (hEDS) remains unknown. METHODS Whole exome sequencing (WES) was undertaken on 174 EDS patients recruited from a national diagnostic service for complex EDS and a specialist clinic for hEDS. Patients had already undergone expert phenotyping, laboratory investigation and gene sequencing, but were without a genetic diagnosis. Filtered WES data were reviewed for genes underlying Mendelian disorders and loci reported in EDS linkage, transcriptome and genome-wide association studies (GWAS). A genetic burden analysis (Minor Allele Frequency (MAF) <0.05) incorporating 248 Avon Longitudinal Study of Parents and Children (ALSPAC) controls sequenced as part of the UK10K study was undertaken using TASER methodology. RESULTS Heterozygous pathogenic (P) or likely pathogenic (LP) variants were identified in known EDS and Loeys-Dietz (LDS) genes. Multiple variants of uncertain significance where segregation and functional analysis may enable reclassification were found in genes associated with EDS, LDS, heritable thoracic aortic disease (HTAD), Mendelian disorders with EDS symptomatology and syndromes with EDS-like features. Genetic burden analysis revealed a number of novel loci, although none reached the threshold for genome-wide significance. Variants with biological plausibility were found in genes and pathways not currently associated with EDS or HTAD. CONCLUSIONS We demonstrate the clinical utility of large panel-based sequencing and WES for patients with complex EDS in distinguishing rare EDS subtypes, LDS and related syndromes. Although many of the P and LP variants reported in this cohort would be identified with current panel testing, they were not at the time of this study, highlighting the use of extended panels and WES as a clinical tool for complex EDS. Our results are consistent with the complex genetic architecture of EDS and suggest a number of novel hEDS and HTAD candidate genes and pathways.
Collapse
Affiliation(s)
- Anthony M Vandersteen
- Maritime Medical Genetics Service, IWK Health Centre, Halifax, Nova Scotia, Canada
- Faculty of Medicine, Department of Pediatrics, Dalhousie University, Halifax, Nova Scotia, Canada
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Ruwan A Weerakkody
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
- Institute of Clinical Sciences, Imperial College London, London, UK
- Department of Vascular Surgery, Royal Free Hospital, London, UK
| | - David A Parry
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Christina Kanonidou
- Department of Clinical Biochemistry, Queen Elizabeth University Hospital, NHS Greater Glasgow and Clyde, Glasgow, UK
| | - Daniel J Toddie-Moore
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Jana Vandrovcova
- Department of Neuromuscular Diseases, UCL Queen Street Institute of Neurology, University College London, London, UK
| | - Rebecca Darlay
- Population Health Sciences Institute, Newcastle University, Newcastle upon Tyne, UK
| | | | - Alison Meynert
- MRC Human Genetics Unit, University of Edinburgh, Edinburgh, UK
| | - Hanadi Kazkaz
- Department of Rheumatology, University College London Hospitals NHS Foundation Trust, London, UK
| | - Rodney Grahame
- Department of Rheumatology, University College London Hospitals NHS Foundation Trust, London, UK
| | - Carole Cummings
- Ehlers-Danlos Syndrome National Diagnostic Service, London North West University Healthcare NHS Trust, Northwick Park Hospital, Harrow, UK
- Department of Metabolism, Digestion and Reproduction Section of Genetics and Genomics, Imperial College London, London, UK
| | - Marion Bartlett
- Ehlers-Danlos Syndrome National Diagnostic Service, London North West University Healthcare NHS Trust, Northwick Park Hospital, Harrow, UK
- Department of Metabolism, Digestion and Reproduction Section of Genetics and Genomics, Imperial College London, London, UK
| | - Neeti Ghali
- Ehlers-Danlos Syndrome National Diagnostic Service, London North West University Healthcare NHS Trust, Northwick Park Hospital, Harrow, UK
- Department of Metabolism, Digestion and Reproduction Section of Genetics and Genomics, Imperial College London, London, UK
| | - Angela F Brady
- Ehlers-Danlos Syndrome National Diagnostic Service, London North West University Healthcare NHS Trust, Northwick Park Hospital, Harrow, UK
- Department of Metabolism, Digestion and Reproduction Section of Genetics and Genomics, Imperial College London, London, UK
| | - F Michael Pope
- Ehlers-Danlos Syndrome National Diagnostic Service, London North West University Healthcare NHS Trust, Northwick Park Hospital, Harrow, UK
- Department of Metabolism, Digestion and Reproduction Section of Genetics and Genomics, Imperial College London, London, UK
| | - Fleur S van Dijk
- Ehlers-Danlos Syndrome National Diagnostic Service, London North West University Healthcare NHS Trust, Northwick Park Hospital, Harrow, UK
- Department of Metabolism, Digestion and Reproduction Section of Genetics and Genomics, Imperial College London, London, UK
| | - Heather J Cordell
- Population Health Sciences Institute, Newcastle University, Newcastle upon Tyne, UK
| | - Timothy J Aitman
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
2
|
Mujyambere V, Adomako K, Olympio OS. Effectiveness of DArTseq markers application in genetic diversity and population structure of indigenous chickens in Eastern Province of Rwanda. BMC Genomics 2024; 25:193. [PMID: 38373904 PMCID: PMC10875757 DOI: 10.1186/s12864-024-10089-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2023] [Accepted: 02/04/2024] [Indexed: 02/21/2024] Open
Abstract
BACKGROUND The application of biotechnologies which make use of genetic markers in chicken breeding is developing rapidly. Diversity Array Technology (DArT) is one of the current Genotyping-By-Sequencing techniques allowing the discovery of whole genome sequencing. In livestock, DArT has been applied in cattle, sheep, and horses. Currently, there is no study on the application of DArT markers in chickens. The aim was to study the effectiveness of DArTSeq markers in the genetic diversity and population structure of indigenous chickens (IC) and SASSO in the Eastern Province of Rwanda. METHODS In total 87 blood samples were randomly collected from 37 males and 40 females of indigenous chickens and 10 females of SASSO chickens purposively selected from 5 sites located in two districts of the Eastern Province of Rwanda. Genotyping by Sequencing (GBS) using DArTseq technology was employed. This involved the complexity reduction method through digestion of genomic DNA and ligation of barcoded adapters followed by PCR amplification of adapter-ligated fragments. RESULTS From 45,677 DArTseq SNPs and 25,444 SilicoDArTs generated, only 8,715 and 6,817 respectively remained for further analysis after quality control. The average call rates observed, 0.99 and 0.98 for DArTseq SNPs and SilicoDArTs respectively were quite similar. The polymorphic information content (PIC) from SilicoDArTs (0.33) was higher than that from DArTseq SNPs (0.22). DArTseq SNPs and SilicoDArTs had 34.4% and 34% of the loci respectively mapped on chromosome 1. DArTseq SNPs revealed distance averages of 0.17 and 0.15 within IC and SASSO chickens respectively while the respective averages observed with SilicoDArTs were 0.42 and 0.36. The average genetic distance between IC and SASSO chickens was moderate for SilicoDArTs (0.120) compared to that of DArTseq SNPs (0.048). The PCoA and population structure clustered the chicken samples into two subpopulations (1 and 2); 1 is composed of IC and 2 by SASSO chickens. An admixture was observed in subpopulation 2 with 12 chickens from subpopulation 1. CONCLUSIONS The application of DArTseq markers have been proven to be effective and efficient for genetic relationship between IC and separated IC from exotic breed used which indicate their suitability in genomic studies. However, further studies using all chicken genetic resources available and large big sample sizes are required.
Collapse
Affiliation(s)
- Valentin Mujyambere
- Department of Animal Production, School of Veterinary Medicine, University of Rwanda, Nyagatare, Rwanda.
- Department of Animal Production, University of Rwanda (UR), P.O. Box 57, Nyagatare, Rwanda.
- Department of Animal Science, Kwame Nkrumah University of Science and Technology (KNUST), Kumasi, AK-385-1973, Ghana.
| | - Kwaku Adomako
- Department of Animal Science, Faculty of Agriculture, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
| | - Oscar Simon Olympio
- Department of Animal Science, Faculty of Agriculture, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
| |
Collapse
|
3
|
Zhu L, Yan S, Cao X, Zhang S, Sha Q. Integrating External Controls by Regression Calibration for Genome-Wide Association Study. Genes (Basel) 2024; 15:67. [PMID: 38254957 PMCID: PMC10815702 DOI: 10.3390/genes15010067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 12/30/2023] [Accepted: 01/01/2024] [Indexed: 01/24/2024] Open
Abstract
Genome-wide association studies (GWAS) have successfully revealed many disease-associated genetic variants. For a case-control study, the adequate power of an association test can be achieved with a large sample size, although genotyping large samples is expensive. A cost-effective strategy to boost power is to integrate external control samples with publicly available genotyped data. However, the naive integration of external controls may inflate the type I error rates if ignoring the systematic differences (batch effect) between studies, such as the differences in sequencing platforms, genotype-calling procedures, population stratification, and so forth. To account for the batch effect, we propose an approach by integrating External Controls into the Association Test by Regression Calibration (iECAT-RC) in case-control association studies. Extensive simulation studies show that iECAT-RC not only can control type I error rates but also can boost statistical power in all models. We also apply iECAT-RC to the UK Biobank data for M72 Fibroblastic disorders by considering genotype calling as the batch effect. Four SNPs associated with fibroblastic disorders have been detected by iECAT-RC and the other two comparison methods, iECAT-Score and Internal. However, our method has a higher probability of identifying these significant SNPs in the scenario of an unbalanced case-control association study.
Collapse
Affiliation(s)
| | | | | | | | - Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA; (L.Z.); (S.Y.); (X.C.); (S.Z.)
| |
Collapse
|
4
|
Wojcik GL, Murphy J, Edelson JL, Gignoux CR, Ioannidis AG, Manning A, Rivas MA, Buyske S, Hendricks AE. Opportunities and challenges for the use of common controls in sequencing studies. Nat Rev Genet 2022; 23:665-679. [PMID: 35581355 PMCID: PMC9765323 DOI: 10.1038/s41576-022-00487-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/22/2022] [Indexed: 01/02/2023]
Abstract
Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, quality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.
Collapse
Affiliation(s)
- Genevieve L Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Jessica Murphy
- Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, USA
| | - Jacob L Edelson
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, USA
| | - Christopher R Gignoux
- Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Alexander G Ioannidis
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, USA
- Clinical and Translational Epidemiology Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Alisa Manning
- Metabolism Program, Broad Institute, Cambridge, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Manuel A Rivas
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, USA
| | - Steven Buyske
- Department of Statistics, Rutgers University, Piscataway, NJ, USA
| | - Audrey E Hendricks
- Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA.
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, USA.
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
| |
Collapse
|
5
|
Chen W, Coombes BJ, Larson NB. Recent advances and challenges of rare variant association analysis in the biobank sequencing era. Front Genet 2022; 13:1014947. [PMID: 36276986 PMCID: PMC9582646 DOI: 10.3389/fgene.2022.1014947] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 09/22/2022] [Indexed: 12/04/2022] Open
Abstract
Causal variants for rare genetic diseases are often rare in the general population. Rare variants may also contribute to common complex traits and can have much larger per-allele effect sizes than common variants, although power to detect these associations can be limited. Sequencing costs have steadily declined with technological advancements, making it feasible to adopt whole-exome and whole-genome profiling for large biobank-scale sample sizes. These large amounts of sequencing data provide both opportunities and challenges for rare-variant association analysis. Herein, we review the basic concepts of rare-variant analysis methods, the current state-of-the-art methods in utilizing variant annotations or external controls to improve the statistical power, and particular challenges facing rare variant analysis such as accounting for population structure, extremely unbalanced case-control design. We also review recent advances and challenges in rare variant analysis for familial sequencing data and for more complex phenotypes such as survival data. Finally, we discuss other potential directions for further methodology investigation.
Collapse
Affiliation(s)
- Wenan Chen
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Brandon J. Coombes
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Nicholas B. Larson
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| |
Collapse
|
6
|
González Silos R, Fischer C, Lorenzo Bermejo J. NGS allele counts versus called genotypes for testing genetic association. Comput Struct Biotechnol J 2022; 20:3729-3733. [PMID: 35891781 PMCID: PMC9294184 DOI: 10.1016/j.csbj.2022.07.016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2022] [Revised: 07/07/2022] [Accepted: 07/07/2022] [Indexed: 11/28/2022] Open
Abstract
RNA sequence data are commonly summarized as read counts. By contrast, so far there is no alternative to genotype calling for investigating the relationship between genetic variants determined by next-generation sequencing (NGS) and a phenotype of interest. Here we propose and evaluate the direct analysis of allele counts for genetic association tests. Specifically, we assess the potential advantage of the ratio of alternative allele counts to the total number of reads aligned at a specific position of the genome (coverage) over called genotypes. We simulated association studies based on NGS data from HapMap individuals. Genotype quality scores and allele counts were simulated using NGS data from the Personal Genome Project. Real data from the 1000 Genomes Project was also used to compare the two competing approaches. The average proportions of probability values lower or equal to 0.05 amounted to 0.0496 for called genotypes and 0.0485 for the ratio of alternative allele counts to coverage in the null scenario, and to 0.69 for called genotypes and 0.75 for the ratio of alternative allele counts to coverage in the alternative scenario (9% power increase). The advantage in statistical power of the novel approach increased with decreasing coverage, with decreasing genotype quality and with decreasing allele frequency – 124% power increase for variants with a minor allele frequency lower than 0.05. We provide computer code in R to implement the novel approach, which does not preclude the use of complementary data quality filters before or after identification of the most promising association signals. Author summary Genetic association tests usually rely on called genotypes. We postulate here that the direct analysis of allele counts from sequence data improves the quality of statistical inference. To evaluate this hypothesis, we investigate simulated and real data using distinct statistical approaches. We demonstrate that association tests based on allele counts rather than called genotypes achieve higher statistical power with controlled type I error rates.
Collapse
Affiliation(s)
| | - Christine Fischer
- Institute of Human Genetics, University of Heidelberg, 69120, Germany
| | | |
Collapse
|
7
|
Li Y, Lee S. Integrating external controls in case–control studies improves power for rare‐variant tests. Genet Epidemiol 2022; 46:145-158. [PMID: 35170803 PMCID: PMC9393083 DOI: 10.1002/gepi.22444] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Revised: 12/29/2021] [Accepted: 01/20/2022] [Indexed: 11/08/2022]
Abstract
Large-scale sequencing and genotyping data provide an opportunity to integrate external samples as controls to improve power of association tests. However, due to the systematic differences between genotyped samples from different studies, naively aggregating the controls could lead to inflation in Type I error rates. There has been recent effort to integrate external controls while adjusting for batch effect, such as the integrating External Controls into Association Test (iECAT) and its score-based single variant tests. Building on the original iECAT framework, we propose an iECAT-Score region-based test that increases power for rare-variant tests when integrating external controls. This method assesses the systematic batch effect between internal and external samples at each variant and constructs compound shrinkage score statistics to test for the joint genetic effect within a gene or a region, while adjusting for covariates and population stratification. Through simulation studies, we demonstrate that the proposed method controls for Type I error rates and improves power in rare-variant tests. The application of the proposed method to the association studies of age-related macular degeneration (AMD) from the International AMD Genomics Consortium and UK Biobank revealed novel rare-variant associations in gene DXO. Through the incorporation of external controls, the iECAT methods offer a powerful suite to identify disease-associated genetic variants, further shedding light on future directions to investigate roles of rare variants in human diseases.
Collapse
Affiliation(s)
- Yatong Li
- Department of Biostatistics University of Michigan Ann Arbor Michigan USA
| | - Seunggeun Lee
- Department of Biostatistics University of Michigan Ann Arbor Michigan USA
- Graduate School of Data Science Seoul National University Seoul Republic of Korea
| |
Collapse
|
8
|
Chen D, Tashman K, Palmer DS, Neale B, Roeder K, Bloemendal A, Churchhouse C, Ke ZT. A data harmonization pipeline to leverage external controls and boost power in GWAS. Hum Mol Genet 2021; 31:481-489. [PMID: 34508597 PMCID: PMC8825237 DOI: 10.1093/hmg/ddab261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 09/02/2021] [Accepted: 09/03/2021] [Indexed: 11/12/2022] Open
Abstract
The use of external controls in genome-wide association study (GWAS) can significantly increase the size and diversity of the control sample, enabling high-resolution ancestry matching and enhancing the power to detect association signals. However, the aggregation of controls from multiple sources is challenging due to batch effects, difficulty in identifying genotyping errors, and the use of different genotyping platforms. These obstacles have impeded the use of external controls in GWAS and can lead to spurious results if not carefully addressed. We propose a unified data harmonization pipeline that includes an iterative approach to quality control (QC) and imputation, implemented before and after merging cohorts and arrays. We apply this harmonization pipeline to aggregate 27 517 European control samples from 16 collections within dbGaP. We leverage these harmonized controls to conduct a GWAS of Crohn's disease. We demonstrate a boost in power over using the cohort samples alone, and that our procedure results in summary statistics free of any significant batch effects. This harmonization pipeline for aggregating genotype data from multiple sources can also serve other applications where individual level genotypes, rather than summary statistics, are required.
Collapse
Affiliation(s)
- Danfeng Chen
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, 08544, New Jersey, United States
| | - Katherine Tashman
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States
| | - Duncan S Palmer
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States
| | - Benjamin Neale
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States.,Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, 02142, Massachusetts, United States
| | - Kathryn Roeder
- Department of Statistics, Carnegie Mellon University, Pittsburgh, 15213, Pennsylvania, United States
| | - Alex Bloemendal
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States
| | - Claire Churchhouse
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States.,Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, 02142, Massachusetts, United States
| | - Zheng Tracy Ke
- Department of Statistics, Harvard University, Cambridge, 02138, Massachusetts, United States
| |
Collapse
|
9
|
Sub-genic intolerance, ClinVar, and the epilepsies: A whole-exome sequencing study of 29,165 individuals. Am J Hum Genet 2021; 108:965-982. [PMID: 33932343 DOI: 10.1016/j.ajhg.2021.04.009] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/08/2021] [Indexed: 12/23/2022] Open
Abstract
Both mild and severe epilepsies are influenced by variants in the same genes, yet an explanation for the resulting phenotypic variation is unknown. As part of the ongoing Epi25 Collaboration, we performed a whole-exome sequencing analysis of 13,487 epilepsy-affected individuals and 15,678 control individuals. While prior Epi25 studies focused on gene-based collapsing analyses, we asked how the pattern of variation within genes differs by epilepsy type. Specifically, we compared the genetic architectures of severe developmental and epileptic encephalopathies (DEEs) and two generally less severe epilepsies, genetic generalized epilepsy and non-acquired focal epilepsy (NAFE). Our gene-based rare variant collapsing analysis used geographic ancestry-based clustering that included broader ancestries than previously possible and revealed novel associations. Using the missense intolerance ratio (MTR), we found that variants in DEE-affected individuals are in significantly more intolerant genic sub-regions than those in NAFE-affected individuals. Only previously reported pathogenic variants absent in available genomic datasets showed a significant burden in epilepsy-affected individuals compared with control individuals, and the ultra-rare pathogenic variants associated with DEE were located in more intolerant genic sub-regions than variants associated with non-DEE epilepsies. MTR filtering improved the yield of ultra-rare pathogenic variants in affected individuals compared with control individuals. Finally, analysis of variants in genes without a disease association revealed a significant burden of loss-of-function variants in the genes most intolerant to such variation, indicating additional epilepsy-risk genes yet to be discovered. Taken together, our study suggests that genic and sub-genic intolerance are critical characteristics for interpreting the effects of variation in genes that influence epilepsy.
Collapse
|
10
|
Li Y, Lee S. Novel score test to increase power in association test by integrating external controls. Genet Epidemiol 2020; 45:293-304. [PMID: 33161601 PMCID: PMC9424128 DOI: 10.1002/gepi.22370] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 10/13/2020] [Accepted: 10/20/2020] [Indexed: 12/18/2022]
Abstract
Recent advances in genotyping and sequencing technologies have enabled genetic association studies to leverage high-quality genotyped data to identify variants accounting for a substantial portion of disease risk. The usage of external controls, whose genomes have already been genotyped and are publicly available, could be a cost-effective approach to increase the power of association testing. There has been recent effort to integrate external controls while adjusting for possible batch effects, such as the integrating External Controls into Association Test (iECAT). The original iECAT test, however, cannot adjust for covariates such as age, gender, and so forth. Hence, based on the insight of iECAT, we propose a novel score-based test that allows for covariate adjustment and constructs a shrinkage score statistic that is a weighted sum of the score statistics using exclusively internal samples and uses both internal and external control samples. We assess the existence of batch effect at a variant by comparing control samples of internal and external sources. We show by simulation studies that our method has increased power over the original iECAT while controlling for type I error rates. We present the application of our method to the association studies of age-related macular degeneration (AMD) utilizing data from the International AMD Genomics Consortium and Michigan Genomics Initiative. Through the incorporation of the score test approach, we extend the use of iECAT to adjust for covariates and improve power, further honing the statistical methods needed to identify disease-causing variants within the human genome.
Collapse
Affiliation(s)
- Yatong Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Seunggeun Lee
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.,Department of Data Science, Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
11
|
Baskurt Z, Mastromatteo S, Gong J, Wintle RF, Scherer SW, Strug LJ. VikNGS: a C++ variant integration kit for next generation sequencing association analysis. Bioinformatics 2020; 36:1283-1285. [PMID: 31580400 PMCID: PMC7703770 DOI: 10.1093/bioinformatics/btz716] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Revised: 08/13/2019] [Accepted: 09/25/2019] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Integration of next generation sequencing data (NGS) across different research studies can improve the power of genetic association testing by increasing sample size and can obviate the need for sequencing controls. If differential genotype uncertainty across studies is not accounted for, combining datasets can produce spurious association results. We developed the Variant Integration Kit for NGS (VikNGS), a fast cross-platform software package, to enable aggregation of several datasets for rare and common variant genetic association analysis of quantitative and binary traits with covariate adjustment. VikNGS also includes a graphical user interface, power simulation functionality and data visualization tools. AVAILABILITY AND IMPLEMENTATION The VikNGS package can be downloaded at http://www.tcag.ca/tools/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zeynep Baskurt
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
| | - Scott Mastromatteo
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
| | - Jiafen Gong
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
| | - Richard F Wintle
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
| | - Stephen W Scherer
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
- McLaughlin Centre and Department of Molecular Genetics, University of Toronto, Toronto, ON M5G 0A4, Canada
| | - Lisa J Strug
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON M5G0A4, Canada
- Division of Biostatistics and Department of Statistical Sciences, University of Toronto, Toronto, ON, M5T3M7, Canada
| |
Collapse
|
12
|
Sun W, Jin C, Gelfond JA, Chen MH, Ibrahim JG. Joint analysis of single-cell and bulk tissue sequencing data to infer intratumor heterogeneity. Biometrics 2019; 76:983-994. [PMID: 31813161 DOI: 10.1111/biom.13198] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Revised: 10/23/2019] [Accepted: 11/25/2019] [Indexed: 11/28/2022]
Abstract
Many computational methods have been developed to discern intratumor heterogeneity (ITH) using DNA sequence data from bulk tumor samples. These methods share an assumption that two mutations arise from the same subclone if they have similar mutant allele-frequencies (MAFs), and thus it is difficult or impossible to distinguish two subclones with similar MAFs. Single-cell DNA sequencing (scDNA-seq) data can be very informative for ITH inference. However, due to the difficulty of DNA amplification, scDNA-seq data are often very noisy. A promising new study design is to collect both bulk and single-cell DNA-seq data and jointly analyze them to mitigate the limitations of each data type. To address the analytic challenges of this new study design, we propose a computational method named BaSiC (Bulk tumor and Single Cell), to discern ITH by jointly analyzing DNA-seq data from bulk tumor and single cells. We demonstrate that BaSiC has comparable or better performance than the methods using either data type. We further evaluate BaSiC using bulk tumor and single-cell DNA-seq data from a breast cancer patient and several leukemia patients.
Collapse
Affiliation(s)
- Wei Sun
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Chong Jin
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Jonathan A Gelfond
- Department of Epidemiology and Biostatistics, UT Health Science Center, San Antonio, Texas
| | - Ming-Hui Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut
| | - Joseph G Ibrahim
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina
| |
Collapse
|
13
|
Stahel P, Nahmias A, Sud SK, Lee SJ, Pucci A, Yousseif A, Youseff A, Jackson T, Urbach DR, Okrainec A, Allard JP, Sockalingam S, Yao T, Barua M, Jiao H, Magi R, Bassett AS, Paterson AD, Dahlman I, Batterham RL, Dash S. Evaluation of the Genetic Association Between Adult Obesity and Neuropsychiatric Disease. Diabetes 2019; 68:2235-2246. [PMID: 31506345 DOI: 10.2337/db18-1254] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Accepted: 08/27/2019] [Indexed: 11/13/2022]
Abstract
Extreme obesity (EO) (BMI >50 kg/m2) is frequently associated with neuropsychiatric disease (NPD). As both EO and NPD are heritable central nervous system disorders, we assessed the prevalence of protein-truncating variants (PTVs) and copy number variants (CNVs) in genes/regions previously implicated in NPD in adults with EO (n = 149) referred for weight loss/bariatric surgery. We also assessed the prevalence of CNVs in patients referred to University College London Hospital (UCLH) with EO (n = 218) and obesity (O) (BMI 35-50 kg/m2; n = 374) and a Swedish cohort of participants from the community with predominantly O (n = 161). The prevalence of variants was compared with control subjects in the Exome Aggregation Consortium/Genome Aggregation Database. In the discovery cohort (high NPD prevalence: 77%), the cumulative PTV/CNV allele frequency (AF) was 7.7% vs. 2.6% in control subjects (odds ratio [OR] 3.1 [95% CI 2-4.1]; P < 0.0001). In the UCLH EO cohort (intermediate NPD prevalence: 47%), CNV AF (1.8% vs. 0.9% in control subjects; OR 1.95 [95% CI 0.96-3.93]; P = 0.06) was lower than the discovery cohort. CNV AF was not increased in the UCLH O cohort (0.8%). No CNVs were identified in the Swedish cohort with no NPD. These findings suggest that PTV/CNVs, in genes/regions previously associated with NPD, may contribute to NPD in patients with EO.
Collapse
Affiliation(s)
- Priska Stahel
- Department of Medicine, Banting & Best Diabetes Centre, University of Toronto, Toronto, Ontario, Canada
| | - Avital Nahmias
- Department of Medicine, Banting & Best Diabetes Centre, University of Toronto, Toronto, Ontario, Canada
| | - Shawn K Sud
- Department of Medicine, Banting & Best Diabetes Centre, University of Toronto, Toronto, Ontario, Canada
| | - So Jeong Lee
- Department of Medicine, Banting & Best Diabetes Centre, University of Toronto, Toronto, Ontario, Canada
| | - Andrea Pucci
- Centre for Obesity Research, Rayne Institute, Department of Medicine, University College London, London, U.K
- UCLH Bariatric Centre for Weight Management and Metabolic Surgery, University College London Hospital, London, U.K
- NIHR Biomedical Research Centre at University College London Hospitals NHS Foundation Trust and University College London, London, U.K
| | - Ahmed Yousseif
- Centre for Obesity Research, Rayne Institute, Department of Medicine, University College London, London, U.K
- UCLH Bariatric Centre for Weight Management and Metabolic Surgery, University College London Hospital, London, U.K
- NIHR Biomedical Research Centre at University College London Hospitals NHS Foundation Trust and University College London, London, U.K
| | - Alaa Youseff
- Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Timothy Jackson
- Institute of Medical Science, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
- Division of General Surgery, University Health Network, Toronto, Ontario, Canada
| | - David R Urbach
- Division of General Surgery, University Health Network, Toronto, Ontario, Canada
| | - Allan Okrainec
- Division of General Surgery, University Health Network, Toronto, Ontario, Canada
- Department of Surgery, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Johane P Allard
- Bariatric Surgery Department, Toronto Western Hospital, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Sanjeev Sockalingam
- Department of Surgery, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Nutritional Sciences, University of Toronto, Toronto, Ontario, Canada
- Centre for Mental Health, University Health Network, Toronto, Ontario, Canada
| | - Tony Yao
- Division of Epidemiology and Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Moumita Barua
- Division of Epidemiology and Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Hong Jiao
- Division of Nephrology, Department of Medicine, Toronto General Research Institute, University Health Network, Toronto, Ontario, Canada
| | - Reedik Magi
- Department of Medicine, Karolinska Institutet, Stockholm, Sweden
| | - Anne S Bassett
- Estonian Genome Center, Institute of Genomics, University of Tartu, Tartu, Estonia
- The Dalglish Family 22q Clinic, University Health Network, Toronto, Ontario, Canada
- Clinical Genetics Research Program, Centre for Addiction and Mental Health, Toronto, Ontario, Canada
- Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada
- Department of Psychiatry, University Health Network, Toronto, Ontario, Canada
- Division of Cardiology, Department of Medicine, University Health Network, Toronto, Ontario, Canada
| | - Andrew D Paterson
- Department of Nutritional Sciences, University of Toronto, Toronto, Ontario, Canada
- Toronto General Research Institute, University Health Network, Toronto, Ontario, Canada
| | - Ingrid Dahlman
- Division of Epidemiology and Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Rachel L Batterham
- Centre for Obesity Research, Rayne Institute, Department of Medicine, University College London, London, U.K
- UCLH Bariatric Centre for Weight Management and Metabolic Surgery, University College London Hospital, London, U.K
- NIHR Biomedical Research Centre at University College London Hospitals NHS Foundation Trust and University College London, London, U.K
| | - Satya Dash
- Department of Medicine, Banting & Best Diabetes Centre, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
14
|
Povysil G, Petrovski S, Hostyk J, Aggarwal V, Allen AS, Goldstein DB. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat Rev Genet 2019; 20:747-759. [PMID: 31605095 DOI: 10.1038/s41576-019-0177-4] [Citation(s) in RCA: 117] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/06/2019] [Indexed: 12/11/2022]
Abstract
The first phase of genome-wide association studies (GWAS) assessed the role of common variation in human disease. Advances optimizing and economizing high-throughput sequencing have enabled a second phase of association studies that assess the contribution of rare variation to complex disease in all protein-coding genes. Unlike the early microarray-based studies, sequencing-based studies catalogue the full range of genetic variation, including the evolutionarily youngest forms. Although the experience with common variants helped establish relevant standards for genome-wide studies, the analysis of rare variation introduces several challenges that require novel analysis approaches.
Collapse
Affiliation(s)
- Gundula Povysil
- Institute for Genomic Medicine, Columbia University Irving Medical Center, Columbia University, New York, NY, USA
| | - Slavé Petrovski
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK.,Department of Medicine, The University of Melbourne, Austin Health and Royal Melbourne Hospital, Melbourne, Victoria, Australia
| | - Joseph Hostyk
- Institute for Genomic Medicine, Columbia University Irving Medical Center, Columbia University, New York, NY, USA
| | - Vimla Aggarwal
- Institute for Genomic Medicine, Columbia University Irving Medical Center, Columbia University, New York, NY, USA
| | - Andrew S Allen
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - David B Goldstein
- Institute for Genomic Medicine, Columbia University Irving Medical Center, Columbia University, New York, NY, USA.
| |
Collapse
|
15
|
Muyas F, Bosio M, Puig A, Susak H, Domènech L, Escaramis G, Zapata L, Demidov G, Estivill X, Rabionet R, Ossowski S. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum Mutat 2018; 40:115-126. [PMID: 30353964 PMCID: PMC6587442 DOI: 10.1002/humu.23674] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Revised: 09/17/2018] [Accepted: 10/20/2018] [Indexed: 12/13/2022]
Abstract
In recent years, next‐generation sequencing (NGS) has become a cornerstone of clinical genetics and diagnostics. Many clinical applications require high precision, especially if rare events such as somatic mutations in cancer or genetic variants causing rare diseases need to be identified. Although random sequencing errors can be modeled statistically and deep sequencing minimizes their impact, systematic errors remain a problem even at high depth of coverage. Understanding their source is crucial to increase precision of clinical NGS applications. In this work, we studied the relation between recurrent biases in allele balance (AB), systematic errors, and false positive variant calls across a large cohort of human samples analyzed by whole exome sequencing (WES). We have modeled the AB distribution for biallelic genotypes in 987 WES samples in order to identify positions recurrently deviating significantly from the expectation, a phenomenon we termed allele balance bias (ABB). Furthermore, we have developed a genotype callability score based on ABB for all positions of the human exome, which detects false positive variant calls that passed state‐of‐the‐art filters. Finally, we demonstrate the use of ABB for detection of false associations proposed by rare variant association studies. Availability: https://github.com/Francesc-Muyas/ABB.
Collapse
Affiliation(s)
- Francesc Muyas
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| | - Mattia Bosio
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Anna Puig
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Hana Susak
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Laura Domènech
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
| | - Georgia Escaramis
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
| | - Luis Zapata
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - German Demidov
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| | - Xavier Estivill
- Sidra Medicine, Doha, Qatar.,Women's Health Dexeus, Barcelona, Spain
| | - Raquel Rabionet
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER in Epidemiology and Public Health (CIBERESP), Barcelona, Spain.,Institut de Recerca Sant Joan de Déu; Institut de Biomedicina de la Universitat de Barcelona (IBUB), ; & Department of Genetics, Microbiology and Statistics, University of Barcelona, Barcelona, Spain
| | - Stephan Ossowski
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| |
Collapse
|
16
|
Liu Y, He Q, Sun W. Association analysis using somatic mutations. PLoS Genet 2018; 14:e1007746. [PMID: 30388102 PMCID: PMC6235399 DOI: 10.1371/journal.pgen.1007746] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2018] [Revised: 11/14/2018] [Accepted: 10/07/2018] [Indexed: 11/18/2022] Open
Abstract
Somatic mutations drive the growth of tumor cells and are pivotal biomarkers for many cancer treatments. Genetic association analysis using somatic mutations is an effective approach to study the functional impact of somatic mutations. However, standard regression methods are not appropriate for somatic mutation association studies because somatic mutation calls often have non-ignorable false positive rate and/or false negative rate. While large scale association analysis using somatic mutations becomes feasible recently—thanks for the improvement of sequencing techniques and the reduction of sequencing cost—there is an urgent need for a new statistical method designed for somatic mutation association analysis. We propose such a method with computationally efficient software implementation: Somatic mutation Association test with Measurement Errors (SAME). SAME accounts for somatic mutation calling uncertainty using a likelihood based approach. It can be used to assess the associations between continuous/dichotomous outcomes and individual mutations or gene-level mutations. Through simulation studies across a wide range of realistic scenarios, we show that SAME can significantly improve statistical power than the naive generalized linear model that ignores mutation calling uncertainty. Finally, using the data collected from The Cancer Genome Atlas (TCGA) project, we apply SAME to study the associations between somatic mutations and gene expression in 12 cancer types, as well as the associations between somatic mutations and colon cancer subtype defined by DNA methylation data. SAME recovered some interesting findings that were missed by the generalized linear model. In addition, we demonstrated that mutation-level and gene-level analyses are often more appropriate for oncogene and tumor-suppressor gene, respectively. Cancer is a genetic disease that is driven by the accumulation of somatic mutations. Association studies using somatic mutations is a powerful approach to identify the potential impact of somatic mutations on molecular or clinical features. One challenge for such tasks is the non-ignorable somatic mutation calling errors. We have developed a statistical method to address this challenge and applied our method to study the gene expression traits associated with somatic mutations in 12 cancer types. Our results show that some somatic mutations affect gene expression in several cancer types. In particular, we show that the associations between gene expression traits and TP53 gene level mutation reveal some similarities across a few cancer types.
Collapse
Affiliation(s)
- Yang Liu
- Department of Mathematics and Statistics, Wright State University, Dayton, Ohio, United States of America
| | - Qianchan He
- Biostatistics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Wei Sun
- Biostatistics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
17
|
Project MinE: study design and pilot analyses of a large-scale whole-genome sequencing study in amyotrophic lateral sclerosis. Eur J Hum Genet 2018; 26:1537-1546. [PMID: 29955173 PMCID: PMC6138692 DOI: 10.1038/s41431-018-0177-4] [Citation(s) in RCA: 101] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2017] [Revised: 04/10/2018] [Accepted: 04/26/2018] [Indexed: 11/16/2022] Open
Abstract
The most recent genome-wide association study in amyotrophic lateral sclerosis (ALS) demonstrates a disproportionate contribution from low-frequency variants to genetic susceptibility to disease. We have therefore begun Project MinE, an international collaboration that seeks to analyze whole-genome sequence data of at least 15 000 ALS patients and 7500 controls. Here, we report on the design of Project MinE and pilot analyses of successfully sequenced 1169 ALS patients and 608 controls drawn from the Netherlands. As has become characteristic of sequencing studies, we find an abundance of rare genetic variation (minor allele frequency < 0.1%), the vast majority of which is absent in public datasets. Principal component analysis reveals local geographical clustering of these variants within The Netherlands. We use the whole-genome sequence data to explore the implications of poor geographical matching of cases and controls in a sequence-based disease study and to investigate how ancestry-matched, externally sequenced controls can induce false positive associations. Also, we have publicly released genome-wide minor allele counts in cases and controls, as well as results from genic burden tests.
Collapse
|
18
|
Liao P, Satten GA, Hu YJ. Robust inference of population structure from next-generation sequencing data with systematic differences in sequencing. Bioinformatics 2018; 34:1157-1163. [PMID: 29186324 PMCID: PMC6031038 DOI: 10.1093/bioinformatics/btx708] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Revised: 09/29/2017] [Accepted: 11/24/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation Inferring population structure is important for both population genetics and genetic epidemiology. Principal components analysis (PCA) has been effective in ascertaining population structure with array genotype data but can be difficult to use with sequencing data, especially when low depth leads to uncertainty in called genotypes. Because PCA is sensitive to differences in variability, PCA using sequencing data can result in components that correspond to differences in sequencing quality (read depth and error rate), rather than differences in population structure. We demonstrate that even existing methods for PCA specifically designed for sequencing data can still yield biased conclusions when used with data having sequencing properties that are systematically different across different groups of samples (i.e. sequencing groups). This situation can arise in population genetics when combining sequencing data from different studies, or in genetic epidemiology when using historical controls such as samples from the 1000 Genomes Project. Results To allow inference on population structure using PCA in these situations, we provide an approach that is based on using sequencing reads directly without calling genotypes. Our approach is to adjust the data from different sequencing groups to have the same read depth and error rate so that PCA does not generate spurious components representing sequencing quality. To accomplish this, we have developed a subsampling procedure to match the depth distributions in different sequencing groups, and a read-flipping procedure to match the error rates. We average over subsamples and read flips to minimize loss of information. We demonstrate the utility of our approach using two datasets from 1000 Genomes, and further evaluate it using simulation studies. Availability and implementation TASER-PC software is publicly available at http://web1.sph.emory.edu/users/yhu30/software.html. Contact yijuan.hu@emory.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Peizhou Liao
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA
| | - Glen A Satten
- Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Yi-Juan Hu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA
| |
Collapse
|
19
|
Genome-wide linkage and association study implicates the 10q26 region as a major genetic contributor to primary nonsyndromic vesicoureteric reflux. Sci Rep 2017; 7:14595. [PMID: 29097723 PMCID: PMC5668427 DOI: 10.1038/s41598-017-15062-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Accepted: 10/06/2017] [Indexed: 12/29/2022] Open
Abstract
Vesicoureteric reflux (VUR) is the commonest urological anomaly in children. Despite treatment improvements, associated renal lesions – congenital dysplasia, acquired scarring or both – are a common cause of childhood hypertension and renal failure. Primary VUR is familial, with transmission rate and sibling risk both approaching 50%, and appears highly genetically heterogeneous. It is often associated with other developmental anomalies of the urinary tract, emphasising its etiology as a disorder of urogenital tract development. We conducted a genome-wide linkage and association study in three European populations to search for loci predisposing to VUR. Family-based association analysis of 1098 parent-affected-child trios and case/control association analysis of 1147 cases and 3789 controls did not reveal any compelling associations, but parametric linkage analysis of 460 families (1062 affected individuals) under a dominant model identified a single region, on 10q26, that showed strong linkage (HLOD = 4.90; ZLRLOD = 4.39) to VUR. The ~9Mb region contains 69 genes, including some good biological candidates. Resequencing this region in selected individuals did not clearly implicate any gene but FOXI2, FANK1 and GLRX3 remain candidates for further investigation. This, the largest genetic study of VUR to date, highlights the 10q26 region as a major genetic contributor to VUR in European populations.
Collapse
|
20
|
Tom JA, Reeder J, Forrest WF, Graham RR, Hunkapiller J, Behrens TW, Bhangale TR. Identifying and mitigating batch effects in whole genome sequencing data. BMC Bioinformatics 2017; 18:351. [PMID: 28738841 PMCID: PMC5525370 DOI: 10.1186/s12859-017-1756-z] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Accepted: 07/12/2017] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Large sample sets of whole genome sequencing with deep coverage are being generated, however assembling datasets from different sources inevitably introduces batch effects. These batch effects are not well understood and can be due to changes in the sequencing protocol or bioinformatics tools used to process the data. No systematic algorithms or heuristics exist to detect and filter batch effects or remove associations impacted by batch effects in whole genome sequencing data. RESULTS We describe key quality metrics, provide a freely available software package to compute them, and demonstrate that identification of batch effects is aided by principal components analysis of these metrics. To mitigate batch effects, we developed new site-specific filters that identified and removed variants that falsely associated with the phenotype due to batch effect. These include filtering based on: a haplotype based genotype correction, a differential genotype quality test, and removing sites with missing genotype rate greater than 30% after setting genotypes with quality scores less than 20 to missing. This method removed 96.1% of unconfirmed genome-wide significant SNP associations and 97.6% of unconfirmed genome-wide significant indel associations. We performed analyses to demonstrate that: 1) These filters impacted variants known to be disease associated as 2 out of 16 confirmed associations in an AMD candidate SNP analysis were filtered, representing a reduction in power of 12.5%, 2) In the absence of batch effects, these filters removed only a small proportion of variants across the genome (type I error rate of 3%), and 3) in an independent dataset, the method removed 90.2% of unconfirmed genome-wide SNP associations and 89.8% of unconfirmed genome-wide indel associations. CONCLUSIONS Researchers currently do not have effective tools to identify and mitigate batch effects in whole genome sequencing data. We developed and validated methods and filters to address this deficiency.
Collapse
Affiliation(s)
- Jennifer A Tom
- Bioinformatics and Computational Biology Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA.
| | - Jens Reeder
- Bioinformatics and Computational Biology Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - William F Forrest
- Bioinformatics and Computational Biology Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Robert R Graham
- Human Genetics Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Julie Hunkapiller
- Human Genetics Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Timothy W Behrens
- Human Genetics Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Tushar R Bhangale
- Bioinformatics and Computational Biology Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA.,Human Genetics Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| |
Collapse
|
21
|
Lee S, Kim S, Fuchsberger C. Improving power for rare-variant tests by integrating external controls. Genet Epidemiol 2017; 41:610-619. [PMID: 28657150 DOI: 10.1002/gepi.22057] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2016] [Revised: 03/16/2017] [Accepted: 04/25/2017] [Indexed: 11/07/2022]
Abstract
Due to the drop in sequencing cost, the number of sequenced genomes is increasing rapidly. To improve power of rare-variant tests, these sequenced samples could be used as external control samples in addition to control samples from the study itself. However, when using external controls, possible batch effects due to the use of different sequencing platforms or genotype calling pipelines can dramatically increase type I error rates. To address this, we propose novel summary statistics based single and gene- or region-based rare-variant tests that allow the integration of external controls while controlling for type I error. Our approach is based on the insight that batch effects on a given variant can be assessed by comparing odds ratio estimates using internal controls only vs. using combined control samples of internal and external controls. From simulation experiments and the analysis of data from age-related macular degeneration and type 2 diabetes studies, we demonstrate that our method can substantially improve power while controlling for type I error rate.
Collapse
Affiliation(s)
- Seunggeun Lee
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA.,Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Sehee Kim
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Christian Fuchsberger
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA.,Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA.,Center for Biomedicine, European Academy of Bolzano/Bozen, affiliated to the University of Lübeck, Bolzano/Bozen, Italy
| |
Collapse
|
22
|
Liao P, Satten GA, Hu YJ. PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies. Genet Epidemiol 2017; 41:375-387. [PMID: 28560825 DOI: 10.1002/gepi.22048] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2015] [Revised: 11/30/2016] [Accepted: 02/27/2017] [Indexed: 12/30/2022]
Abstract
A fundamental challenge in analyzing next-generation sequencing (NGS) data is to determine an individual's genotype accurately, as the accuracy of the inferred genotype is essential to downstream analyses. Correctly estimating the base-calling error rate is critical to accurate genotype calls. Phred scores that accompany each call can be used to decide which calls are reliable. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too high threshold may lose data, while a too low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The approach, which we call PhredEM, uses the expectation-maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. It also includes a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be nonmonomorphic require application of the EM algorithm. Like GATK, PhredEM can be used together with a linkage-disequilibrium-based method such as Beagle, which can further improve genotype calling as a refinement step. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project and the 1000 Genomes project. The results demonstrate that PhredEM performs better than either GATK or SeqEM, and that PhredEM is an improved, robust, and widely applicable genotype-calling approach for NGS studies. The relevant software is freely available.
Collapse
Affiliation(s)
- Peizhou Liao
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, United States of America
| | - Glen A Satten
- Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | - Yi-Juan Hu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, United States of America
| |
Collapse
|
23
|
López‐Isac E, Bossini‐Castillo L, Palma AB, Assassi S, Mayes MD, Simeón CP, Ortego‐Centeno N, Vicente E, Tolosa C, Rubio‐Rivas M, Román‐Ivorra JA, Beretta L, Moroncini G, Hunzelmann N, Distler JHW, Riemekasten G, de Vries‐Bouwstra J, Voskuyl AE, Radstake TRDJ, Herrick A, Denton CP, Fonseca C, Martín J. Analysis of
ATP8B4
F436L Missense Variant in a Large Systemic Sclerosis Cohort. Arthritis Rheumatol 2017; 69:1337-1338. [DOI: 10.1002/art.40058] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Accepted: 01/26/2017] [Indexed: 01/15/2023]
Affiliation(s)
- Elena López‐Isac
- Institute of Parasitology and Biomedicine López‐NeyraGranada Spain
| | | | - Ana B. Palma
- Institute of Parasitology and Biomedicine López‐NeyraGranada Spain
| | | | | | | | | | | | | | | | | | - Lorenzo Beretta
- Fondazione IRCCS Ca' Granda Ospedale, Maggiore Policlinico di MilanoMilan Italy
| | | | | | | | | | | | | | | | - Ariane Herrick
- The University of Manchester, Manchester Academic Health Science CentreManchester UK
| | | | - Carmen Fonseca
- Royal Free and University College Medical SchoolLondon UK
| | - Javier Martín
- Institute of Parasitology and Biomedicine López‐NeyraGranada Spain
| |
Collapse
|