1
|
Miller C, Portlock T, Nyaga DM, O'Sullivan JM. A review of model evaluation metrics for machine learning in genetics and genomics. FRONTIERS IN BIOINFORMATICS 2024; 4:1457619. [PMID: 39318760 PMCID: PMC11420621 DOI: 10.3389/fbinf.2024.1457619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Accepted: 08/27/2024] [Indexed: 09/26/2024] Open
Abstract
Machine learning (ML) has shown great promise in genetics and genomics where large and complex datasets have the potential to provide insight into many aspects of disease risk, pathogenesis of genetic disorders, and prediction of health and wellbeing. However, with this possibility there is a responsibility to exercise caution against biases and inflation of results that can have harmful unintended impacts. Therefore, researchers must understand the metrics used to evaluate ML models which can influence the critical interpretation of results. In this review we provide an overview of ML metrics for clustering, classification, and regression and highlight the advantages and disadvantages of each. We also detail common pitfalls that occur during model evaluation. Finally, we provide examples of how researchers can assess and utilise the results of ML models, specifically from a genomics perspective.
Collapse
Affiliation(s)
- Catriona Miller
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - Theo Portlock
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - Denis M Nyaga
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - Justin M O'Sullivan
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
- The Maurice Wilkins Centre, The University of Auckland, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science Technology and Research, Singapore, Singapore
| |
Collapse
|
2
|
Liu M, Zhang Q, Ma S. A tree-based gene-environment interaction analysis with rare features. Stat Anal Data Min 2022; 15:648-674. [PMID: 38046814 PMCID: PMC10691867 DOI: 10.1002/sam.11578] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 02/14/2022] [Indexed: 01/20/2023]
Abstract
Gene-environment (G-E) interaction analysis plays a critical role in understanding and modeling complex diseases. Compared to main-effect-only analysis, it is more seriously challenged by higher dimensionality, weaker signals, and the unique "main effects, interactions" variable selection hierarchy. In joint G-E interaction analysis under which a large number of G factors are analysed in a single model, effort tailored to rare features (e.g., SNPs with low minor allele frequencies) has been limited. Existing investigations on rare features have been mostly focused on marginal analysis, where various data aggregation techniques have been developed, and hypothesis testings have been conducted to identify significant aggregated features. However, such techniques cannot be extended to joint G-E interaction analysis. In this study, building on a very recent tree-based data aggregation technique, which has been developed for main-effect-only analysis, we develop a new G-E interaction analysis approach tailored to rare features. The adopted data aggregation technique allows for more efficient information borrowing from neighboring rare features. Similar to some existing state-of-the-art ones, the proposed approach adopts penalization for variable selection, regularized estimation, and respect of the variable selection hierarchy. Simulation shows that it has more accurate identification of important interactions and main effects than several competing alternatives. In the analysis of NFBC1966 study, the proposed approach leads to findings different from the alternatives and with satisfactory prediction and stability performance.
Collapse
Affiliation(s)
- Mengque Liu
- School of Journalism and New Media, Xi’an Jiaotong Universit0y, Shanxi Xi’an, China
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Fujian Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| |
Collapse
|
3
|
[An improved association analysis pipeline for tumor susceptibility variant in haplotype amplification area]. NAN FANG YI KE DA XUE XUE BAO = JOURNAL OF SOUTHERN MEDICAL UNIVERSITY 2020; 40:1493-1499. [PMID: 33118521 PMCID: PMC7606235 DOI: 10.12122/j.issn.1673-4254.2020.10.16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
OBJECTIVE Haplotype amplification on germline variants is suggested to imply potential selective advantages and clonal expansion susceptibility and has become an important signature for seeking cancer susceptibility gene.Here we propose an improved association method that fully considers the haplotype amplification status. METHODS The haplotype amplification status was estimated by the variant allelic frequencies.We adopted a permutation test on variant allelic frequencies to divide the candidate variants into multiple groups.A likelihood clustering method was then applied to establish the neighborhood system of the hidden Markov random field framework.A filtering pipeline was introduced into the proposed method to further refine the candidate variants, including a Wilson's interval filter and a false discovery rate controller.The final candidate set along with the haplotype amplification status was collapsed into the weighted virtual sites for association tests. RESULTS Through simulated tests on a series of datasets, we compared the type Ⅰ error rates of different minor allele frequencies, which stably fell within 2%, suggesting good robustness of the algorithm.In addition, we compared another 5 published association approaches for Type-Ⅰ and Type-Ⅱ error rates with the proposed method, which resulted in the error rates all within 2%, demonstrating significant advantages and a good statistical ability of the proposed method. CONCLUSIONS The proposed method can accurately identify tumor susceptibility variants in haplotype amplification area with good robustness and stability.
Collapse
|
4
|
Hahn G, Lutz SM, Hecker J, Prokopenko D, Cho MH, Silverman EK, Weiss ST, Lange C. locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies. Genet Epidemiol 2020; 45:82-98. [PMID: 32929743 DOI: 10.1002/gepi.22356] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 08/05/2020] [Accepted: 08/24/2020] [Indexed: 01/08/2023]
Abstract
locStra is an R -package for the analysis of regional and global population stratification in whole-genome sequencing (WGS) studies, where regional stratification refers to the substructure defined by the loci in a particular region on the genome. Population substructure can be assessed based on the genetic covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix. Using a sliding window approach, the regional similarity matrices are compared with the global ones, based on user-defined window sizes and metrics, for example, the correlation between regional and global eigenvectors. An algorithm for the specification of the window size is provided. As the implementation fully exploits sparse matrix algebra and is written in C++, the analysis is highly efficient. Even on single cores, for realistic study sizes (several thousand subjects, several million rare variants per subject), the runtime for the genome-wide computation of all regional similarity matrices does typically not exceed one hour, enabling an unprecedented investigation of regional stratification across the entire genome. The package is applied to three WGS studies, illustrating the varying patterns of regional substructure across the genome and its beneficial effects on association testing.
Collapse
Affiliation(s)
- Georg Hahn
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | - Sharon M Lutz
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | - Julian Hecker
- Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA
| | - Dmitry Prokopenko
- Massachusetts General Hospital, Harvard University, Boston, Massachusetts, USA
| | - Michael H Cho
- Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA
| | - Edwin K Silverman
- Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA
| | - Scott T Weiss
- Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA
| | - Christoph Lange
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | | |
Collapse
|
5
|
Yazdani A, Yazdani A, Elsea SH, Schaid DJ, Kosorok MR, Dangol G, Samiei A. Genome analysis and pleiotropy assessment using causal networks with loss of function mutation and metabolomics. BMC Genomics 2019; 20:395. [PMID: 31113383 PMCID: PMC6528192 DOI: 10.1186/s12864-019-5772-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 05/03/2019] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Many genome-wide association studies have detected genomic regions associated with traits, yet understanding the functional causes of association often remains elusive. Utilizing systems approaches and focusing on intermediate molecular phenotypes might facilitate biologic understanding. RESULTS The availability of exome sequencing of two populations of African-Americans and European-Americans from the Atherosclerosis Risk in Communities study allowed us to investigate the effects of annotated loss-of-function (LoF) mutations on 122 serum metabolites. To assess the findings, we built metabolomic causal networks for each population separately and utilized structural equation modeling. We then validated our findings with a set of independent samples. By use of methods based on concepts of Mendelian randomization of genetic variants, we showed that some of the affected metabolites are risk predictors in the causal pathway of disease. For example, LoF mutations in the gene KIAA1755 were identified to elevate the levels of eicosapentaenoate (p-value = 5E-14), an essential fatty acid clinically identified to increase essential hypertension. We showed that this gene is in the pathway to triglycerides, where both triglycerides and essential hypertension are risk factors of metabolomic disorder and heart attack. We also identified that the gene CLDN17, harboring loss-of-function mutations, had pleiotropic actions on metabolites from amino acid and lipid pathways. CONCLUSION Using systems biology approaches for the analysis of metabolomics and genetic data, we integrated several biological processes, which lead to findings that may functionally connect genetic variants with complex diseases.
Collapse
Affiliation(s)
| | - Akram Yazdani
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, 10029 USA
| | - Sarah H. Elsea
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
| | - Daniel J. Schaid
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN 55905 USA
| | - Michael R. Kosorok
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA
| | - Gita Dangol
- Health Science Center, The University of Texas MD Anderson Cancer Center, Austin, TX 77030 USA
| | - Ahmad Samiei
- Hasso Plattner Institute, 14482 Potsdam, Germany
- Climax Data Pattern, Boston, MA USA
| |
Collapse
|
6
|
Li X, Wu D, Cui Y, Liu B, Walter H, Schumann G, Li C, Jiang T. Reliable heritability estimation using sparse regularization in ultrahigh dimensional genome-wide association studies. BMC Bioinformatics 2019; 20:219. [PMID: 31039742 PMCID: PMC6492418 DOI: 10.1186/s12859-019-2792-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 04/02/2019] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Data from genome-wide association studies (GWASs) have been used to estimate the heritability of human complex traits in recent years. Existing methods are based on the linear mixed model, with the assumption that the genetic effects are random variables, which is opposite to the fixed effect assumption embedded in the framework of quantitative genetics theory. Moreover, heritability estimators provided by existing methods may have large standard errors, which calls for the development of reliable and accurate methods to estimate heritability. RESULTS In this paper, we first investigate the influences of the fixed and random effect assumption on heritability estimation, and prove that these two assumptions are equivalent under mild conditions in the theoretical aspect. Second, we propose a two-stage strategy by first performing sparse regularization via cross-validated elastic net, and then applying variance estimation methods to construct reliable heritability estimations. Results on both simulated data and real data show that our strategy achieves a considerable reduction in the standard error while reserving the accuracy. CONCLUSIONS The proposed strategy allows for a reliable and accurate heritability estimation using GWAS data. It shows the promising future that reliable estimations can still be obtained with even a relatively restricted sample size, and should be especially useful for large-scale heritability analyses in the genomics era.
Collapse
Affiliation(s)
- Xin Li
- School of Mathematical Sciences, Zhejiang University, 38 Zheda Road, Hangzhou, 310027 China
| | - Dongya Wu
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- University of Chinese Academy of Sciences, 19 Yuquan Road, Beijing, 100049 China
| | - Yue Cui
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
| | - Bing Liu
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
| | - Henrik Walter
- Department of Psychiatry and Psychotherapy, Campus Charité Mitte, Charité, Universitätsmedizin Berlin, Berlin, Germany
| | - Gunter Schumann
- Centre for Population Neuroscience and Stratified Medicine (PONS) and MRC-SGDP Centre, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London, United Kingdom
| | - Chong Li
- School of Mathematical Sciences, Zhejiang University, 38 Zheda Road, Hangzhou, 310027 China
| | - Tianzi Jiang
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, 4 Section 2 North Jianshe Road, Chengdu, 610054 China
- The Queensland Brain Institute, University of Queensland, Brisbane, QLD 4072 Australia
- University of Chinese Academy of Sciences, 19 Yuquan Road, Beijing, 100049 China
| |
Collapse
|
7
|
Yazdani A, Yazdani A, Méndez Giráldez R, Aguilar D, Sartore L. A Multi-Trait Approach Identified Genetic Variants Including a Rare Mutation in RGS3 with Impact on Abnormalities of Cardiac Structure/Function. Sci Rep 2019; 9:5845. [PMID: 30971721 PMCID: PMC6458140 DOI: 10.1038/s41598-019-41362-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Accepted: 03/05/2019] [Indexed: 01/29/2023] Open
Abstract
Heart failure is a major cause for premature death. Given the heterogeneity of the heart failure syndrome, identifying genetic determinants of cardiac function and structure may provide greater insights into heart failure. Despite progress in understanding the genetic basis of heart failure through genome wide association studies, the heritability of heart failure is not well understood. Gaining further insights into mechanisms that contribute to heart failure requires systematic approaches that go beyond single trait analysis. We integrated a Bayesian multi-trait approach and a Bayesian networks for the analysis of 10 correlated traits of cardiac structure and function measured across 3387 individuals with whole exome sequence data. While using single-trait based approaches did not find any significant genetic variant, applying the integrative Bayesian multi-trait approach, we identified 3 novel variants located in genes, RGS3, CHD3, and MRPL38 with significant impact on the cardiac traits such as left ventricular volume index, parasternal long axis interventricular septum thickness, and mean left ventricular wall thickness. Among these, the rare variant NC_000009.11:g.116346115C > A (rs144636307) in RGS3 showed pleiotropic effect on left ventricular mass index, left ventricular volume index and maximal left atrial anterior-posterior diameter while RGS3 can inhibit TGF-beta signaling associated with left ventricle dilation and systolic dysfunction.
Collapse
Affiliation(s)
- Akram Yazdani
- Department of Genetics and Genomic Science, Icahn School of Medicine at Mount Sinai, New York, NY, USA. .,Climax Data Pattern, Boston, MA, USA.
| | - Azam Yazdani
- School of Medicine, Boston University, Boston, MA, USA
| | - Raúl Méndez Giráldez
- Lineberger Comprehensive Cancer Center, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | | - Luca Sartore
- National Institute of Statistical Science, Washington, DC, USA
| |
Collapse
|
8
|
Geng Y, Zhao Z, Zhang X, Wang W, Cui X, Ye K, Xiao X, Wang J. An improved burden-test pipeline for identifying associations from rare germline and somatic variants. BMC Genomics 2017. [PMID: 29513197 PMCID: PMC5657102 DOI: 10.1186/s12864-017-4133-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying rare germline and somatic variants associated with cancer progression is an important research topic in cancer genomics. Although many approaches are proposed for rare variant association study, they are not fit for cancer sequencing data due to multiple issues, such as overly relying on pre-selection, losing sight of interacting hotspots, etc. RESULTS In this article, we propose an improved pipeline to identify germline variant and somatic mutation interactions influencing cancer susceptibility from pair-wise cancer sequencing data. The proposed pipeline, RareProb-C performs an algorithmic selection on the given variants by incorporating the variant allelic frequencies. The interactions among the variants are considered within the regions which are limited by a four-gamete test. Then it filters singular cases according to the posterior probability at each site. Finally, it outputs the selected candidates that pass a collapse test. CONCLUSIONS We apply RareProb-C on a series of carefully constructed simulation cases and it outperforms six existing genetic model-free approaches. We also test RareProb-C on 429 TCGA ovarian cancer cases, and RareProb-C successfully identifies the known highlighted variants which are considered increasing disease susceptibilities.
Collapse
Affiliation(s)
- Yu Geng
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China.,Jinzhou Medical University, Jinzhou, Liaoning, 121001, China
| | - Zhongmeng Zhao
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China. .,Institute of Data Science and Information Quality, Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China.
| | - Xuanping Zhang
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China.,Institute of Data Science and Information Quality, Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China
| | - Wenke Wang
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China
| | - Xingjian Cui
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China.,Institute of Data Science and Information Quality, Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China
| | - Kai Ye
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China.,Institute of Data Science and Information Quality, Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China
| | - Xiao Xiao
- Institute of Data Science and Information Quality, Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China.,State Key Laboratory of Cancer Biology, Xijing Hospital of Digestive Diseases, Xi'an, 710032, Shaanxi, China
| | - Jiayin Wang
- School of Management, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China. .,Institute of Data Science and Information Quality, Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China.
| |
Collapse
|
9
|
Longitudinal data analysis for rare variants detection with penalized quadratic inference function. Sci Rep 2017; 7:650. [PMID: 28381821 PMCID: PMC5429681 DOI: 10.1038/s41598-017-00712-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 03/08/2017] [Indexed: 11/08/2022] Open
Abstract
Longitudinal genetic data provide more information regarding genetic effects over time compared with cross-sectional data. Coupled with next-generation sequencing technologies, it becomes reality to identify important genes containing both rare and common variants in a longitudinal design. In this work, we adopted a weighted sum statistic (WSS) to collapse multiple variants in a gene region to form a gene score. When multiple genes in a pathway were considered together, a penalized longitudinal model under the quadratic inference function (QIF) framework was applied for efficient gene selection. We evaluated the estimation accuracy and model selection performance under different model settings, then applied the method to a real dataset from the Genetic Analysis Workshop 18 (GAW18). Compared with the unpenalized QIF method, the penalized QIF (pQIF) method achieved better estimation accuracy and higher selection efficiency. The pQIF remained optimal even when the working correlation structure was mis-specified. The real data analysis identified one important gene, angiotensin II receptor type 1 (AGTR1), in the Ca2+/AT-IIR/α-AR signaling pathway. The estimated effect implied that AGTR1 may have a protective effect for hypertension. Our pQIF method provides a general tool for longitudinal sequencing studies involving large numbers of genetic variants.
Collapse
|
10
|
Loehlein Fier H, Prokopenko D, Hecker J, Cho MH, Silverman EK, Weiss ST, Tanzi RE, Lange C. On the association analysis of genome-sequencing data: A spatial clustering approach for partitioning the entire genome into nonoverlapping windows. Genet Epidemiol 2017; 41:332-340. [PMID: 28318110 DOI: 10.1002/gepi.22040] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2016] [Revised: 12/20/2016] [Accepted: 02/04/2017] [Indexed: 12/16/2022]
Abstract
For the association analysis of whole-genome sequencing (WGS) studies, we propose an efficient and fast spatial-clustering algorithm. Compared to existing analysis approaches for WGS data, that define the tested regions either by sliding or consecutive windows of fixed sizes along variants, a meaningful grouping of nearby variants into consecutive regions has the advantage that, compared to sliding window approaches, the number of tested regions is likely to be smaller. In comparison to consecutive, fixed-window approaches, our approach is likely to group nearby variants together. Given existing biological evidence that disease-associated mutations tend to physically cluster in specific regions along the chromosome, the identification of meaningful groups of nearby located variants could thus lead to a potential power gain for association analysis. Our algorithm defines consecutive genomic regions based on the physical positions of the variants, assuming an inhomogeneous Poisson process and groups together nearby variants. As parameters are estimated locally, the algorithm takes the differing variant density along the chromosome into account and provides locally optimal partitioning of variants into consecutive regions. An R-implementation of the algorithm is provided. We discuss the theoretical advances of our algorithm compared to existing, window-based approaches and show the performance and advantage of our introduced algorithm in a simulation study and by an application to Alzheimer's disease WGS data. Our analysis identifies a region in the ITGB3 gene that potentially harbors disease susceptibility loci for Alzheimer's disease. The region-based association signal of ITGB3 replicates in an independent data set and achieves formally genome-wide significance. Software Implementation: An implementation of the algorithm in R is available at: https://github.com/heidefier/cluster_wgs_data.
Collapse
Affiliation(s)
- Heide Loehlein Fier
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America.,Working Group of Genomic Mathematics, University of Bonn, Bonn, Germany
| | - Dmitry Prokopenko
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Julian Hecker
- Working Group of Genomic Mathematics, University of Bonn, Bonn, Germany
| | - Michael H Cho
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Edwin K Silverman
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Scott T Weiss
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Rudolph E Tanzi
- Genetics and Aging Research Unit, MassGeneral Institute for Neurodegenerative Disease, Massachusetts General Hospital, Harvard Medical School, Charlestown, Massachusetts, United States of America
| | - Christoph Lange
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America.,Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
11
|
Block-based association tests for rare variants using Kullback–Leibler divergence. J Hum Genet 2016; 61:965-975. [DOI: 10.1038/jhg.2016.90] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Revised: 05/03/2016] [Accepted: 06/17/2016] [Indexed: 11/09/2022]
|
12
|
Yazdani A, Yazdani A, Liu X, Boerwinkle E. Identification of Rare Variants in Metabolites of the Carnitine Pathway by Whole Genome Sequencing Analysis. Genet Epidemiol 2016; 40:486-91. [PMID: 27256581 DOI: 10.1002/gepi.21980] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Revised: 01/06/2016] [Accepted: 04/04/2016] [Indexed: 12/28/2022]
Abstract
We use whole genome sequence data and rare variant analysis methods to investigate a subset of the human serum metabolome, including 16 carnitine-related metabolites that are important components of mammalian energy metabolism. Medium pass sequence data consisting of 12,820,347 rare variants and serum metabolomics data were available on 1,456 individuals. By applying a penalization method, we identified two genes FGF8 and MDGA2 with significant effects on lysine and cis-4-decenoylcarnitine, respectively, using Δ-AIC and likelihood ratio test statistics. Single variant analyses in these regions did not identify a single low-frequency variant (minor allele count > 3) responsible for the underlying signal. The results demonstrate the utility of whole genome sequence and innovative analyses for identifying candidate regions influencing complex phenotypes.
Collapse
Affiliation(s)
- Akram Yazdani
- Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Azam Yazdani
- Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Xiaoming Liu
- Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Eric Boerwinkle
- Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America.,Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
| |
Collapse
|