1
|
Ahlinder J, Hall D, Suontama M, Sillanpää MJ. Principal component analysis revisited: fast multitrait genetic evaluations with smooth convergence. G3 (BETHESDA, MD.) 2024:jkae228. [PMID: 39429114 DOI: 10.1093/g3journal/jkae228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2024] [Accepted: 09/10/2024] [Indexed: 10/22/2024]
Abstract
A cornerstone in breeding and population genetics is the genetic evaluation procedure, needed to make important decisions on population management. Multivariate mixed model analysis, in which many traits are considered jointly, utilizes genetic and environmental correlations between traits to improve the accuracy. However, the number of parameters in the multitrait model grows exponentially with the number of traits which reduces its scalability. Here, we suggest using principal component analysis to reduce the dimensions of the response variables, and then using the computed principal components as separate responses in the genetic evaluation analysis. As principal components are orthogonal to each other so that phenotypic covariance is abscent between principal components, a full multivariate analysis can be approximated by separate univariate analyses instead which should speed up computations considerably. We compared the approach to both traditional multivariate analysis and factor analytic approach in terms of computational requirement and rank lists according to predicted genetic merit on two forest tree datasets with 22 and 27 measured traits, respectively. Obtained rank lists of the top 50 individuals were in good agreement. Interestingly, the required computational time of the approach only took a few seconds without convergence issues, unlike the traditional approach which required considerably more time to run (7 and 10 h, respectively). The factor analytic approach took approximately 5-10 min. Our approach can easily handle missing data and can be used with all available linear mixed effect model softwares as it does not require any specific implementation. The approach can help to mitigate difficulties with multitrait genetic analysis in both breeding and wild populations.
Collapse
Affiliation(s)
- Jon Ahlinder
- Department of Tree Breeding, Skogforsk, Box 3, Tomterna 1, Sävar SE-91821, Sweden
| | - David Hall
- Department of Tree Breeding, Skogforsk, Box 3, Tomterna 1, Sävar SE-91821, Sweden
- Department of Ecology and Environmental Science, Umeå University, Umeå SE-90736, Sweden
| | - Mari Suontama
- Department of Tree Breeding, Skogforsk, Box 3, Tomterna 1, Sävar SE-91821, Sweden
| | - Mikko J Sillanpää
- Research Unit of Mathematical Sciences, Oulu University, Oulu FI-90014, Finland
| |
Collapse
|
2
|
He Q, Liu H, Lu L, Zhang Q, Wang Q, Wang B, Wu X, Guan L, Mao J, Xue Y, Zhang C, Cao X, He Y, Peng X, Peng H, Zhao K, Li H, Jin X, Zhao L, Zhang J, Wang T. A genome-wide association study of neonatal metabolites. CELL GENOMICS 2024; 4:100668. [PMID: 39389019 DOI: 10.1016/j.xgen.2024.100668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Revised: 12/16/2023] [Accepted: 09/11/2024] [Indexed: 10/12/2024]
Abstract
Genetic factors significantly influence the concentration of metabolites in adults. Nevertheless, the genetic influence on neonatal metabolites remains uncertain. To bridge this gap, we employed genotype imputation techniques on large-scale low-pass genome data obtained from non-invasive prenatal testing. Subsequently, we conducted association studies on a total of 75 metabolic components in neonates. The study identified 19 previously reported associations and 11 novel associations between single-nucleotide polymorphisms and metabolic components. These associations were initially found in the discovery cohort (8,744 participants) and subsequently confirmed in a replication cohort (19,041 participants). The average heritability of metabolic components was estimated to be 76.2%, with a range of 69%-78.8%. These findings offer valuable insights into the genetic architecture of neonatal metabolism.
Collapse
Affiliation(s)
- Quanze He
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China; Suzhou Municipal Hospital, Suzhou Jiangsu 215000, China
| | - Hankui Liu
- Hebei Industrial Technology Research Institute of Genomics in Maternal & Child Health, Clin Lab, BGI Genomics, Shijiazhuang 050035, China; BGI Genomics, Shenzhen 518083, China
| | - Lu Lu
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Qin Zhang
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Qi Wang
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Benjing Wang
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Xiaojuan Wu
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Liping Guan
- Hebei Industrial Technology Research Institute of Genomics in Maternal & Child Health, Clin Lab, BGI Genomics, Shijiazhuang 050035, China; BGI Genomics, Shenzhen 518083, China
| | - Jun Mao
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Ying Xue
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Chunhua Zhang
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Xinye Cao
- Clinical Medicine Department, Xinjiang Medical University, Urumqi, Xinjiang Province 830054, China
| | - Yuxing He
- Clinical Medicine Department, Xinjiang Medical University, Urumqi, Xinjiang Province 830054, China
| | - Xiangwen Peng
- Changsha Hospital for Maternal and Child Health Care of Hunan Normal University, Changsha, Hunan Province 431005, China
| | | | - Kangrong Zhao
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Hong Li
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China
| | - Xin Jin
- BGI Research, Shenzhen 518083, China; The Innovation Centre of Ministry of Education for Development and Diseases, School of Medicine, South China University of Technology, Guangzhou 510006, China; Shanxi Medical University-BGI Collaborative Center for Future Medicine, Shanxi Medical University, Taiyuan 030001, China; Shenzhen Key Laboratory of Transomics Biotechnologies, BGI Research, Shenzhen 518083, China.
| | - Lijian Zhao
- Hebei Industrial Technology Research Institute of Genomics in Maternal & Child Health, Clin Lab, BGI Genomics, Shijiazhuang 050035, China; BGI Genomics, Shenzhen 518083, China; Medical Technology College, Hebei Medical University, Shijiazhuang 050000, China.
| | - Jianguo Zhang
- Hebei Industrial Technology Research Institute of Genomics in Maternal & Child Health, Clin Lab, BGI Genomics, Shijiazhuang 050035, China; BGI Research, Shenzhen 518083, China; School of Public Health, Hebei Medical University, Shijiazhuang 050000, China.
| | - Ting Wang
- The Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou, Jiangsu Province 215000, China; Suzhou Municipal Hospital, Suzhou Jiangsu 215000, China.
| |
Collapse
|
3
|
von Hinke S, Vitt N. An analysis of the accuracy of retrospective birth location recall using sibling data. Nat Commun 2024; 15:2665. [PMID: 38531849 DOI: 10.1038/s41467-024-46781-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 03/06/2024] [Indexed: 03/28/2024] Open
Abstract
Many surveys ask participants to retrospectively record their location of birth. This paper examines the accuracy of such data in the UK Biobank using a sample of full siblings. Comparison of reported birth locations for siblings with different age gaps allows us to estimate the probabilities of household moves and of misreported birth locations. Our first contribution is to show that there are inaccuracies in retrospective birth location data, showing a sizeable probability of misreporting, with 28% of birth coordinates, 16% of local districts and 6% of counties of birth being incorrectly reported. Our second contribution is to show that such error can lead to substantial attenuation bias when investigating the impacts of location-based exposures, especially when there is little spatial correlation and limited time variation in the exposure variable. Sibling fixed effect models are shown to be particularly vulnerable to the attenuation bias. Our third contribution is to highlight possible solutions to the attenuation bias and sensitivity analyses to the reporting error.
Collapse
Affiliation(s)
- Stephanie von Hinke
- School of Economics, University of Bristol, Bristol, United Kingdom.
- Institute for Fiscal Studies, London, United Kingdom.
- Institute for the Study of Labor (IZA), Bonn, Germany.
| | - Nicolai Vitt
- School of Economics, University of Bristol, Bristol, United Kingdom.
| |
Collapse
|
4
|
Kolobkov D, Mishra Sharma S, Medvedev A, Lebedev M, Kosaretskiy E, Vakhitov R. Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project. Front Big Data 2024; 7:1266031. [PMID: 38487517 PMCID: PMC10937521 DOI: 10.3389/fdata.2024.1266031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 01/31/2024] [Indexed: 03/17/2024] Open
Abstract
Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.
Collapse
Affiliation(s)
- Dmitry Kolobkov
- GENXT, Hinxton, United Kingdom
- Laboratory of Ecological Genetics, Vavilov Institute of General Genetics, Moscow, Russia
| | - Satyarth Mishra Sharma
- GENXT, Hinxton, United Kingdom
- Center for Artificial Intelligence Technology, Skolkovo Institute of Science and Technology, Moscow, Russia
| | - Aleksandr Medvedev
- GENXT, Hinxton, United Kingdom
- Center for Artificial Intelligence Technology, Skolkovo Institute of Science and Technology, Moscow, Russia
| | | | | | | |
Collapse
|
5
|
Chen H, Naseri A, Zhi D. FiMAP: A fast identity-by-descent mapping test for biobank-scale cohorts. PLoS Genet 2023; 19:e1011057. [PMID: 38039339 PMCID: PMC10718418 DOI: 10.1371/journal.pgen.1011057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 12/13/2023] [Accepted: 11/07/2023] [Indexed: 12/03/2023] Open
Abstract
Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS.
Collapse
Affiliation(s)
- Han Chen
- Human Genetics Center, Department of Epidemiology, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Ardalan Naseri
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Degui Zhi
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| |
Collapse
|
6
|
Yuan D, Mancuso N. SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis. iScience 2023; 26:108181. [PMID: 37953948 PMCID: PMC10638022 DOI: 10.1016/j.isci.2023.108181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 09/20/2023] [Accepted: 10/09/2023] [Indexed: 11/14/2023] Open
Abstract
Latent factor models, like principal component analysis (PCA), provide a statistical framework to infer low-rank representation in various biological contexts. However, feature selection is challenging when this low-rank structure manifests from a sparse subspace. We introduce SuSiE PCA, a scalable sparse latent factor approach that evaluates uncertainty in contributing variables through posterior inclusion probabilities. We validate our model in extensive simulations and demonstrate that SuSiE PCA outperforms other approaches in signal detection and model robustness. We apply SuSiE PCA to multi-tissue expression quantitative trait loci (eQTLs) data from GTEx v8 and identify tissue-specific factors and their contributing eGenes. We further investigate its performance on the large-scale perturbation data and find that SuSiE PCA identifies modules with a higher enrichment of ribosome-related genes than sparse PCA (false discovery rate [FDR] = 9.2 × 10 - 82 vs. 1.4 × 10 - 33 ), while being ∼ 18x faster. Overall, SuSiE PCA provides an efficient tool to identify relevant features in high-dimensional biological data.
Collapse
Affiliation(s)
- Dong Yuan
- Biostatistics Division, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA
| | - Nicholas Mancuso
- Biostatistics Division, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA
- Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
7
|
Lin BD, Pries LK, van Os J, Luykx JJ, Rutten BPF, Guloksuz S. Adjusting for population stratification in polygenic risk score analyses: a guide for model specifications in the UK Biobank. J Hum Genet 2023; 68:653-656. [PMID: 37188914 DOI: 10.1038/s10038-023-01161-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 04/06/2023] [Accepted: 05/08/2023] [Indexed: 05/17/2023]
Abstract
The current study was conducted to provide a general guidance for model specifications in polygenic risk score (PRS) analyses of the UK Biobank, such as adjusting for covariates (i.e. age, sex, recruitment centers, and genetic batch) and the number of principal components (PCs) that need to be included. To cover behavioral, physical and mental health outcomes, we evaluated three continuous outcomes (BMI, smoking, drinking) and two binary outcomes (Major Depressive Disorder and educational attainment). We applied 3280 (656 per phenotype) different models including different sets of covariates. We evaluated these different model specifications by comparing regression parameters such as R2, coefficients, and P values, as well as ANOVA tests. Findings suggest that only up to three PCs appears to be sufficient for controlling population stratification for most outcomes, whereas including other covariates (particularly age and sex) appears to be more essential for model performance.
Collapse
Affiliation(s)
- Bochao Danae Lin
- Department of Psychiatry and Neuropsychology, School for Mental Health and Neuroscience, Maastricht University Medical Centre, Maastricht, The Netherlands
- Department of Preventive Medicine, Institute of Biomedical Informatics, Bioinformatics Center, School of Basic Medical Sciences, Henan University, Kaifeng, China
- Department of Psychiatry, UMC Utrecht Brain Center, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
| | - Lotta-Katrin Pries
- Department of Psychiatry and Neuropsychology, School for Mental Health and Neuroscience, Maastricht University Medical Centre, Maastricht, The Netherlands
| | - Jim van Os
- Department of Psychiatry and Neuropsychology, School for Mental Health and Neuroscience, Maastricht University Medical Centre, Maastricht, The Netherlands
- Department of Psychiatry, UMC Utrecht Brain Center, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Jurjen J Luykx
- Department of Psychiatry and Neuropsychology, School for Mental Health and Neuroscience, Maastricht University Medical Centre, Maastricht, The Netherlands
- Department of Psychiatry, UMC Utrecht Brain Center, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
- GGNet Mental Health, Warnsveld, The Netherlands
- Department of Translational Neuroscience, UMC Utrecht Brain Center, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
| | - Bart P F Rutten
- Department of Psychiatry and Neuropsychology, School for Mental Health and Neuroscience, Maastricht University Medical Centre, Maastricht, The Netherlands
| | - Sinan Guloksuz
- Department of Psychiatry and Neuropsychology, School for Mental Health and Neuroscience, Maastricht University Medical Centre, Maastricht, The Netherlands.
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA.
| |
Collapse
|
8
|
Li Z, Meisner J, Albrechtsen A. Fast and accurate out-of-core PCA framework for large scale biobank data. Genome Res 2023; 33:1599-1608. [PMID: 37620119 PMCID: PMC10620046 DOI: 10.1101/gr.277525.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 08/18/2023] [Indexed: 08/26/2023]
Abstract
Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.
Collapse
Affiliation(s)
- Zilong Li
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200 København, Denmark;
| | - Jonas Meisner
- Biological and Precision Psychiatry, Mental Health Centre Copenhagen, Copenhagen University Hospital, 2100 København, Denmark
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, 2200 København, Denmark
| | - Anders Albrechtsen
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200 København, Denmark
| |
Collapse
|
9
|
Cheng S, Xu Z, Bian S, Chen X, Shi Y, Li Y, Duan Y, Liu Y, Lin J, Jiang Y, Jing J, Li Z, Wang Y, Meng X, Liu Y, Fang M, Jin X, Xu X, Wang J, Wang C, Li H, Liu S, Wang Y. The STROMICS genome study: deep whole-genome sequencing and analysis of 10K Chinese patients with ischemic stroke reveal complex genetic and phenotypic interplay. Cell Discov 2023; 9:75. [PMID: 37479695 PMCID: PMC10362040 DOI: 10.1038/s41421-023-00582-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Accepted: 06/21/2023] [Indexed: 07/23/2023] Open
Abstract
Ischemic stroke is a leading cause of global mortality and long-term disability. However, there is a paucity of whole-genome sequencing studies on ischemic stroke, resulting in limited knowledge of the interplay between genomic and phenotypic variations among affected patients. Here, we outline the STROMICS design and present the first whole-genome analysis on ischemic stroke by deeply sequencing and analyzing 10,241 stroke patients from China. We identified 135.59 million variants, > 42% of which were novel. Notable disparities in allele frequency were observed between Chinese and other populations for 89 variants associated with stroke risk and 10 variants linked to response to stroke medications. We investigated the population structure of the participants, generating a map of genetic selection consisting of 31 adaptive signals. The adaption of the MTHFR rs1801133-G allele, which links to genetically evaluated VB9 (folate acid) in southern Chinese patients, suggests a gene-specific folate supplement strategy. Through genome-wide association analysis of 18 stroke-related traits, we discovered 10 novel genetic-phenotypic associations and extensive cross-trait pleiotropy at 6 lipid-trait loci of therapeutic relevance. Additionally, we found that the set of loss-of-function and cysteine-altering variants present in the causal gene NOTCH3 for the autosomal dominant stroke disorder CADASIL displayed a broad neuro-imaging spectrum. These findings deepen our understanding of the relationship between the population and individual genetic layout and clinical phenotype among stroke patients, and provide a foundation for future efforts to utilize human genetic knowledge to investigate mechanisms underlying ischemic stroke outcomes, discover novel therapeutic targets, and advance precision medicine.
Collapse
Affiliation(s)
- Si Cheng
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Changping Laboratory, Beijing, China
- Clinical Center for Precision Medicine in Stroke, Capital Medical University, Beijing, China
- Center of excellence for Omics Research (CORe), Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Zhe Xu
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Center of excellence for Omics Research (CORe), Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Shengzhe Bian
- School of Public Health (Shenzhen), Sun Yat-sen University, Shenzhen, Guangdong, China
| | - Xi Chen
- BGI-Tianjin, BGI-Shenzhen, Tianjin, China
| | - Yanfeng Shi
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Center of excellence for Omics Research (CORe), Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Yanran Li
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Center of excellence for Omics Research (CORe), Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Yunyun Duan
- Department of Radiology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Yang Liu
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Center of excellence for Omics Research (CORe), Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Jinxi Lin
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
| | - Yong Jiang
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
| | - Jing Jing
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Tiantan Neuroimaging Center of Excellence, Beijing, China
| | - Zixiao Li
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
| | - Yilong Wang
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Xia Meng
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
| | - Yaou Liu
- Department of Radiology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | | | - Xin Jin
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
- Guangdong Provincial Key Laboratory of Genome Read and Write, BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Jian Wang
- BGI-Shenzhen, Shenzhen, Guangdong, China
- James D. Watson Institute of Genome Sciences, Hangzhou, Zhejiang, China
| | - Chaolong Wang
- Department of Epidemiology and Biostatistics, Ministry of Education Key Laboratory of Environment and Health, State Key Laboratory of Environmental Health (Incubating), School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Hao Li
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Center of excellence for Omics Research (CORe), Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Siyang Liu
- School of Public Health (Shenzhen), Sun Yat-sen University, Shenzhen, Guangdong, China.
- BGI-Shenzhen, Shenzhen, Guangdong, China.
| | - Yongjun Wang
- Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China.
- China National Clinical Research Center for Neurological Diseases, Beijing, China.
- Changping Laboratory, Beijing, China.
- Clinical Center for Precision Medicine in Stroke, Capital Medical University, Beijing, China.
- Center of excellence for Omics Research (CORe), Beijing Tiantan Hospital, Capital Medical University, Beijing, China.
| |
Collapse
|
10
|
Khan Z, Jung M, Crow M, Mohindra R, Maiya V, Kaminker JS, Hackos DH, Chandler GS, McCarthy MI, Bhangale T. Whole genome sequencing across clinical trials identifies rare coding variants in GPR68 associated with chemotherapy-induced peripheral neuropathy. Genome Med 2023; 15:45. [PMID: 37344884 DOI: 10.1186/s13073-023-01193-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 05/17/2023] [Indexed: 06/23/2023] Open
Abstract
BACKGROUND Dose-limiting toxicities significantly impact the benefit/risk profile of many drugs. Whole genome sequencing (WGS) in patients receiving drugs with dose-limiting toxicities can identify therapeutic hypotheses to prevent these toxicities. Chemotherapy-induced peripheral neuropathy (CIPN) is a common dose-limiting neurological toxicity of chemotherapies with no effective approach for prevention. METHODS We conducted a genetic study of time-to-first peripheral neuropathy event using 30× germline WGS data from whole blood samples from 4900 European-ancestry cancer patients in 14 randomized controlled trials. A substantial number of patients in these trials received taxane and platinum-based chemotherapies as part of their treatment regimen, either standard of care or in combination with the PD-L1 inhibitor atezolizumab. The trials spanned several cancers including renal cell carcinoma, triple negative breast cancer, non-small cell lung cancer, small cell lung cancer, bladder cancer, ovarian cancer, and melanoma. RESULTS We identified a locus consisting of low-frequency variants in intron 13 of GRID2 associated with time-to-onset of first peripheral neuropathy (PN) indexed by rs17020773 (p = 2.03 × 10-8, all patients, p = 6.36 × 10-9, taxane treated). Gene-level burden analysis identified rare coding variants associated with increased PN risk in the C-terminus of GPR68 (p = 1.59 × 10-6, all patients, p = 3.47 × 10-8, taxane treated), a pH-sensitive G-protein coupled receptor (GPCR). The variants driving this signal were found to alter predicted arrestin binding motifs in the C-terminus of GPR68. Analysis of snRNA-seq from human dorsal root ganglia (DRG) indicated that expression of GPR68 was highest in mechano-thermo-sensitive nociceptors. CONCLUSIONS Our genetic study provides insight into the impact of low-frequency and rare coding genetic variation on PN risk and suggests that further study of GPR68 in sensory neurons may yield a therapeutic hypothesis for prevention of CIPN.
Collapse
Affiliation(s)
- Zia Khan
- Genentech, 1 DNA Way, South San Francisco, 94080, USA.
| | - Min Jung
- Genentech, 1 DNA Way, South San Francisco, 94080, USA
| | - Megan Crow
- Genentech, 1 DNA Way, South San Francisco, 94080, USA
| | - Rajat Mohindra
- F. Hoffmann-La Roche, Grenzacherstrasse 124, 4070, Basel, Switzerland
| | - Vidya Maiya
- Genentech, 1 DNA Way, South San Francisco, 94080, USA
| | | | | | - G Scott Chandler
- F. Hoffmann-La Roche, Grenzacherstrasse 124, 4070, Basel, Switzerland
| | | | | |
Collapse
|
11
|
Wang N, Yu B, Jun G, Qi Q, Durazo-Arvizu RA, Lindstrom S, Morrison AC, Kaplan RC, Boerwinkle E, Chen H. StocSum: stochastic summary statistics for whole genome sequencing studies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.06.535886. [PMID: 37066281 PMCID: PMC10104122 DOI: 10.1101/2023.04.06.535886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Genomic summary statistics, usually defined as single-variant test results from genome-wide association studies, have been widely used to advance the genetics field in a wide range of applications. Applications that involve multiple genetic variants also require their correlations or linkage disequilibrium (LD) information, often obtained from an external reference panel. In practice, it is usually difficult to find suitable external reference panels that represent the LD structure for underrepresented and admixed populations, or rare genetic variants from whole genome sequencing (WGS) studies, limiting the scope of applications for genomic summary statistics. Here we introduce StocSum, a novel reference-panel-free statistical framework for generating, managing, and analyzing stochastic summary statistics using random vectors. We develop various downstream applications using StocSum including single-variant tests, conditional association tests, gene-environment interaction tests, variant set tests, as well as meta-analysis and LD score regression tools. We demonstrate the accuracy and computational efficiency of StocSum using two cohorts from the Trans-Omics for Precision Medicine Program. StocSum will facilitate sharing and utilization of genomic summary statistics from WGS studies, especially for underrepresented and admixed populations.
Collapse
Affiliation(s)
- Nannan Wang
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Bing Yu
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Goo Jun
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Qibin Qi
- Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Ramon A. Durazo-Arvizu
- The Saban Research Institute, Children’s Hospital Los Angeles, Los Angeles, California
- Department of Pediatrics, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Sara Lindstrom
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
- Department of Epidemiology, School of Public Health, University of Washington, 3980 15th Ave NE, Seattle, WA, USA
| | - Alanna C. Morrison
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Robert C. Kaplan
- Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Eric Boerwinkle
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
12
|
Chu BB, Ko S, Zhou JJ, Jensen A, Zhou H, Sinsheimer JS, Lange K. Multivariate genome-wide association analysis by iterative hard thresholding. Bioinformatics 2023; 39:btad193. [PMID: 37067496 PMCID: PMC10133532 DOI: 10.1093/bioinformatics/btad193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 04/07/2023] [Accepted: 04/13/2023] [Indexed: 04/18/2023] Open
Abstract
MOTIVATION In a genome-wide association study, analyzing multiple correlated traits simultaneously is potentially superior to analyzing the traits one by one. Standard methods for multivariate genome-wide association study operate marker-by-marker and are computationally intensive. RESULTS We present a sparsity constrained regression algorithm for multivariate genome-wide association study based on iterative hard thresholding and implement it in a convenient Julia package MendelIHT.jl. In simulation studies with up to 100 quantitative traits, iterative hard thresholding exhibits similar true positive rates, smaller false positive rates, and faster execution times than GEMMA's linear mixed models and mv-PLINK's canonical correlation analysis. On UK Biobank data with 470 228 variants, MendelIHT completed a three-trait joint analysis (n=185 656) in 20 h and an 18-trait joint analysis (n=104 264) in 53 h with an 80 GB memory footprint. In short, MendelIHT enables geneticists to fit a single regression model that simultaneously considers the effect of all SNPs and dozens of traits. AVAILABILITY AND IMPLEMENTATION Software, documentation, and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelIHT.jl.
Collapse
Affiliation(s)
- Benjamin B Chu
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
| | - Seyoon Ko
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
| | - Jin J Zhou
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
| | - Aubrey Jensen
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
| | - Hua Zhou
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
| | - Janet S Sinsheimer
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
| | - Kenneth Lange
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Statistics at UCLA, Los Angeles, CA 90095-1554, United States
| |
Collapse
|
13
|
Wei X, Robles CR, Pazokitoroudi A, Ganna A, Gusev A, Durvasula A, Gazal S, Loh PR, Reich D, Sankararaman S. The lingering effects of Neanderthal introgression on human complex traits. eLife 2023; 12:e80757. [PMID: 36939312 PMCID: PMC10076017 DOI: 10.7554/elife.80757] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Accepted: 03/17/2023] [Indexed: 03/21/2023] Open
Abstract
The genetic variants introduced into the ancestors of modern humans from interbreeding with Neanderthals have been suggested to contribute an unexpected extent to complex human traits. However, testing this hypothesis has been challenging due to the idiosyncratic population genetic properties of introgressed variants. We developed rigorous methods to assess the contribution of introgressed Neanderthal variants to heritable trait variation and applied these methods to analyze 235,592 introgressed Neanderthal variants and 96 distinct phenotypes measured in about 300,000 unrelated white British individuals in the UK Biobank. Introgressed Neanderthal variants make a significant contribution to trait variation (explaining 0.12% of trait variation on average). However, the contribution of introgressed variants tends to be significantly depleted relative to modern human variants matched for allele frequency and linkage disequilibrium (about 59% depletion on average), consistent with purifying selection on introgressed variants. Different from previous studies (McArthur et al., 2021), we find no evidence for elevated heritability across the phenotypes examined. We identified 348 independent significant associations of introgressed Neanderthal variants with 64 phenotypes. Previous work (Skov et al., 2020) has suggested that a majority of such associations are likely driven by statistical association with nearby modern human variants that are the true causal variants. Applying a customized fine-mapping led us to identify 112 regions across 47 phenotypes containing 4303 unique genetic variants where introgressed variants are highly likely to have a phenotypic effect. Examination of these variants reveals their substantial impact on genes that are important for the immune system, development, and metabolism.
Collapse
Affiliation(s)
- Xinzhu Wei
- Department of Computational Biology, Cornell UniversityNew YorkUnited States
| | - Christopher R Robles
- Department of Human Genetics, University of California, Los AngelesLos AngelesUnited States
| | - Ali Pazokitoroudi
- Department of Computer Science, University of California, Los AngelesLos AngelesUnited States
| | - Andrea Ganna
- Analytical and Translational Genetics Unit, Center for Genomic Medicine, Massachusetts General HospitalBostonUnited States
- Program in Medical and Population Genetics, Broad Institute of MIT and HarvardCambridgeUnited States
- Stanley Center for Psychiatric Research, Broad Institute of MIT and HarvardCambridgeUnited States
| | - Alexander Gusev
- Dana-Farber Cancer Institute, Harvard Medical SchoolBostonUnited States
| | - Arun Durvasula
- Department of Genetics, Harvard Medical SchoolBostonUnited States
- Department of Human Evolutionary Biology, Harvard UniversityCambridgeUnited States
| | - Steven Gazal
- Center for Genetic Epidemiology, Department of Public and Population Health Sciences, University of Southern CaliforniaLos AngelesUnited States
- Division of Genetics,Department of Medicine, Brigham and Women’s Hospital, Harvard Medical SchoolBostonUnited States
| | - Po-Ru Loh
- Program in Medical and Population Genetics, Broad Institute of MIT and HarvardCambridgeUnited States
| | - David Reich
- Program in Medical and Population Genetics, Broad Institute of MIT and HarvardCambridgeUnited States
- Department of Genetics, Harvard Medical SchoolBostonUnited States
- Department of Human Evolutionary Biology, Harvard UniversityCambridgeUnited States
- Howard Hughes Medical Institute, Harvard Medical SchoolBostonUnited States
| | - Sriram Sankararaman
- Department of Human Genetics, University of California, Los AngelesLos AngelesUnited States
- Department of Computer Science, University of California, Los AngelesLos AngelesUnited States
- Department of Computational Medicine, University of California, Los AngelesLos AngelesUnited States
| |
Collapse
|
14
|
Eliseussen E, Fleischer T, Vitelli V. Rank-based Bayesian variable selection for genome-wide transcriptomic analyses. Stat Med 2022; 41:4532-4553. [PMID: 35844145 PMCID: PMC9796757 DOI: 10.1002/sim.9524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 06/16/2022] [Accepted: 06/27/2022] [Indexed: 01/07/2023]
Abstract
Variable selection is crucial in high-dimensional omics-based analyses, since it is biologically reasonable to assume only a subset of non-noisy features contributes to the data structures. However, the task is particularly hard in an unsupervised setting, and a priori ad hoc variable selection is still a very frequent approach, despite the evident drawbacks and lack of reproducibility. We propose a Bayesian variable selection approach for rank-based unsupervised transcriptomic analysis. Making use of data rankings instead of the actual continuous measurements increases the robustness of conclusions when compared to classical statistical methods, and embedding variable selection into the inferential tasks allows complete reproducibility. Specifically, we develop a novel extension of the Bayesian Mallows model for variable selection that allows for a full probabilistic analysis, leading to coherent quantification of uncertainties. Simulation studies demonstrate the versatility and robustness of the proposed method in a variety of scenarios, as well as its superiority with respect to several competitors when varying the data dimension or data generating process. We use the novel approach to analyze genome-wide RNAseq gene expression data from ovarian cancer patients: several genes that affect cancer development are correctly detected in a completely unsupervised fashion, showing the usefulness of the method in the context of signature discovery for cancer genomics. Moreover, the possibility to also perform uncertainty quantification plays a key role in the subsequent biological investigation.
Collapse
Affiliation(s)
- Emilie Eliseussen
- Oslo Centre for Biostatistics and Epidemiology, Department of BiostatisticsUniversity of OsloOsloNorway
| | - Thomas Fleischer
- Department of Cancer Genetics, Institute for Cancer ResearchOslo University HospitalOsloNorway
| | - Valeria Vitelli
- Oslo Centre for Biostatistics and Epidemiology, Department of BiostatisticsUniversity of OsloOsloNorway
| |
Collapse
|
15
|
Sun K, Yao Y, Yun L, Zhang C, Xie J, Qian X, Tang Q, Sun L. Application of machine learning for ancestry inference using multi-InDel markers. Forensic Sci Int Genet 2022; 59:102702. [DOI: 10.1016/j.fsigen.2022.102702] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Revised: 03/22/2022] [Accepted: 03/27/2022] [Indexed: 01/04/2023]
|
16
|
Sapkota Y, Liu Q, Li N, Bhatt NS, Ehrhardt MJ, Wilson CL, Wang Z, Jefferies JL, Zhang J, Armstrong GT, Hudson MM, Robison LL, Mulrooney DA, Yasui Y. Contribution of Genome-Wide Polygenic Score to Risk of Coronary Artery Disease in Childhood Cancer Survivors. JACC CardioOncol 2022; 4:258-267. [PMID: 35818558 PMCID: PMC9270604 DOI: 10.1016/j.jaccao.2022.04.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 04/15/2022] [Accepted: 04/21/2022] [Indexed: 11/25/2022] Open
Abstract
Background Adverse cardiovascular outcomes such as coronary artery disease (CAD) are the leading noncancer causes of morbidity and mortality among childhood cancer survivors. Objectives The aim of this study was to assess the role of a genome-wide polygenic score (GPS) for CAD, well validated in the general population, and its interplay with cancer-related risk factors among childhood cancer survivors. Methods In a cohort study of 2,472 5-year childhood cancer survivors from the St. Jude Lifetime Cohort, the association between the GPS and the risk of CAD was performed using Cox regression models adjusted for age at cancer diagnosis, sex, cumulative dose of anthracyclines, and mean heart radiation dose. Results Among survivors of European ancestry, the GPS was significantly associated with the risk of CAD (HR per 1 SD of the GPS: 1.25; 95% CI: 1.04-1.49; P = 0.014). Compared with the first tertile, survivors in the upper tertile had a greater risk of CAD (1.51-fold higher HR of CAD [95% CI: 0.96-2.37; P = 0.074]), although the difference was not statistically significant. The GPS-CAD association was stronger among survivors diagnosed with cancer at age <10 years exposed to >25 Gy heart radiation (HR top vs. bottom tertile of GPS: 15.49; 95% CI: 5.24-45.52; Ptrend = 0.005) but not among those diagnosed at age ≥10 years (Ptrend ≥ 0.77) and not among those diagnosed at age <10 years exposed to ≤25 Gy heart radiation (Ptrend = 0.23). Among high-risk survivors, defined by an estimated relative hazard ≥3.0 from fitted Cox models including clinical risk factors alone, the cumulative incidence of CAD at 40 years from diagnosis was 29% (95% CI: 13%-45%). After incorporating the GPS into the model, the cumulative incidence increased to 48% (95% CI: 26%-69%). Conclusions Childhood cancer survivors are at risk for premature CAD. A GPS may help identify those who may benefit from targeted screening and personalized preventive interventions.
Collapse
|
17
|
Peter BM. A geometric relationship of
F
2
,
F
3
and
F
4
-statistics with principal component analysis. Philos Trans R Soc Lond B Biol Sci 2022; 377:20200413. [PMID: 35430884 PMCID: PMC9014194 DOI: 10.1098/rstb.2020.0413] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Principal component analysis (PCA) and
F
-statistics
sensu
Patterson are two of the most widely used population genetic tools to study human genetic variation. Here, I derive explicit connections between the two approaches and show that these two methods are closely related.
F
-statistics have a simple geometrical interpretation in the context of PCA, and orthogonal projections are a key concept to establish this link. I show that for any pair of populations, any population that is admixed as determined by an
F
3
-statistic will lie inside a circle on a PCA plot. Furthermore, the
F
4
-statistic is closely related to an angle measurement, and will be zero if the differences between pairs of populations intersect at a right angle in PCA space. I illustrate my results on two examples, one of Western Eurasian, and one of global human diversity. In both examples, I find that the first few PCs are sufficient to approximate most
F
-statistics, and that PCA plots are effective at predicting
F
-statistics. Thus, while
F
-statistics are commonly understood in terms of discrete populations, the geometric perspective illustrates that they can be viewed in a framework of populations that vary in a more continuous manner.
This article is part of the theme issue ‘Celebrating 50 years since Lewontin's apportionment of human diversity’.
Collapse
Affiliation(s)
- Benjamin M. Peter
- Max-Planck-Institute for Evolutionary Anthropology, Leipzig 04103, Germany
| |
Collapse
|
18
|
Jiang R, Sun T, Song D, Li JJ. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol 2022; 23:31. [PMID: 35063006 PMCID: PMC8783472 DOI: 10.1186/s13059-022-02601-5] [Citation(s) in RCA: 130] [Impact Index Per Article: 65.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 01/04/2022] [Indexed: 12/13/2022] Open
Abstract
Researchers view vast zeros in single-cell RNA-seq data differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as missing data to be corrected. To help address the controversy, here we discuss the sources of biological and non-biological zeros; introduce five mechanisms of adding non-biological zeros in computational benchmarking; evaluate the impacts of non-biological zeros on data analysis; benchmark three input data types: observed counts, imputed counts, and binarized counts; discuss the open questions regarding non-biological zeros; and advocate the importance of transparent analysis.
Collapse
Affiliation(s)
- Ruochen Jiang
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
| | - Tianyi Sun
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
| | - Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, 90095-7246, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095-7088, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095-1766, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095-1772, CA, USA.
| |
Collapse
|
19
|
Genetic Evaluation and Combined Selection for the Simultaneous Improvement of Growth and Wood Properties in Catalpa bungei Clones. FORESTS 2021. [DOI: 10.3390/f12070868] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Catalpa bungei is an important timber tree. Improvements in growth and wood quality are important goals of C. bungei breeding, and it is necessary to understand the genetic parameters of specific target traits and genetic correlation between growth traits and wood properties for tree breeding. In this study, the genetic parameters of height, diameter at breast height (DBH) and wood properties were estimated and genetic and phenotypic correlations between growth traits and wood properties were evaluated in C. bungei. Finally, different selection scenarios were used to evaluate and select optimal clones. The results showed that there were significant differences in growth and wood properties among clones. The wood hardness (0.66–0.79), basic density (0.89), air-dried density (0.89) and compression strength parallel to the grain of wood (CSP) (0.84) had high repeatability. The variance component proportions indicated that the variation in wood properties came mainly from different genotypes (clones) rather than from different individuals of the same clone. The DBH showed a significant negative genetic correlation with the hardness of radial section (HRS) (−643), basic density (−0.531) and air-dry density (−0.495). This unfavorable relationship makes it difficult to improve growth and wood quality simultaneously in C. bungei. We selected the optimal clones under different scenarios, and we obtained 7.75–9.06% genetic gains for growth in the scenario in which height and DBH were the target traits. Genetic gains of 7.43–14.94% were obtained for wood properties by selecting optimal clones in the scenario in which wood properties were the target traits. Approximately 5% and 4% genetic gains were obtained for growth and wood properties, respectively, for the combined selection. This study provides new insights into the genetic improvement of wood quality in C. bungei.
Collapse
|
20
|
Privé F, Luu K, Blum MGB, McGrath JJ, Vilhjálmsson BJ. Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics 2021; 36:4449-4457. [PMID: 32415959 PMCID: PMC7750941 DOI: 10.1093/bioinformatics/btaa520] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Revised: 05/07/2020] [Accepted: 05/12/2020] [Indexed: 12/01/2022] Open
Abstract
Motivation Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. Availability and implementation R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Florian Privé
- National Centre for Register-Based Research, Aarhus University, Aarhus 8210, Denmark.,Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes, La Tronche 38700, France
| | - Keurcien Luu
- Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes, La Tronche 38700, France
| | - Michael G B Blum
- Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes, La Tronche 38700, France.,OWKIN France, Paris 75010, France
| | - John J McGrath
- National Centre for Register-Based Research, Aarhus University, Aarhus 8210, Denmark.,Queensland Brain Institute, University of Queensland, St. Lucia, 4072 Queensland, Australia.,Queensland Centre for Mental Health Research, The Park Centre for Mental Health, Wacol, 4076 Queensland, Australia
| | - Bjarni J Vilhjálmsson
- National Centre for Register-Based Research, Aarhus University, Aarhus 8210, Denmark
| |
Collapse
|