1
|
Chua EW, Ooi DJ, Nor Muhammad NA. A concise guide to essential R packages for analyses of DNA, RNA, and proteins. Mol Cells 2024; 47:100120. [PMID: 39374792 DOI: 10.1016/j.mocell.2024.100120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Revised: 09/30/2024] [Accepted: 09/30/2024] [Indexed: 10/09/2024] Open
Abstract
R is widely regarded as unrivaled by other high-level programming languages for its statistical functions. The popularity of R as a statistical language has led many to overlook its applications outside the statistical realm. In this brief review, we present a list of R packages for supporting projects that entail analyses of DNA, RNA, and proteins. These R packages span the gamut of important molecular techniques, from routine quantitative polymerase chain reaction (qPCR) and Western blotting to high-throughput sequencing and proteomics generating very large datasets. The text-mining power of R can also be harnessed to facilitate literature reviews and predict future research trends and avenues. We encourage researchers to make full use of R in their work, given the versatility of the language, as well as its straightforward syntax which eases the initial learning curve.
Collapse
Affiliation(s)
- Eng Wee Chua
- Centre for Drug and Herbal Development, Faculty of Pharmacy, Universiti Kebangsaan Malaysia, 50300 Kuala Lumpur, Malaysia.
| | - Der Jiun Ooi
- Department of Preclinical Sciences, Faculty of Dentistry, MAHSA University, 42610 Jenjarom, Selangor, Malaysia
| | - Nor Azlan Nor Muhammad
- Institute of Systems Biology (INBIOSIS), Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia
| |
Collapse
|
2
|
Malakhov MM, Dai B, Shen XT, Pan W. A bootstrap model comparison test for identifying genes with context-specific patterns of genetic regulation. Ann Appl Stat 2024; 18:1840-1857. [PMID: 39421855 PMCID: PMC11484521 DOI: 10.1214/23-aoas1859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2024]
Abstract
Understanding how genetic variation affects gene expression is essential for a complete picture of the functional pathways that give rise to complex traits. Although numerous studies have established that many genes are differentially expressed in distinct human tissues and cell types, no tools exist for identifying the genes whose expression is differentially regulated. Here we introduce DRAB (differential regulation analysis by bootstrapping), a gene-based method for testing whether patterns of genetic regulation are significantly different between tissues or other biological contexts. DRAB first leverages the elastic net to learn context-specific models of local genetic regulation and then applies a novel bootstrap-based model comparison test to check their equivalency. Unlike previous model comparison tests, our proposed approach can determine whether population-level models have equal predictive performance by accounting for the variability of feature selection and model training. We validated DRAB on mRNA expression data from a variety of human tissues in the Genotype-Tissue Expression (GTEx) Project. DRAB yielded biologically reasonable results and had sufficient power to detect genes with tissue-specific regulatory profiles while effectively controlling false positives. By providing a framework that facilitates the prioritization of differentially regulated genes, our study enables future discoveries on the genetic architecture of molecular phenotypes.
Collapse
Affiliation(s)
| | - Ben Dai
- Department of Statistics, The Chinese University of Hong Kong
| | | | - Wei Pan
- Division of Biostatistics and Health Data Science, University of Minnesota
| |
Collapse
|
3
|
Malakhov MM, Dai B, Shen XT, Pan W. A BOOTSTRAP MODEL COMPARISON TEST FOR IDENTIFYING GENES WITH CONTEXT-SPECIFIC PATTERNS OF GENETIC REGULATION. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.06.531446. [PMID: 36945657 PMCID: PMC10028853 DOI: 10.1101/2023.03.06.531446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
Abstract
Understanding how genetic variation affects gene expression is essential for a complete picture of the functional pathways that give rise to complex traits. Although numerous studies have established that many genes are differentially expressed in distinct human tissues and cell types, no tools exist for identifying the genes whose expression is differentially regulated. Here we introduce DRAB (Differential Regulation Analysis by Bootstrapping), a gene-based method for testing whether patterns of genetic regulation are significantly different between tissues or other biological contexts. DRAB first leverages the elastic net to learn context-specific models of local genetic regulation and then applies a novel bootstrap-based model comparison test to check their equivalency. Unlike previous model comparison tests, our proposed approach can determine whether population-level models have equal predictive performance by accounting for the variability of feature selection and model training. We validated DRAB on mRNA expression data from a variety of human tissues in the Genotype-Tissue Expression (GTEx) Project. DRAB yielded biologically reasonable results and had sufficient power to detect genes with tissue-specific regulatory profiles while effectively controlling false positives. By providing a framework that facilitates the prioritization of differentially regulated genes, our study enables future discoveries on the genetic architecture of molecular phenotypes.
Collapse
Affiliation(s)
| | - Ben Dai
- Department of Statistics, The Chinese University of Hong Kong
| | | | - Wei Pan
- Division of Biostatistics, University of Minnesota
| |
Collapse
|
4
|
Valente BD, de los Campos G, Grueneberg A, Chen CY, Ros-Freixedes R, Herring WO. Using residual regressions to quantify and map signal leakage in genomic prediction. Genet Sel Evol 2023; 55:57. [PMID: 37550618 PMCID: PMC10405418 DOI: 10.1186/s12711-023-00830-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 07/12/2023] [Indexed: 08/09/2023] Open
Abstract
BACKGROUND Most genomic prediction applications in animal breeding use genotypes with tens of thousands of single nucleotide polymorphisms (SNPs). However, modern sequencing technologies and imputation algorithms can generate ultra-high-density genotypes (including millions of SNPs) at an affordable cost. Empirical studies have not produced clear evidence that using ultra-high-density genotypes can significantly improve prediction accuracy. However, (whole-genome) prediction accuracy is not very informative about the ability of a model to capture the genetic signals from specific genomic regions. To address this problem, we propose a simple methodology that detects chromosome regions for which a specific model (e.g., single-step genomic best linear unbiased prediction (ssGBLUP)) may fail to fully capture the genetic signal present in such segments-a phenomenon that we refer to as signal leakage. We propose to detect regions with evidence of signal leakage by testing the association of residuals from a pedigree or a genomic model with SNP genotypes. We discuss how this approach can be used to map regions with signals that are poorly captured by a model and to identify strategies to fix those problems (e.g., using a different prior or increasing marker density). Finally, we explored the proposed approach to scan for signal leakage of different models (pedigree-based, ssGBLUP, and various Bayesian models) applied to growth-related phenotypes (average daily gain and backfat thickness) in pigs. RESULTS We report widespread evidence of signal leakage for pedigree-based models. Including a percentage of animals with SNP data in ssGBLUP reduced the extent of signal leakage. However, local peaks of missed signals remained in some regions, even when all animals were genotyped. Using variable selection priors solves leakage points that are caused by excessive shrinkage of marker effects. Nevertheless, these models still miss signals in some regions due to low linkage disequilibrium between the SNPs on the array used and causal variants. Thus, we discuss how such problems could be addressed by adding sequence SNPs from those regions to the prediction model. CONCLUSIONS Residual single-marker regression analysis is a simple approach that can be used to detect regional genomic signals that are poorly captured by a model and to indicate ways to fix such problems.
Collapse
Affiliation(s)
| | - Gustavo de los Campos
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI USA
- Department of Statistics and Probability, Michigan State University, East Lansing, MI USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI USA
| | - Alexander Grueneberg
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI USA
| | - Ching-Yi Chen
- The Pig Improvement Company, Genus Plc, Hendersonville, TN USA
| | - Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
- Departament de Ciència Animal, Universitat de Lleida-Agrotecnio-CERCA Center, Lleida, Spain
| | | |
Collapse
|
5
|
de Los Campos G, Grueneberg A, Funkhouser S, Pérez-Rodríguez P, Samaddar A. Fine mapping and accurate prediction of complex traits using Bayesian Variable Selection models applied to biobank-size data. Eur J Hum Genet 2023; 31:313-320. [PMID: 35853950 PMCID: PMC9995454 DOI: 10.1038/s41431-022-01135-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 04/06/2022] [Accepted: 06/13/2022] [Indexed: 11/08/2022] Open
Abstract
Modern GWAS studies use an enormous sample size and ultra-high density SNP genotypes. These conditions reduce the mapping resolution of marginal association tests-the method most often used in GWAS. Multi-locus Bayesian Variable Selection (BVS) offers a one-stop solution for powerful and precise mapping of risk variants and polygenic risk score (PRS) prediction. We show (with an extensive simulation) that multi-locus BVS methods can achieve high power with a low false discovery rate and a much better mapping resolution than marginal association tests. We demonstrate the performance of BVS for mapping and PRS prediction using data from blood biomarkers from the UK-Biobank (~300,000 samples and ~5.5 million SNPs). The article is accompanied by open-source R-software that implement the methods used in the study and scales to biobank-sized data.
Collapse
Affiliation(s)
- Gustavo de Los Campos
- Michigan State University, Department of Epidemiology & Biostatistics, East Lansing, MI, USA.
- Michigan State University, Department of Statistics & Probability, East Lansing, MI, USA.
- Michigan State University, Institute for Quantitative Health Sciences and Engineering, East Lansing, MI, USA.
| | - Alexander Grueneberg
- Michigan State University, Department of Epidemiology & Biostatistics, East Lansing, MI, USA
- Michigan State University, Institute for Quantitative Health Sciences and Engineering, East Lansing, MI, USA
| | | | | | - Anirban Samaddar
- Michigan State University, Department of Statistics & Probability, East Lansing, MI, USA
- Michigan State University, Institute for Quantitative Health Sciences and Engineering, East Lansing, MI, USA
| |
Collapse
|
6
|
Lopez-Cruz M, Dreisigacker S, Crespo-Herrera L, Bentley AR, Singh R, Poland J, Shrestha S, Huerta-Espino J, Govindan V, Juliana P, Mondal S, Pérez-Rodríguez P, Crossa J. Sparse kernel models provide optimization of training set design for genomic prediction in multiyear wheat breeding data. THE PLANT GENOME 2022; 15:e20254. [PMID: 36043341 DOI: 10.1002/tpg2.20254] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2021] [Accepted: 07/17/2022] [Indexed: 06/15/2023]
Abstract
The success of genomic selection (GS) in breeding schemes relies on its ability to provide accurate predictions of unobserved lines at early stages. Multigeneration data provides opportunities to increase the training data size and thus, the likelihood of extracting useful information from ancestors to improve prediction accuracy. The genomic best linear unbiased predictions (GBLUPs) are performed by borrowing information through kinship relationships between individuals. Multigeneration data usually becomes heterogeneous with complex family relationship patterns that are increasingly entangled with each generation. Under these conditions, historical data may not be optimal for model training as the accuracy could be compromised. The sparse selection index (SSI) is a method for training set (TRN) optimization, in which training individuals provide predictions to some but not all predicted subjects. We added an additional trimming process to the original SSI (trimmed SSI) to remove less important training individuals for prediction. Using a large multigeneration (8 yr) wheat (Triticum aestivum L.) grain yield dataset (n = 68,836), we found increases in accuracy as more years are included in the TRN, with improvements of ∼0.05 in the GBLUP accuracy when using 5 yr of historical data relative to when using only 1 yr. The SSI method showed a small gain over the GBLUP accuracy but with an important reduction on the TRN size. These reduced TRNs were formed with a similar number of subjects from each training generation. Our results suggest that the SSI provides a more stable ranking of genotypes than the GBLUP as the TRN becomes larger.
Collapse
Affiliation(s)
- Marco Lopez-Cruz
- Dep. of Epidemiology and Biostatistics, Michigan State Univ., East Lansing, MI, USA
| | - Susanne Dreisigacker
- Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | - Leonardo Crespo-Herrera
- Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | - Alison R Bentley
- Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | - Ravi Singh
- Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | - Jesse Poland
- Dep. of Agronomy, Kansas State Univ., Manhattan, KS, USA
| | | | - Julio Huerta-Espino
- Campo Experimental Valle de Mexico, Instituto Nacional de Investigaciones Forestales, Agricolas y Pecuarias (INIFAP), Chapingo, Mexico
| | - Velu Govindan
- Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | - Philomin Juliana
- Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | - Suchismita Mondal
- Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | | | - Jose Crossa
- Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
- Colegio de Postgraduados, Montecillos, Mexico
| |
Collapse
|
7
|
Lupi AS, Sumpter NA, Leask MP, O’Sullivan J, Fadason T, de los Campos G, Merriman TR, Reynolds RJ, Vazquez AI. Local genetic covariance between serum urate and kidney function estimated with Bayesian multitrait models. G3 (BETHESDA, MD.) 2022; 12:jkac158. [PMID: 35876900 PMCID: PMC9434310 DOI: 10.1093/g3journal/jkac158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 06/05/2022] [Indexed: 11/13/2022]
Abstract
Hyperuricemia (serum urate >6.8 mg/dl) is associated with several cardiometabolic and renal diseases, such as gout and chronic kidney disease. Previous studies have examined the shared genetic basis of chronic kidney disease and hyperuricemia in humans either using single-variant tests or estimating whole-genome genetic correlations between the traits. Individual variants typically explain a small fraction of the genetic correlation between traits, thus the ability to map pleiotropic loci is lacking power for available sample sizes. Alternatively, whole-genome estimates of genetic correlation indicate a moderate correlation between these traits. While useful to explain the comorbidity of these traits, whole-genome genetic correlation estimates do not shed light on what regions may be implicated in the shared genetic basis of traits. Therefore, to fill the gap between these two approaches, we used local Bayesian multitrait models to estimate the genetic covariance between a marker for chronic kidney disease (estimated glomerular filtration rate) and serum urate in specific genomic regions. We identified 134 overlapping linkage disequilibrium windows with statistically significant covariance estimates, 49 of which had positive directionalities, and 85 negative directionalities, the latter being consistent with that of the overall genetic covariance. The 134 significant windows condensed to 64 genetically distinct shared loci which validate 17 previously identified shared loci with consistent directionality and revealed 22 novel pleiotropic genes. Finally, to examine potential biological mechanisms for these shared loci, we have identified a subset of the genomic windows that are associated with gene expression using colocalization analyses. The regions identified by our local Bayesian multitrait model approach may help explain the association between chronic kidney disease and hyperuricemia.
Collapse
Affiliation(s)
- Alexa S Lupi
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Systems Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Nicholas A Sumpter
- Department of Medicine, The University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Megan P Leask
- Department of Medicine, The University of Alabama at Birmingham, Birmingham, AL 35294, USA
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand
| | - Justin O’Sullivan
- Liggins Institute, The University of Auckland, Auckland 1142, New Zealand
| | - Tayaza Fadason
- Liggins Institute, The University of Auckland, Auckland 1142, New Zealand
| | - Gustavo de los Campos
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Systems Biology, Michigan State University, East Lansing, MI 48824, USA
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| | - Tony R Merriman
- Department of Medicine, The University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Richard J Reynolds
- Department of Medicine, The University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Ana I Vazquez
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Systems Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
8
|
Hansen PB, Ruud AK, de los Campos G, Malinowska M, Nagy I, Svane SF, Thorup-Kristensen K, Jensen JD, Krusell L, Asp T. Integration of DNA Methylation and Transcriptome Data Improves Complex Trait Prediction in Hordeum vulgare. PLANTS 2022; 11:plants11172190. [PMID: 36079572 PMCID: PMC9459846 DOI: 10.3390/plants11172190] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 08/19/2022] [Accepted: 08/21/2022] [Indexed: 11/30/2022]
Abstract
Whole-genome multi-omics profiles contain valuable information for the characterization and prediction of complex traits in plants. In this study, we evaluate multi-omics models to predict four complex traits in barley (Hordeum vulgare); grain yield, thousand kernel weight, protein content, and nitrogen uptake. Genomic, transcriptomic, and DNA methylation data were obtained from 75 spring barley lines tested in the RadiMax semi-field phenomics facility under control and water-scarce treatment. By integrating multi-omics data at genomic, transcriptomic, and DNA methylation regulatory levels, a higher proportion of phenotypic variance was explained (0.72–0.91) than with genomic models alone (0.55–0.86). The correlation between predictions and phenotypes varied from 0.17–0.28 for control plants and 0.23–0.37 for water-scarce plants, and the increase in accuracy was significant for nitrogen uptake and protein content compared to models using genomic information alone. Adding transcriptomic and DNA methylation information to the prediction models explained more of the phenotypic variance attributed to the environment in grain yield and nitrogen uptake. It furthermore explained more of the non-additive genetic effects for thousand kernel weight and protein content. Our results show the feasibility of multi-omics prediction for complex traits in barley.
Collapse
Affiliation(s)
- Pernille Bjarup Hansen
- Center for Quantitative Genetics and Genomics, Aarhus University, 4200 Slagelse, Denmark
- Correspondence: (P.B.H.); (T.A.); Tel.: +45-87158243 (T.A.)
| | - Anja Karine Ruud
- Center for Quantitative Genetics and Genomics, Aarhus University, 4200 Slagelse, Denmark
| | - Gustavo de los Campos
- Departments of Epidemiology & Biostatistics and Statistics & Probability, Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Marta Malinowska
- Center for Quantitative Genetics and Genomics, Aarhus University, 4200 Slagelse, Denmark
| | - Istvan Nagy
- Center for Quantitative Genetics and Genomics, Aarhus University, 4200 Slagelse, Denmark
| | - Simon Fiil Svane
- Section for Crop Sciences, Department of Plant and Environmental Sciences, Copenhagen University, 2630 Taastrup, Denmark
| | - Kristian Thorup-Kristensen
- Section for Crop Sciences, Department of Plant and Environmental Sciences, Copenhagen University, 2630 Taastrup, Denmark
| | | | - Lene Krusell
- Sejet Plant Breeding, Nørremarksvej 67, 8700 Horsens, Denmark
| | - Torben Asp
- Center for Quantitative Genetics and Genomics, Aarhus University, 4200 Slagelse, Denmark
- Correspondence: (P.B.H.); (T.A.); Tel.: +45-87158243 (T.A.)
| |
Collapse
|
9
|
Pérez-Rodríguez P, de Los Campos G. Multi-trait Bayesian Shrinkage and Variable Selection Models with the BGLR R-package. Genetics 2022; 222:6655691. [PMID: 35924977 PMCID: PMC9434216 DOI: 10.1093/genetics/iyac112] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Accepted: 07/14/2022] [Indexed: 12/02/2022] Open
Abstract
The BGLR-R package implements various types of single-trait shrinkage/variable selection Bayesian regressions. The package was first released in 2014, since then it has become a software very often used in genomic studies. We recently develop functionality for multitrait models. The implementation allows users to include an arbitrary number of random-effects terms. For each set of predictors, users can choose diffuse, Gaussian, and Gaussian–spike–slab multivariate priors. Unlike other software packages for multitrait genomic regressions, BGLR offers many specifications for (co)variance parameters (unstructured, diagonal, factor analytic, and recursive). Samples from the posterior distribution of the models implemented in the multitrait function are generated using a Gibbs sampler, which is implemented by combining code written in the R and C programming languages. In this article, we provide an overview of the models and methods implemented BGLR’s multitrait function, present examples that illustrate the use of the package, and benchmark the performance of the software.
Collapse
Affiliation(s)
- Paulino Pérez-Rodríguez
- Colegio de Postgraduados, CP 56230, Montecillos, Estado de México, México.,Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| | - Gustavo de Los Campos
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA.,Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA.,Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
10
|
Deng WQ, Sun L. gJLS2: an R package for generalized joint location and scale analysis in X-inclusive genome-wide association studies. G3 GENES|GENOMES|GENETICS 2022; 12:6535712. [PMID: 35201341 PMCID: PMC8982384 DOI: 10.1093/g3journal/jkac049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Accepted: 02/17/2022] [Indexed: 11/12/2022]
Abstract
A joint analysis of location and scale can be a powerful tool in genome-wide association studies to uncover previously overlooked markers that influence a quantitative trait through both mean and variance, as well as to prioritize candidates for gene–environment interactions. This approach has recently been generalized to handle related samples, dosage data, and the analytically challenging X-chromosome. We disseminate the latest advances in methodology through a user-friendly R software package with added functionalities to support genome-wide analysis on individual-level or summary-level data. The implemented R package can be called from PLINK or directly in a scripting environment, to enable a streamlined genome-wide analysis for biobank-scale data. Application results on individual-level and summary-level data highlight the advantage of the joint test to discover more genome-wide signals as compared to a location or scale test alone. We hope the availability of gJLS2 software package will encourage more scale and/or joint analyses in large-scale datasets, and promote the standardized reporting of their P-values to be shared with the scientific community.
Collapse
Affiliation(s)
- Wei Q Deng
- Department of Psychiatry and Behavioural Neurosciences, McMaster University, Hamilton, ON L8P 3R2, Canada
- Peter Boris Centre for Addictions Research, St. Joseph’s Healthcare Hamilton, McMaster University, Hamilton, ON L8P 3R2, Canada
| | - Lei Sun
- Department of Statistical Sciences, University of Toronto, Toronto, ON M5G 1Z5, Canada
- Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
| |
Collapse
|
11
|
Aase K, Jensen H, Muff S. Genomic estimation of quantitative genetic parameters in wild admixed populations. Methods Ecol Evol 2022. [DOI: 10.1111/2041-210x.13810] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Kenneth Aase
- Centre for Biodiversity Dynamics, Department of Biology Norwegian University of Science and Technology Trondheim Norway
| | - Henrik Jensen
- Centre for Biodiversity Dynamics, Department of Biology Norwegian University of Science and Technology Trondheim Norway
| | - Stefanie Muff
- Centre for Biodiversity Dynamics, Department of Biology Norwegian University of Science and Technology Trondheim Norway
- Department of Mathematical Sciences, Norwegian University of Science and Technology Trondheim Norway
| |
Collapse
|
12
|
McCarty AJ, Allen SK, Plough LV. Genome-wide analysis of acute low salinity tolerance in the eastern oyster Crassostrea virginica and potential of genomic selection for trait improvement. G3 (BETHESDA, MD.) 2022; 12:6409860. [PMID: 34849774 PMCID: PMC8727987 DOI: 10.1093/g3journal/jkab368] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 10/11/2021] [Indexed: 11/12/2022]
Abstract
As the global demand for seafood increases, research into the genetic basis of traits that can increase aquaculture production is critical. The eastern oyster (Crassostrea virginica) is an important aquaculture species along the Atlantic and Gulf Coasts of the United States, but increases in heavy rainfall events expose oysters to acute low salinity conditions, which negatively impact production. Low salinity survival is known to be a moderately heritable trait, but the genetic architecture underlying this trait is still poorly understood. In this study, we used ddRAD sequencing to generate genome-wide single-nucleotide polymorphism (SNP) data for four F2 families to investigate the genomic regions associated with survival in extreme low salinity (<3). SNP data were also used to assess the feasibility of genomic selection (GS) for improving this trait. Quantitative trait locus (QTL) mapping and combined linkage disequilibrium analysis revealed significant QTL on eastern oyster chromosomes 1 and 7 underlying both survival and day to death in a 36-day experimental challenge. Significant QTL were located in genes related to DNA/RNA function and repair, ion binding and membrane transport, and general response to stress. GS was investigated using Bayesian linear regression models and prediction accuracies ranged from 0.48 to 0.57. Genomic prediction accuracies were largest using the BayesB prior and prediction accuracies did not substantially decrease when SNPs located within the QTL region on Chr1 were removed, suggesting that this trait is controlled by many genes of small effect. Our results suggest that GS will likely be a viable option for improvement of survival in extreme low salinity.
Collapse
Affiliation(s)
- Alexandra J McCarty
- Horn Point Laboratory, University of Maryland Center for Environmental Science, Cambridge, MD 21613, USA
| | - Standish K Allen
- Virginia Institute of Marine Science, Aquaculture Genetics and Breeding Technology Center, Gloucester Point, VA 23062, USA
| | - Louis V Plough
- Horn Point Laboratory, University of Maryland Center for Environmental Science, Cambridge, MD 21613, USA
| |
Collapse
|
13
|
Lopez-Cruz M, de Los Campos G. Optimal breeding-value prediction using a sparse selection index. Genetics 2021; 218:6179494. [PMID: 33748861 PMCID: PMC8128408 DOI: 10.1093/genetics/iyab030] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 02/13/2021] [Indexed: 02/06/2023] Open
Abstract
Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a sparse selection index (SSI) that integrates selection index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-Best Linear Unbiased Predictor (G-BLUP) (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in 10 different environments) that the SSI can achieve significant (anywhere between 5 and 10%) gains in prediction accuracy relative to the G-BLUP.
Collapse
Affiliation(s)
- Marco Lopez-Cruz
- Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI 48824, USA
| | - Gustavo de Los Campos
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA.,Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA.,Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
14
|
Balmant KM, Noble JD, C Alves F, Dervinis C, Conde D, Schmidt HW, Vazquez AI, Barbazuk WB, Campos GDL, Resende MFR, Kirst M. Xylem systems genetics analysis reveals a key regulator of lignin biosynthesis in Populus deltoides. Genome Res 2020; 30:1131-1143. [PMID: 32817237 PMCID: PMC7462072 DOI: 10.1101/gr.261438.120] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Accepted: 07/13/2020] [Indexed: 02/01/2023]
Abstract
Despite the growing resources and tools for high-throughput characterization and analysis of genomic information, the discovery of the genetic elements that regulate complex traits remains a challenge. Systems genetics is an emerging field that aims to understand the flow of biological information that underlies complex traits from genotype to phenotype. In this study, we used a systems genetics approach to identify and evaluate regulators of the lignin biosynthesis pathway in Populus deltoides by combining genome, transcriptome, and phenotype data from a population of 268 unrelated individuals of P. deltoides The discovery of lignin regulators began with the quantitative genetic analysis of the xylem transcriptome and resulted in the detection of 6706 and 4628 significant local- and distant-eQTL associations, respectively. Among the locally regulated genes, we identified the R2R3-MYB transcription factor MYB125 (Potri.003G114100) as a putative trans-regulator of the majority of genes in the lignin biosynthesis pathway. The expression of MYB125 in a diverse population positively correlated with lignin content. Furthermore, overexpression of MYB125 in transgenic poplar resulted in increased lignin content, as well as altered expression of genes in the lignin biosynthesis pathway. Altogether, our findings indicate that MYB125 is involved in the control of a transcriptional coexpression network of lignin biosynthesis genes during secondary cell wall formation in P. deltoides.
Collapse
Affiliation(s)
- Kelly M Balmant
- School of Forest Resources and Conservation, University of Florida, Gainesville, Florida 32611, USA
| | - Jerald D Noble
- Plant Molecular and Cellular Biology Graduate Program, University of Florida, Gainesville, Florida 32611, USA
| | - Filipe C Alves
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan 48824, USA
| | - Christopher Dervinis
- School of Forest Resources and Conservation, University of Florida, Gainesville, Florida 32611, USA
| | - Daniel Conde
- School of Forest Resources and Conservation, University of Florida, Gainesville, Florida 32611, USA
| | - Henry W Schmidt
- School of Forest Resources and Conservation, University of Florida, Gainesville, Florida 32611, USA
| | - Ana I Vazquez
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan 48824, USA
| | - William B Barbazuk
- Plant Molecular and Cellular Biology Graduate Program, University of Florida, Gainesville, Florida 32611, USA
- Department of Biology, University of Florida, Gainesville, Florida 32611, USA
- Genetics Institute, University of Florida, Gainesville, Florida 32611, USA
| | - Gustavo de Los Campos
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan 48824, USA
- Statistics Department, Michigan State University, East Lansing, Michigan 48824, USA
| | - Marcio F R Resende
- Plant Molecular and Cellular Biology Graduate Program, University of Florida, Gainesville, Florida 32611, USA
- Horticulture Sciences Department, University of Florida, Gainesville, Florida 32611, USA
| | - Matias Kirst
- School of Forest Resources and Conservation, University of Florida, Gainesville, Florida 32611, USA
- Plant Molecular and Cellular Biology Graduate Program, University of Florida, Gainesville, Florida 32611, USA
- Genetics Institute, University of Florida, Gainesville, Florida 32611, USA
| |
Collapse
|
15
|
Privé F, Luu K, Vilhjálmsson BJ, Blum MGB. Performing Highly Efficient Genome Scans for Local Adaptation with R Package pcadapt Version 4. Mol Biol Evol 2020; 37:2153-2154. [DOI: 10.1093/molbev/msaa053] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Abstract
R package pcadapt is a user-friendly R package for performing genome scans for local adaptation. Here, we present version 4 of pcadapt which substantially improves computational efficiency while providing similar results. This improvement is made possible by using a different format for storing genotypes and a different algorithm for computing principal components of the genotype matrix, which is the most computationally demanding step in method pcadapt. These changes are seamlessly integrated into the existing pcadapt package, and users will experience a large reduction in computation time (by a factor of 20–60 in our analyses) as compared with previous versions.
Collapse
Affiliation(s)
- Florian Privé
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark
- University of Grenoble Alpes, Laboratoire TIMC-IMAG, UMR 5525, La Tronche, France
| | - Keurcien Luu
- University of Grenoble Alpes, Laboratoire TIMC-IMAG, UMR 5525, La Tronche, France
| | | | - Michael G B Blum
- University of Grenoble Alpes, Laboratoire TIMC-IMAG, UMR 5525, La Tronche, France
- OWKIN France, Paris, France
| |
Collapse
|
16
|
Imperfect Linkage Disequilibrium Generates Phantom Epistasis (& Perils of Big Data). G3-GENES GENOMES GENETICS 2019; 9:1429-1436. [PMID: 30877081 PMCID: PMC6505142 DOI: 10.1534/g3.119.400101] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The genetic architecture of complex human traits and diseases is affected by large number of possibly interacting genes, but detecting epistatic interactions can be challenging. In the last decade, several studies have alluded to problems that linkage disequilibrium can create when testing for epistatic interactions between DNA markers. However, these problems have not been formalized nor have their consequences been quantified in a precise manner. Here we use a conceptually simple three locus model involving a causal locus and two markers to show that imperfect LD can generate the illusion of epistasis, even when the underlying genetic architecture is purely additive. We describe necessary conditions for such "phantom epistasis" to emerge and quantify its relevance using simulations. Our empirical results demonstrate that phantom epistasis can be a very serious problem in GWAS studies (with rejection rates against the additive model greater than 0.28 for nominal p-values of 0.05, even when the model is purely additive). Some studies have sought to avoid this problem by only testing interactions between SNPs with R-sq. <0.1. We show that this threshold is not appropriate and demonstrate that the magnitude of the problem is even greater with large sample size, intermediate allele frequencies, and when the causal locus explains a large amount of phenotypic variance. We conclude that caution must be exercised when interpreting GWAS results derived from very large data sets showing strong evidence in support of epistatic interactions between markers.
Collapse
|