1
|
Li Q, Bian J, Qian Y, Kossinna P, Gau C, Gordon PMK, Zhou X, Guo X, Yan J, Wu J, Long Q. An expression-directed linear mixed model discovering low-effect genetic variants. Genetics 2024; 226:iyae018. [PMID: 38314848 DOI: 10.1093/genetics/iyae018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 11/29/2023] [Accepted: 01/05/2024] [Indexed: 02/07/2024] Open
Abstract
Detecting genetic variants with low-effect sizes using a moderate sample size is difficult, hindering downstream efforts to learn pathology and estimating heritability. In this work, by utilizing informative weights learned from training genetically predicted gene expression models, we formed an alternative approach to estimate the polygenic term in a linear mixed model. Our linear mixed model estimates the genetic background by incorporating their relevance to gene expression. Our protocol, expression-directed linear mixed model, enables the discovery of subtle signals of low-effect variants using moderate sample size. By applying expression-directed linear mixed model to cohorts of around 5,000 individuals with either binary (WTCCC) or quantitative (NFBC1966) traits, we demonstrated its power gain at the low-effect end of the genetic etiology spectrum. In aggregate, the additional low-effect variants detected by expression-directed linear mixed model substantially improved estimation of missing heritability. Expression-directed linear mixed model moves precision medicine forward by accurately detecting the contribution of low-effect genetic variants to human diseases.
Collapse
Affiliation(s)
- Qing Li
- Department of Biochemistry & Molecular Biology, University of Calgary, Calgary T2N 1N4, Canada
| | - Jiayi Bian
- Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada
| | - Yanzhao Qian
- Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada
| | - Pathum Kossinna
- Department of Biochemistry & Molecular Biology, University of Calgary, Calgary T2N 1N4, Canada
| | - Cooper Gau
- Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada
| | - Paul M K Gordon
- Alberta Children's Hospital Research Institute, University of Calgary, Calgary T2N 1N4, Canada
| | - Xiang Zhou
- School of Public Health, University of Michigan, Ann Arbor 48109, USA
| | - Xingyi Guo
- Department of Medicine & Biomedical Informatics, Vanderbilt University Medical Center, Nashville 37203, USA
| | - Jun Yan
- Physiology and Pharmacology, University of Calgary, Calgary T2N 1N4, Canada
- Hotchkiss Brain Institute, University of Calgary, Calgary T2N 1N4, Canada
| | - Jingjing Wu
- Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada
| | - Quan Long
- Department of Biochemistry & Molecular Biology, University of Calgary, Calgary T2N 1N4, Canada
- Department of Mathematics and Statistics, University of Calgary, Calgary T2N 1N4, Canada
- Alberta Children's Hospital Research Institute, University of Calgary, Calgary T2N 1N4, Canada
- Hotchkiss Brain Institute, University of Calgary, Calgary T2N 1N4, Canada
- Department of Medical Genetics, University of Calgary, Calgary T2N 1N4, Canada
| |
Collapse
|
2
|
Apicella C, Ruano CSM, Thilaganathan B, Khalil A, Giorgione V, Gascoin G, Marcellin L, Gaspar C, Jacques S, Murdoch CE, Miralles F, Méhats C, Vaiman D. Pan-Genomic Regulation of Gene Expression in Normal and Pathological Human Placentas. Cells 2023; 12:cells12040578. [PMID: 36831244 PMCID: PMC9954093 DOI: 10.3390/cells12040578] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 01/17/2023] [Accepted: 01/28/2023] [Indexed: 02/17/2023] Open
Abstract
In this study, we attempted to find genetic variants affecting gene expression (eQTL = expression Quantitative Trait Loci) in the human placenta in normal and pathological situations. The analysis of gene expression in placental diseases (Pre-eclampsia and Intra-Uterine Growth Restriction) is hindered by the fact that diseased placental tissue samples are generally taken at earlier gestations compared to control samples. The difference in gestational age is considered a major confounding factor in the transcriptome regulation of the placenta. To alleviate this significant problem, we propose here a novel approach to pinpoint disease-specific cis-eQTLs. By statistical correction for gestational age at sampling as well as other confounding/surrogate variables systematically searched and identified, we found 43 e-genes for which proximal SNPs influence expression level. Then, we performed the analysis again, removing the disease status from the covariates, and we identified 54 e-genes, 16 of which are identified de novo and, thus, possibly related to placental disease. We found a highly significant overlap with previous studies for the list of 43 e-genes, validating our methodology and findings. Among the 16 disease-specific e-genes, several are intrinsic to trophoblast biology and, therefore, constitute novel targets of interest to better characterize placental pathology and its varied clinical consequences. The approach that we used may also be applied to the study of other human diseases where confounding factors have hampered a better understanding of the pathology.
Collapse
Affiliation(s)
- Clara Apicella
- Team ‘From Gametes to Birth’, Institut Cochin, U1016 INSERM, UMR 8104 CNRS, Paris-Descartes University, 75014 Paris, France
| | - Camino S. M. Ruano
- Team ‘From Gametes to Birth’, Institut Cochin, U1016 INSERM, UMR 8104 CNRS, Paris-Descartes University, 75014 Paris, France
| | - Basky Thilaganathan
- Fetal Medicine Unit, St George’s University Hospitals NHS Foundation Trust, London SW17 0RE, UK
- Vascular Biology Research Centre, Molecular and Clinical Sciences Research Institute, St George’s University of London, London SW17 0RE, UK
| | - Asma Khalil
- Fetal Medicine Unit, St George’s University Hospitals NHS Foundation Trust, London SW17 0RE, UK
- Vascular Biology Research Centre, Molecular and Clinical Sciences Research Institute, St George’s University of London, London SW17 0RE, UK
| | - Veronica Giorgione
- Fetal Medicine Unit, St George’s University Hospitals NHS Foundation Trust, London SW17 0RE, UK
- Vascular Biology Research Centre, Molecular and Clinical Sciences Research Institute, St George’s University of London, London SW17 0RE, UK
| | - Géraldine Gascoin
- Department of Neonatology, Angers University Hospital, F-49000 Angers, France
| | - Louis Marcellin
- Department of Gynaecology, Obstetrics and Reproductive Medicine, Centre Hospitalier Universitaire (CHU) Cochin Faculté de Médecine, Assistance Publique-Hôpitaux de Paris (AP-HP), Hôpitaux Universitaires Paris Centre (HUPC), Université de Paris, 138 Boulevard de Port-Royal, 75014 Paris, France
| | - Cassandra Gaspar
- Sorbonne Université, Inserm, UMS Production et Analyse des données en Sciences de la vie et en Santé, PASS, Plateforme Post-génomique de la Pitié-Salpêtrière, 75013 Paris, France
| | - Sébastien Jacques
- Team ‘From Gametes to Birth’, Institut Cochin, U1016 INSERM, UMR 8104 CNRS, Paris-Descartes University, 75014 Paris, France
| | - Colin E. Murdoch
- Systems Medicine, School of Medicine, University of Dundee, Dundee DD1 9SY, UK
| | - Francisco Miralles
- Team ‘From Gametes to Birth’, Institut Cochin, U1016 INSERM, UMR 8104 CNRS, Paris-Descartes University, 75014 Paris, France
| | - Céline Méhats
- Team ‘From Gametes to Birth’, Institut Cochin, U1016 INSERM, UMR 8104 CNRS, Paris-Descartes University, 75014 Paris, France
| | - Daniel Vaiman
- Team ‘From Gametes to Birth’, Institut Cochin, U1016 INSERM, UMR 8104 CNRS, Paris-Descartes University, 75014 Paris, France
- Correspondence: ; Tel.: +33-1-44412301; Fax: +33-1-44412302
| |
Collapse
|
3
|
Wang B, Gamazon ER. Modeling mutational effects on biochemical phenotypes using convolutional neural networks: application to SARS-CoV-2. iScience 2022; 25:104500. [PMID: 35669036 PMCID: PMC9159778 DOI: 10.1016/j.isci.2022.104500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 11/15/2021] [Accepted: 05/26/2022] [Indexed: 11/29/2022] Open
Abstract
Deep mutational scanning (DMS) experiments have been performed on SARS-CoV-2’s spike receptor-binding domain (RBD) and human angiotensin-converting enzyme 2 (ACE2) zinc-binding peptidase domain—both central players in viral infection and evolution and antibody evasion—quantifying how mutations impact biochemical phenotypes. We modeled biochemical phenotypes from massively parallel assays, using neural networks trained on protein sequence mutations in the virus and human host. Neural networks were significantly predictive of binding affinity, protein expression, and antibody escape, learning complex interactions and higher-order features that are difficult to capture with conventional methods from structural biology. Integrating the physicochemical properties of amino acids, such as hydrophobicity and long-range non-bonded energy per atom, significantly improved prediction (empirical p < 0.01). We observed concordance of the neural network predictions with molecular dynamics (multiple 500 ns or 1 μs all-atom) simulations of the spike protein-ACE2 interface, with critical implications for the use of deep learning to dissect molecular mechanisms. Deep learning models of biochemical phenotypes from deep mutational scanning (DMS) data Prediction performance gain from using physicochemical properties of amino acids Concordance of neural network predictions with molecular dynamics simulations Improved causal inference properties for neural-network-defined phenotypes
Collapse
Affiliation(s)
- Bo Wang
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Eric R Gamazon
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, USA.,Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232, USA.,Data Science Institute, Vanderbilt University Medical Center, Nashville, TN 37232, USA.,Clare Hall, University of Cambridge, Cambridge CB3 9AL, UK
| |
Collapse
|
4
|
Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 2022; 23:169-181. [PMID: 34837041 DOI: 10.1038/s41576-021-00434-9] [Citation(s) in RCA: 86] [Impact Index Per Article: 43.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/28/2021] [Indexed: 11/08/2022]
Abstract
The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.
Collapse
|
5
|
Malik MA, Michoel T. Restricted maximum-likelihood method for learning latent variance components in gene expression data with known and unknown confounders. G3 (BETHESDA, MD.) 2022; 12:6447512. [PMID: 34864982 PMCID: PMC9210293 DOI: 10.1093/g3journal/jkab410] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 11/11/2021] [Indexed: 11/15/2022]
Abstract
Random effects models are popular statistical models for detecting and correcting spurious sample correlations due to hidden confounders in genome-wide gene expression data. In applications where some confounding factors are known, estimating simultaneously the contribution of known and latent variance components in random effects models is a challenge that has so far relied on numerical gradient-based optimizers to maximize the likelihood function. This is unsatisfactory because the resulting solution is poorly characterized and the efficiency of the method may be suboptimal. Here, we prove analytically that maximum-likelihood latent variables can always be chosen orthogonal to the known confounding factors, in other words, that maximum-likelihood latent variables explain sample covariances not already explained by known factors. Based on this result, we propose a restricted maximum-likelihood (REML) method that estimates the latent variables by maximizing the likelihood on the restricted subspace orthogonal to the known confounding factors and show that this reduces to probabilistic principal component analysis on that subspace. The method then estimates the variance-covariance parameters by maximizing the remaining terms in the likelihood function given the latent variables, using a newly derived analytic solution for this problem. Compared to gradient-based optimizers, our method attains greater or equal likelihood values, can be computed using standard matrix operations, results in latent factors that do not overlap with any known factors, and has a runtime reduced by several orders of magnitude. Hence, the REML method facilitates the application of random effects modeling strategies for learning latent variance components to much larger gene expression datasets than possible with current methods.
Collapse
Affiliation(s)
- Muhammad Ammar Malik
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen 5020, Norway
| | - Tom Michoel
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen 5020, Norway
| |
Collapse
|
6
|
Payne NY, Gagnon-Bartsch JA. Separating and reintegrating latent variables to improve classification of genomic data. Biostatistics 2022; 23:1133-1149. [DOI: 10.1093/biostatistics/kxab046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2021] [Revised: 11/09/2021] [Accepted: 11/24/2021] [Indexed: 11/12/2022] Open
Abstract
Summary
Genomic data sets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes), giving rise to dense latent variation. This latent variation presents both challenges and opportunities for classification. While some of these latent variables may be partially correlated with the phenotype of interest and thus helpful, others may be uncorrelated and merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. To address these challenges, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. We apply the method to simulated data and a variety of genomic data sets from multiple platforms. In general, we find that the CRC performs well relative to existing classifiers and sometimes offers substantial gains.
Collapse
Affiliation(s)
- Nora Yujia Payne
- Department of Statistics, University of Michigan, 1085 S. University Ave., Ann Arbor, MI 48109, USA
| | - Johann A Gagnon-Bartsch
- Department of Statistics, University of Michigan, 1085 S. University Ave., Ann Arbor, MI 48109, USA
| |
Collapse
|
7
|
Gao C, Wei H, Zhang K. LORSEN: Fast and Efficient eQTL Mapping With Low Rank Penalized Regression. Front Genet 2021; 12:690926. [PMID: 34868194 PMCID: PMC8636089 DOI: 10.3389/fgene.2021.690926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Accepted: 10/08/2021] [Indexed: 12/02/2022] Open
Abstract
Characterization of genetic variations that are associated with gene expression levels is essential to understand cellular mechanisms that underline human complex traits. Expression quantitative trait loci (eQTL) mapping attempts to identify genetic variants, such as single nucleotide polymorphisms (SNPs), that affect the expression of one or more genes. With the availability of a large volume of gene expression data, it is necessary and important to develop fast and efficient statistical and computational methods to perform eQTL mapping for such large scale data. In this paper, we proposed a new method, the low rank penalized regression method (LORSEN), for eQTL mapping. We evaluated and compared the performance of LORSEN with two existing methods for eQTL mapping using extensive simulations as well as real data from the HapMap3 project. Simulation studies showed that our method outperformed two commonly used methods for eQTL mapping, LORS and FastLORS, in many scenarios in terms of area under the curve (AUC). We illustrated the usefulness of our method by applying it to SNP variants data and gene expression levels on four chromosomes from the HapMap3 Project.
Collapse
Affiliation(s)
- Cheng Gao
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, United States
| | - Hairong Wei
- College of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI, United States
| | - Kui Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, United States
| |
Collapse
|
8
|
Ponsonby AL. Reflection on modern methods: building causal evidence within high-dimensional molecular epidemiological studies of moderate size. Int J Epidemiol 2021; 50:1016-1029. [PMID: 33594409 DOI: 10.1093/ije/dyaa174] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/17/2020] [Indexed: 12/29/2022] Open
Abstract
This commentary provides a practical perspective on epidemiological analysis within a single high-dimensional study of moderate size to consider a causal question. In this setting, non-causal confounding is important. This occurs when a factor is a determinant of outcome and the underlying association between exposure and the factor is non-causal. That is, the association arises due to chance, confounding or other bias rather than reflecting that exposure and the factor are causally related. In particular, the influence of technical processing factors must be accounted for by pre-processing measures to remove artefact or to control for these factors such as batch run. Work steps include the evaluation of alternative non-causal explanations for observed exposure-disease associations and strategies to obtain the highest level of causal inference possible within the study. A systematic approach is required to work through a question set and obtain insights on not only the exposure-disease association but also the multifactorial causal structure of the underlying data where possible. The appropriate inclusion of molecular findings will enhance the quest to better understand multifactorial disease causation in modern observational epidemiological studies.
Collapse
|
9
|
Gerard D, Stephens M. UNIFYING AND GENERALIZING METHODS FOR REMOVING UNWANTED VARIATION BASED ON NEGATIVE CONTROLS. Stat Sin 2021; 31:1145-1166. [PMID: 38148787 PMCID: PMC10751021 DOI: 10.5705/ss.202018.0345] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2023]
Abstract
Unwanted variation, including hidden confounding, is a well-known problem in many fields, but particularly in large-scale gene expression studies. Recent proposals to use control genes, genes assumed to be unassociated with the covariates of interest, have led to new methods to deal with this problem. Several versions of these removing unwanted variation (RUV) methods have been proposed, including RUV1, RUV2, RUV4, RUVinv, RUVrinv, and RUVfun. Here, we introduce a general framework, RUV*, that both unites and generalizes these approaches. This unifying framework helps clarify the connections between existing methods. In particular, we provide conditions under which RUV2 and RUV4 are equivalent. The RUV* framework preserves an advantage of the RUV approaches, namely, their modularity, which facilitates the development of novel methods based on existing matrix imputation algorithms. We illustrate this by implementing RUVB, a version of RUV* based on Bayesian factor analysis. In realistic simulations based on real data, we found RUVB to be competitive with existing methods in terms of both power and calibration. However, providing a consistently reliable calibration among the data sets remains challenging.
Collapse
Affiliation(s)
- David Gerard
- Department of Mathematics and Statistics, American University, Washington, DC 20016, USA
| | - Matthew Stephens
- Departments of Human Genetics and Statistics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
10
|
Mathew B, Léon J, Dadshani S, Pillen K, Sillanpää MJ, Naz AA. Importance of correcting genomic relationships in single-locus QTL mapping model with an advanced backcross population. G3 GENES|GENOMES|GENETICS 2021; 11:6211194. [PMID: 33822941 PMCID: PMC8495747 DOI: 10.1093/g3journal/jkab105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 03/18/2021] [Indexed: 11/29/2022]
Abstract
Advanced backcross (AB) populations have been widely used to identify and utilize beneficial alleles in various crops such as rice, tomato, wheat, and barley. For the development of an AB population, a controlled crossing scheme is used and this controlled crossing along with the selection (both natural and artificial) of agronomically adapted alleles during the development of AB population may lead to unbalanced allele frequencies in the population. However, it is commonly believed that interval mapping of traits in experimental crosses such as AB populations is immune to the deviations from the expected frequencies under Mendelian segregation. Using two AB populations and simulated data sets as examples, we describe the severity of the problem caused by unbalanced allele frequencies in quantitative trait loci mapping and demonstrate how it can be corrected using the linear mixed model having a polygenic effect with the covariance structure (genomic relationship matrix) calculated from molecular markers.
Collapse
Affiliation(s)
- Boby Mathew
- Institute of Crop Science and Resource Conservation, Department of Plant Breeding, University of Bonn, 53115 Bonn, Germany
| | - Jens Léon
- Institute of Crop Science and Resource Conservation, Department of Plant Breeding, University of Bonn, 53115 Bonn, Germany
| | - Said Dadshani
- Institute of Crop Science and Resource Conservation, Department of Plant Breeding, University of Bonn, 53115 Bonn, Germany
| | - Klaus Pillen
- Department of Plant Breeding, Institute of Agricultural and Nutritional Sciences, Martin-Luther University Halle-Wittenberg, 06120 Halle (Saale), Germany
| | | | - Ali Ahmad Naz
- Institute of Crop Science and Resource Conservation, Department of Plant Breeding, University of Bonn, 53115 Bonn, Germany
| |
Collapse
|
11
|
Mao W, Rahimikollu J, Hausler R, Chikina M. DataRemix: a universal data transformation for optimal inference from gene expression datasets. Bioinformatics 2021; 37:984-991. [PMID: 32821903 DOI: 10.1093/bioinformatics/btaa745] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 08/01/2020] [Accepted: 08/17/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION RNA-seq technology provides unprecedented power in the assessment of the transcription abundance and can be used to perform a variety of downstream tasks such as inference of gene-correlation network and eQTL discovery. However, raw gene expression values have to be normalized for nuisance biological variation and technical covariates, and different normalization strategies can lead to dramatically different results in the downstream study. RESULTS We describe a generalization of singular value decomposition-based reconstruction for which the common techniques of whitening, rank-k approximation and removing the top k principal components are special cases. Our simple three-parameter transformation, DataRemix, can be tuned to reweigh the contribution of hidden factors and reveal otherwise hidden biological signals. In particular, we demonstrate that the method can effectively prioritize biological signals over noise without leveraging external dataset-specific knowledge, and can outperform normalization methods that make explicit use of known technical factors. We also show that DataRemix can be efficiently optimized via Thompson sampling approach, which makes it feasible for computationally expensive objectives such as eQTL analysis. Finally, we apply our method to the Religious Orders Study and Memory and Aging Project dataset, and we report what to our knowledge is the first replicable trans-eQTL effect in human brain. AVAILABILITYAND IMPLEMENTATION DataRemix is an R package which is freely available at GitHub (https://github.com/wgmao/DataRemix). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Weiguang Mao
- Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA 15260, USA.,Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Javad Rahimikollu
- Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA 15260, USA.,Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Ryan Hausler
- Department of Medicine, Division of Hematology/Oncology,, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Maria Chikina
- Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, PA 15260, USA.,Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| |
Collapse
|
12
|
Zhang W, Ghosh D. A general approach to sensitivity analysis for Mendelian randomization. STATISTICS IN BIOSCIENCES 2021; 13:34-55. [PMID: 33737984 DOI: 10.1007/s12561-020-09280-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Mendelian Randomization (MR) represents a class of instrumental variable methods using genetic variants. It has become popular in epidemiological studies to account for the unmeasured confounders when estimating the effect of exposure on outcome. The success of Mendelian Randomization depends on three critical assumptions, which are difficult to verify. Therefore, sensitivity analysis methods are needed for evaluating results and making plausible conclusions. We propose a general and easy to apply approach to conduct sensitivity analysis for Mendelian Randomization studies. Bound et al. (1995) derived a formula for the asymptotic bias of the instrumental variable estimator. Based on their work, we derive a new sensitivity analysis formula. The parameters in the formula include sensitivity parameters such as the correlation between instruments and unmeasured confounder, the direct effect of instruments on outcome and the strength of instruments. In our simulation studies, we examined our approach in various scenarios using either individual SNPs or unweighted allele score as instruments. By using a previously published dataset from researchers involving a bone mineral density study, we demonstrate that our proposed method is a useful tool for MR studies, and that investigators can combine their domain knowledge with our method to obtain bias-corrected results and make informed conclusions on the scientific plausibility of their findings.
Collapse
Affiliation(s)
- Weiming Zhang
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, Colorado, U.S.A
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, Colorado, U.S.A
| |
Collapse
|
13
|
Modeling mutational effects on biochemical phenotypes using convolutional neural networks: application to SARS-CoV-2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021. [PMID: 33532766 PMCID: PMC7852230 DOI: 10.1101/2021.01.28.428521] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Biochemical phenotypes are major indexes for protein structure and function characterization. They are determined, at least in part, by the intrinsic physicochemical properties of amino acids and may be reflected in the protein three-dimensional structure. Modeling mutational effects on biochemical phenotypes is a critical step for understanding protein function and disease mechanism as well as enabling drug discovery. Deep Mutational Scanning (DMS) experiments have been performed on SARS-CoV-2’s spike receptor binding domain and the human ACE2 zinc-binding peptidase domain - both central players in viral infection and evolution and antibody evasion - quantifying how mutations impact binding affinity and protein expression. Here, we modeled biochemical phenotypes from massively parallel assays, using convolutional neural networks trained on protein sequence mutations in the virus and human host. We found that neural networks are significantly predictive of binding affinity, protein expression, and antibody escape, learning complex interactions and higher-order features that are difficult to capture with conventional methods from structural biology. Integrating the intrinsic physicochemical properties of amino acids, including hydrophobicity, solvent-accessible surface area, and long-range non-bonded energy per atom, significantly improved prediction (empirical p<0.01) though there was such a strong dependence on the sequence data alone to yield reasonably good prediction. We observed concordance of the DMS data and our neural network predictions with an independent study on intermolecular interactions from molecular dynamics (multiple 500 ns or 1 μs all-atom) simulations of the spike protein-ACE2 interface, with critical implications for the use of deep learning to dissect molecular mechanisms. The mutation- or genetically-determined component of a biochemical phenotype estimated from the neural networks has improved causal inference properties relative to the original phenotype and can facilitate crucial insights into disease pathophysiology and therapeutic design.
Collapse
|
14
|
Choi JH, Kim T, Jung J, Joo JWJ. Fully automated web-based tool for identifying regulatory hotspots. BMC Genomics 2020; 21:616. [PMID: 33208108 PMCID: PMC7677835 DOI: 10.1186/s12864-020-07012-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Background Regulatory hotspots are genetic variations that may regulate the expression levels of many genes. It has been of great interest to find those hotspots utilizing expression quantitative trait locus (eQTL) analysis. However, it has been reported that many of the findings are spurious hotspots induced by various unknown confounding factors. Recently, methods utilizing complicated statistical models have been developed that successfully identify genuine hotspots. Next-generation Intersample Correlation Emended (NICE) is one of the methods that show high sensitivity and low false-discovery rate in finding regulatory hotspots. Even though the methods successfully find genuine hotspots, they have not been widely used due to their non-user-friendly interfaces and complex running processes. Furthermore, most of the methods are impractical due to their prohibitively high computational complexity. Results To overcome the limitations of existing methods, we developed a fully automated web-based tool, referred to as NICER (NICE Renew), which is based on NICE program. First, we dramatically reduced running and installing burden of NICE. Second, we significantly reduced running time by incorporating multi-processing. Third, besides our web-based NICER, users can use NICER on Google Compute Engine and can readily install and run the NICER web service on their local computers. Finally, we provide different input formats and visualizations tools to show results. Utilizing a yeast dataset, we show that NICER can be successfully used in an eQTL analysis to identify many genuine regulatory hotspots, for which more than half of the hotspots were previously reported elsewhere. Conclusions Even though many hotspot analysis tools have been proposed, they have not been widely used for many practical reasons. NICER is a fully-automated web-based solution for eQTL mapping and regulatory hotspots analysis. NICER provides a user-friendly interface and has made hotspot analysis more viable by reducing the running time significantly. We believe that NICER will become the method of choice for increasing power of eQTL hotspot analysis.
Collapse
Affiliation(s)
- Ju Hun Choi
- Department of Computer Science and Engineering, Dongguk University-Seoul, Seoul, 04620, South Korea
| | - Taegun Kim
- Department of Computer Science and Engineering, Dongguk University-Seoul, Seoul, 04620, South Korea
| | - Junghyun Jung
- Department of Life Science, Dongguk University-Seoul, Seoul, 04620, South Korea
| | - Jong Wha J Joo
- Department of Computer Science and Engineering, Dongguk University-Seoul, Seoul, 04620, South Korea.
| |
Collapse
|
15
|
Jacob L, Witteveen A, Beumer I, Delahaye L, Wehkamp D, van den Akker J, Snel M, Chan B, Floore A, Bakx N, Brink G, Poncet C, Bogaerts J, Delorenzi M, Piccart M, Rutgers E, Cardoso F, Speed T, van 't Veer L, Glas A. Controlling technical variation amongst 6693 patient microarrays of the randomized MINDACT trial. Commun Biol 2020; 3:397. [PMID: 32719399 PMCID: PMC7385160 DOI: 10.1038/s42003-020-1111-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 06/23/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data obtained in large studies hold great promises for discovering disease signatures or subtypes through data analysis. It is also prone to technical variation, whose removal is essential to avoid spurious discoveries. Because this variation is not always known and can be confounded with biological signals, its removal is a challenging task. Here we provide a step-wise procedure and comprehensive analysis of the MINDACT microarray dataset. The MINDACT trial enrolled 6693 breast cancer patients and prospectively validated the gene expression signature MammaPrint for outcome prediction. The study also yielded a full-transcriptome microarray for each tumor. We show for the first time in such a large dataset how technical variation can be removed while retaining expected biological signals. Because of its unprecedented size, we hope the resulting adjusted dataset will be an invaluable tool to discover or test gene expression signatures and to advance our understanding of breast cancer.
Collapse
Affiliation(s)
- Laurent Jacob
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, Villeurbanne, France
| | | | - Inès Beumer
- Agendia NV/Agendia Inc, Amsterdam, The Netherlands
| | | | | | | | | | - Bob Chan
- Agendia NV/Agendia Inc, Amsterdam, The Netherlands
| | - Arno Floore
- Agendia NV/Agendia Inc, Amsterdam, The Netherlands
| | - Niels Bakx
- Agendia NV/Agendia Inc, Amsterdam, The Netherlands
| | - Guido Brink
- Agendia NV/Agendia Inc, Amsterdam, The Netherlands
| | | | | | - Mauro Delorenzi
- University Lausanne, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | - Emiel Rutgers
- Netherlands Cancer Institute, Amsterdam, The Netherlands
| | | | - Terence Speed
- Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia
| | - Laura van 't Veer
- Agendia NV/Agendia Inc, Amsterdam, The Netherlands.
- Helen Diller Family Comprehensive Cancer Center, University California San Francisco, San Francisco, CA, USA.
| | - Annuska Glas
- Agendia NV/Agendia Inc, Amsterdam, The Netherlands.
| |
Collapse
|
16
|
Abstract
BACKGROUND With the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in unsubstantiated claims of a method's performance. RESULTS Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. CONCLUSIONS Using data simulated from a theoretical model can substantially impact the results of a study. We developed more realistic simulation techniques for RNA-seq data. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Network: https://cran.r-project.org/package=seqgendiff.
Collapse
Affiliation(s)
- David Gerard
- Department of Mathematics and Statistics, American University, Massachusetts Ave NW, Washington, DC, 20016, USA.
| |
Collapse
|
17
|
Rhyne J, Jeng XJ, Chi EC, Tzeng J. FastLORS: Joint modelling for expression quantitative trait loci mapping in R. Stat (Int Stat Inst) 2020. [DOI: 10.1002/sta4.265] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Jacob Rhyne
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| | - X. Jessie Jeng
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| | - Eric C. Chi
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| | - Jung‐Ying Tzeng
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| |
Collapse
|
18
|
Ellis JL, Pecanka J, Goeman JJ. Gaining power in multiple testing of interval hypotheses via conditionalization. Biostatistics 2020; 21:e65-e79. [PMID: 30247521 DOI: 10.1093/biostatistics/kxy042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Revised: 07/12/2018] [Accepted: 08/04/2018] [Indexed: 11/14/2022] Open
Abstract
In this article, we introduce a novel procedure for improving power of multiple testing procedures (MTPs) of interval hypotheses. When testing interval hypotheses the null hypothesis $P$-values tend to be stochastically larger than standard uniform if the true parameter is in the interior of the null hypothesis. The new procedure starts with a set of $P$-values and discards those with values above a certain pre-selected threshold, while the rest are corrected (scaled-up) by the value of the threshold. Subsequently, a chosen family-wise error rate (FWER) or false discovery rate MTP is applied to the set of corrected $P$-values only. We prove the general validity of this procedure under independence of $P$-values, and for the special case of the Bonferroni method, we formulate several sufficient conditions for the control of the FWER. It is demonstrated that this "filtering" of $P$-values can yield considerable gains of power.
Collapse
Affiliation(s)
- Jules L Ellis
- Behavioral Science Institute, Radboud University Nijmegen, Postbus 9104, 6500 HE, Nijmegen, The Netherlands
| | - Jakub Pecanka
- Biomedical Data Sciences, Leiden University Medical Center, Postbus 9600, 2300 RC, Leiden, The Netherlands
| | - Jelle J Goeman
- Biomedical Data Sciences, Leiden University Medical Center, Postbus 9600, 2300 RC, Leiden, The Netherlands
| |
Collapse
|
19
|
Mefford J, Park D, Zheng Z, Ko A, Ala-Korpela M, Laakso M, Pajukanta P, Yang J, Witte J, Zaitlen N. Efficient Estimation and Applications of Cross-Validated Genetic Predictions to Polygenic Risk Scores and Linear Mixed Models. J Comput Biol 2020; 27:599-612. [PMID: 32077750 PMCID: PMC7185352 DOI: 10.1089/cmb.2019.0325] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Large-scale cohorts with combined genetic and phenotypic data, coupled with methodological advances, have produced increasingly accurate genetic predictors of complex human phenotypes called polygenic risk scores (PRSs). In addition to the potential translational impacts of identifying at-risk individuals, PRS are being utilized for a growing list of scientific applications, including causal inference, identifying pleiotropy and genetic correlation, and powerful gene-based and mixed-model association tests. Existing PRS approaches rely on external large-scale genetic cohorts that have also measured the phenotype of interest. They further require matching on ancestry and genotyping platform or imputation quality. In this work, we present a novel reference-free method to produce a PRS that does not rely on an external cohort. We show that naive implementations of reference-free PRS either result in substantial overfitting or prohibitive increases in computational time. We show that our algorithm avoids both of these issues and can produce informative in-sample PRSs over a single cohort without overfitting. We then demonstrate several novel applications of reference-free PRSs, including detection of pleiotropy across 246 metabolic traits and efficient mixed-model association testing.
Collapse
Affiliation(s)
| | - Danny Park
- School of Medicine, UCSF, San Francisco, California
| | - Zhili Zheng
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia
| | - Arthur Ko
- Human Genetics, UCLA, Los Angeles, California
| | - Mika Ala-Korpela
- Baker IDI Heart and Diabetes Institute, Melbourne, Victoria, Australia
- University of Oulu Biocenter, Oulu, Finland
- NMR Metabolomics Laboratory, School of Pharmacy, University of Eastern Finland, Kuopio, Finland
- University of Bristol School of Medical Sciences, Population Health Science, Bristol, Bristol, United Kingdom
| | - Markku Laakso
- Department of Medicine, University of Eastern Finland School of Medicine, Kuopio, Finland
| | | | - Jian Yang
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia
| | - John Witte
- Departments of Epidemiology and Biostatistics, and Urology, UCSF, San Francisco, California
| | | |
Collapse
|
20
|
Jeng XJ, Rhyne J, Zhang T, Tzeng JY. Effective SNP ranking improves the performance of eQTL mapping. Genet Epidemiol 2020; 44:611-619. [PMID: 32216117 DOI: 10.1002/gepi.22293] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Revised: 02/21/2020] [Accepted: 03/11/2020] [Indexed: 11/06/2022]
Abstract
Genome-wide expression quantitative trait loci (eQTLs) mapping explores the relationship between gene expression and DNA variants, such as single-nucleotide polymorphism (SNPs), to understand genetic basis of human diseases. Due to the large number of genes and SNPs that need to be assessed, current methods for eQTL mapping often suffer from low detection power, especially for identifying trans-eQTLs. In this paper, we propose the idea of performing SNP ranking based on the higher criticism statistic, a summary statistic developed in large-scale signal detection. We illustrate how the HC-based SNP ranking can effectively prioritize eQTL signals over noise, greatly reduce the burden of joint modeling, and improve the power for eQTL mapping. Numerical results in simulation studies demonstrate the superior performance of our method compared to existing methods. The proposed method is also evaluated in HapMap eQTL data analysis and the results are compared to a database of known eQTLs.
Collapse
Affiliation(s)
- X Jessie Jeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Jacob Rhyne
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Teng Zhang
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina.,Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina.,Department of Statistics, National Cheng-Kung University, Tainan, Taiwan.,Division of Biostatistics, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
21
|
A Multi-Omics Perspective of Quantitative Trait Loci in Precision Medicine. Trends Genet 2020; 36:318-336. [PMID: 32294413 DOI: 10.1016/j.tig.2020.01.009] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Revised: 01/05/2020] [Accepted: 01/21/2020] [Indexed: 02/07/2023]
Abstract
Quantitative trait loci (QTL) analysis is an important approach to investigate the effects of genetic variants identified through an increasing number of large-scale, multidimensional 'omics data sets. In this 'big data' era, the research community has identified a significant number of molecular QTLs (molQTLs) and increased our understanding of their effects. Herein, we review multiple categories of molQTLs, including those associated with transcriptome, post-transcriptional regulation, epigenetics, proteomics, metabolomics, and the microbiome. We summarize approaches to identify molQTLs and to infer their causal effects. We further discuss the integrative analysis of molQTLs through a multi-omics perspective. Our review highlights future opportunities to better understand the functional significance of genetic variants and to utilize the discovery of molQTLs in precision medicine.
Collapse
|
22
|
Dahl A, Guillemot V, Mefford J, Aschard H, Zaitlen N. Adjusting for Principal Components of Molecular Phenotypes Induces Replicating False Positives. Genetics 2019; 211:1179-1189. [PMID: 30692194 PMCID: PMC6456307 DOI: 10.1534/genetics.118.301768] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2018] [Accepted: 01/23/2019] [Indexed: 12/20/2022] Open
Abstract
High-throughput measurements of molecular phenotypes provide an unprecedented opportunity to model cellular processes and their impact on disease. These highly structured datasets are usually strongly confounded, creating false positives and reducing power. This has motivated many approaches based on principal components analysis (PCA) to estimate and correct for confounders, which have become indispensable elements of association tests between molecular phenotypes and both genetic and nongenetic factors. Here, we show that these correction approaches induce a bias, and that it persists for large sample sizes and replicates out-of-sample. We prove this theoretically for PCA by deriving an analytic, deterministic, and intuitive bias approximation. We assess other methods with realistic simulations, which show that perturbing any of several basic parameters can cause false positive rate (FPR) inflation. Our experiments show the bias depends on covariate and confounder sparsity, effect sizes, and their correlation. Surprisingly, when the covariate and confounder have [Formula: see text], standard two-step methods all have [Formula: see text]-fold FPR inflation. Our analysis informs best practices for confounder correction in genomic studies, and suggests many false discoveries have been made and replicated in some differential expression analyses.
Collapse
Affiliation(s)
- Andy Dahl
- Department of Medicine, University of California San Francisco, 94158 California
| | - Vincent Guillemot
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative, Institut Pasteur, Paris, 75015 France
| | - Joel Mefford
- Department of Medicine, University of California San Francisco, 94158 California
| | - Hugues Aschard
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative, Institut Pasteur, Paris, 75015 France
- Department of Epidemiology, Harvard TH Chan School of Public Health, Boston, 02115 Massachusetts
| | - Noah Zaitlen
- Department of Medicine, University of California San Francisco, 94158 California
| |
Collapse
|
23
|
Huo Z, Song C, Tseng G. BAYESIAN LATENT HIERARCHICAL MODEL FOR TRANSCRIPTOMIC META-ANALYSIS TO DETECT BIOMARKERS WITH CLUSTERED META-PATTERNS OF DIFFERENTIAL EXPRESSION SIGNALS. Ann Appl Stat 2019; 13:340-366. [PMID: 31007807 PMCID: PMC6472949 DOI: 10.1214/18-aoas1188] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Due to the rapid development of high-throughput experimental techniques and fast-dropping prices, many transcriptomic datasets have been generated and accumulated in the public domain. Meta-analysis combining multiple transcriptomic studies can increase the statistical power to detect disease-related biomarkers. In this paper, we introduce a Bayesian latent hierarchical model to perform transcriptomic meta-analysis. This method is capable of detecting genes that are differentially expressed (DE) in only a subset of the combined studies, and the latent variables help quantify homogeneous and heterogeneous differential expression signals across studies. A tight clustering algorithm is applied to detected biomarkers to capture differential meta-patterns that are informative to guide further biological investigation. Simulations and three examples, including a microarray dataset from metabolism-related knockout mice, an RNA-seq dataset from HIV transgenic rats, and cross-platform datasets from human breast cancer, are used to demonstrate the performance of the proposed method.
Collapse
Affiliation(s)
- Zhiguang Huo
- Department of Biostatistics University of Florida Gainesville, FL 32611
| | - Chi Song
- Division of Biostatistics College of Public Health The Ohio State University Columbus, OH 43210
| | - George Tseng
- Department of Biostatistics, Human Genetics and Computational Biology University of Pittsburgh Pittsburgh, PA 15261
| |
Collapse
|
24
|
Chaturvedi N, Menezes RXD, Goeman JJ, Wieringen WV. A test for detecting differential indirect trans effects between two groups of samples. Stat Appl Genet Mol Biol 2018; 17:/j/sagmb.ahead-of-print/sagmb-2017-0058/sagmb-2017-0058.xml. [PMID: 30059350 DOI: 10.1515/sagmb-2017-0058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Integrative analysis of copy number and gene expression data can help in understanding the cis and trans effect of copy number aberrations on transcription levels of genes involved in a pathway. To analyse how these copy number mediated gene-gene interactions differ between groups of samples we propose a new method, named dNET. Our method uses ridge regression to model the network topology involving one gene's expression level, its gene dosage and the expression levels of other genes in the network. The interaction parameters are estimated by fitting the model per gene for all samples together. However, instead of testing for differential network topology per gene, dNET tests for an overall difference in estimated parameters between two groups of samples and produces a single p-value. With the help of several simulation studies, we show that dNET can detect differential network nodes with high accuracy and low rate of false positives even in the presence of differential cis effects. We also apply dNET to publicly available TCGA cancer datasets and identify pathways where copy number mediated gene-gene interactions differ between samples with cancer stage lower than stage 3 and samples with cancer stage 3 or above.
Collapse
Affiliation(s)
- Nimisha Chaturvedi
- Afdeling Epidemiologie en Biostatistiek, Amsterdam Public Health Research Institute, Medische Faculteit (F-vleugel), VU Medisch Centrum, 1007 MB Amsterdam, The Netherlands
- Netherlands Bioinformatics Center, 260 NBIC, 6500 HB Nijmegen, The Netherlands
| | - Renée X de Menezes
- Afdeling Epidemiologie en Biostatistiek, Amsterdam Public Health Research Institute, Medische Faculteit (F-vleugel), VU Medisch Centrum, 1007 MB Amsterdam, The Netherlands
- Netherlands Bioinformatics Center, 260 NBIC, 6500 HB Nijmegen, The Netherlands
| | - Jelle J Goeman
- Department of Biomedical Data Sciences, Room Number S5-P, LUMC Main Building, Leiden University Medical Center, Albinusdreef 2, 2333 ZA Leiden, The Netherlands
| | - Wessel van Wieringen
- Afdeling Epidemiologie en Biostatistiek, Amsterdam Public Health Research Institute, Medische Faculteit (F-vleugel), VU Medisch Centrum, 1007 MB Amsterdam, The Netherlands
- Department of Mathematics, Amsterdam Public Health Research Institute, Faculty of Sciences, Vrije Universiteit, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
25
|
Toosi A, Fernando RL, Dekkers JCM. Genome-wide mapping of quantitative trait loci in admixed populations using mixed linear model and Bayesian multiple regression analysis. Genet Sel Evol 2018; 50:32. [PMID: 29914353 PMCID: PMC6006859 DOI: 10.1186/s12711-018-0402-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Accepted: 06/01/2018] [Indexed: 12/18/2022] Open
Abstract
Background Population stratification and cryptic relationships have been the main sources of excessive false-positives and false-negatives in population-based association studies. Many methods have been developed to model these confounding factors and minimize their impact on the results of genome-wide association studies. In most of these methods, a two-stage approach is applied where: (1) methods are used to determine if there is a population structure in the sample dataset and (2) the effects of population structure are corrected either by modeling it or by running a separate analysis within each sub-population. The objective of this study was to evaluate the impact of population structure on the accuracy and power of genome-wide association studies using a Bayesian multiple regression method. Methods We conducted a genome-wide association study in a stochastically simulated admixed population. The genome was composed of six chromosomes, each with 1000 markers. Fifteen segregating quantitative trait loci contributed to the genetic variation of a quantitative trait with heritability of 0.30. The impact of genetic relationships and breed composition (BC) on three analysis methods were evaluated: single marker simple regression (SMR), single marker mixed linear model (MLM) and Bayesian multiple-regression analysis (BMR). Each method was fitted with and without BC. Accuracy, power, false-positive rate and the positive predictive value of each method were calculated and used for comparison. Results SMR and BMR, both without BC, were ranked as the worst and the best performing approaches, respectively. Our results showed that, while explicit modeling of genetic relationships and BC is essential for models SMR and MLM, BMR can disregard them and yet result in a higher power without compromising its false-positive rate. Conclusions This study showed that the Bayesian multiple-regression analysis is robust to population structure and to relationships among study subjects and performs better than a single marker mixed linear model approach.
Collapse
Affiliation(s)
- Ali Toosi
- Cobb-Vantress Inc., 4703 US HWY 412 E, Siloam Springs, AR, 72761, USA.
| | - Rohan L Fernando
- Department of Animal Science, Iowa State University, Ames, IA, 50010, USA
| | - Jack C M Dekkers
- Department of Animal Science, Iowa State University, Ames, IA, 50010, USA
| |
Collapse
|
26
|
Uzzaman MR, Park JE, Lee KT, Cho ES, Choi BH, Kim TH. A genome-wide association study of reproductive traits in a Yorkshire pig population. Livest Sci 2018. [DOI: 10.1016/j.livsci.2018.01.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
27
|
Posma JM, Garcia-Perez I, Ebbels TMD, Lindon JC, Stamler J, Elliott P, Holmes E, Nicholson JK. Optimized Phenotypic Biomarker Discovery and Confounder Elimination via Covariate-Adjusted Projection to Latent Structures from Metabolic Spectroscopy Data. J Proteome Res 2018; 17:1586-1595. [PMID: 29457906 PMCID: PMC5891819 DOI: 10.1021/acs.jproteome.7b00879] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Metabolism is altered by genetics, diet, disease status, environment, and many other factors. Modeling either one of these is often done without considering the effects of the other covariates. Attributing differences in metabolic profile to one of these factors needs to be done while controlling for the metabolic influence of the rest. We describe here a data analysis framework and novel confounder-adjustment algorithm for multivariate analysis of metabolic profiling data. Using simulated data, we show that similar numbers of true associations and significantly less false positives are found compared to other commonly used methods. Covariate-adjusted projections to latent structures (CA-PLS) are exemplified here using a large-scale metabolic phenotyping study of two Chinese populations at different risks for cardiovascular disease. Using CA-PLS, we find that some previously reported differences are actually associated with external factors and discover a number of previously unreported biomarkers linked to different metabolic pathways. CA-PLS can be applied to any multivariate data where confounding may be an issue and the confounder-adjustment procedure is translatable to other multivariate regression techniques.
Collapse
Affiliation(s)
| | - Isabel Garcia-Perez
- Investigative Medicine, Department of Medicine, Faculty of Medicine , Imperial College London , W12 0NN London , United Kingdom
| | | | | | - Jeremiah Stamler
- Department of Preventive Medicine, Feinberg School of Medicine , Northwestern University , Chicago , Illinois 60611 , United States
| | | | | | | |
Collapse
|
28
|
Buettner F, Pratanwanich N, McCarthy DJ, Marioni JC, Stegle O. f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq. Genome Biol 2017; 18:212. [PMID: 29115968 PMCID: PMC5674756 DOI: 10.1186/s13059-017-1334-8] [Citation(s) in RCA: 70] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2017] [Accepted: 09/29/2017] [Indexed: 11/10/2022] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) allows studying heterogeneity in gene expression in large cell populations. Such heterogeneity can arise due to technical or biological factors, making decomposing sources of variation difficult. We here describe f-scLVM (factorial single-cell latent variable model), a method based on factor analysis that uses pathway annotations to guide the inference of interpretable factors underpinning the heterogeneity. Our model jointly estimates the relevance of individual factors, refines gene set annotations, and infers factors without annotation. In applications to multiple scRNA-seq datasets, we find that f-scLVM robustly decomposes scRNA-seq datasets into interpretable components, thereby facilitating the identification of novel subpopulations.
Collapse
Affiliation(s)
- Florian Buettner
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
- Current address: Helmholtz Zentrum München-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.
| | - Naruemon Pratanwanich
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Davis J McCarthy
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- St Vincent's Institute of Medical Research, 41 Victoria Parade, Fitzroy, Victoria, 3065, Australia
| | - John C Marioni
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
- Cancer Research UK Cambridge Institute, Cambridge, UK.
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
| | - Oliver Stegle
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Meyerhofstr. 1, 69117, Heidelberg, Germany.
| |
Collapse
|
29
|
Uzzaman MR, Park JE, Lee KT, Cho ES, Choi BH, Kim TH. Whole-genome association and genome partitioning revealed variants and explained heritability for total number of teats in a Yorkshire pig population. ASIAN-AUSTRALASIAN JOURNAL OF ANIMAL SCIENCES 2017; 31:473-479. [PMID: 29059723 PMCID: PMC5838318 DOI: 10.5713/ajas.17.0178] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Revised: 07/03/2017] [Accepted: 10/09/2017] [Indexed: 12/03/2022]
Abstract
Objective The study was designed to perform a genome-wide association (GWA) and partitioning of genome using Illumina’s PorcineSNP60 Beadchip in order to identify variants and determine the explained heritability for the total number of teats in Yorkshire pig. Methods After screening with the following criteria: minor allele frequency, MAF≤0.01; Hardy-Weinberg equilibrium, HWE≤0.000001, a pair-wise genomic relationship matrix was produced using 42,953 single nucleotide polymorphisms (SNPs). A genome-wide mixed linear model-based association analysis (MLMA) was conducted. And for estimating the explained heritability with genome- or chromosome-wide SNPs the genetic relatedness estimation through maximum likelihood approach was used in our study. Results The MLMA analysis and false discovery rate p-values identified three significant SNPs on two different chromosomes (rs81476910 and rs81405825 on SSC8; rs81332615 on SSC13) for total number of teats. Besides, we estimated that 30% of variance could be explained by all of the common SNPs on the autosomal chromosomes for the trait. The maximum amount of heritability obtained by partitioning the genome were 0.22±0.05, 0.16±0.05, 0.10±0.03 and 0.08±0.03 on SSC7, SSC13, SSC1, and SSC8, respectively. Of them, SSC7 explained the amount of estimated heritability along with a SNP (rs80805264) identified by genome-wide association studies at the empirical p value significance level of 2.35E-05 in our study. Interestingly, rs80805264 was found in a nearby quantitative trait loci (QTL) on SSC7 for the teat number trait as identified in a recent study. Moreover, all other significant SNPs were found within and/or close to some QTLs related to ovary weight, total number of born alive and age at puberty in pigs. Conclusion The SNPs we identified unquestionably represent some of the important QTL regions as well as genes of interest in the genome for various physiological functions responsible for reproduction in pigs.
Collapse
Affiliation(s)
- Md Rasel Uzzaman
- Animal Genomics & Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea
| | - Jong-Eun Park
- Animal Genomics & Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea
| | - Kyung-Tai Lee
- Animal Genomics & Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea
| | - Eun-Seok Cho
- Animal Genomics & Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea
| | - Bong-Hwan Choi
- Animal Genomics & Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea
| | - Tae-Hun Kim
- Animal Genomics & Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea
| |
Collapse
|
30
|
Controlling for Confounding Effects in Single Cell RNA Sequencing Studies Using both Control and Target Genes. Sci Rep 2017; 7:13587. [PMID: 29051597 PMCID: PMC5648789 DOI: 10.1038/s41598-017-13665-w] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 09/29/2017] [Indexed: 11/24/2022] Open
Abstract
Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is a crucial step for accurate downstream analysis. Here, we present a novel statistical method, which we refer to as scPLS (single cell partial least squares), for robust and accurate inference of confounding effects. scPLS takes advantage of the fact that genes in a scRNAseq study often can be naturally classified into two sets: a control set of genes that are free of effects of the predictor variables and a target set of genes that are of primary interest. By modeling the two sets of genes jointly using the partial least squares regression, scPLS is capable of making full use of the data to improve the inference of confounding effects. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. Finally, we apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.
Collapse
|
31
|
Yuan L, Zhu L, Guo WL, Zhou X, Zhang Y, Huang Z, Huang DS. Nonconvex Penalty Based Low-Rank Representation and Sparse Regression for eQTL Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1154-1164. [PMID: 28114074 DOI: 10.1109/tcbb.2016.2609420] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This paper addresses the problem of accounting for confounding factors and expression quantitative trait loci (eQTL) mapping in the study of SNP-gene associations. The existing convex penalty based algorithm has limited capacity to keep main information of matrix in the process of reducing matrix rank. We present an algorithm, which use nonconvex penalty based low-rank representation to account for confounding factors and make use of sparse regression for eQTL mapping (NCLRS). The efficiency of the presented algorithm is evaluated by comparing the results of 18 synthetic datasets given by NCLRS and presented algorithm, respectively. The experimental results or biological dataset show that our approach is an effective tool to account for non-genetic effects than currently existing methods.
Collapse
|
32
|
Sun S, Hood M, Scott L, Peng Q, Mukherjee S, Tung J, Zhou X. Differential expression analysis for RNAseq using Poisson mixed models. Nucleic Acids Res 2017; 45:e106. [PMID: 28369632 PMCID: PMC5499851 DOI: 10.1093/nar/gkx204] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Revised: 03/02/2017] [Accepted: 03/17/2017] [Indexed: 12/13/2022] Open
Abstract
Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html.
Collapse
Affiliation(s)
- Shiquan Sun
- Systems Engineering Institute, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, P.R. China
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Michelle Hood
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Laura Scott
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Qinke Peng
- Systems Engineering Institute, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, P.R. China
| | - Sayan Mukherjee
- Departments of Statistical Science, Mathematics, and Computer Science, Duke University, Durham, NC 27708, USA
| | - Jenny Tung
- Departments of Evolutionary Anthropology and Biology, Duke University, Durham, NC 27708, USA
- Duke University Population Research Institute, Duke University, Durham, NC 27708, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
33
|
Ju JH, Shenoy SA, Crystal RG, Mezey JG. An independent component analysis confounding factor correction framework for identifying broad impact expression quantitative trait loci. PLoS Comput Biol 2017; 13:e1005537. [PMID: 28505156 PMCID: PMC5448815 DOI: 10.1371/journal.pcbi.1005537] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2016] [Revised: 05/30/2017] [Accepted: 04/28/2017] [Indexed: 11/19/2022] Open
Abstract
Genome-wide expression Quantitative Trait Loci (eQTL) studies in humans have provided numerous insights into the genetics of both gene expression and complex diseases. While the majority of eQTL identified in genome-wide analyses impact a single gene, eQTL that impact many genes are particularly valuable for network modeling and disease analysis. To enable the identification of such broad impact eQTL, we introduce CONFETI: Confounding Factor Estimation Through Independent component analysis. CONFETI is designed to address two conflicting issues when searching for broad impact eQTL: the need to account for non-genetic confounding factors that can lower the power of the analysis or produce broad impact eQTL false positives, and the tendency of methods that account for confounding factors to model broad impact eQTL as non-genetic variation. The key advance of the CONFETI framework is the use of Independent Component Analysis (ICA) to identify variation likely caused by broad impact eQTL when constructing the sample covariance matrix used for the random effect in a mixed model. We show that CONFETI has better performance than other mixed model confounding factor methods when considering broad impact eQTL recovery from synthetic data. We also used the CONFETI framework and these same confounding factor methods to identify eQTL that replicate between matched twin pair datasets in the Multiple Tissue Human Expression Resource (MuTHER), the Depression Genes Networks study (DGN), the Netherlands Study of Depression and Anxiety (NESDA), and multiple tissue types in the Genotype-Tissue Expression (GTEx) consortium. These analyses identified both cis-eQTL and trans-eQTL impacting individual genes, and CONFETI had better or comparable performance to other mixed model confounding factor analysis methods when identifying such eQTL. In these analyses, we were able to identify and replicate a few broad impact eQTL although the overall number was small even when applying CONFETI. In light of these results, we discuss the broad impact eQTL that have been previously reported from the analysis of human data and suggest that considerable caution should be exercised when making biological inferences based on these reported eQTL.
Collapse
Affiliation(s)
- Jin Hyun Ju
- Department of Genetic Medicine, Weill Cornell Medical College, New York, NY, United States of America
- Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, United States of America
| | - Sushila A. Shenoy
- Department of Genetic Medicine, Weill Cornell Medical College, New York, NY, United States of America
| | - Ronald G. Crystal
- Department of Genetic Medicine, Weill Cornell Medical College, New York, NY, United States of America
| | - Jason G. Mezey
- Department of Genetic Medicine, Weill Cornell Medical College, New York, NY, United States of America
- Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, United States of America
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, United States of America
- * E-mail:
| |
Collapse
|
34
|
Lee S, Sun W, Wright FA, Zou F. An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika 2017; 104:303-316. [PMID: 29430031 PMCID: PMC5627626 DOI: 10.1093/biomet/asx018] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2015] [Indexed: 01/31/2023] Open
Abstract
Unobserved environmental, demographic and technical factors canadversely affect the estimation and testing of the effects ofprimary variables. Surrogate variable analysis, proposed to tacklethis problem, has been widely used in genomic studies. To estimatehidden factors that are correlated with the primary variables,surrogate variable analysis performs principal component analysiseither on a subset of features or on all features, but weightingeach differently. However, existing approaches may fail to identifyhidden factors that are strongly correlated with the primaryvariables, and the extra step of feature selection and weightcalculation makes the theoretical investigation of surrogatevariable analysis challenging. In this paper, we propose an improvedsurrogate variable analysis, using all measured features, that has anatural connection with restricted least squares, which allows us tostudy its theoretical properties. Simulation studies and real-dataanalysis show that the method is competitive with state-of-the-artmethods.
Collapse
Affiliation(s)
- Seunggeun Lee
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48109,
| | - Wei Sun
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, Washington 98109,
| | - Fred A Wright
- Bioinformatics Research Center, North Carolina State University, 1 Lampe Drive, Raleigh, North Carolina 27607,
| | - Fei Zou
- Department of Biostatistics, University of Florida, 2004 Mowry Rd, Gainesville, Florida 32611,
| |
Collapse
|
35
|
Casale FP, Horta D, Rakitsch B, Stegle O. Joint genetic analysis using variant sets reveals polygenic gene-context interactions. PLoS Genet 2017; 13:e1006693. [PMID: 28426829 PMCID: PMC5398484 DOI: 10.1371/journal.pgen.1006693] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2016] [Accepted: 03/15/2017] [Indexed: 01/28/2023] Open
Abstract
Joint genetic models for multiple traits have helped to enhance association analyses. Most existing multi-trait models have been designed to increase power for detecting associations, whereas the analysis of interactions has received considerably less attention. Here, we propose iSet, a method based on linear mixed models to test for interactions between sets of variants and environmental states or other contexts. Our model generalizes previous interaction tests and in particular provides a test for local differences in the genetic architecture between contexts. We first use simulations to validate iSet before applying the model to the analysis of genotype-environment interactions in an eQTL study. Our model retrieves a larger number of interactions than alternative methods and reveals that up to 20% of cases show context-specific configurations of causal variants. Finally, we apply iSet to test for sub-group specific genetic effects in human lipid levels in a large human cohort, where we identify a gene-sex interaction for C-reactive protein that is missed by alternative methods. Genetic effects on phenotypes can depend on external contexts, including environment. Statistical tests for identifying such interactions are important to understand how individual genetic variants may act in different contexts. Interaction effects can either be studied using measurements of a given phenotype in different contexts, under the same genetic backgrounds, or by stratifying a population into subgroups. Here, we derive a method based on linear mixed models that can be applied to both of these designs. iSet enables testing for interactions between context and sets of variants, and accounts for polygenic effects. We validate our model using simulations, before applying it to the genetic analysis of gene expression studies and genome-wide association studies of human blood lipid levels. We find that modeling interactions with variant sets offers increased power, thereby uncovering interactions that cannot be detected by alternative methods.
Collapse
Affiliation(s)
- Francesco Paolo Casale
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, CB10 1SD Hinxton, Cambridge, United Kingdom
- * E-mail: (FPC); (OS)
| | - Danilo Horta
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, CB10 1SD Hinxton, Cambridge, United Kingdom
| | - Barbara Rakitsch
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, CB10 1SD Hinxton, Cambridge, United Kingdom
| | - Oliver Stegle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, CB10 1SD Hinxton, Cambridge, United Kingdom
- * E-mail: (FPC); (OS)
| |
Collapse
|
36
|
Mähler N, Wang J, Terebieniec BK, Ingvarsson PK, Street NR, Hvidsten TR. Gene co-expression network connectivity is an important determinant of selective constraint. PLoS Genet 2017; 13:e1006402. [PMID: 28406900 PMCID: PMC5407845 DOI: 10.1371/journal.pgen.1006402] [Citation(s) in RCA: 69] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2016] [Revised: 04/27/2017] [Accepted: 03/31/2017] [Indexed: 12/12/2022] Open
Abstract
While several studies have investigated general properties of the genetic architecture of natural variation in gene expression, few of these have considered natural, outbreeding populations. In parallel, systems biology has established that a general feature of biological networks is that they are scale-free, rendering them buffered against random mutations. To date, few studies have attempted to examine the relationship between the selective processes acting to maintain natural variation of gene expression and the associated co-expression network structure. Here we utilised RNA-Sequencing to assay gene expression in winter buds undergoing bud flush in a natural population of Populus tremula, an outbreeding forest tree species. We performed expression Quantitative Trait Locus (eQTL) mapping and identified 164,290 significant eQTLs associating 6,241 unique genes (eGenes) with 147,419 unique SNPs (eSNPs). We found approximately four times as many local as distant eQTLs, with local eQTLs having significantly higher effect sizes. eQTLs were primarily located in regulatory regions of genes (UTRs or flanking regions), regardless of whether they were local or distant. We used the gene expression data to infer a co-expression network and investigated the relationship between network topology, the genetic architecture of gene expression and signatures of selection. Within the co-expression network, eGenes were underrepresented in network module cores (hubs) and overrepresented in the periphery of the network, with a negative correlation between eQTL effect size and network connectivity. We additionally found that module core genes have experienced stronger selective constraint on coding and non-coding sequence, with connectivity associated with signatures of selection. Our integrated genetics and genomics results suggest that purifying selection is the primary mechanism underlying the genetic architecture of natural variation in gene expression assayed in flushing leaf buds of P. tremula and that connectivity within the co-expression network is linked to the strength of purifying selection.
Collapse
Affiliation(s)
- Niklas Mähler
- Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
- Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, Umeå, Sweden
| | - Jing Wang
- Umeå Plant Science Centre, Department of Ecology and Environmental Science, Umeå University, Umeå, Sweden
- Centre for Integrative Genetics, Faculty of Biosciences, Norwegian University of Life Sciences, Ås, Norway
| | - Barbara K. Terebieniec
- Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, Umeå, Sweden
| | - Pär K. Ingvarsson
- Umeå Plant Science Centre, Department of Ecology and Environmental Science, Umeå University, Umeå, Sweden
- Department of Plant Biology, Swedish University of Agricultural Sciences, Uppsala, Sweden
| | - Nathaniel R. Street
- Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, Umeå, Sweden
| | - Torgeir R. Hvidsten
- Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
- Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, Umeå, Sweden
| |
Collapse
|
37
|
Variation-preserving normalization unveils blind spots in gene expression profiling. Sci Rep 2017; 7:42460. [PMID: 28276435 PMCID: PMC5343588 DOI: 10.1038/srep42460] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2016] [Accepted: 01/11/2017] [Indexed: 11/17/2022] Open
Abstract
RNA-Seq and gene expression microarrays provide comprehensive profiles of gene activity, but lack of reproducibility has hindered their application. A key challenge in the data analysis is the normalization of gene expression levels, which is currently performed following the implicit assumption that most genes are not differentially expressed. Here, we present a mathematical approach to normalization that makes no assumption of this sort. We have found that variation in gene expression is much larger than currently believed, and that it can be measured with available assays. Our results also explain, at least partially, the reproducibility problems encountered in transcriptomics studies. We expect that this improvement in detection will help efforts to realize the full potential of gene expression profiling, especially in analyses of cellular processes involving complex modulations of gene expression.
Collapse
|
38
|
Rohart F, Eslami A, Matigian N, Bougeard S, Lê Cao KA. MINT: a multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms. BMC Bioinformatics 2017; 18:128. [PMID: 28241739 PMCID: PMC5327533 DOI: 10.1186/s12859-017-1553-8] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 02/16/2017] [Indexed: 12/12/2022] Open
Abstract
Background Molecular signatures identified from high-throughput transcriptomic studies often have poor reliability and fail to reproduce across studies. One solution is to combine independent studies into a single integrative analysis, additionally increasing sample size. However, the different protocols and technological platforms across transcriptomic studies produce unwanted systematic variation that strongly confounds the integrative analysis results. When studies aim to discriminate an outcome of interest, the common approach is a sequential two-step procedure; unwanted systematic variation removal techniques are applied prior to classification methods. Results To limit the risk of overfitting and over-optimistic results of a two-step procedure, we developed a novel multivariate integration method, MINT, that simultaneously accounts for unwanted systematic variation and identifies predictive gene signatures with greater reproducibility and accuracy. In two biological examples on the classification of three human cell types and four subtypes of breast cancer, we combined high-dimensional microarray and RNA-seq data sets and MINT identified highly reproducible and relevant gene signatures predictive of a given phenotype. MINT led to superior classification and prediction accuracy compared to the existing sequential two-step procedures. Conclusions MINT is a powerful approach and the first of its kind to solve the integrative classification framework in a single step by combining multiple independent studies. MINT is computationally fast as part of the mixOmics R CRAN package, available at http://www.mixOmics.org/mixMINT/and http://cran.r-project.org/web/packages/mixOmics/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1553-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Florian Rohart
- The University of Queensland Diamantina Institute, The University of Queensland, Translational Research Institute, Brisbane, 4102, QLD, Australia
| | - Aida Eslami
- Centre for Heart Lung Innovation, University of British Columbia, Vancouver, BC V6Z 1Y6, Canada
| | - Nicholas Matigian
- The University of Queensland Diamantina Institute, The University of Queensland, Translational Research Institute, Brisbane, 4102, QLD, Australia
| | - Stéphanie Bougeard
- French agency for food, environmental and occupational health safety (Anses), Department of Epidemiology, Ploufragan, 22440, France
| | - Kim-Anh Lê Cao
- The University of Queensland Diamantina Institute, The University of Queensland, Translational Research Institute, Brisbane, 4102, QLD, Australia.
| |
Collapse
|
39
|
Fusi N, Listgarten J. Flexible Modeling of Genetic Effects on Function-Valued Traits. J Comput Biol 2017; 24:524-535. [PMID: 28056190 DOI: 10.1089/cmb.2016.0174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Genome-wide association studies commonly examine one trait at a time. Occasionally they examine several related traits with the hope of increasing power; in such a setting, the traits are not generally smoothly varying in any way such as time or space. However, for function-valued traits, the trait is often smoothly varying along the axis of interest, such as space or time. For instance, in the case of longitudinal traits such as growth curves, the axis of interest is time; for spatially varying traits such as chromatin accessibility, it would be position along the genome. Although there have been efforts to perform genome-wide association studies with such function-valued traits, the statistical approaches developed for this purpose often have limitations such as requiring the trait to behave linearly in time or space, or constraining the genetic effect itself to be constant or linear in time. Herein, we present a flexible model for this problem-the Partitioned Gaussian Process-which removes many such limitations and is especially effective as the number of time points increases. The theoretical basis of this model provides machinery for handling missing and unaligned function values such as would occur when not all individuals are measured at the same time points. Furthermore, we make use of algebraic refactorizations to substantially reduce the time complexity of our model beyond the naive implementation. Finally, we apply our approach and several others to synthetic data before closing, with some directions for improved modeling and statistical testing.
Collapse
Affiliation(s)
- Nicolo Fusi
- Microsoft Research , Cambridge, Massachusetts
| | | |
Collapse
|
40
|
Hoffman GE, Schadt EE. variancePartition: interpreting drivers of variation in complex gene expression studies. BMC Bioinformatics 2016; 17:483. [PMID: 27884101 PMCID: PMC5123296 DOI: 10.1186/s12859-016-1323-z] [Citation(s) in RCA: 354] [Impact Index Per Article: 44.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Accepted: 11/05/2016] [Indexed: 12/14/2022] Open
Abstract
Background As large-scale studies of gene expression with multiple sources of biological and technical variation become widely adopted, characterizing these drivers of variation becomes essential to understanding disease biology and regulatory genetics. Results We describe a statistical and visualization framework, variancePartition, to prioritize drivers of variation based on a genome-wide summary, and identify genes that deviate from the genome-wide trend. Using a linear mixed model, variancePartition quantifies variation in each expression trait attributable to differences in disease status, sex, cell or tissue type, ancestry, genetic background, experimental stimulus, or technical variables. Analysis of four large-scale transcriptome profiling datasets illustrates that variancePartition recovers striking patterns of biological and technical variation that are reproducible across multiple datasets. Conclusions Our open source software, variancePartition, enables rapid interpretation of complex gene expression studies as well as other high-throughput genomics assays. variancePartition is available from Bioconductor: http://bioconductor.org/packages/variancePartition. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1323-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gabriel E Hoffman
- Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, USA.
| | - Eric E Schadt
- Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, USA
| |
Collapse
|
41
|
Gao C, McDowell IC, Zhao S, Brown CD, Engelhardt BE. Context Specific and Differential Gene Co-expression Networks via Bayesian Biclustering. PLoS Comput Biol 2016; 12:e1004791. [PMID: 27467526 PMCID: PMC4965098 DOI: 10.1371/journal.pcbi.1004791] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Accepted: 02/03/2016] [Indexed: 01/15/2023] Open
Abstract
Identifying latent structure in high-dimensional genomic data is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-regulated genes that covary in all of the samples or in only a subset of the samples. Our biclustering method, BicMix, allows overcomplete representations of the data, computational tractability, and joint modeling of unknown confounders and biological signals. Compared with related biclustering methods, BicMix recovers latent structure with higher precision across diverse simulation scenarios as compared to state-of-the-art biclustering methods. Further, we develop a principled method to recover context specific gene co-expression networks from the estimated sparse biclustering matrices. We apply BicMix to breast cancer gene expression data and to gene expression data from a cardiovascular study cohort, and we recover gene co-expression networks that are differential across ER+ and ER- samples and across male and female samples. We apply BicMix to the Genotype-Tissue Expression (GTEx) pilot data, and we find tissue specific gene networks. We validate these findings by using our tissue specific networks to identify trans-eQTLs specific to one of four primary tissues.
Collapse
Affiliation(s)
- Chuan Gao
- Department of Statistical Science, Duke University, Durham, North Carolina, United States of America
| | - Ian C. McDowell
- Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina, United States of America
| | - Shiwen Zhao
- Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina, United States of America
| | - Christopher D. Brown
- Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Barbara E. Engelhardt
- Department of Computer Science, Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| |
Collapse
|
42
|
Chaturvedi N, de Menezes RX, Goeman JJ. A global × global test for testing associations between two large sets of variables. Biom J 2016; 59:145-158. [PMID: 27225065 DOI: 10.1002/bimj.201500106] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Revised: 01/06/2016] [Accepted: 03/07/2016] [Indexed: 12/30/2022]
Abstract
In high-dimensional omics studies where multiple molecular profiles are obtained for each set of patients, there is often interest in identifying complex multivariate associations, for example, copy number regulated expression levels in a certain pathway or in a genomic region. To detect such associations, we present a novel approach to test for association between two sets of variables. Our approach generalizes the global test, which tests for association between a group of covariates and a single univariate response, to allow high-dimensional multivariate response. We apply the method to several simulated datasets as well as two publicly available datasets, where we compare the performance of multivariate global test (G2) with univariate global test. The method is implemented in R and will be available as a part of the globaltest package in R.
Collapse
Affiliation(s)
- Nimisha Chaturvedi
- Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.,Netherlands Bioinformatics Center, Nijmegen, The Netherlands
| | - Renée X de Menezes
- Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.,Netherlands Bioinformatics Center, Nijmegen, The Netherlands
| | - Jelle J Goeman
- Biostatistics, Department for Health Evidence, Radboud University Medical Center, Nijmegen, The Netherlands.,Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
43
|
Cheng W, Shi Y, Zhang X, Wang W. Sparse regression models for unraveling group and individual associations in eQTL mapping. BMC Bioinformatics 2016; 17:136. [PMID: 27000043 PMCID: PMC4802846 DOI: 10.1186/s12859-016-0986-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2015] [Accepted: 03/10/2016] [Indexed: 11/18/2022] Open
Abstract
Background As a promising tool for dissecting the genetic basis of common diseases, expression quantitative trait loci (eQTL) study has attracted increasing research interest. Traditional eQTL methods focus on testing the associations between individual single-nucleotide polymorphisms (SNPs) and gene expression traits. A major drawback of this approach is that it cannot model the joint effect of a set of SNPs on a set of genes, which may correspond to biological pathways. Results To alleviate this limitation, in this paper, we propose geQTL, a sparse regression method that can detect both group-wise and individual associations between SNPs and expression traits. geQTL can also correct the effects of potential confounders. Our method employs computationally efficient technique, thus it is able to fulfill large scale studies. Moreover, our method can automatically infer the proper number of group-wise associations. We perform extensive experiments on both simulated datasets and yeast datasets to demonstrate the effectiveness and efficiency of the proposed method. The results show that geQTL can effectively detect both individual and group-wise signals and outperforms the state-of-the-arts by a large margin. Conclusions This paper well illustrates that decoupling individual and group-wise associations for association mapping is able to improve eQTL mapping accuracy, and inferring individual and group-wise associations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0986-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wei Cheng
- Department of Computer Science, UNC at Chapel Hill, 201 S Columbia St., Chapel Hill, NC 27599, USA.
| | - Yu Shi
- Computer Science at the University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL 61801, USA
| | - Xiang Zhang
- Department of Elect. Eng. and Computer Science, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106, USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, 3531-G Boelter Hall, Los Angeles, CA 90095, USA
| |
Collapse
|
44
|
Rakitsch B, Stegle O. Modelling local gene networks increases power to detect trans-acting genetic effects on gene expression. Genome Biol 2016; 17:33. [PMID: 26911988 PMCID: PMC4765046 DOI: 10.1186/s13059-016-0895-2] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2015] [Accepted: 02/09/2016] [Indexed: 01/05/2023] Open
Abstract
Expression quantitative trait loci (eQTL) mapping is a widely used tool to study the genetics of gene expression. Confounding factors and the burden of multiple testing limit the ability to map distal trans eQTLs, which is important to understand downstream genetic effects on genes and pathways. We propose a two-stage linear mixed model that first learns local directed gene-regulatory networks to then condition on the expression levels of selected genes. We show that this covariate selection approach controls for confounding factors and regulatory context, thereby increasing eQTL detection power and improving the consistency between studies. GNet-LMM is available at: https://github.com/PMBio/GNetLMM.
Collapse
Affiliation(s)
- Barbara Rakitsch
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
| | - Oliver Stegle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
| |
Collapse
|
45
|
The Dissection of Expression Quantitative Trait Locus Hotspots. Genetics 2016; 202:1563-74. [PMID: 26837753 DOI: 10.1534/genetics.115.183624] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Accepted: 01/27/2016] [Indexed: 02/03/2023] Open
Abstract
Studies of the genetic loci that contribute to variation in gene expression frequently identify loci with broad effects on gene expression: expression quantitative trait locus hotspots. We describe a set of exploratory graphical methods as well as a formal likelihood-based test for assessing whether a given hotspot is due to one or multiple polymorphisms. We first look at the pattern of effects of the locus on the expression traits that map to the locus: the direction of the effects and the degree of dominance. A second technique is to focus on the individuals that exhibit no recombination event in the region, apply dimensionality reduction (e.g., with linear discriminant analysis), and compare the phenotype distribution in the nonrecombinant individuals to that in the recombinant individuals: if the recombinant individuals display a different expression pattern than the nonrecombinant individuals, this indicates the presence of multiple causal polymorphisms. In the formal likelihood-based test, we compare a two-locus model, with each expression trait affected by one or the other locus, to a single-locus model. We apply our methods to a large mouse intercross with gene expression microarray data on six tissues.
Collapse
|
46
|
François O, Martins H, Caye K, Schoville SD. Controlling false discoveries in genome scans for selection. Mol Ecol 2016; 25:454-69. [PMID: 26671840 DOI: 10.1111/mec.13513] [Citation(s) in RCA: 138] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Revised: 11/23/2015] [Accepted: 11/25/2015] [Indexed: 02/06/2023]
Abstract
Population differentiation (PD) and ecological association (EA) tests have recently emerged as prominent statistical methods to investigate signatures of local adaptation using population genomic data. Based on statistical models, these genomewide testing procedures have attracted considerable attention as tools to identify loci potentially targeted by natural selection. An important issue with PD and EA tests is that incorrect model specification can generate large numbers of false-positive associations. Spurious association may indeed arise when shared demographic history, patterns of isolation by distance, cryptic relatedness or genetic background are ignored. Recent works on PD and EA tests have widely focused on improvements of test corrections for those confounding effects. Despite significant algorithmic improvements, there is still a number of open questions on how to check that false discoveries are under control and implement test corrections, or how to combine statistical tests from multiple genome scan methods. This tutorial study provides a detailed answer to these questions. It clarifies the relationships between traditional methods based on allele frequency differentiation and EA methods and provides a unified framework for their underlying statistical tests. We demonstrate how techniques developed in the area of genomewide association studies, such as inflation factors and linear mixed models, benefit genome scan methods and provide guidelines for good practice while conducting statistical tests in landscape and population genomic applications. Finally, we highlight how the combination of several well-calibrated statistical tests can increase the power to reject neutrality, improving our ability to infer patterns of local adaptation in large population genomic data sets.
Collapse
Affiliation(s)
- Olivier François
- Centre National de la Recherche Scientifique, Université Grenoble-Alpes, TIMC-IMAG UMR 5525, Grenoble, 38042, France
| | - Helena Martins
- Centre National de la Recherche Scientifique, Université Grenoble-Alpes, TIMC-IMAG UMR 5525, Grenoble, 38042, France
| | - Kevin Caye
- Centre National de la Recherche Scientifique, Université Grenoble-Alpes, TIMC-IMAG UMR 5525, Grenoble, 38042, France
| | - Sean D Schoville
- Department of Entomology, 637 Russell Laboratories, University of Wisconsin-Madison, 1630 Linden Drive, Madison, WI, 53706, USA
| |
Collapse
|
47
|
Zhang P, Zhong K, Shahid MQ, Tong H. Association Analysis in Rice: From Application to Utilization. FRONTIERS IN PLANT SCIENCE 2016; 7:1202. [PMID: 27582745 PMCID: PMC4987372 DOI: 10.3389/fpls.2016.01202] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Accepted: 07/28/2016] [Indexed: 05/03/2023]
Abstract
Association analysis based on linkage disequilibrium (LD) is an efficient way to dissect complex traits and to identify gene functions in rice. Although association analysis is an effective way to construct fine maps for quantitative traits, there are a few issues which need to be addressed. In this review, we will first summarize type, structure, and LD level of populations used for association analysis of rice, and then discuss the genotyping methods and statistical approaches used for association analysis in rice. Moreover, we will review current shortcomings and benefits of association analysis as well as specific types of future research to overcome these shortcomings. Furthermore, we will analyze the reasons for the underutilization of the results within association analysis in rice breeding.
Collapse
Affiliation(s)
- Peng Zhang
- State Key Laboratory of Rice Biology, China National Rice Research InstituteHangzhou, China
- *Correspondence: Peng Zhang
| | - Kaizhen Zhong
- State Key Laboratory of Rice Biology, China National Rice Research InstituteHangzhou, China
| | - Muhammad Qasim Shahid
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, South China Agricultural UniversityGuangzhou, China
| | - Hanhua Tong
- State Key Laboratory of Rice Biology, China National Rice Research InstituteHangzhou, China
- Hanhua Tong
| |
Collapse
|
48
|
Identification of the Bile Acid Transporter Slco1a6 as a Candidate Gene That Broadly Affects Gene Expression in Mouse Pancreatic Islets. Genetics 2015; 201:1253-62. [PMID: 26385979 DOI: 10.1534/genetics.115.179432] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2015] [Accepted: 09/16/2015] [Indexed: 12/15/2022] Open
Abstract
We surveyed gene expression in six tissues in an F2 intercross between mouse strains C57BL/6J (abbreviated B6) and BTBR T(+) tf/J (abbreviated BTBR) made genetically obese with the Leptin(ob) mutation. We identified a number of expression quantitative trait loci (eQTL) affecting the expression of numerous genes distal to the locus, called trans-eQTL hotspots. Some of these trans-eQTL hotspots showed effects in multiple tissues, whereas some were specific to a single tissue. An unusually large number of transcripts (∼8% of genes) mapped in trans to a hotspot on chromosome 6, specifically in pancreatic islets. By considering the first two principal components of the expression of genes mapping to this region, we were able to convert the multivariate phenotype into a simple Mendelian trait. Fine mapping the locus by traditional methods reduced the QTL interval to a 298-kb region containing only three genes, including Slco1a6, one member of a large family of organic anion transporters. Direct genomic sequencing of all Slco1a6 exons identified a nonsynonymous coding SNP that converts a highly conserved proline residue at amino acid position 564 to serine. Molecular modeling suggests that Pro564 faces an aqueous pore within this 12-transmembrane domain-spanning protein. When transiently overexpressed in HEK293 cells, BTBR organic anion transporting polypeptide (OATP)1A6-mediated cellular uptake of the bile acid taurocholic acid (TCA) was enhanced compared to B6 OATP1A6. Our results suggest that genetic variation in Slco1a6 leads to altered transport of TCA (and potentially other bile acids) by pancreatic islets, resulting in broad gene regulation.
Collapse
|
49
|
Jacob L, Gagnon-Bartsch JA, Speed TP. Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 2015; 17:16-28. [PMID: 26286812 PMCID: PMC4679071 DOI: 10.1093/biostatistics/kxv026] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Accepted: 06/25/2015] [Indexed: 11/13/2022] Open
Abstract
When dealing with large scale gene expression studies, observations are commonly contaminated by sources of unwanted variation such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g. when the goal is to cluster the samples or to build a corrected version of the dataset--as opposed to the study of an observed factor of interest--taking unwanted variation into account can become a difficult task. The factors driving unwanted variation may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the expression data. The proposed methods are then evaluated on synthetic data and three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest and compare favorably to state-of-the-art corrections. All proposed methods are implemented in the bioconductor package RUVnormalize.
Collapse
Affiliation(s)
- Laurent Jacob
- Laboratoire de Biométrie et Biologie Évolutive, Université de Lyon, Université Lyon 1, CNRS, UMR, 5558 Lyon, France
| | | | - Terence P Speed
- Department of Statistics, University of California, Berkeley, CA 974720, USA and Division of Bioinformatics, Walter and Eliza Hall Institute of Medical Research, Melbourne 3052, Australia
| |
Collapse
|
50
|
Bohlin J, Andreassen BK, Joubert BR, Magnus MC, Wu MC, Parr CL, Håberg SE, Magnus P, Reese SE, Stoltenberg C, London SJ, Nystad W. Effect of maternal gestational weight gain on offspring DNA methylation: a follow-up to the ALSPAC cohort study. BMC Res Notes 2015. [PMID: 26219460 PMCID: PMC4518864 DOI: 10.1186/s13104-015-1286-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background Several epidemiologic studies indicate that maternal gestational weight gain (GWG) influences health outcomes in offspring. Any underlying mechanisms have, however, not been established. A recent study of 88 children based on the Avon Longitudinal Study of Parents and Children (ALSPAC) cohort examined the methylation levels at 1,505 Cytosine-Guanine methylation (CpG) loci and found several to be significantly associated with maternal weight gain between weeks 0 and 18 of gestation. Since these results could not be replicated we wanted to examine associations between 0 and 18 week GWG and genome-wide methylation levels using the Infinium HumanMethylation450 BeadChip (450K) platform on a larger sample size, i.e. 729 newborns sampled from the Norwegian Mother and Child Cohort Study (MoBa). Results We found no CpG loci associated with 0–18 week GWG after adjusting for the set of covariates used in the ALSPAC study (i.e. child’s sex and maternal age) and for multiple testing (q > 0.9, both 1,505 and 473,731 tests). Hence, none of the CpG loci linked with the genes found significantly associated with 0–18 week GWG in the ALSPAC study were significant in our study. Conclusions The inconsistency in the results with the ALSPAC study with regards to the 0–18 week GWG model may arise for several reasons: sampling from different populations, dissimilar methylome coverage, sample size and/or false positive findings. Electronic supplementary material The online version of this article (doi:10.1186/s13104-015-1286-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jon Bohlin
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P.O. Box 4404, 0403, Oslo, Norway.
| | - Bettina K Andreassen
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P.O. Box 4404, 0403, Oslo, Norway. .,Department of Molecular Biology, Institute of Clinical Medicine, University of Oslo, Oslo, Norway.
| | - Bonnie R Joubert
- National Institute of Environmental Health Sciences, MD A3-05, PO Box 12233, Research Triangle Park, NC, 27709, USA.
| | - Maria C Magnus
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P.O. Box 4404, 0403, Oslo, Norway.
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.
| | - Christine L Parr
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P.O. Box 4404, 0403, Oslo, Norway.
| | - Siri E Håberg
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P.O. Box 4404, 0403, Oslo, Norway.
| | - Per Magnus
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P.O. Box 4404, 0403, Oslo, Norway.
| | - Sarah E Reese
- National Institute of Environmental Health Sciences, MD A3-05, PO Box 12233, Research Triangle Park, NC, 27709, USA.
| | - Camilla Stoltenberg
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P.O. Box 4404, 0403, Oslo, Norway.
| | - Stephanie J London
- National Institute of Environmental Health Sciences, MD A3-05, PO Box 12233, Research Triangle Park, NC, 27709, USA.
| | - Wenche Nystad
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P.O. Box 4404, 0403, Oslo, Norway.
| |
Collapse
|