1
|
Koch AA, Bagnall JS, Smyllie NJ, Begley N, Adamson AD, Fribourgh JL, Spiller DG, Meng QJ, Partch CL, Strimmer K, House TA, Hastings MH, Loudon ASI. Quantification of protein abundance and interaction defines a mechanism for operation of the circadian clock. eLife 2022; 11:73976. [PMID: 35285799 PMCID: PMC8983044 DOI: 10.7554/elife.73976] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 03/11/2022] [Indexed: 11/13/2022] Open
Abstract
The mammalian circadian clock exerts control of daily gene expression through cycles of DNA binding. Here, we develop a quantitative model of how a finite pool of BMAL1 protein can regulate thousands of target sites over daily time scales. We used quantitative imaging to track dynamic changes in endogenous labelled proteins across peripheral tissues and the SCN. We determine the contribution of multiple rhythmic processes coordinating BMAL1 DNA binding, including cycling molecular abundance, binding affinities, and repression. We find nuclear BMAL1 concentration determines corresponding CLOCK through heterodimerisation and define a DNA residence time of this complex. Repression of CLOCK:BMAL1 is achieved through rhythmic changes to BMAL1:CRY1 association and high-affinity interactions between PER2:CRY1 which mediates CLOCK:BMAL1 displacement from DNA. Finally, stochastic modelling reveals a dual role for PER:CRY complexes in which increasing concentrations of PER2:CRY1 promotes removal of BMAL1:CLOCK from genes consequently enhancing ability to move to new target sites.
Collapse
Affiliation(s)
- Alex Ashton Koch
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom
| | - James S Bagnall
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom
| | - Nicola J Smyllie
- Laboratory of Molecular Biology, Medical Research Council, Cambridge, United Kingdom
| | - Nicola Begley
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom
| | - Antony D Adamson
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom
| | - Jennifer L Fribourgh
- Department of Chemistry and Biochemistry, University of California, Santa Cruz, Santa Cruz, United States
| | - David G Spiller
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom
| | - Qing-Jun Meng
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom
| | - Carrie L Partch
- Department of Chemistry and Biochemistry, University of California, Santa Cruz, Santa Cruz, United States
| | - Korbinian Strimmer
- Department of Mathematics, University of Manchester, Manchester, United Kingdom
| | - Thomas A House
- Department of Mathematics, University of Manchester, Manchester, United Kingdom
| | - Michael H Hastings
- Laboratory of Molecular Biology, Medical Research Council, Cambridge, United Kingdom
| | - Andrew S I Loudon
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
2
|
Jendoubi T, Strimmer K. A whitening approach to probabilistic canonical correlation analysis for omics data integration. BMC Bioinformatics 2019; 20:15. [PMID: 30626338 PMCID: PMC6327589 DOI: 10.1186/s12859-018-2572-9] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Accepted: 12/10/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Canonical correlation analysis (CCA) is a classic statistical tool for investigating complex multivariate data. Correspondingly, it has found many diverse applications, ranging from molecular biology and medicine to social science and finance. Intriguingly, despite the importance and pervasiveness of CCA, only recently a probabilistic understanding of CCA is developing, moving from an algorithmic to a model-based perspective and enabling its application to large-scale settings. RESULTS Here, we revisit CCA from the perspective of statistical whitening of random variables and propose a simple yet flexible probabilistic model for CCA in the form of a two-layer latent variable generative model. The advantages of this variant of probabilistic CCA include non-ambiguity of the latent variables, provisions for negative canonical correlations, possibility of non-normal generative variables, as well as ease of interpretation on all levels of the model. In addition, we show that it lends itself to computationally efficient estimation in high-dimensional settings using regularized inference. We test our approach to CCA analysis in simulations and apply it to two omics data sets illustrating the integration of gene expression data, lipid concentrations and methylation levels. CONCLUSIONS Our whitening approach to CCA provides a unifying perspective on CCA, linking together sphering procedures, multivariate regression and corresponding probabilistic generative models. Furthermore, we offer an efficient computer implementation in the "whitening" R package available at https://CRAN.R-project.org/package=whitening .
Collapse
Affiliation(s)
- Takoua Jendoubi
- Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London, W2 1PG, UK. .,Statistics Section, Department of Mathematics, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK.
| | - Korbinian Strimmer
- School of Mathematics, University of Manchester, Alan Turing Building, Oxford Road, Manchester, M13 9PL, UK
| |
Collapse
|
3
|
Affiliation(s)
- Agnan Kessy
- Statistics Section, Department of Mathematics, Imperial College London, South Kensington Campus, London, United Kingdom
| | - Alex Lewin
- Department of Mathematics, Brunel University London, Kingstone Lane, Uxbridge, United Kingdom
| | - Korbinian Strimmer
- Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London, United Kingdom
| |
Collapse
|
4
|
Jobb G, von Haeseler A, Strimmer K. Retraction Note: TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics. BMC Evol Biol 2015; 15:243. [PMID: 26542699 PMCID: PMC4635604 DOI: 10.1186/s12862-015-0513-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2015] [Accepted: 10/20/2015] [Indexed: 11/10/2022] Open
|
5
|
Gibb S, Strimmer K. Differential protein expression and peak selection in mass spectrometry data by binary discriminant analysis. Bioinformatics 2015; 31:3156-62. [PMID: 26026136 DOI: 10.1093/bioinformatics/btv334] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 05/26/2015] [Indexed: 02/05/2023] Open
Abstract
MOTIVATION Proteomic mass spectrometry analysis is becoming routine in clinical diagnostics, for example to monitor cancer biomarkers using blood samples. However, differential proteomics and identification of peaks relevant for class separation remains challenging. RESULTS Here, we introduce a simple yet effective approach for identifying differentially expressed proteins using binary discriminant analysis. This approach works by data-adaptive thresholding of protein expression values and subsequent ranking of the dichotomized features using a relative entropy measure. Our framework may be viewed as a generalization of the 'peak probability contrast' approach of Tibshirani et al. (2004) and can be applied both in the two-group and the multi-group setting. Our approach is computationally inexpensive and shows in the analysis of a large-scale drug discovery test dataset equivalent prediction accuracy as a random forest. Furthermore, we were able to identify in the analysis of mass spectrometry data from a pancreas cancer study biological relevant and statistically predictive marker peaks unrecognized in the original study. AVAILABILITY AND IMPLEMENTATION The methodology for binary discriminant analysis is implemented in the R package binda, which is freely available under the GNU General Public License (version 3 or later) from CRAN at URL http://cran.r-project.org/web/packages/binda/. R scripts reproducing all described analyzes are available from the web page http://strimmerlab.org/software/binda/. CONTACT k.strimmer@imperial.ac.uk.
Collapse
Affiliation(s)
- Sebastian Gibb
- Anesthesiology and Intensive Care Medicine, University Hospital Greifswald, Ferdinand-Sauerbruch-Straße, D-17475 Greifswald, Germany and
| | - Korbinian Strimmer
- Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London, W2 1PG, UK
| |
Collapse
|
6
|
Hoffmann S, Stadler PF, Strimmer K. A simple data-adaptive probabilistic variant calling model. Algorithms Mol Biol 2015; 10:10. [PMID: 25788974 PMCID: PMC4363181 DOI: 10.1186/s13015-015-0037-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 01/11/2015] [Indexed: 11/30/2022] Open
Abstract
Background Several sources of noise obfuscate the identification of single nucleotide variation (SNV) in next generation sequencing data. For instance, errors may be introduced during library construction and sequencing steps. In addition, the reference genome and the algorithms used for the alignment of the reads are further critical factors determining the efficacy of variant calling methods. It is crucial to account for these factors in individual sequencing experiments. Results We introduce a simple data-adaptive model for variant calling. This model automatically adjusts to specific factors such as alignment errors. To achieve this, several characteristics are sampled from sites with low mismatch rates, and these are used to estimate empirical log-likelihoods. The likelihoods are then combined to a score that typically gives rise to a mixture distribution. From this we determine a decision threshold to separate potentially variant sites from the noisy background. Conclusions In simulations we show that our simple model is competitive with frequently used much more complex SNV calling algorithms in terms of sensitivity and specificity. It performs specifically well in cases with low allele frequencies. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.
Collapse
|
7
|
Zuber V, Duarte Silva AP, Strimmer K. A novel algorithm for simultaneous SNP selection in high-dimensional genome-wide association studies. BMC Bioinformatics 2012; 13:284. [PMID: 23113980 PMCID: PMC3558454 DOI: 10.1186/1471-2105-13-284] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2012] [Accepted: 10/20/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs. RESULTS We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs. CONCLUSIONS Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from http://strimmerlab.org/software/care/.
Collapse
Affiliation(s)
- Verena Zuber
- Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Härtelstr. 16–18, D-04107 Leipzig, Germany
| | - A Pedro Duarte Silva
- Faculdade de Economia e Gestão & CEGE, Catholic University of Portugal, Rua Diogo Botelho 1327, 4169-005 Porto, Portugal
| | - Korbinian Strimmer
- Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Härtelstr. 16–18, D-04107 Leipzig, Germany
| |
Collapse
|
8
|
Abstract
Signal identification in large-dimensional settings is a challenging problem in biostatistics. Recently, the method of higher criticism (HC) was shown to be an effective means for determining appropriate decision thresholds. Here, we study HC from a false discovery rate (FDR) perspective. We show that the HC threshold may be viewed as an approximation to a natural class boundary (CB) in two-class discriminant analysis which in turn is expressible as the FDR threshold. We demonstrate that in a rare-weak setting in the region of the phase space where signal identification is possible, both thresholds are practicably indistinguishable, and thus HC thresholding is identical to using a simple local FDR cutoff. The relationship of the HC and CB thresholds and their properties are investigated both analytically and by simulations, and are further compared by the application to four cancer gene expression data sets.
Collapse
Affiliation(s)
- Bernd Klaus
- Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Härtelstr. 16-18, D-04107 Leipzig, Germany
| | | |
Collapse
|
9
|
Abstract
UNLABELLED MALDIquant is an R package providing a complete and modular analysis pipeline for quantitative analysis of mass spectrometry data. MALDIquant is specifically designed with application in clinical diagnostics in mind and implements sophisticated routines for importing raw data, preprocessing, non-linear peak alignment and calibration. It also handles technical replicates as well as spectra with unequal resolution. AVAILABILITY MALDIquant and its associated R packages readBrukerFlexData and readMzXmlData are freely available from the R archive CRAN (http://cran.r-project.org). The software is distributed under the GNU General Public License (version 3 or later) and is accompanied by example files and data. Additional documentation is available from http://strimmerlab.org/software/maldiquant/.
Collapse
Affiliation(s)
- Sebastian Gibb
- Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, Leipzig, Germany.
| | | |
Collapse
|
10
|
Abstract
MOTIVATION In statistical bioinformatics research, different optimization mechanisms potentially lead to 'over-optimism' in published papers. So far, however, a systematic critical study concerning the various sources underlying this over-optimism is lacking. RESULTS We present an empirical study on over-optimism using high-dimensional classification as example. Specifically, we consider a 'promising' new classification algorithm, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. While this approach yields poor results in terms of error rate, we quantitatively demonstrate that it can artificially seem superior to existing approaches if we 'fish for significance'. The investigated sources of over-optimism include the optimization of datasets, of settings, of competing methods and, most importantly, of the method's characteristics. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should always be demonstrated on independent validation data. AVAILABILITY The R codes and relevant data can be downloaded from http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/overoptimism/, such that the study is completely reproducible.
Collapse
Affiliation(s)
- Monika Jelizarow
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Munich, Germany
| | | | | | | | | |
Collapse
|
11
|
Ahdesmäki M, Strimmer K. Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Ann Appl Stat 2010. [DOI: 10.1214/09-aoas277] [Citation(s) in RCA: 90] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
12
|
Abstract
MOTIVATION Biomarker discovery and gene ranking is a standard task in genomic high-throughput analysis. Typically, the ordering of markers is based on a stabilized variant of the t-score, such as the moderated t or the SAM statistic. However, these procedures ignore gene-gene correlations, which may have a profound impact on the gene orderings and on the power of the subsequent tests. RESULTS We propose a simple procedure that adjusts gene-wise t-statistics to take account of correlations among genes. The resulting correlation-adjusted t-scores ('cat' scores) are derived from a predictive perspective, i.e. as a score for variable selection to discriminate group membership in two-class linear discriminant analysis. In the absence of correlation the cat score reduces to the standard t-score. Moreover, using the cat score it is straightforward to evaluate groups of features (i.e. gene sets). For computation of the cat score from small sample data, we propose a shrinkage procedure. In a comparative study comprising six different synthetic and empirical correlation structures, we show that the cat score improves estimation of gene orderings and leads to higher power for fixed true discovery rate, and vice versa. Finally, we also illustrate the cat score by analyzing metabolomic data. AVAILABILITY The shrinkage cat score is implemented in the R package 'st', which is freely available under the terms of the GNU General Public License (version 3 or later) from CRAN (http://cran.r-project.org/web/packages/st/).
Collapse
Affiliation(s)
- Verena Zuber
- Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, Härtelstr. 16-18, 04107 Leipzig, Germany
| | | |
Collapse
|
13
|
Abstract
Background Analysis of microarray and other high-throughput data on the basis of gene sets, rather than individual genes, is becoming more important in genomic studies. Correspondingly, a large number of statistical approaches for detecting gene set enrichment have been proposed, but both the interrelations and the relative performance of the various methods are still very much unclear. Results We conduct an extensive survey of statistical approaches for gene set analysis and identify a common modular structure underlying most published methods. Based on this finding we propose a general framework for detecting gene set enrichment. This framework provides a meta-theory of gene set analysis that not only helps to gain a better understanding of the relative merits of each embedded approach but also facilitates a principled comparison and offers insights into the relative interplay of the methods. Conclusion We use this framework to conduct a computer simulation comparing 261 different variants of gene set enrichment procedures and to analyze two experimental data sets. Based on the results we offer recommendations for best practices regarding the choice of effective procedures for gene set enrichment analysis.
Collapse
Affiliation(s)
- Marit Ackermann
- Biotechnology Center, Technical University Dresden, 01062 Dresden, Germany
| | | |
Collapse
|
14
|
|
15
|
Abstract
Background False discovery rate (FDR) methods play an important role in analyzing high-dimensional data. There are two types of FDR, tail area-based FDR and local FDR, as well as numerous statistical algorithms for estimating or controlling FDR. These differ in terms of underlying test statistics and procedures employed for statistical learning. Results A unifying algorithm for simultaneous estimation of both local FDR and tail area-based FDR is presented that can be applied to a diverse range of test statistics, including p-values, correlations, z- and t-scores. This approach is semipararametric and is based on a modified Grenander density estimator. For test statistics other than p-values it allows for empirical null modeling, so that dependencies among tests can be taken into account. The inference of the underlying model employs truncated maximum-likelihood estimation, with the cut-off point chosen according to the false non-discovery rate. Conclusion The proposed procedure generalizes a number of more specialized algorithms and thus offers a common framework for FDR estimation consistent across test statistics and types of FDR. In comparative study the unified approach performs on par with the best competing yet more specialized alternatives. The algorithm is implemented in R in the "fdrtool" package, available under the GNU GPL from and from the R package archive CRAN.
Collapse
Affiliation(s)
- Korbinian Strimmer
- Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Härtelstr, 16-18, 04107 Leipzig, Germany.
| |
Collapse
|
16
|
Abstract
BACKGROUND False discovery rate (FDR) methods play an important role in analyzing high-dimensional data. There are two types of FDR, tail area-based FDR and local FDR, as well as numerous statistical algorithms for estimating or controlling FDR. These differ in terms of underlying test statistics and procedures employed for statistical learning. RESULTS A unifying algorithm for simultaneous estimation of both local FDR and tail area-based FDR is presented that can be applied to a diverse range of test statistics, including p-values, correlations, z- and t-scores. This approach is semipararametric and is based on a modified Grenander density estimator. For test statistics other than p-values it allows for empirical null modeling, so that dependencies among tests can be taken into account. The inference of the underlying model employs truncated maximum-likelihood estimation, with the cut-off point chosen according to the false non-discovery rate. CONCLUSION The proposed procedure generalizes a number of more specialized algorithms and thus offers a common framework for FDR estimation consistent across test statistics and types of FDR. In comparative study the unified approach performs on par with the best competing yet more specialized alternatives. The algorithm is implemented in R in the "fdrtool" package, available under the GNU GPL from http://strimmerlab.org/software/fdrtool/ and from the R package archive CRAN.
Collapse
Affiliation(s)
- Korbinian Strimmer
- Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Härtelstr, 16-18, 04107 Leipzig, Germany.
| |
Collapse
|
17
|
Abstract
UNLABELLED False discovery rate (FDR) methodologies are essential in the study of high-dimensional genomic and proteomic data. The R package 'fdrtool' facilitates such analyses by offering a comprehensive set of procedures for FDR estimation. Its distinctive features include: (i) many different types of test statistics are allowed as input data, such as P-values, z-scores, correlations and t-scores; (ii) simultaneously, both local FDR and tail area-based FDR values are estimated for all test statistics and (iii) empirical null models are fit where possible, thereby taking account of potential over- or underdispersion of the theoretical null. In addition, 'fdrtool' provides readily interpretable graphical output, and can be applied to very large scale (in the order of millions of hypotheses) multiple testing problems. Consequently, 'fdrtool' implements a flexible FDR estimation scheme that is unified across different test statistics and variants of FDR. AVAILABILITY The program is freely available from the Comprehensive R Archive Network (http://cran.r-project.org/) under the terms of the GNU General Public License (version 3 or later). CONTACT strimmer@uni-leipzig.de.
Collapse
Affiliation(s)
- Korbinian Strimmer
- Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Härtelstr. 16-18, 04107 Leipzig, Germany.
| |
Collapse
|
18
|
Abstract
With HIV persisting lifelong in infected persons, therapeutic vaccination is a novel alternative concept to control virus replication. Even though CD8 and CD4 cell responses to such immunizations have been demonstrated, their effects on virus replication are still unclear. In view of this fact, we studied the impact of a therapeutic vaccination with HIV nef delivered by a recombinant modified vaccinia Ankara vector on viral diversity. We investigated HIV sequences derived from chronically infected persons before and after therapeutic vaccination. Before immunization the mean +/- se pairwise variability of patient-derived Nef protein sequences was 0.1527 +/- 0.0041. After vaccination the respective value was 0.1249 +/- 0.0042, resulting in a significant (P<0.0001) difference between the two time points. The genes vif and 5'gag tested in parallel and nef sequences in control persons yielded a constant amino acid sequence variation. The data presented suggest that Nef immunization induced a selective pressure, limiting HIV sequence variability. To our knowledge this is the first report directly linking therapeutic HIV vaccination to decreasing diversity in patient-derived virus isolates.
Collapse
Affiliation(s)
- Dieter Hoffmann
- Institute of Virology, Technical University of Munich, Munich, Germany.
| | | | | | | | | | | | | |
Collapse
|
19
|
Opgen-Rhein R, Strimmer K. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC Syst Biol 2007; 1:37. [PMID: 17683609 PMCID: PMC1995222 DOI: 10.1186/1752-0509-1-37] [Citation(s) in RCA: 261] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/21/2007] [Accepted: 08/06/2007] [Indexed: 11/10/2022]
Abstract
Background The use of correlation networks is widespread in the analysis of gene expression and proteomics data, even though it is known that correlations not only confound direct and indirect associations but also provide no means to distinguish between cause and effect. For "causal" analysis typically the inference of a directed graphical model is required. However, this is rather difficult due to the curse of dimensionality. Results We propose a simple heuristic for the statistical learning of a high-dimensional "causal" network. The method first converts a correlation network into a partial correlation graph. Subsequently, a partial ordering of the nodes is established by multiple testing of the log-ratio of standardized partial variances. This allows identifying a directed acyclic causal network as a subgraph of the partial correlation network. We illustrate the approach by analyzing a large Arabidopsis thaliana expression data set. Conclusion The proposed approach is a heuristic algorithm that is based on a number of approximations, such as substituting lower order partial correlations by full order partial correlations. Nevertheless, for small samples and for sparse networks the algorithm not only yield sensible first order approximations of the causal structure in high-dimensional genomic data but is also computationally highly efficient. Availability and Requirements The method is implemented in the "GeneNet" R package (version 1.2.0), available from CRAN and from . The software includes an R script for reproducing the network analysis of the Arabidopsis thaliana data.
Collapse
Affiliation(s)
- Rainer Opgen-Rhein
- Department of Statistics, Ludwig-Maximilians-Universität München, Ludwigstraße 33, D-80539 München, Germany
| | - Korbinian Strimmer
- Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, Härtelstr. 16-18, 04107 Leipzig, Germany
| |
Collapse
|
20
|
Opgen-Rhein R, Strimmer K. Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process. BMC Bioinformatics 2007; 8 Suppl 2:S3. [PMID: 17493252 PMCID: PMC1892072 DOI: 10.1186/1471-2105-8-s2-s3] [Citation(s) in RCA: 95] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Causal networks based on the vector autoregressive (VAR) process are a promising statistical tool for modeling regulatory interactions in a cell. However, learning these networks is challenging due to the low sample size and high dimensionality of genomic data. Results We present a novel and highly efficient approach to estimate a VAR network. This proceeds in two steps: (i) improved estimation of VAR regression coefficients using an analytic shrinkage approach, and (ii) subsequent model selection by testing the associated partial correlations. In simulations this approach outperformed for small sample size all other considered approaches in terms of true discovery rate (number of correctly identified edges relative to the significant edges). Moreover, the analysis of expression time series data from Arabidopsis thaliana resulted in a biologically sensible network. Conclusion Statistical learning of large-scale VAR causal models can be done efficiently by the proposed procedure, even in the difficult data situations prevalent in genomics and proteomics. Availability The method is implemented in R code that is available from the authors on request.
Collapse
Affiliation(s)
- Rainer Opgen-Rhein
- Department of Statistics, Ludwig-Maximilians-Universität München, Ludwigstraße 33, D-80539 München, Germany
| | - Korbinian Strimmer
- Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, Härtelstr. 16-18, 04107 Leipzig, Germany
| |
Collapse
|
21
|
Abstract
High-dimensional case-control analysis is encountered in many different settings in genomics. In order to rank genes accordingly, many different scores have been proposed, ranging from ad hoc modifications of the ordinary t statistic to complicated hierarchical Bayesian models. Here, we introduce the "shrinkage t" statistic that is based on a novel and model-free shrinkage estimate of the variance vector across genes. This is derived in a quasi-empirical Bayes setting. The new rank score is fully automatic and requires no specification of parameters or distributions. It is computationally inexpensive and can be written analytically in closed form. Using a series of synthetic and three real expression data we studied the quality of gene rankings produced by the "shrinkage t" statistic. The new score consistently leads to highly accurate rankings for the complete range of investigated data sets and all considered scenarios for across-gene variance structures.
Collapse
|
22
|
Abstract
Partial least squares (PLS) is an efficient statistical regression technique that is highly suited for the analysis of genomic and proteomic data. In this article, we review both the theory underlying PLS as well as a host of bioinformatics applications of PLS. In particular, we provide a systematic comparison of the PLS approaches currently employed, and discuss analysis problems as diverse as, e.g. tumor classification from transcriptome data, identification of relevant genes, survival analysis and modeling of gene networks and transcription factor activities.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Medical Statistics and Epidemiology, Technical University of Munich, Ismaningerstrasse 22, D-81675 Munich, Germany.
| | | |
Collapse
|
23
|
Strimmer K. Book Review: Statistical Methods in Bioinformatics: An Introduction. By Warren J. Ewens and Gregory R. Grant. Biom J 2006. [DOI: 10.1002/bimj.200510194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
24
|
Schäfer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 2005; 4:Article32. [PMID: 16646851 DOI: 10.2202/1544-6115.1175] [Citation(s) in RCA: 642] [Impact Index Per Article: 33.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Inferring large-scale covariance matrices from sparse genomic data is an ubiquitous problem in bioinformatics. Clearly, the widely used standard covariance and correlation estimators are ill-suited for this purpose. As statistically efficient and computationally fast alternative we propose a novel shrinkage covariance estimator that exploits the Ledoit-Wolf (2003) lemma for analytic calculation of the optimal shrinkage intensity. Subsequently, we apply this improved covariance estimator (which has guaranteed minimum mean squared error, is well-conditioned, and is always positive definite even for small sample sizes) to the problem of inferring large-scale gene association networks. We show that it performs very favorably compared to competing approaches both in simulations as well as in application to real expression data.
Collapse
|
25
|
Thalmeier A, Giegling I, Dietrich I, Schneider B, Maurer K, Hartmann AM, Möller HJ, Strimmer K, Schaefer J, Bratzke H, Schnabel A. Identification of differentially expressed genes in the brains of suicide victims; a microarray analysis study. Pharmacopsychiatry 2005. [DOI: 10.1055/s-2005-918852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
26
|
Boulesteix AL, Strimmer K. Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theor Biol Med Model 2005; 2:23. [PMID: 15978125 PMCID: PMC1182396 DOI: 10.1186/1742-4682-2-23] [Citation(s) in RCA: 81] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2005] [Accepted: 06/24/2005] [Indexed: 11/10/2022] Open
Abstract
Background The study of the network between transcription factors and their targets is important for understanding the complex regulatory mechanisms in a cell. Unfortunately, with standard microarray experiments it is not possible to measure the transcription factor activities (TFAs) directly, as their own transcription levels are subject to post-translational modifications. Results Here we propose a statistical approach based on partial least squares (PLS) regression to infer the true TFAs from a combination of mRNA expression and DNA-protein binding measurements. This method is also statistically sound for small samples and allows the detection of functional interactions among the transcription factors via the notion of "meta"-transcription factors. In addition, it enables false positives to be identified in ChIP data and activation and suppression activities to be distinguished. Conclusion The proposed method performs very well both for simulated data and for real expression and ChIP data from yeast and E. Coli experiments. It overcomes the limitations of previously used approaches to estimating TFAs. The estimated profiles may also serve as input for further studies, such as tests of periodicity or differential regulation. An R package "plsgenomics" implementing the proposed methods is available for download from the CRAN archive.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| | - Korbinian Strimmer
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| |
Collapse
|
27
|
Opgen-Rhein R, Fahrmeir L, Strimmer K. Inference of demographic history from genealogical trees using reversible jump Markov chain Monte Carlo. BMC Evol Biol 2005; 5:6. [PMID: 15663782 PMCID: PMC548300 DOI: 10.1186/1471-2148-5-6] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2004] [Accepted: 01/21/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Coalescent theory is a general framework to model genetic variation in a population. Specifically, it allows inference about population parameters from sampled DNA sequences. However, most currently employed variants of coalescent theory only consider very simple demographic scenarios of population size changes, such as exponential growth. RESULTS Here we develop a coalescent approach that allows Bayesian non-parametric estimation of the demographic history using genealogies reconstructed from sampled DNA sequences. In this framework inference and model selection is done using reversible jump Markov chain Monte Carlo (MCMC). This method is computationally efficient and overcomes the limitations of related non-parametric approaches such as the skyline plot. We validate the approach using simulated data. Subsequently, we reanalyze HIV-1 sequence data from Central Africa and Hepatitis C virus (HCV) data from Egypt. CONCLUSIONS The new method provides a Bayesian procedure for non-parametric estimation of the demographic history. By construction it additionally provides confidence limits and may be used jointly with other MCMC-based coalescent approaches.
Collapse
Affiliation(s)
- Rainer Opgen-Rhein
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| | - Ludwig Fahrmeir
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| | - Korbinian Strimmer
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| |
Collapse
|
28
|
Abstract
MOTIVATION Genetic networks are often described statistically using graphical models (e.g. Bayesian networks). However, inferring the network structure offers a serious challenge in microarray analysis where the sample size is small compared to the number of considered genes. This renders many standard algorithms for graphical models inapplicable, and inferring genetic networks an 'ill-posed' inverse problem. METHODS We introduce a novel framework for small-sample inference of graphical models from gene expression data. Specifically, we focus on the so-called graphical Gaussian models (GGMs) that are now frequently used to describe gene association networks and to detect conditionally dependent genes. Our new approach is based on (1) improved (regularized) small-sample point estimates of partial correlation, (2) an exact test of edge inclusion with adaptive estimation of the degree of freedom and (3) a heuristic network search based on false discovery rate multiple testing. Steps (2) and (3) correspond to an empirical Bayes estimate of the network topology. RESULTS Using computer simulations, we investigate the sensitivity (power) and specificity (true negative rate) of the proposed framework to estimate GGMs from microarray data. This shows that it is possible to recover the true network topology with high accuracy even for small-sample datasets. Subsequently, we analyze gene expression data from a breast cancer tumor study and illustrate our approach by inferring a corresponding large-scale gene association network for 3883 genes.
Collapse
Affiliation(s)
- Juliane Schäfer
- Department of Statistics, University of Munich, Ludwigstrasse 33, D-80539 Munich, Germany
| | | |
Collapse
|
29
|
Abstract
MOTIVATION Cancer diagnosis using gene expression profiles requires supervised learning and gene selection methods. Of the many suggested approaches, the method of emerging patterns (EPs) has the particular advantage of explicitly modeling interactions among genes, which improves classification accuracy. However, finding useful (i.e. short and statistically significant) EP is typically very hard. METHODS Here we introduce a CART-based approach to discover EPs in microarray data. The method is based on growing decision trees from which the EPs are extracted. This approach combines pattern search with a statistical procedure based on Fisher's exact test to assess the significance of each EP. Subsequently, sample classification based on the inferred EPs is performed using maximum-likelihood linear discriminant analysis. RESULTS Using simulated data as well as gene expression data from colon and leukemia cancer experiments we assessed the performance of our pattern search algorithm and classification procedure. In the simulations, our method recovers a large proportion of known EPs while for real data it is comparable in classification accuracy with three top-performing alternative classification algorithms. In addition, it assigns statistical significance to the inferred EPs and allows to rank the patterns while simultaneously avoiding overfit of the data. The new approach therefore provides a versatile and computationally fast tool for elucidating local gene interactions as well as for classification. AVAILABILITY A computer program written in the statistical language R implementing the new approach is freely available from the web page http://www.stat.uni-muenchen.de/~socher/
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Seminar for Applied Stochastics, Department of Statistics, University of Munich, Akademiestrasse 1, D-80799 Munich, Germany.
| | | | | |
Collapse
|
30
|
Jobb G, von Haeseler A, Strimmer K. TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics. BMC Evol Biol 2004; 4:18. [PMID: 15222900 PMCID: PMC459214 DOI: 10.1186/1471-2148-4-18] [Citation(s) in RCA: 907] [Impact Index Per Article: 45.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2004] [Accepted: 06/28/2004] [Indexed: 11/23/2022] Open
Abstract
Background Most analysis programs for inferring molecular phylogenies are difficult to use, in particular for researchers with little programming experience. Results TREEFINDER is an easy-to-use integrative platform-independent analysis environment for molecular phylogenetics. In this paper the main features of TREEFINDER (version of April 2004) are described. TREEFINDER is written in ANSI C and Java and implements powerful statistical approaches for inferring gene tree and related analyzes. In addition, it provides a user-friendly graphical interface and a phylogenetic programming language. Conclusions TREEFINDER is a versatile framework for analyzing phylogenetic data across different platforms that is suited both for exploratory as well as advanced studies.
Collapse
Affiliation(s)
- Gangolf Jobb
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| | - Arndt von Haeseler
- Department of Computer Science, University of Düsseldorf, Universitätsstr. 1, D-40225 Düsseldorf, Germany
- John von Neumann Institute for Computing, Forschungszentrum Jülich, D-52425 Jülich, Germany
| | - Korbinian Strimmer
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| |
Collapse
|
31
|
Abstract
UNLABELLED Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use in molecular evolution and phylogenetics. APE provides both utility functions for reading and writing data and manipulating phylogenetic trees, as well as several advanced methods for phylogenetic and evolutionary analysis (e.g. comparative and population genetic methods). APE takes advantage of the many R functions for statistics and graphics, and also provides a flexible framework for developing and implementing further statistical methods for the analysis of evolutionary processes. AVAILABILITY The program is free and available from the official R package archive at http://cran.r-project.org/src/contrib/PACKAGES.html#ape. APE is licensed under the GNU General Public License.
Collapse
Affiliation(s)
- Emmanuel Paradis
- Laboratoire de Paléontologie, Paléobiologie and Phylogénie, Institut des Sciences de l'Evolution, Université Montpellier II, F-34095 Montpellier cédex 05, France.
| | | | | |
Collapse
|
32
|
Abstract
MOTIVATION Microarray experiments are now routinely used to collect large-scale time series data, for example to monitor gene expression during the cell cycle. Statistical analysis of this data poses many challenges, one being that it is hard to identify correctly the subset of genes with a clear periodic signature. This has lead to a controversial argument with regard to the suitability of both available methods and current microarray data. METHODS We introduce two simple but efficient statistical methods for signal detection and gene selection in gene expression time series data. First, we suggest the average periodogram as an exploratory device for graphical assessment of the presence of periodic transcripts in the data. Second, we describe an exact statistical test to identify periodically expressed genes that allows one to distinguish periodic from purely random processes. This identification method is based on the so-called g-statistic and uses the false discovery rate approach to multiple testing. RESULTS Using simulated data it is shown that the suggested method is capable of identifying cell-cycle-activated genes in a gene expression data set even if the number of the cyclic genes is very small and regardless the presence of a dominant non-periodic component in the data. Subsequently, we re-examine 12 large microarray time series data sets (in part controversially discussed) from yeast, human fibroblast, human HeLa and bacterial cells. Based on the statistical analysis it is found that a majority of these data sets contained little or no statistical significant evidence for genes with periodic variation linked to cell cycle regulation. On the other hand, for the remaining data the method extends the catalog of previously known cell-cycle-specific transcripts by identifying additional periodic genes not found by other methods. The problem of distinguishing periodicity due to generic cell cycle activity and to artifacts from synchronization is also discussed. AVAILABILITY The approach has been implemented in the R package GeneTS available from http://www.stat.uni-muenchen.de/~strimmer/software.html under the terms of the GNU General Public License.
Collapse
Affiliation(s)
- Sofia Wichert
- Department of Statistics, University of Munich, Ludwigstrasse 33, D-80539 Munich, Germany
| | | | | |
Collapse
|
33
|
Abstract
UNLABELLED Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use in molecular evolution and phylogenetics. APE provides both utility functions for reading and writing data and manipulating phylogenetic trees, as well as several advanced methods for phylogenetic and evolutionary analysis (e.g. comparative and population genetic methods). APE takes advantage of the many R functions for statistics and graphics, and also provides a flexible framework for developing and implementing further statistical methods for the analysis of evolutionary processes. AVAILABILITY The program is free and available from the official R package archive at http://cran.r-project.org/src/contrib/PACKAGES.html#ape. APE is licensed under the GNU General Public License.
Collapse
Affiliation(s)
- Emmanuel Paradis
- Laboratoire de Paléontologie, Paléobiologie and Phylogénie, Institut des Sciences de l'Evolution, Université Montpellier II, F-34095 Montpellier cédex 05, France.
| | | | | |
Collapse
|
34
|
Strimmer K, Forslund K, Holland B, Moulton V. A novel exploratory method for visual recombination detection. Genome Biol 2003; 4:R33. [PMID: 12734013 PMCID: PMC156589 DOI: 10.1186/gb-2003-4-5-r33] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2002] [Revised: 03/10/2003] [Accepted: 03/31/2003] [Indexed: 11/17/2022] Open
Abstract
A versatile visual approach for detecting recombination and identifying recombination breakpoints within a sequence alignment is presented. The method is based on two novel diagrams - the highway plot and the occupancy plot - that graphically portray phylogenetic inhomogeneity along an alignment, and can be viewed as a synthesis of two widely used but unrelated methods: bootscanning and quartet-mapping. To illustrate the method, simulated data and HIV-1 and influenza A datasets are investigated.
Collapse
|
35
|
Abstract
BACKGROUND Using suitable error models for gene expression measurements is essential in the statistical analysis of microarray data. However, the true probabilistic model underlying gene expression intensity readings is generally not known. Instead, in currently used approaches some simple parametric model is assumed (usually a transformed normal distribution) or the empirical distribution is estimated. However, both these strategies may not be optimal for gene expression data, as the non-parametric approach ignores known structural information whereas the fully parametric models run the risk of misspecification. A further related problem is the choice of a suitable scale for the model (e.g. observed vs. log-scale). RESULTS Here a simple semi-parametric model for gene expression measurement error is presented. In this approach inference is based an approximate likelihood function (the extended quasi-likelihood). Only partial knowledge about the unknown true distribution is required to construct this function. In case of gene expression this information is available in the form of the postulated (e.g. quadratic) variance structure of the data. As the quasi-likelihood behaves (almost) like a proper likelihood, it allows for the estimation of calibration and variance parameters, and it is also straightforward to obtain corresponding approximate confidence intervals. Unlike most other frameworks, it also allows analysis on any preferred scale, i.e. both on the original linear scale as well as on a transformed scale. It can also be employed in regression approaches to model systematic (e.g. array or dye) effects. CONCLUSIONS The quasi-likelihood framework provides a simple and versatile approach to analyze gene expression data that does not make any strong distributional assumptions about the underlying error model. For several simulated as well as real data sets it provides a better fit to the data than competing models. In an example it also improved the power of tests to identify differential expression.
Collapse
Affiliation(s)
- Korbinian Strimmer
- Department of Statistics, University of Munich, Ludwigstrasse 33, D-80539 Munich, Germany.
| |
Collapse
|
36
|
Abstract
SUMMARY TREE-PUZZLE is a program package for quartet-based maximum-likelihood phylogenetic analysis (formerly PUZZLE, Strimmer and von Haeseler, Mol. Biol. Evol., 13, 964-969, 1996) that provides methods for reconstruction, comparison, and testing of trees and models on DNA as well as protein sequences. To reduce waiting time for larger datasets the tree reconstruction part of the software has been parallelized using message passing that runs on clusters of workstations as well as parallel computers. AVAILABILITY http://www.tree-puzzle.de. The program is written in ANSI C. TREE-PUZZLE can be run on UNIX, Windows and Mac systems, including Mac OS X. To run the parallel version of PUZZLE, a Message Passing Interface (MPI) library has to be installed on the system. Free MPI implementations are available on the Web (cf. http://www.lam-mpi.org/mpi/implementations/).
Collapse
Affiliation(s)
- Heiko A Schmidt
- Max-Planck-Institut für Molekulare Genetik, Ihnestr. 73, D-14195 Berlin, Germany.
| | | | | | | |
Collapse
|
37
|
Abstract
The problem of inferring confidence sets of gene trees is discussed without assuming that the substitution model or the branching pattern of any of the investigated trees is correct. In this case, widely used methods to compare genealogies can give highly contradicting results. Here, three methods to infer confidence sets that are robust against model misspecification are compared, including a new approach based on estimating the confidence in a specific tree using expected-likelihood weights. The power of the investigated methods is studied by analysing HIV-1 and mtDNA sequence data as well as simulated sequences. Finally, guidelines for choosing an appropriate method to compare multiple gene trees are provided.
Collapse
Affiliation(s)
- Korbinian Strimmer
- Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK.
| | | |
Collapse
|
38
|
Abstract
We present an intuitive visual framework, the generalized skyline plot, to explore the demographic history of sampled DNA sequences. This approach is based on a genealogy inferred from the sequences and provides a nonparametric estimate of effective population size through time. In contrast to previous related procedures, the generalized skyline plot is more applicable to cases where the underlying tree is not fully resolved and the data is not highly variable. This is achieved by the grouping of adjacent coalescent intervals. We employ a small-sample Akaike information criterion to objectively choose the optimal grouping strategy. We investigate the performance of our approach using simulation and subsequently apply it to HIV-1 sequences from central Africa and mtDNA sequences from red pandas.
Collapse
Affiliation(s)
- K Strimmer
- Department of Zoology, University of Oxford
| | | |
Collapse
|
39
|
Abstract
Phylogenetic Analysis Library (PAL) is a collection of Java classes for use in molecular evolution and phylogenetics. PAL provides a modular environment for the rapid construction of both special-purpose and general analysis programs. PAL version 1.1 consists of 145 public classes or interfaces in 13 packages, including classes for models of character evolution, maximum-likelihood estimation, and the coalescent, with a total of more than 27000 lines of code. The PAL project is set up as a collaborative project to facilitate contributions from other researchers. AVAILIABILTY: The program is free and is available at http://www.pal-project.org. It requires Java 1.1 or later. PAL is licensed under the GNU General Public License.
Collapse
Affiliation(s)
- A Drummond
- School of Biological Sciences, University of Auckland, 3A Symonds Street, Auckland, New Zealand.
| | | |
Collapse
|
40
|
Salemi M, Strimmer K, Hall WW, Duffy M, Delaporte E, Mboup S, Peeters M, Vandamme AM. Dating the common ancestor of SIVcpz and HIV-1 group M and the origin of HIV-1 subtypes using a new method to uncover clock-like molecular evolution. FASEB J 2001; 15:276-8. [PMID: 11156935 DOI: 10.1096/fj.00-0449fje] [Citation(s) in RCA: 85] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Attempts to estimate the time of origin of human immunodeficiency virus (HIV)-1 by using phylogenetic analysis are seriously flawed because of the unequal evolutionary rates among different viral lineages. Here, we report a new method of molecular clock analysis, called Site Stripping for Clock Detection (SSCD), which allows selection of nucleotide sites evolving at an equal rate in different lineages. The method was validated on a dataset of patients all infected with hepatitis C virus in 1977 by the same donor, and it was able to date exactly the known origin of the infection. Using the same method, we calculated that the origin of HIV-1 group M radiation was in the 1930s. In addition, we show that the coalescence time of the simian ancestor of HIV-1 group M and its closest related cpz strains occurred around the end of the XVII century, a date that could be considered the upper limit to the time of simian-to-human transmission of HIV-1 group M. The results show also that SSCD is an easy-to-use method of general applicability in molecular evolution to calibrate clock-like phylogenetic trees.
Collapse
Affiliation(s)
- M Salemi
- Rega Institute for Medical Research, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
| | | | | | | | | | | | | | | |
Collapse
|
41
|
|
42
|
Abstract
A method for computing the likelihood of a set of sequences assuming a phylogenetic network as an evolutionary hypothesis is presented. The approach applies directed graphical models to sequence evolution on networks and is a natural generalization of earlier work by Felsenstein on evolutionary trees, including it as a special case. The likelihood computation involves several steps. First, the phylogenetic network is rooted to form a directed acyclic graph (DAG). Then, applying standard models for nucleotide/amino acid substitution, the DAG is converted into a Bayesian network from which the joint probability distribution involving all nodes of the network can be directly read. The joint probability is explicitly dependent on branch lengths and on recombination parameters (prior probability of a parent sequence). The likelihood of the data assuming no knowledge of hidden nodes is obtained by marginalization, i.e., by summing over all combinations of unknown states. As the number of terms increases exponentially with the number of hidden nodes, a Markov chain Monte Carlo procedure (Gibbs sampling) is used to accurately approximate the likelihood by summing over the most important states only. Investigating a human T-cell lymphotropic virus (HTLV) data set and optimizing both branch lengths and recombination parameters, we find that the likelihood of a corresponding phylogenetic network outperforms a set of competing evolutionary trees. In general, except for the case of a tree, the likelihood of a network will be dependent on the choice of the root, even if a reversible model of substitution is applied. Thus, the method also provides a way in which to root a phylogenetic network by choosing a node that produces a most likely network.
Collapse
Affiliation(s)
- K Strimmer
- GSF-Forschungszentrum für Umwelt und Gesundheit, MIPS, am Max-Planck-Institut für Biochemie, Martinsried, Germany
| | | |
Collapse
|
43
|
Abstract
We introduce a graphical method, likelihood-mapping, to visualize the phylogenetic content of a set of aligned sequences. The method is based on an analysis of the maximum likelihoods for the three fully resolved tree topologies that can be computed for four sequences. The three likelihoods are represented as one point inside an equilateral triangle. The triangle is partitioned in different regions. One region represents star-like evolution, three regions represent a well-resolved phylogeny, and three regions reflect the situation where it is difficult to distinguish between two of the three trees. The location of the likelihoods in the triangle defines the mode of sequence evolution. If n sequences are analyzed, then the likelihoods for each subset of four sequences are mapped onto the triangle. The resulting distribution of points shows whether the data are suitable for a phylogenetic reconstruction or not.
Collapse
Affiliation(s)
- K Strimmer
- Zoologisches Institut, Universität München, P.O. Box 202136, D-80021 Munich, Germany
| | | |
Collapse
|
44
|
|
45
|
|
46
|
|