1
|
Abstract
In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs.
Collapse
|
Review |
15 |
1136 |
2
|
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38:2074-2102. [PMID: 30652356 PMCID: PMC6492164 DOI: 10.1002/sim.8086] [Citation(s) in RCA: 616] [Impact Index Per Article: 102.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Revised: 08/23/2018] [Accepted: 11/02/2018] [Indexed: 12/11/2022]
Abstract
Simulation studies are computer experiments that involve creating data by pseudo-random sampling. A key strength of simulation studies is the ability to understand the behavior of statistical methods because some "truth" (usually some parameter/s of interest) is known from the process of generating the data. This allows us to consider properties of methods, such as bias. While widely used, simulation studies are often poorly designed, analyzed, and reported. This tutorial outlines the rationale for using simulation studies and offers guidance for design, execution, analysis, reporting, and presentation. In particular, this tutorial provides a structured approach for planning and reporting simulation studies, which involves defining aims, data-generating mechanisms, estimands, methods, and performance measures ("ADEMP"); coherent terminology for simulation studies; guidance on coding simulation studies; a critical discussion of key performance measures and their estimation; guidance on structuring tabular and graphical presentation of results; and new graphical presentations. With a view to describing recent practice, we review 100 articles taken from Volume 34 of Statistics in Medicine, which included at least one simulation study and identify areas for improvement.
Collapse
|
research-article |
6 |
616 |
3
|
White IR, Carlin JB. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med 2010; 29:2920-31. [PMID: 20842622 DOI: 10.1002/sim.3944] [Citation(s) in RCA: 469] [Impact Index Per Article: 31.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
When missing data occur in one or more covariates in a regression model, multiple imputation (MI) is widely advocated as an improvement over complete-case analysis (CC). We use theoretical arguments and simulation studies to compare these methods with MI implemented under a missing at random assumption. When data are missing completely at random, both methods have negligible bias, and MI is more efficient than CC across a wide range of scenarios. For other missing data mechanisms, bias arises in one or both methods. In our simulation setting, CC is biased towards the null when data are missing at random. However, when missingness is independent of the outcome given the covariates, CC has negligible bias and MI is biased away from the null. With more general missing data mechanisms, bias tends to be smaller for MI than for CC. Since MI is not always better than CC for missing covariate problems, the choice of method should take into account what is known about the missing data mechanism in a particular substantive application. Importantly, the choice of method should not be based on comparison of standard errors. We propose new ways to understand empirical differences between MI and CC, which may provide insights into the appropriateness of the assumptions underlying each method, and we propose a new index for assessing the likely gain in precision from MI: the fraction of incomplete cases among the observed values of a covariate (FICO).
Collapse
|
Comparative Study |
15 |
469 |
4
|
Collins GS, Ogundimu EO, Altman DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat Med 2016; 35:214-26. [PMID: 26553135 PMCID: PMC4738418 DOI: 10.1002/sim.6787] [Citation(s) in RCA: 452] [Impact Index Per Article: 50.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Revised: 10/02/2015] [Accepted: 10/12/2015] [Indexed: 11/08/2022]
Abstract
After developing a prognostic model, it is essential to evaluate the performance of the model in samples independent from those used to develop the model, which is often referred to as external validation. However, despite its importance, very little is known about the sample size requirements for conducting an external validation. Using a large real data set and resampling methods, we investigate the impact of sample size on the performance of six published prognostic models. Focussing on unbiased and precise estimation of performance measures (e.g. the c-index, D statistic and calibration), we provide guidance on sample size for investigators designing an external validation study. Our study suggests that externally validating a prognostic model requires a minimum of 100 events and ideally 200 (or more) events.
Collapse
|
research-article |
9 |
452 |
5
|
Perperoglou A, Sauerbrei W, Abrahamowicz M, Schmid M. A review of spline function procedures in R. BMC Med Res Methodol 2019; 19:46. [PMID: 30841848 PMCID: PMC6402144 DOI: 10.1186/s12874-019-0666-3] [Citation(s) in RCA: 243] [Impact Index Per Article: 40.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Accepted: 01/18/2019] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND With progress on both the theoretical and the computational fronts the use of spline modelling has become an established tool in statistical regression analysis. An important issue in spline modelling is the availability of user friendly, well documented software packages. Following the idea of the STRengthening Analytical Thinking for Observational Studies initiative to provide users with guidance documents on the application of statistical methods in observational research, the aim of this article is to provide an overview of the most widely used spline-based techniques and their implementation in R. METHODS In this work, we focus on the R Language for Statistical Computing which has become a hugely popular statistics software. We identified a set of packages that include functions for spline modelling within a regression framework. Using simulated and real data we provide an introduction to spline modelling and an overview of the most popular spline functions. RESULTS We present a series of simple scenarios of univariate data, where different basis functions are used to identify the correct functional form of an independent variable. Even in simple data, using routines from different packages would lead to different results. CONCLUSIONS This work illustrate challenges that an analyst faces when working with data. Most differences can be attributed to the choice of hyper-parameters rather than the basis used. In fact an experienced user will know how to obtain a reasonable outcome, regardless of the type of spline used. However, many analysts do not have sufficient knowledge to use these powerful tools adequately and will need more guidance.
Collapse
|
Review |
6 |
243 |
6
|
Silverman JD, Washburne AD, Mukherjee S, David LA. A phylogenetic transform enhances analysis of compositional microbiota data. eLife 2017; 6:e21887. [PMID: 28198697 PMCID: PMC5328592 DOI: 10.7554/elife.21887] [Citation(s) in RCA: 202] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2016] [Accepted: 02/13/2017] [Indexed: 12/17/2022] Open
Abstract
Surveys of microbial communities (microbiota), typically measured as relative abundance of species, have illustrated the importance of these communities in human health and disease. Yet, statistical artifacts commonly plague the analysis of relative abundance data. Here, we introduce the PhILR transform, which incorporates microbial evolutionary models with the isometric log-ratio transform to allow off-the-shelf statistical tools to be safely applied to microbiota surveys. We demonstrate that analyses of community-level structure can be applied to PhILR transformed data with performance on benchmarks rivaling or surpassing standard tools. Additionally, by decomposing distance in the PhILR transformed space, we identified neighboring clades that may have adapted to distinct human body sites. Decomposing variance revealed that covariation of bacterial clades within human body sites increases with phylogenetic relatedness. Together, these findings illustrate how the PhILR transform combines statistical and phylogenetic models to overcome compositional data challenges and enable evolutionary insights relevant to microbial communities.
Collapse
|
Research Support, N.I.H., Extramural |
8 |
202 |
7
|
Xu S, Grullon S, Ge K, Peng W. Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells. Methods Mol Biol 2014; 1150:97-111. [PMID: 24743992 DOI: 10.1007/978-1-4939-0512-6_5] [Citation(s) in RCA: 176] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
Chromatin states are the key to embryonic stem cell pluripotency and differentiation. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-Seq) is increasingly used to map chromatin states and to functionally annotate the genome. Many ChIP-Seq profiles, especially those of histone methylations, are noisy and diffuse. Here we describe SICER (Zang et al., Bioinformatics 25(15):1952-1958, 2009), an algorithm specifically designed to identify disperse ChIP-enriched regions with high sensitivity and specificity. This algorithm has found a lot of applications in epigenomic studies. In this Chapter, we will demonstrate in detail how to run SICER to delineate ChIP-enriched regions and assess their statistical significance, and to identify regions of differential enrichment when two chromatin states are compared.
Collapse
|
Research Support, N.I.H., Intramural |
11 |
176 |
8
|
Benner C, Havulinna AS, Järvelin MR, Salomaa V, Ripatti S, Pirinen M. Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. Am J Hum Genet 2017; 101:539-551. [PMID: 28942963 DOI: 10.1016/j.ajhg.2017.08.012] [Citation(s) in RCA: 153] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 08/17/2017] [Indexed: 01/15/2023] Open
Abstract
During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.
Collapse
|
Journal Article |
8 |
153 |
9
|
Marsh HW, Guo J, Dicke T, Parker PD, Craven RG. Confirmatory Factor Analysis (CFA), Exploratory Structural Equation Modeling (ESEM), and Set-ESEM: Optimal Balance Between Goodness of Fit and Parsimony. MULTIVARIATE BEHAVIORAL RESEARCH 2020; 55:102-119. [PMID: 31204844 DOI: 10.1080/00273171.2019.1602503] [Citation(s) in RCA: 120] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
CFAs of multidimensional constructs often fail to meet standards of good measurement (e.g., goodness-of-fit, measurement invariance, and well-differentiated factors). Exploratory structural equation modeling (ESEM) represents a compromise between exploratory factor analysis' (EFA) flexibility, and CFA/SEM's rigor and parsimony, but lacks parsimony (particularly in large models) and might confound constructs that need to be kept separate. In Set-ESEM, two or more a priori sets of constructs are modeled within a single model such that cross-loadings are permissible within the same set of factors (as in Full-ESEM) but are constrained to be zero for factors in different sets (as in CFA). The different sets can reflect the same set of constructs on multiple occasions, and/or different constructs measured within the same wave. Hence, Set-ESEM that represents a middle-ground between the flexibility of traditional-ESEM (hereafter referred to as Full-ESEM) and the rigor and parsimony of CFA/SEM. Thus, the purposes of this article are to provide an overview tutorial on Set-ESEM, juxtapose it with Full-ESEM, and to illustrate its application with simulated data and diverse "real" data applications with accessible, heuristic explanations of best practice.
Collapse
|
|
5 |
120 |
10
|
Abstract
We describe analytic approaches for study designs that, like large simple trials, can be better characterized as longitudinal studies with baseline randomization than as either a pure randomized experiment or a purely observational study. We (i) discuss the intention-to-treat effect as an effect measure for randomized studies, (ii) provide a formal definition of causal effect for longitudinal studies, (iii) describe several methods -- based on inverse probability weighting and g-estimation -- to estimate such effect, (iv) present an application of these methods to a naturalistic trial of antipsychotics on symptom severity of schizophrenia, and (v) discuss the relative advantages and disadvantages of each method.
Collapse
|
Research Support, N.I.H., Extramural |
17 |
83 |
11
|
Sauerbrei W, Abrahamowicz M, Altman DG, le Cessie S, Carpenter J, on behalf of the STRATOS initiative. STRengthening analytical thinking for observational studies: the STRATOS initiative. Stat Med 2014; 33:5413-32. [PMID: 25074480 PMCID: PMC4320765 DOI: 10.1002/sim.6265] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2013] [Revised: 05/28/2014] [Accepted: 06/20/2014] [Indexed: 12/18/2022]
Abstract
The validity and practical utility of observational medical research depends critically on good study design, excellent data quality, appropriate statistical methods and accurate interpretation of results. Statistical methodology has seen substantial development in recent times. Unfortunately, many of these methodological developments are ignored in practice. Consequently, design and analysis of observational studies often exhibit serious weaknesses. The lack of guidance on vital practical issues discourages many applied researchers from using more sophisticated and possibly more appropriate methods when analyzing observational studies. Furthermore, many analyses are conducted by researchers with a relatively weak statistical background and limited experience in using statistical methodology and software. Consequently, even 'standard' analyses reported in the medical literature are often flawed, casting doubt on their results and conclusions. An efficient way to help researchers to keep up with recent methodological developments is to develop guidance documents that are spread to the research community at large. These observations led to the initiation of the strengthening analytical thinking for observational studies (STRATOS) initiative, a large collaboration of experts in many different areas of biostatistical research. The objective of STRATOS is to provide accessible and accurate guidance in the design and analysis of observational studies. The guidance is intended for applied statisticians and other data analysts with varying levels of statistical education, experience and interests. In this article, we introduce the STRATOS initiative and its main aims, present the need for guidance documents and outline the planned approach and progress so far. We encourage other biostatisticians to become involved.
Collapse
|
research-article |
11 |
81 |
12
|
Nakas CT, Alonzo TA, Yiannoutsos CT. Accuracy and cut-off point selection in three-class classification problems using a generalization of the Youden index. Stat Med 2010; 29:2946-55. [PMID: 20809485 PMCID: PMC2991472 DOI: 10.1002/sim.4044] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
We study properties of the index J(3), defined as the accuracy, or the maximum correct classification, for a given three-class classification problem. Specifically, using J(3) one can assess the discrimination between the three distributions and obtain an optimal pair of cut-off points c(1)
Collapse
|
Research Support, N.I.H., Extramural |
15 |
74 |
13
|
Porter HF, O’Reilly PF. Multivariate simulation framework reveals performance of multi-trait GWAS methods. Sci Rep 2017; 7:38837. [PMID: 28287610 PMCID: PMC5347376 DOI: 10.1038/srep38837] [Citation(s) in RCA: 70] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Accepted: 10/19/2016] [Indexed: 01/22/2023] Open
Abstract
Burgeoning availability of genome-wide association study (GWAS) results and national biobank data has led to growing interest in performing multi-trait genetic analyses. Numerous multi-trait GWAS methods that exploit either summary statistics or individual-level data have been developed, but their relative performance is unclear. Here we develop a simulation framework to model the complex networks underlying multivariate genetic epidemiology, enabling the vast model space of genetic effects on multiple correlated traits to be explored systematically. We perform a comprehensive comparison of the leading multi-trait GWAS methods, finding: (1) method performance is highly sensitive to the specific combination of genetic effects and phenotypic correlations, (2) most of the current multivariate methods have remarkably similar statistical power, and (3) multivariate methods may offer a substantial increase in the discovery of genetic variants over the standard univariate approach. We believe our findings offer the clearest picture to date of the relative performance of multi-trait GWAS methods and act as a guide for method selection. We provide a web application and open-source software program implementing our simulation framework, for: (i) further benchmarking of multivariate GWAS methods, (ii) power calculations for multivariate genetic studies, and (iii) generating data for testing any multivariate method in genetic epidemiology.
Collapse
|
research-article |
8 |
70 |
14
|
Zhang X, Zhang MJ, Fine J. A proportional hazards regression model for the subdistribution with right-censored and left-truncated competing risks data. Stat Med 2011; 30:1933-51. [PMID: 21557288 PMCID: PMC3408877 DOI: 10.1002/sim.4264] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2009] [Accepted: 03/21/2011] [Indexed: 11/07/2022]
Abstract
With competing risks failure time data, one often needs to assess the covariate effects on the cumulative incidence probabilities. Fine and Gray proposed a proportional hazards regression model to directly model the subdistribution of a competing risk. They developed the estimating procedure for right-censored competing risks data, based on the inverse probability of censoring weighting. Right-censored and left-truncated competing risks data sometimes occur in biomedical researches. In this paper, we study the proportional hazards regression model for the subdistribution of a competing risk with right-censored and left-truncated data. We adopt a new weighting technique to estimate the parameters in this model. We have derived the large sample properties of the proposed estimators. To illustrate the application of the new method, we analyze the failure time data for children with acute leukemia. In this example, the failure times for children who had bone marrow transplants were left truncated.
Collapse
|
Research Support, N.I.H., Extramural |
14 |
66 |
15
|
Abstract
We examine the practicality of propensity score methods for estimating causal treatment effects conditional on intermediate posttreatment outcomes (principal effects) in the context of randomized experiments. In particular, we focus on the sensitivity of principal causal effect estimates to violation of principal ignorability, which is the primary assumption that underlies the use of propensity score methods to estimate principal effects. Under principal ignorability (PI), principal strata membership is conditionally independent of the potential outcome under control given the pre-treatment covariates; i.e. there are no differences in the potential outcomes under control across principal strata given the observed pretreatment covariates. Under this assumption, principal scores modeling principal strata membership can be estimated based solely on the observed covariates and used to predict strata membership and estimate principal effects. While this assumption underlies the use of propensity scores in this setting, sensitivity to violations of it has not been studied rigorously. In this paper, we explicitly define PI using the outcome model (although we do not actually use this outcome model in estimating principal scores) and systematically examine how deviations from the assumption affect estimates, including how the strength of association between principal stratum membership and covariates modifies the performance. We find that when PI is violated, very strong covariate predictors of stratum membership are needed to yield accurate estimates of principal effects.
Collapse
|
Research Support, N.I.H., Extramural |
16 |
62 |
16
|
Wang Y, Fang X, An F, Wang G, Zhang X. Improvement of antibiotic activity of Xenorhabdus bovienii by medium optimization using response surface methodology. Microb Cell Fact 2011; 10:98. [PMID: 22082189 PMCID: PMC3227641 DOI: 10.1186/1475-2859-10-98] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2011] [Accepted: 11/14/2011] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND The production of secondary metabolites with antibiotic properties is a common characteristic to entomopathogenic bacteria Xenorhabdus spp. These metabolites not only have diverse chemical structures but also have a wide range of bioactivities with medicinal and agricultural interests such as antibiotic, antimycotic and insecticidal, nematicidal and antiulcer, antineoplastic and antiviral. It has been known that cultivation parameters are critical to the secondary metabolites produced by microorganisms. Even small changes in the culture medium may not only impact the quantity of certain compounds but also the general metabolic profile of microorganisms. Manipulating nutritional or environmental factors can promote the biosynthesis of secondary metabolites and thus facilitate the discovery of new natural products. This work was conducted to evaluate the influence of nutrition on the antibiotic production of X. bovienii YL002 and to optimize the medium to maximize its antibiotic production. RESULTS Nutrition has high influence on the antibiotic production of X. bovienii YL002. Glycerol and soytone were identified as the best carbon and nitrogen sources that significantly affected the antibiotic production using one-factor-at-a-time approach. Response surface methodology (RSM) was applied to optimize the medium constituents (glycerol, soytone and minerals) for the antibiotic production of X. bovienii YL002. Higher antibiotic activity (337.5 U/mL) was obtained after optimization. The optimal levels of medium components were (g/L): glycerol 6.90, soytone 25.17, MgSO4·7H2O 1.57, (NH4)2SO4 2.55, KH2PO4 0.87, K2HPO4 1.11 and Na2SO4 1.81. An overall of 37.8% increase in the antibiotic activity of X. bovienii YL002 was obtained compared with that of the original medium. CONCLUSIONS To the best of our knowledge, there are no reports on antibiotic production of X. boviebii by medium optimization using RSM. The results strongly support the use of RSM for medium optimization. The optimized medium not only resulted in a 37.8% increase of antibiotic activity, but also reduced the numbers of experiments. The chosen method of medium optimization was efficient, simple and less time consuming. This work will be useful for the development of X. bovienii cultivation process for efficient antibiotic production on a large scale, and for the development of more advanced control strategies on plant diseases.
Collapse
|
Evaluation Study |
14 |
58 |
17
|
Michel MC, Murphy TJ, Motulsky HJ. New Author Guidelines for Displaying Data and Reporting Data Analysis and Statistical Methods in Experimental Biology. J Pharmacol Exp Ther 2020; 372:136-147. [PMID: 31884418 DOI: 10.1124/jpet.119.264143] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Accepted: 11/22/2019] [Indexed: 03/08/2025] Open
Abstract
The American Society for Pharmacology and Experimental Therapeutics has revised the Instructions to Authors for Drug Metabolism and Disposition, Journal of Pharmacology and Experimental Therapeutics, and Molecular Pharmacology These revisions relate to data analysis (including statistical analysis) and reporting but do not tell investigators how to design and perform their experiments. Their overall focus is on greater granularity in the description of what has been done and found. Key recommendations include the need to differentiate between preplanned, hypothesis-testing, and exploratory experiments or studies; explanations of whether key elements of study design, such as sample size and choice of specific statistical tests, had been specified before any data were obtained or adapted thereafter; and explanation of whether any outliers (data points or entire experiments) were eliminated and when the rules for doing so had been defined. Variability should be described by S.D. or interquartile range, and precision should be described by confidence intervals; S.E. should not be used. P values should be used sparingly; in most cases, reporting differences or ratios (effect sizes) with their confidence intervals will be preferred. Depiction of data in figures should provide as much granularity as possible, e.g., by replacing bar graphs with scatter plots wherever feasible and violin or box-and-whisker plots when not. This editorial explains the revisions and the underlying scientific rationale. We believe that these revised guidelines will lead to a less biased and more transparent reporting of research findings.
Collapse
|
Editorial |
5 |
58 |
18
|
Kaizer AM, Koopmeiners JS, Hobbs BP. Bayesian hierarchical modeling based on multisource exchangeability. Biostatistics 2018; 19:169-184. [PMID: 29036300 PMCID: PMC5862286 DOI: 10.1093/biostatistics/kxx031] [Citation(s) in RCA: 56] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Accepted: 05/07/2017] [Indexed: 12/19/2022] Open
Abstract
Bayesian hierarchical models produce shrinkage estimators that can be used as the basis for integrating supplementary data into the analysis of a primary data source. Established approaches should be considered limited, however, because posterior estimation either requires prespecification of a shrinkage weight for each source or relies on the data to inform a single parameter, which determines the extent of influence or shrinkage from all sources, risking considerable bias or minimal borrowing. We introduce multisource exchangeability models (MEMs), a general Bayesian approach for integrating multiple, potentially non-exchangeable, supplemental data sources into the analysis of a primary data source. Our proposed modeling framework yields source-specific smoothing parameters that can be estimated in the presence of the data to facilitate a dynamic multi-resolution smoothed estimator that is asymptotically consistent while reducing the dimensionality of the prior space. When compared with competing Bayesian hierarchical modeling strategies, we demonstrate that MEMs achieve approximately 2.2 times larger median effective supplemental sample size when the supplemental data sources are exchangeable as well as a 56% reduction in bias when there is heterogeneity among the supplemental sources. We illustrate the application of MEMs using a recently completed randomized trial of very low nicotine content cigarettes, which resulted in a 30% improvement in efficiency compared with the standard analysis.
Collapse
|
Research Support, N.I.H., Extramural |
7 |
56 |
19
|
Ishwaran H, Blackstone EH, Apperson-Hansen C, Rice TW. A novel approach to cancer staging: application to esophageal cancer. Biostatistics 2009; 10:603-20. [PMID: 19502615 PMCID: PMC3590074 DOI: 10.1093/biostatistics/kxp016] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2008] [Revised: 03/30/2009] [Accepted: 05/08/2009] [Indexed: 02/02/2023] Open
Abstract
A novel 3-step random forests methodology involving survival data (survival forests), ordinal data (multiclass forests), and continuous data (regression forests) is introduced for cancer staging. The methodology is illustrated for esophageal cancer using worldwide esophageal cancer collaboration data involving 4627 patients.
Collapse
|
research-article |
16 |
55 |
20
|
Bruns H, Lozanovski VJ, Schultze D, Hillebrand N, Hinz U, Büchler MW, Schemmer P. Prediction of postoperative mortality in liver transplantation in the era of MELD-based liver allocation: a multivariate analysis. PLoS One 2014; 9:e98782. [PMID: 24905210 PMCID: PMC4048202 DOI: 10.1371/journal.pone.0098782] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Accepted: 05/06/2014] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND AND AIMS Liver transplantation is the only curative treatment for end-stage liver disease. While waiting list mortality can be predicted by the MELD-score, reliable scoring systems for the postoperative period do not exist. This study's objective was to identify risk factors that contribute to postoperative mortality. METHODS Between December 2006 and March 2011, 429 patients underwent liver transplantation in our department. Risk factors for postoperative mortality in 266 consecutive liver transplantations were identified using univariate and multivariate analyses. Patients who were <18 years, HU-listings, and split-, living related, combined or re-transplantations were excluded from the analysis. The correlation between number of risk factors and mortality was analyzed. RESULTS A labMELD ≥20, female sex, coronary heart disease, donor risk index >1.5 and donor Na+>145 mmol/L were identified to be independent predictive factors for postoperative mortality. With increasing number of these risk-factors, postoperative 90-day and 1-year mortality increased (0-1: 0 and 0%; 2: 2.9 and 17.4%; 3: 5.6 and 16.8%; 4: 22.2 and 33.3%; 5-6: 60.9 and 66.2%). CONCLUSIONS In this analysis, a simple score was derived that adequately identified patients at risk after liver transplantation. Opening a discussion on the inclusion of these parameters in the process of organ allocation may be a worthwhile venture.
Collapse
|
research-article |
11 |
53 |
21
|
Abstract
We consider the estimation of an optimal dynamic two time-point treatment rule defined as the rule that maximizes the mean outcome under the dynamic treatment, where the candidate rules are restricted to depend only on a user-supplied subset of the baseline and intermediate covariates. This estimation problem is addressed in a statistical model for the data distribution that is nonparametric, beyond possible knowledge about the treatment and censoring mechanisms. We propose data adaptive estimators of this optimal dynamic regime which are defined by sequential loss-based learning under both the blip function and weighted classification frameworks. Rather than a priori selecting an estimation framework and algorithm, we propose combining estimators from both frameworks using a super-learning based cross-validation selector that seeks to minimize an appropriate cross-validated risk. The resulting selector is guaranteed to asymptotically perform as well as the best convex combination of candidate algorithms in terms of loss-based dissimilarity under conditions. We offer simulation results to support our theoretical findings.
Collapse
|
Research Support, N.I.H., Extramural |
9 |
52 |
22
|
Bellan SE, Pulliam JRC, Pearson CAB, Champredon D, Fox SJ, Skrip L, Galvani AP, Gambhir M, Lopman BA, Porco TC, Meyers LA, Dushoff J. Statistical power and validity of Ebola vaccine trials in Sierra Leone: a simulation study of trial design and analysis. THE LANCET. INFECTIOUS DISEASES 2015; 15:703-10. [PMID: 25886798 PMCID: PMC4815262 DOI: 10.1016/s1473-3099(15)70139-8] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
BACKGROUND Safe and effective vaccines could help to end the ongoing Ebola virus disease epidemic in parts of west Africa, and mitigate future outbreaks of the virus. We assess the statistical validity and power of randomised controlled trial (RCT) and stepped-wedge cluster trial (SWCT) designs in Sierra Leone, where the incidence of Ebola virus disease is spatiotemporally heterogeneous, and is decreasing rapidly. METHODS We projected district-level Ebola virus disease incidence for the next 6 months, using a stochastic model fitted to data from Sierra Leone. We then simulated RCT and SWCT designs in trial populations comprising geographically distinct clusters at high risk, taking into account realistic logistical constraints, and both individual-level and cluster-level variations in risk. We assessed false-positive rates and power for parametric and non-parametric analyses of simulated trial data, across a range of vaccine efficacies and trial start dates. FINDINGS For an SWCT, regional variation in Ebola virus disease incidence trends produced increased false-positive rates (up to 0·15 at α=0·05) under standard statistical models, but not when analysed by a permutation test, whereas analyses of RCTs remained statistically valid under all models. With the assumption of a 6-month trial starting on Feb 18, 2015, we estimate the power to detect a 90% effective vaccine to be between 49% and 89% for an RCT, and between 6% and 26% for an SWCT, depending on the Ebola virus disease incidence within the trial population. We estimate that a 1-month delay in trial initiation will reduce the power of the RCT by 20% and that of the SWCT by 49%. INTERPRETATION Spatiotemporal variation in infection risk undermines the statistical power of the SWCT. This variation also undercuts the SWCT's expected ethical advantages over the RCT, because an RCT, but not an SWCT, can prioritise vaccination of high-risk clusters. FUNDING US National Institutes of Health, US National Science Foundation, and Canadian Institutes of Health Research.
Collapse
|
Research Support, N.I.H., Extramural |
10 |
51 |
23
|
Chen H, Strand M, Norenburg JL, Sun S, Kajihara H, Chernyshev AV, Maslakova SA, Sundberg P. Statistical parsimony networks and species assemblages in Cephalotrichid nemerteans (nemertea). PLoS One 2010; 5:e12885. [PMID: 20877627 PMCID: PMC2943479 DOI: 10.1371/journal.pone.0012885] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2010] [Accepted: 08/13/2010] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND It has been suggested that statistical parsimony network analysis could be used to get an indication of species represented in a set of nucleotide data, and the approach has been used to discuss species boundaries in some taxa. METHODOLOGY/PRINCIPAL FINDINGS Based on 635 base pairs of the mitochondrial protein-coding gene cytochrome c oxidase I (COI), we analyzed 152 nemertean specimens using statistical parsimony network analysis with the connection probability set to 95%. The analysis revealed 15 distinct networks together with seven singletons. Statistical parsimony yielded three networks supporting the species status of Cephalothrix rufifrons, C. major and C. spiralis as they currently have been delineated by morphological characters and geographical location. Many other networks contained haplotypes from nearby geographical locations. Cladistic structure by maximum likelihood analysis overall supported the network analysis, but indicated a false positive result where subnetworks should have been connected into one network/species. This probably is caused by undersampling of the intraspecific haplotype diversity. CONCLUSIONS/SIGNIFICANCE Statistical parsimony network analysis provides a rapid and useful tool for detecting possible undescribed/cryptic species among cephalotrichid nemerteans based on COI gene. It should be combined with phylogenetic analysis to get indications of false positive results, i.e., subnetworks that would have been connected with more extensive haplotype sampling.
Collapse
|
Evaluation Study |
15 |
50 |
24
|
van Eenige R, Verhave PS, Koemans PJ, Tiebosch IACW, Rensen PCN, Kooijman S. RandoMice, a novel, user-friendly randomization tool in animal research. PLoS One 2020; 15:e0237096. [PMID: 32756603 PMCID: PMC7406044 DOI: 10.1371/journal.pone.0237096] [Citation(s) in RCA: 46] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Accepted: 07/20/2020] [Indexed: 11/18/2022] Open
Abstract
Careful design of experiments using living organisms (e.g. mice) is of critical importance from both an ethical and a scientific standpoint. Randomization should, whenever possible, be an integral part of such experimental design to reduce bias thereby increasing its reliability and reproducibility. To keep the sample size as low as possible, one might take randomization one step further by controlling for baseline variations in the dependent variable(s) and/or certain known covariates. To give an example, in animal experiments aimed to study atherosclerosis development, one would want to control for baseline characteristics such as plasma triglyceride and total cholesterol levels and body weight. This can be done by first defining blocks to create balance among groups in terms of group size and baseline characteristics, followed by random assignment of the blocks to the various control and intervention groups. In the current study we developed a novel, user-friendly tool that allows users to easily randomize animals into blocks and identify random block divisions that are well-balanced based on given baseline characteristics, making randomization time-efficient and easy-to-use. Here, we present the resulting software tool that we have named RandoMice.
Collapse
|
|
5 |
46 |
25
|
Tang ZZ, Chen G. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis. Biostatistics 2019; 20:698-713. [PMID: 29939212 PMCID: PMC7410344 DOI: 10.1093/biostatistics/kxy025] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Revised: 04/26/2018] [Accepted: 05/06/2018] [Indexed: 12/19/2022] Open
Abstract
There is heightened interest in using high-throughput sequencing technologies to quantify abundances of microbial taxa and linking the abundance to human diseases and traits. Proper modeling of multivariate taxon counts is essential to the power of detecting this association. Existing models are limited in handling excessive zero observations in taxon counts and in flexibly accommodating complex correlation structures and dispersion patterns among taxa. In this article, we develop a new probability distribution, zero-inflated generalized Dirichlet multinomial (ZIGDM), that overcomes these limitations in modeling multivariate taxon counts. Based on this distribution, we propose a ZIGDM regression model to link microbial abundances to covariates (e.g. disease status) and develop a fast expectation-maximization algorithm to efficiently estimate parameters in the model. The derived tests enable us to reveal rich patterns of variation in microbial compositions including differential mean and dispersion. The advantages of the proposed methods are demonstrated through simulation studies and an analysis of a gut microbiome dataset.
Collapse
|
research-article |
6 |
44 |