1
|
How Trustworthy Is Your Tree? Bayesian Phylogenetic Effective Sample Size Through the Lens of Monte Carlo Error. BAYESIAN ANALYSIS 2024; 19:565-593. [PMID: 38665694 PMCID: PMC11042687 DOI: 10.1214/22-ba1339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/28/2024]
Abstract
Bayesian inference is a popular and widely-used approach to infer phylogenies (evolutionary trees). However, despite decades of widespread application, it remains difficult to judge how well a given Bayesian Markov chain Monte Carlo (MCMC) run explores the space of phylogenetic trees. In this paper, we investigate the Monte Carlo error of phylogenies, focusing on high-dimensional summaries of the posterior distribution, including variability in estimated edge/branch (known in phylogenetics as "split") probabilities and tree probabilities, and variability in the estimated summary tree. Specifically, we ask if there is any measure of effective sample size (ESS) applicable to phylogenetic trees which is capable of capturing the Monte Carlo error of these three summary measures. We find that there are some ESS measures capable of capturing the error inherent in using MCMC samples to approximate the posterior distributions on phylogenies. We term these tree ESS measures, and identify a set of three which are useful in practice for assessing the Monte Carlo error. Lastly, we present visualization tools that can improve comparisons between multiple independent MCMC runs by accounting for the Monte Carlo error present in each chain. Our results indicate that common post-MCMC workflows are insufficient to capture the inherent Monte Carlo error of the tree, and highlight the need for both within-chain mixing and between-chain convergence assessments.
Collapse
|
2
|
Scalable Bayesian Divergence Time Estimation With Ratio Transformations. Syst Biol 2023; 72:1136-1153. [PMID: 37458991 PMCID: PMC10636426 DOI: 10.1093/sysbio/syad039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Revised: 06/13/2023] [Accepted: 06/30/2023] [Indexed: 11/08/2023] Open
Abstract
Divergence time estimation is crucial to provide temporal signals for dating biologically important events from species divergence to viral transmissions in space and time. With the advent of high-throughput sequencing, recent Bayesian phylogenetic studies have analyzed hundreds to thousands of sequences. Such large-scale analyses challenge divergence time reconstruction by requiring inference on highly correlated internal node heights that often become computationally infeasible. To overcome this limitation, we explore a ratio transformation that maps the original $N-1$ internal node heights into a space of one height parameter and $N-2$ ratio parameters. To make the analyses scalable, we develop a collection of linear-time algorithms to compute the gradient and Jacobian-associated terms of the log-likelihood with respect to these ratios. We then apply Hamiltonian Monte Carlo sampling with the ratio transform in a Bayesian framework to learn the divergence times in 4 pathogenic viruses (West Nile virus, rabies virus, Lassa virus, and Ebola virus) and the coralline red algae. Our method both resolves a mixing issue in the West Nile virus example and improves inference efficiency by at least 5-fold for the Lassa and rabies virus examples as well as for the algae example. Our method now also makes it computationally feasible to incorporate mixed-effects molecular clock models for the Ebola virus example, confirms the findings from the original study, and reveals clearer multimodal distributions of the divergence times of some clades of interest.
Collapse
|
3
|
Low-coverage whole genome sequencing for highly accurate population assignment: Mapping migratory connectivity in the American Redstart (Setophaga ruticilla). Mol Ecol 2023; 32:5528-5540. [PMID: 37706673 DOI: 10.1111/mec.17137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 08/26/2023] [Accepted: 09/04/2023] [Indexed: 09/15/2023]
Abstract
Understanding the geographic linkages among populations across the annual cycle is an essential component for understanding the ecology and evolution of migratory species and for facilitating their effective conservation. While genetic markers have been widely applied to describe migratory connections, the rapid development of new sequencing methods, such as low-coverage whole genome sequencing (lcWGS), provides new opportunities for improved estimates of migratory connectivity. Here, we use lcWGS to identify fine-scale population structure in a widespread songbird, the American Redstart (Setophaga ruticilla), and accurately assign individuals to genetically distinct breeding populations. Assignment of individuals from the nonbreeding range reveals population-specific patterns of varying migratory connectivity. By combining migratory connectivity results with demographic analysis of population abundance and trends, we consider full annual cycle conservation strategies for preserving numbers of individuals and genetic diversity. Notably, we highlight the importance of the Northern Temperate-Greater Antilles migratory population as containing the largest proportion of individuals in the species. Finally, we highlight valuable considerations for other population assignment studies aimed at using lcWGS. Our results have broad implications for improving our understanding of the ecology and evolution of migratory species through conservation genomics approaches.
Collapse
|
4
|
Power and sample size for observational studies of point exposure effects. Biometrics 2022; 78:388-398. [PMID: 33226116 PMCID: PMC8141060 DOI: 10.1111/biom.13405] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 09/12/2020] [Accepted: 11/11/2020] [Indexed: 11/29/2022]
Abstract
Inverse probability of treatment weights (IPTWs) are commonly used to control for confounding when estimating causal effects of point exposures from observational data. When planning a study that will be analyzed with IPTWs, determining the required sample size for a given level of statistical power is challenging because of the effect of weighting on the variance of the estimated causal means. This paper considers the utility of the design effect to quantify the effect of weighting on the precision of causal estimates. The design effect is defined as the ratio of the variance of the causal mean estimator divided by the variance of a naïve estimator if, counter to fact, no confounding had been present and weights were not needed. A simple, closed-form approximation of the design effect is derived that is outcome invariant and can be estimated during the study design phase. Once the design effect is approximated for each treatment group, sample size calculations are conducted as for a randomized trial, but with variances inflated by the design effects to account for weighting. Simulations demonstrate the accuracy of the design effect approximation, and practical considerations are discussed.
Collapse
|
5
|
Propensity-score-based meta-analytic predictive prior for incorporating real-world and historical data. Stat Med 2021; 40:4794-4808. [PMID: 34126656 DOI: 10.1002/sim.9095] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 05/07/2021] [Accepted: 05/27/2021] [Indexed: 01/20/2023]
Abstract
As the availability of real-world data sources (eg, EHRs, claims data, registries) and historical data has rapidly surged in recent years, there is an increasing interest and need from investigators and health authorities to leverage all available information to reduce patient burden and accelerate both drug development and regulatory decision making. Bayesian meta-analytic approaches are a popular historical borrowing method that has been developed to leverage such data using robust hierarchical models. The model structure accounts for various degrees of between-trial heterogeneity, resulting in adaptively discounting the external information in the case of data conflict. In this article, we propose to integrate the propensity score method and Bayesian meta-analytic-predictive (MAP) prior to leverage external real-world and historical data. The propensity score methodology is applied to select a subset of patients from external data that are similar to those in the current study with regards to key baseline covariates and to stratify the selected patients together with those in the current study into more homogeneous strata. The MAP prior approach is used to obtain stratum-specific MAP prior and derive the overall propensity score integrated meta-analytic predictive (PS-MAP) prior. Additionally, we allow for tuning the prior effective sample size for the proposed PS-MAP prior, which quantifies the amount of information borrowed from external data. We evaluate the performance of the proposed PS-MAP prior by comparing it to the existing propensity score-integrated power prior approach in a simulation study and illustrate its implementation with an example of a single-arm phase II trial.
Collapse
|
6
|
Sample Mass Estimate for the Use of Near-Infrared and Raman Spectroscopy to Monitor Content Uniformity in a Tablet Press Feed Frame of a Drug Product Continuous Manufacturing Process. APPLIED SPECTROSCOPY 2021; 75:216-224. [PMID: 32721168 DOI: 10.1177/0003702820950318] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Recently, feed frame-based process analytical technology measurements used to assure product quality during continuous manufacturing processes have received significant attention. These measurements are able to accurately determine uniformity of the powder blend before compression, and in these applications, it is necessary to understand the interrogated sample volume per measurement. This understanding ensures that the blend measurement can be indicative of the uniformity of the final dosage form. A scientifically sound approach is proposed here to estimate sample mass for a continuous manufacturing process that utilizes either near infrared or Raman spectroscopy. A wide range of commercially available probes with varying spot diameters are considered. By comparing near infrared and Raman spectroscopy, an optimal range of probe spot diameters was identified in order to reach an estimated sample mass between 50 and 500 mg for pharmaceutical blends per measurement, which is equivalent to common tablet weight ranges for solid oral dosage forms currently on the market.
Collapse
|
7
|
kg_nchs: A command for Korn-Graubard confidence intervals and National Center for Health Statistics' Data Presentation Standards for Proportions. THE STATA JOURNAL 2019; 19:510-522. [PMID: 31814807 PMCID: PMC6896998 DOI: 10.1177/1536867x19874221] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In August 2017 the National Center for Health Statistics (NCHS), part of the U.S. Federal Statistical System, published new standards for determining the reliability of proportions estimated using their data. These standards require an individual to take the Korn-Graubard confidence interval (CI), along with CI widths, sample size, and degrees of freedom, to assess reliability of a proportion and determine if it can be presented. The assessment itself involves determining if several conditions are met. This manuscript presents kg_nchs, a postestimation command that is used following svy: proportion. It allows Stata users to (a) calculate the Korn-Graubard CI and associated statistics used in applying the NCHS presentation standards for proportions, and (b) display a series of three dichotomous flags that show if the standards are met. The empirical examples provided show how kg_nchs can be used to easily apply the standards and prevent Stata users from needing to perform manual calculations. While developed for NCHS survey data, this command can also be used with data that stems from any survey with a complex sample design.
Collapse
|
8
|
Development of an In-Line Near-Infrared Method for Blend Content Uniformity Assessment in a Tablet Feed Frame. APPLIED SPECTROSCOPY 2019; 73:1028-1040. [PMID: 30990067 DOI: 10.1177/0003702819842189] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Process analytical technology (PAT) has shown great potential for in-line tableting process monitoring. The study focuses on the development and validation of an in-line near-infrared (NIR) spectroscopic method for the determination of content uniformity of blends in a tablet feed frame. An in-line NIR method was developed after careful evaluation of the impact of potential experimental factors on the robustness and model accuracy and precision. The NIR method was validated according to the principles outlined in International Conference on Harmonization-Q2 for validation of analytical procedures and was demonstrated to be suitable for monitoring blend content for the formulation under evaluation. Reliable measurements of blend homogeneity rely on representative sampling. To reach the appropriate scale of scrutiny for a unit dose, the study assessed factors that influence the effective sample size measured by NIR. Spectral averaging, integration time, and feed frame paddle wheel speed were found to influence the effective sample size measured by the NIR probe. The effective sampling size was also estimated by comparing the distribution of predicted values with the reference values. The development of a robust, in-line PAT method was facilitated by thorough understanding of the sensitivity of PAT sensors to factors affecting pharmaceutical processes and products.
Collapse
|
9
|
Quantifying and presenting overall evidence in network meta-analysis. Stat Med 2018; 37:4114-4125. [PMID: 30019428 DOI: 10.1002/sim.7905] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Revised: 04/15/2018] [Accepted: 06/18/2018] [Indexed: 01/10/2023]
Abstract
Network meta-analysis (NMA) has become an increasingly used tool to compare multiple treatments simultaneously by synthesizing direct and indirect evidence in clinical research. However, many existing studies did not properly report the evidence of treatment comparisons and show the comparison structure to audience. In addition, nearly all treatment networks presented only direct evidence, not overall evidence that can reflect the benefit of performing NMAs. This article classifies treatment networks into three types under different assumptions; they include networks with each treatment comparison's edge width proportional to the corresponding number of studies, sample size, and precision. In addition, three new measures (ie, the effective number of studies, the effective sample size, and the effective precision) are proposed to preliminarily quantify overall evidence gained in NMAs. They permit audience to intuitively evaluate the benefit of performing NMAs, compared with pairwise meta-analyses based on only direct evidence. We use four case studies, including one illustrative example, to demonstrate their derivations and interpretations. Treatment networks may look fairly differently when different measures are used to present the evidence. The proposed measures provide clear information about overall evidence of all treatment comparisons, and they also imply the additional number of studies, sample size, and precision obtained from indirect evidence. Some comparisons may benefit little from NMAs. Researchers are encouraged to present overall evidence of all treatment comparisons, so that audience can preliminarily evaluate the quality of NMAs.
Collapse
|
10
|
Online Bayesian Phylogenetic Inference: Theoretical Foundations via Sequential Monte Carlo. Syst Biol 2018; 67:503-517. [PMID: 29244177 PMCID: PMC5920340 DOI: 10.1093/sysbio/syx087] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Revised: 11/08/2017] [Accepted: 11/09/2017] [Indexed: 11/29/2022] Open
Abstract
Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickly adding new sequences to already substantial databases. With all current techniques for Bayesian phylogenetics, computation must start anew each time a sequence becomes available, making it costly to maintain an up-to-date estimate of a phylogenetic posterior. These considerations highlight the need for an online Bayesian phylogenetic method which can update an existing posterior with new sequences. Here, we provide theoretical results on the consistency and stability of methods for online Bayesian phylogenetic inference based on Sequential Monte Carlo (SMC) and Markov chain Monte Carlo. We first show a consistency result, demonstrating that the method samples from the correct distribution in the limit of a large number of particles. Next, we derive the first reported set of bounds on how phylogenetic likelihood surfaces change when new sequences are added. These bounds enable us to characterize the theoretical performance of sampling algorithms by bounding the effective sample size (ESS) with a given number of particles from below. We show that the ESS is guaranteed to grow linearly as the number of particles in an SMC sampler grows. Surprisingly, this result holds even though the dimensions of the phylogenetic model grow with each new added sequence.
Collapse
|
11
|
Population metrics for suicide events: A causal inference approach. Stat Methods Med Res 2017; 28:503-514. [PMID: 28933251 DOI: 10.1177/0962280217729843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Large-scale public health prevention initiatives and interventions are a very important component to current public health strategies. But evaluating effects of such large-scale prevention/intervention faces a lot of challenges due to confounding effects and heterogeneity of study population. In this paper, we will develop metrics to assess the risk for suicide events based on causal inference framework when the study population is heterogeneous. The proposed metrics deal with the confounding effect by first estimating the risk of suicide events within each of the risk levels, number of prior attempts, and then taking a weighted sum of the conditional probabilities. The metrics provide unbiased estimates of the risk of suicide events. Simulation studies and a real data example will be used to demonstrate the proposed metrics.
Collapse
|
12
|
Statistical modeling for Bayesian extrapolation of adult clinical trial information in pediatric drug evaluation. Pharm Stat 2017; 16:232-249. [PMID: 28448684 DOI: 10.1002/pst.1807] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2016] [Revised: 01/19/2017] [Accepted: 03/03/2017] [Indexed: 11/06/2022]
Abstract
Children represent a large underserved population of "therapeutic orphans," as an estimated 80% of children are treated off-label. However, pediatric drug development often faces substantial challenges, including economic, logistical, technical, and ethical barriers, among others. Among many efforts trying to remove these barriers, increased recent attention has been paid to extrapolation; that is, the leveraging of available data from adults or older age groups to draw conclusions for the pediatric population. The Bayesian statistical paradigm is natural in this setting, as it permits the combining (or "borrowing") of information across disparate sources, such as the adult and pediatric data. In this paper, authored by the pediatric subteam of the Drug Information Association Bayesian Scientific Working Group and Adaptive Design Working Group, we develop, illustrate, and provide suggestions on Bayesian statistical methods that could be used to design improved pediatric development programs that use all available information in the most efficient manner. A variety of relevant Bayesian approaches are described, several of which are illustrated through 2 case studies: extrapolating adult efficacy data to expand the labeling for Remicade to include pediatric ulcerative colitis and extrapolating adult exposure-response information for antiepileptic drugs to pediatrics.
Collapse
|
13
|
Risk management for moisture related effects in dry manufacturing processes: a statistical approach. Pharm Dev Technol 2014; 21:147-51. [PMID: 25384711 DOI: 10.3109/10837450.2014.979943] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
A risk- and science-based approach to control the quality in pharmaceutical manufacturing includes a full understanding of how product attributes and process parameters relate to product performance through a proactive approach in formulation and process development. For dry manufacturing, where moisture content is not directly manipulated within the process, the variability in moisture of the incoming raw materials can impact both the processability and drug product quality attributes. A statistical approach is developed using individual raw material historical lots as a basis for the calculation of tolerance intervals for drug product moisture content so that risks associated with excursions in moisture content can be mitigated. The proposed method is based on a model-independent approach that uses available data to estimate parameters of interest that describe the population of blend moisture content values and which do not require knowledge of the individual blend moisture content values. Another advantage of the proposed tolerance intervals is that, it does not require the use of tabulated values for tolerance factors. This facilitates the implementation on any spreadsheet program like Microsoft Excel. A computational example is used to demonstrate the proposed method.
Collapse
|
14
|
Pike and salmon as sister taxa: detailed intraclade resolution and divergence time estimation of Esociformes + Salmoniformes based on whole mitochondrial genome sequences. Gene 2013; 530:57-65. [PMID: 23954876 DOI: 10.1016/j.gene.2013.07.068] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2013] [Revised: 07/16/2013] [Accepted: 07/18/2013] [Indexed: 11/30/2022]
Abstract
The increasing number of taxa and loci in molecular phylogenetic studies of basal euteleosts has brought stability in a controversial area. A key emerging aspect to these studies is a sister Esociformes (pike) and Salmoniformes (salmon) relationship. We evaluate mitochondrial genome support for a sister Esociformes and Salmoniformes hypothesis by surveying many potential outgroups for these taxa, employing multiple phylogenetic approaches, and utilizing a thorough sampling scheme. Secondly, we conduct a simultaneous divergence time estimation and phylogenetic inference in a Bayesian framework with fossil calibrations focusing on relationships within Esociformes+Salmoniformes. Our dataset supports a sister relationship between Esociformes and Salmoniformes; however the nearest relatives of Esociformes+Salmoniformes are inconsistent among analyses. Within the order Esociformes, we advocate for a single family, Esocidae. Subfamily relationships within Salmonidae are poorly supported as Salmoninae sister to Thymallinae+Coregoninae.
Collapse
|
15
|
Molecular evolution of attachment glycoprotein (G) gene in human respiratory syncytial virus detected in Japan 2008-2011. INFECTION GENETICS AND EVOLUTION 2013; 18:168-73. [PMID: 23707845 DOI: 10.1016/j.meegid.2013.05.010] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2013] [Revised: 05/01/2013] [Accepted: 05/13/2013] [Indexed: 11/23/2022]
Abstract
We investigated the evolution of the C-terminal 3rd hypervariable region of G gene in the prevalent human respiratory syncytial virus (RSV) subgroups A (RSV-A) and B (RSV-B) in Japan in 2008-2011. Phylogenetic analysis and the evolutionary timescale was obtained by the Bayesian Markov Chain Monte Carlo method. All 38 RSV-A strains detected were classified into genotype NA1 and the 17 RSV-B strains detected belonged to genotypes BA and GB2. NA1 subdivided around 1998 in the present phylogenetic tree. Genotype BA subdivided around 1994. The evolutionary rates for RSV-A and RSV-B were estimated at 3.63×10⁻³ and 4.56×10⁻³ substitutions/site/year, respectively. The mean evolutionary rate of RSV-B was significantly faster than that of RSV-A during all seasons. The pairwise distance was relatively short (less than 0.06). In addition, some unique sites under positive selection were found. The results suggested that this region of the RSV strains rapidly evolved with some unique amino acid substitutions due to positive pressure.
Collapse
|
16
|
Abstract
Equilibrium sampling of biomolecules remains an unmet challenge after more than 30 years of atomistic simulation. Efforts to enhance sampling capability, which are reviewed here, range from the development of new algorithms to parallelization to novel uses of hardware. Special focus is placed on classifying algorithms--most of which are underpinned by a few key ideas--in order to understand their fundamental strengths and limitations. Although algorithms have proliferated, progress resulting from novel hardware use appears to be more clear-cut than from algorithms alone, due partly to the lack of widely used sampling measures.
Collapse
|