1
|
Liu CC, Yu RX, Aitkin M. The flaw of averages: Bayes factors as posterior means of the likelihood ratio. Pharm Stat 2024; 23:466-479. [PMID: 38282048 DOI: 10.1002/pst.2355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 09/25/2023] [Accepted: 11/24/2023] [Indexed: 01/30/2024]
Abstract
As an alternative to the Frequentist p-value, the Bayes factor (or ratio of marginal likelihoods) has been regarded as one of the primary tools for Bayesian hypothesis testing. In recent years, several researchers have begun to re-analyze results from prominent medical journals, as well as from trials for FDA-approved drugs, to show that Bayes factors often give divergent conclusions from those of p-values. In this paper, we investigate the claim that Bayes factors are straightforward to interpret as directly quantifying the relative strength of evidence. In particular, we show that for nested hypotheses with consistent priors, the Bayes factor for the null over the alternative hypothesis is the posterior mean of the likelihood ratio. By re-analyzing 39 results previously published in the New England Journal of Medicine, we demonstrate how the posterior distribution of the likelihood ratio can be computed and visualized, providing useful information beyond the posterior mean alone.
Collapse
Affiliation(s)
- Charles C Liu
- Department of Biostatistics, Gilead Sciences, Foster City, CA, USA
| | - Ron Xiaolong Yu
- Department of Biostatistics, Gilead Sciences, Foster City, CA, USA
| | - Murray Aitkin
- School of Mathematics and Statistics, The University of Melbourne, Parkville, Victoria, Australia
| |
Collapse
|
2
|
Linde M, van Ravenzwaaij D. baymedr: an R package and web application for the calculation of Bayes factors for superiority, equivalence, and non-inferiority designs. BMC Med Res Methodol 2023; 23:279. [PMID: 38001458 PMCID: PMC10668366 DOI: 10.1186/s12874-023-02097-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 11/03/2023] [Indexed: 11/26/2023] Open
Abstract
BACKGROUND Clinical trials often seek to determine the superiority, equivalence, or non-inferiority of an experimental condition (e.g., a new drug) compared to a control condition (e.g., a placebo or an already existing drug). The use of frequentist statistical methods to analyze data for these types of designs is ubiquitous even though they have several limitations. Bayesian inference remedies many of these shortcomings and allows for intuitive interpretations, but are currently difficult to implement for the applied researcher. RESULTS We outline the frequentist conceptualization of superiority, equivalence, and non-inferiority designs and discuss its disadvantages. Subsequently, we explain how Bayes factors can be used to compare the relative plausibility of competing hypotheses. We present baymedr, an R package and web application, that provides user-friendly tools for the computation of Bayes factors for superiority, equivalence, and non-inferiority designs. Instructions on how to use baymedr are provided and an example illustrates how existing results can be reanalyzed with baymedr. CONCLUSIONS Our baymedr R package and web application enable researchers to conduct Bayesian superiority, equivalence, and non-inferiority tests. baymedr is characterized by a user-friendly implementation, making it convenient for researchers who are not statistical experts. Using baymedr, it is possible to calculate Bayes factors based on raw data and summary statistics.
Collapse
Affiliation(s)
- Maximilian Linde
- GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany.
- University of Groningen, Groningen, The Netherlands.
| | | |
Collapse
|
3
|
Pittelkow MM, de Vries YA, Monden R, Bastiaansen JA, van Ravenzwaaij D. Comparing the evidential strength for psychotropic drugs: a Bayesian meta-analysis. Psychol Med 2021; 51:2752-2761. [PMID: 34620261 PMCID: PMC8640368 DOI: 10.1017/s0033291721003950] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 09/06/2021] [Accepted: 09/09/2021] [Indexed: 11/17/2022]
Abstract
Approval and prescription of psychotropic drugs should be informed by the strength of evidence for efficacy. Using a Bayesian framework, we examined (1) whether psychotropic drugs are supported by substantial evidence (at the time of approval by the Food and Drug Administration), and (2) whether there are systematic differences across drug groups. Data from short-term, placebo-controlled phase II/III clinical trials for 15 antipsychotics, 16 antidepressants for depression, nine antidepressants for anxiety, and 20 drugs for attention deficit hyperactivity disorder (ADHD) were extracted from FDA reviews. Bayesian model-averaged meta-analysis was performed and strength of evidence was quantified (i.e. BFBMA). Strength of evidence and trialling varied between drugs. Median evidential strength was extreme for ADHD medication (BFBMA = 1820.4), moderate for antipsychotics (BFBMA = 365.4), and considerably lower and more frequently classified as weak or moderate for antidepressants for depression (BFBMA = 94.2) and anxiety (BFBMA = 49.8). Varying median effect sizes (ESschizophrenia = 0.45, ESdepression = 0.30, ESanxiety = 0.37, ESADHD = 0.72), sample sizes (Nschizophrenia = 324, Ndepression = 218, Nanxiety = 254, NADHD = 189.5), and numbers of trials (kschizophrenia = 3, kdepression = 5.5, kanxiety = 3, kADHD = 2) might account for differences. Although most drugs were supported by strong evidence at the time of approval, some only had moderate or ambiguous evidence. These results show the need for more systematic quantification and classification of statistical evidence for psychotropic drugs. Evidential strength should be communicated transparently and clearly towards clinical decision makers.
Collapse
Affiliation(s)
- Merle-Marie Pittelkow
- Department Psychometrics and Statistics, University of Groningen, Groningen, the Netherlands
| | - Ymkje Anna de Vries
- Department of Developmental Psychology, University of Groningen, Groningen, the Netherlands
- Interdisciplinary Center Psychopathology and Emotion Regulation, Department of Psychiatry, University Medical Center Groningen, Groningen, the Netherlands
| | - Rei Monden
- Interdisciplinary Center Psychopathology and Emotion Regulation, Department of Psychiatry, University Medical Center Groningen, Groningen, the Netherlands
- Department of Biomedical Statistics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Jojanneke A. Bastiaansen
- Interdisciplinary Center Psychopathology and Emotion Regulation, Department of Psychiatry, University Medical Center Groningen, Groningen, the Netherlands
- Department of Education and Research, Friesland Mental Health Care Services, Leeuwarden, the Netherlands
| | - Don van Ravenzwaaij
- Department Psychometrics and Statistics, University of Groningen, Groningen, the Netherlands
| |
Collapse
|
4
|
The evidence base for psychotropic drugs approved by the European Medicines Agency: a meta-assessment of all European Public Assessment Reports. Epidemiol Psychiatr Sci 2020; 29:e120. [PMID: 32336312 PMCID: PMC7214735 DOI: 10.1017/s2045796020000359] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
AIMS To systematically assess the level of evidence for psychotropic drugs approved by the European Medicines Agency (EMA). METHODS Cross-sectional analysis of all European Public Assessment Reports (EPARs) and meta-analyses of the many studies reported in these EPARs. Eligible EPARs were identified from the EMA's website and individual study reports were requested from the Agency when necessary. All marketing authorisation applications (defined by the drug, the route of administration and given indications) for psychotropic medications for adults (including drugs used in psychiatry and addictology) were considered. EPARs solely based on bioequivalence studies were excluded. Our primary outcome measure was the presence of robust evidence of comparative effectiveness, defined as at least two 'positive' superiority studies against an active comparator. Various other features of the approvals were assessed, such as evidence of non-inferiority v. active comparator and superiority v. placebo. For studies with available data, effect sizes were computed and pooled using a random effect meta-analysis for each dose of each drug in each indication. RESULTS Twenty-seven marketing authorisations were identified. For one, comparative effectiveness was explicitly considered as not needed in the EPAR. Of those remaining, 21/26 (81%) did not provide any evidence of superiority against an active comparator, 2/26 (8%) were based on at least two trials showing superiority against active comparator and three (11%) were based on one positive trial; 1/26 provided evidence for two positive non-inferiority analyses v. active comparator and seven (26%) provided evidence for one. In total, 20/27 (74%) evaluations reported evidence of superiority v. placebo with two or more trials. Among the meta-analyses of initiation studies against active comparator (57 available comparisons), the median effect size was 0.051 (range -0.503; 0.318). Twenty approved evaluations (74%) reported evidence of superiority v. placebo on the basis of two or more initiation trials and seven based on a single trial. Among meta-analyses of initiation studies against placebo (125 available comparisons), the median effect size was -0.283 (range -0.820; 0.091). Importantly, among the 89 study reports requested on the EMA website, only 19 were made available 1 year after our requests. CONCLUSIONS The evidence for psychiatric drug approved by the EMA was in general poor. Small to modest effects v. placebo were considered sufficient in indications where an earlier drug exists. Data retrieval was incomplete after 1 year despite EMA's commitment to transparency. Improvements are needed.
Collapse
|
5
|
van Ravenzwaaij D, Ioannidis JPA. True and false positive rates for different criteria of evaluating statistical evidence from clinical trials. BMC Med Res Methodol 2019; 19:218. [PMID: 31775644 PMCID: PMC6882054 DOI: 10.1186/s12874-019-0865-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2019] [Accepted: 11/08/2019] [Indexed: 01/25/2023] Open
Abstract
Background Until recently a typical rule that has often been used for the endorsement of new medications by the Food and Drug Administration has been the existence of at least two statistically significant clinical trials favoring the new medication. This rule has consequences for the true positive (endorsement of an effective treatment) and false positive rates (endorsement of an ineffective treatment). Methods In this paper, we compare true positive and false positive rates for different evaluation criteria through simulations that rely on (1) conventional p-values; (2) confidence intervals based on meta-analyses assuming fixed or random effects; and (3) Bayes factors. We varied threshold levels for statistical evidence, thresholds for what constitutes a clinically meaningful treatment effect, and number of trials conducted. Results Our results show that Bayes factors, meta-analytic confidence intervals, and p-values often have similar performance. Bayes factors may perform better when the number of trials conducted is high and when trials have small sample sizes and clinically meaningful effects are not small, particularly in fields where the number of non-zero effects is relatively large. Conclusions Thinking about realistic effect sizes in conjunction with desirable levels of statistical evidence, as well as quantifying statistical evidence with Bayes factors may help improve decision-making in some circumstances.
Collapse
Affiliation(s)
- Don van Ravenzwaaij
- Department of Psychology, University of Groningen, Grote Kruisstraat 2/1, Heymans Building, room 169, 9712 TS, Groningen, The Netherlands.
| | - John P A Ioannidis
- Departments of Medicine, of Health Research and Policy, and of Statistics and Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, USA
| |
Collapse
|
6
|
Powezka K, Normahani P, Standfield NJ, Jaffer U. A novel team Familiarity Score for operating teams is a predictor of length of a procedure: A retrospective Bayesian analysis. J Vasc Surg 2019; 71:959-966. [PMID: 31401113 DOI: 10.1016/j.jvs.2019.03.085] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 03/26/2019] [Indexed: 10/26/2022]
Abstract
OBJECTIVE The aim of our retrospective study was to assess whether a novel team Familiarity Score (FS) is associated with the length of procedure (LOP), postoperative length of stay (LOS), and complication rate after vascular procedures. METHODS We retrospectively analyzed 326 vascular procedures performed at a tertiary care vascular surgery center between April 2012 and September 2014. Data collected included patients' age, American Society of Anesthesiologists grade, LOP, type and urgency of procedure, LOS, and complications. Familiarity Score (FS) was defined as the sum of the number of times that each possible pair of the team (vascular consultant, vascular registrar, scrub nurse, anesthetic consultant) within the team had worked together during the previous 6 months, divided by the number of possible combinations of pairs in the team. Bayesian statistics was used to analyze the data. RESULTS FS was significantly associated with type and urgency of the procedure (Bayes factor [BF] >1000). Emergency procedures were performed by less familiar teams, and the least familiar teams were involved in the emergency aortic procedures-endovascular and open. FS was strongly associated with LOP (BF = 37) but not with LOS (BF = 4.0) and complication rate. CONCLUSIONS FS in vascular teams was shown to be strongly associated with LOP, suggesting that more familiar teams might collaborate more efficiently.
Collapse
Affiliation(s)
- Katarzyna Powezka
- Imperial Vascular Unit, Imperial College Healthcare NHS Trust, London, United Kingdom
| | - Pasha Normahani
- Imperial Vascular Unit, Imperial College Healthcare NHS Trust, London, United Kingdom
| | - Nigel J Standfield
- Imperial Vascular Unit, Imperial College Healthcare NHS Trust, London, United Kingdom
| | - Usman Jaffer
- Imperial Vascular Unit, Imperial College Healthcare NHS Trust, London, United Kingdom.
| |
Collapse
|
7
|
van Ravenzwaaij D, Monden R, Tendeiro JN, Ioannidis JPA. Bayes factors for superiority, non-inferiority, and equivalence designs. BMC Med Res Methodol 2019; 19:71. [PMID: 30925900 PMCID: PMC6441196 DOI: 10.1186/s12874-019-0699-7] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Accepted: 02/28/2019] [Indexed: 12/20/2022] Open
Abstract
Background In clinical trials, study designs may focus on assessment of superiority, equivalence, or non-inferiority, of a new medicine or treatment as compared to a control. Typically, evidence in each of these paradigms is quantified with a variant of the null hypothesis significance test. A null hypothesis is assumed (null effect, inferior by a specific amount, inferior by a specific amount and superior by a specific amount, for superiority, non-inferiority, and equivalence respectively), after which the probabilities of obtaining data more extreme than those observed under these null hypotheses are quantified by p-values. Although ubiquitous in clinical testing, the null hypothesis significance test can lead to a number of difficulties in interpretation of the results of the statistical evidence. Methods We advocate quantifying evidence instead by means of Bayes factors and highlight how these can be calculated for different types of research design. Results We illustrate Bayes factors in practice with reanalyses of data from existing published studies. Conclusions Bayes factors for superiority, non-inferiority, and equivalence designs allow for explicit quantification of evidence in favor of the null hypothesis. They also allow for interim testing without the need to employ explicit corrections for multiple testing.
Collapse
Affiliation(s)
- Don van Ravenzwaaij
- University of Groningen, Department of Psychology, Grote Kruisstraat 2/1, Heymans Building, Groningen, 9712, TS, The Netherlands.
| | - Rei Monden
- University of Groningen, Department of Psychology, Grote Kruisstraat 2/1, Heymans Building, Groningen, 9712, TS, The Netherlands.,University Medical Center Groningen, Groningen, The Netherlands
| | - Jorge N Tendeiro
- University of Groningen, Department of Psychology, Grote Kruisstraat 2/1, Heymans Building, Groningen, 9712, TS, The Netherlands
| | - John P A Ioannidis
- Departments of Medicine, of Health Research and Policy, of Biomedical Data Science, and of Statistics, and Meta-Research Innovation Center, Stanford, USA
| |
Collapse
|
8
|
Affiliation(s)
- John P. A. Ioannidis
- Departments of Medicine, of Health Research and Policy, of Biomedical Data Science, and of Statistics, Stanford University and Meta-Research Innovation Center at Stanford (METRICS), Stanford, CA
| |
Collapse
|
9
|
Design analysis indicates Potential overestimation of treatment effects in randomized controlled trials supporting Food and Drug Administration cancer drug approvals. J Clin Epidemiol 2018; 103:1-9. [PMID: 30297036 DOI: 10.1016/j.jclinepi.2018.06.012] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Revised: 06/14/2018] [Accepted: 06/26/2018] [Indexed: 12/13/2022]
Abstract
OBJECTIVE Statistical significance drives interpretation of randomized controlled trials (RCTs). We examined the type S error risk-claiming a new drug is falsely beneficial-and exaggeration ratio-how estimated effects differ from true effects-to re-emphasize direction and magnitude of treatment effects. STUDY DESIGN AND SETTING We systematically reviewed RCTs supporting Food and Drug Administration (FDA) approval of cancer drugs between 2007 and 2016. We extracted data for overall survival (OS), progression-free survival (PFS), and response outcomes from FDA reviews. We estimated type S error risks and exaggeration ratios by considering replicated RCTs of equal size and a range of true effects. RESULTS We analyzed 42 RCTs for 39 approved drugs. Across 38 RCTs reporting OS, the median type S error risk was 0.00% (Q1-Q3, 0.00-0.01%) and 3.56% (0.40-6.74%), for true hazard ratios of 0.7 and 0.9, respectively, indicating confidence in effect direction. The corresponding exaggeration ratios were 1.09 (1.01-1.11) and 1.30 (1.13-1.42), indicating median overestimations of 9% and 30%. Similar results held for PFS and response outcomes. CONCLUSIONS The type S error risk and exaggeration ratio provide additional insights into the replicability of RCTs. Our analyses also quantify the winner's curse, in which pivotal RCTs tend toward overoptimism.
Collapse
|
10
|
Fahy AS, Jamal L, Gavrilovic B, Carillo B, Gerstle JT, Nasr A, Azzie G. The Impact of Simulator Size on Forces Generated in the Performance of a Defined Intracorporeal Suturing Task: A Pilot Study. J Laparoendosc Adv Surg Tech A 2018; 28:1520-1524. [PMID: 30004827 DOI: 10.1089/lap.2018.0255] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Background: In pediatric minimal access surgery, the operative domain may vary from that of an adult to that of a neonate. This study aimed to quantify the impact of decreased operative domain on forces generated in the performance of a defined intracorporeal suturing task. Methods: One hundred five participants performed a defined intracorporeal suturing task in small and large simulators. Time to task completion and force analysis parameters (FAPs = total, maximum, and mean forces in X, Y, and Z axes) were measured. Expertise level was assigned based on the number of laparoscopic cases. Outcomes were analyzed using paired sample t-tests, P value of <.05. Results: Time to task completion varied significantly for experts between adult and pediatric simulators but not for intermediates or novices. Total, maximum, and mean forces in the X ("side to side") axis were significantly greater in the larger laparoscopic simulator for all levels of expertise. In the Y axis ("in and out" movement) and Z axis ("up and down" movement), total and mean forces were higher in the adult simulator regardless of the level of expertise. Differences in maximum force between the adult and pediatric simulators in the Z axis ("up and down" movement) varied significantly for novices and intermediates but not for experts. Conclusion: Forces were greater, particularly in the side-to-side plane, in the larger simulator for participants of all levels in the performance of this defined intracorporeal suturing task. Further analysis will determine the reasons for and implications of the increased force parameters in the simulator of larger domain.
Collapse
Affiliation(s)
- Aodhnait S Fahy
- 1 Division of General and Thoracic Surgery, Hospital for Sick Children, Toronto, Canada
| | - Luai Jamal
- 1 Division of General and Thoracic Surgery, Hospital for Sick Children, Toronto, Canada
| | - Bojan Gavrilovic
- 2 Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, Canada
| | | | - Justin T Gerstle
- 1 Division of General and Thoracic Surgery, Hospital for Sick Children, Toronto, Canada
| | - Ahmed Nasr
- 3 Division of General and Thoracic Surgery, Children's Hospital of Eastern Ontario, University of Ottawa, Ottawa, Canada
| | - Georges Azzie
- 1 Division of General and Thoracic Surgery, Hospital for Sick Children, Toronto, Canada
| |
Collapse
|
11
|
Cristea IA, Ioannidis JPA. P values in display items are ubiquitous and almost invariably significant: A survey of top science journals. PLoS One 2018; 13:e0197440. [PMID: 29763472 PMCID: PMC5953482 DOI: 10.1371/journal.pone.0197440] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2018] [Accepted: 05/02/2018] [Indexed: 12/18/2022] Open
Abstract
P values represent a widely used, but pervasively misunderstood and fiercely contested method of scientific inference. Display items, such as figures and tables, often containing the main results, are an important source of P values. We conducted a survey comparing the overall use of P values and the occurrence of significant P values in display items of a sample of articles in the three top multidisciplinary journals (Nature, Science, PNAS) in 2017 and, respectively, in 1997. We also examined the reporting of multiplicity corrections and its potential influence on the proportion of statistically significant P values. Our findings demonstrated substantial and growing reliance on P values in display items, with increases of 2.5 to 14.5 times in 2017 compared to 1997. The overwhelming majority of P values (94%, 95% confidence interval [CI] 92% to 96%) were statistically significant. Methods to adjust for multiplicity were almost non-existent in 1997, but reported in many articles relying on P values in 2017 (Nature 68%, Science 48%, PNAS 38%). In their absence, almost all reported P values were statistically significant (98%, 95% CI 96% to 99%). Conversely, when any multiplicity corrections were described, 88% (95% CI 82% to 93%) of reported P values were statistically significant. Use of Bayesian methods was scant (2.5%) and rarely (0.7%) articles relied exclusively on Bayesian statistics. Overall, wider appreciation of the need for multiplicity corrections is a welcome evolution, but the rapid growth of reliance on P values and implausibly high rates of reported statistical significance are worrisome.
Collapse
Affiliation(s)
- Ioana Alina Cristea
- Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, United States of America
- Department of Clinical Psychology and Psychotherapy, Babes-Bolyai University, Cluj-Napoca Romania
| | - John P. A. Ioannidis
- Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, United States of America
- Departments of Medicine, Stanford University, Stanford, California, United States of America
- Department of Health Research and Policy, Stanford University, Stanford, California, United States of America
- Department of Biomedical Data Science, Stanford University, Stanford, California, United States of America
- Department of Statistics, Stanford University, Stanford, California, United States of America
| |
Collapse
|