1
|
Drouin JR, Flores S. Effects of training length on adaptation to noise-vocoded speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:2114-2127. [PMID: 38488452 DOI: 10.1121/10.0025273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 02/22/2024] [Indexed: 03/19/2024]
Abstract
Listeners show rapid perceptual learning of acoustically degraded speech, though the amount of exposure required to maximize speech adaptation is unspecified. The current work used a single-session design to examine the length of auditory training on perceptual learning for normal hearing listeners exposed to eight-channel noise-vocoded speech. Participants completed short, medium, or long training using a two-alternative forced choice sentence identification task with feedback. To assess learning and generalization, a 40-trial pre-test and post-test transcription task was administered using trained and novel sentences. Training results showed all groups performed near ceiling with no reliable differences. For test data, we evaluated changes in transcription accuracy using separate linear mixed models for trained or novel sentences. In both models, we observed a significant improvement in transcription at post-test relative to pre-test. Critically, the three training groups did not differ in the magnitude of improvement following training. Subsequent Bayes factors analysis evaluating the test by group interaction provided strong evidence in support of the null hypothesis. For these stimuli and procedure, results suggest increased training does not necessarily maximize learning outcomes; both passive and trained experience likely supported adaptation. Findings may contribute to rehabilitation recommendations for listeners adapting to degraded speech signals.
Collapse
Affiliation(s)
- Julia R Drouin
- Division of Speech and Hearing Sciences, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA
| | - Stephany Flores
- Department of Communication Sciences and Disorders, California State University Fullerton, Fullerton, California 92831, USA
| |
Collapse
|
2
|
Drown L, Philip B, Francis AL, Theodore RM. Revisiting the left ear advantage for phonetic cues to talker identification. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:3107. [PMID: 36456295 PMCID: PMC9715276 DOI: 10.1121/10.0015093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 09/13/2022] [Accepted: 10/18/2022] [Indexed: 06/17/2023]
Abstract
Previous research suggests that learning to use a phonetic property [e.g., voice-onset-time, (VOT)] for talker identity supports a left ear processing advantage. Specifically, listeners trained to identify two "talkers" who only differed in characteristic VOTs showed faster talker identification for stimuli presented to the left ear compared to that presented to the right ear, which is interpreted as evidence of hemispheric lateralization consistent with task demands. Experiment 1 (n = 97) aimed to replicate this finding and identify predictors of performance; experiment 2 (n = 79) aimed to replicate this finding under conditions that better facilitate observation of laterality effects. Listeners completed a talker identification task during pretest, training, and posttest phases. Inhibition, category identification, and auditory acuity were also assessed in experiment 1. Listeners learned to use VOT for talker identity, which was positively associated with auditory acuity. Talker identification was not influenced by ear of presentation, and Bayes factors indicated strong support for the null. These results suggest that talker-specific phonetic variation is not sufficient to induce a left ear advantage for talker identification; together with the extant literature, this instead suggests that hemispheric lateralization for talker-specific phonetic variation requires phonetic variation to be conditioned on talker differences in source characteristics.
Collapse
Affiliation(s)
- Lee Drown
- Department of Speech, Language, and Hearing Sciences, University of Connecticut, Storrs, Connecticut 06269-1085, USA
| | - Betsy Philip
- Department of Speech, Language, and Hearing Sciences, University of Connecticut, Storrs, Connecticut 06269-1085, USA
| | - Alexander L Francis
- Department of Speech, Language, and Hearing Sciences, Purdue University, West Lafayette, Indiana 47907-2122, USA
| | - Rachel M Theodore
- Department of Speech, Language, and Hearing Sciences, University of Connecticut, Storrs, Connecticut 06269-1085, USA
| |
Collapse
|
3
|
Quintana DS. Towards better hypothesis tests in oxytocin research: Evaluating the validity of auxiliary assumptions. Psychoneuroendocrinology 2022; 137:105642. [PMID: 34991063 DOI: 10.1016/j.psyneuen.2021.105642] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 12/16/2021] [Accepted: 12/17/2021] [Indexed: 10/19/2022]
Abstract
Various factors have been attributed to the inconsistent reproducibility of human oxytocin research in the cognitive and behavioral sciences. These factors include small sample sizes, a lack of pre-registered studies, and the absence of overarching theoretical frameworks that can account for oxytocin's effects over a broad range of contexts. While there have been efforts to remedy these issues, there has been very little systematic scrutiny of the role of auxiliary assumptions, which are claims that are not central for testing a hypothesis but nonetheless critical for testing theories. For instance, the hypothesis that oxytocin increases the salience of social cues is predicated on the assumption that intranasally administered oxytocin increases oxytocin levels in the brain. Without robust auxiliary assumptions, it is unclear whether a hypothesis testing failure is due to an incorrect hypothesis or poorly supported auxiliary assumptions. Consequently, poorly supported auxiliary assumptions can be blamed for hypothesis failure, thereby safeguarding theories from falsification. In this article, I will evaluate the body of evidence for key auxiliary assumptions in human behavioral oxytocin research in terms of theory, experimental design, and statistical inference, and highlight assumptions that require stronger evidence. Strong auxiliary assumptions will leave hypotheses vulnerable for falsification, which will improve hypothesis testing and consequently advance our understanding of oxytocin's role in cognition and behavior.
Collapse
Affiliation(s)
- Daniel S Quintana
- Department of Psychology, University of Oslo, Oslo, Norway; NevSom, Department of Rare Disorders, Oslo University Hospital, Oslo, Norway; Norwegian Centre for Mental Disorders Research (NORMENT), University of Oslo, Oslo, Norway; KG Jebsen Centre for Neurodevelopmental Disorders, University of Oslo, Oslo, Norway.
| |
Collapse
|
4
|
Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields. JOURNAL OF PACIFIC RIM PSYCHOLOGY 2021. [DOI: 10.1017/prp.2019.28] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
P values and confidence intervals (CIs) are the most widely used statistical indices in scientific literature. Several surveys have revealed that these two indices are generally misunderstood. However, existing surveys on this subject fall under psychology and biomedical research, and data from other disciplines are rare. Moreover, the confidence of researchers when constructing judgments remains unclear. To fill this research gap, we surveyed 1,479 researchers and students from different fields in China. Results reveal that for significant (i.e., p < .05, CI does not include zero) and non-significant (i.e., p > .05, CI includes zero) conditions, most respondents, regardless of academic degrees, research fields and stages of career, could not interpret p values and CIs accurately. Moreover, the majority were confident about their (inaccurate) judgements (see osf.io/mcu9q/ for raw data, materials, and supplementary analyses). Therefore, as misinterpretations of p values and CIs prevail in the whole scientific community, there is a need for better statistical training in science.
Collapse
|
5
|
Carpenter TP, Law KC. Optimizing the scientific study of suicide with open and transparent research practices. Suicide Life Threat Behav 2021; 51:36-46. [PMID: 33624871 DOI: 10.1111/sltb.12665] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Suicide research is vitally important, yet-like psychology research more broadly-faces methodological challenges. In recent years, researchers have raised concerns about standard practices in psychological research, concerns that apply to suicide research and raise questions about its robustness and validity. In the present paper, we review these concerns and the corresponding solutions put forth by the "open science" community. These include using open science platforms, pre-registering studies, ensuring reproducible analyses, using high-powered studies, ensuring open access to research materials and products, and conducting replication studies. We build upon existing guides, address specific obstacles faced by suicide researchers, and offer a clear set of recommended practices for suicide researchers. In particular, we consider challenges that suicide researchers may face in seeking to adopt "open science" practices (e.g., prioritizing large samples) and suggest possible strategies that the field may use in order to ensure robust and transparent research, despite these challenges.
Collapse
Affiliation(s)
| | - Keyne C Law
- Seattle Pacific University, Seattle, Washington, USA
| |
Collapse
|
6
|
Świątkowski W, Carrier A. There is Nothing Magical about Bayesian Statistics: An Introduction to Epistemic Probabilities in Data Analysis for Psychology Starters. BASIC AND APPLIED SOCIAL PSYCHOLOGY 2020. [DOI: 10.1080/01973533.2020.1792297] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
7
|
Abstract
Zusammenfassung. In der (Pädagogischen) Psychologie sind Replikationsstudien bislang extrem seltene Ausnahmen. Dieser Artikel legt dar, dass und warum Wiederholungsstudien unentbehrlich sind. Weiterhin wird der Frage nachgegangen, warum – trotz des enormen Mehrwerts – nahezu keine Replikationen publiziert werden und warum viele „Ergebnisse“ der psychologischen Forschung nicht replizierbar sind. Dass es sich bei diesen Sachverhalten nicht um Vermutungen handelt, wird durch vorliegende Untersuchungen belegt. Die Ursachen dafür liegen in verschiedenen – teilweise voneinander abhängigen – Ebenen des Wissenschaftssystems: die verbreitete – aber abwegige – Ansicht, „statistische Signifikanz“ indiziere auch die Wahrscheinlichkeit, einen Befund replizieren zu können; die Verwechslung von „statistisch signifikant“ mit relevant; die Unsitte, getestete Untersuchungshypothesen erst im Nachhinein (ex post), also in Kenntnis der Resultate einer Studie, aufgestellt zu haben, aber in der Publikation als theoretisch abgeleiteten Ausgangspunkt (d. h. a priori formuliert) auszugeben; die α-Fehler-Inflationierung durch multiple statistische Signifikanztestungen; das exklusive Berichten von Ergebnissen, welche die Forschungshypothesen stützen, verbunden mit dem Unterschlagen abweichender Befunde; mangelnde Konstruktvalidität der verwendeten Messinstrumente; Lug und Betrug in der Wissenschaft; die Geringschätzung von Replikationen durch Zeitschriftenherausgeber, Gutachter und Drittmittelgeber. All das führt dazu, dass fast ausschließlich „statistisch signifikante“ und „neue“ Ergebnisse veröffentlicht werden und falsche Theorien persistieren. Als Gegenmaßnahmen werden beispielhaft genannt: eine großzügige finanzielle Förderung von Replikationsprojekten und ihrer Publikation; die nachdrückliche gutachterliche Befürwortung der Veröffentlichung methodisch adäquater Wiederholungsstudien; die Bereitschaft von Fachzeitschriften, dafür genug Platz bereitzustellen; die Anerkennung des großen wissenschaftlichen Werts von Wiederholungsstudien, auch in Berufungsverfahren. Daraus ergibt sich, dass mit den aufgezeigten Möglichkeiten und Forderungen zur Etablierung und Förderung von Replikationsstudien unterschiedliche Adressaten parallel angesprochen werden müssen. Nachhaltige Veränderungen sind allerdings nur erreichbar, wenn die einzelnen Akteure (Forscher; Gutachter; Zeitschriftenherausgeber; Berufungskommissionen; Drittmittelgeber) ihre individuelle Verantwortung anerkennen und entsprechende Taten folgen lassen.
Collapse
Affiliation(s)
- Detlef H. Rost
- Southwest University Chongqing, Faculty of Psychology, Chongqing, P. R. China
- Philipps-Universität Marburg, Fachbereich Psychologie, Marburg, Deutschland
| | - Marc Bienefeld
- Universität Bielefeld, Fakultät für Erziehungswissenschaft, Bielefeld, Deutschland
| |
Collapse
|
8
|
Griffiths P, Needleman J. Statistical significance testing and p-values: Defending the indefensible? A discussion paper and position statement. Int J Nurs Stud 2019; 99:103384. [PMID: 31442781 DOI: 10.1016/j.ijnurstu.2019.07.001] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 07/13/2019] [Indexed: 11/15/2022]
Abstract
Much statistical teaching and many research reports focus on the 'null hypothesis significance test'. Yet the correct meaning and interpretation of statistical significance tests is elusive. Misinterpretations are both common and persistent, leading many to question whether significance tests should be used at all. While most take aim at the arbitrary declaration of p < 0.05 as a threshold for determining 'significance', others extend the critique to suggest the 'p-value' should be dispensed with entirely. P-values and significance tests are still widely used as if they give a measure of the size and importance of relationships, even though this misunderstanding has been observed and discussed for many years. We argue that p-values and significance tests are intrinsically misleading. Point estimates of relationships and confidence intervals give direct information about the effect and the uncertainty of the estimate without recourse to interpreting how a particular p-value might have arisen or indeed referring to them at all. In this paper we briefly outline some of the problems with significance testing, offer a number of examples selected from a recent issue of the International Journal of Nursing Studies and discuss some proposed responses to these problems. We conclude by offering some guidance to authors reporting statistical tests in journals and present a position statement that has been adopted by the International Journal of Nursing Studies to guide its' authors in reporting the results of statistical analyses. While stopping short of calling for an outright ban on reporting p-values and significance tests we urge authors (and journals) to place more emphasis on measures of effect and estimates of precision/uncertainty and, following the position of the American Statistical Association, emphasise that authors (and readers) should avoid using 0.05 or any other cut off for a p-value as the basis for a decision about the meaningfulness/importance of an effect. If point estimates and confidence intervals are used, then the p-value may be redundant and can be omitted from reports. When authors talk about 'significance' they need to be explicit when referring to statistical significance and we recommend authors adopt the language of 'importance' when talking about effect sizes to avoid any confusion.
Collapse
Affiliation(s)
- Peter Griffiths
- University of Southampton, UK and Executive Editor, International Journal of Nursing Studies, United Kingdom.
| | - Jack Needleman
- Department of Health Policy and Management, University of California, Los Angeles School of Public Health, Los Angeles, USA
| |
Collapse
|
9
|
Badenes-Ribera L, Frias-Navarro D, Iotti NO, Bonilla-Campos A, Longobardi C. Perceived Statistical Knowledge Level and Self-Reported Statistical Practice Among Academic Psychologists. Front Psychol 2018; 9:996. [PMID: 29988476 PMCID: PMC6024681 DOI: 10.3389/fpsyg.2018.00996] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Accepted: 05/28/2018] [Indexed: 11/18/2022] Open
Abstract
Introduction: Publications arguing against the null hypothesis significance testing (NHST) procedure and in favor of good statistical practices have increased. The most frequently mentioned alternatives to NHST are effect size statistics (ES), confidence intervals (CIs), and meta-analyses. A recent survey conducted in Spain found that academic psychologists have poor knowledge about effect size statistics, confidence intervals, and graphic displays for meta-analyses, which might lead to a misinterpretation of the results. In addition, it also found that, although the use of ES is becoming generalized, the same thing is not true for CIs. Finally, academics with greater knowledge about ES statistics presented a profile closer to good statistical practice and research design. Our main purpose was to analyze the extension of these results to a different geographical area through a replication study. Methods: For this purpose, we elaborated an on-line survey that included the same items as the original research, and we asked academic psychologists to indicate their level of knowledge about ES, their CIs, and meta-analyses, and how they use them. The sample consisted of 159 Italian academic psychologists (54.09% women, mean age of 47.65 years). The mean number of years in the position of professor was 12.90 (SD = 10.21). Results: As in the original research, the results showed that, although the use of effect size estimates is becoming generalized, an under-reporting of CIs for ES persists. The most frequent ES statistics mentioned were Cohen's d and R2/η2, which can have outliers or show non-normality or violate statistical assumptions. In addition, academics showed poor knowledge about meta-analytic displays (e.g., forest plot and funnel plot) and quality checklists for studies. Finally, academics with higher-level knowledge about ES statistics seem to have a profile closer to good statistical practices. Conclusions: Changing statistical practice is not easy.This change requires statistical training programs for academics, both graduate and undergraduate.
Collapse
Affiliation(s)
- Laura Badenes-Ribera
- Departament de Metodologia de les Ciències del Comportament, Universitat de València, Valencia, Spain
| | - Dolores Frias-Navarro
- Departament de Metodologia de les Ciències del Comportament, Universitat de València, Valencia, Spain
| | - Nathalie O Iotti
- Dipartimento di Psicologia, Università degli Studi di Torino, Turin, Italy
| | - Amparo Bonilla-Campos
- Departament de Metodologia de les Ciències del Comportament, Universitat de València, Valencia, Spain
| | - Claudio Longobardi
- Dipartimento di Psicologia, Università degli Studi di Torino, Turin, Italy
| |
Collapse
|
10
|
Gigerenzer G. Statistical Rituals: The Replication Delusion and How We Got There. ADVANCES IN METHODS AND PRACTICES IN PSYCHOLOGICAL SCIENCE 2018. [DOI: 10.1177/2515245918771329] [Citation(s) in RCA: 81] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
The “replication crisis” has been attributed to misguided external incentives gamed by researchers (the strategic-game hypothesis). Here, I want to draw attention to a complementary internal factor, namely, researchers’ widespread faith in a statistical ritual and associated delusions (the statistical-ritual hypothesis). The “null ritual,” unknown in statistics proper, eliminates judgment precisely at points where statistical theories demand it. The crucial delusion is that the p value specifies the probability of a successful replication (i.e., 1 – p), which makes replication studies appear to be superfluous. A review of studies with 839 academic psychologists and 991 students shows that the replication delusion existed among 20% of the faculty teaching statistics in psychology, 39% of the professors and lecturers, and 66% of the students. Two further beliefs, the illusion of certainty (e.g., that statistical significance proves that an effect exists) and Bayesian wishful thinking (e.g., that the probability of the alternative hypothesis being true is 1 – p), also make successful replication appear to be certain or almost certain, respectively. In every study reviewed, the majority of researchers (56%–97%) exhibited one or more of these delusions. Psychology departments need to begin teaching statistical thinking, not rituals, and journal editors should no longer accept manuscripts that report results as “significant” or “not significant.”
Collapse
Affiliation(s)
- Gerd Gigerenzer
- Harding Center for Risk Literacy, Max-Planck Institute for Human Development, Berlin, Germany
| |
Collapse
|
11
|
Lyu Z, Peng K, Hu CP. P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation. Front Psychol 2018; 9:868. [PMID: 29937743 PMCID: PMC6002511 DOI: 10.3389/fpsyg.2018.00868] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2017] [Accepted: 05/14/2018] [Indexed: 11/13/2022] Open
Affiliation(s)
- Ziyang Lyu
- Department of Psychology, School of Social Science, Tsinghua University, Beijing, China
| | - Kaiping Peng
- Department of Psychology, School of Social Science, Tsinghua University, Beijing, China
| | - Chuan-Peng Hu
- Neuroimaging Center (NIC), Focus Program Translational Neuroscience (FTN), Johannes Gutenberg University, Mainz, Germany
- Deutsches Resilienz Zentrum (DRZ), University Medical Center of the Johannes Gutenberg University, Mainz, Germany
| |
Collapse
|
12
|
Badenes-Ribera L, Frias-Navarro D. Falacias sobre el valor p compartidas por profesores y estudiantes universitarios. UNIVERSITAS PSYCHOLOGICA 2017. [DOI: 10.11144/javeriana.upsy16-3.fvcp] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Resumen
La “Práctica Basada en la Evidencia” exige que los profesionales valoren de forma crítica los resultados de las investigaciones psicológicas. Sin embargo, las interpretaciones incorrectas de los valores p de probabilidad son abundantes y repetitivas. Estas interpretaciones incorrectas afectan a las decisiones profesionales y ponen en riesgo la calidad de las intervenciones y la acumulación de un conocimiento científico válido. Identificar el tipo de falacia que subyace a las decisiones estadísticas es fundamental para abordar y planificar estrategias de educación estadística dirigidas a intervenir sobre las interpretaciones incorrectas. En consecuencia, el objetivo de este estudio es analizar la interpretación del valor p en estudiantes y profesores universitarios de Psicología. La muestra estuvo formada por 161 participantes (43 profesores y 118 estudiantes). La antigüedad media como profesor fue de 16.7 años (DT = 10.07). La edad media de los estudiantes fue de 21.59 (DT = 1.3). Los hallazgos sugieren que los estudiantes y profesores universitarios no conocen la interpretación correcta del valor p. La falacia de la probabilidad inversa presenta mayores problemas de comprensión. Además, se confunde la significación estadística y la significación práctica o clínica. Estos resultados destacan la necesidad de la educación estadística y re-educación estadística.
Abstract
The "Evidence Based Practice" requires professionals to critically assess the results of psychological research. However, incorrect interpretations of p values of probability are abundant and repetitive. These misconceptions affect professional decisions and compromise the quality of interventions and the accumulation of a valid scientific knowledge. Identifying the types of fallacies that underlying statistical decisions is fundamental for approaching and planning statistical education strategies designed to intervene in incorrect interpretations. Therefore, the aim of this study is to analyze the interpretation of p value among college students of psychology and academic psychologist. The sample was composed of 161 participants (43 academic and 118 students). The mean number of years as academic was 16.7 (SD = 10.07). The mean age of college students was 21.59 years (SD = 1.3). The findings suggest that college students and academic do not know the correct interpretation of p values. The fallacy of the inverse probability presents major problems of comprehension. In addition, statistical significance and practical significance or clinical are confused. There is a need for statistical education and statistical re-education.
Collapse
|
13
|
Amrhein V, Korner-Nievergelt F, Roth T. The earth is flat ( p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ 2017; 5:e3544. [PMID: 28698825 PMCID: PMC5502092 DOI: 10.7717/peerj.3544] [Citation(s) in RCA: 144] [Impact Index Per Article: 20.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Accepted: 06/14/2017] [Indexed: 11/25/2022] Open
Abstract
The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.
Collapse
Affiliation(s)
- Valentin Amrhein
- Zoological Institute, University of Basel, Basel, Switzerland
- Research Station Petite Camargue Alsacienne, Saint-Louis, France
- Swiss Ornithological Institute, Sempach, Switzerland
| | | | - Tobias Roth
- Zoological Institute, University of Basel, Basel, Switzerland
- Research Station Petite Camargue Alsacienne, Saint-Louis, France
| |
Collapse
|