1
|
Debray TPA, Vergouwe Y, Koffijberg H, Nieboer D, Steyerberg EW, Moons KGM. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol 2014; 68:279-89. [PMID: 25179855 DOI: 10.1016/j.jclinepi.2014.06.018] [Citation(s) in RCA: 379] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2014] [Revised: 06/18/2014] [Accepted: 06/30/2014] [Indexed: 01/01/2023]
Abstract
OBJECTIVES It is widely acknowledged that the performance of diagnostic and prognostic prediction models should be assessed in external validation studies with independent data from "different but related" samples as compared with that of the development sample. We developed a framework of methodological steps and statistical methods for analyzing and enhancing the interpretation of results from external validation studies of prediction models. STUDY DESIGN AND SETTING We propose to quantify the degree of relatedness between development and validation samples on a scale ranging from reproducibility to transportability by evaluating their corresponding case-mix differences. We subsequently assess the models' performance in the validation sample and interpret the performance in view of the case-mix differences. Finally, we may adjust the model to the validation setting. RESULTS We illustrate this three-step framework with a prediction model for diagnosing deep venous thrombosis using three validation samples with varying case mix. While one external validation sample merely assessed the model's reproducibility, two other samples rather assessed model transportability. The performance in all validation samples was adequate, and the model did not require extensive updating to correct for miscalibration or poor fit to the validation settings. CONCLUSION The proposed framework enhances the interpretation of findings at external validation of prediction models.
Collapse
|
Research Support, Non-U.S. Gov't |
11 |
379 |
2
|
Park JE, Park SY, Kim HJ, Kim HS. Reproducibility and Generalizability in Radiomics Modeling: Possible Strategies in Radiologic and Statistical Perspectives. Korean J Radiol 2019; 20:1124-1137. [PMID: 31270976 PMCID: PMC6609433 DOI: 10.3348/kjr.2018.0070] [Citation(s) in RCA: 222] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Accepted: 04/07/2019] [Indexed: 02/06/2023] Open
Abstract
Radiomics, which involves the use of high-dimensional quantitative imaging features for predictive purposes, is a powerful tool for developing and testing medical hypotheses. Radiologic and statistical challenges in radiomics include those related to the reproducibility of imaging data, control of overfitting due to high dimensionality, and the generalizability of modeling. The aims of this review article are to clarify the distinctions between radiomics features and other omics and imaging data, to describe the challenges and potential strategies in reproducibility and feature selection, and to reveal the epidemiological background of modeling, thereby facilitating and promoting more reproducible and generalizable radiomics research.
Collapse
|
Review |
6 |
222 |
3
|
Nastase SA, Goldstein A, Hasson U. Keep it real: rethinking the primacy of experimental control in cognitive neuroscience. Neuroimage 2020; 222:117254. [PMID: 32800992 PMCID: PMC7789034 DOI: 10.1016/j.neuroimage.2020.117254] [Citation(s) in RCA: 145] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Revised: 07/08/2020] [Accepted: 08/04/2020] [Indexed: 01/17/2023] Open
Abstract
Naturalistic experimental paradigms in neuroimaging arose from a pressure to test the validity of models we derive from highly-controlled experiments in real-world contexts. In many cases, however, such efforts led to the realization that models developed under particular experimental manipulations failed to capture much variance outside the context of that manipulation. The critique of non-naturalistic experiments is not a recent development; it echoes a persistent and subversive thread in the history of modern psychology. The brain has evolved to guide behavior in a multidimensional world with many interacting variables. The assumption that artificially decoupling and manipulating these variables will lead to a satisfactory understanding of the brain may be untenable. We develop an argument for the primacy of naturalistic paradigms, and point to recent developments in machine learning as an example of the transformative power of relinquishing control. Naturalistic paradigms should not be deployed as an afterthought if we hope to build models of brain and behavior that extend beyond the laboratory into the real world.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
145 |
4
|
Graham EK, Rutsohn JP, Turiano NA, Bendayan R, Batterham PJ, Gerstorf D, Katz MJ, Reynolds CA, Sharp ES, Yoneda TB, Bastarache ED, Elleman LG, Zelinski EM, Johansson B, Kuh D, Barnes LL, Bennett DA, Deeg DJH, Lipton RB, Pedersen NL, Piccinin AM, Spiro A, Muniz-Terrera G, Willis SL, Schaie KW, Roan C, Herd P, Hofer SM, Mroczek DK. Personality Predicts Mortality Risk: An Integrative Data Analysis of 15 International Longitudinal Studies. JOURNAL OF RESEARCH IN PERSONALITY 2017; 70:174-186. [PMID: 29230075 DOI: 10.1016/j.jrp.2017.07.005] [Citation(s) in RCA: 99] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
This study examined the Big Five personality traits as predictors of mortality risk, and smoking as a mediator of that association. Replication was built into the fabric of our design: we used a Coordinated Analysis with 15 international datasets, representing 44,094 participants. We found that high neuroticism and low conscientiousness, extraversion, and agreeableness were consistent predictors of mortality across studies. Smoking had a small mediating effect for neuroticism. Country and baseline age explained variation in effects: studies with older baseline age showed a pattern of protective effects (HR<1.00) for openness, and U.S. studies showed a pattern of protective effects for extraversion. This study demonstrated coordinated analysis as a powerful approach to enhance replicability and reproducibility, especially for aging-related longitudinal research.
Collapse
|
Research Support, N.I.H., Extramural |
8 |
99 |
5
|
He J, Morales DR, Guthrie B. Exclusion rates in randomized controlled trials of treatments for physical conditions: a systematic review. Trials 2020; 21:228. [PMID: 32102686 PMCID: PMC7045589 DOI: 10.1186/s13063-020-4139-0] [Citation(s) in RCA: 77] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Accepted: 02/04/2020] [Indexed: 01/06/2023] Open
Abstract
Background The generalisability of randomized controlled trials (RCTs) can be uncertain because the impact of exclusion criteria is rarely quantified. The aim of this study was to systematically review studies examining the percentage of clinical populations with a physical health condition who would be excluded by RCTs of treatments for that condition. Methods Medline and Embase were searched from inception to Feb 11th 2018. Two reviewers independently completed screening, full-text review, data extraction and risk-of-bias assessment. The primary outcome was the percentage of patients in the clinical population who would have been excluded from each examined trial. Subgroup analyses examined exclusion by population setting, publication date and funding source. Results Titles/abstracts (20,754) were screened, and 50 studies were included which reported exclusion rates from 305 trials of treatments in 31 physical conditions. Estimated rates of exclusion from trials varied from 0% to 100%, and the median exclusion rate was 77.1% of patients (interquartile range 55.5% to 89.0% exclusion). Median exclusion rates for trials in common chronic conditions were high, including hypertension 83.0%, type 2 diabetes 81.7%, chronic obstructive pulmonary disease 84.3%, and asthma 96.0%. The most commonly applied exclusion criteria related to age, co-morbidity and co-prescribing, whereas more implicit criteria relating to life expectancy or functional status were not typically examined. There was no evidence that exclusion varied by the nature of the clinical population in which exclusion was evaluated or trial funding source. There was no statistically significant change in exclusion rates in more recent compared with older trials. Conclusions The majority of trials of treatments for physical conditions examined excluded the majority of patients with the condition being treated. Almost a quarter of the trials studied excluded over 90% of patients, more than half of trials excluded at least three quarters of patients, and four out of five trials excluded at least half of patients. A limitation is that most studies applied only a subset of eligibility criteria, so exclusion rates are likely under-estimated. Exclusion from trials of older people and people with co-morbidity and co-prescribing is increasingly untenable given population aging and increasing multimorbidity. Trial registration PROSPERO registration CRD42016042282.
Collapse
|
Systematic Review |
5 |
77 |
6
|
Abstract
Cross-validation (CV) is increasingly popular as a generic method to adjudicate between mathematical models of cognition and behavior. In order to measure model generalizability, CV quantifies out-of-sample predictive performance, and the CV preference goes to the model that predicted the out-of-sample data best. The advantages of CV include theoretic simplicity and practical feasibility. Despite its prominence, however, the limitations of CV are often underappreciated. Here, we demonstrate the limitations of a particular form of CV—Bayesian leave-one-out cross-validation or LOO—with three concrete examples. In each example, a data set of infinite size is perfectly in line with the predictions of a simple model (i.e., a general law or invariance). Nevertheless, LOO shows bounded and relatively modest support for the simple model. We conclude that CV is not a panacea for model selection.
Collapse
|
Journal Article |
7 |
66 |
7
|
Going web or staying paper? The use of web-surveys among older people. BMC Med Res Methodol 2020; 20:252. [PMID: 33032531 PMCID: PMC7545880 DOI: 10.1186/s12874-020-01138-0] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Accepted: 09/29/2020] [Indexed: 01/07/2023] Open
Abstract
Background Web-surveys are increasingly used in population studies. Yet, web-surveys targeting older individuals are still uncommon for various reasons. However, with younger cohorts approaching older age, the potentials for web-surveys among older people might be improved. In this study, we investigated response patterns in a web-survey targeting older adults and the potential importance of offering a paper-questionnaire as an alternative to the web-questionnaire. Methods We analyzed data from three waves of a retirement study, in which a web-push methodology was used and a paper questionnaire was offered as an alternative to the web questionnaire in the last reminder. We mapped the response patterns, compared web- and paper respondents and compared different key outcomes resulting from the sample with and without the paper respondents, both at baseline and after two follow-ups. Results Paper-respondents, that is, those that did not answer until they got a paper questionnaire with the last reminder, were more likely to be female, retired, single, and to report a lower level of education, higher levels of depression and lower self-reported health, compared to web-respondents. The association between retirement status and depression was only present among web-respondents. The differences between web and paper respondents were stronger in the longitudinal sample (after two follow-ups) than at baseline. Conclusions We conclude that a web-survey might be a feasible and good alternative in surveys targeting people in the retirement age range. However, without offering a paper-questionnaire, a small but important group will likely be missing with potential biased estimates as the result.
Collapse
|
Research Support, Non-U.S. Gov't |
5 |
49 |
8
|
Hayes-Larson E, Kezios KL, Mooney SJ, Lovasi G. Who is in this study, anyway? Guidelines for a useful Table 1. J Clin Epidemiol 2019; 114:125-132. [PMID: 31229583 PMCID: PMC6773463 DOI: 10.1016/j.jclinepi.2019.06.011] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Revised: 05/23/2019] [Accepted: 06/10/2019] [Indexed: 10/26/2022]
Abstract
OBJECTIVE Epidemiologic and clinical research papers often describe the study sample in the first table. If well-executed, this "Table 1" can illuminate potential threats to internal and external validity. However, little guidance exists on best practices for designing a Table 1, especially for complex study designs and analyses. We aimed to summarize and extend the literature related to reporting descriptive statistics. STUDY DESIGN AND SETTING In consultation with existing guidelines, we synthesized and developed reporting recommendations driven by study design and focused on transparency related to potential threats to internal and external validity. RESULTS We describe a basic structure for Table 1 and discuss simple modifications in terms of columns, rows, and cells to enhance a reader's ability to judge both internal and external validity. We further highlight several analytic complexities common in epidemiologic research (missing data, sample weights, clustered data, and interaction) and describe possible variations to Table 1 to maintain and add clarity about study validity in light of these issues. We discuss considerations and tradeoffs in Table 1 related to breadth and comprehensiveness vs. parsimony and reader-friendliness. CONCLUSION We anticipate that our work will guide authors considering layouts for Table 1, with attention to the reader's perspective.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
42 |
9
|
Eckstein MK, Wilbrecht L, Collins AGE. What do Reinforcement Learning Models Measure? Interpreting Model Parameters in Cognition and Neuroscience. Curr Opin Behav Sci 2021; 41:128-137. [PMID: 34984213 PMCID: PMC8722372 DOI: 10.1016/j.cobeha.2021.06.004] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Reinforcement learning (RL) is a concept that has been invaluable to fields including machine learning, neuroscience, and cognitive science. However, what RL entails differs between fields, leading to difficulties when interpreting and translating findings. After laying out these differences, this paper focuses on cognitive (neuro)science to discuss how we as a field might over-interpret RL modeling results. We too often assume-implicitly-that modeling results generalize between tasks, models, and participant populations, despite negative empirical evidence for this assumption. We also often assume that parameters measure specific, unique (neuro)cognitive processes, a concept we call interpretability, when evidence suggests that they capture different functions across studies and tasks. We conclude that future computational research needs to pay increased attention to implicit assumptions when using RL models, and suggest that a more systematic understanding of contextual factors will help address issues and improve the ability of RL to explain brain and behavior.
Collapse
|
research-article |
4 |
42 |
10
|
Susukida R, Crum RM, Ebnesajjad C, Stuart EA, Mojtabai R. Generalizability of findings from randomized controlled trials: application to the National Institute of Drug Abuse Clinical Trials Network. Addiction 2017; 112:1210-1219. [PMID: 28191694 PMCID: PMC5461185 DOI: 10.1111/add.13789] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Revised: 10/27/2016] [Accepted: 02/06/2017] [Indexed: 11/30/2022]
Abstract
AIMS To compare randomized controlled trial (RCT) sample treatment effects with the population effects of substance use disorder (SUD) treatment. DESIGN Statistical weighting was used to re-compute the effects from 10 RCTs such that the participants in the trials had characteristics that resembled those of patients in the target populations. SETTINGS Multi-site RCTs and usual SUD treatment settings in the United States. PARTICIPANTS A total of 3592 patients in 10 RCTs and 1 602 226 patients from usual SUD treatment settings between 2001 and 2009. MEASUREMENTS Three outcomes of SUD treatment were examined: retention, urine toxicology and abstinence. We weighted the RCT sample treatment effects using propensity scores representing the conditional probability of participating in RCTs. FINDINGS Weighting the samples changed the significance of estimated sample treatment effects. Most commonly, positive effects of trials became statistically non-significant after weighting (three trials for retention and urine toxicology and one trial for abstinence); also, non-significant effects became significantly positive (one trial for abstinence) and significantly negative effects became non-significant (two trials for abstinence). There was suggestive evidence of treatment effect heterogeneity in subgroups that are under- or over-represented in the trials, some of which were consistent with the differences in average treatment effects between weighted and unweighted results. CONCLUSIONS The findings of randomized controlled trials (RCTs) for substance use disorder treatment do not appear to be directly generalizable to target populations when the RCT samples do not reflect adequately the target populations and there is treatment effect heterogeneity across patient subgroups.
Collapse
|
Multicenter Study |
8 |
39 |
11
|
Zhang J, Ma X, Zhang J, Sun D, Zhou X, Mi C, Wen H. Insights into geospatial heterogeneity of landslide susceptibility based on the SHAP-XGBoost model. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2023; 332:117357. [PMID: 36731409 DOI: 10.1016/j.jenvman.2023.117357] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 01/05/2023] [Accepted: 01/22/2023] [Indexed: 06/18/2023]
Abstract
The spatial heterogeneity of landslide influencing factors is the main reason for the poor generalizability of the susceptibility evaluation model. This study aimed to construct a comprehensive explanatory framework for landslide susceptibility evaluation models based on the SHAP (SHapley Additive explanation)-XGBoost (eXtreme Gradient Boosting) algorithm, analyze the regional characteristics and spatial heterogeneity of landslide influencing factors, and discuss the heterogeneity of the generalizability of the models under different landscapes. Firstly, we selected different regions in typical mountainous hilly region and constructed a geospatial database containing 12 landslide influencing factors such as elevation, annual average rainfall, slope, lithology, and NDVI through field surveys, satellite images, and a literature review. Subsequently, the landslide susceptibility evaluation model was constructed based on the XGBoost algorithm and spatial database, and the prediction results of the landslide susceptibility evaluation model were explained based on regional topography, geology, and hydrology using the SHAP algorithm. Finally, the model was generalized and applied to regions with both similar and very different topography, geology, meteorology, and vegetation, to explore the spatial heterogeneity of the generalizability of the model. The following conclusions were drawn: the spatial distribution of landslides is heterogeneous and complex, and the contribution of each influencing factor on the occurrence of landslides has obvious regional characteristics and spatial heterogeneity. The generalizability of the landslide susceptibility evaluation model is spatially heterogeneous and has better generalizability to regions with similar regional characteristics. Further explanation of the XGBoost landslide susceptibility evaluation model using the SHAP method allows quantitative analysis of the differences in how much various factors contribute to disasters due to spatial heterogeneity, from the perspective of global and local evaluation units. In summary, the integrated explanatory framework based on the SHAP-XGBoost model can quantify the contribution of influencing factors on landslide occurrence at both global and local levels, which is conducive to the construction and improvement of the influencing factor system of landslide susceptibility in different regions. It can also provide a reference for predicting potential landslide hazard-prone areas and for Explainable Artificial Intelligence (XAI) research.
Collapse
|
Review |
2 |
37 |
12
|
Limits to the generalizability of resting-state functional magnetic resonance imaging studies of youth: An examination of ABCD Study® baseline data. Brain Imaging Behav 2022; 16:1919-1925. [PMID: 35552993 DOI: 10.1007/s11682-022-00665-2] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/18/2022] [Indexed: 11/02/2022]
Abstract
This study examined how resting-state functional magnetic resonance imaging (rs-fMRI) data quality and availability relate to clinical and sociodemographic variables within the Adolescent Brain Cognitive Development Study. A sample of participants with an adequate sample of quality baseline rs-fMRI data containing low average motion (framewise displacement ≤ 0.15; low-noise; n = 4,356) was compared to a sample of participants without an adequate sample of quality data and/or containing high average motion (higher-noise; n = 7,437) using Chi-squared analyses and t-tests. A linear mixed model examined relationships between clinical and sociodemographic characteristics and average head motion in the sample with low-noise data. Relative to the sample with higher-noise data, the low-noise sample included more females, youth identified by parents as non-Hispanic white, and youth with married parents, higher parent education, and greater household incomes (ORs = 1.32-1.42). Youth in the low-noise sample were also older and had higher neurocognitive skills, lower BMIs, and fewer externalizing and neurodevelopmental problems (ds = 0.12-0.30). Within the low-noise sample, several clinical and demographic characteristics related to motion. Thus, participants with low-noise rs-fMRI data may be less representative of the general population and motion may remain a confound in this sample. Future rs-fMRI studies of youth should consider these limitations in the design and analysis stages in order to optimize the representativeness and clinical relevance of analyses and results.
Collapse
|
|
3 |
33 |
13
|
Subbaswamy A, Saria S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 2020; 21:345-352. [PMID: 31742354 DOI: 10.1093/biostatistics/kxz041] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Revised: 09/25/2019] [Accepted: 09/25/2019] [Indexed: 11/13/2022] Open
|
Journal Article |
5 |
33 |
14
|
Roach BJ, D'Souza DC, Ford JM, Mathalon DH. Test-retest reliability of time-frequency measures of auditory steady-state responses in patients with schizophrenia and healthy controls. Neuroimage Clin 2019; 23:101878. [PMID: 31228795 PMCID: PMC6587022 DOI: 10.1016/j.nicl.2019.101878] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 04/18/2019] [Accepted: 05/25/2019] [Indexed: 11/29/2022]
Abstract
BACKGROUND Auditory steady-state response (ASSR) paradigms have consistently demonstrated gamma band abnormalities in schizophrenia at a 40-Hz driving frequency with both electroencephalography (EEG) and magnetoencephalography (MEG). Various time-frequency measures have been used to assess the 40-Hz ASSR, including evoked power, single trial total power, phase-locking factor (PLF), and phase-locking angle (PLA). While both EEG and MEG studies have shown power and PLF ASSR measures to exhibit excellent test-retest reliability in healthy adults, the reliability of these measures in patients with schizophrenia has not been determined. METHODS ASSRs were obtained by recording EEG data during presentation of repeated 20-Hz, 30-Hz and 40-Hz auditory click trains from nine schizophrenia patients (SZ) and nine healthy controls (HC) tested on two occasions. Similar ASSR data were collected from a separate group of 30 HC on two to three test occasions. A subset of these HC subjects had EEG recordings during two tasks, passively listening and actively attending to click train stimuli. Evoked power, total power, PLF, and PLA were calculated following Morlet wavelet time-frequency decomposition of EEG data and test-retest generalizability (G) coefficients were calculated for each ASSR condition, time-frequency measure, and subject group. RESULTS G-coefficients ranged from good to excellent (> 0.6) for most 40-Hz time-frequency measures and participant groups, whereas 20-Hz G-coefficients were much more variable. Importantly, test-retest reliability was excellent for the various 40-Hz ASSR measures in SZ, similar to reliabilities in HC. Active attention to click train stimuli modestly reduced G-coefficients in HC relative to the passive listening condition. DISCUSSION The excellent test-retest reliability of 40-Hz ASSR measures replicates previous EEG and MEG studies. PLA, a relatively new time-frequency measure, was shown for the first time to have excellent reliability, comparable to power and PLF measures. Excellent reliability of 40 Hz ASSR measures in SZ supports their use in clinical trials and longitudinal observational studies.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
30 |
15
|
Humphreys K, Blodgett JC, Roberts LW. The exclusion of people with psychiatric disorders from medical research. J Psychiatr Res 2015; 70:28-32. [PMID: 26424420 DOI: 10.1016/j.jpsychires.2015.08.005] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Revised: 07/03/2015] [Accepted: 08/06/2015] [Indexed: 11/18/2022]
Abstract
People with psychiatric disorders are excluded from medical research to an unknown degree with unknown effects. We examined the prevalence of reported psychiatric exclusion criteria using a sample of 400 highly-cited randomized trials (2002-2010) across 20 common chronic disorders (6 psychiatric and 14 other medical disorders). Two coders rated the presence of psychiatric exclusion criteria for each trial. Half of all trials (and 84% of psychiatric disorder treatment trials) reported possible or definite psychiatric exclusion criteria, with significant variation across disorders (p < .001). Non-psychiatric conditions with high rates of reported psychiatric exclusion criteria included low back pain (75%), osteoarthritis (57%), COPD (55%), and diabetes (55%). The most commonly reported type of psychiatric exclusion criteria were those related to substance use disorders (reported in 48% of trials reporting at least one psychiatric exclusion criteria). General psychiatric exclusions (e.g., "any serious psychiatric disorder") were also prevalent (38% of trials). Psychiatric disorder trials were more likely than other medical disorder trials to report each specific type of psychiatric exclusion (p's < .001). Because published clinical trial reports do not always fully describe exclusion criteria, this study's estimates of the prevalence of psychiatric exclusion criteria are conservative. Clinical trials greatly influence state-of-the-art medical care, yet individuals with psychiatric disorders are often actively excluded from these trials. This pattern of exclusion represents an under-recognized and worrisome cause of health inequity. Further attention should be paid to how individuals with psychiatric disorders can be safely included in medical research to address this important clinical and social justice issue.
Collapse
|
Review |
10 |
29 |
16
|
Eckstein MK, Master SL, Xia L, Dahl RE, Wilbrecht L, Collins AGE. The interpretation of computational model parameters depends on the context. eLife 2022; 11:e75474. [PMID: 36331872 PMCID: PMC9635876 DOI: 10.7554/elife.75474] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 09/09/2022] [Indexed: 11/06/2022] Open
Abstract
Reinforcement Learning (RL) models have revolutionized the cognitive and brain sciences, promising to explain behavior from simple conditioning to complex problem solving, to shed light on developmental and individual differences, and to anchor cognitive processes in specific brain mechanisms. However, the RL literature increasingly reveals contradictory results, which might cast doubt on these claims. We hypothesized that many contradictions arise from two commonly-held assumptions about computational model parameters that are actually often invalid: That parameters generalize between contexts (e.g. tasks, models) and that they capture interpretable (i.e. unique, distinctive) neurocognitive processes. To test this, we asked 291 participants aged 8-30 years to complete three learning tasks in one experimental session, and fitted RL models to each. We found that some parameters (exploration / decision noise) showed significant generalization: they followed similar developmental trajectories, and were reciprocally predictive between tasks. Still, generalization was significantly below the methodological ceiling. Furthermore, other parameters (learning rates, forgetting) did not show evidence of generalization, and sometimes even opposite developmental trajectories. Interpretability was low for all parameters. We conclude that the systematic study of context factors (e.g. reward stochasticity; task volatility) will be necessary to enhance the generalizability and interpretability of computational cognitive models.
Collapse
|
research-article |
3 |
28 |
17
|
Zimmerman M, Balling C, Chelminski I, Dalrymple K. Have Treatment Studies of Depression Become Even Less Generalizable? Applying the Inclusion and Exclusion Criteria in Placebo-Controlled Antidepressant Efficacy Trials Published over 20 Years to a Clinical Sample. PSYCHOTHERAPY AND PSYCHOSOMATICS 2020; 88:165-170. [PMID: 31096246 DOI: 10.1159/000499917] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Accepted: 03/26/2019] [Indexed: 11/19/2022]
Abstract
BACKGROUND Antidepressants are amongst the most frequently prescribed medications. More than a decade ago, our clinical research group applied a prototypic set of inclusion/exclusion criteria used in an antidepressant efficacy trial (AET) to patients presenting for treatment in our outpatient practice and found that most patients would not qualify for the trial. In the present report from the Rhode Island Methods to Improve Diagnostic Assessment and Services (MIDAS) project, we apply the psychiatric inclusion/exclusion criteria used in 158 placebo-controlled studies to a large sample of depressed patients who presented for outpatient treatment to determine the range and extent of the representativeness of samples treated in AETs and whether this has changed over time. METHOD We applied the inclusion and exclusion criteria used in 158 AETs to 1,271 patients presenting to an outpatient practice who received a principal diagnosis of major depressive disorder. The patients underwent a thorough diagnostic evaluation. RESULTS Across all 158 studies, the percentage of patients that would have been excluded ranged from 44.4 to 99.8% (mean = 86.1%). The percentage of patients that would have been excluded was significantly higher in the studies published in 2010 through 2014 compared to the studies published from 1995 to 2009 (91.4 vs. 83.8%, t(156) = 3.74, p < 0.001). CONCLUSIONS Only a minority of depressed patients seen in clinical practice are likely to be eligible for most AETs. The generalizability of AETs has decreased over time. It is unclear how generalizable the results of AETs are to patients treated in real-world clinical practice.
Collapse
|
|
5 |
28 |
18
|
Chen JS, Coyner AS, Ostmo S, Sonmez K, Bajimaya S, Pradhan E, Valikodath N, Cole ED, Al-Khaled T, Chan RVP, Singh P, Kalpathy-Cramer J, Chiang MF, Campbell JP. Deep Learning for the Diagnosis of Stage in Retinopathy of Prematurity: Accuracy and Generalizability across Populations and Cameras. Ophthalmol Retina 2021; 5:1027-1035. [PMID: 33561545 PMCID: PMC8364291 DOI: 10.1016/j.oret.2020.12.013] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 12/02/2020] [Accepted: 12/16/2020] [Indexed: 12/23/2022]
Abstract
PURPOSE Stage is an important feature to identify in retinal images of infants at risk of retinopathy of prematurity (ROP). The purpose of this study was to implement a convolutional neural network (CNN) for binary detection of stages 1, 2, and 3 in ROP and to evaluate its generalizability across different populations and camera systems. DESIGN Diagnostic validation study of CNN for stage detection. PARTICIPANTS Retinal fundus images obtained from preterm infants during routine ROP screenings. METHODS Two datasets were used: 5943 fundus images obtained by RetCam camera (Natus Medical, Pleasanton, CA) from 9 North American institutions and 5049 images obtained by 3nethra camera (Forus Health Incorporated, Bengaluru, India) from 4 hospitals in Nepal. Images were labeled based on the presence of stage by 1 to 3 expert graders. Three CNN models were trained using 5-fold cross-validation on datasets from North America alone, Nepal alone, and a combined dataset and were evaluated on 2 held-out test sets consisting of 708 and 247 images from the Nepali and North American datasets, respectively. MAIN OUTCOME MEASURES Convolutional neural network performance was evaluated using area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), sensitivity, and specificity. RESULTS Both the North American- and Nepali-trained models demonstrated high performance on a test set from the same population: AUROC, 0.99; AUPRC, 0.98; sensitivity, 94%; and AUROC, 0.97; AUPRC, 0.91; and sensitivity, 73%; respectively. However, the performance of each model decreased to AUROC of 0.96 and AUPRC of 0.88 (sensitivity, 52%) and AUROC of 0.62 and AUPRC of 0.36 (sensitivity, 44%) when evaluated on a test set from the other population. Compared with the models trained on individual datasets, the model trained on a combined dataset achieved improved performance on each respective test set: sensitivity improved from 94% to 98% on the North American test set and from 73% to 82% on the Nepali test set. CONCLUSIONS A CNN can identify accurately the presence of ROP stage in retinal images, but performance depends on the similarity between training and testing populations. We demonstrated that internal and external performance can be improved by increasing the heterogeneity of the training dataset features of the training dataset, in this case by combining images from different populations and cameras.
Collapse
|
Research Support, N.I.H., Extramural |
4 |
27 |
19
|
Lorenzo-Luaces L, Zimmerman M, Cuijpers P. Are studies of psychotherapies for depression more or less generalizable than studies of antidepressants? J Affect Disord 2018. [PMID: 29522947 DOI: 10.1016/j.jad.2018.02.066] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
BACKGROUND The generalizability of findings from studies exploring the efficacy of psychotherapy and antidepressants has been called into question in part because studies exclude many patients. Despite this, the frequency with which psychotherapy and antidepressant studies use specific inclusion and exclusion criteria has never been compared. We explored the exclusion criteria used in psychotherapy and pharmacotherapy studies from 1995 to 2014. METHOD Systematic literature searches were conducted in PubMed, Medline, PsycINFO, and Embase of published randomized controlled trials (RCTs) of the treatment of major depressive disorder (MDD) in adults with either antidepressants (vs. placebos) or psychotherapy (vs. placebos, treatments as usual, or other controls). RESULTS Most psychotherapy (81%) and antidepressant (100%) trials excluded patients with milder symptoms as well as patients with elevated suicidal risk (56-75%), psychotic symptoms (84-88%), or substance misuse (75-81%). Psychotherapy studies were less likely to exclude patients on the basis of brief episode duration (0% vs. 48%) and co-morbid Axis I disorders (6% vs. 27%). However, psychotherapy studies excluded patients with more severe symptoms more frequently (38%) than antidepressant studies (8%). CONCLUSIONS Overall, psychotherapy studies appear somewhat more inclusive than antidepressant studies. On average, antidepressant studies appear to target patients with more chronic and severe, as well as more purely depressive presentations.
Collapse
|
|
7 |
26 |
20
|
Roberts I, Prieto-Merino D. Applying results from clinical trials: tranexamic acid in trauma patients. J Intensive Care 2014; 2:56. [PMID: 25705414 PMCID: PMC4336134 DOI: 10.1186/s40560-014-0056-1] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2014] [Accepted: 09/16/2014] [Indexed: 01/22/2023] Open
Abstract
This paper considers how results from clinical trials should be applied in the care of patients, using the results of the Clinical Randomisation of an Antifibrinolytic in Significant Haemorrhage (CRASH-2) trial of tranexamic acid in bleeding trauma patients as a case study. We explain why an understanding of the mechanisms of action of the trial treatment, and insight into the factors that might be relevant to this mechanism, is critical in order to properly apply (generalise) trial results and why it is not necessary that the trial population is representative of the population in which the medicine will be used. We explain why cause (mechanism)-specific mortality is more generalizable than all-cause mortality and why the risk ratio is the generalizable measure of the effect of the treatment. Overall, we argue that a biological insight into how the treatment works is more relevant when applying research results to patient care than the application of statistical reasoning.
Collapse
|
Review |
11 |
25 |
21
|
Pan I, Agarwal S, Merck D. Generalizable Inter-Institutional Classification of Abnormal Chest Radiographs Using Efficient Convolutional Neural Networks. J Digit Imaging 2019; 32:888-896. [PMID: 30838482 PMCID: PMC6737122 DOI: 10.1007/s10278-019-00180-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Our objective is to evaluate the effectiveness of efficient convolutional neural networks (CNNs) for abnormality detection in chest radiographs and investigate the generalizability of our models on data from independent sources. We used the National Institutes of Health ChestX-ray14 (NIH-CXR) and the Rhode Island Hospital chest radiograph (RIH-CXR) datasets in this study. Both datasets were split into training, validation, and test sets. The DenseNet and MobileNetV2 CNN architectures were used to train models on each dataset to classify chest radiographs into normal or abnormal categories; models trained on NIH-CXR were designed to also predict the presence of 14 different pathological findings. Models were evaluated on both NIH-CXR and RIH-CXR test sets based on the area under the receiver operating characteristic curve (AUROC). DenseNet and MobileNetV2 models achieved AUROCs of 0.900 and 0.893 for normal versus abnormal classification on NIH-CXR and AUROCs of 0.960 and 0.951 on RIH-CXR. For the 14 pathological findings in NIH-CXR, MobileNetV2 achieved an AUROC within 0.03 of DenseNet for each finding, with an average difference of 0.01. When externally validated on independently collected data (e.g., RIH-CXR-trained models on NIH-CXR), model AUROCs decreased by 3.6-5.2% relative to their locally trained counterparts. MobileNetV2 achieved comparable performance to DenseNet in our analysis, demonstrating the efficacy of efficient CNNs for chest radiograph abnormality detection. In addition, models were able to generalize to external data albeit with performance decreases that should be taken into consideration when applying models on data from different institutions.
Collapse
|
research-article |
6 |
23 |
22
|
Justification of exclusion criteria was underreported in a review of cardiovascular trials. J Clin Epidemiol 2014; 67:635-44. [PMID: 24613498 DOI: 10.1016/j.jclinepi.2013.12.005] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2013] [Revised: 11/01/2013] [Accepted: 12/16/2013] [Indexed: 11/23/2022]
Abstract
OBJECTIVES Ethical guidelines for human subject research require that the burdens and benefits of participation be equally distributed. This study aimed to provide empirical data on exclusion of trial participants and reasons for this exclusion. As a secondary objective, we assessed to what extent exclusion affects generalizability of study results. STUDY DESIGN AND SETTING Review of trials on secondary prevention of cardiovascular events. RESULTS One hundred thirteen trials were identified, of which 112 reported exclusion criteria. One study justified the exclusion criteria applied. Ambiguous exclusion criteria due to the opinion of the physician (28 of 112 = 25%) or physical disability (12 of 112 = 11%) were reported. Within groups of trials that studied similar treatments (ie, beta-blocker, clopidogrel, or statin therapy), baseline characteristics differed among trials. For example, the proportion of women ranged between 23.1-47.4%, 2.1-38.9%, and 10.6-50.6% for the clopidogrel, beta-blocker, and statin trials, respectively. Nevertheless, no evidence was found for heterogeneity of treatment effects. CONCLUSION Almost none of the articles justified the applied exclusion criteria. No evidence was found that inclusion of dissimilar participants affected generalizability. To allow for a normative discussion on equitable selection of study populations, researchers should not only report exclusion criteria but also the reasons for using these criteria.
Collapse
|
Review |
11 |
19 |
23
|
Gard AM, Hyde LW, Heeringa SG, West BT, Mitchell C. Why weight? Analytic approaches for large-scale population neuroscience data. Dev Cogn Neurosci 2023; 59:101196. [PMID: 36630774 PMCID: PMC9843279 DOI: 10.1016/j.dcn.2023.101196] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 12/30/2022] [Accepted: 01/05/2023] [Indexed: 01/09/2023] Open
Abstract
Population-based neuroimaging studies that feature complex sampling designs enable researchers to generalize their results more widely. However, several theoretical and analytical questions pose challenges to researchers interested in these data. The following is a resource for researchers interested in using population-based neuroimaging data. We provide an overview of sampling designs and describe the differences between traditional model-based analyses and survey-oriented design-based analyses. To elucidate key concepts, we leverage data from the Adolescent Brain Cognitive Development℠ Study (ABCD Study®), a population-based sample of 11,878 9-10-year-olds in the United States. Analyses revealed modest sociodemographic discrepancies between the target population of 9-10-year-olds in the U.S. and both the recruited ABCD sample and the analytic sample with usable structural and functional imaging data. In evaluating the associations between socioeconomic resources (i.e., constructs that are tightly linked to recruitment biases) and several metrics of brain development, we show that model-based approaches over-estimated the associations of household income and under-estimated the associations of caregiver education with total cortical volume and surface area. Comparable results were found in models predicting neural function during two fMRI task paradigms. We conclude with recommendations for ABCD Study® users and users of population-based neuroimaging cohorts more broadly.
Collapse
|
research-article |
2 |
19 |
24
|
Maleki F, Ovens K, Gupta R, Reinhold C, Spatz A, Forghani R. Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls. Radiol Artif Intell 2022; 5:e220028. [PMID: 36721408 PMCID: PMC9885377 DOI: 10.1148/ryai.220028] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2022] [Revised: 10/10/2022] [Accepted: 10/24/2022] [Indexed: 11/17/2022]
Abstract
Purpose To investigate the impact of the following three methodological pitfalls on model generalizability: (a) violation of the independence assumption, (b) model evaluation with an inappropriate performance indicator or baseline for comparison, and (c) batch effect. Materials and Methods The authors used retrospective CT, histopathologic analysis, and radiography datasets to develop machine learning models with and without the three methodological pitfalls to quantitatively illustrate their effect on model performance and generalizability. F1 score was used to measure performance, and differences in performance between models developed with and without errors were assessed using the Wilcoxon rank sum test when applicable. Results Violation of the independence assumption by applying oversampling, feature selection, and data augmentation before splitting data into training, validation, and test sets seemingly improved model F1 scores by 71.2% for predicting local recurrence and 5.0% for predicting 3-year overall survival in head and neck cancer and by 46.0% for distinguishing histopathologic patterns in lung cancer. Randomly distributing data points for a patient across datasets superficially improved the F1 score by 21.8%. High model performance metrics did not indicate high-quality lung segmentation. In the presence of a batch effect, a model built for pneumonia detection had an F1 score of 98.7% but correctly classified only 3.86% of samples from a new dataset of healthy patients. Conclusion Machine learning models developed with these methodological pitfalls, which are undetectable during internal evaluation, produce inaccurate predictions; thus, understanding and avoiding these pitfalls is necessary for developing generalizable models.Keywords: Random Forest, Diagnosis, Prognosis, Convolutional Neural Network (CNN), Medical Image Analysis, Generalizability, Machine Learning, Deep Learning, Model Evaluation Supplemental material is available for this article. Published under a CC BY 4.0 license.
Collapse
|
research-article |
3 |
19 |
25
|
Yang S, Kim JK, Song R. Doubly robust inference when combining probability and non-probability samples with high dimensional data. J R Stat Soc Series B Stat Methodol 2020; 82:445-465. [PMID: 33162780 DOI: 10.1111/rssb.12354] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
We consider integrating a non-probability sample with a probability sample which provides high dimensional representative covariate information of the target population. We propose a two-step approach for variable selection and finite population inference. In the first step, we use penalized estimating equations with folded concave penalties to select important variables and show selection consistency for general samples. In the second step, we focus on a doubly robust estimator of the finite population mean and re-estimate the nuisance model parameters by minimizing the asymptotic squared bias of the doubly robust estimator. This estimating strategy mitigates the possible first-step selection error and renders the doubly robust estimator root n consistent if either the sampling probability or the outcome model is correctly specified.
Collapse
|
|
5 |
18 |