1
|
Wlosik J, Granjeaud S, Gorvel L, Olive D, Chretien AS. A beginner's guide to supervised analysis for mass cytometry data in cancer biology. Cytometry A 2024. [PMID: 39486897 DOI: 10.1002/cyto.a.24901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 09/16/2024] [Accepted: 10/01/2024] [Indexed: 11/04/2024]
Abstract
Mass cytometry enables deep profiling of biological samples at single-cell resolution. This technology is more than relevant in cancer research due to high cellular heterogeneity and complexity. Downstream analysis of high-dimensional datasets increasingly relies on machine learning (ML) to extract clinically relevant information, including supervised algorithms for classification and regression purposes. In cancer research, they are used to develop predictive models that will guide clinical decision making. However, the development of supervised algorithms faces major challenges, such as sufficient validation, before being translated into the clinics. In this work, we provide a framework for the analysis of mass cytometry data with a specific focus on supervised algorithms and practical examples of their applications. We also raise awareness on key issues regarding good practices for researchers curious to implement supervised ML on their mass cytometry data. Finally, we discuss the challenges of supervised ML application to cancer research.
Collapse
Affiliation(s)
- Julia Wlosik
- Team 'Immunity and Cancer', Marseille Cancer Research Center, Inserm U1068, CNRS UMR7258, Paoli-Calmettes Institute, Aix-Marseille University UM105, Marseille, France
- Immunomonitoring Department, Paoli-Calmettes Institute, Marseille, France
| | - Samuel Granjeaud
- Systems Biology Platform, Marseille Cancer Research Center, Inserm U1068, CNRS UMR7258, Paoli-Calmettes Institute, Aix-Marseille University UM105, Marseille, France
| | - Laurent Gorvel
- Team 'Immunity and Cancer', Marseille Cancer Research Center, Inserm U1068, CNRS UMR7258, Paoli-Calmettes Institute, Aix-Marseille University UM105, Marseille, France
- Immunomonitoring Department, Paoli-Calmettes Institute, Marseille, France
| | - Daniel Olive
- Team 'Immunity and Cancer', Marseille Cancer Research Center, Inserm U1068, CNRS UMR7258, Paoli-Calmettes Institute, Aix-Marseille University UM105, Marseille, France
- Immunomonitoring Department, Paoli-Calmettes Institute, Marseille, France
| | - Anne-Sophie Chretien
- Team 'Immunity and Cancer', Marseille Cancer Research Center, Inserm U1068, CNRS UMR7258, Paoli-Calmettes Institute, Aix-Marseille University UM105, Marseille, France
- Immunomonitoring Department, Paoli-Calmettes Institute, Marseille, France
| |
Collapse
|
2
|
Ngiam W, Geng JJ, Shomstein S. Editorial for Attention, Perception, & Psychophysics. Atten Percept Psychophys 2024:10.3758/s13414-024-02973-9. [PMID: 39482551 DOI: 10.3758/s13414-024-02973-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Affiliation(s)
- William Ngiam
- School of Psychology, University of Adelaide, Adelaide, Australia
| | - Joy J Geng
- Department of Psychology, University of California, Davis, California, USA
- Center for Mind and Brain, University of California, Davis, California, USA
| | | |
Collapse
|
3
|
Cavalcante BRR, Freitas RD, Siquara da Rocha LO, Santos RSB, Souza BSDF, Ramos PIP, Rocha GV, Gurgel Rocha CA. In silico approaches for drug repurposing in oncology: a scoping review. Front Pharmacol 2024; 15:1400029. [PMID: 38919258 PMCID: PMC11196849 DOI: 10.3389/fphar.2024.1400029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 05/14/2024] [Indexed: 06/27/2024] Open
Abstract
Introduction: Cancer refers to a group of diseases characterized by the uncontrolled growth and spread of abnormal cells in the body. Due to its complexity, it has been hard to find an ideal medicine to treat all cancer types, although there is an urgent need for it. However, the cost of developing a new drug is high and time-consuming. In this sense, drug repurposing (DR) can hasten drug discovery by giving existing drugs new disease indications. Many computational methods have been applied to achieve DR, but just a few have succeeded. Therefore, this review aims to show in silico DR approaches and the gap between these strategies and their ultimate application in oncology. Methods: The scoping review was conducted according to the Arksey and O'Malley framework and the Joanna Briggs Institute recommendations. Relevant studies were identified through electronic searching of PubMed/MEDLINE, Embase, Scopus, and Web of Science databases, as well as the grey literature. We included peer-reviewed research articles involving in silico strategies applied to drug repurposing in oncology, published between 1 January 2003, and 31 December 2021. Results: We identified 238 studies for inclusion in the review. Most studies revealed that the United States, India, China, South Korea, and Italy are top publishers. Regarding cancer types, breast cancer, lymphomas and leukemias, lung, colorectal, and prostate cancer are the top investigated. Additionally, most studies solely used computational methods, and just a few assessed more complex scientific models. Lastly, molecular modeling, which includes molecular docking and molecular dynamics simulations, was the most frequently used method, followed by signature-, Machine Learning-, and network-based strategies. Discussion: DR is a trending opportunity but still demands extensive testing to ensure its safety and efficacy for the new indications. Finally, implementing DR can be challenging due to various factors, including lack of quality data, patient populations, cost, intellectual property issues, market considerations, and regulatory requirements. Despite all the hurdles, DR remains an exciting strategy for identifying new treatments for numerous diseases, including cancer types, and giving patients faster access to new medications.
Collapse
Affiliation(s)
- Bruno Raphael Ribeiro Cavalcante
- Gonçalo Moniz Institute, Oswaldo Cruz Foundation (IGM-FIOCRUZ/BA), Salvador, Brazil
- Department of Pathology and Forensic Medicine of the School of Medicine, Federal University of Bahia, Salvador, Brazil
| | - Raíza Dias Freitas
- Gonçalo Moniz Institute, Oswaldo Cruz Foundation (IGM-FIOCRUZ/BA), Salvador, Brazil
- Department of Social and Pediatric Dentistry of the School of Dentistry, Federal University of Bahia, Salvador, Brazil
| | - Leonardo de Oliveira Siquara da Rocha
- Gonçalo Moniz Institute, Oswaldo Cruz Foundation (IGM-FIOCRUZ/BA), Salvador, Brazil
- Department of Pathology and Forensic Medicine of the School of Medicine, Federal University of Bahia, Salvador, Brazil
| | | | - Bruno Solano de Freitas Souza
- Gonçalo Moniz Institute, Oswaldo Cruz Foundation (IGM-FIOCRUZ/BA), Salvador, Brazil
- D’Or Institute for Research and Education (IDOR), Salvador, Brazil
| | - Pablo Ivan Pereira Ramos
- Gonçalo Moniz Institute, Oswaldo Cruz Foundation (IGM-FIOCRUZ/BA), Salvador, Brazil
- Center of Data and Knowledge Integration for Health (CIDACS), Salvador, Brazil
| | - Gisele Vieira Rocha
- Gonçalo Moniz Institute, Oswaldo Cruz Foundation (IGM-FIOCRUZ/BA), Salvador, Brazil
- D’Or Institute for Research and Education (IDOR), Salvador, Brazil
| | - Clarissa Araújo Gurgel Rocha
- Gonçalo Moniz Institute, Oswaldo Cruz Foundation (IGM-FIOCRUZ/BA), Salvador, Brazil
- Department of Pathology and Forensic Medicine of the School of Medicine, Federal University of Bahia, Salvador, Brazil
- D’Or Institute for Research and Education (IDOR), Salvador, Brazil
- Department of Propaedeutics, School of Dentistry of the Federal University of Bahia, Salvador, Brazil
| |
Collapse
|
4
|
Nab L, Schaffer AL, Hulme W, DeVito NJ, Dillingham I, Wiedemann M, Andrews CD, Curtis H, Fisher L, Green A, Massey J, Walters CE, Higgins R, Cunningham C, Morley J, Mehrkar A, Hart L, Davy S, Evans D, Hickman G, Inglesby P, Morton CE, Smith RM, Ward T, O'Dwyer T, Maude S, Bridges L, Butler-Cole BFC, Stables CL, Stokes P, Bates C, Cockburn J, Hester F, Parry J, Bhaskaran K, Schultze A, Rentsch CT, Mathur R, Tomlinson LA, Williamson EJ, Smeeth L, Walker A, Bacon S, MacKenna B, Goldacre B. OpenSAFELY: A platform for analysing electronic health records designed for reproducible research. Pharmacoepidemiol Drug Saf 2024; 33:e5815. [PMID: 38783412 PMCID: PMC7616137 DOI: 10.1002/pds.5815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/29/2024] [Accepted: 04/30/2024] [Indexed: 05/25/2024]
Abstract
Electronic health records (EHRs) and other administrative health data are increasingly used in research to generate evidence on the effectiveness, safety, and utilisation of medical products and services, and to inform public health guidance and policy. Reproducibility is a fundamental step for research credibility and promotes trust in evidence generated from EHRs. At present, ensuring research using EHRs is reproducible can be challenging for researchers. Research software platforms can provide technical solutions to enhance the reproducibility of research conducted using EHRs. In response to the COVID-19 pandemic, we developed the secure, transparent, analytic open-source software platform OpenSAFELY designed with reproducible research in mind. OpenSAFELY mitigates common barriers to reproducible research by: standardising key workflows around data preparation; removing barriers to code-sharing in secure analysis environments; enforcing public sharing of programming code and codelists; ensuring the same computational environment is used everywhere; integrating new and existing tools that encourage and enable the use of reproducible working practices; and providing an audit trail for all code that is run against the real data to increase transparency. This paper describes OpenSAFELY's reproducibility-by-design approach in detail.
Collapse
Affiliation(s)
- Linda Nab
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Andrea L Schaffer
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - William Hulme
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Nicholas J DeVito
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Iain Dillingham
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Milan Wiedemann
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Colm D Andrews
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Helen Curtis
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Louis Fisher
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Amelia Green
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Jon Massey
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Caroline E Walters
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Rose Higgins
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Christine Cunningham
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Jessica Morley
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Amir Mehrkar
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Liam Hart
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Simon Davy
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - David Evans
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - George Hickman
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Peter Inglesby
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Caroline E Morton
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Rebecca M Smith
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Tom Ward
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Thomas O'Dwyer
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Steven Maude
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Lucy Bridges
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Ben F C Butler-Cole
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Catherine L Stables
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Pete Stokes
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | | | | | | | | | | | - Anna Schultze
- London School of Hygiene and Tropical Medicine, London, UK
| | | | - Rohini Mathur
- London School of Hygiene and Tropical Medicine, London, UK
| | | | | | - Liam Smeeth
- London School of Hygiene and Tropical Medicine, London, UK
| | - Alex Walker
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Sebastian Bacon
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Brian MacKenna
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Ben Goldacre
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| |
Collapse
|
5
|
Kohrs FE, Auer S, Bannach-Brown A, Fiedler S, Haven TL, Heise V, Holman C, Azevedo F, Bernard R, Bleier A, Bössel N, Cahill BP, Castro LJ, Ehrenhofer A, Eichel K, Frank M, Frick C, Friese M, Gärtner A, Gierend K, Grüning DJ, Hahn L, Hülsemann M, Ihle M, Illius S, König L, König M, Kulke L, Kutlin A, Lammers F, Mehler DMA, Miehl C, Müller-Alcazar A, Neuendorf C, Niemeyer H, Pargent F, Peikert A, Pfeuffer CU, Reinecke R, Röer JP, Rohmann JL, Sánchez-Tójar A, Scherbaum S, Sixtus E, Spitzer L, Straßburger VM, Weber M, Whitmire CJ, Zerna J, Zorbek D, Zumstein P, Weissgerber TL. Eleven strategies for making reproducible research and open science training the norm at research institutions. eLife 2023; 12:e89736. [PMID: 37994903 PMCID: PMC10666927 DOI: 10.7554/elife.89736] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 11/07/2023] [Indexed: 11/24/2023] Open
Abstract
Reproducible research and open science practices have the potential to accelerate scientific progress by allowing others to reuse research outputs, and by promoting rigorous research that is more likely to yield trustworthy results. However, these practices are uncommon in many fields, so there is a clear need for training that helps and encourages researchers to integrate reproducible research and open science practices into their daily work. Here, we outline eleven strategies for making training in these practices the norm at research institutions. The strategies, which emerged from a virtual brainstorming event organized in collaboration with the German Reproducibility Network, are concentrated in three areas: (i) adapting research assessment criteria and program requirements; (ii) training; (iii) building communities. We provide a brief overview of each strategy, offer tips for implementation, and provide links to resources. We also highlight the importance of allocating resources and monitoring impact. Our goal is to encourage researchers - in their roles as scientists, supervisors, mentors, instructors, and members of curriculum, hiring or evaluation committees - to think creatively about the many ways they can promote reproducible research and open science practices in their institutions.
Collapse
Affiliation(s)
- Friederike E Kohrs
- QUEST Center for Responsible Research, Berlin Institute of Health at Charité - Universitätsmedizin BerlinBerlinGermany
| | - Susann Auer
- Department of Plant Physiology, Faculty of Biology, Technische Universität DresdenDresdenGermany
| | - Alexandra Bannach-Brown
- QUEST Center for Responsible Research, Berlin Institute of Health at Charité - Universitätsmedizin BerlinBerlinGermany
| | - Susann Fiedler
- Department Strategy & Innovation, Vienna University of Economics and BusinessViennaAustria
| | - Tamarinde Laura Haven
- Danish Centre for Studies in Research & Research Policy, Department of Political Science, Aarhus UniversityAarhusDenmark
| | | | - Constance Holman
- QUEST Center for Responsible Research, Berlin Institute of Health at Charité - Universitätsmedizin BerlinBerlinGermany
| | - Flavio Azevedo
- Saxony Center for Criminological ResearchChemnitzGermany
- University of CambridgeCambridgeUnited Kingdom
| | - René Bernard
- NeuroCure Cluster of Excellence, Charité - Universitätsmedizin BerlinBerlinGermany
| | - Arnim Bleier
- Department for Computational Social Sciences, GESIS - Leibniz Institute for the Social SciencesCologneGermany
| | - Nicole Bössel
- Department of Psychiatry and Psychotherapy, University Medicine GreifswaldGreifswaldGermany
| | | | | | - Adrian Ehrenhofer
- Institute of Solid Mechanics & Dresden Center for Intelligent Materials, Technische Universität DresdenDresdenGermany
| | - Kristina Eichel
- Department of Education and Psychology, Freie Universität BerlinBerlinGermany
| | | | - Claudia Frick
- Institute of Information Science, Technische Hochschule KölnKölnGermany
| | - Malte Friese
- Department of Psychology, Saarland UniversitySaarbrückenGermany
| | - Anne Gärtner
- Department of Psychology, Technische Universität DresdenDresdenGermany
| | - Kerstin Gierend
- Department of Biomedical Informatics at the Center for Preventive Medicine and Digital Health, Medical Faculty Mannheim, Heidelberg UniversityHeidelbergGermany
| | - David Joachim Grüning
- Department of Psychology, Heidelberg UniversityHeidelbergGermany
- Department of Survey Development and Methodology, GESIS – Leibniz Institute for the Social SciencesMannheimGermany
| | - Lena Hahn
- Department of Social Psychology, Universität TrierTrierGermany
| | - Maren Hülsemann
- QUEST Center for Responsible Research, Berlin Institute of Health at Charité - Universitätsmedizin BerlinBerlinGermany
| | - Malika Ihle
- LMU Open Science Center, Department of Psychology, LMU MunichMunichGermany
| | - Sabrina Illius
- ICAN Institute for Cognitive and Affective Neuroscience, Department of Psychology, Faculty of Human Sciences, Medical School HamburgHamburgGermany
| | - Laura König
- Faculty of Life Sciences: Food, Nutrition and Health, University of BayreuthBayreuthGermany
| | - Matthias König
- Institute for Biology, Institute for Theoretical Biology, Humboldt-University BerlinBerlinGermany
| | - Louisa Kulke
- Developmental Psychology with Educational Psychology, University of BremenBremenGermany
| | - Anton Kutlin
- Max Planck Institute for the Physics of Complex SystemsDresdenGermany
| | - Fritjof Lammers
- Division of Regulatory Genomics and Cancer Evolution, German Cancer Research Center (DKFZ)HeidelbergGermany
| | - David MA Mehler
- Department of Psychiatry, Psychotherapy and Psychosomatics, Medical School, RWTH Aachen UniversityAachenGermany
| | - Christoph Miehl
- Computation in Neural Circuits, Max Planck Institute for Brain ResearchFrankfurtGermany
| | - Anett Müller-Alcazar
- ICAN Institute for Cognitive and Affective Neuroscience, Department of Psychology, Faculty of Human Sciences, Medical School HamburgHamburgGermany
| | - Claudia Neuendorf
- Hector-Institute for Education Sciences and Psychology, Eberhard Karls, University of TübingenTübingenGermany
| | - Helen Niemeyer
- Department of Education and Psychology, Freie Universität BerlinBerlinGermany
| | | | - Aaron Peikert
- Center for Lifespan Psychology, Max Planck Institute for Human DevelopmentBerlinGermany
| | - Christina U Pfeuffer
- Department of Psychology, Catholic University of Eichstätt-IngolstadtEichstättGermany
| | - Robert Reinecke
- Institute of Geography, Johannes Gutenberg-University MainzMainzGermany
| | - Jan Philipp Röer
- Department of Psychology and Psychotherapy, Witten/Herdecke UniversityWittenGermany
| | - Jessica L Rohmann
- Scientific Directorate, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC)BerlinGermany
| | | | - Stefan Scherbaum
- Department of Psychology, Technische Universität DresdenDresdenGermany
| | - Elena Sixtus
- Empirical Childhood Research, University of PotsdamPotsdamGermany
| | | | - Vera Maren Straßburger
- Department of Psychology, Medical School HamburgHamburgGermany
- Charité - Universitätsmedizin Berlin, Gender in Medicine (GiM)BerlinGermany
| | - Marcel Weber
- Department of Psychology, Saarland UniversitySaarbrückenGermany
| | - Clarissa J Whitmire
- Max Delbrück Center for Molecular Medicine in the Helmholtz AssociationBerlinGermany
- Neuroscience Research Center, Charité-Universitätsmedizin BerlinBerlinGermany
| | - Josephine Zerna
- Department of Psychology, Technische Universität DresdenDresdenGermany
| | - Dilara Zorbek
- International Graduate Program Medical Neurosciences, Charité – Universitätsmedizin BerlinBerlinGermany
| | | | - Tracey L Weissgerber
- QUEST Center for Responsible Research, Berlin Institute of Health at Charité - Universitätsmedizin BerlinBerlinGermany
| |
Collapse
|
6
|
Kapoor S, Narayanan A. Leakage and the reproducibility crisis in machine-learning-based science. PATTERNS (NEW YORK, N.Y.) 2023; 4:100804. [PMID: 37720327 PMCID: PMC10499856 DOI: 10.1016/j.patter.2023.100804] [Citation(s) in RCA: 61] [Impact Index Per Article: 61.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Revised: 05/18/2023] [Accepted: 07/05/2023] [Indexed: 09/19/2023]
Abstract
Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models.
Collapse
Affiliation(s)
- Sayash Kapoor
- Department of Computer Science and Center for Information Technology Policy, Princeton University, Princeton, NJ 08540, USA
| | - Arvind Narayanan
- Department of Computer Science and Center for Information Technology Policy, Princeton University, Princeton, NJ 08540, USA
| |
Collapse
|
7
|
Hamilton DG, Hong K, Fraser H, Rowhani-Farid A, Fidler F, Page MJ. Prevalence and predictors of data and code sharing in the medical and health sciences: systematic review with meta-analysis of individual participant data. BMJ 2023; 382:e075767. [PMID: 37433624 PMCID: PMC10334349 DOI: 10.1136/bmj-2023-075767] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/07/2023] [Indexed: 07/13/2023]
Abstract
OBJECTIVES To synthesise research investigating data and code sharing in medicine and health to establish an accurate representation of the prevalence of sharing, how this frequency has changed over time, and what factors influence availability. DESIGN Systematic review with meta-analysis of individual participant data. DATA SOURCES Ovid Medline, Ovid Embase, and the preprint servers medRxiv, bioRxiv, and MetaArXiv were searched from inception to 1 July 2021. Forward citation searches were also performed on 30 August 2022. REVIEW METHODS Meta-research studies that investigated data or code sharing across a sample of scientific articles presenting original medical and health research were identified. Two authors screened records, assessed the risk of bias, and extracted summary data from study reports when individual participant data could not be retrieved. Key outcomes of interest were the prevalence of statements that declared that data or code were publicly or privately available (declared availability) and the success rates of retrieving these products (actual availability). The associations between data and code availability and several factors (eg, journal policy, type of data, trial design, and human participants) were also examined. A two stage approach to meta-analysis of individual participant data was performed, with proportions and risk ratios pooled with the Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis. RESULTS The review included 105 meta-research studies examining 2 121 580 articles across 31 specialties. Eligible studies examined a median of 195 primary articles (interquartile range 113-475), with a median publication year of 2015 (interquartile range 2012-2018). Only eight studies (8%) were classified as having a low risk of bias. Meta-analyses showed a prevalence of declared and actual public data availability of 8% (95% confidence interval 5% to 11%) and 2% (1% to 3%), respectively, between 2016 and 2021. For public code sharing, both the prevalence of declared and actual availability were estimated to be <0.5% since 2016. Meta-regressions indicated that only declared public data sharing prevalence estimates have increased over time. Compliance with mandatory data sharing policies ranged from 0% to 100% across journals and varied by type of data. In contrast, success in privately obtaining data and code from authors historically ranged between 0% and 37% and 0% and 23%, respectively. CONCLUSIONS The review found that public code sharing was persistently low across medical research. Declarations of data sharing were also low, increasing over time, but did not always correspond to actual sharing of data. The effectiveness of mandatory data sharing policies varied substantially by journal and type of data, a finding that might be informative for policy makers when designing policies and allocating resources to audit compliance. SYSTEMATIC REVIEW REGISTRATION Open Science Framework doi:10.17605/OSF.IO/7SX8U.
Collapse
Affiliation(s)
- Daniel G Hamilton
- MetaMelb Research Group, School of BioSciences, University of Melbourne, Melbourne, VIC, Australia
- Melbourne Medical School, Faculty of Medicine, Dentistry, and Health Sciences, University of Melbourne, Melbourne, VIC, Australia
| | - Kyungwan Hong
- Department of Practice, Sciences, and Health Outcomes Research, University of Maryland School of Pharmacy, Baltimore, MD, USA
| | - Hannah Fraser
- MetaMelb Research Group, School of BioSciences, University of Melbourne, Melbourne, VIC, Australia
| | - Anisa Rowhani-Farid
- Department of Practice, Sciences, and Health Outcomes Research, University of Maryland School of Pharmacy, Baltimore, MD, USA
| | - Fiona Fidler
- MetaMelb Research Group, School of BioSciences, University of Melbourne, Melbourne, VIC, Australia
- School of Historical and Philosophical Studies, University of Melbourne, Melbourne, VIC, Australia
| | - Matthew J Page
- Methods in Evidence Synthesis Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC, Australia
| |
Collapse
|
8
|
Sparks AH, Ponte EMD, Alves KS, Foster ZSL, Grünwald NJ. Openness and Computational Reproducibility in Plant Pathology: Where We Stand and a Way Forward. PHYTOPATHOLOGY 2023; 113:1159-1170. [PMID: 36624724 DOI: 10.1094/phyto-10-21-0430-per] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Open research practices have been highlighted extensively during the last 10 years in many fields of scientific study as essential standards needed to promote transparency and reproducibility of scientific results. Scientific claims can only be evaluated based on how protocols, materials, equipment, and methods were described; data were collected and prepared; and analyses were conducted. Openly sharing protocols, data, and computational code is central to current scholarly dissemination and communication, but in many fields, including plant pathology, adoption of these practices has been slow. We randomly selected 450 articles published from 2012 to 2021 across 21 journals representative of the plant pathology discipline and assigned them scores reflecting their openness and computational reproducibility. We found that most of the articles did not follow protocols for open science and failed to share data or code in a reproducible way. We propose that use of open-source tools facilitates computationally reproducible work and analyses, benefitting not just readers but the authors as well. Finally, we provide ideas and suggest tools to promote open, reproducible computational research practices among plant pathologists. [Formula: see text] Copyright © 2023 The Author(s). This is an open access article distributed under the CC BY 4.0 International license.
Collapse
Affiliation(s)
- Adam H Sparks
- Department of Primary Industries and Regional Development, Perth, WA 6000, Australia
- University of Southern Queensland, Centre for Crop Health, Toowoomba, Qld 4350, Australia
| | | | - Kaique S Alves
- Departmento de Fitopatologia, Universidade Federal de Viçosa, Brazil
| | - Zachary S L Foster
- Horticultural Crops Disease and Pest Management Research Unit, U.S. Department of Agriculture-Agricultural Research Service, Corvallis, OR 97330, U.S.A
| | - Niklaus J Grünwald
- Horticultural Crops Disease and Pest Management Research Unit, U.S. Department of Agriculture-Agricultural Research Service, Corvallis, OR 97330, U.S.A
| |
Collapse
|
9
|
Crüwell S, Apthorp D, Baker BJ, Colling L, Elson M, Geiger SJ, Lobentanzer S, Monéger J, Patterson A, Schwarzkopf DS, Zaneva M, Brown NJL. What's in a Badge? A Computational Reproducibility Investigation of the Open Data Badge Policy in One Issue of Psychological Science. Psychol Sci 2023; 34:512-522. [PMID: 36730433 DOI: 10.1177/09567976221140828] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
In April 2019, Psychological Science published its first issue in which all Research Articles received the Open Data badge. We used that issue to investigate the effectiveness of this badge, focusing on the adherence to its aim at Psychological Science: sharing both data and code to ensure reproducibility of results. Twelve researchers of varying experience levels attempted to reproduce the results of the empirical articles in the target issue (at least three researchers per article). We found that all 14 articles provided at least some data and six provided analysis code, but only one article was rated to be exactly reproducible, and three were rated as essentially reproducible with minor deviations. We suggest that researchers should be encouraged to adhere to the higher standard in force at Psychological Science. Moreover, a check of reproducibility during peer review may be preferable to the disclosure method of awarding badges.
Collapse
Affiliation(s)
- Sophia Crüwell
- Meta-Research Innovation Center Berlin (METRIC-B), QUEST Center for Transforming Biomedical Research, Berlin Institute of Health, Charité - Universitätsmedizin Berlin
- Department of History and Philosophy of Science, University of Cambridge
| | - Deborah Apthorp
- School of Psychology, University of New England
- School of Computing, Australian National University
| | - Bradley J Baker
- Department of Sport and Recreation Management, Temple University
| | | | - Malte Elson
- Faculty of Psychology, Ruhr University Bochum
- Horst Görtz Institute for IT Security, Ruhr University Bochum
| | - Sandra J Geiger
- Environmental Psychology, Department of Cognition, Emotion, and Methods, Faculty of Psychology, University of Vienna
| | | | - Jean Monéger
- Department of Psychology, University of Poitiers
- Research Center on Cognition and Learning, Centre National de la Recherche Scientifique (CNRS) 7295
| | - Alex Patterson
- Sheffield Methods Institute, The University of Sheffield
| | - D Samuel Schwarzkopf
- School of Optometry and Vision Science, University of Auckland
- Experimental Psychology, University College London
| | - Mirela Zaneva
- Department of Experimental Psychology, University of Oxford
| | | |
Collapse
|
10
|
DeVito NJ, Morton C, Cashin AG, Richards GC, Lee H. Sharing study materials in health and medical research. BMJ Evid Based Med 2022:bmjebm-2022-111987. [PMID: 36162960 DOI: 10.1136/bmjebm-2022-111987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/03/2022] [Indexed: 11/04/2022]
Abstract
Making study materials available allows for a more comprehensive understanding of the scientific literature. Sharing can take many forms and include a wide variety of outputs including code and data. Biomedical research can benefit from increased transparency but faces unique challenges for sharing, for instance, confidentiality concerns around participants' medical data. Both general and specialised repositories exist to aid in sharing most study materials. Sharing may also require skills and resources to ensure that it is done safely and effectively. Educating researchers on how to best share their materials, and properly rewarding these practices, requires action from a variety of stakeholders including journals, funders and research institutions.
Collapse
Affiliation(s)
- Nicholas J DeVito
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, Oxfordshire, UK
| | - Caroline Morton
- Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, Oxfordshire, UK
| | - Aidan Gregory Cashin
- School of Health Sciences, University of New South Wales, Sydney, New South Wales, Australia
- Centre for Pain IMPACT, Neuroscience Research Australia, Randwick, New South Wales, Australia
| | - Georgia C Richards
- Centre for Evidence Based Medicine, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, Oxfordshire, UK
| | - Hopin Lee
- Centre for Statistics in Medicine & Rehabilitation Research in Oxford, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, Oxfordshire, UK
- School of Medicine and Public Health, The University of Newcastle, Callaghan, New South Wales, Australia
| |
Collapse
|
11
|
Cadwallader L, Hrynaszkiewicz I. A survey of researchers' code sharing and code reuse practices, and assessment of interactive notebook prototypes. PeerJ 2022; 10:e13933. [PMID: 36032954 PMCID: PMC9406794 DOI: 10.7717/peerj.13933] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 08/01/2022] [Indexed: 01/19/2023] Open
Abstract
This research aimed to understand the needs and habits of researchers in relation to code sharing and reuse; gather feedback on prototype code notebooks created by NeuroLibre; and help determine strategies that publishers could use to increase code sharing. We surveyed 188 researchers in computational biology. Respondents were asked about how often and why they look at code, which methods of accessing code they find useful and why, what aspects of code sharing are important to them, and how satisfied they are with their ability to complete these tasks. Respondents were asked to look at a prototype code notebook and give feedback on its features. Respondents were also asked how much time they spent preparing code and if they would be willing to increase this to use a code sharing tool, such as a notebook. As a reader of research articles the most common reason (70%) for looking at code was to gain a better understanding of the article. The most commonly encountered method for code sharing-linking articles to a code repository-was also the most useful method of accessing code from the reader's perspective. As authors, the respondents were largely satisfied with their ability to carry out tasks related to code sharing. The most important of these tasks were ensuring that the code was running in the correct environment, and sharing code with good documentation. The average researcher, according to our results, is unwilling to incur additional costs (in time, effort or expenditure) that are currently needed to use code sharing tools alongside a publication. We infer this means we need different models for funding and producing interactive or executable research outputs if they are to reach a large number of researchers. For the purpose of increasing the amount of code shared by authors, PLOS Computational Biology is, as a result, focusing on policy rather than tools.
Collapse
|
12
|
Seibold H, Czerny S, Decke S, Dieterle R, Eder T, Fohr S, Hahn N, Hartmann R, Heindl C, Kopper P, Lepke D, Loidl V, Mandl M, Musiol S, Peter J, Piehler A, Rojas E, Schmid S, Schmidt H, Schmoll M, Schneider L, To XY, Tran V, Völker A, Wagner M, Wagner J, Waize M, Wecker H, Yang R, Zellner S, Nalenz M. Correction: A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses. PLoS One 2022; 17:e0269047. [PMID: 35604918 PMCID: PMC9126381 DOI: 10.1371/journal.pone.0269047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
|
13
|
Xu Y, Mansmann U. Validating the knowledge bank approach for personalized prediction of survival in acute myeloid leukemia: a reproducibility study. Hum Genet 2022; 141:1467-1480. [PMID: 35429300 PMCID: PMC9360099 DOI: 10.1007/s00439-022-02455-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2021] [Accepted: 04/05/2022] [Indexed: 11/29/2022]
Abstract
Reproducibility is not only essential for the integrity of scientific research but is also a prerequisite for model validation and refinement for the future application of predictive algorithms. However, reproducible research is becoming increasingly challenging, particularly in high-dimensional genomic data analyses with complex statistical or algorithmic techniques. Given that there are no mandatory requirements in most biomedical and statistical journals to provide the original data, analytical source code, or other relevant materials for publication, accessibility to these supplements naturally suggests a greater credibility of the published work. In this study, we performed a reproducibility assessment of the notable paper by Gerstung et al. (Nat Genet 49:332–340, 2017) by rerunning the analysis using their original code and data, which are publicly accessible. Despite an open science setting, it was challenging to reproduce the entire research project; reasons included: incomplete data and documentation, suboptimal code readability, coding errors, limited portability of intensive computing performed on a specific platform, and an R computing environment that could no longer be re-established. We learn that the availability of code and data does not guarantee transparency and reproducibility of a study; paradoxically, the source code is still liable to error and obsolescence, essentially due to methodological and computational complexity, a lack of reproducibility checking at submission, and updates for software and operating environment. The complex code may also hide problematic methodological aspects of the proposed research. Building on the experience gained, we discuss the best programming and software engineering practices that could have been employed to improve reproducibility, and propose practical criteria for the conduct and reporting of reproducibility studies for future researchers.
Collapse
Affiliation(s)
- Yujun Xu
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig-Maximilians-Universität München, Marchioninistr. 15, 81377 Munich, Germany
| | - Ulrich Mansmann
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig-Maximilians-Universität München, Marchioninistr. 15, 81377 Munich, Germany
| |
Collapse
|
14
|
Baldwin JR, Pingault JB, Schoeler T, Sallis HM, Munafò MR. Protecting against researcher bias in secondary data analysis: challenges and potential solutions. Eur J Epidemiol 2022; 37:1-10. [PMID: 35025022 PMCID: PMC8791887 DOI: 10.1007/s10654-021-00839-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 12/28/2021] [Indexed: 11/05/2022]
Abstract
Analysis of secondary data sources (such as cohort studies, survey data, and administrative records) has the potential to provide answers to science and society's most pressing questions. However, researcher biases can lead to questionable research practices in secondary data analysis, which can distort the evidence base. While pre-registration can help to protect against researcher biases, it presents challenges for secondary data analysis. In this article, we describe these challenges and propose novel solutions and alternative approaches. Proposed solutions include approaches to (1) address bias linked to prior knowledge of the data, (2) enable pre-registration of non-hypothesis-driven research, (3) help ensure that pre-registered analyses will be appropriate for the data, and (4) address difficulties arising from reduced analytic flexibility in pre-registration. For each solution, we provide guidance on implementation for researchers and data guardians. The adoption of these practices can help to protect against researcher bias in secondary data analysis, to improve the robustness of research based on existing data.
Collapse
Affiliation(s)
- Jessie R Baldwin
- Department of Clinical, Educational and Health Psychology, Division of Psychology and Language Sciences, University College London, London, WC1H 0AP, UK.
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
| | - Jean-Baptiste Pingault
- Department of Clinical, Educational and Health Psychology, Division of Psychology and Language Sciences, University College London, London, WC1H 0AP, UK
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
| | - Tabea Schoeler
- Department of Clinical, Educational and Health Psychology, Division of Psychology and Language Sciences, University College London, London, WC1H 0AP, UK
| | - Hannah M Sallis
- MRC Integrative Epidemiology Unit at the University of Bristol, Bristol Medical School, University of Bristol, Bristol, UK
- School of Psychological Science, University of Bristol, Bristol, UK
- Centre for Academic Mental Health, Population Health Sciences, University of Bristol, Bristol, UK
| | - Marcus R Munafò
- MRC Integrative Epidemiology Unit at the University of Bristol, Bristol Medical School, University of Bristol, Bristol, UK
- School of Psychological Science, University of Bristol, Bristol, UK
- NIHR Biomedical Research Centre, University Hospitals Bristol NHS Foundation Trust and University of Bristol, Bristol, UK
| |
Collapse
|
15
|
Stewart AJ, Farran EK, Grange JA, Macleod M, Munafò M, Newton P, Shanks DR. Improving research quality: the view from the UK Reproducibility Network institutional leads for research improvement. BMC Res Notes 2021; 14:458. [PMID: 34930427 PMCID: PMC8686561 DOI: 10.1186/s13104-021-05883-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Accepted: 12/09/2021] [Indexed: 11/10/2022] Open
Abstract
The adoption and incentivisation of open and transparent research practices is critical in addressing issues around research reproducibility and research integrity. These practices will require training and funding. Individuals need to be incentivised to adopt open and transparent research practices (e.g., added as desirable criteria in hiring, probation, and promotion decisions, recognition that funded research should be conducted openly and transparently, the importance of publishers mandating the publication of research workflows and appropriately curated data associated with each research output). Similarly, institutions need to be incentivised to encourage the adoption of open and transparent practices by researchers. Research quality should be prioritised over research quantity. As research transparency will look different for different disciplines, there can be no one-size-fits-all approach. An outward looking and joined up UK research strategy is needed that places openness and transparency at the heart of research activity. This should involve key stakeholders (institutions, research organisations, funders, publishers, and Government) and crucially should be focused on action. Failure to do this will have negative consequences not just for UK research, but also for our ability to innovate and subsequently commercialise UK-led discovery.
Collapse
|
16
|
Incidence of invasive fungal infection in acute lymphoblastic and acute myelogenous leukemia in the era of antimold prophylaxis. Sci Rep 2021; 11:22160. [PMID: 34773060 PMCID: PMC8590008 DOI: 10.1038/s41598-021-01716-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 11/01/2021] [Indexed: 12/12/2022] Open
Abstract
The incidence of invasive fungal infection (IFI) in patients with acute myeloid leukemia (AML) has decreased with the introduction of antimold prophylaxis. Although acute lymphoblastic leukemia (ALL) has a lower risk of IFI than does AML, the incidences of IFI in both AML and ALL in the era of antimold prophylaxis should be re-evaluated. We analyzed adults with AML or ALL who had undergone induction, re-induction, or consolidation chemotherapy from January 2017 to December 2019 at Seoul National University Hospital. Their clinical characteristics during each chemotherapy episode were reviewed, and cases with proven or probable diagnoses were regarded as positive for IFI. Of 552 episodes (393 in AML and 159 in ALL), 40 (7.2%) were IFI events. Of the IFI episodes, 8.1% (12/148) and 5.9% (13/220) (P = 0.856) occurred in cases of ALL without antimold prophylaxis and AML with antimold prophylaxis, respectively. After adjusting for clinical factors, a lack of antimold prophylaxis (adjusted odds ratio [aOR], 3.52; 95% confidence interval [CI], 1.35–9.22; P = 0.010) and a longer duration of neutropenia (per one day, aOR, 1.02; 95% CI, 1.01–1.04; P = 0.001) were independently associated with IFI. In conclusion, the incidence of IFI in ALL without antimold prophylaxis was not lower than that in AML. A lack of antimold prophylaxis and prolonged neutropenia were independent risk factors for IFI. Clinicians should be on guard for detecting IFI in patients with ALL, especially those with risk factors.
Collapse
|