1
|
Hasan E, Duhaime E, Trueblood JS. Boosting wisdom of the crowd for medical image annotation using training performance and task features. Cogn Res Princ Implic 2024; 9:31. [PMID: 38763994 PMCID: PMC11102897 DOI: 10.1186/s41235-024-00558-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 04/29/2024] [Indexed: 05/21/2024] Open
Abstract
A crucial bottleneck in medical artificial intelligence (AI) is high-quality labeled medical datasets. In this paper, we test a large variety of wisdom of the crowd algorithms to label medical images that were initially classified by individuals recruited through an app-based platform. Individuals classified skin lesions from the International Skin Lesion Challenge 2018 into 7 different categories. There was a large dispersion in the geographical location, experience, training, and performance of the recruited individuals. We tested several wisdom of the crowd algorithms of varying complexity from a simple unweighted average to more complex Bayesian models that account for individual patterns of errors. Using a switchboard analysis, we observe that the best-performing algorithms rely on selecting top performers, weighting decisions by training accuracy, and take into account the task environment. These algorithms far exceed expert performance. We conclude by discussing the implications of these approaches for the development of medical AI.
Collapse
Affiliation(s)
- Eeshan Hasan
- Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th St., Bloomington, IN, 47405-7007, USA.
- Cognitive Science Program, Indiana University, Bloomington, USA.
| | | | - Jennifer S Trueblood
- Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th St., Bloomington, IN, 47405-7007, USA.
- Cognitive Science Program, Indiana University, Bloomington, USA.
| |
Collapse
|
2
|
Lovis C, Weber J, Liopyris K, Braun RP, Marghoob AA, Quigley EA, Nelson K, Prentice K, Duhaime E, Halpern AC, Rotemberg V. Agreement Between Experts and an Untrained Crowd for Identifying Dermoscopic Features Using a Gamified App: Reader Feasibility Study. JMIR Med Inform 2023; 11:e38412. [PMID: 36652282 PMCID: PMC9892985 DOI: 10.2196/38412] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 09/28/2022] [Accepted: 10/16/2022] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Dermoscopy is commonly used for the evaluation of pigmented lesions, but agreement between experts for identification of dermoscopic structures is known to be relatively poor. Expert labeling of medical data is a bottleneck in the development of machine learning (ML) tools, and crowdsourcing has been demonstrated as a cost- and time-efficient method for the annotation of medical images. OBJECTIVE The aim of this study is to demonstrate that crowdsourcing can be used to label basic dermoscopic structures from images of pigmented lesions with similar reliability to a group of experts. METHODS First, we obtained labels of 248 images of melanocytic lesions with 31 dermoscopic "subfeatures" labeled by 20 dermoscopy experts. These were then collapsed into 6 dermoscopic "superfeatures" based on structural similarity, due to low interrater reliability (IRR): dots, globules, lines, network structures, regression structures, and vessels. These images were then used as the gold standard for the crowd study. The commercial platform DiagnosUs was used to obtain annotations from a nonexpert crowd for the presence or absence of the 6 superfeatures in each of the 248 images. We replicated this methodology with a group of 7 dermatologists to allow direct comparison with the nonexpert crowd. The Cohen κ value was used to measure agreement across raters. RESULTS In total, we obtained 139,731 ratings of the 6 dermoscopic superfeatures from the crowd. There was relatively lower agreement for the identification of dots and globules (the median κ values were 0.526 and 0.395, respectively), whereas network structures and vessels showed the highest agreement (the median κ values were 0.581 and 0.798, respectively). This pattern was also seen among the expert raters, who had median κ values of 0.483 and 0.517 for dots and globules, respectively, and 0.758 and 0.790 for network structures and vessels. The median κ values between nonexperts and thresholded average-expert readers were 0.709 for dots, 0.719 for globules, 0.714 for lines, 0.838 for network structures, 0.818 for regression structures, and 0.728 for vessels. CONCLUSIONS This study confirmed that IRR for different dermoscopic features varied among a group of experts; a similar pattern was observed in a nonexpert crowd. There was good or excellent agreement for each of the 6 superfeatures between the crowd and the experts, highlighting the similar reliability of the crowd for labeling dermoscopic images. This confirms the feasibility and dependability of using crowdsourcing as a scalable solution to annotate large sets of dermoscopic images, with several potential clinical and educational applications, including the development of novel, explainable ML tools.
Collapse
Affiliation(s)
| | - Jochen Weber
- Dermatology Section, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Konstantinos Liopyris
- Department of Dermatology, Andreas Syggros Hospital of Cutaneous and Venereal Diseases, University of Athens, Athens, Greece
| | - Ralph P Braun
- Department of Dermatology, University Hospital Zurich, Zurich, Switzerland
| | - Ashfaq A Marghoob
- Dermatology Section, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Elizabeth A Quigley
- Dermatology Section, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Kelly Nelson
- Department of Dermatology, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | | | | | - Allan C Halpern
- Dermatology Section, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Veronica Rotemberg
- Dermatology Section, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| |
Collapse
|
3
|
Zhang Z, Citardi D, Wang D, Genc Y, Shan J, Fan X. Patients' perceptions of using artificial intelligence (AI)-based technology to comprehend radiology imaging data. Health Informatics J 2021; 27:14604582211011215. [PMID: 33913359 DOI: 10.1177/14604582211011215] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Results of radiology imaging studies are not typically comprehensible to patients. With the advances in artificial intelligence (AI) technology in recent years, it is expected that AI technology can aid patients' understanding of radiology imaging data. The aim of this study is to understand patients' perceptions and acceptance of using AI technology to interpret their radiology reports. We conducted semi-structured interviews with 13 participants to elicit reflections pertaining to the use of AI technology in radiology report interpretation. A thematic analysis approach was employed to analyze the interview data. Participants have a generally positive attitude toward using AI-based systems to comprehend their radiology reports. AI is perceived to be particularly useful in seeking actionable information, confirming the doctor's opinions, and preparing for the consultation. However, we also found various concerns related to the use of AI in this context, such as cyber-security, accuracy, and lack of empathy. Our results highlight the necessity of providing AI explanations to promote people's trust and acceptance of AI. Designers of patient-centered AI systems should employ user-centered design approaches to address patients' concerns. Such systems should also be designed to promote trust and deliver concerning health results in an empathetic manner to optimize the user experience.
Collapse
|
4
|
Rother A, Niemann U, Hielscher T, Völzke H, Ittermann T, Spiliopoulou M. Assessing the difficulty of annotating medical data in crowdworking with help of experiments. PLoS One 2021; 16:e0254764. [PMID: 34324540 PMCID: PMC8321104 DOI: 10.1371/journal.pone.0254764] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Accepted: 07/02/2021] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND As healthcare-related data proliferate, there is need to annotate them expertly for the purposes of personalized medicine. Crowdworking is an alternative to expensive expert labour. Annotation corresponds to diagnosis, so comparing unlabeled records to labeled ones seems more appropriate for crowdworkers without medical expertise. We modeled the comparison of a record to two other records as a triplet annotation task, and we conducted an experiment to investigate to what extend sensor-measured stress, task duration, uncertainty of the annotators and agreement among the annotators could predict annotation correctness. MATERIALS AND METHODS We conducted an annotation experiment on health data from a population-based study. The triplet annotation task was to decide whether an individual was more similar to a healthy one or to one with a given disorder. We used hepatic steatosis as example disorder, and described the individuals with 10 pre-selected characteristics related to this disorder. We recorded task duration, electro-dermal activity as stress indicator, and uncertainty as stated by the experiment participants (n = 29 non-experts and three experts) for 30 triplets. We built an Artificial Similarity-Based Annotator (ASBA) and compared its correctness and uncertainty to that of the experiment participants. RESULTS We found no correlation between correctness and either of stated uncertainty, stress and task duration. Annotator agreement has not been predictive either. Notably, for some tasks, annotators agreed unanimously on an incorrect annotation. When controlling for Triplet ID, we identified significant correlations, indicating that correctness, stress levels and annotation duration depend on the task itself. Average correctness among the experiment participants was slightly lower than achieved by ASBA. Triplet annotation turned to be similarly difficult for experts as for non-experts. CONCLUSION Our lab experiment indicates that the task of triplet annotation must be prepared cautiously if delegated to crowdworkers. Neither certainty nor agreement among annotators should be assumed to imply correct annotation, because annotators may misjudge difficult tasks as easy and agree on incorrect annotations. Further research is needed to improve visualizations for complex tasks, to judiciously decide how much information to provide, Out-of-the-lab experiments in crowdworker setting are needed to identify appropriate designs of a human-annotation task, and to assess under what circumstances non-human annotation should be preferred.
Collapse
Affiliation(s)
- Anne Rother
- Faculty of Computer Science, Otto von Guericke University Magdeburg, Magdeburg, Germany
| | - Uli Niemann
- Faculty of Computer Science, Otto von Guericke University Magdeburg, Magdeburg, Germany
| | - Tommy Hielscher
- Faculty of Computer Science, Otto von Guericke University Magdeburg, Magdeburg, Germany
| | - Henry Völzke
- Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Till Ittermann
- Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Myra Spiliopoulou
- Faculty of Computer Science, Otto von Guericke University Magdeburg, Magdeburg, Germany
| |
Collapse
|
5
|
Casey A, Davidson E, Poon M, Dong H, Duma D, Grivas A, Grover C, Suárez-Paniagua V, Tobin R, Whiteley W, Wu H, Alex B. A systematic review of natural language processing applied to radiology reports. BMC Med Inform Decis Mak 2021; 21:179. [PMID: 34082729 PMCID: PMC8176715 DOI: 10.1186/s12911-021-01533-7] [Citation(s) in RCA: 64] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 05/17/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Natural language processing (NLP) has a significant role in advancing healthcare and has been found to be key in extracting structured information from radiology reports. Understanding recent developments in NLP application to radiology is of significance but recent reviews on this are limited. This study systematically assesses and quantifies recent literature in NLP applied to radiology reports. METHODS We conduct an automated literature search yielding 4836 results using automated filtering, metadata enriching steps and citation search combined with manual review. Our analysis is based on 21 variables including radiology characteristics, NLP methodology, performance, study, and clinical application characteristics. RESULTS We present a comprehensive analysis of the 164 publications retrieved with publications in 2019 almost triple those in 2015. Each publication is categorised into one of 6 clinical application categories. Deep learning use increases in the period but conventional machine learning approaches are still prevalent. Deep learning remains challenged when data is scarce and there is little evidence of adoption into clinical practice. Despite 17% of studies reporting greater than 0.85 F1 scores, it is hard to comparatively evaluate these approaches given that most of them use different datasets. Only 14 studies made their data and 15 their code available with 10 externally validating results. CONCLUSIONS Automated understanding of clinical narratives of the radiology reports has the potential to enhance the healthcare process and we show that research in this field continues to grow. Reproducibility and explainability of models are important if the domain is to move applications into clinical use. More could be done to share code enabling validation of methods on different institutional data and to reduce heterogeneity in reporting of study properties allowing inter-study comparisons. Our results have significance for researchers in the field providing a systematic synthesis of existing work to build on, identify gaps, opportunities for collaboration and avoid duplication.
Collapse
Affiliation(s)
- Arlene Casey
- School of Literatures, Languages and Cultures (LLC), University of Edinburgh, Edinburgh, Scotland
| | - Emma Davidson
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland
| | - Michael Poon
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland
| | - Hang Dong
- Centre for Medical Informatics, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, Scotland
- Health Data Research UK, London, UK
| | - Daniel Duma
- School of Literatures, Languages and Cultures (LLC), University of Edinburgh, Edinburgh, Scotland
| | - Andreas Grivas
- Institute for Language, Cognition and Computation, School of informatics, University of Edinburgh, Edinburgh, Scotland
| | - Claire Grover
- Institute for Language, Cognition and Computation, School of informatics, University of Edinburgh, Edinburgh, Scotland
| | - Víctor Suárez-Paniagua
- Centre for Medical Informatics, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, Scotland
- Health Data Research UK, London, UK
| | - Richard Tobin
- Institute for Language, Cognition and Computation, School of informatics, University of Edinburgh, Edinburgh, Scotland
| | - William Whiteley
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Honghan Wu
- Health Data Research UK, London, UK
- Institute of Health Informatics, University College London, London, UK
| | - Beatrice Alex
- School of Literatures, Languages and Cultures (LLC), University of Edinburgh, Edinburgh, Scotland
- Edinburgh Futures Institute, University of Edinburgh, Edinburgh, Scotland
| |
Collapse
|
6
|
Zhang Z, Genc Y, Wang D, Ahsen ME, Fan X. Effect of AI Explanations on Human Perceptions of Patient-Facing AI-Powered Healthcare Systems. J Med Syst 2021; 45:64. [PMID: 33948743 DOI: 10.1007/s10916-021-01743-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 04/28/2021] [Indexed: 10/21/2022]
Abstract
Ongoing research efforts have been examining how to utilize artificial intelligence technology to help healthcare consumers make sense of their clinical data, such as diagnostic radiology reports. How to promote the acceptance of such novel technology is a heated research topic. Recent studies highlight the importance of providing local explanations about AI prediction and model performance to help users determine whether to trust AI's predictions. Despite some efforts, limited empirical research has been conducted to quantitatively measure how AI explanations impact healthcare consumers' perceptions of using patient-facing, AI-powered healthcare systems. The aim of this study is to evaluate the effects of different AI explanations on people's perceptions of AI-powered healthcare system. In this work, we designed and deployed a large-scale experiment (N = 3,423) on Amazon Mechanical Turk (MTurk) to evaluate the effects of AI explanations on people's perceptions in the context of comprehending radiology reports. We created four groups based on two factors-the extent of explanations for the prediction (High vs. Low Transparency) and the model performance (Good vs. Weak AI Model)-and randomly assigned participants to one of the four conditions. Participants were instructed to classify a radiology report as describing a normal or abnormal finding, followed by completing a post-study survey to indicate their perceptions of the AI tool. We found that revealing model performance information can promote people's trust and perceived usefulness of system outputs, while providing local explanations for the rationale of a prediction can promote understandability but not necessarily trust. We also found that when model performance is low, the more information the AI system discloses, the less people would trust the system. Lastly, whether human agrees with AI predictions or not and whether the AI prediction is correct or not could also influence the effect of AI explanations. We conclude this paper by discussing implications for designing AI systems for healthcare consumers to interpret diagnostic report.
Collapse
Affiliation(s)
- Zhan Zhang
- School of Computer Science and Information Systems, Pace University, New York, USA.
| | - Yegin Genc
- School of Computer Science and Information Systems, Pace University, New York, USA
| | | | - Mehmet Eren Ahsen
- College of Business, University of Illinois At Urbana-Champaign, Champaign, USA
| | - Xiangmin Fan
- The Institute of Software, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
7
|
Sousa D, Lamurias A, Couto FM. A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing. Database (Oxford) 2020; 2020:baaa104. [PMID: 33258966 PMCID: PMC7706181 DOI: 10.1093/database/baaa104] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Revised: 09/02/2020] [Accepted: 11/12/2020] [Indexed: 12/14/2022]
Abstract
Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype-gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.
Collapse
Affiliation(s)
- Diana Sousa
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Andre Lamurias
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Francisco M Couto
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| |
Collapse
|
8
|
Spasic I, Nenadic G. Clinical Text Data in Machine Learning: Systematic Review. JMIR Med Inform 2020; 8:e17984. [PMID: 32229465 PMCID: PMC7157505 DOI: 10.2196/17984] [Citation(s) in RCA: 115] [Impact Index Per Article: 28.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Revised: 02/24/2020] [Accepted: 02/24/2020] [Indexed: 12/22/2022] Open
Abstract
Background Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. Objective The main aim of this study was to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigated the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. Methods Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multifaceted interface, to perform a literature search against MEDLINE. We identified 110 relevant studies and extracted information about text data used to support machine learning, NLP tasks supported, and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation, and any relevant statistics. Results The majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents, with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing the predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free-text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable because of the sensitive nature of data considered. Besides the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The majority of studies focused on text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management, and surveillance. Conclusions We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which do not require data annotation.
Collapse
Affiliation(s)
- Irena Spasic
- School of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom
| | - Goran Nenadic
- Department of Computer Science, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
9
|
Crowdsourcing in health and medical research: a systematic review. Infect Dis Poverty 2020; 9:8. [PMID: 31959234 PMCID: PMC6971908 DOI: 10.1186/s40249-020-0622-9] [Citation(s) in RCA: 67] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2019] [Accepted: 01/07/2020] [Indexed: 12/31/2022] Open
Abstract
Background Crowdsourcing is used increasingly in health and medical research. Crowdsourcing is the process of aggregating crowd wisdom to solve a problem. The purpose of this systematic review is to summarize quantitative evidence on crowdsourcing to improve health. Methods We followed Cochrane systematic review guidance and systematically searched seven databases up to September 4th 2019. Studies were included if they reported on crowdsourcing and related to health or medicine. Studies were excluded if recruitment was the only use of crowdsourcing. We determined the level of evidence associated with review findings using the GRADE approach. Results We screened 3508 citations, accessed 362 articles, and included 188 studies. Ninety-six studies examined effectiveness, 127 examined feasibility, and 37 examined cost. The most common purposes were to evaluate surgical skills (17 studies), to create sexual health messages (seven studies), and to provide layperson cardio-pulmonary resuscitation (CPR) out-of-hospital (six studies). Seventeen observational studies used crowdsourcing to evaluate surgical skills, finding that crowdsourcing evaluation was as effective as expert evaluation (low quality). Four studies used a challenge contest to solicit human immunodeficiency virus (HIV) testing promotion materials and increase HIV testing rates (moderate quality), and two of the four studies found this approach saved money. Three studies suggested that an interactive technology system increased rates of layperson initiated CPR out-of-hospital (moderate quality). However, studies analyzing crowdsourcing to evaluate surgical skills and layperson-initiated CPR were only from high-income countries. Five studies examined crowdsourcing to inform artificial intelligence projects, most often related to annotation of medical data. Crowdsourcing was evaluated using different outcomes, limiting the extent to which studies could be pooled. Conclusions Crowdsourcing has been used to improve health in many settings. Although crowdsourcing is effective at improving behavioral outcomes, more research is needed to understand effects on clinical outcomes and costs. More research is needed on crowdsourcing as a tool to develop artificial intelligence systems in medicine. Trial registration PROSPERO: CRD42017052835. December 27, 2016.
Collapse
|
10
|
Moccia S, Romeo L, Migliorelli L, Frontoni E, Zingaretti P. Supervised CNN Strategies for Optical Image Segmentation and Classification in Interventional Medicine. INTELLIGENT SYSTEMS REFERENCE LIBRARY 2020. [DOI: 10.1007/978-3-030-42750-4_8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
11
|
Grote A, Schaadt NS, Forestier G, Wemmert C, Feuerhake F. Crowdsourcing of Histological Image Labeling and Object Delineation by Medical Students. IEEE TRANSACTIONS ON MEDICAL IMAGING 2019; 38:1284-1294. [PMID: 30489264 DOI: 10.1109/tmi.2018.2883237] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Crowdsourcing in pathology has been performed on tasks that are assumed to be manageable by nonexperts. Demand remains high for annotations of more complex elements in digital microscopic images, such as anatomical structures. Therefore, this paper investigates conditions to enable crowdsourced annotations of high-level image objects, a complex task considered to require expert knowledge. Seventy six medical students without specific domain knowledge who voluntarily participated in three experiments solved two relevant annotation tasks on histopathological images: 1) labeling of images showing tissue regions and 2) delineation of morphologically defined image objects. We focus on methods to ensure sufficient annotation quality including several tests on the required number of participants and on the correlation of participants' performance between tasks. In a set up simulating annotation of images with limited ground truth, we validated the feasibility of a confidence score using full ground truth. For this, we computed a majority vote using weighting factors based on individual assessment of contributors against scattered gold standard annotated by pathologists. In conclusion, we provide guidance for task design and quality control to enable a crowdsourced approach to obtain accurate annotations required in the era of digital pathology.
Collapse
|
12
|
Tkaczyk ER, Coco JR, Wang J, Chen F, Ye C, Jagasia MH, Dawant BM, Fabbri D. Crowdsourcing to delineate skin affected by chronic graft-vs-host disease. Skin Res Technol 2019; 25:572-577. [PMID: 30786065 DOI: 10.1111/srt.12688] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 01/05/2019] [Accepted: 01/12/2019] [Indexed: 11/27/2022]
Abstract
BACKGROUND Estimating the extent of affected skin is an important unmet clinical need both for research and practical management in many diseases. In particular, cutaneous burden of chronic graft-vs-host disease (cGVHD) is a primary outcome in many trials. Despite advances in artificial intelligence and 3D photography, progress toward reliable automated techniques is hindered by limited expert time to delineate cGVHD patient images. Crowdsourcing may have potential to provide the requisite expert-level data. MATERIALS AND METHODS Forty-one three-dimensional photographs of three cutaneous cGVHD patients were delineated by a board-certified dermatologist. 410 two-dimensional projections of the raw photos were each annotated by seven crowd workers, whose consensus performance was compared to the expert. RESULTS The consensus delineation by four of seven crowd workers achieved the highest agreement with the expert, measured by a median Dice index of 0.7551 across all 410 images, outperforming even the best worker from the crowd (Dice index 0.7216). For their internal agreement, crowd workers achieved a median Fleiss's kappa of 0.4140 across the images. The time a worker spent marking an image had only weak correlation with the surface area marked, and very low correlation with accuracy. Percent of pixels selected by the consensus exhibited good correlation (Pearson R = 0.81) with the patient's affected surface area. CONCLUSION Crowdsourcing may be an efficient method for obtaining demarcations of affected skin, on par with expert performance. Crowdsourced data generally agreed with the current clinical standard of percent body surface area to assess cGVHD severity in the skin.
Collapse
Affiliation(s)
- Eric R Tkaczyk
- Department of Veterans Affairs, Tennessee Valley Health System, Nashville, Tennessee.,Department of Dermatology, Vanderbilt Cutaneous Imaging Clinic, Vanderbilt University Medical Center, Nashville, Tennessee.,Department of Biomedical Engineering, Vanderbilt University, Nashville, Tennessee
| | - Joseph R Coco
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Jianing Wang
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee
| | - Fuyao Chen
- Department of Veterans Affairs, Tennessee Valley Health System, Nashville, Tennessee.,Department of Dermatology, Vanderbilt Cutaneous Imaging Clinic, Vanderbilt University Medical Center, Nashville, Tennessee.,Department of Biomedical Engineering, Vanderbilt University, Nashville, Tennessee
| | - Cheng Ye
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee
| | | | - Benoit M Dawant
- Department of Biomedical Engineering, Vanderbilt University, Nashville, Tennessee.,Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee
| | - Daniel Fabbri
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee.,Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee
| |
Collapse
|
13
|
Moreno I, Boldrini E, Moreda P, Romá-Ferri MT. DrugSemantics: A corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics. J Biomed Inform 2017. [PMID: 28624642 DOI: 10.1016/j.jbi.2017.06.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
For the healthcare sector, it is critical to exploit the vast amount of textual health-related information. Nevertheless, healthcare providers have difficulties to benefit from such quantity of data during pharmacotherapeutic care. The problem is that such information is stored in different sources and their consultation time is limited. In this context, Natural Language Processing techniques can be applied to efficiently transform textual data into structured information so that it could be used in critical healthcare applications, being of help for physicians in their daily workload, such as: decision support systems, cohort identification, patient management, etc. Any development of these techniques requires annotated corpora. However, there is a lack of such resources in this domain and, in most cases, the few ones available concern English. This paper presents the definition and creation of DrugSemantics corpus, a collection of Summaries of Product Characteristics in Spanish. It was manually annotated with pharmacotherapeutic named entities, detailed in DrugSemantics annotation scheme. Annotators were a Registered Nurse (RN) and two students from the Degree in Nursing. The quality of DrugSemantics corpus has been assessed by measuring its annotation reliability (overall F=79.33% [95%CI: 78.35-80.31]), as well as its annotation precision (overall P=94.65% [95%CI: 94.11-95.19]). Besides, the gold-standard construction process is described in detail. In total, our corpus contains more than 2000 named entities, 780 sentences and 226,729 tokens. Last, a Named Entity Classification module trained on DrugSemantics is presented aiming at showing the quality of our corpus, as well as an example of how to use it.
Collapse
Affiliation(s)
- Isabel Moreno
- Department of Software and Computing Systems, University of Alicante, Alicante, Spain.
| | - Ester Boldrini
- Department of Software and Computing Systems, University of Alicante, Alicante, Spain.
| | - Paloma Moreda
- Department of Software and Computing Systems, University of Alicante, Alicante, Spain.
| | | |
Collapse
|