1
|
Zaman N, Goldberg DM, Gruss RJ, Abrahams AS. A semiautomated risk assessment method for consumer products. RISK ANALYSIS : AN OFFICIAL PUBLICATION OF THE SOCIETY FOR RISK ANALYSIS 2024; 44:705-723. [PMID: 37337464 DOI: 10.1111/risa.14180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 03/12/2023] [Accepted: 06/02/2023] [Indexed: 06/21/2023]
Abstract
In this study, we develop a model that assesses product risk using online reviews from Amazon.com. We first identify unique words and phrases capable of identifying hazards. Second, we estimate risk severity using hazard type weights and risk likelihood using total reviews as a proxy for sales volume. In addition, we obtain expert assessments of product hazard risk (risk likelihood and severity) from a sample of high- and low-risk consumer products identified by a computerized risk assessment model we have developed. Third, we assess the validity of our computerized product risk assessment scoring model by utilizing the experts' survey responses. We find that our model is especially consistent with expert judgments of hazard likelihood but not as consistent with expert judgments of hazard severity. This model helps organizations to determine the risk severity, risk likelihood, and overall risk level of a specific product. The model produced by this study is helpful for product safety practitioners in product risk identification, characterization, and mitigation.
Collapse
Affiliation(s)
- Nohel Zaman
- Department of Management, Information Systems & Quantitative Methods, University of Alabama - Birmingham, Birmingham, Alabama, USA
| | - David M Goldberg
- Department of Management Information Systems, San Diego State University, San Diego, California, USA
| | - Richard J Gruss
- Department of Management, Radford University, Radford, Virginia, USA
| | - Alan S Abrahams
- Department of Business Information Technology, Virginia Tech, Blacksburg, Virginia, USA
| |
Collapse
|
2
|
Catchpoole J, Nanda G, Vallmuur K, Nand G, Lehto M. Application of a Machine Learning-based Decision Support Tool to Improve an Injury Surveillance System Workflow. Appl Clin Inform 2022; 13:700-710. [PMID: 35644141 DOI: 10.1055/a-1863-7176] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022] Open
Abstract
Abstract
Background
Emergency department (ED)-based injury surveillance systems across many countries face resourcing challenges related to manual validation and coding of data.
Objective
This paper describes the evaluation of a machine learning-based Decision Support Tool (DST) to assist injury surveillance departments in the validation, coding and use of their data, comparing outcomes in coding time and accuracy pre- and post-implementation.
Methods
Manually coded injury surveillance data has been used to develop, train and iteratively refine a machine learning-based classifier to enable semi-automated coding of injury narrative data. This paper describes a trial implementation of the machine learning-based DST in the Queensland Injury Surveillance Unit (QISU) workflow using a major pediatric hospital's emergency department data comparing outcomes in coding time and accuracy pre- and post-implementation.
Results
The study found a 10% reduction in manual coding time after the DST was introduced. The Kappa statistics analysis in both DST-assisted and unassisted data shows increases in accuracy across three data fields; injury intent (85.4% unassisted vs. 94.5% assisted), external cause (88.8% unassisted vs. 91.8% assisted) and injury factor (89.3% unassisted vs. 92.9% assisted). The classifier was also used to produce a timely report monitoring injury patterns during the COVID-19 pandemic. Hence, it has the potential for near real-time surveillance of emerging hazards to inform public health responses.
Conclusions
The integration of the DST into the injury surveillance workflow shows benefits as it facilitates timely reporting and acts as a DST in the manual coding process.
Collapse
Affiliation(s)
- Jesani Catchpoole
- Jamieson Trauma Institute, Metro North Hospital and Health Service, Herston, Australia
- Queensland Injury Surveillance Unit, Metro North Hospital and Health Service, Herston, Australia
- Queensland University of Technology, Kelvin Grove, Australia
| | - Gaurav Nanda
- School of Engineering Technology, Purdue University, West Lafayette, United States
| | - Kirsten Vallmuur
- Australian Centre for Health Services Innovation, Queensland University of Technology, Kelvin Grove, Australia
- Jamieson Trauma Institute, Metro North Hospital and Health Service, Herston, Australia
| | - Goshad Nand
- Queensland Injury Surveillance Unit, Metro North Hospital and Health Service, Herston, Australia
| | - Mark Lehto
- Industrial Engineering, Purdue University, West Lafayette, United States
| |
Collapse
|
3
|
Relational Graph Convolutional Network for Text-Mining-Based Accident Causal Classification. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12052482] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Accident investigation reports are text documents that systematically review and analyze the cause and process of accidents after accidents have occurred and have been widely used in the fields such as transportation, construction and aerospace. With the aid of accident investigation reports, the cause of the accident can be clearly identified, which provides an important basis for accident prevention and reliability assessment. However, since accident record reports are mostly composed of unstructured data such as text, the analysis of accident causes inevitably relies on a lot of expert experience and statistical analyses also require a lot of manual classification. Although, in recent years, with the development of natural language processing technology, there have been many efforts to automatically analyze and classify text. However, the existing methods either rely on large corpus and data preprocessing methods, which are cumbersome, or extract text information based on bidirectional encoder representation from transformers (BERT), but the computational cost is extremely high. These shortcomings make it still a great challenge to automatically analyze accident investigation reports and extract the information therein. To address the aforementioned problems, this study proposes a text-mining-based accident causal classification method based on a relational graph convolutional network (R-GCN) and pre-trained BERT. On the one hand, the proposed method avoids preprocessing such as stop word removal and word segmentation, which not only preserves the information of accident investigation reports to the greatest extent, but also avoids tedious operations. On the other hand, with the help of R-GCN to process the semantic features obtained by BERT representation, the dependence of BERT retraining on computing resources can be avoided.
Collapse
|
4
|
Goldberg DM. Characterizing accident narratives with word embeddings: Improving accuracy, richness, and generalizability. JOURNAL OF SAFETY RESEARCH 2022; 80:441-455. [PMID: 35249625 DOI: 10.1016/j.jsr.2021.12.024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Revised: 07/12/2021] [Accepted: 12/20/2021] [Indexed: 06/14/2023]
Abstract
INTRODUCTION Ensuring occupational health and safety is an enormous concern for organizations, as accidents not only harm workers but also result in financial losses. Analysis of accident data has the potential to reveal insights that may improve capabilities to mitigate future accidents. However, because accident data are often transcribed textually, analyzing these narratives proves difficult. This study contributes to a recent stream of literature utilizing machine learning to automatically label accident narratives, converting them into more easily analyzable fields. METHOD First, a large dataset of accident narratives in which workers were injured is collected from the U.S. Occupational Safety and Health Administration (OSHA). Word embeddings-based text mining is implemented; compared to past works, this methodology offers excellent performance. Second, to improve the richness of analyses, each record is assessed across five dimensions. The machine learning models provide classifications of body part(s) injured, the source of the injury, the type of event causing the injury, whether a hospitalization occurred, and whether an amputation occurred. Finally, demonstrating generalizability, the trained models are deployed to analyze two additional datasets of accident narratives in the construction industry and the mining and metals industry (transfer learning). Practical Applications: These contributions improve organizations' capacities to rapidly analyze textual accident narratives.
Collapse
Affiliation(s)
- David M Goldberg
- San Diego State University, 5500 Campanile Drive, San Diego, CA 92182, United States.
| |
Collapse
|
5
|
Sienkiewicz K, Chen J, Chatrath A, Lawson JT, Sheffield NC, Zhang L, Ratan A. Detecting molecular subtypes from multi-omics datasets using SUMO. CELL REPORTS METHODS 2022; 2:100152. [PMID: 35211690 PMCID: PMC8865426 DOI: 10.1016/j.crmeth.2021.100152] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 08/27/2021] [Accepted: 12/21/2021] [Indexed: 12/31/2022]
Abstract
We present a data integration framework that uses non-negative matrix factorization of patient-similarity networks to integrate continuous multi-omics datasets for molecular subtyping. It is demonstrated to have the capability to handle missing data without using imputation and to be consistently among the best in detecting subtypes with differential prognosis and enrichment of clinical associations in a large number of cancers. When applying the approach to data from individuals with lower-grade gliomas, we identify a subtype with a significantly worse prognosis. Tumors assigned to this subtype are hypomethylated genome wide with a gain of AP-1 occupancy in demethylated distal enhancers. The tumors are also enriched for somatic chromosome 7 (chr7) gain, chr10 loss, and other molecular events that have been suggested as diagnostic markers for "IDH wild type, with molecular features of glioblastoma" by the cIMPACT-NOW consortium but have yet to be included in the World Health Organization (WHO) guidelines.
Collapse
Affiliation(s)
- Karolina Sienkiewicz
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA
| | - Jinyu Chen
- Department of Mathematics and Computational Biology Program, National University of Singapore, Singapore 119076, Singapore
| | - Ajay Chatrath
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA
| | - John T. Lawson
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C. Sheffield
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA
- University of Virginia Cancer Center, Charlottesville, VA 22908, USA
| | - Louxin Zhang
- Department of Mathematics and Computational Biology Program, National University of Singapore, Singapore 119076, Singapore
| | - Aakrosh Ratan
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA
- University of Virginia Cancer Center, Charlottesville, VA 22908, USA
| |
Collapse
|
6
|
Nanda G, Vallmuur K, Lehto M. Improving autocoding performance of rare categories in injury classification: Is more training data or filtering the solution? ACCIDENT; ANALYSIS AND PREVENTION 2018; 110:115-127. [PMID: 29127808 DOI: 10.1016/j.aap.2017.10.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2017] [Revised: 08/13/2017] [Accepted: 10/21/2017] [Indexed: 06/07/2023]
Abstract
INTRODUCTION Classical Machine Learning (ML) models have been found to assign the external-cause-of-injury codes (E-codes) based on injury narratives with good overall accuracy but often struggle with rare categories, primarily due to lack of enough training cases and heavily skewed nature of injurdata. In this paper, we have: a) studied the effect of increasing the size of training data on the prediction performance of three classical ML models: Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM) and Logistic Regression (LR), and b) studied the effect of filtering based on prediction strength of LR model when the model is trained on very-small (10,000 cases) and very-large (450,000 cases) training sets. METHOD Data from Queensland Injury Surveillance Unit from years 2002-2012, which was categorized into 20 broad E-codes was used for this study. Eleven randomly chosen training sets of size ranging from 10,000 to 450,000 cases were used to train the ML models, and the prediction performance was analyzed on a prediction set of 50,150 cases. Filtering approach was tested on LR models trained on smallest and largest training datasets. Sensitivity was used as the performance measure for individual categories. Weighted average sensitivity (WAvg) and Unweighted average sensitivity (UAvg) were used as the measures of overall performance. Filtering approach was also tested for estimating category counts and was compared with approaches of summing prediction probabilities and counting direct predictions by ML model. RESULTS The overall performance of all three ML models improved with increase in the size of training data. The overall sensitivities with maximum training size for LR and SVM models were similar (∼82%), and higher than MNB (76%). For all the ML models, the sensitivities of rare categories improved with increasing training data but they were considerably less than sensitivities of larger categories. With increasing training data size, LR and SVM exhibited diminishing improvement in UAvg whereas the improvement was relatively steady in case of MNB. Filtering based on prediction strength of LR model (and manual review of filtered cases) helped in improving the sensitivities of rare categories. A sizeable portion of cases still needed to be filtered even when the LR model was trained on very large training set. For estimating category counts, filtering approach provided best estimates for most E-codes and summing prediction probabilities approach provided better estimates for rare categories. CONCLUSIONS Increasing the size of training data alone cannot solve the problem of poor classification performance on rare categories by ML models. Filtering could be an effective strategy to improve classification performance of rare categories when large training data is not available.
Collapse
Affiliation(s)
- Gaurav Nanda
- School of Industrial Engineering, Purdue University, USA.
| | - Kirsten Vallmuur
- Current: Australian Centre for Health Services Innovation, School of Public Health and Social Work, Queensland University of Technology, Australia; Formerly: Centre for Accident Research and Road Safety-Queensland, School of Psychology and Counselling, Queensland University of Technology, Australia
| | - Mark Lehto
- School of Industrial Engineering, Purdue University, USA
| |
Collapse
|
7
|
Goh YM, Ubeynarayana CU. Construction accident narrative classification: An evaluation of text mining techniques. ACCIDENT; ANALYSIS AND PREVENTION 2017; 108:122-130. [PMID: 28865927 DOI: 10.1016/j.aap.2017.08.026] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/15/2017] [Revised: 07/24/2017] [Accepted: 08/26/2017] [Indexed: 06/07/2023]
Abstract
Learning from past accidents is fundamental to accident prevention. Thus, accident and near miss reporting are encouraged by organizations and regulators. However, for organizations managing large safety databases, the time taken to accurately classify accident and near miss narratives will be very significant. This study aims to evaluate the utility of various text mining classification techniques in classifying 1000 publicly available construction accident narratives obtained from the US OSHA website. The study evaluated six machine learning algorithms, including support vector machine (SVM), linear regression (LR), random forest (RF), k-nearest neighbor (KNN), decision tree (DT) and Naive Bayes (NB), and found that SVM produced the best performance in classifying the test set of 251 cases. Further experimentation with tokenization of the processed text and non-linear SVM were also conducted. In addition, a grid search was conducted on the hyperparameters of the SVM models. It was found that the best performing classifiers were linear SVM with unigram tokenization and radial basis function (RBF) SVM with uni-gram tokenization. In view of its relative simplicity, the linear SVM is recommended. Across the 11 labels of accident causes or types, the precision of the linear SVM ranged from 0.5 to 1, recall ranged from 0.36 to 0.9 and F1 score was between 0.45 and 0.92. The reasons for misclassification were discussed and suggestions on ways to improve the performance were provided.
Collapse
Affiliation(s)
- Yang Miang Goh
- Safety and Resilience Research Unit (SaRRU), Dept. of Building, School of Design and Environment, National Univ. of Singapore, 4 Architecture Dr., 117566, Singapore.
| | - C U Ubeynarayana
- Safety and Resilience Research Unit (SaRRU), Dept. of Building, School of Design and Environment, National Univ. of Singapore, 4 Architecture Dr., 117566, Singapore
| |
Collapse
|
8
|
Marucci-Wellman HR, Corns HL, Lehto MR. Classifying injury narratives of large administrative databases for surveillance-A practical approach combining machine learning ensembles and human review. ACCIDENT; ANALYSIS AND PREVENTION 2017; 98:359-371. [PMID: 27863339 DOI: 10.1016/j.aap.2016.10.014] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Revised: 10/07/2016] [Accepted: 10/10/2016] [Indexed: 06/06/2023]
Abstract
Injury narratives are now available real time and include useful information for injury surveillance and prevention. However, manual classification of the cause or events leading to injury found in large batches of narratives, such as workers compensation claims databases, can be prohibitive. In this study we compare the utility of four machine learning algorithms (Naïve Bayes, Single word and Bi-gram models, Support Vector Machine and Logistic Regression) for classifying narratives into Bureau of Labor Statistics Occupational Injury and Illness event leading to injury classifications for a large workers compensation database. These algorithms are known to do well classifying narrative text and are fairly easy to implement with off-the-shelf software packages such as Python. We propose human-machine learning ensemble approaches which maximize the power and accuracy of the algorithms for machine-assigned codes and allow for strategic filtering of rare, emerging or ambiguous narratives for manual review. We compare human-machine approaches based on filtering on the prediction strength of the classifier vs. agreement between algorithms. Regularized Logistic Regression (LR) was the best performing algorithm alone. Using this algorithm and filtering out the bottom 30% of predictions for manual review resulted in high accuracy (overall sensitivity/positive predictive value of 0.89) of the final machine-human coded dataset. The best pairings of algorithms included Naïve Bayes with Support Vector Machine whereby the triple ensemble NBSW=NBBI-GRAM=SVM had very high performance (0.93 overall sensitivity/positive predictive value and high accuracy (i.e. high sensitivity and positive predictive values)) across both large and small categories leaving 41% of the narratives for manual review. Integrating LR into this ensemble mix improved performance only slightly. For large administrative datasets we propose incorporation of methods based on human-machine pairings such as we have done here, utilizing readily-available off-the-shelf machine learning techniques and resulting in only a fraction of narratives that require manual review. Human-machine ensemble methods are likely to improve performance over total manual coding.
Collapse
Affiliation(s)
- Helen R Marucci-Wellman
- Center for Injury Epidemiology, Liberty Mutual Research Institute for Safety, 71 Frankland Road, Hopkinton, MA 01748, USA.
| | - Helen L Corns
- Center for Injury Epidemiology, Liberty Mutual Research Institute for Safety, 71 Frankland Road, Hopkinton, MA 01748, USA
| | - Mark R Lehto
- School of Industrial Engineering, Purdue University, 1287 Grissom Hall, West Lafayette, IN 47907, USA
| |
Collapse
|
9
|
Dinh MM, Russell SB, Bein KJ, Vallmuur K, Muscatello D, Chalkley D, Ivers R. Age-related trends in injury and injury severity presenting to emergency departments in New South Wales Australia: Implications for major injury surveillance and trauma systems. Injury 2017; 48:171-176. [PMID: 27542554 DOI: 10.1016/j.injury.2016.08.005] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/27/2016] [Revised: 08/06/2016] [Accepted: 08/11/2016] [Indexed: 02/02/2023]
Abstract
OBJECTIVES To describe population based trends and clinical characteristics of injury related presentations to Emergency Departments (EDs). DESIGN AND SETTING A retrospective, descriptive analysis of de-identified linked ED data across New South Wales, Australia over five calendar years, from 2010 to 2014. PARTICIPANTS Patients were included in this analysis if they presented to an Emergency Department and had an injury related diagnosis. Injury severity was categorised into critical (triage category 1-2 and admitted to ICU or operating theatre, or died in ED), serious (admitted as an in-patient, excluding above critical injuries) and minor injuries (discharged from ED). MAIN OUTCOME MEASURES The outcomes of interest were rates of injury related presentations to EDs by age groups and injury severity. RESULTS A total of 2.09 million injury related ED presentations were analysed. Minor injuries comprised 85.0%, and 14.1% and 1.0% were serious and critical injuries respectively. There was a 15.8% per annum increase in the rate of critical injuries per 1000 population in those 80 years and over, with the most common diagnosis being head injuries. Around 40% of those with critical injuries presented directly to a major trauma centre. CONCLUSION Critical injuries in the elderly have risen dramatically in recent years. A minority of critical injuries present directly to major trauma centres. Trauma service provision models need revision to ensure appropriate patient care. Injury surveillance is needed to understand the external causes of injury presenting to hospital.
Collapse
Affiliation(s)
- Michael M Dinh
- Royal Prince Alfred Hospital, Australia; Discipline of Emergency Medicine, The University of Sydney, Australia.
| | | | | | | | - David Muscatello
- School of Public Health and Community Medicine, University of New South Wales, Australia
| | | | - Rebecca Ivers
- The George Institute for Global Health, The University of Sydney, Australia; School of Nursing and Midwifery, Flinders University, Australia
| |
Collapse
|
10
|
Nanda G, Grattan KM, Chu MT, Davis LK, Lehto MR. Bayesian decision support for coding occupational injury data. JOURNAL OF SAFETY RESEARCH 2016; 57:71-82. [PMID: 27178082 DOI: 10.1016/j.jsr.2016.03.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Revised: 12/10/2015] [Accepted: 03/02/2016] [Indexed: 06/05/2023]
Abstract
INTRODUCTION Studies on autocoding injury data have found that machine learning algorithms perform well for categories that occur frequently but often struggle with rare categories. Therefore, manual coding, although resource-intensive, cannot be eliminated. We propose a Bayesian decision support system to autocode a large portion of the data, filter cases for manual review, and assist human coders by presenting them top k prediction choices and a confusion matrix of predictions from Bayesian models. METHOD We studied the prediction performance of Single-Word (SW) and Two-Word-Sequence (TW) Naïve Bayes models on a sample of data from the 2011 Survey of Occupational Injury and Illness (SOII). We used the agreement in prediction results of SW and TW models, and various prediction strength thresholds for autocoding and filtering cases for manual review. We also studied the sensitivity of the top k predictions of the SW model, TW model, and SW-TW combination, and then compared the accuracy of the manually assigned codes to SOII data with that of the proposed system. RESULTS The accuracy of the proposed system, assuming well-trained coders reviewing a subset of only 26% of cases flagged for review, was estimated to be comparable (86.5%) to the accuracy of the original coding of the data set (range: 73%-86.8%). Overall, the TW model had higher sensitivity than the SW model, and the accuracy of the prediction results increased when the two models agreed, and for higher prediction strength thresholds. The sensitivity of the top five predictions was 93%. CONCLUSIONS The proposed system seems promising for coding injury data as it offers comparable accuracy and less manual coding. PRACTICAL APPLICATIONS Accurate and timely coded occupational injury data is useful for surveillance as well as prevention activities that aim to make workplaces safer.
Collapse
Affiliation(s)
- Gaurav Nanda
- School of Industrial Engineering, Purdue University, 315 N. Grant Street, West Lafayette, IN 47907-2023, USA.
| | - Kathleen M Grattan
- Massachusetts Department of Public Health, 250 Washington Street, 4th Floor, Boston, MA 02108, USA.
| | - MyDzung T Chu
- Massachusetts Department of Public Health, 250 Washington Street, 4th Floor, Boston, MA 02108, USA
| | - Letitia K Davis
- Massachusetts Department of Public Health, 250 Washington Street, 4th Floor, Boston, MA 02108, USA
| | - Mark R Lehto
- School of Industrial Engineering, Purdue University, 315 N. Grant Street, West Lafayette, IN 47907-2023, USA.
| |
Collapse
|
11
|
Ning W, Yu M, Zhang R. A hierarchical method to automatically encode Chinese diagnoses through semantic similarity estimation. BMC Med Inform Decis Mak 2016; 16:30. [PMID: 26940992 PMCID: PMC4778321 DOI: 10.1186/s12911-016-0269-4] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2015] [Accepted: 02/26/2016] [Indexed: 12/31/2022] Open
Abstract
Background The accumulation of medical documents in China has rapidly increased in the past years. We focus on developing a method that automatically performs ICD-10 code assignment to Chinese diagnoses from the electronic medical records to support the medical coding process in Chinese hospitals. Methods We propose two encoding methods: one that directly determines the desired code (flat method), and one that hierarchically determines the most suitable code until the desired code is obtained (hierarchical method). Both methods are based on instances from the standard diagnostic library, a gold standard dataset in China. For the first time, semantic similarity estimation between Chinese words are applied in the biomedical domain with the successful implementation of knowledge-based and distributional approaches. Characteristics of the Chinese language are considered in implementing distributional semantics. We test our methods against 16,330 coding instances from our partner hospital. Results The hierarchical method outperforms the flat method in terms of accuracy and time complexity. Representing distributional semantics using Chinese characters can achieve comparable performance to the use of Chinese words. The diagnoses in the test set can be encoded automatically with micro-averaged precision of 92.57 %, recall of 89.63 %, and F-score of 91.08 %. A sharp decrease in encoding performance is observed without semantic similarity estimation. Conclusion The hierarchical nature of ICD-10 codes can enhance the performance of the automated code assignment. Semantic similarity estimation is demonstrated indispensable in dealing with Chinese medical text. The proposed method can greatly reduce the workload and improve the efficiency of the code assignment process in Chinese hospitals. Electronic supplementary material The online version of this article (doi:10.1186/s12911-016-0269-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wenxin Ning
- Health Care Services Research Center, Department of Industrial Engineering, Tsinghua University, Beijing, 100084, PR China.
| | - Ming Yu
- Health Care Services Research Center, Department of Industrial Engineering, Tsinghua University, Beijing, 100084, PR China.
| | - Runtong Zhang
- Department of Information Management, School of Economics and Management, Beijing Jiaotong University, Beijing, 100084, PR China.
| |
Collapse
|
12
|
Vallmuur K, Marucci-Wellman HR, Taylor JA, Lehto M, Corns HL, Smith GS. Harnessing information from injury narratives in the 'big data' era: understanding and applying machine learning for injury surveillance. Inj Prev 2016; 22 Suppl 1:i34-42. [PMID: 26728004 DOI: 10.1136/injuryprev-2015-041813] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Accepted: 12/08/2015] [Indexed: 11/03/2022]
Abstract
OBJECTIVE Vast amounts of injury narratives are collected daily and are available electronically in real time and have great potential for use in injury surveillance and evaluation. Machine learning algorithms have been developed to assist in identifying cases and classifying mechanisms leading to injury in a much timelier manner than is possible when relying on manual coding of narratives. The aim of this paper is to describe the background, growth, value, challenges and future directions of machine learning as applied to injury surveillance. METHODS This paper reviews key aspects of machine learning using injury narratives, providing a case study to demonstrate an application to an established human-machine learning approach. RESULTS The range of applications and utility of narrative text has increased greatly with advancements in computing techniques over time. Practical and feasible methods exist for semiautomatic classification of injury narratives which are accurate, efficient and meaningful. The human-machine learning approach described in the case study achieved high sensitivity and PPV and reduced the need for human coding to less than a third of cases in one large occupational injury database. CONCLUSIONS The last 20 years have seen a dramatic change in the potential for technological advancements in injury surveillance. Machine learning of 'big injury narrative data' opens up many possibilities for expanded sources of data which can provide more comprehensive, ongoing and timely surveillance to inform future injury prevention policy and practice.
Collapse
Affiliation(s)
- Kirsten Vallmuur
- Queensland University of Technology, Centre for Accident Research and Road Safety-Queensland, Brisbane, Queensland, Australia
| | - Helen R Marucci-Wellman
- Center for Injury Epidemiology, Liberty Mutual Research Institute for Safety, Hopkinton, Massachusetts, USA
| | - Jennifer A Taylor
- Department of Environmental & Occupational Health, School of Public Health, Drexel University, Philadelphia, Pennsylvania, USA
| | - Mark Lehto
- School of Industrial Engineering, Purdue University, West Lafayette, Indiana, USA
| | - Helen L Corns
- Center for Injury Epidemiology, Liberty Mutual Research Institute for Safety, Hopkinton, Massachusetts, USA
| | - Gordon S Smith
- National Center for Trauma and EMS, University of Maryland School of Medicine, Baltimore, Maryland, USA
| |
Collapse
|