1
|
Jiang S, Wang T, Zhang KH. Data-driven decision-making for precision diagnosis of digestive diseases. Biomed Eng Online 2023; 22:87. [PMID: 37658345 PMCID: PMC10472739 DOI: 10.1186/s12938-023-01148-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Accepted: 08/15/2023] [Indexed: 09/03/2023] Open
Abstract
Modern omics technologies can generate massive amounts of biomedical data, providing unprecedented opportunities for individualized precision medicine. However, traditional statistical methods cannot effectively process and utilize such big data. To meet this new challenge, machine learning algorithms have been developed and applied rapidly in recent years, which are capable of reducing dimensionality, extracting features, organizing data and forming automatable data-driven clinical decision systems. Data-driven clinical decision-making have promising applications in precision medicine and has been studied in digestive diseases, including early diagnosis and screening, molecular typing, staging and stratification of digestive malignancies, as well as precise diagnosis of Crohn's disease, auxiliary diagnosis of imaging and endoscopy, differential diagnosis of cystic lesions, etiology discrimination of acute abdominal pain, stratification of upper gastrointestinal bleeding (UGIB), and real-time diagnosis of esophageal motility function, showing good application prospects. Herein, we reviewed the recent progress of data-driven clinical decision making in precision diagnosis of digestive diseases and discussed the limitations of data-driven decision making after a brief introduction of methods for data-driven decision making.
Collapse
Affiliation(s)
- Song Jiang
- Department of Gastroenterology, The First Affiliated Hospital of Nanchang University, No. 17, Yongwai Zheng Street, Nanchang, 330006 China
- Jiangxi Institute of Gastroenterology and Hepatology, Nanchang, 330006 China
| | - Ting Wang
- Department of Gastroenterology, The First Affiliated Hospital of Nanchang University, No. 17, Yongwai Zheng Street, Nanchang, 330006 China
- Jiangxi Institute of Gastroenterology and Hepatology, Nanchang, 330006 China
| | - Kun-He Zhang
- Department of Gastroenterology, The First Affiliated Hospital of Nanchang University, No. 17, Yongwai Zheng Street, Nanchang, 330006 China
- Jiangxi Institute of Gastroenterology and Hepatology, Nanchang, 330006 China
| |
Collapse
|
2
|
Yamanouchi Y, Nakamura T, Ikeda T, Usuku K. An Alternative Application of Natural Language Processing to Express a Characteristic Feature of Diseases in Japanese Medical Records. Methods Inf Med 2023; 62:110-118. [PMID: 36809794 PMCID: PMC10462427 DOI: 10.1055/a-2039-3773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2021] [Accepted: 04/13/2022] [Indexed: 02/23/2023]
Abstract
BACKGROUND Owing to the linguistic situation, Japanese natural language processing (NLP) requires morphological analyses for word segmentation using dictionary techniques. OBJECTIVE We aimed to clarify whether it can be substituted with an open-end discovery-based NLP (OD-NLP), which does not use any dictionary techniques. METHODS Clinical texts at the first medical visit were collected for comparison of OD-NLP with word dictionary-based-NLP (WD-NLP). Topics were generated in each document using a topic model, which later corresponded to the respective diseases determined in International Statistical Classification of Diseases and Related Health Problems 10 revision. The prediction accuracy and expressivity of each disease were examined in equivalent number of entities/words after filtration with either term frequency and inverse document frequency (TF-IDF) or dominance value (DMV). RESULTS In documents from 10,520 observed patients, 169,913 entities and 44,758 words were segmented using OD-NLP and WD-NLP, simultaneously. Without filtering, accuracy and recall levels were low, and there was no difference in the harmonic mean of the F-measure between NLPs. However, physicians reported OD-NLP contained more meaningful words than WD-NLP. When datasets were created in an equivalent number of entities/words with TF-IDF, F-measure in OD-NLP was higher than WD-NLP at lower thresholds. When the threshold increased, the number of datasets created decreased, resulting in increased values of F-measure, although the differences disappeared. Two datasets near the maximum threshold showing differences in F-measure were examined whether their topics were associated with diseases. The results showed that more diseases were found in OD-NLP at lower thresholds, indicating that the topics described characteristics of diseases. The superiority remained as much as that of TF-IDF when filtration was changed to DMV. CONCLUSION The current findings prefer the use of OD-NLP to express characteristics of diseases from Japanese clinical texts and may help in the construction of document summaries and retrieval in clinical settings.
Collapse
Affiliation(s)
- Yoshinori Yamanouchi
- Department of Medical Information Science, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan
| | - Taishi Nakamura
- Department of Medical Information Science, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan
| | - Tokunori Ikeda
- Department of Pharmaceutical Sciences, Faculty of Pharmaceutical Sciences, Sojo University, Nishi-ku, Kumamoto, Japan
| | - Koichiro Usuku
- Department of Medical Information Science, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan
| |
Collapse
|
3
|
Ji W, Xue M, Zhang Y, Yao H, Wang Y. A Machine Learning Based Framework to Identify and Classify Non-alcoholic Fatty Liver Disease in a Large-Scale Population. Front Public Health 2022; 10:846118. [PMID: 35444985 PMCID: PMC9013842 DOI: 10.3389/fpubh.2022.846118] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 02/23/2022] [Indexed: 12/12/2022] Open
Abstract
Non-alcoholic fatty liver disease (NAFLD) is a common serious health problem worldwide, which lacks efficient medical treatment. We aimed to develop and validate the machine learning (ML) models which could be used to the accurate screening of large number of people. This paper included 304,145 adults who have joined in the national physical examination and used their questionnaire and physical measurement parameters as model's candidate covariates. Absolute shrinkage and selection operator (LASSO) was used to feature selection from candidate covariates, then four ML algorithms were used to build the screening model for NAFLD, used a classifier with the best performance to output the importance score of the covariate in NAFLD. Among the four ML algorithms, XGBoost owned the best performance (accuracy = 0.880, precision = 0.801, recall = 0.894, F-1 = 0.882, and AUC = 0.951), and the importance ranking of covariates is accordingly BMI, age, waist circumference, gender, type 2 diabetes, gallbladder disease, smoking, hypertension, dietary status, physical activity, oil-loving and salt-loving. ML classifiers could help medical agencies achieve the early identification and classification of NAFLD, which is particularly useful for areas with poor economy, and the covariates' importance degree will be helpful to the prevention and treatment of NAFLD.
Collapse
Affiliation(s)
- Weidong Ji
- Department of Medical Information, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China
| | - Mingyue Xue
- Hospital of Traditional Chinese Medicine Affiliated to the Fourth Clinical Medical College of Xinjiang Medical University, Urumqi, China
| | - Yushan Zhang
- Department of Maternal and Child Health, School of Public Health, Sun Yat-sen University, Guangzhou, China
| | - Hua Yao
- Center of Health Management, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, China
| | - Yushan Wang
- Center of Health Management, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, China
- *Correspondence: Yushan Wang
| |
Collapse
|
4
|
Nwanosike EM, Conway BR, Merchant HA, Hasan SS. Potential applications and performance of machine learning techniques and algorithms in clinical practice: A systematic review. Int J Med Inform 2021; 159:104679. [PMID: 34990939 DOI: 10.1016/j.ijmedinf.2021.104679] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Revised: 12/08/2021] [Accepted: 12/27/2021] [Indexed: 12/11/2022]
Abstract
PURPOSE The advent of clinically adapted machine learning algorithms can solve numerous problems ranging from disease diagnosis and prognosis to therapy recommendations. This systematic review examines the performance of machine learning (ML) algorithms and evaluates the progress made to date towards their implementation in clinical practice. METHODS Systematic searching of databases (PubMed, MEDLINE, Scopus, Google Scholar, Cochrane Library and WHO Covid-19 database) to identify original articles published between January 2011 and October 2021. Studies reporting ML techniques in clinical practice involving humans and ML algorithms with a performance metric were considered. RESULTS Of 873 unique articles identified, 36 studies were eligible for inclusion. The XGBoost (extreme gradient boosting) algorithm showed the highest potential for clinical applications (n = 7 studies); this was followed jointly by random forest algorithm, logistic regression, and the support vector machine, respectively (n = 5 studies). Prediction of outcomes (n = 33), in particular Inflammatory diseases (n = 7) received the most attention followed by cancer and neuropsychiatric disorders (n = 5 for each) and Covid-19 (n = 4). Thirty-three out of the thirty-six included studies passed more than 50% of the selected quality assessment criteria in the TRIPOD checklist. In contrast, none of the studies could achieve an ideal overall bias rating of 'low' based on the PROBAST checklist. In contrast, only three studies showed evidence of the deployment of ML algorithm(s) in clinical practice. CONCLUSIONS ML is potentially a reliable tool for clinical decision support. Although advocated widely in clinical practice, work is still in progress to validate clinically adapted ML algorithms. Improving quality standards, transparency, and interpretability of ML models will further lower the barriers to acceptability.
Collapse
Affiliation(s)
- Ezekwesiri Michael Nwanosike
- Department of Pharmacy, School of Applied Sciences, University of Huddersfield, Queensgate Huddersfield HD1 3DH, West Yorkshire, United Kingdom
| | - Barbara R Conway
- Department of Pharmacy, School of Applied Sciences, University of Huddersfield, Queensgate Huddersfield HD1 3DH, West Yorkshire, United Kingdom
| | - Hamid A Merchant
- Department of Pharmacy, School of Applied Sciences, University of Huddersfield, Queensgate Huddersfield HD1 3DH, West Yorkshire, United Kingdom
| | - Syed Shahzad Hasan
- Department of Pharmacy, School of Applied Sciences, University of Huddersfield, Queensgate Huddersfield HD1 3DH, West Yorkshire, United Kingdom; School of Biomedical Sciences & Pharmacy, University of Newcastle, Callaghan, Australia.
| |
Collapse
|
5
|
Mijwil MM. Skin cancer disease images classification using deep learning solutions. MULTIMEDIA TOOLS AND APPLICATIONS 2021. [DOI: 10.1007/s11042-021-10952-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Revised: 11/04/2020] [Accepted: 04/14/2021] [Indexed: 08/30/2023]
|
6
|
Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif Intell Med 2020; 104:101822. [DOI: 10.1016/j.artmed.2020.101822] [Citation(s) in RCA: 197] [Impact Index Per Article: 49.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Revised: 01/17/2020] [Accepted: 02/17/2020] [Indexed: 12/13/2022]
|
7
|
Classification and prediction of diabetes disease using machine learning paradigm. Health Inf Sci Syst 2020; 8:7. [PMID: 31949894 DOI: 10.1007/s13755-019-0095-z] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Accepted: 12/21/2019] [Indexed: 12/19/2022] Open
Abstract
Background and objectives Diabetes is a chronic disease characterized by high blood sugar. It may cause many complicated disease like stroke, kidney failure, heart attack, etc. About 422 million people were affected by diabetes disease in worldwide in 2014. The figure will be reached 642 million in 2040. The main objective of this study is to develop a machine learning (ML)-based system for predicting diabetic patients. Materials and methods Logistic regression (LR) is used to identify the risk factors for diabetes disease based on p value and odds ratio (OR). We have adopted four classifiers like naïve Bayes (NB), decision tree (DT), Adaboost (AB), and random forest (RF) to predict the diabetic patients. Three types of partition protocols (K2, K5, and K10) have also adopted and repeated these protocols into 20 trails. Performances of these classifiers are evaluated using accuracy (ACC) and area under the curve (AUC). Results We have used diabetes dataset, conducted in 2009-2012, derived from the National Health and Nutrition Examination Survey. The dataset consists of 6561 respondents with 657 diabetic and 5904 controls. LR model demonstrates that 7 factors out of 14 as age, education, BMI, systolic BP, diastolic BP, direct cholesterol, and total cholesterol are the risk factors for diabetes. The overall ACC of ML-based system is 90.62%. The combination of LR-based feature selection and RF-based classifier gives 94.25% ACC and 0.95 AUC for K10 protocol. Conclusion The combination of LR and RF-based classifier performs better. This combination will be very helpful for predicting diabetic patients.
Collapse
|
8
|
Xue M, Su Y, Li C, Wang S, Yao H. Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework. J Diabetes Res 2020; 2020:6873891. [PMID: 33029536 PMCID: PMC7532405 DOI: 10.1155/2020/6873891] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 08/01/2020] [Accepted: 09/02/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND An estimated 425 million people globally have diabetes, accounting for 12% of the world's health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. METHODS A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables' importance scores of T2DM. RESULTS The results indicated that XGBoost had the best performance (accuracy = 0.906, precision = 0.910, recall = 0.902, F-1 = 0.906, and AUC = 0.968). The degree of variables' importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). CONCLUSIONS We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables' importance scores gives a clue to prevent diabetes occurrence.
Collapse
Affiliation(s)
- Mingyue Xue
- Hospital of Traditional Chinese Medicine Affiliated to the Fourth Clinical Medical College of Xinjiang Medical University, Urumqi, China
- College of Public Health, Xinjiang Medical University, Urumqi, China
| | - Yinxia Su
- College of Public Health, Xinjiang Medical University, Urumqi, China
| | - Chen Li
- The First Affiliated Hospital of Xinjiang Medical University, Urumqi, China
| | - Shuxia Wang
- Center of Health Management, The First Affiliated Hospital, Xinjiang Medical University, Urumqi, China
| | - Hua Yao
- Center of Health Management, The First Affiliated Hospital, Xinjiang Medical University, Urumqi, China
| |
Collapse
|
9
|
Wu DTY, Vennemeyer S, Brown K, Revalee J, Murdock P, Salomone S, France A, Clarke-Myers K, Hanke SP. Usability Testing of an Interactive Dashboard for Surgical Quality Improvement in a Large Congenital Heart Center. Appl Clin Inform 2019; 10:859-869. [PMID: 31724143 DOI: 10.1055/s-0039-1698466] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Interactive data visualization and dashboards can be an effective way to explore meaningful patterns in large clinical data sets and to inform quality improvement initiatives. However, these interactive dashboards may have usability issues that undermine their effectiveness. These usability issues can be attributed to mismatched mental models between the designers and the users. Unfortunately, very few evaluation studies in visual analytics have specifically examined such mismatches between these two groups. OBJECTIVES We aimed to evaluate the usability of an interactive surgical dashboard and to seek opportunities for improvement. We also aimed to provide empirical evidence to demonstrate the mismatched mental models between the designers and the users of the dashboard. METHODS An interactive dashboard was developed in a large congenital heart center. This dashboard provides real-time, interactive access to clinical outcomes data for the surgical program. A mixed-method, two-phase study was conducted to collect user feedback. A group of designers (N = 3) and a purposeful sample of users (N = 12) were recruited. The qualitative data were analyzed thematically. The dashboards were compared using the System Usability Scale (SUS) and qualitative data. RESULTS The participating users gave an average SUS score of 82.9 on the new dashboard and 63.5 on the existing dashboard (p = 0.006). The participants achieved high task accuracy when using the new dashboard. The qualitative analysis revealed three opportunities for improvement. The data analysis and triangulation provided empirical evidence to the mismatched mental models. CONCLUSION We conducted a mixed-method usability study on an interactive surgical dashboard and identified areas of improvements. Our study design can be an effective and efficient way to evaluate visual analytics systems in health care. We encourage researchers and practitioners to conduct user-centered evaluation and implement education plans to mitigate potential usability challenges and increase user satisfaction and adoption.
Collapse
Affiliation(s)
- Danny T Y Wu
- Department of Biomedical Informatics, University of Cincinnati, Cincinnati, Ohio, United States.,Department of Pediatrics, University of Cincinnati, Cincinnati, Ohio, United States
| | - Scott Vennemeyer
- Department of Biomedical Informatics, University of Cincinnati, Cincinnati, Ohio, United States
| | - Kelly Brown
- Heart Institute, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, United States
| | - Jason Revalee
- DAAP School of Design, University of Cincinnati, Cincinnati, Ohio, United States
| | - Paul Murdock
- Department of Biomedical Informatics, University of Cincinnati, Cincinnati, Ohio, United States
| | - Sarah Salomone
- Department of Biomedical Informatics, University of Cincinnati, Cincinnati, Ohio, United States
| | - Ashton France
- Heart Institute, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, United States
| | - Katherine Clarke-Myers
- Heart Institute, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, United States
| | - Samuel P Hanke
- Department of Pediatrics, University of Cincinnati, Cincinnati, Ohio, United States.,Heart Institute, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, United States
| |
Collapse
|
10
|
Wang X, Williams C, Liu ZH, Croghan J. Big data management challenges in health research-a literature review. Brief Bioinform 2019; 20:156-167. [PMID: 28968677 DOI: 10.1093/bib/bbx086] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Indexed: 12/12/2022] Open
Abstract
Big data management for information centralization (i.e. making data of interest findable) and integration (i.e. making related data connectable) in health research is a defining challenge in biomedical informatics. While essential to create a foundation for knowledge discovery, optimized solutions to deliver high-quality and easy-to-use information resources are not thoroughly explored. In this review, we identify the gaps between current data management approaches and the need for new capacity to manage big data generated in advanced health research. Focusing on these unmet needs and well-recognized problems, we introduce state-of-the-art concepts, approaches and technologies for data management from computing academia and industry to explore improvement solutions. We explain the potential and significance of these advances for biomedical informatics. In addition, we discuss specific issues that have a great impact on technical solutions for developing the next generation of digital products (tools and data) to facilitate the raw-data-to-knowledge process in health research.
Collapse
Affiliation(s)
- Xiaoming Wang
- National Institute of Infectious and Allergy Diseases, NIH, Rockville, Maryland, USA
| | - Carolyn Williams
- National Institute of Infectious and Allergy Diseases, NIH, Rockville, Maryland, USA
| | | | - Joe Croghan
- National Institute of Infectious and Allergy Diseases, NIH, Rockville, Maryland, USA
| |
Collapse
|
11
|
Zeng X, Luo G. Progressive sampling-based Bayesian optimization for efficient and automatic machine learning model selection. Health Inf Sci Syst 2017; 5:2. [PMID: 29038732 PMCID: PMC5617811 DOI: 10.1007/s13755-017-0023-z] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2017] [Accepted: 09/20/2017] [Indexed: 12/11/2022] Open
Abstract
PURPOSE Machine learning is broadly used for clinical data analysis. Before training a model, a machine learning algorithm must be selected. Also, the values of one or more model parameters termed hyper-parameters must be set. Selecting algorithms and hyper-parameter values requires advanced machine learning knowledge and many labor-intensive manual iterations. To lower the bar to machine learning, miscellaneous automatic selection methods for algorithms and/or hyper-parameter values have been proposed. Existing automatic selection methods are inefficient on large data sets. This poses a challenge for using machine learning in the clinical big data era. METHODS To address the challenge, this paper presents progressive sampling-based Bayesian optimization, an efficient and automatic selection method for both algorithms and hyper-parameter values. RESULTS We report an implementation of the method. We show that compared to a state of the art automatic selection method, our method can significantly reduce search time, classification error rate, and standard deviation of error rate due to randomization. CONCLUSIONS This is major progress towards enabling fast turnaround in identifying high-quality solutions required by many machine learning-based clinical data analysis tasks.
Collapse
Affiliation(s)
- Xueqiang Zeng
- Computer Center, Nanchang University, 999 Xuefu Road, Nanchang, 330031 Jiangxi People’s Republic of China
| | - Gang Luo
- Department of Biomedical Informatics and Medical Education, University of Washington, UW Medicine South Lake Union, 850 Republican Street, Building C, Box 358047, Seattle, WA 98109 USA
| |
Collapse
|
12
|
Luo G, Stone BL, Johnson MD, Tarczy-Hornoch P, Wilcox AB, Mooney SD, Sheng X, Haug PJ, Nkoy FL. Automating Construction of Machine Learning Models With Clinical Big Data: Proposal Rationale and Methods. JMIR Res Protoc 2017; 6:e175. [PMID: 28851678 PMCID: PMC5596298 DOI: 10.2196/resprot.7757] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2017] [Revised: 07/14/2017] [Accepted: 07/15/2017] [Indexed: 12/14/2022] Open
Abstract
Background To improve health outcomes and cut health care costs, we often need to conduct prediction/classification using large clinical datasets (aka, clinical big data), for example, to identify high-risk patients for preventive interventions. Machine learning has been proposed as a key technology for doing this. Machine learning has won most data science competitions and could support many clinical activities, yet only 15% of hospitals use it for even limited purposes. Despite familiarity with data, health care researchers often lack machine learning expertise to directly use clinical big data, creating a hurdle in realizing value from their data. Health care researchers can work with data scientists with deep machine learning knowledge, but it takes time and effort for both parties to communicate effectively. Facing a shortage in the United States of data scientists and hiring competition from companies with deep pockets, health care systems have difficulty recruiting data scientists. Building and generalizing a machine learning model often requires hundreds to thousands of manual iterations by data scientists to select the following: (1) hyper-parameter values and complex algorithms that greatly affect model accuracy and (2) operators and periods for temporally aggregating clinical attributes (eg, whether a patient’s weight kept rising in the past year). This process becomes infeasible with limited budgets. Objective This study’s goal is to enable health care researchers to directly use clinical big data, make machine learning feasible with limited budgets and data scientist resources, and realize value from data. Methods This study will allow us to achieve the following: (1) finish developing the new software, Automated Machine Learning (Auto-ML), to automate model selection for machine learning with clinical big data and validate Auto-ML on seven benchmark modeling problems of clinical importance; (2) apply Auto-ML and novel methodology to two new modeling problems crucial for care management allocation and pilot one model with care managers; and (3) perform simulations to estimate the impact of adopting Auto-ML on US patient outcomes. Results We are currently writing Auto-ML’s design document. We intend to finish our study by around the year 2022. Conclusions Auto-ML will generalize to various clinical prediction/classification problems. With minimal help from data scientists, health care researchers can use Auto-ML to quickly build high-quality models. This will boost wider use of machine learning in health care and improve patient outcomes.
Collapse
Affiliation(s)
- Gang Luo
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, United States
| | - Bryan L Stone
- Department of Pediatrics, University of Utah, Salt Lake City, UT, United States
| | - Michael D Johnson
- Department of Pediatrics, University of Utah, Salt Lake City, UT, United States
| | - Peter Tarczy-Hornoch
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, United States.,Division of Neonatology, Department of Pediatrics, University of Washington, Seattle, WA, United States.,Department of Computer Science and Engineering, University of Washington, Seattle, WA, United States
| | - Adam B Wilcox
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, United States
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, United States
| | - Xiaoming Sheng
- Department of Pediatrics, University of Utah, Salt Lake City, UT, United States
| | - Peter J Haug
- Homer Warner Research Center, Intermountain Healthcare, Murray, UT, United States.,Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States
| | - Flory L Nkoy
- Department of Pediatrics, University of Utah, Salt Lake City, UT, United States
| |
Collapse
|
13
|
Abstract
In various biomedical applications that collect, handle, and manipulate data, the amounts of data tend to build up and venture into the range identified as bigdata. In such occurrences, a design decision has to be taken as to what type of database would be used to handle this data. More often than not, the default and classical solution to this in the biomedical domain according to past research is relational databases. While this used to be the norm for a long while, it is evident that there is a trend to move away from relational databases in favor of other types and paradigms of databases. However, it still has paramount importance to understand the interrelation that exists between biomedical big data and relational databases. This chapter will review the pros and cons of using relational databases to store biomedical big data that previous researches have discussed and used.
Collapse
Affiliation(s)
- N H Nisansa D de Silva
- Department of Computer and Information Science, University of Oregon, 224 Deschutes Hall, 1477 E 13th Ave., Eugene, OR, 97403, USA.
| |
Collapse
|
14
|
Luo G. PredicT-ML: a tool for automating machine learning model building with big clinical data. Health Inf Sci Syst 2016; 4:5. [PMID: 27280018 PMCID: PMC4897944 DOI: 10.1186/s13755-016-0018-1] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2016] [Accepted: 06/01/2016] [Indexed: 12/16/2022] Open
Abstract
Background Predictive modeling is fundamental to transforming large clinical data sets, or “big clinical data,” into actionable knowledge for various healthcare applications. Machine learning is a major predictive modeling approach, but two barriers make its use in healthcare challenging. First, a machine learning tool user must choose an algorithm and assign one or more model parameters called hyper-parameters before model training. The algorithm and hyper-parameter values used typically impact model accuracy by over 40 %, but their selection requires many labor-intensive manual iterations that can be difficult even for computer scientists. Second, many clinical attributes are repeatedly recorded over time, requiring temporal aggregation before predictive modeling can be performed. Many labor-intensive manual iterations are required to identify a good pair of aggregation period and operator for each clinical attribute. Both barriers result in time and human resource bottlenecks, and preclude healthcare administrators and researchers from asking a series of what-if questions when probing opportunities to use predictive models to improve outcomes and reduce costs. Methods This paper describes our design of and vision for PredicT-ML (prediction tool using machine learning), a software system that aims to overcome these barriers and automate machine learning model building with big clinical data. Results The paper presents the detailed design of PredicT-ML. Conclusions PredicT-ML will open the use of big clinical data to thousands of healthcare administrators and researchers and increase the ability to advance clinical research and improve healthcare.
Collapse
Affiliation(s)
- Gang Luo
- Department of Biomedical Informatics, University of Utah, Suite 140, 421 Wakara Way, Salt Lake City, UT 84108 USA
| |
Collapse
|
15
|
Optimized Distributed Hyperparameter Search and Simulation for Lung Texture Classification in CT Using Hadoop. J Imaging 2016. [DOI: 10.3390/jimaging2020019] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
16
|
A review of automatic selection methods for machine learning algorithms and hyper-parameter values. ACTA ACUST UNITED AC 2016. [DOI: 10.1007/s13721-016-0125-6] [Citation(s) in RCA: 130] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
17
|
Petridis AK, Fischer I, Cornelius JF, Kamp MA, Ringel F, Tortora A, Steiger HJ. Demographic distribution of hospital admissions for brain arteriovenous malformations in Germany--estimation of the natural course with the big-data approach. Acta Neurochir (Wien) 2016; 158:791-796. [PMID: 26873715 DOI: 10.1007/s00701-016-2727-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 01/27/2016] [Indexed: 11/29/2022]
Abstract
BACKGROUND Estimation of the natural history of arteriovenous malformations based on short-term observation is potentially biased by multiple factors. Retrieval of demographic information of all AVM patients of national data pools and comparison with the national demographic profile might be another way to approach the natural history. MATERIALS AND METHODS Upon request, the German Federal Statistical Office provided the numbers of patients admitted in Germany from 2009 through 2013 with ICD Q28.2 (brain AVM) as primary discharge diagnosis, and the corresponding age distribution. Age-related admission rates of AVM were calculated by comparison with the German demographic distribution. RESULTS A total of 6527 patients were hospitalized from 2009-2013 with brain AVM (Q28.2) as the principal diagnosis. Age-specific admission rate during the first year of life was high with 19.0/100,000 during the 5-year study period, corresponding to a yearly admission rate of 3.8 per 100,000 babies. Apart from the high admission rate during the first year of life, the admission rate was low, but steadily increasing during first decades of life reaching a plateau with 11.1/100,000 in the age group 30-34 years, corresponding to an annual admission rate of 2.2/100,000. After the age of 30-34 years, admission rates decreased continuously, reaching 0 in the age group 90-95 years. The lifetime risk of admission in terms of admission per 100,000 age-matched people was calculated by retrograde integration of the admission rates. At the age of 1 year, the cumulative number of future admissions for AVM during lifetime amounted to 131.3/100,000 children. For the older age groups, the chance of future admission for AVM decreased as expected, reaching 43.8/100,000 by the age of 50 and 0 by the age of 90. CONCLUSIONS Despite some open issues, the current data suggests that achieving old age with an untreated brain AVM is unlikely. Furthermore, the data support the concept that most brain AVMs are not necessarily a congenital entity but develop during the first decades of life.
Collapse
Affiliation(s)
- Athanasios K Petridis
- Department of Neurosurgery, University Hospital Duesseldorf, Moorenstr. 5, 40225, Düsseldorf, Germany.
| | - Igor Fischer
- Division of Informatics and Statistics, Department of Neurosurgery, Heinrich-Heine-Universität, Düsseldorf, Germany
| | - Jan F Cornelius
- Department of Neurosurgery, University Hospital Duesseldorf, Moorenstr. 5, 40225, Düsseldorf, Germany
| | - Marcel A Kamp
- Department of Neurosurgery, University Hospital Duesseldorf, Moorenstr. 5, 40225, Düsseldorf, Germany
| | - Florian Ringel
- Department of Neurosurgery, Technische Universität, Munich, Germany
| | - Angelo Tortora
- Department of Neurosurgery, University Hospital Duesseldorf, Moorenstr. 5, 40225, Düsseldorf, Germany
| | - Hans-Jakob Steiger
- Department of Neurosurgery, University Hospital Duesseldorf, Moorenstr. 5, 40225, Düsseldorf, Germany
| |
Collapse
|
18
|
Predictive Business Process Monitoring Framework with Hyperparameter Optimization. ADVANCED INFORMATION SYSTEMS ENGINEERING 2016. [DOI: 10.1007/978-3-319-39696-5_22] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
|