1
|
Demircioğlu A. radMLBench: A dataset collection for benchmarking in radiomics. Comput Biol Med 2024; 182:109140. [PMID: 39270457 DOI: 10.1016/j.compbiomed.2024.109140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 08/20/2024] [Accepted: 09/08/2024] [Indexed: 09/15/2024]
Abstract
BACKGROUND New machine learning methods and techniques are frequently introduced in radiomics, but they are often tested on a single dataset, which makes it challenging to assess their true benefit. Currently, there is a lack of a larger, publicly accessible dataset collection on which such assessments could be performed. In this study, a collection of radiomics datasets with binary outcomes in tabular form was curated to allow benchmarking of machine learning methods and techniques. METHODS A variety of journals and online sources were searched to identify tabular radiomics data with binary outcomes, which were then compiled into a homogeneous data collection that is easily accessible via Python. To illustrate the utility of the dataset collection, it was applied to investigate whether feature decorrelation prior to feature selection could improve predictive performance in a radiomics pipeline. RESULTS A total of 50 radiomic datasets were collected, with sample sizes ranging from 51 to 969 and 101 to 11165 features. Using this data, it was observed that decorrelating features did not yield any significant improvement on average. CONCLUSIONS A large collection of datasets, easily accessible via Python, suitable for benchmarking and evaluating new machine learning techniques and methods was curated. Its utility was exemplified by demonstrating that feature decorrelation prior to feature selection does not, on average, lead to significant performance gains and could be omitted, thereby increasing the robustness and reliability of the radiomics pipeline.
Collapse
Affiliation(s)
- Aydin Demircioğlu
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, D-45147, Essen, Germany.
| |
Collapse
|
2
|
Peterson RA, McGrath M, Cavanaugh JE. Can a Transparent Machine Learning Algorithm Predict Better than Its Black Box Counterparts? A Benchmarking Study Using 110 Data Sets. ENTROPY (BASEL, SWITZERLAND) 2024; 26:746. [PMID: 39330080 PMCID: PMC11431724 DOI: 10.3390/e26090746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 08/27/2024] [Accepted: 08/28/2024] [Indexed: 09/28/2024]
Abstract
We developed a novel machine learning (ML) algorithm with the goal of producing transparent models (i.e., understandable by humans) while also flexibly accounting for nonlinearity and interactions. Our method is based on ranked sparsity, and it allows for flexibility and user control in varying the shade of the opacity of black box machine learning methods. The main tenet of ranked sparsity is that an algorithm should be more skeptical of higher-order polynomials and interactions a priori compared to main effects, and hence, the inclusion of these more complex terms should require a higher level of evidence. In this work, we put our new ranked sparsity algorithm (as implemented in the open source R package, sparseR) to the test in a predictive model "bakeoff" (i.e., a benchmarking study of ML algorithms applied "out of the box", that is, with no special tuning). Algorithms were trained on a large set of simulated and real-world data sets from the Penn Machine Learning Benchmarks database, addressing both regression and binary classification problems. We evaluated the extent to which our human-centered algorithm can attain predictive accuracy that rivals popular black box approaches such as neural networks, random forests, and support vector machines, while also producing more interpretable models. Using out-of-bag error as a meta-outcome, we describe the properties of data sets in which human-centered approaches can perform as well as or better than black box approaches. We found that interpretable approaches predicted optimally or within 5% of the optimal method in most real-world data sets. We provide a more in-depth comparison of the performances of random forests to interpretable methods for several case studies, including exemplars in which algorithms performed similarly, and several cases when interpretable methods underperformed. This work provides a strong rationale for including human-centered transparent algorithms such as ours in predictive modeling applications.
Collapse
Affiliation(s)
- Ryan A Peterson
- Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado, Anschutz Medical Campus, 13001 E. 17th Pl, Aurora, CO 80045, USA
| | - Max McGrath
- Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado, Anschutz Medical Campus, 13001 E. 17th Pl, Aurora, CO 80045, USA
| | - Joseph E Cavanaugh
- Department of Biostatistics, College of Public Health, University of Iowa, 145 N. Riverside Dr., Iowa City, IA 52245, USA
| |
Collapse
|
3
|
Tjaden J, Tjaden B. MLpronto: A tool for democratizing machine learning. PLoS One 2023; 18:e0294924. [PMID: 38032968 PMCID: PMC10688639 DOI: 10.1371/journal.pone.0294924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2023] [Accepted: 11/11/2023] [Indexed: 12/02/2023] Open
Abstract
The democratization of machine learning is a popular and growing movement. In a world with a wealth of publicly available data, it is important that algorithms for analysis of data are accessible and usable by everyone. We present MLpronto, a system for machine learning analysis that is designed to be easy to use so as to facilitate engagement with machine learning algorithms. With its web interface, MLpronto requires no computer programming or machine learning background, and it normally returns results in a matter of seconds. As input, MLpronto takes a file of data to be analyzed. MLpronto then executes some of the more commonly used supervised machine learning algorithms on the data and reports the results of the analyses. As part of its execution, MLpronto generates computer programming code corresponding to its machine learning analysis, which it also supplies as output. Thus, MLpronto can be used as a no-code solution for citizen data scientists with no machine learning or programming background, as an educational tool for those learning about machine learning, and as a first step for those who prefer to engage with programming code in order to facilitate rapid development of machine learning projects. MLpronto is freely available for use at https://mlpronto.org/.
Collapse
Affiliation(s)
- Jacob Tjaden
- Computer Science Department, Colby College, Waterville, ME, United States of America
| | - Brian Tjaden
- Department of Computer Science, Wellesley College, Wellesley, MA, United States of America
| |
Collapse
|
4
|
Ong W, Liu RW, Makmur A, Low XZ, Sng WJ, Tan JH, Kumar N, Hallinan JTPD. Artificial Intelligence Applications for Osteoporosis Classification Using Computed Tomography. Bioengineering (Basel) 2023; 10:1364. [PMID: 38135954 PMCID: PMC10741220 DOI: 10.3390/bioengineering10121364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 11/21/2023] [Accepted: 11/23/2023] [Indexed: 12/24/2023] Open
Abstract
Osteoporosis, marked by low bone mineral density (BMD) and a high fracture risk, is a major health issue. Recent progress in medical imaging, especially CT scans, offers new ways of diagnosing and assessing osteoporosis. This review examines the use of AI analysis of CT scans to stratify BMD and diagnose osteoporosis. By summarizing the relevant studies, we aimed to assess the effectiveness, constraints, and potential impact of AI-based osteoporosis classification (severity) via CT. A systematic search of electronic databases (PubMed, MEDLINE, Web of Science, ClinicalTrials.gov) was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. A total of 39 articles were retrieved from the databases, and the key findings were compiled and summarized, including the regions analyzed, the type of CT imaging, and their efficacy in predicting BMD compared with conventional DXA studies. Important considerations and limitations are also discussed. The overall reported accuracy, sensitivity, and specificity of AI in classifying osteoporosis using CT images ranged from 61.8% to 99.4%, 41.0% to 100.0%, and 31.0% to 100.0% respectively, with areas under the curve (AUCs) ranging from 0.582 to 0.994. While additional research is necessary to validate the clinical efficacy and reproducibility of these AI tools before incorporating them into routine clinical practice, these studies demonstrate the promising potential of using CT to opportunistically predict and classify osteoporosis without the need for DEXA.
Collapse
Affiliation(s)
- Wilson Ong
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.)
| | - Ren Wei Liu
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.)
| | - Andrew Makmur
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.)
- Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Drive, Singapore 117597, Singapore
| | - Xi Zhen Low
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.)
- Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Drive, Singapore 117597, Singapore
| | - Weizhong Jonathan Sng
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.)
- Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Drive, Singapore 117597, Singapore
| | - Jiong Hao Tan
- University Spine Centre, Department of Orthopaedic Surgery, National University Health System, 1E Lower Kent Ridge Road, Singapore 119228, Singapore; (J.H.T.); (N.K.)
| | - Naresh Kumar
- University Spine Centre, Department of Orthopaedic Surgery, National University Health System, 1E Lower Kent Ridge Road, Singapore 119228, Singapore; (J.H.T.); (N.K.)
| | - James Thomas Patrick Decourcy Hallinan
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.)
- Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Drive, Singapore 117597, Singapore
| |
Collapse
|
5
|
Decoux A, Duron L, Habert P, Roblot V, Arsovic E, Chassagnon G, Arnoux A, Fournier L. Comparative performances of machine learning algorithms in radiomics and impacting factors. Sci Rep 2023; 13:14069. [PMID: 37640728 PMCID: PMC10462640 DOI: 10.1038/s41598-023-39738-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 07/30/2023] [Indexed: 08/31/2023] Open
Abstract
There are no current recommendations on which machine learning (ML) algorithms should be used in radiomics. The objective was to compare performances of ML algorithms in radiomics when applied to different clinical questions to determine whether some strategies could give the best and most stable performances regardless of datasets. This study compares the performances of nine feature selection algorithms combined with fourteen binary classification algorithms on ten datasets. These datasets included radiomics features and clinical diagnosis for binary clinical classifications including COVID-19 pneumonia or sarcopenia on CT, head and neck, orbital or uterine lesions on MRI. For each dataset, a train-test split was created. Each of the 126 (9 × 14) combinations of feature selection algorithms and classification algorithms was trained and tuned using a ten-fold cross validation, then AUC was computed. This procedure was repeated three times per dataset. Best overall performances were obtained with JMI and JMIM as feature selection algorithms and random forest and linear regression models as classification algorithms. The choice of the classification algorithm was the factor explaining most of the performance variation (10% of total variance). The choice of the feature selection algorithm explained only 2% of variation, while the train-test split explained 9%.
Collapse
Affiliation(s)
- Antoine Decoux
- Université Paris Cité, PARCC UMRS 970, INSERM, Paris, France
- Unité de Recherche Clinique, Center d'Investigation Clinique 1418 Épidémiologie Clinique, Université Paris Cité, AP-HP, Hôpital Européen Georges Pompidou, INSERM, Paris, France
| | - Loic Duron
- Université Paris Cité, PARCC UMRS 970, INSERM, Paris, France
- Department of Radiology, Hôpital Fondation Ophtalmologique Adolphe de Rothschild, Paris, France
| | - Paul Habert
- Université Paris Cité, PARCC UMRS 970, INSERM, Paris, France
- Imaging Department, Hôpital Nord, APHM, Aix Marseille University, Marseille, France
- Aix Marseille Univ, LIIE, Marseille, France
| | - Victoire Roblot
- Université Paris Cité, PARCC UMRS 970, INSERM, Paris, France
| | - Emina Arsovic
- Université Paris Cité, PARCC UMRS 970, INSERM, Paris, France
| | - Guillaume Chassagnon
- Department of Radiology, Université Paris Cité, AP-HP, Hôpital Cochin, Paris, France
| | - Armelle Arnoux
- Unité de Recherche Clinique, Center d'Investigation Clinique 1418 Épidémiologie Clinique, Université Paris Cité, AP-HP, Hôpital Européen Georges Pompidou, INSERM, Paris, France
| | - Laure Fournier
- Department of Radiology, Université Paris Cité, AP-HP, Hôpital Européen Georges Pompidou, PARCC UMRS 970, INSERM, Paris, France.
| |
Collapse
|
6
|
La Cava WG, Lee PC, Ajmal I, Ding X, Solanki P, Cohen JB, Moore JH, Herman DS. A flexible symbolic regression method for constructing interpretable clinical prediction models. NPJ Digit Med 2023; 6:107. [PMID: 37277550 PMCID: PMC10241925 DOI: 10.1038/s41746-023-00833-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 05/05/2023] [Indexed: 06/07/2023] Open
Abstract
Machine learning (ML) models trained for triggering clinical decision support (CDS) are typically either accurate or interpretable but not both. Scaling CDS to the panoply of clinical use cases while mitigating risks to patients will require many ML models be intuitively interpretable for clinicians. To this end, we adapted a symbolic regression method, coined the feature engineering automation tool (FEAT), to train concise and accurate models from high-dimensional electronic health record (EHR) data. We first present an in-depth application of FEAT to classify hypertension, hypertension with unexplained hypokalemia, and apparent treatment-resistant hypertension (aTRH) using EHR data for 1200 subjects receiving longitudinal care in a large healthcare system. FEAT models trained to predict phenotypes adjudicated by chart review had equivalent or higher discriminative performance (p < 0.001) and were at least three times smaller (p < 1 × 10-6) than other potentially interpretable models. For aTRH, FEAT generated a six-feature, highly discriminative (positive predictive value = 0.70, sensitivity = 0.62), and clinically intuitive model. To assess the generalizability of the approach, we tested FEAT on 25 benchmark clinical phenotyping tasks using the MIMIC-III critical care database. Under comparable dimensionality constraints, FEAT's models exhibited higher area under the receiver-operating curve scores than penalized linear models across tasks (p < 6 × 10-6). In summary, FEAT can train EHR prediction models that are both intuitively interpretable and accurate, which should facilitate safe and effective scaling of ML-triggered CDS to the panoply of potential clinical use cases and healthcare practices.
Collapse
Affiliation(s)
- William G La Cava
- Computational Health Informatics Program, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | - Paul C Lee
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Imran Ajmal
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Xiruo Ding
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Priyanka Solanki
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Jordana B Cohen
- Division of Renal-Electrolyte and Hypertension, Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jason H Moore
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Daniel S Herman
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
7
|
Alòs J, Ansótegui C, Torres E. Interpretable decision trees through MaxSAT. Artif Intell Rev 2022; 56:1-21. [PMID: 36590759 PMCID: PMC9794111 DOI: 10.1007/s10462-022-10377-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2022] [Indexed: 12/29/2022]
Abstract
We present an approach to improve the accuracy-interpretability trade-off of Machine Learning (ML) Decision Trees (DTs). In particular, we apply Maximum Satisfiability technology to compute Minimum Pure DTs (MPDTs). We improve the runtime of previous approaches and, show that these MPDTs can outperform the accuracy of DTs generated with the ML framework sklearn.
Collapse
Affiliation(s)
- Josep Alòs
- Logic & Optimization Group (LOG), University of Lleida, Lleida, Spain
| | - Carlos Ansótegui
- Logic & Optimization Group (LOG), University of Lleida, Lleida, Spain
| | - Eduard Torres
- Logic & Optimization Group (LOG), University of Lleida, Lleida, Spain
| |
Collapse
|
8
|
Duong-Trung N, Born S, Kim JW, Schermeyer MT, Paulick K, Borisyak M, Cruz-Bournazou MN, Werner T, Scholz R, Schmidt-Thieme L, Neubauer P, Martinez E. When Bioprocess Engineering Meets Machine Learning: A Survey from the Perspective of Automated Bioprocess Development. Biochem Eng J 2022. [DOI: 10.1016/j.bej.2022.108764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
9
|
Valdes G, Interian Y, Gennatas E, Van der Laan M. The Conditional Super Learner. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:10236-10243. [PMID: 34851823 DOI: 10.1109/tpami.2021.3131976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Using cross validation to select the best model from a library is standard practice in machine learning. Similarly, meta learning is a widely used technique where models previously developed are combined (mainly linearly) with the expectation of improving performance with respect to individual models. In this article we consider the Conditional Super Learner (CSL), an algorithm that selects the best model candidate from a library of models conditional on the covariates. The CSL expands the idea of using cross validation to select the best model and merges it with meta learning. We propose an optimization algorithm that finds a local minimum to the problem posed and proves that it converges at a rate faster than Op(n-1/4). We offer empirical evidence that: (1) CSL is an excellent candidate to substitute stacking and (2) CLS is suitable for the analysis of Hierarchical problems. Additionally, implications for global interpretability are emphasized.
Collapse
|
10
|
Orzechowski P, Moore JH. Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers. SCIENCE ADVANCES 2022; 8:eabl4747. [PMID: 36417520 PMCID: PMC9683726 DOI: 10.1126/sciadv.abl4747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 10/07/2022] [Indexed: 06/16/2023]
Abstract
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial to determine their scope of application. Here, we introduce the Diverse and Generative ML Benchmark (DIGEN), a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of ML algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions that map continuous features to binary targets for creating synthetic datasets. These 40 functions were found using a heuristic algorithm designed to maximize the diversity of performance among multiple popular ML algorithms, thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms, thus providing ideas for improvement.
Collapse
Affiliation(s)
- Patryk Orzechowski
- Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA
- Department of Automatics and Robotics, AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków, Poland
| | - Jason H. Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Suite G540, West Hollywood, CA 90069, USA
| |
Collapse
|
11
|
Kasperek D, Podpora M, Kawala-Sterniuk A. Comparison of the Usability of Apple M1 Processors for Various Machine Learning Tasks. SENSORS (BASEL, SWITZERLAND) 2022; 22:8005. [PMID: 36298358 PMCID: PMC9608475 DOI: 10.3390/s22208005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 10/07/2022] [Accepted: 10/17/2022] [Indexed: 06/16/2023]
Abstract
In this paper, the authors have compared all of the currently available Apple MacBook Pro laptops, in terms of their usability for basic machine learning research applications (text-based, vision-based, tabular). The paper presents four tests/benchmarks, comparing four Apple Macbook Pro laptop versions: Intel based (i5) and three Apple based (M1, M1 Pro and M1 Max). A script in the Swift programming language was prepared, whose goal was to conduct the training and evaluation process for four machine learning (ML) models. It used the Create ML framework-Apple's solution dedicated to ML model creation on macOS devices. The training and evaluation processes were performed three times. While running, the script performed measurements of their performance, including the time results. The results were compared with each other in tables, which allowed to compare and discuss the performance of individual devices and the benefits of the specificity of their hardware architectures.
Collapse
|
12
|
Ho L, Goethals P. Machine learning applications in river research: Trends, opportunities and challenges. Methods Ecol Evol 2022. [DOI: 10.1111/2041-210x.13992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Long Ho
- Department of Animal Sciences and Aquatic Ecology Ghent University Ghent Belgium
| | - Peter Goethals
- Department of Animal Sciences and Aquatic Ecology Ghent University Ghent Belgium
| |
Collapse
|
13
|
Stafford IS, Gosink MM, Mossotto E, Ennis S, Hauben M. A Systematic Review of Artificial Intelligence and Machine Learning Applications to Inflammatory Bowel Disease, with Practical Guidelines for Interpretation. Inflamm Bowel Dis 2022; 28:1573-1583. [PMID: 35699597 PMCID: PMC9527612 DOI: 10.1093/ibd/izac115] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Indexed: 12/15/2022]
Abstract
BACKGROUND Inflammatory bowel disease (IBD) is a gastrointestinal chronic disease with an unpredictable disease course. Computational methods such as machine learning (ML) have the potential to stratify IBD patients for the provision of individualized care. The use of ML methods for IBD was surveyed, with an additional focus on how the field has changed over time. METHODS On May 6, 2021, a systematic review was conducted through a search of MEDLINE and Embase databases, with the search structure ("machine learning" OR "artificial intelligence") AND ("Crohn* Disease" OR "Ulcerative Colitis" OR "Inflammatory Bowel Disease"). Exclusion criteria included studies not written in English, no human patient data, publication before 2001, studies that were not peer reviewed, nonautoimmune disease comorbidity research, and record types that were not primary research. RESULTS Seventy-eight (of 409) records met the inclusion criteria. Random forest methods were most prevalent, and there was an increase in neural networks, mainly applied to imaging data sets. The main applications of ML to clinical tasks were diagnosis (18 of 78), disease course (22 of 78), and disease severity (16 of 78). The median sample size was 263. Clinical and microbiome-related data sets were most popular. Five percent of studies used an external data set after training and testing for additional model validation. DISCUSSION Availability of longitudinal and deep phenotyping data could lead to better modeling. Machine learning pipelines that consider imbalanced data and that feature selection only on training data will generate more generalizable models. Machine learning models are increasingly being applied to more complex clinical tasks for specific phenotypes, indicating progress towards personalized medicine for IBD.
Collapse
Affiliation(s)
- Imogen S Stafford
- Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- Institute for Life Sciences, University Of Southampton, Southampton, UK
- NIHR Southampton Biomedical Research, University HospitalSouthampton, Southampton, UK
| | | | - Enrico Mossotto
- Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
| | - Sarah Ennis
- Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
| | - Manfred Hauben
- Pfizer Inc, New York, NY, USA
- NYU Langone Health, Department of Medicine, New York, NY, USA
| |
Collapse
|
14
|
Colombelli F, Kowalski TW, Recamonde-Mendoza M. A hybrid ensemble feature selection design for candidate biomarkers discovery from transcriptome profiles. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
15
|
Oppong SO, Twum F, Hayfron-Acquah JB, Missah YM. A Novel Computer Vision Model for Medicinal Plant Identification Using Log-Gabor Filters and Deep Learning Algorithms. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:1189509. [PMID: 36203732 PMCID: PMC9532088 DOI: 10.1155/2022/1189509] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Revised: 08/16/2022] [Accepted: 09/05/2022] [Indexed: 11/27/2022]
Abstract
Computer vision is the science that enables computers and machines to see and perceive image content on a semantic level. It combines concepts, techniques, and ideas from various fields such as digital image processing, pattern matching, artificial intelligence, and computer graphics. A computer vision system is designed to model the human visual system on a functional basis as closely as possible. Deep learning and Convolutional Neural Networks (CNNs) in particular which are biologically inspired have significantly contributed to computer vision studies. This research develops a computer vision system that uses CNNs and handcrafted filters from Log-Gabor filters to identify medicinal plants based on their leaf textural features in an ensemble manner. The system was tested on a dataset developed from the Centre of Plant Medicine Research, Ghana (MyDataset) consisting of forty-nine (49) plant species. Using the concept of transfer learning, ten pretrained networks including Alexnet, GoogLeNet, DenseNet201, Inceptionv3, Mobilenetv2, Restnet18, Resnet50, Resnet101, vgg16, and vgg19 were used as feature extractors. The DenseNet201 architecture resulted with the best outcome of 87% accuracy and GoogLeNet with 79% preforming the worse averaged across six supervised learning algorithms. The proposed model (OTAMNet), created by fusing a Log-Gabor layer into the transition layers of the DenseNet201 architecture achieved 98% accuracy when tested on MyDataset. OTAMNet was tested on other benchmark datasets; Flavia, Swedish Leaf, MD2020, and the Folio dataset. The Flavia dataset achieved 99%, Swedish Leaf 100%, MD2020 99%, and the Folio dataset 97%. A false-positive rate of less than 0.1% was achieved in all cases.
Collapse
Affiliation(s)
| | - Frimpong Twum
- Department of Computer Science, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
| | - James Ben Hayfron-Acquah
- Department of Computer Science, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
| | - Yaw Marfo Missah
- Department of Computer Science, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
| |
Collapse
|
16
|
Paepae T, Bokoro PN, Kyamakya K. A Virtual Sensing Concept for Nitrogen and Phosphorus Monitoring Using Machine Learning Techniques. SENSORS (BASEL, SWITZERLAND) 2022; 22:7338. [PMID: 36236438 PMCID: PMC9572788 DOI: 10.3390/s22197338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Revised: 09/20/2022] [Accepted: 09/24/2022] [Indexed: 06/16/2023]
Abstract
Harmful cyanobacterial bloom (HCB) is problematic for drinking water treatment, and some of its strains can produce toxins that significantly affect human health. To better control eutrophication and HCB, catchment managers need to continuously keep track of nitrogen (N) and phosphorus (P) in the water bodies. However, the high-frequency monitoring of these water quality indicators is not economical. In these cases, machine learning techniques may serve as viable alternatives since they can learn directly from the available surrogate data. In the present work, a random forest, extremely randomized trees (ET), extreme gradient boosting, k-nearest neighbors, a light gradient boosting machine, and bagging regressor-based virtual sensors were used to predict N and P in two catchments with contrasting land uses. The effect of data scaling and missing value imputation were also assessed, while the Shapley additive explanations were used to rank feature importance. A specification book, sensitivity analysis, and best practices for developing virtual sensors are discussed. Results show that ET, MinMax scaler, and a multivariate imputer were the best predictive model, scaler, and imputer, respectively. The highest predictive performance, reported in terms of R2, was 97% in the rural catchment and 82% in an urban catchment.
Collapse
Affiliation(s)
- Thulane Paepae
- Department of Electrical and Electronic Engineering Technology, University of Johannesburg, Doornfontein 2028, South Africa
| | - Pitshou N. Bokoro
- Department of Electrical and Electronic Engineering Technology, University of Johannesburg, Doornfontein 2028, South Africa
| | - Kyandoghere Kyamakya
- Institute for Smart Systems Technologies, Transportation Informatics, Alpen-Adria Universität Klagenfurt, 9020 Klagenfurt, Austria
| |
Collapse
|
17
|
Uncertainty Propagation Based MINLP Approach for Artificial Neural Network Structure Reduction. Processes (Basel) 2022. [DOI: 10.3390/pr10091716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
The performance of artificial neural networks (ANNs) is highly influenced by the selection of input variables and the architecture defined by hyper parameters such as the number of neurons in the hidden layer and connections between network variables. Although there are some black-box and trial and error based studies in the literature to deal with these issues, it is fair to state that a rigorous and systematic method providing global and unique solution is still missing. Accordingly, in this study, a mixed integer nonlinear programming (MINLP) formulation is proposed to detect the best features and connections among the neural network elements while propagating parameter and output uncertainties for regression problems. The objective of the formulation is to minimize the covariance of the estimated parameters while by (i) detecting the ideal number of neurons, (ii) synthesizing the connection configuration between those neurons, inputs and outputs, and (iii) selecting optimum input variables in a multi variable data set to design and ensure identifiable ANN architectures. As a result, suggested approach provides a robust and optimal ANN architecture with tighter prediction bounds obtained from propagation of parameter uncertainty, and higher prediction accuracy compared to the traditional fully connected approach and other benchmarks. Furthermore, such a performance is obtained after elimination of approximately 85% and 90% of the connections, for two case studies respectively, compared to traditional ANN in addition to significant reduction in the input subset.
Collapse
|
18
|
Ngo G, Beard R, Chandra R. Evolutionary bagging for ensemble learning. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.08.055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
19
|
A Romero RA, Y Deypalan MN, Mehrotra S, Jungao JT, Sheils NE, Manduchi E, Moore JH. Benchmarking AutoML frameworks for disease prediction using medical claims. BioData Min 2022; 15:15. [PMID: 35883154 PMCID: PMC9327416 DOI: 10.1186/s13040-022-00300-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 06/27/2022] [Indexed: 11/10/2022] Open
Abstract
Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics. Results The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications. Discussion Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance. Conclusion Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application. Supplementary Information The online version contains supplementary material available at (10.1186/s13040-022-00300-2).
Collapse
Affiliation(s)
| | | | | | | | | | - Elisabetta Manduchi
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center Suite G540, West Hollywood, 90069, CA, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center Suite G540, West Hollywood, 90069, CA, USA.
| |
Collapse
|
20
|
Boecking B, Jeanselme V, Dubrawski A. Constrained clustering and multiple kernel learning without pairwise constraint relaxation. ADV DATA ANAL CLASSI 2022. [DOI: 10.1007/s11634-022-00507-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
21
|
Successfully and efficiently training deep multi-layer perceptrons with logistic activation function simply requires initializing the weights with an appropriate negative mean. Neural Netw 2022; 153:87-103. [DOI: 10.1016/j.neunet.2022.05.030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Revised: 03/25/2022] [Accepted: 05/31/2022] [Indexed: 12/26/2022]
|
22
|
Zheng Y, Guo Z, Zhang Y, Shang J, Yu L, Fu P, Liu Y, Li X, Wang H, Ren L, Zhang W, Hou H, Tan X, Wang W. Rapid triage for ischemic stroke: a machine learning-driven approach in the context of predictive, preventive and personalised medicine. EPMA J 2022; 13:285-298. [PMID: 35719136 PMCID: PMC9203613 DOI: 10.1007/s13167-022-00283-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 05/09/2022] [Indexed: 02/05/2023]
Abstract
BACKGROUND Recognising the early signs of ischemic stroke (IS) in emergency settings has been challenging. Machine learning (ML), a robust tool for predictive, preventive and personalised medicine (PPPM/3PM), presents a possible solution for this issue and produces accurate predictions for real-time data processing. METHODS This investigation evaluated 4999 IS patients among a total of 10,476 adults included in the initial dataset, and 1076 IS subjects among 3935 participants in the external validation dataset. Six ML-based models for the prediction of IS were trained on the initial dataset of 10,476 participants (split participants into a training set [80%] and an internal validation set [20%]). Selected clinical laboratory features routinely assessed at admission were used to inform the models. Model performance was mainly evaluated by the area under the receiver operating characteristic (AUC) curve. Additional techniques-permutation feature importance (PFI), local interpretable model-agnostic explanations (LIME), and SHapley Additive exPlanations (SHAP)-were applied for explaining the black-box ML models. RESULTS Fifteen routine haematological and biochemical features were selected to establish ML-based models for the prediction of IS. The XGBoost-based model achieved the highest predictive performance, reaching AUCs of 0.91 (0.90-0.92) and 0.92 (0.91-0.93) in the internal and external datasets respectively. PFI globally revealed that demographic feature age, routine haematological parameters, haemoglobin and neutrophil count, and biochemical analytes total protein and high-density lipoprotein cholesterol were more influential on the model's prediction. LIME and SHAP showed similar local feature attribution explanations. CONCLUSION In the context of PPPM/3PM, we used the selected predictors obtained from the results of common blood tests to develop and validate ML-based models for the diagnosis of IS. The XGBoost-based model offers the most accurate prediction. By incorporating the individualised patient profile, this prediction tool is simple and quick to administer. This is promising to support subjective decision making in resource-limited settings or primary care, thereby shortening the time window for the treatment, and improving outcomes after IS. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s13167-022-00283-4.
Collapse
Affiliation(s)
- Yulu Zheng
- Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western
Australia Australia
| | - Zheng Guo
- Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western
Australia Australia
| | - Yanbo Zhang
- The Second Affiliated Hospital of Shandong First Medical University, Tai’an, Shandong China
| | | | - Leilei Yu
- Tai’an City Central Hospital, Tai’an, Shandong China
| | - Ping Fu
- Ti’men Township Central Hospital, Tai’an, Shandong China
| | - Yizhi Liu
- School of Public Health, Shandong First Medical University & Shandong Academy of Medical Sciences, 619 Changcheng Road, Tai’an, 271016 Shandong China
| | - Xingang Li
- Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western
Australia Australia
| | - Hao Wang
- Department of Clinical Epidemiology and Evidence-Based Medicine, National Clinical Research Centre for Digestive Disease, Beijing Friendship Hospital, Capital Medical University, Beijing, China
- Beijing Key Laboratory of Clinical Epidemiology, School of Public Health, Capital Medical University, Beijing, China
| | - Ling Ren
- Beijing United Family Hospital, No.2 Jiangtai Road, Chaoyang District, Beijing, China
| | - Wei Zhang
- Centre for Cognitive Neurology, Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Haifeng Hou
- Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western
Australia Australia
- The Second Affiliated Hospital of Shandong First Medical University, Tai’an, Shandong China
- School of Public Health, Shandong First Medical University &
- Shandong Academy of Medical Sciences, 619 Changcheng Road, Tai’an, 271016 Shandong China
| | - Xuerui Tan
- The First Affiliated Hospital of Shantou University Medical College, Shantou, Guangdong China
| | - Wei Wang
- Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western
Australia Australia
- School of Public Health, Shandong First Medical University &
- Shandong Academy of Medical Sciences, 619 Changcheng Road, Tai’an, 271016 Shandong China
- Beijing Key Laboratory of Clinical Epidemiology, School of Public Health, Capital Medical University, Beijing, China
- The First Affiliated Hospital of Shantou University Medical College, Shantou, Guangdong China
- Institute for Nutrition Research, Edith Cowan University, Joondalup, WA Australia
| | | |
Collapse
|
23
|
Wittscher L, Diers J, Pigorsch C. Improving image classification robustness using self‐supervision. Stat (Int Stat Inst) 2022. [DOI: 10.1002/sta4.455] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Ladyna Wittscher
- Economic and Social Statistics Friedrich‐Schiller‐University Jena Jena Germany
| | - Jan Diers
- Economic and Social Statistics Friedrich‐Schiller‐University Jena Jena Germany
| | - Christian Pigorsch
- Economic and Social Statistics Friedrich‐Schiller‐University Jena Jena Germany
| |
Collapse
|
24
|
Romano JD, Le TT, La Cava W, Gregg JT, Goldberg DJ, Chakraborty P, Ray NL, Himmelstein D, Fu W, Moore JH. PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods. Bioinformatics 2022; 38:878-880. [PMID: 34677586 PMCID: PMC8756190 DOI: 10.1093/bioinformatics/btab727] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 08/17/2021] [Accepted: 10/18/2021] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. RESULTS This release of PMLB (Penn Machine Learning Benchmarks) provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community. AVAILABILITY AND IMPLEMENTATION PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.
Collapse
Affiliation(s)
- Joseph D Romano
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Center of Excellence in Environmental Toxicology, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Trang T Le
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - William La Cava
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - John T Gregg
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Daniel J Goldberg
- Department of Computer Science & Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Praneel Chakraborty
- School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA 19104, USA
- Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | - Daniel Himmelstein
- Related Sciences, Denver, CO 80220, USA
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Weixuan Fu
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
25
|
Glaab E, Rauschenberger A, Banzi R, Gerardi C, Garcia P, Demotes J. Biomarker discovery studies for patient stratification using machine learning analysis of omics data: a scoping review. BMJ Open 2021; 11:e053674. [PMID: 34873011 PMCID: PMC8650485 DOI: 10.1136/bmjopen-2021-053674] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Accepted: 11/09/2021] [Indexed: 12/12/2022] Open
Abstract
OBJECTIVE To review biomarker discovery studies using omics data for patient stratification which led to clinically validated FDA-cleared tests or laboratory developed tests, in order to identify common characteristics and derive recommendations for future biomarker projects. DESIGN Scoping review. METHODS We searched PubMed, EMBASE and Web of Science to obtain a comprehensive list of articles from the biomedical literature published between January 2000 and July 2021, describing clinically validated biomarker signatures for patient stratification, derived using statistical learning approaches. All documents were screened to retain only peer-reviewed research articles, review articles or opinion articles, covering supervised and unsupervised machine learning applications for omics-based patient stratification. Two reviewers independently confirmed the eligibility. Disagreements were solved by consensus. We focused the final analysis on omics-based biomarkers which achieved the highest level of validation, that is, clinical approval of the developed molecular signature as a laboratory developed test or FDA approved tests. RESULTS Overall, 352 articles fulfilled the eligibility criteria. The analysis of validated biomarker signatures identified multiple common methodological and practical features that may explain the successful test development and guide future biomarker projects. These include study design choices to ensure sufficient statistical power for model building and external testing, suitable combinations of non-targeted and targeted measurement technologies, the integration of prior biological knowledge, strict filtering and inclusion/exclusion criteria, and the adequacy of statistical and machine learning methods for discovery and validation. CONCLUSIONS While most clinically validated biomarker models derived from omics data have been developed for personalised oncology, first applications for non-cancer diseases show the potential of multivariate omics biomarker design for other complex disorders. Distinctive characteristics of prior success stories, such as early filtering and robust discovery approaches, continuous improvements in assay design and experimental measurement technology, and rigorous multicohort validation approaches, enable the derivation of specific recommendations for future studies.
Collapse
Affiliation(s)
- Enrico Glaab
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Rita Banzi
- Center for Health Regulatory Policies, Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Milano, Italy
| | - Chiara Gerardi
- Center for Health Regulatory Policies, Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Milano, Italy
| | - Paula Garcia
- European Clinical Research Infrastructure Network, ECRIN, Paris, France
| | - Jacques Demotes
- European Clinical Research Infrastructure Network, ECRIN, Paris, France
| |
Collapse
|
26
|
La Cava W, Burlacu B, Virgolin M, Kommenda M, Orzechowski P, de França FO, Jin Y, Moore JH. Contemporary Symbolic Regression Methods and their Relative Performance. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2021; 2021:1-16. [PMID: 38715933 PMCID: PMC11074949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/12/2024]
Abstract
Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. We address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that several approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.
Collapse
Affiliation(s)
| | - Bogdan Burlacu
- Josef Ressel Center for Symbolic Regression, University of Applied Sciences Upper Austria
| | - Marco Virgolin
- Life Sciences and Health Group, Centrum Wiskunde & Informatica
| | - Michael Kommenda
- Josef Ressel Center for Symbolic Regression, University of Applied Sciences Upper Austria
| | | | | | - Ying Jin
- Department of Statistics, Stanford University
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania
| |
Collapse
|
27
|
Bikia V, Fong T, Climie RE, Bruno RM, Hametner B, Mayer C, Terentes-Printzios D, Charlton PH. Leveraging the potential of machine learning for assessing vascular ageing: state-of-the-art and future research. EUROPEAN HEART JOURNAL. DIGITAL HEALTH 2021; 2:676-690. [PMID: 35316972 PMCID: PMC7612526 DOI: 10.1093/ehjdh/ztab089] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Vascular ageing biomarkers have been found to be predictive of cardiovascular risk independently of classical risk factors, yet are not widely used in clinical practice. In this review, we present two basic approaches for using machine learning (ML) to assess vascular age: parameter estimation and risk classification. We then summarize their role in developing new techniques to assess vascular ageing quickly and accurately. We discuss the methods used to validate ML-based markers, the evidence for their clinical utility, and key directions for future research. The review is complemented by case studies of the use of ML in vascular age assessment which can be replicated using freely available data and code.
Collapse
Affiliation(s)
- Vasiliki Bikia
- Laboratory of Hemodynamics and Cardiovascular Technology (LHTC), Swiss Federal Institute of Technology, CH-1015 Lausanne, Vaud, Switzerland
| | - Terence Fong
- Baker Heart and Diabetes Institute, 75 Commercial Rd, Melbourne, Victoria, 3004 Australia,Department of Cardiometabolic Health, Melbourne Medical School, University of Melbourne, Grattan Street, Parkville, Victoria, 3010 Australia
| | - Rachel E Climie
- Baker Heart and Diabetes Institute, 75 Commercial Rd, Melbourne, Victoria, 3004 Australia,Université de Paris, INSERM U970, Paris Cardiovascular Research Centre, Integrative Epidemiology of Cardiovascular Disease, Paris, France
| | - Rosa-Maria Bruno
- Université de Paris, INSERM U970, Paris Cardiovascular Research Centre, Integrative Epidemiology of Cardiovascular Disease, Paris, France
| | - Bernhard Hametner
- Center for Health & Bioresources, AIT Austrian Institute of Technology, Giefinggasse 4, 1210 Vienna, Austria
| | - Christopher Mayer
- Center for Health & Bioresources, AIT Austrian Institute of Technology, Giefinggasse 4, 1210 Vienna, Austria
| | - Dimitrios Terentes-Printzios
- First Department of Cardiology, Hippokration Hospital, Medical School, National and Kapodistrian University of Athens, 114 Vasilissis Sofias Avenue, 11527, Athens, Greece
| | - Peter H Charlton
- Department of Public Health and Primary Care, Strangeways Research Laboratory, 2 Worts' Causeway, Cambridge, CB1 8RN, UK,Research Centre for Biomedical Engineering, City, University of London, Northampton Square, London, EC1V 0HB, UK,Corresponding author.
| |
Collapse
|
28
|
Azad TD, Ehresman J, Ahmed AK, Staartjes VE, Lubelski D, Stienen MN, Veeravagu A, Ratliff JK. Fostering reproducibility and generalizability in machine learning for clinical prediction modeling in spine surgery. Spine J 2021; 21:1610-1616. [PMID: 33065274 DOI: 10.1016/j.spinee.2020.10.006] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 08/13/2020] [Accepted: 10/07/2020] [Indexed: 02/03/2023]
Abstract
As the use of machine learning algorithms in the development of clinical prediction models has increased, researchers are becoming more aware of the deleterious effects that stem from the lack of reporting standards. One of the most obvious consequences is the insufficient reproducibility found in current prediction models. In an attempt to characterize methods to improve reproducibility and to allow for better clinical performance, we utilize a previously proposed taxonomy that separates reproducibility into 3 components: technical, statistical, and conceptual reproducibility. By following this framework, we discuss common errors that lead to poor reproducibility, highlight the importance of generalizability when evaluating a ML model's performance, and provide suggestions to optimize generalizability to ensure adequate performance. These efforts are a necessity before such models are applied to patient care.
Collapse
Affiliation(s)
- Tej D Azad
- Department of Neurosurgery, Johns Hopkins Hospital, 1800 Orleans Street, Baltimore, MD, USA 21287
| | - Jeff Ehresman
- Department of Neurosurgery, Johns Hopkins Hospital, 1800 Orleans Street, Baltimore, MD, USA 21287
| | - Ali Karim Ahmed
- Department of Neurosurgery, Johns Hopkins Hospital, 1800 Orleans Street, Baltimore, MD, USA 21287
| | - Victor E Staartjes
- Machine Intelligence in Clinical Neuroscience (MICN) Lab, Clinical Neuroscience Centre, University of Zurich, Switzerland; Department of Neurosurgery, University Hospital Zurich, Zurich, Switzerland
| | - Daniel Lubelski
- Department of Neurosurgery, Johns Hopkins Hospital, 1800 Orleans Street, Baltimore, MD, USA 21287
| | - Martin N Stienen
- Machine Intelligence in Clinical Neuroscience (MICN) Lab, Clinical Neuroscience Centre, University of Zurich, Switzerland; Department of Neurosurgery, University Hospital Zurich, Zurich, Switzerland
| | - Anand Veeravagu
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, CA, USA
| | - John K Ratliff
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, CA, USA.
| |
Collapse
|
29
|
de Franca FO, Aldeia GSI. Interaction-Transformation Evolutionary Algorithm for Symbolic Regression. EVOLUTIONARY COMPUTATION 2021; 29:367-390. [PMID: 33306435 DOI: 10.1162/evco_a_00285] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 12/03/2020] [Indexed: 06/12/2023]
Abstract
Interaction-Transformation (IT) is a new representation for Symbolic Regression that reduces the space of solutions to a set of expressions that follow a specific structure. The potential of this representation was illustrated in prior work with the algorithm called SymTree. This algorithm starts with a simple linear model and incrementally introduces new transformed features until a stop criterion is met. While the results obtained by this algorithm were competitive with the literature, it had the drawback of not scaling well with the problem dimension. This article introduces a mutation-only Evolutionary Algorithm, called ITEA, capable of evolving a population of IT expressions. One advantage of this algorithm is that it enables the user to specify the maximum number of terms in an expression. In order to verify the competitiveness of this approach, ITEA is compared to linear, nonlinear, and Symbolic Regression models from the literature. The results indicate that ITEA is capable of finding equal or better approximations than other Symbolic Regression models while being competitive to state-of-the-art nonlinear models. Additionally, since this representation follows a specific structure, it is possible to extract the importance of each original feature of a data set as an analytical function, enabling us to automate the explanation of any prediction. In conclusion, ITEA is competitive when comparing to regression models with the additional benefit of automating the extraction of additional information of the generated models.
Collapse
Affiliation(s)
- F O de Franca
- Center for Mathematics, Computation and Cognition, Heuristics, Analysis and Learning Laboratory, Federal University of ABC, Santo Andre, Brazil
| | - G S I Aldeia
- Center for Mathematics, Computation and Cognition, Heuristics, Analysis and Learning Laboratory, Federal University of ABC, Santo Andre, Brazil
| |
Collapse
|
30
|
Gu Q, Kumar A, Bray S, Creason A, Khanteymoori A, Jalili V, Grüning B, Goecks J. Galaxy-ML: An accessible, reproducible, and scalable machine learning toolkit for biomedicine. PLoS Comput Biol 2021; 17:e1009014. [PMID: 34061826 PMCID: PMC8213174 DOI: 10.1371/journal.pcbi.1009014] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2021] [Revised: 06/18/2021] [Accepted: 04/27/2021] [Indexed: 11/25/2022] Open
Abstract
Supervised machine learning is an essential but difficult to use approach in biomedical data analysis. The Galaxy-ML toolkit (https://galaxyproject.org/community/machine-learning/) makes supervised machine learning more accessible to biomedical scientists by enabling them to perform end-to-end reproducible machine learning analyses at large scale using only a web browser. Galaxy-ML extends Galaxy (https://galaxyproject.org), a biomedical computational workbench used by tens of thousands of scientists across the world, with a suite of tools for all aspects of supervised machine learning.
Collapse
Affiliation(s)
- Qiang Gu
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, Oregon, United States of America
- The Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon, United States of America
| | - Anup Kumar
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Simon Bray
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Allison Creason
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, Oregon, United States of America
- The Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon, United States of America
| | - Alireza Khanteymoori
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Vahid Jalili
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, Oregon, United States of America
- The Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon, United States of America
| | - Björn Grüning
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Jeremy Goecks
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, Oregon, United States of America
- The Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon, United States of America
- * E-mail:
| |
Collapse
|
31
|
|
32
|
Bridgelall R, Tolliver DD. Railroad accident analysis using extreme gradient boosting. ACCIDENT; ANALYSIS AND PREVENTION 2021; 156:106126. [PMID: 33878573 DOI: 10.1016/j.aap.2021.106126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 03/19/2021] [Accepted: 04/03/2021] [Indexed: 06/12/2023]
Abstract
Railroads are critical to the economic health of a nation. Unfortunately, railroads lose hundreds of millions of dollars from accidents each year. Trends reveal that derailments consistently account for more than 70 % of the U.S. railroad industry's average annual accident cost. Hence, knowledge of explanatory factors that distinguish derailments from other accident types can inform more cost-effective and impactful railroad risk management strategies. Five feature scoring methods, including ANOVA and Gini, agreed that the top four explanatory factors in accident type prediction were track class, type of movement authority, excess speed, and territory signalization. Among 11 different types of machine learning algorithms, the extreme gradient boosting method was most effective at predicting the accident type with an area under the receiver operating curve (AUC) metric of 89 %. Principle component analysis revealed that relative to other accident types, derailments were more strongly associated with lower track classes, non-signalized territories, and movement authorizations within restricted limits. On average, derailments occurred at 16 kph below the speed limit for the track class whereas other accident types occurred at 32 kph below the speed limit. Railroads can use the integrated data preparation, machine learning, and feature ranking framework presented to gain additional insights for managing risk, based on their unique operating environments.
Collapse
Affiliation(s)
- Raj Bridgelall
- Department of Transportation, Logistics & Finance, College of Business, North Dakota State University, Fargo, ND, 58108, United States.
| | - Denver D Tolliver
- Upper Great Plains Transportation Institute, North Dakota State University, Fargo, ND, 58108, United States.
| |
Collapse
|
33
|
Kim S, Jeong M, Ko BC. Lightweight surrogate random forest support for model simplification and feature relevance. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02451-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
34
|
La Cava W, Williams H, Fu W, Vitale S, Srivatsan D, Moore JH. Evaluating recommender systems for AI-driven biomedical informatics. Bioinformatics 2021; 37:250-256. [PMID: 32766825 PMCID: PMC8055228 DOI: 10.1093/bioinformatics/btaa698] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Revised: 06/23/2020] [Accepted: 07/27/2020] [Indexed: 11/13/2022] Open
Abstract
Motivation Many researchers with domain expertise are unable to easily apply machine learning (ML) to their bioinformatics data due to a lack of ML and/or coding expertise. Methods that have been proposed thus far to automate ML mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based AI platform to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user’s experiments as well as prior knowledge. To validate this framework, we conduct an experiment on 165 classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients. Results We find that matrix factorization-based recommendation systems outperform metalearning methods for automating ML. This result mirrors the results of earlier recommender systems research in other domains. The proposed AI is competitive with state-of-the-art automated ML methods in terms of choosing optimal algorithm configurations for datasets. In our application to prediction of septic shock, the AI-driven analysis produces a competent ML model (AUROC 0.85±0.02) that performs on par with state-of-the-art deep learning results for this task, with much less computational effort. Availability and implementation PennAI is available free of charge and open-source. It is distributed under the GNU public license (GPL) version 3. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- William La Cava
- Institute for Biomedical Informatics, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Heather Williams
- Institute for Biomedical Informatics, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Weixuan Fu
- Institute for Biomedical Informatics, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Steve Vitale
- Institute for Biomedical Informatics, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Durga Srivatsan
- Institute for Biomedical Informatics, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jason H Moore
- Institute for Biomedical Informatics, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
35
|
|
36
|
Moreno-Indias I, Lahti L, Nedyalkova M, Elbere I, Roshchupkin G, Adilovic M, Aydemir O, Bakir-Gungor B, Santa Pau ECD, D’Elia D, Desai MS, Falquet L, Gundogdu A, Hron K, Klammsteiner T, Lopes MB, Marcos-Zambrano LJ, Marques C, Mason M, May P, Pašić L, Pio G, Pongor S, Promponas VJ, Przymus P, Saez-Rodriguez J, Sampri A, Shigdel R, Stres B, Suharoschi R, Truu J, Truică CO, Vilne B, Vlachakis D, Yilmaz E, Zeller G, Zomer AL, Gómez-Cabrero D, Claesson MJ. Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Front Microbiol 2021; 12:635781. [PMID: 33692771 PMCID: PMC7937616 DOI: 10.3389/fmicb.2021.635781] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 01/28/2021] [Indexed: 12/23/2022] Open
Abstract
The human microbiome has emerged as a central research topic in human biology and biomedicine. Current microbiome studies generate high-throughput omics data across different body sites, populations, and life stages. Many of the challenges in microbiome research are similar to other high-throughput studies, the quantitative analyses need to address the heterogeneity of data, specific statistical properties, and the remarkable variation in microbiome composition across individuals and body sites. This has led to a broad spectrum of statistical and machine learning challenges that range from study design, data processing, and standardization to analysis, modeling, cross-study comparison, prediction, data science ecosystems, and reproducible reporting. Nevertheless, although many statistics and machine learning approaches and tools have been developed, new techniques are needed to deal with emerging applications and the vast heterogeneity of microbiome data. We review and discuss emerging applications of statistical and machine learning techniques in human microbiome studies and introduce the COST Action CA18131 "ML4Microbiome" that brings together microbiome researchers and machine learning experts to address current challenges such as standardization of analysis pipelines for reproducibility of data analysis results, benchmarking, improvement, or development of existing and new tools and ontologies.
Collapse
Affiliation(s)
- Isabel Moreno-Indias
- Instituto de Investigación Biomédica de Málaga (IBIMA), Unidad de Gestión Clìnica de Endocrinologìa y Nutrición, Hospital Clìnico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomeìdica en Red de Fisiopatologtìa de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Miroslava Nedyalkova
- Human Genetics and Disease Mechanisms, Latvian Biomedical Research and Study Centre, Riga, Latvia
| | - Ilze Elbere
- Latvian Biomedical Research and Study Centre, Riga, Latvia
| | | | - Muhamed Adilovic
- Department of Genetics and Bioengineering, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
| | - Onder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | | | - Domenica D’Elia
- Department for Biomedical Sciences, Institute for Biomedical Technologies, National Research Council, Bari, Italy
| | - Mahesh S. Desai
- Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
- Odense Research Center for Anaphylaxis, Department of Dermatology and Allergy Center, Odense University Hospital, University of Southern Denmark, Odense, Denmark
| | - Laurent Falquet
- Department of Biology, University of Fribourg, Fribourg, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Aycan Gundogdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Turkey
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Turkey
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Cláudia Marques
- CINTESIS, NOVA Medical School, NMS, Universidade Nova de Lisboa, Lisbon, Portugal
| | - Michael Mason
- Computational Oncology, Sage Bionetworks, Seattle, WA, United States
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Lejla Pašić
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
| | - Sándor Pongor
- Faculty of Information Tehnology and Bionics, Pázmány University, Budapest, Hungary
| | - Vasilis J. Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruñ, Poland
| | - Julio Saez-Rodriguez
- Institute of Computational Biomedicine, Heidelberg University, Faculty of Medicine and Heidelberg University Hospital, Heidelberg, Germany
| | - Alexia Sampri
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Jozef Stefan Institute, Ljubljana, Slovenia
- Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, University of Ljubljana, Ljubljana, Slovenia
| | - Ramona Suharoschi
- Molecular Nutrition and Proteomics Lab, Faculty of the Food Science and Technology, Institute of Life Sciences, University of Agricultural Sciences and Veterinary Medicine of Cluj-Napoca, Cluj-Napoca, Romania
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ciprian-Octavian Truică
- Department of Computer Science and Engineering, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Bucharest, Romania
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Dimitrios Vlachakis
- Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Turkey
| | - Georg Zeller
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany
| | - Aldert L. Zomer
- Department of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, Netherlands
| | - David Gómez-Cabrero
- Navarrabiomed, Complejo Hospitalario de Navarra (CHN), IdiSNA, Universidad Pública de Navarra (UPNA), Pamplona, Spain
| | - Marcus J. Claesson
- School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland
| |
Collapse
|
37
|
Sipper M, Moore JH. Conservation machine learning: a case study of random forests. Sci Rep 2021; 11:3629. [PMID: 33574563 PMCID: PMC7878914 DOI: 10.1038/s41598-021-83247-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Accepted: 02/01/2021] [Indexed: 11/19/2022] Open
Abstract
Conservation machine learning conserves models across runs, users, and experiments-and puts them to good use. We have previously shown the merit of this idea through a small-scale preliminary experiment, involving a single dataset source, 10 datasets, and a single so-called cultivation method-used to produce the final ensemble. In this paper, focusing on classification tasks, we perform extensive experimentation with conservation random forests, involving 5 cultivation methods (including a novel one introduced herein-lexigarden), 6 dataset sources, and 31 datasets. We show that significant improvement can be attained by making use of models we are already in possession of anyway, and envisage the possibility of repositories of models (not merely datasets, solutions, or code), which could be made available to everyone, thus having conservation live up to its name, furthering the cause of data and computational science.
Collapse
Affiliation(s)
- Moshe Sipper
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, 19104-6021, USA.
- Department of Computer Science, Ben-Gurion University, Beer Sheva, 84105, Israel.
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, 19104-6021, USA
| |
Collapse
|
38
|
Orlenko A, Moore JH. A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions. BioData Min 2021; 14:9. [PMID: 33514397 PMCID: PMC7847145 DOI: 10.1186/s13040-021-00243-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 01/13/2021] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. RESULTS To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. CONCLUSIONS By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.
Collapse
Affiliation(s)
- Alena Orlenko
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
39
|
Kalyuzhnaya AV, Nikitin NO, Hvatov A, Maslyaev M, Yachmenkov M, Boukhanovsky A. Towards Generative Design of Computationally Efficient Mathematical Models with Evolutionary Learning. ENTROPY 2020; 23:e23010028. [PMID: 33375471 PMCID: PMC7823403 DOI: 10.3390/e23010028] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 12/17/2020] [Accepted: 12/24/2020] [Indexed: 11/16/2022]
Abstract
In this paper, we describe the concept of generative design approach applied to the automated evolutionary learning of mathematical models in a computationally efficient way. To formalize the problems of models' design and co-design, the generalized formulation of the modeling workflow is proposed. A parallelized evolutionary learning approach for the identification of model structure is described for the equation-based model and composite machine learning models. Moreover, the involvement of the performance models in the design process is analyzed. A set of experiments with various models and computational resources is conducted to verify different aspects of the proposed approach.
Collapse
|
40
|
Khomtchouk BB, Tran DT, Vand KA, Might M, Gozani O, Assimes TL. Cardioinformatics: the nexus of bioinformatics and precision cardiology. Brief Bioinform 2020; 21:2031-2051. [PMID: 31802103 PMCID: PMC7947182 DOI: 10.1093/bib/bbz119] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 08/08/2019] [Accepted: 08/13/2019] [Indexed: 12/12/2022] Open
Abstract
Cardiovascular disease (CVD) is the leading cause of death worldwide, causing over 17 million deaths per year, which outpaces global cancer mortality rates. Despite these sobering statistics, most bioinformatics and computational biology research and funding to date has been concentrated predominantly on cancer research, with a relatively modest footprint in CVD. In this paper, we review the existing literary landscape and critically assess the unmet need to further develop an emerging field at the multidisciplinary interface of bioinformatics and precision cardiovascular medicine, which we refer to as 'cardioinformatics'.
Collapse
Affiliation(s)
- Bohdan B Khomtchouk
- Department of Biology, Stanford University, Stanford, CA, USA
- Department of Medicine, Division of Cardiovascular Medicine, Stanford University, Stanford, CA, USA
- VA Palo Alto Health Care System, Palo Alto, CA, USA
- Department of Medicine, Section of Computational Biomedicine and Biomedical Data Science, University of Chicago, Chicago, IL, USA
| | - Diem-Trang Tran
- School of Computing, University of Utah, Salt Lake City, UT, USA
| | | | - Matthew Might
- Hugh Kaul Personalized Medicine Institute, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Or Gozani
- Department of Biology, Stanford University, Stanford, CA, USA
| | - Themistocles L Assimes
- Department of Medicine, Division of Cardiovascular Medicine, Stanford University, Stanford, CA, USA
- VA Palo Alto Health Care System, Palo Alto, CA, USA
| |
Collapse
|
41
|
Thiagarajan JJ, Venkatesh B, Anirudh R, Bremer PT, Gaffney J, Anderson G, Spears B. Designing accurate emulators for scientific processes using calibration-driven deep models. Nat Commun 2020; 11:5622. [PMID: 33159053 PMCID: PMC7648787 DOI: 10.1038/s41467-020-19448-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Accepted: 09/21/2020] [Indexed: 01/16/2023] Open
Abstract
Predictive models that accurately emulate complex scientific processes can achieve speed-ups over numerical simulators or experiments and at the same time provide surrogates for improving the subsequent analysis. Consequently, there is a recent surge in utilizing modern machine learning methods to build data-driven emulators. In this work, we study an often overlooked, yet important, problem of choosing loss functions while designing such emulators. Popular choices such as the mean squared error or the mean absolute error are based on a symmetric noise assumption and can be unsuitable for heterogeneous data or asymmetric noise distributions. We propose Learn-by-Calibrating, a novel deep learning approach based on interval calibration for designing emulators that can effectively recover the inherent noise structure without any explicit priors. Using a large suite of use-cases, we demonstrate the efficacy of our approach in providing high-quality emulators, when compared to widely-adopted loss function choices, even in small-data regimes.
Collapse
Affiliation(s)
- Jayaraman J Thiagarajan
- Lawrence Livermore National Laboratory, Center for Applied Scientific Computing, Livermore, CA, USA.
| | - Bindya Venkatesh
- School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ, USA
| | - Rushil Anirudh
- Lawrence Livermore National Laboratory, Center for Applied Scientific Computing, Livermore, CA, USA
| | - Peer-Timo Bremer
- Lawrence Livermore National Laboratory, Center for Applied Scientific Computing, Livermore, CA, USA
| | - Jim Gaffney
- Lawrence Livermore National Laboratory, Center for Applied Scientific Computing, Livermore, CA, USA
| | - Gemma Anderson
- Lawrence Livermore National Laboratory, Center for Applied Scientific Computing, Livermore, CA, USA
| | - Brian Spears
- Lawrence Livermore National Laboratory, Center for Applied Scientific Computing, Livermore, CA, USA
| |
Collapse
|
42
|
Trujillo L, Álvarez González E, Galván E, Tapia JJ, Ponsich A. On the analysis of hyper-parameter space for a genetic programming system with iterated F-Race. Soft comput 2020. [DOI: 10.1007/s00500-020-04829-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
43
|
Kline A, Kline T, Shakeri Hossein Abad Z, Lee J. Using Item Response Theory for Explainable Machine Learning in Predicting Mortality in the Intensive Care Unit: Case-Based Approach. J Med Internet Res 2020; 22:e20268. [PMID: 32975523 PMCID: PMC7547395 DOI: 10.2196/20268] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 07/02/2020] [Accepted: 08/08/2020] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Supervised machine learning (ML) is being featured in the health care literature with study results frequently reported using metrics such as accuracy, sensitivity, specificity, recall, or F1 score. Although each metric provides a different perspective on the performance, they remain to be overall measures for the whole sample, discounting the uniqueness of each case or patient. Intuitively, we know that all cases are not equal, but the present evaluative approaches do not take case difficulty into account. OBJECTIVE A more case-based, comprehensive approach is warranted to assess supervised ML outcomes and forms the rationale for this study. This study aims to demonstrate how the item response theory (IRT) can be used to stratify the data based on how difficult each case is to classify, independent of the outcome measure of interest (eg, accuracy). This stratification allows the evaluation of ML classifiers to take the form of a distribution rather than a single scalar value. METHODS Two large, public intensive care unit data sets, Medical Information Mart for Intensive Care III and electronic intensive care unit, were used to showcase this method in predicting mortality. For each data set, a balanced sample (n=8078 and n=21,940, respectively) and an imbalanced sample (n=12,117 and n=32,910, respectively) were drawn. A 2-parameter logistic model was used to provide scores for each case. Several ML algorithms were used in the demonstration to classify cases based on their health-related features: logistic regression, linear discriminant analysis, K-nearest neighbors, decision tree, naive Bayes, and a neural network. Generalized linear mixed model analyses were used to assess the effects of case difficulty strata, ML algorithm, and the interaction between them in predicting accuracy. RESULTS The results showed significant effects (P<.001) for case difficulty strata, ML algorithm, and their interaction in predicting accuracy and illustrated that all classifiers performed better with easier-to-classify cases and that overall the neural network performed best. Significant interactions suggest that cases that fall in the most arduous strata should be handled by logistic regression, linear discriminant analysis, decision tree, or neural network but not by naive Bayes or K-nearest neighbors. Conventional metrics for ML classification have been reported for methodological comparison. CONCLUSIONS This demonstration shows that using the IRT is a viable method for understanding the data that are provided to ML algorithms, independent of outcome measures, and highlights how well classifiers differentiate cases of varying difficulty. This method explains which features are indicative of healthy states and why. It enables end users to tailor the classifier that is appropriate to the difficulty level of the patient for personalized medicine.
Collapse
Affiliation(s)
- Adrienne Kline
- Department of Biomedical Engineering, University of Calgary, Calgary, AB, Canada
- Undergraduate Medical Education, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- Data Intelligence for Health Lab, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Theresa Kline
- Department of Psychology, University of Calgary, Calgary, AB, Canada
| | - Zahra Shakeri Hossein Abad
- Data Intelligence for Health Lab, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Joon Lee
- Data Intelligence for Health Lab, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- Department of Cardiac Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| |
Collapse
|
44
|
Abstract
AbstractGaussian processes (GPs) are distributions over functions, which provide a Bayesian nonparametric approach to regression and classification. In spite of their success, GPs have limited use in some applications, for example, in some cases a symmetric distribution with respect to its mean is an unreasonable model. This implies, for instance, that the mean and the median coincide, while the mean and median in an asymmetric (skewed) distribution can be different numbers. In this paper, we propose skew-Gaussian processes (SkewGPs) as a non-parametric prior over functions. A SkewGP extends the multivariate unified skew-normal distribution over finite dimensional vectors to a stochastic processes. The SkewGP class of distributions includes GPs and, therefore, SkewGPs inherit all good properties of GPs and increase their flexibility by allowing asymmetry in the probabilistic model. By exploiting the fact that SkewGP and probit likelihood are conjugate model, we derive closed form expressions for the marginal likelihood and predictive distribution of this new nonparametric classifier. We verify empirically that the proposed SkewGP classifier provides a better performance than a GP classifier based on either Laplace’s method or expectation propagation.
Collapse
|
45
|
La Cava W, Moore JH. Learning feature spaces for regression with genetic programming. GENETIC PROGRAMMING AND EVOLVABLE MACHINES 2020; 21:433-467. [PMID: 33343224 PMCID: PMC7748157 DOI: 10.1007/s10710-020-09383-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 01/17/2020] [Indexed: 06/07/2023]
Abstract
Genetic programming has found recent success as a tool for learning sets of features for regression and classification. Multidimensional genetic programming is a useful variant of genetic programming for this task because it represents candidate solutions as sets of programs. These sets of programs expose additional information that can be exploited for building block identification. In this work, we discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. We investigate methods for biasing the components of programs that are promoted in order to guide search towards useful and complementary feature spaces. We study two main approaches: 1) the introduction of new objectives and 2) the use of specialized semantic variation operators. We find that a semantic crossover operator based on stagewise regression leads to significant improvements on a set of regression problems. The inclusion of semantic crossover produces state-of-the-art results in a large benchmark study of open-source regression problems in comparison to several state-of-the-art machine learning approaches and other genetic programming frameworks. Finally, we look at the collinearity and complexity of the data representations produced by different methods, in order to assess whether relevant, concise, and independent factors of variation can be produced in application.
Collapse
Affiliation(s)
- William La Cava
- University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA
| | - Jason H Moore
- University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA
| |
Collapse
|
46
|
Le TT, Fu W, Moore JH. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 2020; 36:250-256. [PMID: 31165141 PMCID: PMC6956793 DOI: 10.1093/bioinformatics/btz470] [Citation(s) in RCA: 114] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2019] [Revised: 05/17/2019] [Accepted: 06/02/2019] [Indexed: 12/13/2022] Open
Abstract
Motivation Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programing (GP) to recommend an optimized analysis pipeline for the data scientist’s prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. Results We introduce two new features implemented in TPOT that helps increase the system’s scalability: Feature Set Selector (FSS) and Template. FSS provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT’s efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing GP to select the best subset in the final pipeline. Template enforces type constraints with strongly typed GP and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual. Availability and implementation Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Trang T Le
- Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Weixuan Fu
- Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jason H Moore
- Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
47
|
Tzanetos A, Dounias G. Nature inspired optimization algorithms or simply variations of metaheuristics? Artif Intell Rev 2020. [DOI: 10.1007/s10462-020-09893-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
48
|
Verkhivker GM, Agajanian S, Hu G, Tao P. Allosteric Regulation at the Crossroads of New Technologies: Multiscale Modeling, Networks, and Machine Learning. Front Mol Biosci 2020; 7:136. [PMID: 32733918 PMCID: PMC7363947 DOI: 10.3389/fmolb.2020.00136] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 06/08/2020] [Indexed: 12/12/2022] Open
Abstract
Allosteric regulation is a common mechanism employed by complex biomolecular systems for regulation of activity and adaptability in the cellular environment, serving as an effective molecular tool for cellular communication. As an intrinsic but elusive property, allostery is a ubiquitous phenomenon where binding or disturbing of a distal site in a protein can functionally control its activity and is considered as the "second secret of life." The fundamental biological importance and complexity of these processes require a multi-faceted platform of synergistically integrated approaches for prediction and characterization of allosteric functional states, atomistic reconstruction of allosteric regulatory mechanisms and discovery of allosteric modulators. The unifying theme and overarching goal of allosteric regulation studies in recent years have been integration between emerging experiment and computational approaches and technologies to advance quantitative characterization of allosteric mechanisms in proteins. Despite significant advances, the quantitative characterization and reliable prediction of functional allosteric states, interactions, and mechanisms continue to present highly challenging problems in the field. In this review, we discuss simulation-based multiscale approaches, experiment-informed Markovian models, and network modeling of allostery and information-theoretical approaches that can describe the thermodynamics and hierarchy allosteric states and the molecular basis of allosteric mechanisms. The wealth of structural and functional information along with diversity and complexity of allosteric mechanisms in therapeutically important protein families have provided a well-suited platform for development of data-driven research strategies. Data-centric integration of chemistry, biology and computer science using artificial intelligence technologies has gained a significant momentum and at the forefront of many cross-disciplinary efforts. We discuss new developments in the machine learning field and the emergence of deep learning and deep reinforcement learning applications in modeling of molecular mechanisms and allosteric proteins. The experiment-guided integrated approaches empowered by recent advances in multiscale modeling, network science, and machine learning can lead to more reliable prediction of allosteric regulatory mechanisms and discovery of allosteric modulators for therapeutically important protein targets.
Collapse
Affiliation(s)
- Gennady M. Verkhivker
- Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA, United States
- Department of Biomedical and Pharmaceutical Sciences, Chapman University School of Pharmacy, Irvine, CA, United States
| | - Steve Agajanian
- Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA, United States
| | - Guang Hu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou, China
| | - Peng Tao
- Department of Chemistry, Center for Drug Discovery, Design, and Delivery (CD4), Center for Scientific Computation, Southern Methodist University, Dallas, TX, United States
| |
Collapse
|
49
|
|
50
|
Levy JJ, O'Malley AJ. Don't dismiss logistic regression: the case for sensible extraction of interactions in the era of machine learning. BMC Med Res Methodol 2020; 20:171. [PMID: 32600277 PMCID: PMC7325087 DOI: 10.1186/s12874-020-01046-3] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 06/10/2020] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Machine learning approaches have become increasingly popular modeling techniques, relying on data-driven heuristics to arrive at its solutions. Recent comparisons between these algorithms and traditional statistical modeling techniques have largely ignored the superiority gained by the former approaches due to involvement of model-building search algorithms. This has led to alignment of statistical and machine learning approaches with different types of problems and the under-development of procedures that combine their attributes. In this context, we hoped to understand the domains of applicability for each approach and to identify areas where a marriage between the two approaches is warranted. We then sought to develop a hybrid statistical-machine learning procedure with the best attributes of each. METHODS We present three simple examples to illustrate when to use each modeling approach and posit a general framework for combining them into an enhanced logistic regression model building procedure that aids interpretation. We study 556 benchmark machine learning datasets to uncover when machine learning techniques outperformed rudimentary logistic regression models and so are potentially well-equipped to enhance them. We illustrate a software package, InteractionTransformer, which embeds logistic regression with advanced model building capacity by using machine learning algorithms to extract candidate interaction features from a random forest model for inclusion in the model. Finally, we apply our enhanced logistic regression analysis to two real-word biomedical examples, one where predictors vary linearly with the outcome and another with extensive second-order interactions. RESULTS Preliminary statistical analysis demonstrated that across 556 benchmark datasets, the random forest approach significantly outperformed the logistic regression approach. We found a statistically significant increase in predictive performance when using hybrid procedures and greater clarity in the association with the outcome of terms acquired compared to directly interpreting the random forest output. CONCLUSIONS When a random forest model is closer to the true model, hybrid statistical-machine learning procedures can substantially enhance the performance of statistical procedures in an automated manner while preserving easy interpretation of the results. Such hybrid methods may help facilitate widespread adoption of machine learning techniques in the biomedical setting.
Collapse
Affiliation(s)
- Joshua J Levy
- Program in Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth, Hanover, USA.
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, USA.
- Department of Pathology, Geisel School of Medicine at Dartmouth, Hanover, USA.
| | - A James O'Malley
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, USA
- The Dartmouth Institute for Health Policy and Clinical Practice, Geisel School of Medicine at Dartmouth, Hanover, USA
| |
Collapse
|