Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Min 2017;10:36. [PMID: 29238404 PMCID: PMC5725843 DOI: 10.1186/s13040-017-0154-4] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Accepted: 11/07/2017] [Indexed: 11/10/2022] Open

For:	Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Min 2017;10:36. [PMID: 29238404 PMCID: PMC5725843 DOI: 10.1186/s13040-017-0154-4] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Accepted: 11/07/2017] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Demircioğlu A. radMLBench: A dataset collection for benchmarking in radiomics. Comput Biol Med 2024;182:109140. [PMID: 39270457 DOI: 10.1016/j.compbiomed.2024.109140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 08/20/2024] [Accepted: 09/08/2024] [Indexed: 09/15/2024]

Peterson RA, McGrath M, Cavanaugh JE. Can a Transparent Machine Learning Algorithm Predict Better than Its Black Box Counterparts? A Benchmarking Study Using 110 Data Sets. ENTROPY (BASEL, SWITZERLAND) 2024;26:746. [PMID: 39330080 PMCID: PMC11431724 DOI: 10.3390/e26090746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 08/27/2024] [Accepted: 08/28/2024] [Indexed: 09/28/2024]

Abstract

We developed a novel machine learning (ML) algorithm with the goal of producing transparent models (i.e., understandable by humans) while also flexibly accounting for nonlinearity and interactions. Our method is based on ranked sparsity, and it allows for flexibility and user control in varying the shade of the opacity of black box machine learning methods. The main tenet of ranked sparsity is that an algorithm should be more skeptical of higher-order polynomials and interactions a priori compared to main effects, and hence, the inclusion of these more complex terms should require a higher level of evidence. In this work, we put our new ranked sparsity algorithm (as implemented in the open source R package, sparseR) to the test in a predictive model "bakeoff" (i.e., a benchmarking study of ML algorithms applied "out of the box", that is, with no special tuning). Algorithms were trained on a large set of simulated and real-world data sets from the Penn Machine Learning Benchmarks database, addressing both regression and binary classification problems. We evaluated the extent to which our human-centered algorithm can attain predictive accuracy that rivals popular black box approaches such as neural networks, random forests, and support vector machines, while also producing more interpretable models. Using out-of-bag error as a meta-outcome, we describe the properties of data sets in which human-centered approaches can perform as well as or better than black box approaches. We found that interpretable approaches predicted optimally or within 5% of the optimal method in most real-world data sets. We provide a more in-depth comparison of the performances of random forests to interpretable methods for several case studies, including exemplars in which algorithms performed similarly, and several cases when interpretable methods underperformed. This work provides a strong rationale for including human-centered transparent algorithms such as ours in predictive modeling applications.

Collapse

Tjaden J, Tjaden B. MLpronto: A tool for democratizing machine learning. PLoS One 2023;18:e0294924. [PMID: 38032968 PMCID: PMC10688639 DOI: 10.1371/journal.pone.0294924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2023] [Accepted: 11/11/2023] [Indexed: 12/02/2023] Open

Ong W, Liu RW, Makmur A, Low XZ, Sng WJ, Tan JH, Kumar N, Hallinan JTPD. Artificial Intelligence Applications for Osteoporosis Classification Using Computed Tomography. Bioengineering (Basel) 2023;10:1364. [PMID: 38135954 PMCID: PMC10741220 DOI: 10.3390/bioengineering10121364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 11/21/2023] [Accepted: 11/23/2023] [Indexed: 12/24/2023] Open

Affiliation(s)

Wilson Ong Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.)
Ren Wei Liu Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.)
Andrew Makmur Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.) Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Drive, Singapore 117597, Singapore
Xi Zhen Low Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.) Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Drive, Singapore 117597, Singapore
Weizhong Jonathan Sng Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.) Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Drive, Singapore 117597, Singapore
Jiong Hao Tan University Spine Centre, Department of Orthopaedic Surgery, National University Health System, 1E Lower Kent Ridge Road, Singapore 119228, Singapore; (J.H.T.); (N.K.)
Naresh Kumar University Spine Centre, Department of Orthopaedic Surgery, National University Health System, 1E Lower Kent Ridge Road, Singapore 119228, Singapore; (J.H.T.); (N.K.)
James Thomas Patrick Decourcy Hallinan Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore (A.M.); (X.Z.L.); (W.J.S.); (J.T.P.D.H.) Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Drive, Singapore 117597, Singapore

Collapse

Decoux A, Duron L, Habert P, Roblot V, Arsovic E, Chassagnon G, Arnoux A, Fournier L. Comparative performances of machine learning algorithms in radiomics and impacting factors. Sci Rep 2023;13:14069. [PMID: 37640728 PMCID: PMC10462640 DOI: 10.1038/s41598-023-39738-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 07/30/2023] [Indexed: 08/31/2023] Open

La Cava WG, Lee PC, Ajmal I, Ding X, Solanki P, Cohen JB, Moore JH, Herman DS. A flexible symbolic regression method for constructing interpretable clinical prediction models. NPJ Digit Med 2023;6:107. [PMID: 37277550 PMCID: PMC10241925 DOI: 10.1038/s41746-023-00833-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 05/05/2023] [Indexed: 06/07/2023] Open

Alòs J, Ansótegui C, Torres E. Interpretable decision trees through MaxSAT. Artif Intell Rev 2022;56:1-21. [PMID: 36590759 PMCID: PMC9794111 DOI: 10.1007/s10462-022-10377-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2022] [Indexed: 12/29/2022]

Duong-Trung N, Born S, Kim JW, Schermeyer MT, Paulick K, Borisyak M, Cruz-Bournazou MN, Werner T, Scholz R, Schmidt-Thieme L, Neubauer P, Martinez E. When Bioprocess Engineering Meets Machine Learning: A Survey from the Perspective of Automated Bioprocess Development. Biochem Eng J 2022. [DOI: 10.1016/j.bej.2022.108764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Valdes G, Interian Y, Gennatas E, Van der Laan M. The Conditional Super Learner. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022;44:10236-10243. [PMID: 34851823 DOI: 10.1109/tpami.2021.3131976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]

Orzechowski P, Moore JH. Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers. SCIENCE ADVANCES 2022;8:eabl4747. [PMID: 36417520 PMCID: PMC9683726 DOI: 10.1126/sciadv.abl4747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 10/07/2022] [Indexed: 06/16/2023]

Kasperek D, Podpora M, Kawala-Sterniuk A. Comparison of the Usability of Apple M1 Processors for Various Machine Learning Tasks. SENSORS (BASEL, SWITZERLAND) 2022;22:8005. [PMID: 36298358 PMCID: PMC9608475 DOI: 10.3390/s22208005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 10/07/2022] [Accepted: 10/17/2022] [Indexed: 06/16/2023]

Ho L, Goethals P. Machine learning applications in river research: Trends, opportunities and challenges. Methods Ecol Evol 2022. [DOI: 10.1111/2041-210x.13992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]

Stafford IS, Gosink MM, Mossotto E, Ennis S, Hauben M. A Systematic Review of Artificial Intelligence and Machine Learning Applications to Inflammatory Bowel Disease, with Practical Guidelines for Interpretation. Inflamm Bowel Dis 2022;28:1573-1583. [PMID: 35699597 PMCID: PMC9527612 DOI: 10.1093/ibd/izac115] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Indexed: 12/15/2022]

Colombelli F, Kowalski TW, Recamonde-Mendoza M. A hybrid ensemble feature selection design for candidate biomarkers discovery from transcriptome profiles. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]

Oppong SO, Twum F, Hayfron-Acquah JB, Missah YM. A Novel Computer Vision Model for Medicinal Plant Identification Using Log-Gabor Filters and Deep Learning Algorithms. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022;2022:1189509. [PMID: 36203732 PMCID: PMC9532088 DOI: 10.1155/2022/1189509] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Revised: 08/16/2022] [Accepted: 09/05/2022] [Indexed: 11/27/2022]

Paepae T, Bokoro PN, Kyamakya K. A Virtual Sensing Concept for Nitrogen and Phosphorus Monitoring Using Machine Learning Techniques. SENSORS (BASEL, SWITZERLAND) 2022;22:7338. [PMID: 36236438 PMCID: PMC9572788 DOI: 10.3390/s22197338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Revised: 09/20/2022] [Accepted: 09/24/2022] [Indexed: 06/16/2023]

Uncertainty Propagation Based MINLP Approach for Artificial Neural Network Structure Reduction. Processes (Basel) 2022. [DOI: 10.3390/pr10091716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Ngo G, Beard R, Chandra R. Evolutionary bagging for ensemble learning. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.08.055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

A Romero RA, Y Deypalan MN, Mehrotra S, Jungao JT, Sheils NE, Manduchi E, Moore JH. Benchmarking AutoML frameworks for disease prediction using medical claims. BioData Min 2022;15:15. [PMID: 35883154 PMCID: PMC9327416 DOI: 10.1186/s13040-022-00300-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 06/27/2022] [Indexed: 11/10/2022] Open

Abstract

Objectives

Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.

Materials and Methods

We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics.

Results

The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications.

Discussion

Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance.

Conclusion

Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.

Supplementary Information

The online version contains supplementary material available at (10.1186/s13040-022-00300-2).

Collapse

Boecking B, Jeanselme V, Dubrawski A. Constrained clustering and multiple kernel learning without pairwise constraint relaxation. ADV DATA ANAL CLASSI 2022. [DOI: 10.1007/s11634-022-00507-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Successfully and efficiently training deep multi-layer perceptrons with logistic activation function simply requires initializing the weights with an appropriate negative mean. Neural Netw 2022;153:87-103. [DOI: 10.1016/j.neunet.2022.05.030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Revised: 03/25/2022] [Accepted: 05/31/2022] [Indexed: 12/26/2022]

Zheng Y, Guo Z, Zhang Y, Shang J, Yu L, Fu P, Liu Y, Li X, Wang H, Ren L, Zhang W, Hou H, Tan X, Wang W. Rapid triage for ischemic stroke: a machine learning-driven approach in the context of predictive, preventive and personalised medicine. EPMA J 2022;13:285-298. [PMID: 35719136 PMCID: PMC9203613 DOI: 10.1007/s13167-022-00283-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 05/09/2022] [Indexed: 02/05/2023]

Abstract

BACKGROUND

Recognising the early signs of ischemic stroke (IS) in emergency settings has been challenging. Machine learning (ML), a robust tool for predictive, preventive and personalised medicine (PPPM/3PM), presents a possible solution for this issue and produces accurate predictions for real-time data processing.

METHODS

This investigation evaluated 4999 IS patients among a total of 10,476 adults included in the initial dataset, and 1076 IS subjects among 3935 participants in the external validation dataset. Six ML-based models for the prediction of IS were trained on the initial dataset of 10,476 participants (split participants into a training set [80%] and an internal validation set [20%]). Selected clinical laboratory features routinely assessed at admission were used to inform the models. Model performance was mainly evaluated by the area under the receiver operating characteristic (AUC) curve. Additional techniques-permutation feature importance (PFI), local interpretable model-agnostic explanations (LIME), and SHapley Additive exPlanations (SHAP)-were applied for explaining the black-box ML models.

RESULTS

Fifteen routine haematological and biochemical features were selected to establish ML-based models for the prediction of IS. The XGBoost-based model achieved the highest predictive performance, reaching AUCs of 0.91 (0.90-0.92) and 0.92 (0.91-0.93) in the internal and external datasets respectively. PFI globally revealed that demographic feature age, routine haematological parameters, haemoglobin and neutrophil count, and biochemical analytes total protein and high-density lipoprotein cholesterol were more influential on the model's prediction. LIME and SHAP showed similar local feature attribution explanations.

CONCLUSION

In the context of PPPM/3PM, we used the selected predictors obtained from the results of common blood tests to develop and validate ML-based models for the diagnosis of IS. The XGBoost-based model offers the most accurate prediction. By incorporating the individualised patient profile, this prediction tool is simple and quick to administer. This is promising to support subjective decision making in resource-limited settings or primary care, thereby shortening the time window for the treatment, and improving outcomes after IS.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1007/s13167-022-00283-4.

Collapse

Affiliation(s)

Yulu Zheng Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western Australia Australia
Zheng Guo Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western Australia Australia
Yanbo Zhang The Second Affiliated Hospital of Shandong First Medical University, Tai’an, Shandong China
Jianjing Shang Dongping People’s Hospital, Tai’an, Shandong China
Leilei Yu Tai’an City Central Hospital, Tai’an, Shandong China
Ping Fu Ti’men Township Central Hospital, Tai’an, Shandong China
Yizhi Liu School of Public Health, Shandong First Medical University & Shandong Academy of Medical Sciences, 619 Changcheng Road, Tai’an, 271016 Shandong China
Xingang Li Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western Australia Australia
Hao Wang Department of Clinical Epidemiology and Evidence-Based Medicine, National Clinical Research Centre for Digestive Disease, Beijing Friendship Hospital, Capital Medical University, Beijing, China Beijing Key Laboratory of Clinical Epidemiology, School of Public Health, Capital Medical University, Beijing, China
Ling Ren Beijing United Family Hospital, No.2 Jiangtai Road, Chaoyang District, Beijing, China
Wei Zhang Centre for Cognitive Neurology, Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
Haifeng Hou Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western Australia Australia The Second Affiliated Hospital of Shandong First Medical University, Tai’an, Shandong China School of Public Health, Shandong First Medical University & Shandong Academy of Medical Sciences, 619 Changcheng Road, Tai’an, 271016 Shandong China
Xuerui Tan The First Affiliated Hospital of Shantou University Medical College, Shantou, Guangdong China
Wei Wang Centre for Precision Health, Edith Cowan University, 270 Joondalup Drive, Joondalup, 6027 Western Australia Australia School of Public Health, Shandong First Medical University & Shandong Academy of Medical Sciences, 619 Changcheng Road, Tai’an, 271016 Shandong China Beijing Key Laboratory of Clinical Epidemiology, School of Public Health, Capital Medical University, Beijing, China The First Affiliated Hospital of Shantou University Medical College, Shantou, Guangdong China Institute for Nutrition Research, Edith Cowan University, Joondalup, WA Australia
on behalf of Global Health Epidemiology Reference Group (GHERG)

Collapse

Wittscher L, Diers J, Pigorsch C. Improving image classification robustness using self‐supervision. Stat (Int Stat Inst) 2022. [DOI: 10.1002/sta4.455] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Romano JD, Le TT, La Cava W, Gregg JT, Goldberg DJ, Chakraborty P, Ray NL, Himmelstein D, Fu W, Moore JH. PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods. Bioinformatics 2022;38:878-880. [PMID: 34677586 PMCID: PMC8756190 DOI: 10.1093/bioinformatics/btab727] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 08/17/2021] [Accepted: 10/18/2021] [Indexed: 02/04/2023] Open

Glaab E, Rauschenberger A, Banzi R, Gerardi C, Garcia P, Demotes J. Biomarker discovery studies for patient stratification using machine learning analysis of omics data: a scoping review. BMJ Open 2021;11:e053674. [PMID: 34873011 PMCID: PMC8650485 DOI: 10.1136/bmjopen-2021-053674] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Accepted: 11/09/2021] [Indexed: 12/12/2022] Open

Abstract

OBJECTIVE

To review biomarker discovery studies using omics data for patient stratification which led to clinically validated FDA-cleared tests or laboratory developed tests, in order to identify common characteristics and derive recommendations for future biomarker projects.

DESIGN

Scoping review.

METHODS

We searched PubMed, EMBASE and Web of Science to obtain a comprehensive list of articles from the biomedical literature published between January 2000 and July 2021, describing clinically validated biomarker signatures for patient stratification, derived using statistical learning approaches. All documents were screened to retain only peer-reviewed research articles, review articles or opinion articles, covering supervised and unsupervised machine learning applications for omics-based patient stratification. Two reviewers independently confirmed the eligibility. Disagreements were solved by consensus. We focused the final analysis on omics-based biomarkers which achieved the highest level of validation, that is, clinical approval of the developed molecular signature as a laboratory developed test or FDA approved tests.

RESULTS

Overall, 352 articles fulfilled the eligibility criteria. The analysis of validated biomarker signatures identified multiple common methodological and practical features that may explain the successful test development and guide future biomarker projects. These include study design choices to ensure sufficient statistical power for model building and external testing, suitable combinations of non-targeted and targeted measurement technologies, the integration of prior biological knowledge, strict filtering and inclusion/exclusion criteria, and the adequacy of statistical and machine learning methods for discovery and validation.

CONCLUSIONS

While most clinically validated biomarker models derived from omics data have been developed for personalised oncology, first applications for non-cancer diseases show the potential of multivariate omics biomarker design for other complex disorders. Distinctive characteristics of prior success stories, such as early filtering and robust discovery approaches, continuous improvements in assay design and experimental measurement technology, and rigorous multicohort validation approaches, enable the derivation of specific recommendations for future studies.

Collapse

La Cava W, Burlacu B, Virgolin M, Kommenda M, Orzechowski P, de França FO, Jin Y, Moore JH. Contemporary Symbolic Regression Methods and their Relative Performance. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2021;2021:1-16. [PMID: 38715933 PMCID: PMC11074949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/12/2024]

Bikia V, Fong T, Climie RE, Bruno RM, Hametner B, Mayer C, Terentes-Printzios D, Charlton PH. Leveraging the potential of machine learning for assessing vascular ageing: state-of-the-art and future research. EUROPEAN HEART JOURNAL. DIGITAL HEALTH 2021;2:676-690. [PMID: 35316972 PMCID: PMC7612526 DOI: 10.1093/ehjdh/ztab089] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]

Azad TD, Ehresman J, Ahmed AK, Staartjes VE, Lubelski D, Stienen MN, Veeravagu A, Ratliff JK. Fostering reproducibility and generalizability in machine learning for clinical prediction modeling in spine surgery. Spine J 2021;21:1610-1616. [PMID: 33065274 DOI: 10.1016/j.spinee.2020.10.006] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 08/13/2020] [Accepted: 10/07/2020] [Indexed: 02/03/2023]

de Franca FO, Aldeia GSI. Interaction-Transformation Evolutionary Algorithm for Symbolic Regression. EVOLUTIONARY COMPUTATION 2021;29:367-390. [PMID: 33306435 DOI: 10.1162/evco_a_00285] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 12/03/2020] [Indexed: 06/12/2023]

Gu Q, Kumar A, Bray S, Creason A, Khanteymoori A, Jalili V, Grüning B, Goecks J. Galaxy-ML: An accessible, reproducible, and scalable machine learning toolkit for biomedicine. PLoS Comput Biol 2021;17:e1009014. [PMID: 34061826 PMCID: PMC8213174 DOI: 10.1371/journal.pcbi.1009014] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2021] [Revised: 06/18/2021] [Accepted: 04/27/2021] [Indexed: 11/25/2022] Open

In-depth analysis of SVM kernel learning and its components. Neural Comput Appl 2021. [DOI: 10.1007/s00521-020-05419-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Bridgelall R, Tolliver DD. Railroad accident analysis using extreme gradient boosting. ACCIDENT; ANALYSIS AND PREVENTION 2021;156:106126. [PMID: 33878573 DOI: 10.1016/j.aap.2021.106126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 03/19/2021] [Accepted: 04/03/2021] [Indexed: 06/12/2023]

Kim S, Jeong M, Ko BC. Lightweight surrogate random forest support for model simplification and feature relevance. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02451-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]

La Cava W, Williams H, Fu W, Vitale S, Srivatsan D, Moore JH. Evaluating recommender systems for AI-driven biomedical informatics. Bioinformatics 2021;37:250-256. [PMID: 32766825 PMCID: PMC8055228 DOI: 10.1093/bioinformatics/btaa698] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Revised: 06/23/2020] [Accepted: 07/27/2020] [Indexed: 11/13/2022] Open

Abstract

Motivation

Many researchers with domain expertise are unable to easily apply machine learning (ML) to their bioinformatics data due to a lack of ML and/or coding expertise. Methods that have been proposed thus far to automate ML mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based AI platform to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user’s experiments as well as prior knowledge. To validate this framework, we conduct an experiment on 165 classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients.

Results

We find that matrix factorization-based recommendation systems outperform metalearning methods for automating ML. This result mirrors the results of earlier recommender systems research in other domains. The proposed AI is competitive with state-of-the-art automated ML methods in terms of choosing optimal algorithm configurations for datasets. In our application to prediction of septic shock, the AI-driven analysis produces a competent ML model (AUROC 0.85±0.02) that performs on par with state-of-the-art deep learning results for this task, with much less computational effort.

Availability and implementation

PennAI is available free of charge and open-source. It is distributed under the GNU public license (GPL) version 3.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Khasidashvili Z, Norman AJ. Feature range analysis. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2021. [DOI: 10.1007/s41060-021-00251-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Moreno-Indias I, Lahti L, Nedyalkova M, Elbere I, Roshchupkin G, Adilovic M, Aydemir O, Bakir-Gungor B, Santa Pau ECD, D’Elia D, Desai MS, Falquet L, Gundogdu A, Hron K, Klammsteiner T, Lopes MB, Marcos-Zambrano LJ, Marques C, Mason M, May P, Pašić L, Pio G, Pongor S, Promponas VJ, Przymus P, Saez-Rodriguez J, Sampri A, Shigdel R, Stres B, Suharoschi R, Truu J, Truică CO, Vilne B, Vlachakis D, Yilmaz E, Zeller G, Zomer AL, Gómez-Cabrero D, Claesson MJ. Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Front Microbiol 2021;12:635781. [PMID: 33692771 PMCID: PMC7937616 DOI: 10.3389/fmicb.2021.635781] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 01/28/2021] [Indexed: 12/23/2022] Open

Affiliation(s)

Isabel Moreno-Indias Instituto de Investigación Biomédica de Málaga (IBIMA), Unidad de Gestión Clìnica de Endocrinologìa y Nutrición, Hospital Clìnico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain Centro de Investigación Biomeìdica en Red de Fisiopatologtìa de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
Leo Lahti Department of Computing, University of Turku, Turku, Finland
Miroslava Nedyalkova Human Genetics and Disease Mechanisms, Latvian Biomedical Research and Study Centre, Riga, Latvia
Ilze Elbere Latvian Biomedical Research and Study Centre, Riga, Latvia
Gennady Roshchupkin Department of Epidemiology, Erasmus Medical Center, Rotterdam, Netherlands
Muhamed Adilovic Department of Genetics and Bioengineering, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
Onder Aydemir Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Turkey
Burcu Bakir-Gungor Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
Enrique Carrillo-de Santa Pau Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
Domenica D’Elia Department for Biomedical Sciences, Institute for Biomedical Technologies, National Research Council, Bari, Italy
Mahesh S. Desai Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg Odense Research Center for Anaphylaxis, Department of Dermatology and Allergy Center, Odense University Hospital, University of Southern Denmark, Odense, Denmark
Laurent Falquet Department of Biology, University of Fribourg, Fribourg, Switzerland Swiss Institute of Bioinformatics, Lausanne, Switzerland
Aycan Gundogdu Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Turkey Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Turkey
Karel Hron Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
Thomas Klammsteiner Department of Microbiology, University of Innsbruck, Innsbruck, Austria
Marta B. Lopes NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
Laura Judith Marcos-Zambrano Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
Cláudia Marques CINTESIS, NOVA Medical School, NMS, Universidade Nova de Lisboa, Lisbon, Portugal
Michael Mason Computational Oncology, Sage Bionetworks, Seattle, WA, United States
Patrick May Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
Lejla Pašić Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
Gianvito Pio Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
Sándor Pongor Faculty of Information Tehnology and Bionics, Pázmány University, Budapest, Hungary
Vasilis J. Promponas Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus
Piotr Przymus Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruñ, Poland
Julio Saez-Rodriguez Institute of Computational Biomedicine, Heidelberg University, Faculty of Medicine and Heidelberg University Hospital, Heidelberg, Germany
Alexia Sampri Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
Rajesh Shigdel Department of Clinical Science, University of Bergen, Bergen, Norway
Blaz Stres Jozef Stefan Institute, Ljubljana, Slovenia Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia Faculty of Civil and Geodetic Engineering, University of Ljubljana, Ljubljana, Slovenia
Ramona Suharoschi Molecular Nutrition and Proteomics Lab, Faculty of the Food Science and Technology, Institute of Life Sciences, University of Agricultural Sciences and Veterinary Medicine of Cluj-Napoca, Cluj-Napoca, Romania
Jaak Truu Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
Ciprian-Octavian Truică Department of Computer Science and Engineering, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Bucharest, Romania
Baiba Vilne Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
Dimitrios Vlachakis Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
Ercument Yilmaz Department of Computer Technologies, Karadeniz Technical University, Trabzon, Turkey
Georg Zeller European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany
Aldert L. Zomer Department of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, Netherlands
David Gómez-Cabrero Navarrabiomed, Complejo Hospitalario de Navarra (CHN), IdiSNA, Universidad Pública de Navarra (UPNA), Pamplona, Spain
Marcus J. Claesson School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland

Collapse

Sipper M, Moore JH. Conservation machine learning: a case study of random forests. Sci Rep 2021;11:3629. [PMID: 33574563 PMCID: PMC7878914 DOI: 10.1038/s41598-021-83247-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Accepted: 02/01/2021] [Indexed: 11/19/2022] Open

Orlenko A, Moore JH. A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions. BioData Min 2021;14:9. [PMID: 33514397 PMCID: PMC7847145 DOI: 10.1186/s13040-021-00243-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 01/13/2021] [Indexed: 01/19/2023] Open

Abstract

BACKGROUND

Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis.

RESULTS

To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions.

CONCLUSIONS

By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Collapse

Kalyuzhnaya AV, Nikitin NO, Hvatov A, Maslyaev M, Yachmenkov M, Boukhanovsky A. Towards Generative Design of Computationally Efficient Mathematical Models with Evolutionary Learning. ENTROPY 2020;23:e23010028. [PMID: 33375471 PMCID: PMC7823403 DOI: 10.3390/e23010028] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 12/17/2020] [Accepted: 12/24/2020] [Indexed: 11/16/2022]

Khomtchouk BB, Tran DT, Vand KA, Might M, Gozani O, Assimes TL. Cardioinformatics: the nexus of bioinformatics and precision cardiology. Brief Bioinform 2020;21:2031-2051. [PMID: 31802103 PMCID: PMC7947182 DOI: 10.1093/bib/bbz119] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 08/08/2019] [Accepted: 08/13/2019] [Indexed: 12/12/2022] Open

Thiagarajan JJ, Venkatesh B, Anirudh R, Bremer PT, Gaffney J, Anderson G, Spears B. Designing accurate emulators for scientific processes using calibration-driven deep models. Nat Commun 2020;11:5622. [PMID: 33159053 PMCID: PMC7648787 DOI: 10.1038/s41467-020-19448-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Accepted: 09/21/2020] [Indexed: 01/16/2023] Open

Trujillo L, Álvarez González E, Galván E, Tapia JJ, Ponsich A. On the analysis of hyper-parameter space for a genetic programming system with iterated F-Race. Soft comput 2020. [DOI: 10.1007/s00500-020-04829-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]

Kline A, Kline T, Shakeri Hossein Abad Z, Lee J. Using Item Response Theory for Explainable Machine Learning in Predicting Mortality in the Intensive Care Unit: Case-Based Approach. J Med Internet Res 2020;22:e20268. [PMID: 32975523 PMCID: PMC7547395 DOI: 10.2196/20268] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 07/02/2020] [Accepted: 08/08/2020] [Indexed: 01/29/2023] Open

Abstract

BACKGROUND

Supervised machine learning (ML) is being featured in the health care literature with study results frequently reported using metrics such as accuracy, sensitivity, specificity, recall, or F1 score. Although each metric provides a different perspective on the performance, they remain to be overall measures for the whole sample, discounting the uniqueness of each case or patient. Intuitively, we know that all cases are not equal, but the present evaluative approaches do not take case difficulty into account.

OBJECTIVE

A more case-based, comprehensive approach is warranted to assess supervised ML outcomes and forms the rationale for this study. This study aims to demonstrate how the item response theory (IRT) can be used to stratify the data based on how difficult each case is to classify, independent of the outcome measure of interest (eg, accuracy). This stratification allows the evaluation of ML classifiers to take the form of a distribution rather than a single scalar value.

METHODS

Two large, public intensive care unit data sets, Medical Information Mart for Intensive Care III and electronic intensive care unit, were used to showcase this method in predicting mortality. For each data set, a balanced sample (n=8078 and n=21,940, respectively) and an imbalanced sample (n=12,117 and n=32,910, respectively) were drawn. A 2-parameter logistic model was used to provide scores for each case. Several ML algorithms were used in the demonstration to classify cases based on their health-related features: logistic regression, linear discriminant analysis, K-nearest neighbors, decision tree, naive Bayes, and a neural network. Generalized linear mixed model analyses were used to assess the effects of case difficulty strata, ML algorithm, and the interaction between them in predicting accuracy.

RESULTS

The results showed significant effects (P<.001) for case difficulty strata, ML algorithm, and their interaction in predicting accuracy and illustrated that all classifiers performed better with easier-to-classify cases and that overall the neural network performed best. Significant interactions suggest that cases that fall in the most arduous strata should be handled by logistic regression, linear discriminant analysis, decision tree, or neural network but not by naive Bayes or K-nearest neighbors. Conventional metrics for ML classification have been reported for methodological comparison.

CONCLUSIONS

This demonstration shows that using the IRT is a viable method for understanding the data that are provided to ML algorithms, independent of outcome measures, and highlights how well classifiers differentiate cases of varying difficulty. This method explains which features are indicative of healthy states and why. It enables end users to tailor the classifier that is appropriate to the difficulty level of the patient for personalized medicine.

Collapse

Skew Gaussian processes for classification. Mach Learn 2020. [DOI: 10.1007/s10994-020-05906-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

La Cava W, Moore JH. Learning feature spaces for regression with genetic programming. GENETIC PROGRAMMING AND EVOLVABLE MACHINES 2020;21:433-467. [PMID: 33343224 PMCID: PMC7748157 DOI: 10.1007/s10710-020-09383-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 01/17/2020] [Indexed: 06/07/2023]

Le TT, Fu W, Moore JH. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 2020;36:250-256. [PMID: 31165141 PMCID: PMC6956793 DOI: 10.1093/bioinformatics/btz470] [Citation(s) in RCA: 114] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2019] [Revised: 05/17/2019] [Accepted: 06/02/2019] [Indexed: 12/13/2022] Open

Abstract

Motivation

Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programing (GP) to recommend an optimized analysis pipeline for the data scientist’s prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data.

Results

We introduce two new features implemented in TPOT that helps increase the system’s scalability: Feature Set Selector (FSS) and Template. FSS provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT’s efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing GP to select the best subset in the final pipeline. Template enforces type constraints with strongly typed GP and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual.

Availability and implementation

Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Tzanetos A, Dounias G. Nature inspired optimization algorithms or simply variations of metaheuristics? Artif Intell Rev 2020. [DOI: 10.1007/s10462-020-09893-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Verkhivker GM, Agajanian S, Hu G, Tao P. Allosteric Regulation at the Crossroads of New Technologies: Multiscale Modeling, Networks, and Machine Learning. Front Mol Biosci 2020;7:136. [PMID: 32733918 PMCID: PMC7363947 DOI: 10.3389/fmolb.2020.00136] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 06/08/2020] [Indexed: 12/12/2022] Open

Abstract

Allosteric regulation is a common mechanism employed by complex biomolecular systems for regulation of activity and adaptability in the cellular environment, serving as an effective molecular tool for cellular communication. As an intrinsic but elusive property, allostery is a ubiquitous phenomenon where binding or disturbing of a distal site in a protein can functionally control its activity and is considered as the "second secret of life." The fundamental biological importance and complexity of these processes require a multi-faceted platform of synergistically integrated approaches for prediction and characterization of allosteric functional states, atomistic reconstruction of allosteric regulatory mechanisms and discovery of allosteric modulators. The unifying theme and overarching goal of allosteric regulation studies in recent years have been integration between emerging experiment and computational approaches and technologies to advance quantitative characterization of allosteric mechanisms in proteins. Despite significant advances, the quantitative characterization and reliable prediction of functional allosteric states, interactions, and mechanisms continue to present highly challenging problems in the field. In this review, we discuss simulation-based multiscale approaches, experiment-informed Markovian models, and network modeling of allostery and information-theoretical approaches that can describe the thermodynamics and hierarchy allosteric states and the molecular basis of allosteric mechanisms. The wealth of structural and functional information along with diversity and complexity of allosteric mechanisms in therapeutically important protein families have provided a well-suited platform for development of data-driven research strategies. Data-centric integration of chemistry, biology and computer science using artificial intelligence technologies has gained a significant momentum and at the forefront of many cross-disciplinary efforts. We discuss new developments in the machine learning field and the emergence of deep learning and deep reinforcement learning applications in modeling of molecular mechanisms and allosteric proteins. The experiment-guided integrated approaches empowered by recent advances in multiscale modeling, network science, and machine learning can lead to more reliable prediction of allosteric regulatory mechanisms and discovery of allosteric modulators for therapeutically important protein targets.

Collapse

Klyuchnikov N, Burnaev E. Gaussian process classification for variable fidelity data. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.10.111] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Levy JJ, O'Malley AJ. Don't dismiss logistic regression: the case for sensible extraction of interactions in the era of machine learning. BMC Med Res Methodol 2020;20:171. [PMID: 32600277 PMCID: PMC7325087 DOI: 10.1186/s12874-020-01046-3] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 06/10/2020] [Indexed: 01/08/2023] Open

Abstract

BACKGROUND

Machine learning approaches have become increasingly popular modeling techniques, relying on data-driven heuristics to arrive at its solutions. Recent comparisons between these algorithms and traditional statistical modeling techniques have largely ignored the superiority gained by the former approaches due to involvement of model-building search algorithms. This has led to alignment of statistical and machine learning approaches with different types of problems and the under-development of procedures that combine their attributes. In this context, we hoped to understand the domains of applicability for each approach and to identify areas where a marriage between the two approaches is warranted. We then sought to develop a hybrid statistical-machine learning procedure with the best attributes of each.

METHODS

We present three simple examples to illustrate when to use each modeling approach and posit a general framework for combining them into an enhanced logistic regression model building procedure that aids interpretation. We study 556 benchmark machine learning datasets to uncover when machine learning techniques outperformed rudimentary logistic regression models and so are potentially well-equipped to enhance them. We illustrate a software package, InteractionTransformer, which embeds logistic regression with advanced model building capacity by using machine learning algorithms to extract candidate interaction features from a random forest model for inclusion in the model. Finally, we apply our enhanced logistic regression analysis to two real-word biomedical examples, one where predictors vary linearly with the outcome and another with extensive second-order interactions.

RESULTS

Preliminary statistical analysis demonstrated that across 556 benchmark datasets, the random forest approach significantly outperformed the logistic regression approach. We found a statistically significant increase in predictive performance when using hybrid procedures and greater clarity in the association with the outcome of terms acquired compared to directly interpreting the random forest output.

CONCLUSIONS

When a random forest model is closer to the true model, hybrid statistical-machine learning procedures can substantially enhance the performance of statistical procedures in an automated manner while preserving easy interpretation of the results. Such hybrid methods may help facilitate widespread adoption of machine learning techniques in the biomedical setting.

Collapse