1
|
Abou-Abbas L, Henni K, Jemal I, Mezghani N. Generative AI with WGAN-GP for boosting seizure detection accuracy. Front Artif Intell 2024; 7:1437315. [PMID: 39415942 PMCID: PMC11480023 DOI: 10.3389/frai.2024.1437315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Accepted: 09/16/2024] [Indexed: 10/19/2024] Open
Abstract
Background Imbalanced datasets pose challenges for developing accurate seizure detection systems based on electroencephalogram (EEG) data. Generative AI techniques may help augment minority class data to facilitate automatic epileptic seizure detection. New method This study investigates the impact of various data augmentation (DA) approaches, including Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP), Vanilla GAN, Conditional GAN (CGAN), and Cramer GAN, on classification performance with Random Forest models. The best-performing GAN variant, WGAN-GP, was then integrated with a bidirectional Long Short-Term Memory (LSTM) architecture and compared against traditional and synthetic oversampling methods. Results The evaluation of different GAN variants for data augmentation with Random Forest classifiers identified WGAN-GP as the most effective approach. The integration of WGAN-GP with bidirectional LSTM yielded substantial performance improvements, outperforming traditional oversampling methods and achieving an accuracy of 91.73% on the augmented data, compared to 86% accuracy on real data without augmentation. Comparison with existing methods The proposed generative AI approach combining WGAN-GP and recurrent neural network models outperforms comparative synthetic oversampling methods on metrics relevant for reliable seizure detection from imbalanced EEG datasets. Conclusions Incorporating the WGAN-GP generative AI technique for data augmentation and integrating it with bidirectional LSTM elevates seizure detection accuracy for imbalanced EEG datasets, surpassing the performance of traditional oversampling and class weight adjustment methods. This approach shows promise for improving epilepsy monitoring and management through enhanced automated detection system effectiveness.
Collapse
Affiliation(s)
- Lina Abou-Abbas
- Applied Artificial Intelligence Institute (I2A), TELUQ University, Montreal, QC, Canada
- Department of Science and Technology, TELUQ University, Montreal, QC, Canada
- Department of Electrical and Computer Engineering, Lebanese American University, Byblos, Lebanon
| | - Khadidja Henni
- Applied Artificial Intelligence Institute (I2A), TELUQ University, Montreal, QC, Canada
- Department of Science and Technology, TELUQ University, Montreal, QC, Canada
| | - Imene Jemal
- Department of Science and Technology, TELUQ University, Montreal, QC, Canada
| | - Neila Mezghani
- Applied Artificial Intelligence Institute (I2A), TELUQ University, Montreal, QC, Canada
- Department of Science and Technology, TELUQ University, Montreal, QC, Canada
| |
Collapse
|
2
|
Wang Y, Liu S, Spiteri AG, Huynh ALH, Chu C, Masters CL, Goudey B, Pan Y, Jin L. Understanding machine learning applications in dementia research and clinical practice: a review for biomedical scientists and clinicians. Alzheimers Res Ther 2024; 16:175. [PMID: 39085973 PMCID: PMC11293066 DOI: 10.1186/s13195-024-01540-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 07/21/2024] [Indexed: 08/02/2024]
Abstract
Several (inter)national longitudinal dementia observational datasets encompassing demographic information, neuroimaging, biomarkers, neuropsychological evaluations, and muti-omics data, have ushered in a new era of potential for integrating machine learning (ML) into dementia research and clinical practice. ML, with its proficiency in handling multi-modal and high-dimensional data, has emerged as an innovative technique to facilitate early diagnosis, differential diagnosis, and to predict onset and progression of mild cognitive impairment and dementia. In this review, we evaluate current and potential applications of ML, including its history in dementia research, how it compares to traditional statistics, the types of datasets it uses and the general workflow. Moreover, we identify the technical barriers and challenges of ML implementations in clinical practice. Overall, this review provides a comprehensive understanding of ML with non-technical explanations for broader accessibility to biomedical scientists and clinicians.
Collapse
Affiliation(s)
- Yihan Wang
- The Florey Institute of Neuroscience and Mental Health, 30 Royal Parade, Parkville, VIC, 3052, Australia
- Florey Department of Neuroscience and Mental Health, The University of Melbourne, 30 Royal Parade, Parkville, VIC, 3052, Australia
| | - Shu Liu
- The Florey Institute of Neuroscience and Mental Health, 30 Royal Parade, Parkville, VIC, 3052, Australia
- Florey Department of Neuroscience and Mental Health, The University of Melbourne, 30 Royal Parade, Parkville, VIC, 3052, Australia
- The ARC Training Centre in Cognitive Computing for Medical Technologies, The University of Melbourne, Carlton, VIC, 3010, Australia
| | - Alanna G Spiteri
- The Florey Institute of Neuroscience and Mental Health, 30 Royal Parade, Parkville, VIC, 3052, Australia
| | - Andrew Liem Hieu Huynh
- Department of Aged Care, Austin Health, Heidelberg, VIC, 3084, Australia
- Department of Medicine, Austin Health, University of Melbourne, Heidelberg, VIC, 3084, Australia
| | - Chenyin Chu
- The Florey Institute of Neuroscience and Mental Health, 30 Royal Parade, Parkville, VIC, 3052, Australia
- Florey Department of Neuroscience and Mental Health, The University of Melbourne, 30 Royal Parade, Parkville, VIC, 3052, Australia
| | - Colin L Masters
- The Florey Institute of Neuroscience and Mental Health, 30 Royal Parade, Parkville, VIC, 3052, Australia
| | - Benjamin Goudey
- Florey Department of Neuroscience and Mental Health, The University of Melbourne, 30 Royal Parade, Parkville, VIC, 3052, Australia
- The ARC Training Centre in Cognitive Computing for Medical Technologies, The University of Melbourne, Carlton, VIC, 3010, Australia
| | - Yijun Pan
- The Florey Institute of Neuroscience and Mental Health, 30 Royal Parade, Parkville, VIC, 3052, Australia.
- Florey Department of Neuroscience and Mental Health, The University of Melbourne, 30 Royal Parade, Parkville, VIC, 3052, Australia.
| | - Liang Jin
- The Florey Institute of Neuroscience and Mental Health, 30 Royal Parade, Parkville, VIC, 3052, Australia
- Florey Department of Neuroscience and Mental Health, The University of Melbourne, 30 Royal Parade, Parkville, VIC, 3052, Australia
| |
Collapse
|
3
|
Adeoye J, Su YX. Artificial intelligence in salivary biomarker discovery and validation for oral diseases. Oral Dis 2024; 30:23-37. [PMID: 37335832 DOI: 10.1111/odi.14641] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 05/19/2023] [Accepted: 05/28/2023] [Indexed: 06/21/2023]
Abstract
Salivary biomarkers can improve the efficacy, efficiency, and timeliness of oral and maxillofacial disease diagnosis and monitoring. Oral and maxillofacial conditions in which salivary biomarkers have been utilized for disease-related outcomes include periodontal diseases, dental caries, oral cancer, temporomandibular joint dysfunction, and salivary gland diseases. However, given the equivocal accuracy of salivary biomarkers during validation, incorporating contemporary analytical techniques for biomarker selection and operationalization from the abundant multi-omics data available may help improve biomarker performance. Artificial intelligence represents one such advanced approach that may optimize the potential of salivary biomarkers to diagnose and manage oral and maxillofacial diseases. Therefore, this review summarized the role and current application of techniques based on artificial intelligence for salivary biomarker discovery and validation in oral and maxillofacial diseases.
Collapse
Affiliation(s)
- John Adeoye
- Division of Oral and Maxillofacial Surgery, Faculty of Dentistry, University of Hong Kong, Hong Kong SAR, China
| | - Yu-Xiong Su
- Division of Oral and Maxillofacial Surgery, Faculty of Dentistry, University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
4
|
Meng Z, Iaboni A, Ye B, Newman K, Mihailidis A, Deng Z, Khan SS. Undersampling and cumulative class re-decision methods to improve detection of agitation in people with dementia. Biomed Eng Lett 2024; 14:69-78. [PMID: 38186943 PMCID: PMC10769992 DOI: 10.1007/s13534-023-00313-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Revised: 07/10/2023] [Accepted: 08/14/2023] [Indexed: 01/09/2024] Open
Abstract
Agitation is one of the most prevalent symptoms in people with dementia (PwD) that can place themselves and the caregiver's safety at risk. Developing objective agitation detection approaches is important to support health and safety of PwD living in a residential setting. In a previous study, we collected multimodal wearable sensor data from 17 participants for 600 days and developed machine learning models for detecting agitation in 1-min windows. However, there are significant limitations in the dataset, such as imbalance problem and potential imprecise labels as the occurrence of agitation is much rarer in comparison to the normal behaviours. In this paper, we first implemented different undersampling methods to eliminate the imbalance problem, and came to the conclusion that only 20% of normal behaviour data were adequate to train a competitive agitation detection model. Then, we designed a weighted undersampling method to evaluate the manual labeling mechanism given the ambiguous time interval assumption. After that, the postprocessing method of cumulative class re-decision (CCR) was proposed based on the historical sequential information and continuity characteristic of agitation, improving the decision-making performance for the potential application of agitation detection system. The results showed that a combination of undersampling and CCR improved F1-score and other metrics to varying degrees with less training time and data. Supplementary Information The online version contains supplementary material available at 10.1007/s13534-023-00313-8.
Collapse
Affiliation(s)
- Zhidong Meng
- School of Automation, Beijing Institute of Technology, Beijing, 100081 China
- KITE—Toronto Rehabilitation Institute, University Health Network, Toronto, ON M5G2A2 Canada
- Institute of Biomedical Engineering, University of Toronto, Toronto, ON M5S3G9 Canada
| | - Andrea Iaboni
- KITE—Toronto Rehabilitation Institute, University Health Network, Toronto, ON M5G2A2 Canada
- Department of Psychiatry, University of Toronto, Toronto, ON M5T1R8 Canada
| | - Bing Ye
- Institute of Biomedical Engineering, University of Toronto, Toronto, ON M5S3G9 Canada
| | - Kristine Newman
- Daphne Cockwell School of Nursing, Ryerson University, Toronto, ON M5B1Z5 Canada
| | - Alex Mihailidis
- KITE—Toronto Rehabilitation Institute, University Health Network, Toronto, ON M5G2A2 Canada
- Institute of Biomedical Engineering, University of Toronto, Toronto, ON M5S3G9 Canada
| | - Zhihong Deng
- School of Automation, Beijing Institute of Technology, Beijing, 100081 China
| | - Shehroz S. Khan
- KITE—Toronto Rehabilitation Institute, University Health Network, Toronto, ON M5G2A2 Canada
- Institute of Biomedical Engineering, University of Toronto, Toronto, ON M5S3G9 Canada
| |
Collapse
|
5
|
Le ND, Nguyen NTH. A metric learning-based method for biomedical entity linking. Front Res Metr Anal 2023; 8:1247094. [PMID: 38173988 PMCID: PMC10762861 DOI: 10.3389/frma.2023.1247094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 11/29/2023] [Indexed: 01/05/2024] Open
Abstract
Biomedical entity linking task is the task of mapping mention(s) that occur in a particular textual context to a unique concept or entity in a knowledge base, e.g., the Unified Medical Language System (UMLS). One of the most challenging aspects of the entity linking task is the ambiguity of mentions, i.e., (1) mentions whose surface forms are very similar, but which map to different entities in different contexts, and (2) entities that can be expressed using diverse types of mentions. Recent studies have used BERT-based encoders to encode mentions and entities into distinguishable representations such that their similarity can be measured using distance metrics. However, most real-world biomedical datasets suffer from severe imbalance, i.e., some classes have many instances while others appear only once or are completely absent from the training data. A common way to address this issue is to down-sample the dataset, i.e., to reduce the number instances of the majority classes to make the dataset more balanced. In the context of entity linking, down-sampling reduces the ability of the model to comprehensively learn the representations of mentions in different contexts, which is very important. To tackle this issue, we propose a metric-based learning method that treats a given entity and its mentions as a whole, regardless of the number of mentions in the training set. Specifically, our method uses a triplet loss-based function in conjunction with a clustering technique to learn the representation of mentions and entities. Through evaluations on two challenging biomedical datasets, i.e., MedMentions and BC5CDR, we show that our proposed method is able to address the issue of imbalanced data and to perform competitively with other state-of-the-art models. Moreover, our method significantly reduces computational cost in both training and inference steps. Our source code is publicly available here.
Collapse
Affiliation(s)
- Ngoc D. Le
- Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam
- Vietnam National University, Ho Chi Minh City, Vietnam
| | - Nhung T. H. Nguyen
- Department of Computer Science, School of Engineering, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
6
|
Tong B, Zhou Z, Tarzanagh DA, Hou B, Saykin AJ, Moore J, Ritchie M, Shen L. Class-Balanced Deep Learning with Adaptive Vector Scaling Loss for Dementia Stage Detection. MACHINE LEARNING IN MEDICAL IMAGING. MLMI (WORKSHOP) 2023; 14349:144-154. [PMID: 38463442 PMCID: PMC10924683 DOI: 10.1007/978-3-031-45676-3_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Alzheimer's disease (AD) leads to irreversible cognitive decline, with Mild Cognitive Impairment (MCI) as its prodromal stage. Early detection of AD and related dementia is crucial for timely treatment and slowing disease progression. However, classifying cognitive normal (CN), MCI, and AD subjects using machine learning models faces class imbalance, necessitating the use of balanced accuracy as a suitable metric. To enhance model performance and balanced accuracy, we introduce a novel method called VS-Opt-Net. This approach incorporates the recently developed vector-scaling (VS) loss into a machine learning pipeline named STREAMLINE. Moreover, it employs Bayesian optimization for hyperparameter learning of both the model and loss function. VS-Opt-Net not only amplifies the contribution of minority examples in proportion to the imbalance level but also addresses the challenge of generalization in training deep networks. In our empirical study, we use MRI-based brain regional measurements as features to conduct the CN vs MCI and AD vs MCI binary classifications. We compare the balanced accuracy of our model with other machine learning models and deep neural network loss functions that also employ class-balanced strategies. Our findings demonstrate that after hyperparameter optimization, the deep neural network using the VS loss function substantially improves balanced accuracy. It also surpasses other models in performance on the AD dataset. Moreover, our feature importance analysis highlights VS-Opt-Net's ability to elucidate biomarker differences across dementia stages.
Collapse
Affiliation(s)
- Boning Tong
- University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Zhuoping Zhou
- University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | - Bojian Hou
- University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | - Jason Moore
- Cedars-Sinai Medical Center, Los Angels, CA 90069, USA
| | | | - Li Shen
- University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
7
|
Yi F, Yang H, Chen D, Qin Y, Han H, Cui J, Bai W, Ma Y, Zhang R, Yu H. XGBoost-SHAP-based interpretable diagnostic framework for alzheimer's disease. BMC Med Inform Decis Mak 2023; 23:137. [PMID: 37491248 PMCID: PMC10369804 DOI: 10.1186/s12911-023-02238-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2022] [Accepted: 07/13/2023] [Indexed: 07/27/2023] Open
Abstract
BACKGROUND Due to the class imbalance issue faced when Alzheimer's disease (AD) develops from normal cognition (NC) to mild cognitive impairment (MCI), present clinical practice is met with challenges regarding the auxiliary diagnosis of AD using machine learning (ML). This leads to low diagnosis performance. We aimed to construct an interpretable framework, extreme gradient boosting-Shapley additive explanations (XGBoost-SHAP), to handle the imbalance among different AD progression statuses at the algorithmic level. We also sought to achieve multiclassification of NC, MCI, and AD. METHODS We obtained patient data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, including clinical information, neuropsychological test results, neuroimaging-derived biomarkers, and APOE-ε4 gene statuses. First, three feature selection algorithms were applied, and they were then included in the XGBoost algorithm. Due to the imbalance among the three classes, we changed the sample weight distribution to achieve multiclassification of NC, MCI, and AD. Then, the SHAP method was linked to XGBoost to form an interpretable framework. This framework utilized attribution ideas that quantified the impacts of model predictions into numerical values and analysed them based on their directions and sizes. Subsequently, the top 10 features (optimal subset) were used to simplify the clinical decision-making process, and their performance was compared with that of a random forest (RF), Bagging, AdaBoost, and a naive Bayes (NB) classifier. Finally, the National Alzheimer's Coordinating Center (NACC) dataset was employed to assess the impact path consistency of the features within the optimal subset. RESULTS Compared to the RF, Bagging, AdaBoost, NB and XGBoost (unweighted), the interpretable framework had higher classification performance with accuracy improvements of 0.74%, 0.74%, 1.46%, 13.18%, and 0.83%, respectively. The framework achieved high sensitivity (81.21%/74.85%), specificity (92.18%/89.86%), accuracy (87.57%/80.52%), area under the receiver operating characteristic curve (AUC) (0.91/0.88), positive clinical utility index (0.71/0.56), and negative clinical utility index (0.75/0.68) on the ADNI and NACC datasets, respectively. In the ADNI dataset, the top 10 features were found to have varying associations with the risk of AD onset based on their SHAP values. Specifically, the higher SHAP values of CDRSB, ADAS13, ADAS11, ventricle volume, ADASQ4, and FAQ were associated with higher risks of AD onset. Conversely, the higher SHAP values of LDELTOTAL, mPACCdigit, RAVLT_immediate, and MMSE were associated with lower risks of AD onset. Similar results were found for the NACC dataset. CONCLUSIONS The proposed interpretable framework contributes to achieving excellent performance in imbalanced AD multiclassification tasks and provides scientific guidance (optimal subset) for clinical decision-making, thereby facilitating disease management and offering new research ideas for optimizing AD prevention and treatment programs.
Collapse
Affiliation(s)
- Fuliang Yi
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
| | - Hui Yang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
| | - Durong Chen
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
| | - Yao Qin
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
| | - Hongjuan Han
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
| | - Jing Cui
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
| | - Wenlin Bai
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
| | - Yifei Ma
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
| | - Rong Zhang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
| | - Hongmei Yu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001 P.R. China
- Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| |
Collapse
|
8
|
Thölke P, Mantilla-Ramos YJ, Abdelhedi H, Maschke C, Dehgan A, Harel Y, Kemtur A, Mekki Berrada L, Sahraoui M, Young T, Bellemare Pépin A, El Khantour C, Landry M, Pascarella A, Hadid V, Combrisson E, O'Byrne J, Jerbi K. Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data. Neuroimage 2023:120253. [PMID: 37385392 DOI: 10.1016/j.neuroimage.2023.120253] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 06/05/2023] [Accepted: 06/26/2023] [Indexed: 07/01/2023] Open
Abstract
Machine learning (ML) is increasingly used in cognitive, computational and clinical neuroscience. The reliable and efficient application of ML requires a sound understanding of its subtleties and limitations. Training ML models on datasets with imbalanced classes is a particularly common problem, and it can have severe consequences if not adequately addressed. With the neuroscience ML user in mind, this paper provides a didactic assessment of the class imbalance problem and illustrates its impact through systematic manipulation of data imbalance ratios in (i) simulated data and (ii) brain data recorded with electroencephalography (EEG), magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI). Our results illustrate how the widely-used Accuracy (Acc) metric, which measures the overall proportion of successful predictions, yields misleadingly high performances, as class imbalance increases. Because Acc weights the per-class ratios of correct predictions proportionally to class size, it largely disregards the performance on the minority class. A binary classification model that learns to systematically vote for the majority class will yield an artificially high decoding accuracy that directly reflects the imbalance between the two classes, rather than any genuine generalizable ability to discriminate between them. We show that other evaluation metrics such as the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC), and the less common Balanced Accuracy (BAcc) metric - defined as the arithmetic mean between sensitivity and specificity, provide more reliable performance evaluations for imbalanced data. Our findings also highlight the robustness of Random Forest (RF), and the benefits of using stratified cross-validation and hyperprameter optimization to tackle data imbalance. Critically, for neuroscience ML applications that seek to minimize overall classification error, we recommend the routine use of BAcc, which in the specific case of balanced data is equivalent to using standard Acc, and readily extends to multi-class settings. Importantly, we present a list of recommendations for dealing with imbalanced data, as well as open-source code to allow the neuroscience community to replicate and extend our observations and explore alternative approaches to coping with imbalanced data.
Collapse
Affiliation(s)
- Philipp Thölke
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Institute of Cognitive Science, Osnabrück University, Neuer Graben 29/Schloss, Osnabrück, 49074, Lower Saxony, Germany.
| | - Yorguin-Jose Mantilla-Ramos
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Neuropsychology and Behavior Group (GRUNECO), Faculty of Medicine, Universidad de Antioquia,53-108, Medellin, Aranjuez, Medellin, 050010, Colombia
| | - Hamza Abdelhedi
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Charlotte Maschke
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Integrated Program in Neuroscience, McGill University, 1033 Pine Ave,Montreal, H3A 0G4, Canada
| | - Arthur Dehgan
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Institut de Neurosciences de la Timone (INT), CNRS, Aix Marseille University,Marseille, 13005, France
| | - Yann Harel
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Anirudha Kemtur
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Loubna Mekki Berrada
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Myriam Sahraoui
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Tammy Young
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Department of Computing Science, University of Alberta, 116 St & 85 Ave, Edmonton, T6G 2R3, AB, Canada
| | - Antoine Bellemare Pépin
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Department of Music, Concordia University, 1550 De Maisonneuve Blvd. W., Montreal, H3H 1G8, QC, Canada
| | - Clara El Khantour
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Mathieu Landry
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Annalisa Pascarella
- Institute for Applied Mathematics Mauro Picone, National Research Council, Roma, Italy, Roma, Italy
| | - Vanessa Hadid
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Etienne Combrisson
- Institut de Neurosciences de la Timone (INT), CNRS, Aix Marseille University,Marseille, 13005, France
| | - Jordan O'Byrne
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Karim Jerbi
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Mila (Quebec Machine Learning Institute),6666 Rue Saint-Urbain, Montreal, H2S 3H1, QC, Canada; UNIQUE Centre (Quebec Neuro-AI Research Centre), 3744 rue Jean-Brillant, Montreal,H3T 1P1,QC, Canada
| |
Collapse
|
9
|
Shao R, Sim A, Wu K, Kim J. Leveraging History to Predict Infrequent Abnormal Transfers in Distributed Workflows. SENSORS (BASEL, SWITZERLAND) 2023; 23:5485. [PMID: 37420657 DOI: 10.3390/s23125485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Revised: 05/29/2023] [Accepted: 06/06/2023] [Indexed: 07/09/2023]
Abstract
Scientific computing heavily relies on data shared by the community, especially in distributed data-intensive applications. This research focuses on predicting slow connections that create bottlenecks in distributed workflows. In this study, we analyze network traffic logs collected between January 2021 and August 2022 at the National Energy Research Scientific Computing Center (NERSC). Based on the observed patterns, we define a set of features primarily based on history for identifying low-performing data transfers. Typically, there are far fewer slow connections on well-maintained networks, which creates difficulty in learning to identify these abnormally slow connections from the normal ones. We devise several stratified sampling techniques to address the class-imbalance challenge and study how they affect the machine learning approaches. Our tests show that a relatively simple technique that undersamples the normal cases to balance the number of samples in two classes (normal and slow) is very effective for model training. This model predicts slow connections with an F1 score of 0.926.
Collapse
Affiliation(s)
- Robin Shao
- EECS, University of California at Berkeley, Berkeley, CA 94720, USA
| | - Alex Sim
- Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Kesheng Wu
- Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Jinoh Kim
- Computer Science Department, Texas A&M University, Commerce, TX 75428, USA
| |
Collapse
|
10
|
Kosolwattana T, Liu C, Hu R, Han S, Chen H, Lin Y. A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare. BioData Min 2023; 16:15. [PMID: 37098549 PMCID: PMC10131309 DOI: 10.1186/s13040-023-00330-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 03/09/2023] [Indexed: 04/27/2023] Open
Abstract
In many healthcare applications, datasets for classification may be highly imbalanced due to the rare occurrence of target events such as disease onset. The SMOTE (Synthetic Minority Over-sampling Technique) algorithm has been developed as an effective resampling method for imbalanced data classification by oversampling samples from the minority class. However, samples generated by SMOTE may be ambiguous, low-quality and non-separable with the majority class. To enhance the quality of generated samples, we proposed a novel self-inspected adaptive SMOTE (SASMOTE) model that leverages an adaptive nearest neighborhood selection algorithm to identify the "visible" nearest neighbors, which are used to generate samples likely to fall into the minority class. To further enhance the quality of the generated samples, an uncertainty elimination via self-inspection approach is introduced in the proposed SASMOTE model. Its objective is to filter out the generated samples that are highly uncertain and inseparable with the majority class. The effectiveness of the proposed algorithm is compared with existing SMOTE-based algorithms and demonstrated through two real-world case studies in healthcare, including risk gene discovery and fatal congenital heart disease prediction. By generating the higher quality synthetic samples, the proposed algorithm is able to help achieve better prediction performance (in terms of F1 score) on average compared to the other methods, which is promising to enhance the usability of machine learning models on highly imbalanced healthcare data.
Collapse
Affiliation(s)
| | - Chenang Liu
- School of Industrial Engineering & Management, Oklahoma State University, Stillwater, USA
| | - Renjie Hu
- Department of Information and Logistics Technology, University of Houston, Houston, USA
| | - Shizhong Han
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, USA
- Lieber Institute for Brain Development, Baltimore, USA
| | - Hua Chen
- Department of Pharmaceutical Health Outcomes and Policy, University of Houston, Houston, USA
| | - Ying Lin
- Department of Industrial Engineering, University of Houston, Houston, USA.
| |
Collapse
|
11
|
Machine learning to improve frequent emergency department use prediction: a retrospective cohort study. Sci Rep 2023; 13:1981. [PMID: 36737625 PMCID: PMC9898278 DOI: 10.1038/s41598-023-27568-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 01/04/2023] [Indexed: 02/05/2023] Open
Abstract
Frequent emergency department use is associated with many adverse events, such as increased risk for hospitalization and mortality. Frequent users have complex needs and associated factors are commonly evaluated using logistic regression. However, other machine learning models, especially those exploiting the potential of large databases, have been less explored. This study aims at comparing the performance of logistic regression to four machine learning models for predicting frequent emergency department use in an adult population with chronic diseases, in the province of Quebec (Canada). This is a retrospective population-based study using medical and administrative databases from the Régie de l'assurance maladie du Québec. Two definitions were used for frequent emergency department use (outcome to predict): having at least three and five visits during a year period. Independent variables included sociodemographic characteristics, healthcare service use, and chronic diseases. We compared the performance of logistic regression with gradient boosting machine, naïve Bayes, neural networks, and random forests (binary and continuous outcome) using Area under the ROC curve, sensibility, specificity, positive predictive value, and negative predictive value. Out of 451,775 ED users, 43,151 (9.5%) and 13,676 (3.0%) were frequent users with at least three and five visits per year, respectively. Random forests with a binary outcome had the lowest performances (ROC curve: 53.8 [95% confidence interval 53.5-54.0] and 51.4 [95% confidence interval 51.1-51.8] for frequent users 3 and 5, respectively) while the other models had superior and overall similar performance. The most important variable in prediction was the number of emergency department visits in the previous year. No model outperformed the others. Innovations in algorithms may slightly refine current predictions, but access to other variables may be more helpful in the case of frequent emergency department use prediction.
Collapse
|
12
|
The framing of time-dependent machine learning models improves risk estimation among young individuals with acute coronary syndromes. Sci Rep 2023; 13:1021. [PMID: 36658176 PMCID: PMC9852445 DOI: 10.1038/s41598-023-27776-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Accepted: 01/09/2023] [Indexed: 01/20/2023] Open
Abstract
Acute coronary syndrome (ACS) is a common cause of death in individuals older than 55 years. Although younger individuals are less frequently seen with ACS, this clinical event has increasing incidence trends, shows high recurrence rates and triggers considerable economic burden. Young individuals with ACS (yACS) are usually underrepresented and show idiosyncratic epidemiologic features compared to older subjects. These differences may justify why available risk prediction models usually penalize yACS with higher false positive rates compared to older subjects. We hypothesized that exploring temporal framing structures such as prediction time, observation windows and subgroup-specific prediction, could improve time-dependent prediction metrics. Among individuals who have experienced ACS (nglobal_cohort = 6341 and nyACS = 2242), the predictive accuracy for adverse clinical events was optimized by using specific rules for yACS and splitting short-term and long-term prediction windows, leading to the detection of 80% of events, compared to 69% by using a rule designed for the global cohort.
Collapse
|
13
|
Elreedy D, Atiya AF, Kamalov F. A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Mach Learn 2023. [DOI: 10.1007/s10994-022-06296-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
AbstractClass imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns’ probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically.
Collapse
|
14
|
Carrington AM, Manuel DG, Fieguth PW, Ramsay T, Osmani V, Wernly B, Bennett C, Hawken S, Magwood O, Sheikh Y, McInnes M, Holzinger A. Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:329-341. [PMID: 35077357 DOI: 10.1109/tpami.2022.3145392] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Optimal performance is desired for decision-making in any field with binary classifiers and diagnostic tests, however common performance measures lack depth in information. The area under the receiver operating characteristic curve (AUC) and the area under the precision recall curve are too general because they evaluate all decision thresholds including unrealistic ones. Conversely, accuracy, sensitivity, specificity, positive predictive value and the F1 score are too specific-they are measured at a single threshold that is optimal for some instances, but not others, which is not equitable. In between both approaches, we propose deep ROC analysis to measure performance in multiple groups of predicted risk (like calibration), or groups of true positive rate or false positive rate. In each group, we measure the group AUC (properly), normalized group AUC, and averages of: sensitivity, specificity, positive and negative predictive value, and likelihood ratio positive and negative. The measurements can be compared between groups, to whole measures, to point measures and between models. We also provide a new interpretation of AUC in whole or part, as balanced average accuracy, relevant to individuals instead of pairs. We evaluate models in three case studies using our method and Python toolkit and confirm its utility.
Collapse
|
15
|
Afrose S, Song W, Nemeroff CB, Lu C, Yao D. Subpopulation-specific machine learning prognosis for underrepresented patients with double prioritized bias correction. COMMUNICATIONS MEDICINE 2022; 2:111. [PMID: 36059892 PMCID: PMC9436942 DOI: 10.1038/s43856-022-00165-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Accepted: 07/27/2022] [Indexed: 11/09/2022] Open
Abstract
Abstract
Background
Many clinical datasets are intrinsically imbalanced, dominated by overwhelming majority groups. Off-the-shelf machine learning models that optimize the prognosis of majority patient types (e.g., healthy class) may cause substantial errors on the minority prediction class (e.g., disease class) and demographic subgroups (e.g., Black or young patients). In the typical one-machine-learning-model-fits-all paradigm, racial and age disparities are likely to exist, but unreported. In addition, some widely used whole-population metrics give misleading results.
Methods
We design a double prioritized (DP) bias correction technique to mitigate representational biases in machine learning-based prognosis. Our method trains customized machine learning models for specific ethnicity or age groups, a substantial departure from the one-model-predicts-all convention. We compare with other sampling and reweighting techniques in mortality and cancer survivability prediction tasks.
Results
We first provide empirical evidence showing various prediction deficiencies in a typical machine learning setting without bias correction. For example, missed death cases are 3.14 times higher than missed survival cases for mortality prediction. Then, we show DP consistently boosts the minority class recall for underrepresented groups, by up to 38.0%. DP also reduces relative disparities across race and age groups, e.g., up to 88.0% better than the 8 existing sampling solutions in terms of the relative disparity of minority class recall. Cross-race and cross-age-group evaluation also suggests the need for subpopulation-specific machine learning models.
Conclusions
Biases exist in the widely accepted one-machine-learning-model-fits-all-population approach. We invent a bias correction method that produces specialized machine learning prognostication models for underrepresented racial and age groups. This technique may reduce potentially life-threatening prediction mistakes for minority populations.
Collapse
|
16
|
Recognition of the Multi-class Schizophrenia Based on the Resting-State EEG Network Topology. Brain Topogr 2022; 35:495-506. [PMID: 35849250 DOI: 10.1007/s10548-022-00907-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Accepted: 06/02/2022] [Indexed: 11/02/2022]
Abstract
The clinical therapy of schizophrenia (SCZ) replies on the corresponding accurate and reliable recognition. Although efforts have been paid, the diagnosis of SCZ is still roughly subjective, it is thus urgent to search for related objective physiological parameters. Motivated by the great potential of resting-state networks in underling the brain deficits among different SCZ groups, in this study, we then developed a multi-class feature extraction approach that could effectively extract the spatial network topology and facilitate the recognition of the SCZ, by combining a network structure based supervised learning with an ensemble co-decision strategy. The results demonstrated that the multi-class spatial pattern of the network (MSPN) features outperformed the other conventional electrophysiological features, such as relative power spectrums and network properties, and achieved the highest classification accuracy of 71.58% in the alpha band. These findings did validate that the resting-state MSPN is a promising tool for the clinical assessment of the SCZ.
Collapse
|
17
|
Pelosi B. Developing a bioinformatics pipeline for comparative protein classification analysis. BMC Genom Data 2022; 23:43. [PMID: 35668373 PMCID: PMC9172112 DOI: 10.1186/s12863-022-01045-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 03/11/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Protein classification is a task of paramount importance in various fields of biology. Despite the great momentum of modern implementation of protein classification, machine learning techniques such as Random Forest and Neural Network could not always be used for several reasons: data collection, unbalanced classification or labelling of the data.As an alternative, I propose the use of a bioinformatics pipeline to search for and classify information from protein databases. Hence, to evaluate the efficiency and accuracy of the pipeline, I focused on the carotenoid biosynthetic genes and developed a filtering approach to retrieve orthologs clusters in two well-studied plants that belong to the Brassicaceae family: Arabidopsis thaliana and Brassica rapa Pekinensis group. The result obtained has been compared with previous studies on carotenoid biosynthetic genes in B. rapa where phylogenetic analysis was conducted. RESULTS The developed bioinformatics pipeline relies on commercial software and multiple databeses including the use of phylogeny, Gene Ontology terms (GOs) and Protein Families (Pfams) at a protein level. Furthermore, the phylogeny is coupled with "population analysis" to evaluate the potential orthologs. All the steps taken together give a final table of potential orthologs. The phylogenetic tree gives a result of 43 putative orthologs conserved in B. rapa Pekinensis group. Different A. thaliana proteins have more than one syntenic ortholog as also shown in a previous finding (Li et al., BMC Genomics 16(1):1-11, 2015). CONCLUSIONS This study demonstrates that, when the biological features of proteins of interest are not specific, I can rely on a computational approach in filtering steps for classification purposes. The comparison of the results obtained here for the carotenoid biosynthetic genes with previous research confirmed the accuracy of the developed pipeline which can therefore be applied for filtering different types of datasets.
Collapse
Affiliation(s)
- Benedetta Pelosi
- Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, Stockholm, Sweden.
| |
Collapse
|
18
|
Machine Learning for the Prediction of Antiviral Compounds Targeting Avian Influenza A/H9N2 Viral Proteins. Symmetry (Basel) 2022. [DOI: 10.3390/sym14061114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Avian influenza subtype A/H9N2—which infects chickens, reducing egg production by up to 80%—may be transmissible to humans. In humans, this virus is very harmful since it attacks the respiratory system and reproductive tract, replicating in both. Previous attempts to find antiviral candidates capable of inhibiting influenza A/H9N2 transmission were unsuccessful. This study aims to better characterize A/H9N2 to facilitate the discovery of antiviral compounds capable of inhibiting its transmission. The Symmetry of this study is to apply several machine learning methods to perform virtual screening to identify H9N2 antivirus candidates. The parameters used to measure the machine learning model’s quality included accuracy, sensitivity, specificity, balanced accuracy, and receiver operating characteristic score. We found that the extreme gradient boosting method yielded better results in classifying compounds predicted to be suitable antiviral compounds than six other machine learning methods, including logistic regression, k-nearest neighbor analysis, support vector machine, multilayer perceptron, random forest, and gradient boosting. Using this algorithm, we identified 10 candidate synthetic compounds with the highest scores. These high scores predicted that the molecular fingerprint may involve strong bonding characteristics. Thus, we were able to find significant candidates for synthetic H9N2 antivirus compounds and identify the best machine learning method to perform virtual screenings.
Collapse
|
19
|
Wie JH, Lee SJ, Choi SK, Jo YS, Hwang HS, Park MH, Kim YH, Shin JE, Kil KC, Kim SM, Choi BS, Hong H, Seol HJ, Won HS, Ko HS, Na S. Prediction of Emergency Cesarean Section Using Machine Learning Methods: Development and External Validation of a Nationwide Multicenter Dataset in Republic of Korea. Life (Basel) 2022; 12:life12040604. [PMID: 35455095 PMCID: PMC9033083 DOI: 10.3390/life12040604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 04/05/2022] [Accepted: 04/13/2022] [Indexed: 11/16/2022] Open
Abstract
This study was a multicenter retrospective cohort study of term nulliparous women who underwent labor, and was conducted to develop an automated machine learning model for prediction of emergent cesarean section (CS) before onset of labor. Nine machine learning methods of logistic regression, random forest, Support Vector Machine (SVM), gradient boosting, extreme gradient boosting (XGBoost), light gradient boosting machine (LGBM), k-nearest neighbors (KNN), Voting, and Stacking were applied and compared for prediction of emergent CS during active labor. External validation was performed using a nationwide multicenter dataset for Korean fetal growth. A total of 6549 term nulliparous women was included in the analysis, and the emergent CS rate was 16.1%. The C-statistics values for KNN, Voting, XGBoost, Stacking, gradient boosting, random forest, LGBM, logistic regression, and SVM were 0.6, 0.69, 0.64, 0.59, 0.66, 0.68, 0.68, 0.7, and 0.69, respectively. The logistic regression model showed the best predictive performance with an accuracy of 0.78. The machine learning model identified nine significant variables of maternal age, height, weight at pre-pregnancy, pregnancy-associated hypertension, gestational age, and fetal sonographic findings. The C-statistic value for the logistic regression machine learning model in the external validation set (1391 term nulliparous women) was 0.69, with an overall accuracy of 0.68, a specificity of 0.83, and a sensitivity of 0.41. Machine learning algorithms with clinical and sonographic parameters at near term could be useful tools to predict individual risk of emergent CS during active labor in nulliparous women.
Collapse
Affiliation(s)
- Jeong Ha Wie
- Department of Obstetrics and Gynecology, Eunpyeong St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul 03312, Korea;
| | - Se Jin Lee
- Department of Obstetrics and Gynecology, Kangwon National University Hospital, Kangwon National University School of Medicine, Chuncheon 24289, Korea;
| | - Sae Kyung Choi
- Department of Obstetrics and Gynecology, Incheon St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul 21431, Korea;
| | - Yun Sung Jo
- Department of Obstetrics and Gynecology, St. Vincent’s Hospital, College of Medicine, The Catholic University of Korea, Seoul 16247, Korea;
| | - Han Sung Hwang
- Department of Obstetrics and Gynecology, Research Institute of Medical Science, Konkuk University School of Medicine, Seoul 05030, Korea;
| | - Mi Hye Park
- Department of Obstetrics and Gynecology, Ewha Medical Center, Ewha Medical Institute, Ewha Womans University College of Medicine, Seoul 07804, Korea;
| | - Yeon Hee Kim
- Department of Obstetrics and Gynecology, Uijeongbu St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul 11765, Korea;
| | - Jae Eun Shin
- Department of Obstetrics and Gynecology, Bucheon St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul 14647, Korea;
| | - Ki Cheol Kil
- Department of Obstetrics and Gynecology, Yeouido St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul 07345, Korea;
| | - Su Mi Kim
- Department of Obstetrics and Gynecology, Daejeon St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul 34943, Korea;
| | - Bong Suk Choi
- Innerwave Co., Ltd., Seoul 08510, Korea; (B.S.C.); (H.H.)
| | - Hanul Hong
- Innerwave Co., Ltd., Seoul 08510, Korea; (B.S.C.); (H.H.)
| | - Hyun-Joo Seol
- Department of Obstetrics and Gynecology, School of Medicine, Kyung Hee University, Seoul 05278, Korea;
| | - Hye-Sung Won
- Department of Obstetrics and Gynecology, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea;
| | - Hyun Sun Ko
- Department of Obstetrics and Gynecology, Seoul St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea
- Correspondence: (H.S.K.); (S.N.)
| | - Sunghun Na
- Department of Obstetrics and Gynecology, Kangwon National University Hospital, Kangwon National University School of Medicine, Chuncheon 24289, Korea;
- Correspondence: (H.S.K.); (S.N.)
| |
Collapse
|
20
|
Valente F, Paredes S, Henriques J, Rocha T, de Carvalho P, Morais J. Interpretability, personalization and reliability of a machine learning based clinical decision support system. Data Min Knowl Discov 2022. [DOI: 10.1007/s10618-022-00821-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
21
|
Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10186-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
22
|
Sarica A, Quattrone A, Quattrone A. Introducing the Rank-Biased Overlap as Similarity Measure for Feature Importance in Explainable Machine Learning: A Case Study on Parkinson’s Disease. Brain Inform 2022. [DOI: 10.1007/978-3-031-15037-1_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
|
23
|
Saul M, Rostami S. Assessing performance of artificial neural networks and re-sampling techniques for healthcare datasets. Health Informatics J 2022; 28:14604582221087109. [PMID: 35357976 DOI: 10.1177/14604582221087109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Re-sampling methods to solve class imbalance problems have shown to improve classification accuracy by mitigating the bias introduced by differences in class size. However, it is possible that a model which uses a specific re-sampling technique prior to Artificial neural networks (ANN) training may not be suitable for aid in classifying varied datasets from the healthcare industry. Five healthcare-related datasets were used across three re-sampling conditions: under-sampling, over-sampling and combi-sampling. Within each condition, different algorithmic approaches were applied to the dataset and the results were statistically analysed for a significant difference in ANN performance. The combi-sampling condition showed that four out of the five datasets did not show significant consistency for the optimal re-sampling technique between the f1-score and Area Under the Receiver Operating Characteristic Curve performance evaluation methods. Contrarily, the over-sampling and under-sampling condition showed all five datasets put forward the same optimal algorithmic approach across performance evaluation methods. Furthermore, the optimal combi-sampling technique (under-, over-sampling and convergence point), were found to be consistent across evaluation measures in only two of the five datasets. This study exemplifies how discrete ANN performances on datasets from the same industry can occur in two ways: how the same re-sampling technique can generate varying ANN performance on different datasets, and how different re-sampling techniques can generate varying ANN performance on the same dataset.
Collapse
|
24
|
Bektaş J, Bektaş Y, Ersin Kangal E. Integrating a novel SRCRN network for segmentation with representative batch-mode experiments for detecting melanoma. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2021.103218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
25
|
Wen J, Varol E, Sotiras A, Yang Z, Chand GB, Erus G, Shou H, Abdulkadir A, Hwang G, Dwyer DB, Pigoni A, Dazzan P, Kahn RS, Schnack HG, Zanetti MV, Meisenzahl E, Busatto GF, Crespo-Facorro B, Rafael RG, Pantelis C, Wood SJ, Zhuo C, Shinohara RT, Fan Y, Gur RC, Gur RE, Satterthwaite TD, Koutsouleris N, Wolf DH, Davatzikos C. Multi-scale semi-supervised clustering of brain images: Deriving disease subtypes. Med Image Anal 2022; 75:102304. [PMID: 34818611 PMCID: PMC8678373 DOI: 10.1016/j.media.2021.102304] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 08/09/2021] [Accepted: 11/08/2021] [Indexed: 01/03/2023]
Abstract
Disease heterogeneity is a significant obstacle to understanding pathological processes and delivering precision diagnostics and treatment. Clustering methods have gained popularity for stratifying patients into subpopulations (i.e., subtypes) of brain diseases using imaging data. However, unsupervised clustering approaches are often confounded by anatomical and functional variations not related to a disease or pathology of interest. Semi-supervised clustering techniques have been proposed to overcome this and, therefore, capture disease-specific patterns more effectively. An additional limitation of both unsupervised and semi-supervised conventional machine learning methods is that they typically model, learn and infer from data using a basis of feature sets pre-defined at a fixed anatomical or functional scale (e.g., atlas-based regions of interest). Herein we propose a novel method, "Multi-scAle heteroGeneity analysIs and Clustering" (MAGIC), to depict the multi-scale presentation of disease heterogeneity, which builds on a previously proposed semi-supervised clustering method, HYDRA. It derives multi-scale and clinically interpretable feature representations and exploits a double-cyclic optimization procedure to effectively drive identification of inter-scale-consistent disease subtypes. More importantly, to understand the conditions under which the clustering model can estimate true heterogeneity related to diseases, we conducted extensive and systematic semi-simulated experiments to evaluate the proposed method on a sizeable healthy control sample from the UK Biobank (N = 4403). We then applied MAGIC to imaging data from Alzheimer's disease (ADNI, N = 1728) and schizophrenia (PHENOM, N = 1166) patients to demonstrate its potential and challenges in dissecting the neuroanatomical heterogeneity of common brain diseases. Taken together, we aim to provide guidance regarding when such analyses can succeed or should be taken with caution. The code of the proposed method is publicly available at https://github.com/anbai106/MAGIC.
Collapse
Affiliation(s)
- Junhao Wen
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA.
| | - Erdem Varol
- Department of Statistics, Center for Theoretical Neuroscience, Zuckerman Institute, Columbia University, New York, USA
| | - Aristeidis Sotiras
- Department of Radiology and Institute for Informatics, Washington University School of Medicine, St. Louis, USA
| | - Zhijian Yang
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Ganesh B Chand
- Department of Radiology, Mallinckrodt Institute of Radiology, Washington University School of Medicine, St. Louis, USA
| | - Guray Erus
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Haochang Shou
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA; Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Ahmed Abdulkadir
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Gyujoon Hwang
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Dominic B Dwyer
- Department of Psychiatry and Psychotherapy, Ludwig-Maximilian University, Munich, Germany
| | - Alessandro Pigoni
- Department of Neurosciences and Mental Health, Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Paola Dazzan
- Institute of Psychiatry, King's College London, London, UK
| | - Rene S Kahn
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Hugo G Schnack
- Department of Psychiatry, University Medical Center Utrecht, Utrecht, the Netherlands
| | - Marcus V Zanetti
- Institute of Psychiatry, Faculty of Medicine, University of São Paulo, São Paulo, Brazil
| | - Eva Meisenzahl
- LVR-Klinikum Düsseldorf, Kliniken der Heinrich-Heine-Universität, Düsseldorf, Germany
| | - Geraldo F Busatto
- Institute of Psychiatry, Faculty of Medicine, University of São Paulo, São Paulo, Brazil
| | - Benedicto Crespo-Facorro
- Hospital Universitario Virgen del Rocio, University of Sevilla-IBIS; IDIVAL-CIBERSAM, Cantabria, Spain
| | - Romero-Garcia Rafael
- Department of Medical Physiology and Biophysics, University of Seville, Instituto de Investigación Sanitaria de Sevilla, IBiS, CIBERSAM, Sevilla, Spain
| | - Christos Pantelis
- Melbourne Neuropsychiatry Centre, Department of Psychiatry, University of Melbourne and Melbourne Health, Carlton South, Australia
| | - Stephen J Wood
- Orygen, National Centre of Excellence for Youth Mental Health, Melbourne, Australia
| | - Chuanjun Zhuo
- key Laboratory of Real Tine Tracing of Brain Circuits in Psychiatry and Neurology(RTBCPN-Lab), Nankai University Affiliated Tianjin Fourth Center Hospital; Department of Psychiatry, Tianjin Medical University, Tianjin, China
| | - Russell T Shinohara
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA; Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Yong Fan
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Ruben C Gur
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Raquel E Gur
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Theodore D Satterthwaite
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA; Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA; University Hospital of Old Age Psychiatry and Psychotherapy, University of Bern, Bern, Switzerland
| | - Nikolaos Koutsouleris
- Department of Psychiatry and Psychotherapy, Ludwig-Maximilian University, Munich, Germany
| | - Daniel H Wolf
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA; Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA; University Hospital of Old Age Psychiatry and Psychotherapy, University of Bern, Bern, Switzerland
| | - Christos Davatzikos
- Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA.
| |
Collapse
|
26
|
Li R, Li L, Xu Y, Yang J. Machine learning meets omics: applications and perspectives. Brief Bioinform 2021; 23:6425809. [PMID: 34791021 DOI: 10.1093/bib/bbab460] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 09/29/2021] [Accepted: 10/07/2021] [Indexed: 02/07/2023] Open
Abstract
The innovation of biotechnologies has allowed the accumulation of omics data at an alarming rate, thus introducing the era of 'big data'. Extracting inherent valuable knowledge from various omics data remains a daunting problem in bioinformatics. Better solutions often need some kind of more innovative methods for efficient handlings and effective results. Recent advancements in integrated analysis and computational modeling of multi-omics data helped address such needs in an increasingly harmonious manner. The development and application of machine learning have largely advanced our insights into biology and biomedicine and greatly promoted the development of therapeutic strategies, especially for precision medicine. Here, we propose a comprehensive survey and discussion on what happened, is happening and will happen when machine learning meets omics. Specifically, we describe how artificial intelligence can be applied to omics studies and review recent advancements at the interface between machine learning and the ever-widest range of omics including genomics, transcriptomics, proteomics, metabolomics, radiomics, as well as those at the single-cell resolution. We also discuss and provide a synthesis of ideas, new insights, current challenges and perspectives of machine learning in omics.
Collapse
Affiliation(s)
- Rufeng Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Lixin Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Yungang Xu
- School of Electronics and Information, Northwestern Polytechnical University, Xi'an, 710129, China
| | - Juan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China.,Key Laboratory of Environment and Genes Related to Diseases (Xi'an Jiaotong University), Ministry of Education of China, Xi'an 710061, P. R. China
| |
Collapse
|
27
|
Wu J, Dong Q, Gui J, Zhang J, Su Y, Chen K, Thompson PM, Caselli RJ, Reiman EM, Ye J, Wang Y. Predicting Brain Amyloid Using Multivariate Morphometry Statistics, Sparse Coding, and Correntropy: Validation in 1,101 Individuals From the ADNI and OASIS Databases. Front Neurosci 2021; 15:669595. [PMID: 34421510 PMCID: PMC8377280 DOI: 10.3389/fnins.2021.669595] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Accepted: 07/15/2021] [Indexed: 01/04/2023] Open
Abstract
Biomarker assisted preclinical/early detection and intervention in Alzheimer’s disease (AD) may be the key to therapeutic breakthroughs. One of the presymptomatic hallmarks of AD is the accumulation of beta-amyloid (Aβ) plaques in the human brain. However, current methods to detect Aβ pathology are either invasive (lumbar puncture) or quite costly and not widely available (amyloid PET). Our prior studies show that magnetic resonance imaging (MRI)-based hippocampal multivariate morphometry statistics (MMS) are an effective neurodegenerative biomarker for preclinical AD. Here we attempt to use MRI-MMS to make inferences regarding brain Aβ burden at the individual subject level. As MMS data has a larger dimension than the sample size, we propose a sparse coding algorithm, Patch Analysis-based Surface Correntropy-induced Sparse-coding and Max-Pooling (PASCS-MP), to generate a low-dimensional representation of hippocampal morphometry for each individual subject. Then we apply these individual representations and a binary random forest classifier to predict brain Aβ positivity for each person. We test our method in two independent cohorts, 841 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and 260 subjects from the Open Access Series of Imaging Studies (OASIS). Experimental results suggest that our proposed PASCS-MP method and MMS can discriminate Aβ positivity in people with mild cognitive impairment (MCI) [Accuracy (ACC) = 0.89 (ADNI)] and in cognitively unimpaired (CU) individuals [ACC = 0.79 (ADNI) and ACC = 0.81 (OASIS)]. These results compare favorably relative to measures derived from traditional algorithms, including hippocampal volume and surface area, shape measures based on spherical harmonics (SPHARM) and our prior Patch Analysis-based Surface Sparse-coding and Max-Pooling (PASS-MP) methods.
Collapse
Affiliation(s)
- Jianfeng Wu
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, United States
| | - Qunxi Dong
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, United States.,Institute of Engineering Medicine, Beijing Institute of Technology, Beijing, China
| | - Jie Gui
- School of Cyber Science and Engineering, Southeast University, Nanjing, China
| | - Jie Zhang
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, United States
| | - Yi Su
- Banner Alzheimer's Institute, Phoenix, AZ, United States
| | - Kewei Chen
- Banner Alzheimer's Institute, Phoenix, AZ, United States
| | - Paul M Thompson
- Imaging Genetics Center, Stevens Neuroimaging and Informatics Institute, University of Southern California, Marina del Rey, CA, United States
| | - Richard J Caselli
- Department of Neurology, Mayo Clinic Arizona, Scottsdale, AZ, United States
| | - Eric M Reiman
- Banner Alzheimer's Institute, Phoenix, AZ, United States
| | - Jieping Ye
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, United States
| | - Yalin Wang
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, United States
| |
Collapse
|
28
|
Zhang T, Liao Q, Zhang D, Zhang C, Yan J, Ngetich R, Zhang J, Jin Z, Li L. Predicting MCI to AD Conversation Using Integrated sMRI and rs-fMRI: Machine Learning and Graph Theory Approach. Front Aging Neurosci 2021; 13:688926. [PMID: 34421570 PMCID: PMC8375594 DOI: 10.3389/fnagi.2021.688926] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 06/23/2021] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Graph theory and machine learning have been shown to be effective ways of classifying different stages of Alzheimer's disease (AD). Most previous studies have only focused on inter-subject classification with single-mode neuroimaging data. However, whether this classification can truly reflect the changes in the structure and function of the brain region in disease progression remains unverified. In the current study, we aimed to evaluate the classification framework, which combines structural Magnetic Resonance Imaging (sMRI) and resting-state functional Magnetic Resonance Imaging (rs-fMRI) metrics, to distinguish mild cognitive impairment non-converters (MCInc)/AD from MCI converters (MCIc) by using graph theory and machine learning. METHODS With the intra-subject (MCInc vs. MCIc) and inter-subject (MCIc vs. AD) design, we employed cortical thickness features, structural brain network features, and sub-frequency (full-band, slow-4, slow-5) functional brain network features for classification. Three feature selection methods [random subset feature selection algorithm (RSFS), minimal redundancy maximal relevance (mRMR), and sparse linear regression feature selection algorithm based on stationary selection (SS-LR)] were used respectively to select discriminative features in the iterative combinations of MRI and network measures. Then support vector machine (SVM) classifier with nested cross-validation was employed for classification. We also compared the performance of multiple classifiers (Random Forest, K-nearest neighbor, Adaboost, SVM) and verified the reliability of our results by upsampling. RESULTS We found that in the classifications of MCIc vs. MCInc, and MCIc vs. AD, the proposed RSFS algorithm achieved the best accuracies (84.71, 89.80%) than the other algorithms. And the high-sensitivity brain regions found with the two classification groups were inconsistent. Specifically, in MCIc vs. MCInc, the high-sensitivity brain regions associated with both structural and functional features included frontal, temporal, caudate, entorhinal, parahippocampal, and calcarine fissure and surrounding cortex. While in MCIc vs. AD, the high-sensitivity brain regions associated only with functional features included frontal, temporal, thalamus, olfactory, and angular. CONCLUSIONS These results suggest that our proposed method could effectively predict the conversion of MCI to AD, and the inconsistency of specific brain regions provides a novel insight for clinical AD diagnosis.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Zhenlan Jin
- Key Laboratory for NeuroInformation of Ministry of Education, High-Field Magnetic Resonance Brain Imaging Key Laboratory of Sichuan Province, Center for Information in Medicine, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Ling Li
- Key Laboratory for NeuroInformation of Ministry of Education, High-Field Magnetic Resonance Brain Imaging Key Laboratory of Sichuan Province, Center for Information in Medicine, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
29
|
Shang Y, Jiang K, Wang L, Zhang Z, Zhou S, Liu Y, Dong J, Wu H. The 30-days hospital readmission risk in diabetic patients: predictive modeling with machine learning classifiers. BMC Med Inform Decis Mak 2021; 21:57. [PMID: 34330267 PMCID: PMC8323261 DOI: 10.1186/s12911-021-01423-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Accepted: 02/08/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND AND OBJECTIVES Diabetes mellitus is a major chronic disease that results in readmissions due to poor disease control. Here we established and compared machine learning (ML)-based readmission prediction methods to predict readmission risks of diabetic patients. METHODS The dataset analyzed in this study was acquired from the Health Facts Database, which includes over 100,000 records of diabetic patients from 1999 to 2008. The basic data distribution characteristics of this dataset were summarized and then analyzed. In this study, 30-days readmission was defined as a readmission period of less than 30 days. After data preprocessing and normalization, multiple risk factors in the dataset were examined for classifier training to predict the probability of readmission using ML models. Different ML classifiers such as random forest, Naive Bayes, and decision tree ensemble were adopted to improve the clinical efficiency of the classification. In this study, the Konstanz Information Miner platform was used to preprocess and model the data, and the performances of the different classifiers were compared. RESULTS A total of 100,244 records were included in the model construction after the data preprocessing and normalization. A total of 23 attributes, including race, sex, age, admission type, admission location, length of stay, and drug use, were finally identified as modeling risk factors. Comparison of the performance indexes of the three algorithms revealed that the RF model had the best performance with a higher area under receiver operating characteristic curve (AUC) than the other two algorithms, suggesting that its use is more suitable for making readmission predictions. CONCLUSION The factors influencing 30-days readmission predictions in diabetic patients, including number of inpatient admissions, age, diagnosis, number of emergencies, and sex, would help healthcare providers to identify patients who are at high risk of short-term readmission and reduce the probability of 30-days readmission. The RF algorithm with the highest AUC is more suitable for making 30-days readmission predictions and deserves further validation in clinical trials.
Collapse
Affiliation(s)
- Yujuan Shang
- Department of Medical Informatics, Medical School of Nantong University, 19 Qixiu Road, Nantong, 226001, Jiangsu, People's Republic of China
- Department of Statistics and Data Management, Children's Hospital of Fudan University, Shanghai, 201102, People's Republic of China
| | - Kui Jiang
- Department of Medical Informatics, Medical School of Nantong University, 19 Qixiu Road, Nantong, 226001, Jiangsu, People's Republic of China
| | - Lei Wang
- Department of Medical Informatics, Medical School of Nantong University, 19 Qixiu Road, Nantong, 226001, Jiangsu, People's Republic of China
| | - Zheqing Zhang
- Department of Medical Informatics, Medical School of Nantong University, 19 Qixiu Road, Nantong, 226001, Jiangsu, People's Republic of China
| | - Siwei Zhou
- Department of Medical Informatics, Medical School of Nantong University, 19 Qixiu Road, Nantong, 226001, Jiangsu, People's Republic of China
| | - Yun Liu
- Department of Information, the First Affiliated Hospital, Nanjing Medical University, No. 300 Guang Zhou Road, Nanjing, 210029, Jiangsu, People's Republic of China
- Department of Medical Informatics, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing, 211166, Jiangsu, People's Republic of China
| | - Jiancheng Dong
- Department of Medical Informatics, Medical School of Nantong University, 19 Qixiu Road, Nantong, 226001, Jiangsu, People's Republic of China
| | - Huiqun Wu
- Department of Medical Informatics, Medical School of Nantong University, 19 Qixiu Road, Nantong, 226001, Jiangsu, People's Republic of China.
| |
Collapse
|
30
|
Li F, Yi C, Liao Y, Jiang Y, Si Y, Song L, Zhang T, Yao D, Zhang Y, Cao Z, Xu P. Reconfiguration of Brain Network Between Resting State and P300 Task. IEEE Trans Cogn Dev Syst 2021. [DOI: 10.1109/tcds.2020.2965135] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
31
|
Computational methods for integrative evaluation of confidence, accuracy, and reaction time in facial affect recognition in schizophrenia. SCHIZOPHRENIA RESEARCH-COGNITION 2021; 25:100196. [PMID: 33996517 PMCID: PMC8093458 DOI: 10.1016/j.scog.2021.100196] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Revised: 03/06/2021] [Accepted: 03/10/2021] [Indexed: 11/21/2022]
Abstract
People with schizophrenia (SZ) process emotions less accurately than do healthy comparators (HC), and emotion recognition have expanded beyond accuracy to performance variables like reaction time (RT) and confidence. These domains are typically evaluated independently, but complex inter-relationships can be evaluated through machine learning at an item-by-item level. Using a mix of ranking and machine learning tools, we investigated item-by-item discrimination of facial affect with two emotion recognition tests (BLERT and ER-40) between SZ and HC. The best performing multi-domain model for ER40 had a large effect size in differentiating SZ and HC (d = 1.24) compared to a standard comparison of accuracy alone (d = 0.48); smaller increments in effect sizes were evident for the BLERT (d = 0.87 vs. d = 0.58). Almost half of the selected items were confidence ratings. Within SZ, machine learning models with ER40 (generally accuracy and reaction time) items predicted severity of depression and overconfidence in social cognitive ability, but not psychotic symptoms. Pending independent replication, the results support machine learning, and the inclusion of confidence ratings, in characterizing the social cognitive deficits in SZ. This moderate-sized study (n = 372) included subjects with schizophrenia (SZ, n = 218) and healthy controls (HC, n = 154). This paper explores the value of integrative evaluation of confidence, accuracy, and reaction time by way of machine learning in understanding the unique aspects of facial affect recognition in schizophrenia. Machine learning models better separated schizophrenia from healthy comparators that standard statistical comparison, confidence ratings contributed to this separation in a disproportionate manner. Machine learning approaches provide a novel way to analyze item-by-item associations with social cognition measures, or potentially other tests, where multiple overlapping dimensions exist. Aberrant confidence ratings interact with performance variables in complex ways to contribute to social cognitive deficits in schizophrenia.
Collapse
|
32
|
Reproducible Evaluation of Diffusion MRI Features for Automatic Classification of Patients with Alzheimer's Disease. Neuroinformatics 2021; 19:57-78. [PMID: 32524428 DOI: 10.1007/s12021-020-09469-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Diffusion MRI is the modality of choice to study alterations of white matter. In past years, various works have used diffusion MRI for automatic classification of Alzheimer's disease. However, classification performance obtained with different approaches is difficult to compare because of variations in components such as input data, participant selection, image preprocessing, feature extraction, feature rescaling (FR), feature selection (FS) and cross-validation (CV) procedures. Moreover, these studies are also difficult to reproduce because these different components are not readily available. In a previous work (Samper-González et al. 2018), we propose an open-source framework for the reproducible evaluation of AD classification from T1-weighted (T1w) MRI and PET data. In the present paper, we first extend this framework to diffusion MRI data. Specifically, we add: conversion of diffusion MRI ADNI data into the BIDS standard and pipelines for diffusion MRI preprocessing and feature extraction. We then apply the framework to compare different components. First, FS has a positive impact on classification results: highest balanced accuracy (BA) improved from 0.76 to 0.82 for task CN vs AD. Secondly, voxel-wise features generally gives better performance than regional features. Fractional anisotropy (FA) and mean diffusivity (MD) provided comparable results for voxel-wise features. Moreover, we observe that the poor performance obtained in tasks involving MCI were potentially caused by the small data samples, rather than by the data imbalance. Furthermore, no extensive classification difference exists for different degree of smoothing and registration methods. Besides, we demonstrate that using non-nested validation of FS leads to unreliable and over-optimistic results: 5% up to 40% relative increase in BA. Lastly, with proper FR and FS, the performance of diffusion MRI features is comparable to that of T1w MRI. All the code of the framework and the experiments are publicly available: general-purpose tools have been integrated into the Clinica software package ( www.clinica.run ) and the paper-specific code is available at: https://github.com/aramis-lab/AD-ML .
Collapse
|
33
|
Use of Machine Learning to Determine the Information Value of a BMI Screening Program. Am J Prev Med 2021; 60:425-433. [PMID: 33483154 PMCID: PMC8610445 DOI: 10.1016/j.amepre.2020.10.016] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Revised: 10/13/2020] [Accepted: 10/14/2020] [Indexed: 12/12/2022]
Abstract
INTRODUCTION Childhood obesity continues to be a significant public health issue in the U.S. and is associated with short- and long-term adverse health outcomes. A number of states have implemented school-based BMI screening programs. However, these programs have been criticized for not being effective in improving students' BMI or reducing childhood obesity. One potential benefit, however, of screening programs is the identification of younger children at risk of obesity as they age. METHODS This study used a unique panel data set from the BMI screening program for public school children in the state of Arkansas collected from 2003 to 2004 through the 2018-2019 academic years and analyzed in 2020. Machine learning algorithms were applied to understand the informational value of BMI screening. Specifically, this study evaluated the importance of BMI information during kindergarten to the accurate prediction of childhood obesity by the 4th grade. RESULTS Kindergarten BMI z-score is the most important predictor of obesity by the 4th grade and is much more important to prediction than sociodemographic and socioeconomic variables that would otherwise be available to policymakers in the absence of the screening program. Including the kindergarten BMI z-score of students in the model meaningfully increases the accuracy of the prediction. CONCLUSIONS Data from the Arkansas BMI screening program greatly improve the ability to identify children at greatest risk of future obesity to the extent that better prediction can be translated into more effective policy and better health outcomes. This is a heretofore unexamined benefit of school-based BMI screening.
Collapse
|
34
|
Gulino MS, Gangi LD, Sortino A, Vangi D. Injury risk assessment based on pre-crash variables: The role of closing velocity and impact eccentricity. ACCIDENT; ANALYSIS AND PREVENTION 2021; 150:105864. [PMID: 33385620 DOI: 10.1016/j.aap.2020.105864] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Revised: 09/28/2020] [Accepted: 10/22/2020] [Indexed: 06/12/2023]
Abstract
Thorough evaluations on injury risk (IR) are fundamental for guiding interventions toward the enhancement of both the road infrastructure and the active/passive safety of vehicles. Well-established estimates are currently based on IR functions modeled on post-crash variables, such as velocity change sustained by the vehicle (ΔV); thence, these analyses do not directly suggest how pre-crash conditions can be modified to allow for IR reduction. Nevertheless, ΔV can be disaggregated into two contributions which enable its apriori calculation, based only on the information available at the impact instant: the Crash Momentum Index (CMI), representing impact eccentricity at collision, and the closing velocity at collision (Vr). By extensively employing the CMI indicator, this work assesses the overall influence of impact eccentricity and closing velocity on the risk for occupants to sustain a serious injury. As CMI synthesizes indications regarding ΔV, its use can be disjointed from the ΔV itself for the derivation of high-quality IR models. This feature distinguishes CMI from the other eccentricity indicators available at the state-of-the-art, allowing for the contribution of eccentricity on IR to be completely isolated. Because of this element of originality, special attention is given to the CMI variable throughout the present work. Based on data extracted from the NASS/CDS database, the influence of the CMI and Vr variables on IR is specifically highlighted and analyzed from several perspectives. The feature ranking algorithm ReliefF, whose use is unprecedented in the accident analysis field, is first employed to assess importance of such impact-related variables in determining the injury outcome: if compared to vehicle-related and occupant-related variables (as category and age, respectively), the higher influence of CMI and Vr is initially highlighted. Secondly, the relevance of CMI and Vr is confirmed by fitting different predictive models: the fitted models which include the CMI predictor perform better than models which neglect the CMI, in terms of classical evaluation metrics. As a whole, considering the high predictive power of the proposed CMI-based models, this work provides valuable tools for the apriori assessment of IR.
Collapse
Affiliation(s)
- Michelangelo-Santo Gulino
- Department of Industrial Engineering, Università degli Studi di Firenze, Via di Santa Marta, 3, 50139 Firenze, Italy.
| | - Leonardo Di Gangi
- Department of Information Engineering, Università degli Studi di Firenze, Via di Santa Marta, 3, 50139 Firenze, Italy
| | - Alessio Sortino
- Department of Information Engineering, Università degli Studi di Firenze, Via di Santa Marta, 3, 50139 Firenze, Italy
| | - Dario Vangi
- Department of Industrial Engineering, Università degli Studi di Firenze, Via di Santa Marta, 3, 50139 Firenze, Italy
| |
Collapse
|
35
|
Shimpi N, McRoy S, Zhao H, Wu M, Acharya A. Development of a periodontitis risk assessment model for primary care providers in an interdisciplinary setting. Technol Health Care 2021; 28:143-154. [PMID: 31282445 DOI: 10.3233/thc-191642] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
BACKGROUND Periodontitis (PD), a form of gum disease, is a major public health concern as it is globally prevalent and harms both individual quality of life and economic productivity. Global cost in lost productivity is estimated at US$54 billion annually. Moreover, current PD assessment applies only after the damage has already occurred. OBJECTIVE This study proposes and tests a new PD risk assessment model applicable at point-of-care, using supervised machine learning methods. METHODS We compare the performance of five algorithms using retrospective clinical data: Naïve Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Tree (DT). RESULTS DT and ANN demonstrated higher accuracy in classifying the patients with high or low PD risk as compared to NB, LR and SVM. The resultant model with DT showed a sensitivity of 87.08% (95% CI 84.12% to 89.76%) and specificity of 93.5% (95% CI 91% to 95.49%). CONCLUSIONS A predictive model with high sensitivity and specificity to stratify individuals into low and high PD risk tiers was developed. Validation in other populations will inform translational value of this approach and its potential applicability as clinical decision support tool.
Collapse
Affiliation(s)
- Neel Shimpi
- University of Wisconsin-Milwaukee, Milwaukee, WI, USA.,Center for Oral and Systemic Health, Marshfield Clinic Research Institute, Marshfield, WI, USA
| | - Susan McRoy
- University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | - Huimin Zhao
- University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | - Min Wu
- University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | - Amit Acharya
- Center for Oral and Systemic Health, Marshfield Clinic Research Institute, Marshfield, WI, USA
| |
Collapse
|
36
|
Kim N, Hong S. Automatic classification of citizen requests for transportation using deep learning: Case study from Boston city. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2020.102410] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
37
|
Jing XY, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang JY. Multiset Feature Learning for Highly Imbalanced Data Classification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:139-156. [PMID: 31331881 DOI: 10.1109/tpami.2019.2929166] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
With the expansion of data, increasing imbalanced data has emerged. When the imbalance ratio (IR) of data is high, most existing imbalanced learning methods decline seriously in classification performance. In this paper, we systematically investigate the highly imbalanced data classification problem, and propose an uncorrelated cost-sensitive multiset learning (UCML) approach for it. Specifically, UCML first constructs multiple balanced subsets through random partition, and then employs the multiset feature learning (MFL) to learn discriminant features from the constructed multiset. To enhance the usability of each subset and deal with the non-linearity issue existed in each subset, we further propose a deep metric based UCML (DM-UCML) approach. DM-UCML introduces the generative adversarial network technique into the multiset constructing process, such that each subset can own similar distribution with the original dataset. To cope with the non-linearity issue, DM-UCML integrates deep metric learning with MFL, such that more favorable performance can be achieved. In addition, DM-UCML designs a new discriminant term to enhance the discriminability of learned metrics. Experiments on eight traditional highly class-imbalanced datasets and two large-scale datasets indicate that: the proposed approaches outperform state-of-the-art highly imbalanced learning methods and are more robust to high IR.
Collapse
|
38
|
Zhu W, Huang H, Yang S, Luo X, Zhu W, Xu S, Meng Q, Zuo C, Zhao K, Liu H, Liu Y, Wang W. Dysfunctional Architecture Underlies White Matter Hyperintensities with and without Cognitive Impairment. J Alzheimers Dis 2020; 71:461-476. [PMID: 31403946 DOI: 10.3233/jad-190174] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
BACKGROUND White matter hyperintensities (WMH) are common in older adults and are associated with cognitive decline. However, little is known about the functional changes underlying cognitive decline in WMH subjects. OBJECTIVES To investigate whole-brain functional connectivity (FC) underpinnings of cognitive decline in WMH subjects using univariate and multivariate analyses. METHODS Twenty-three WMH subjects with mild cognitive impairment (WMH-MCI), 43 WMH subjects with no cognitive impairment (WMH-nCI), and 55 healthy controls underwent resting-state functional MRI scans. Whole-brain FC was calculated using the fine-grained human Brainnetome Atlas, followed by performance of between-group comparisons and FC-cognition correlation analysis. A multivariate analysis using support vector machine (SVM) was performed to classify WMH-MCI and WMH-nCI subjects based on FC. RESULTS Both the WMH-MCI and WMH-nCI subjects exhibited characteristic impaired FC patterns. Markedly reduced FC involving subcortical nuclei and cortical hub regions of cognitive networks, especially the cingulate cortex, was identified in the WMH-MCI patients. In the WMH-MCI group, several connections involving the cingulate cortex were associated with cognitive decline. The exploratory mediation analyses indicated that FC alterations could partially explain the association between WMH and cognition. Furthermore, an SVM classifier based on FC distinguished WMH-MCI and WMH-nCI subjects with 78.8% accuracy. Connections that contributed most to the classification showed a similar distribution as the connections identified in the univariate analysis. CONCLUSIONS This study provides a new window into the pathophysiology of cognitive impairment in WMH subjects and offer a novel and potential approach for early detection of the cognitive impairment in WMH subjects at the individual level.
Collapse
Affiliation(s)
- Wenhao Zhu
- Department of Neurology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Hao Huang
- Department of Neurology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Shiqi Yang
- Department of Radiology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Xiang Luo
- Department of Neurology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Wenzhen Zhu
- Department of Radiology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Shabei Xu
- Department of Neurology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Qi Meng
- Department of Neurology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Chengchao Zuo
- Department of Neurology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Kun Zhao
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China.,School of Information Science and Engineering, Shandong Normal University, Ji'nan, China
| | - Hesheng Liu
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Yong Liu
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China.,National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China.,Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing, China.,School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
| | - Wei Wang
- Department of Neurology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| |
Collapse
|
39
|
Identifying influential factors distinguishing recidivists among offender patients with a diagnosis of schizophrenia via machine learning algorithms. Forensic Sci Int 2020; 315:110435. [DOI: 10.1016/j.forsciint.2020.110435] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Revised: 07/05/2020] [Accepted: 07/24/2020] [Indexed: 11/17/2022]
|
40
|
Du G, Zhang J, Luo Z, Ma F, Ma L, Li S. Joint imbalanced classification and feature selection for hospital readmissions. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106020] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
41
|
Wang KZ, Bani-Fatemi A, Adanty C, Harripaul R, Griffiths J, Kolla N, Gerretsen P, Graff A, De Luca V. Prediction of physical violence in schizophrenia with machine learning algorithms. Psychiatry Res 2020; 289:112960. [PMID: 32361562 DOI: 10.1016/j.psychres.2020.112960] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/22/2019] [Revised: 03/17/2020] [Accepted: 03/27/2020] [Indexed: 10/24/2022]
Abstract
Patients with schizophrenia have been shown to have an increased risk for physical violence. While certain features have been identified as risk factors, it has been difficult to integrate these variables to identify violent patients. The present study thus attempts to develop a clinically-relevant predictive tool. In a population of 275 schizophrenia patients, we identified 103 participants as violent and 172 as non-violent through electronic medical documentation, and conducted cross-sectional assessments to identify demographic, clinical, and sociocultural variables. Using these predictors, we utilized seven machine learning classification algorithms to predict for past instances of physical violence. Our classification algorithms predicted with significant accuracy compared to random discrimination alone, and had varying degrees of predictive power, as described by various performance measures. We determined that the random forest model performed marginally better than other algorithms, with an accuracy of 62% and an area under the receiver operator characteristic curve (AUROC) of 0.63. To summarize, machine learning classification algorithms are becoming increasingly valuable, though, optimization of these models is needed to better complement diagnostic decisions regarding early interventional measures to predict instances of physical violence.
Collapse
Affiliation(s)
- Kevin Z Wang
- Group for Suicide Studies, Centre for Addiction and Mental Health, 250 College St, M5T1R8, Toronto, Canada
| | - Ali Bani-Fatemi
- Group for Suicide Studies, Centre for Addiction and Mental Health, 250 College St, M5T1R8, Toronto, Canada
| | - Christopher Adanty
- Group for Suicide Studies, Centre for Addiction and Mental Health, 250 College St, M5T1R8, Toronto, Canada
| | - Ricardo Harripaul
- Group for Suicide Studies, Centre for Addiction and Mental Health, 250 College St, M5T1R8, Toronto, Canada
| | - John Griffiths
- Group for Suicide Studies, Centre for Addiction and Mental Health, 250 College St, M5T1R8, Toronto, Canada
| | - Nathan Kolla
- Group for Suicide Studies, Centre for Addiction and Mental Health, 250 College St, M5T1R8, Toronto, Canada
| | - Philip Gerretsen
- Group for Suicide Studies, Centre for Addiction and Mental Health, 250 College St, M5T1R8, Toronto, Canada
| | - Ariel Graff
- Group for Suicide Studies, Centre for Addiction and Mental Health, 250 College St, M5T1R8, Toronto, Canada
| | - Vincenzo De Luca
- Group for Suicide Studies, Centre for Addiction and Mental Health, 250 College St, M5T1R8, Toronto, Canada.
| |
Collapse
|
42
|
N'Diaye A, Byrns B, Cory AT, Nilsen KT, Walkowiak S, Sharpe A, Robinson SJ, Pozniak CJ. Machine learning analyses of methylation profiles uncovers tissue-specific gene expression patterns in wheat. THE PLANT GENOME 2020; 13:e20027. [PMID: 33016606 DOI: 10.1002/tpg2.20027] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Revised: 01/24/2020] [Accepted: 04/12/2020] [Indexed: 06/11/2023]
Abstract
DNA methylation is a mechanism of epigenetic modification in eukaryotic organisms. Generally, methylation within genes promoter inhibits regulatory protein binding and represses transcription, whereas gene body methylation is associated with actively transcribed genes. However, it remains unclear whether there is interaction between methylation levels across genic regions and which site has the biggest impact on gene regulation. We investigated and used the methylation patterns of the bread wheat cultivar Chinese Spring to uncover differentially expressed genes (DEGs) between roots and leaves, using six machine learning algorithms and a deep neural network. As anticipated, genes with higher expression in leaves were mainly involved in photosynthesis and pigment biosynthesis processes whereas genes that were not differentially expressed between roots and leaves were involved in protein processes and membrane structures. Methylation occurred preponderantly (60%) in the CG context, whereas 35 and 5% of methylation occurred in CHG and CHH contexts, respectively. Methylation levels were highly correlated (r = 0.7 to 0.9) between all genic regions, except within the promoter (r = 0.4 to 0.5). Machine learning models gave a high (0.81) prediction accuracy of DEGs. There was a strong correlation (p-value = 9.20×10-10 ) between all features and gene expression, suggesting that methylation across all genic regions contribute to gene regulation. However, the methylation of the promoter, the CDS and the exon in CG context was the most impactful. Our study provides more insights into the interplay between DNA methylation and gene expression and paves the way for identifying tissue-specific genes using methylation profiles.
Collapse
Affiliation(s)
- Amidou N'Diaye
- Department of Plant Sciences and Crop Development Centre, University of Saskatchewan, Saskatoon, SK, Canada, S7N 5A8
| | - Brook Byrns
- Department of Plant Sciences and Crop Development Centre, University of Saskatchewan, Saskatoon, SK, Canada, S7N 5A8
| | - Aron T Cory
- Department of Plant Sciences and Crop Development Centre, University of Saskatchewan, Saskatoon, SK, Canada, S7N 5A8
| | - Kirby T Nilsen
- Department of Plant Sciences and Crop Development Centre, University of Saskatchewan, Saskatoon, SK, Canada, S7N 5A8
| | - Sean Walkowiak
- Department of Plant Sciences and Crop Development Centre, University of Saskatchewan, Saskatoon, SK, Canada, S7N 5A8
| | - Andrew Sharpe
- Global Institute for Food Security, Saskatoon, SK, Canada, S7N 0W9
| | - Stephen J Robinson
- Saskatoon Research and Development Centre, Agriculture and Agri-Food Canada, Saskatoon, SK, Canada, S7N 0X2
| | - Curtis J Pozniak
- Department of Plant Sciences and Crop Development Centre, University of Saskatchewan, Saskatoon, SK, Canada, S7N 5A8
| |
Collapse
|
43
|
Fu GH, Wu YJ, Zong MJ, Pan J. Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data. BMC Bioinformatics 2020; 21:121. [PMID: 32293252 PMCID: PMC7092448 DOI: 10.1186/s12859-020-3411-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Accepted: 02/12/2020] [Indexed: 11/11/2022] Open
Abstract
Background Feature selection in class-imbalance learning has gained increasing attention in recent years due to the massive growth of high-dimensional class-imbalanced data across many scientific fields. In addition to reducing model complexity and discovering key biomarkers, feature selection is also an effective method of combating overlapping which may arise in such data and become a crucial aspect for determining classification performance. However, ordinary feature selection techniques for classification can not be simply used for addressing class-imbalanced data without any adjustment. Thus, more efficient feature selection technique must be developed for complicated class-imbalanced data, especially in the context of high-dimensionality. Results We proposed an algorithm called sssHD to achieve stable sparse feature selection applied it to complicated class-imbalanced data. sssHD is based on the Hellinger distance (HD) coupled with sparse regularization techniques. We stated that Hellinger distance is not only class-insensitive but also translation-invariant. Simulation result indicates that HD-based selection algorithm is effective in recognizing key features and control false discoveries for class-imbalance learning. Five gene expression datasets are also employed to test the performance of the sssHD algorithm, and a comparison with several existing selection procedures is performed. The result shows that sssHD is highly competitive in terms of five assessment metrics. In addition, sssHD presents limited differences between performing and not performing re-balance preprocessing. Conclusions sssHD is a practical feature selection method for high-dimensional class-imbalanced data, which is simple and can be an alternative for performing feature selection in class-imbalanced data. sssHD can be easily extended by connecting it with different re-balance preprocessing, different sparse regularization structures as well as different classifiers. As such, the algorithm is extremely general and has a wide range of applicability.
Collapse
Affiliation(s)
- Guang-Hui Fu
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China.
| | - Yuan-Jiao Wu
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China
| | - Min-Jie Zong
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China
| | - Jianxin Pan
- School of Mathematics, The University of Manchester, Manchester, M13 9PL, UK
| |
Collapse
|
44
|
Abdelhamid N, Padmavathy A, Peebles D, Thabtah F, Goulder-Horobin D. Data Imbalance in Autism Pre-Diagnosis Classification Systems: An Experimental Study. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT 2020. [DOI: 10.1142/s0219649220400146] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Machine learning (ML) is a branch of computer science that is rapidly gaining popularity within the healthcare arena due to its ability to explore large datasets to discover useful patterns that can be interepreted for decision-making and prediction. ML techniques are used for the analysis of clinical parameters and their combinations for prognosis, therapy planning and support and patient management and wellbeing. In this research, we investigate a crucial problem associated with medical applications such as autism spectrum disorder (ASD) data imbalances in which cases are far more than just controls in the dataset. In autism diagnosis data, the number of possible instances is linked with one class, i.e. the no ASD is larger than the ASD, and this may cause performance issues such as models favouring the majority class and undermining the minority class. This research experimentally measures the impact of class imbalance issue on the performance of different classifiers on real autism datasets when various data imbalance approaches are utilised in the pre-processing phase. We employ oversampling techniques, such as Synthetic Minority Oversampling (SMOTE), and undersampling with different classifiers including Naive Bayes, RIPPER, C4.5 and Random Forest to measure the impact of these on the performance of the models derived in terms of area under curve and other metrics. Results pinpoint that oversampling techniques are superior to undersampling techniques, at least for the toddlers’ autism dataset that we consider, and suggest that further work should look at incorporating sampling techniques with feature selection to generate models that do not overfit the dataset.
Collapse
Affiliation(s)
- Neda Abdelhamid
- IT Programme, Auckland Institute of Studies, Auckland, New Zealand
| | - Arun Padmavathy
- Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand
| | - David Peebles
- Department of Psychology, University of Huddersfield, Queensgate, Huddersfield HD1 3DH, UK
| | - Fadi Thabtah
- Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand
| | | |
Collapse
|
45
|
Qian Y, Ye S, Zhang Y, Zhang J. SUMO-Forest: A Cascade Forest based method for the prediction of SUMOylation sites on imbalanced data. Gene 2020; 741:144536. [PMID: 32160959 DOI: 10.1016/j.gene.2020.144536] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Revised: 03/03/2020] [Accepted: 03/06/2020] [Indexed: 11/30/2022]
Affiliation(s)
- Ying Qian
- School of Computer Science & Technology, East China Normal University, North Zhongshan Road, 200062 Shanghai, China.
| | - Shasha Ye
- School of Computer Science & Technology, East China Normal University, North Zhongshan Road, 200062 Shanghai, China.
| | - Yu Zhang
- School of Computer Science & Technology, East China Normal University, North Zhongshan Road, 200062 Shanghai, China.
| | - Jiongmin Zhang
- School of Computer Science & Technology, East China Normal University, North Zhongshan Road, 200062 Shanghai, China.
| |
Collapse
|
46
|
Rahman R, Kodesh A, Levine SZ, Sandin S, Reichenberg A, Schlessinger A. Identification of newborns at risk for autism using electronic medical records and machine learning. Eur Psychiatry 2020; 63:e22. [PMID: 32100657 PMCID: PMC7315872 DOI: 10.1192/j.eurpsy.2020.17] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Current approaches for early identification of individuals at high risk for autism spectrum disorder (ASD) in the general population are limited, and most ASD patients are not identified until after the age of 4. This is despite substantial evidence suggesting that early diagnosis and intervention improves developmental course and outcome. The aim of the current study was to test the ability of machine learning (ML) models applied to electronic medical records (EMRs) to predict ASD early in life, in a general population sample. METHODS We used EMR data from a single Israeli Health Maintenance Organization, including EMR information for parents of 1,397 ASD children (ICD-9/10) and 94,741 non-ASD children born between January 1st, 1997 and December 31st, 2008. Routinely available parental sociodemographic information, parental medical histories, and prescribed medications data were used to generate features to train various ML algorithms, including multivariate logistic regression, artificial neural networks, and random forest. Prediction performance was evaluated with 10-fold cross-validation by computing the area under the receiver operating characteristic curve (AUC; C-statistic), sensitivity, specificity, accuracy, false positive rate, and precision (positive predictive value [PPV]). RESULTS All ML models tested had similar performance. The average performance across all models had C-statistic of 0.709, sensitivity of 29.93%, specificity of 98.18%, accuracy of 95.62%, false positive rate of 1.81%, and PPV of 43.35% for predicting ASD in this dataset. CONCLUSIONS We conclude that ML algorithms combined with EMR capture early life ASD risk as well as reveal previously unknown features to be associated with ASD-risk. Such approaches may be able to enhance the ability for accurate and efficient early detection of ASD in large populations of children.
Collapse
Affiliation(s)
- Rayees Rahman
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Arad Kodesh
- Department of Mental Health, Meuhedet Health Services, Tel Aviv, Israel.,Department of Community Health, University of Haifa, Haifa, Israel
| | - Stephen Z Levine
- Department of Community Health, University of Haifa, Haifa, Israel
| | - Sven Sandin
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA.,Seaver Center for Autism Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Abraham Reichenberg
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA.,Seaver Center for Autism Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA.,MINDICH Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA.,Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Avner Schlessinger
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| |
Collapse
|
47
|
Matsuzaka Y, Uesawa Y. DeepSnap-Deep Learning Approach Predicts Progesterone Receptor Antagonist Activity With High Performance. Front Bioeng Biotechnol 2020; 7:485. [PMID: 32039185 PMCID: PMC6987043 DOI: 10.3389/fbioe.2019.00485] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2019] [Accepted: 12/30/2019] [Indexed: 12/16/2022] Open
Abstract
The progesterone receptor (PR) is important therapeutic target for many malignancies and endocrine disorders due to its role in controlling ovulation and pregnancy via the reproductive cycle. Therefore, the modulation of PR activity using its agonists and antagonists is receiving increasing interest as novel treatment strategy. However, clinical trials using the PR modulators have not yet been found conclusive evidences. Recently, increasing evidence from several fields shows that the classification of chemical compounds, including agonists and antagonists, can be done with recent improvements in deep learning (DL) using deep neural network. Therefore, we recently proposed a novel DL-based quantitative structure-activity relationship (QSAR) strategy using transfer learning to build prediction models for agonists and antagonists. By employing this novel approach, referred as DeepSnap-DL method, which uses images captured from 3-dimension (3D) chemical structure with multiple angles as input data into the DL classification, we constructed prediction models of the PR antagonists in this study. Here, the DeepSnap-DL method showed a high performance prediction of the PR antagonists by optimization of some parameters and image adjustment from 3D-structures. Furthermore, comparison of the prediction models from this approach with conventional machine learnings (MLs) indicated the DeepSnap-DL method outperformed these MLs. Therefore, the models predicted by DeepSnap-DL would be powerful tool for not only QSAR field in predicting physiological and agonist/antagonist activities, toxicity, and molecular bindings; but also for identifying biological or pathological phenomena.
Collapse
Affiliation(s)
| | - Yoshihiro Uesawa
- Department of Medical Molecular Informatics, Meiji Pharmaceutical University, Tokyo, Japan
| |
Collapse
|
48
|
Chang YW, Tsai SJ, Wu YF, Yang AC. Development of an Al-Based Web Diagnostic System for Phenotyping Psychiatric Disorders. Front Psychiatry 2020; 11:542394. [PMID: 33250789 PMCID: PMC7674487 DOI: 10.3389/fpsyt.2020.542394] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Accepted: 09/14/2020] [Indexed: 12/14/2022] Open
Abstract
Background: Artificial intelligence (AI)-based medical diagnostic applications are on the rise. Our recent study has suggested an explainable deep neural network (EDNN) framework for identifying key structural deficits related to the pathology of schizophrenia. Here, we presented an AI-based web diagnostic system for schizophrenia under the EDNN framework with three-dimensional (3D) visualization of subjects' neuroimaging dataset. Methods: This AI-based web diagnostic system consisted of a web server and a neuroimaging diagnostic database. The web server deployed the EDNN algorithm under the Node.js environment. Feature selection and network model building were performed on the dataset obtained from two hundred schizophrenic patients and healthy controls in the Taiwan Aging and Mental Illness (TAMI) cohort. We included an independent cohort with 88 schizophrenic patients and 44 healthy controls recruited at Tri-Service General Hospital Beitou Branch for validation purposes. Results: Our AI-based web diagnostic system achieved 84.00% accuracy (89.47% sensitivity, 80.62% specificity) for gray matter (GM) and 90.22% accuracy (89.21% sensitivity, 91.23% specificity) for white matter (WM) on the TAMI cohort. For the Beitou cohort as an unseen test set, the model achieved 77.27 and 70.45% accuracy for GM and WM. Furthermore, it achieved 85.50 and 88.20% accuracy after model retraining to mitigate the effects of drift on the predictive capability. Moreover, our system visualized the identified voxels in brain atrophy in a 3D manner with patients' structural image, optimizing the evaluation process of the diagnostic results. Discussion: Together, our approach under the EDNN framework demonstrated the potential future direction of making a schizophrenia diagnosis based on structural brain imaging data. Our deep learning model is explainable, arguing for the accuracy of the key information related to the pathology of schizophrenia when using the AI-based web assessment platform. The rationale of this approach is in accordance with the Research Domain Criteria suggested by the National Institute of Mental Health.
Collapse
Affiliation(s)
- Yu-Wei Chang
- Institute of Brain Science and Digital Medicine Center, National Yang-Ming University, Taipei, Taiwan
| | - Shih-Jen Tsai
- Institute of Brain Science and Digital Medicine Center, National Yang-Ming University, Taipei, Taiwan.,Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan.,Division of Psychiatry, School of Medicine, National Yang-Ming University, Taipei, Taiwan
| | - Yung-Fu Wu
- Department of Psychiatry, Beitou Branch, Tri-service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Albert C Yang
- Institute of Brain Science and Digital Medicine Center, National Yang-Ming University, Taipei, Taiwan.,Brain Medicine Center, Tao-Yuan Psychiatric Center, Tao-Yuan, Taiwan
| |
Collapse
|
49
|
Machine learning algorithms, bull genetic information, and imbalanced datasets used in abortion incidence prediction models for Iranian Holstein dairy cattle. Prev Vet Med 2019; 175:104869. [PMID: 31896505 DOI: 10.1016/j.prevetmed.2019.104869] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2018] [Revised: 12/13/2019] [Accepted: 12/16/2019] [Indexed: 11/21/2022]
Abstract
The ability to predict abortion incidence, especially in regions with high abortion rates (e.g., Iran), helps improve reproductive performance and, thereby, dairy farm profitability. The objective of this study was to predict pregnancy loss in Iranian dairy herds. For this purpose, the cow history records and bull genetic information available at 6 large commercial dairy farms with cows calved between 2005 and 2014 were extracted from an on-farm record-keeping software. Using WEKA, 12 commonly used machine learning (ML) algorithms were applied to the dataset. The algorithms belonged to 5 classifier groups which were Bayes, meta, functions, rules, and trees. The original dataset including herd-cow factors was randomly divided into 2 subsets: a training dataset and a test one (at a ratio of 60:40). The original dataset was combined with the bull genetic information to create a full dataset. The average abortion rate was 15.4 %, which represented an imbalanced dataset. Therefore, 2 down- and up-sampling techniques were additionally implemented on the original dataset (more specifically on the training one) to create 2 balanced datasets. This ultimately yielded 4 datasets; original, full, down-sampling, and up-sampling. Different algorithms and models were evaluated based on F-measure and area under the curve (AUC). Based on the results obtained, ML algorithms exhibited a high performance in predicting abortion when applied to the balanced dataset. However, their performance varied from 32.3 % (poor) to 69.2 % (medium upward) when applied to the imbalanced original dataset. In addition to the imbalance in the original dataset, the reason for these poor results were attributed to the high proportion of unknown risk factors underlying abortion incidence. Even when including the bull genetic information, it did not lead to any significant improvements in the prediction model. From among the datasets used, the Bayes algorithms outperformed the others in predicting pregnancy losses while rules had the worst performance. Furthermore, while the Bayes algorithms were not affected by the type of dataset (balanced or imbalanced), substantial increases in F-measure and AUC were observed for rules, trees, and functions with balanced datasets. Overall, the balanced models outperformed the others, with the down-sampling method exhibiting the highest performance. Despite the fact that the prediction models used in this study did not perform as expected, it was shown that they can be beneficially used to predict and reduce pregnancy losses, despite their moderate accuracy, especially when used for herds with high abortion rates and low reproductive performances.
Collapse
|
50
|
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.07.070] [Citation(s) in RCA: 100] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|