1
|
Cheung EYW, Wu RWK, Chu ESM, Mak HKF. Integrating Demographics and Imaging Features for Various Stages of Dementia Classification: Feed Forward Neural Network Multi-Class Approach. Biomedicines 2024; 12:896. [PMID: 38672253 PMCID: PMC11047992 DOI: 10.3390/biomedicines12040896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 03/05/2024] [Accepted: 03/12/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND MRI magnetization-prepared rapid acquisition (MPRAGE) is an easily available imaging modality for dementia diagnosis. Previous studies suggested that volumetric analysis plays a crucial role in various stages of dementia classification. In this study, volumetry, radiomics and demographics were integrated as inputs to develop an artificial intelligence model for various stages, including Alzheimer's disease (AD), mild cognitive decline (MCI) and cognitive normal (CN) dementia classifications. METHOD The Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset was separated into training and testing groups, and the Open Access Series of Imaging Studies (OASIS) dataset was used as the second testing group. The MRI MPRAGE image was reoriented via statistical parametric mapping (SPM12). Freesurfer was employed for brain segmentation, and 45 regional brain volumes were retrieved. The 3D Slicer software was employed for 107 radiomics feature extractions from within the whole brain. Data on patient demographics were collected from the datasets. The feed-forward neural network (FFNN) and the other most common artificial intelligence algorithms, including support vector machine (SVM), ensemble classifier (EC) and decision tree (DT), were used to build the models using various features. RESULTS The integration of brain regional volumes, radiomics and patient demographics attained the highest overall accuracy at 76.57% and 73.14% in ADNI and OASIS testing, respectively. The subclass accuracies in MCI, AD and CN were 78.29%, 89.71% and 85.14%, respectively, in ADNI testing, as well as 74.86%, 88% and 83.43% in OASIS testing. Balanced sensitivity and specificity were obtained for all subclass classifications in MCI, AD and CN. CONCLUSION The FFNN yielded good overall accuracy for MCI, AD and CN categorization, with balanced subclass accuracy, sensitivity and specificity. The proposed FFNN model is simple, and it may support the triage of patients for further confirmation of the diagnosis.
Collapse
Affiliation(s)
- Eva Y. W. Cheung
- School of Medical and Health Sciences, Tung Wah College, 31 Wylie Road, HoManTin, Hong Kong
| | - Ricky W. K. Wu
- Department of Biological and Biomedical Sciences, School of Health and Life Sciences, Glasgow Caledonian University, Glasgow G4 0BA, UK
| | - Ellie S. M. Chu
- School of Medical and Health Sciences, Tung Wah College, 31 Wylie Road, HoManTin, Hong Kong
| | - Henry K. F. Mak
- Department of Diagnostic Radiology, School of Clinical Medicine, LKS Faculty of Medicine, University of Hong Kong, Hong Kong
| |
Collapse
|
2
|
Bucholc M, James C, Khleifat AA, Badhwar A, Clarke N, Dehsarvi A, Madan CR, Marzi SJ, Shand C, Schilder BM, Tamburin S, Tantiangco HM, Lourida I, Llewellyn DJ, Ranson JM. Artificial intelligence for dementia research methods optimization. Alzheimers Dement 2023; 19:5934-5951. [PMID: 37639369 DOI: 10.1002/alz.13441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 07/19/2023] [Accepted: 07/23/2023] [Indexed: 08/31/2023]
Abstract
Artificial intelligence (AI) and machine learning (ML) approaches are increasingly being used in dementia research. However, several methodological challenges exist that may limit the insights we can obtain from high-dimensional data and our ability to translate these findings into improved patient outcomes. To improve reproducibility and replicability, researchers should make their well-documented code and modeling pipelines openly available. Data should also be shared where appropriate. To enhance the acceptability of models and AI-enabled systems to users, researchers should prioritize interpretable methods that provide insights into how decisions are generated. Models should be developed using multiple, diverse datasets to improve robustness, generalizability, and reduce potentially harmful bias. To improve clarity and reproducibility, researchers should adhere to reporting guidelines that are co-produced with multiple stakeholders. If these methodological challenges are overcome, AI and ML hold enormous promise for changing the landscape of dementia research and care. HIGHLIGHTS: Machine learning (ML) can improve diagnosis, prevention, and management of dementia. Inadequate reporting of ML procedures affects reproduction/replication of results. ML models built on unrepresentative datasets do not generalize to new datasets. Obligatory metrics for certain model structures and use cases have not been defined. Interpretability and trust in ML predictions are barriers to clinical translation.
Collapse
Affiliation(s)
- Magda Bucholc
- Cognitive Analytics Research Lab, School of Computing, Engineering & Intelligent Systems, Ulster University, Derry, UK
| | - Charlotte James
- NIHR Bristol Biomedical Research Centre, University Hospitals Bristol and Weston NHS Foundation Trust and University of Bristol, Bristol, UK
| | - Ahmad Al Khleifat
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - AmanPreet Badhwar
- Multiomics Investigation of Neurodegenerative Diseases (MIND) Lab, Centre de Recherche de l'Institut Universitaire de Gériatrie de Montréal, Montréal, Quebec, Canada
- Institut de génie biomédical, Université de Montréal, Montréal, Quebec, Canada
- Département de Pharmacologie et Physiologie, Université de Montréal, Montréal, Quebec, Canada
| | - Natasha Clarke
- Multiomics Investigation of Neurodegenerative Diseases (MIND) Lab, Centre de Recherche de l'Institut Universitaire de Gériatrie de Montréal, Montréal, Quebec, Canada
| | - Amir Dehsarvi
- Aberdeen Biomedical Imaging Centre, School of Medicine, Medical Sciences, and Nutrition, University of Aberdeen, Aberdeen, UK
| | | | - Sarah J Marzi
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Cameron Shand
- Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK
| | - Brian M Schilder
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Stefano Tamburin
- Department of Neurosciences, Biomedicine and Movement Sciences, University of Verona, Verona, Italy
| | | | | | - David J Llewellyn
- University of Exeter Medical School, Exeter, UK
- The Alan Turing Institute, London, UK
| | | |
Collapse
|
3
|
Han Trong T, Nguyen Van H, Vu Dang L. High-Performance Method for Brain Tumor Feature Extraction in MRI Using Complex Network. Appl Bionics Biomech 2023; 2023:8843488. [PMID: 37780200 PMCID: PMC10539089 DOI: 10.1155/2023/8843488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 08/12/2023] [Accepted: 08/26/2023] [Indexed: 10/03/2023] Open
Abstract
Objective To localize and distinguish between benign and malignant tumors on MRI. Method This work proposes a high-performance method for brain tumor feature extraction using a combination of complex network and U-Net architecture. And then, the common machine-learning algorithms are used to discriminate between benign and malignant tumors. Experiments and Results. The dataset of brain MRI of a total of 230 brain tumor patients in which 77 high-grade glioma patients and 153 low-grade glioma patients were processed. The results of classifying benign and malignant tumors achieved an accuracy of 99.84%. Conclusion The high accuracy of experiment results demonstrates that the use of the complex network and U-Net architecture can significantly improve the accuracy of brain tumor classification. This method could potentially be useful for clinicians in aiding diagnosis and treatment planning for brain tumor patients.
Collapse
Affiliation(s)
- Thanh Han Trong
- School of Electronics and Telecommunications, Hanoi University of Science and Technology, Hanoi, Vietnam
| | - Hinh Nguyen Van
- Department of Science and Technology Management and International Cooperation, Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
| | | |
Collapse
|
4
|
Jain A, Begum T, Ahmad S. Analysis and Prediction of Pathogen Nucleic Acid Specificity for Toll-like Receptors in Vertebrates. J Mol Biol 2023; 435:168208. [PMID: 37479078 DOI: 10.1016/j.jmb.2023.168208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 06/20/2023] [Accepted: 07/13/2023] [Indexed: 07/23/2023]
Abstract
Identification of key sequence, expression and function related features of nucleic acid-sensing host proteins is of fundamental importance to understand the dynamics of pathogen-specific host responses. To meet this objective, we considered toll-like receptors (TLRs), a representative class of membrane-bound sensor proteins, from 17 vertebrate species covering mammals, birds, reptiles, amphibians, and fishes in this comparative study. We identified the molecular signatures of host TLRs that are responsible for sensing pathogen nucleic acids or other pathogen-associated molecular patterns (PAMPs), and potentially play important roles in host defence mechanism. Interestingly, our findings reveal that such host-specific features are directly related to the strand (single or double) specificity of nucleic acid from pathogens. However, during host-pathogen interactions, such features were unable to explain the pathogenic PAMP (i.e., DNA, RNA or other) selectivity, suggesting a more complex mechanism. Using these features, we developed a number of machine learning models, of which Random Forest achieved a high performance (94.57% accuracy) to predict strand specificity of TLRs from protein-derived features. We applied the trained model to propose strand specificity of some previously uncharacterized distinct fish-specific novel TLRs (TLR18, TLR23, TLR24, TLR25, TLR27).
Collapse
Affiliation(s)
- Anuja Jain
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India. https://twitter.com/@Anuja334
| | - Tina Begum
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
| | - Shandar Ahmad
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
| |
Collapse
|
5
|
Park D, Son SI, Kim MS, Kim TY, Choi JH, Lee SE, Hong D, Kim MC. Machine learning predictive model for aspiration screening in hospitalized patients with acute stroke. Sci Rep 2023; 13:7835. [PMID: 37188793 DOI: 10.1038/s41598-023-34999-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 05/11/2023] [Indexed: 05/17/2023] Open
Abstract
Dysphagia is a fatal condition after acute stroke. We established machine learning (ML) models for screening aspiration in patients with acute stroke. This retrospective study enrolled patients with acute stroke admitted to a cerebrovascular specialty hospital between January 2016 and June 2022. A videofluoroscopic swallowing study (VFSS) confirmed aspiration. We evaluated the Gugging Swallowing Screen (GUSS), an early assessment tool for dysphagia, in all patients and compared its predictive value with ML models. Following ML algorithms were applied: regularized logistic regressions (ridge, lasso, and elastic net), random forest, extreme gradient boosting, support vector machines, k-nearest neighbors, and naïve Bayes. We finally analyzed data from 3408 patients, and 448 of them had aspiration on VFSS. The GUSS showed an area under the receiver operating characteristics curve (AUROC) of 0.79 (0.77-0.81). The ridge regression model was the best model among all ML models, with an AUROC of 0.81 (0.76-0.86), an F1 measure of 0.45. Regularized logistic regression models exhibited higher sensitivity (0.66-0.72) than the GUSS (0.64). Feature importance analyses revealed that the modified Rankin scale was the most important feature of ML performance. The proposed ML prediction models are valid and practical for screening aspiration in patients with acute stroke.
Collapse
Affiliation(s)
- Dougho Park
- Department of Medical Science and Engineering, School of Convergence Science and Technology, Pohang University of Science and Technology, Pohang, Republic of Korea.
- Department of Rehabilitation Medicine, Pohang Stroke and Spine Hospital, Pohang, Republic of Korea.
| | - Seok Il Son
- Occupational Therapy Department of Rehabilitation Center, Pohang Stroke and Spine Hospital, Pohang, Republic of Korea
| | - Min Sol Kim
- Occupational Therapy Department of Rehabilitation Center, Pohang Stroke and Spine Hospital, Pohang, Republic of Korea
| | - Tae Yeon Kim
- Speech-Language Therapy Department of Rehabilitation Center, Pohang Stroke and Spine Hospital, Pohang, Republic of Korea
| | - Jun Hwa Choi
- Department of Quality Improvement, Pohang Stroke and Spine Hospital, Pohang, Republic of Korea
| | - Sang-Eok Lee
- Department of Rehabilitation Medicine, Pohang Stroke and Spine Hospital, Pohang, Republic of Korea
| | - Daeyoung Hong
- Department of Neurosurgery, Pohang Stroke and Spine Hospital, Pohang, Republic of Korea
| | - Mun-Chul Kim
- Department of Neurosurgery, Pohang Stroke and Spine Hospital, Pohang, Republic of Korea
| |
Collapse
|
6
|
Bucholc M, James C, Al Khleifat A, Badhwar A, Clarke N, Dehsarvi A, Madan CR, Marzi SJ, Shand C, Schilder BM, Tamburin S, Tantiangco HM, Lourida I, Llewellyn DJ, Ranson JM. Artificial Intelligence for Dementia Research Methods Optimization. ARXIV 2023:arXiv:2303.01949v1. [PMID: 36911275 PMCID: PMC10002770] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/14/2023]
Abstract
INTRODUCTION Machine learning (ML) has been extremely successful in identifying key features from high-dimensional datasets and executing complicated tasks with human expert levels of accuracy or greater. METHODS We summarize and critically evaluate current applications of ML in dementia research and highlight directions for future research. RESULTS We present an overview of ML algorithms most frequently used in dementia research and highlight future opportunities for the use of ML in clinical practice, experimental medicine, and clinical trials. We discuss issues of reproducibility, replicability and interpretability and how these impact the clinical applicability of dementia research. Finally, we give examples of how state-of-the-art methods, such as transfer learning, multi-task learning, and reinforcement learning, may be applied to overcome these issues and aid the translation of research to clinical practice in the future. DISCUSSION ML-based models hold great promise to advance our understanding of the underlying causes and pathological mechanisms of dementia.
Collapse
Affiliation(s)
- Magda Bucholc
- Cognitive Analytics Research Lab, School of Computing, Engineering & Intelligent Systems, Ulster University, Derry, UK
| | - Charlotte James
- NIHR Bristol Biomedical Research Centre, University Hospitals Bristol and Weston NHS Foundation Trust and University of Bristol, Bristol, UK
| | - Ahmad Al Khleifat
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London, United Kingdom
| | - AmanPreet Badhwar
- Multiomics Investigation of Neurodegenerative Diseases (MIND) Lab, Centre de Recherche de l’Institut Universitaire de Gériatrie de Montréal, Montréal, Canada
- Institut de génie biomédical, Université de Montréal, Montréal, Canada
- Département de Pharmacologie et Physiologie, Université de Montréal, Montréal, Canada
| | - Natasha Clarke
- Multiomics Investigation of Neurodegenerative Diseases (MIND) Lab, Centre de Recherche de l’Institut Universitaire de Gériatrie de Montréal, Montréal, Canada
| | - Amir Dehsarvi
- Aberdeen Biomedical Imaging Centre, School of Medicine, Medical Sciences, and Nutrition, University of Aberdeen, Aberdeen, UK
| | | | - Sarah J. Marzi
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Cameron Shand
- Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK
| | - Brian M. Schilder
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Stefano Tamburin
- Department of Neurosciences, Biomedicine and Movement Sciences, University of Verona, Verona, Italy
| | | | | | - David J. Llewellyn
- University of Exeter Medical School, Exeter, UK
- The Alan Turing Institute, London, UK
| | | |
Collapse
|
7
|
Alazwari A, Johnstone A, Tafakori L, Abdollahian M, AlEidan AM, Alfuhigi K, Alghofialy MM, Albunyan AA, Al Abbad H, AlEssa MH, Alareefy AKH, Alshamrani MA. Predicting the development of T1D and identifying its Key Performance Indicators in children; a case-control study in Saudi Arabia. PLoS One 2023; 18:e0282426. [PMID: 36857368 PMCID: PMC9977054 DOI: 10.1371/journal.pone.0282426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 02/15/2023] [Indexed: 03/02/2023] Open
Abstract
The increasing incidence of type 1 diabetes (T1D) in children is a growing global concern. It is known that genetic and environmental factors contribute to childhood T1D. An optimal model to predict the development of T1D in children using Key Performance Indicators (KPIs) would aid medical practitioners in developing intervention plans. This paper for the first time has built a model to predict the risk of developing T1D and identify its significant KPIs in children aged (0-14) in Saudi Arabia. Machine learning methods, namely Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, and Artificial Neural Network have been utilised and compared for their relative performance. Analyses were performed in a population-based case-control study from three Saudi Arabian regions. The dataset (n = 1,142) contained demographic and socioeconomic status, genetic and disease history, nutrition history, obstetric history, and maternal characteristics. The comparison between case and control groups showed that most children (cases = 68% and controls = 88%) are from urban areas, 69% (cases) and 66% (control) were delivered after a full-term pregnancy and 31% of cases group were delivered by caesarean, which was higher than the controls (χ2 = 4.12, P-value = 0.042). Models were built using all available environmental and family history factors. The efficacy of models was evaluated using Area Under the Curve, Sensitivity, F Score and Precision. Full logistic regression outperformed other models with Accuracy = 0.77, Sensitivity, F Score and Precision of 0.70, and AUC = 0.83. The most significant KPIs were early exposure to cow's milk (OR = 2.92, P = 0.000), birth weight >4 Kg (OR = 3.11, P = 0.007), residency(rural) (OR = 3.74, P = 0.000), family history (first and second degree), and maternal age >25 years. The results presented here can assist healthcare providers in collecting and monitoring influential KPIs and developing intervention strategies to reduce the childhood T1D incidence rate in Saudi Arabia.
Collapse
Affiliation(s)
- Ahood Alazwari
- School of Science, RMIT University, Melbourne, Victoria, Australia
- School of Science, Al-Baha University, Al-Baha, Saudi Arabia
- * E-mail:
| | - Alice Johnstone
- School of Science, RMIT University, Melbourne, Victoria, Australia
| | - Laleh Tafakori
- School of Science, RMIT University, Melbourne, Victoria, Australia
| | - Mali Abdollahian
- School of Science, RMIT University, Melbourne, Victoria, Australia
| | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Using Recurrent Neural Networks for Predicting Type-2 Diabetes from Genomic and Tabular Data. Diagnostics (Basel) 2022; 12:diagnostics12123067. [PMID: 36553074 PMCID: PMC9776641 DOI: 10.3390/diagnostics12123067] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 12/01/2022] [Accepted: 12/04/2022] [Indexed: 12/12/2022] Open
Abstract
The development of genomic technology for smart diagnosis and therapies for various diseases has lately been the most demanding area for computer-aided diagnostic and treatment research. Exponential breakthroughs in artificial intelligence and machine intelligence technologies could pave the way for identifying challenges afflicting the healthcare industry. Genomics is paving the way for predicting future illnesses, including cancer, Alzheimer's disease, and diabetes. Machine learning advancements have expedited the pace of biomedical informatics research and inspired new branches of computational biology. Furthermore, knowing gene relationships has resulted in developing more accurate models that can effectively detect patterns in vast volumes of data, making classification models important in various domains. Recurrent Neural Network models have a memory that allows them to quickly remember knowledge from previous cycles and process genetic data. The present work focuses on type 2 diabetes prediction using gene sequences derived from genomic DNA fragments through automated feature selection and feature extraction procedures for matching gene patterns with training data. The suggested model was tested using tabular data to predict type 2 diabetes based on several parameters. The performance of neural networks incorporating Recurrent Neural Network (RNN) components, Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU) was tested in this research. The model's efficiency is assessed using the evaluation metrics such as Sensitivity, Specificity, Accuracy, F1-Score, and Mathews Correlation Coefficient (MCC). The suggested technique predicted future illnesses with fair Accuracy. Furthermore, our research showed that the suggested model could be used in real-world scenarios and that input risk variables from an end-user Android application could be kept and evaluated on a secure remote server.
Collapse
|
9
|
Thabtah F, Spencer R, Abdelhamid N, Kamalov F, Wentzel C, Ye Y, Dayara T. Autism screening: an unsupervised machine learning approach. Health Inf Sci Syst 2022; 10:26. [PMID: 36092454 PMCID: PMC9458819 DOI: 10.1007/s13755-022-00191-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Accepted: 08/08/2022] [Indexed: 11/26/2022] Open
Abstract
Early screening of autism spectrum disorders (ASD) is a key area of research in healthcare. Currently artificial intelligence (AI)-driven approaches are used to improve the process of autism diagnosis using computer-aided diagnosis (CAD) systems. One of the issues related to autism diagnosis and screening data is the reliance of the predictions primarily on scores provided by medical screening methods which can be biased depending on how the scores are calculated. We attempt to reduce this bias by assessing the performance of the predictions related to the screening process using a new model that consists of a Self-Organizing Map (SOM) with classification algorithms. The SOM is employed prior to the diagnostic process to derive a new class label using clusters learnt from the independent features; these clusters are related to communication, repetitive traits, and social traits in the input dataset. Then, the new clusters are compared with existing class labels in the dataset to refine and eliminate any inconsistencies. Lastly, the refined dataset is utilised to derive classification systems for autism diagnosis. The new model was evaluated against a real-life autism screening dataset that consists of over 2000 instances of cases and controls. The results based on the refined dataset show that the proposed method achieves significantly higher accuracy, precision, and recall for the classification models derived when compared to models derived from the original dataset.
Collapse
Affiliation(s)
| | - Robinson Spencer
- Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand
| | | | | | - Carl Wentzel
- Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand
| | - Yongsheng Ye
- Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand
| | - Thanu Dayara
- Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand
| |
Collapse
|
10
|
Quazi S. Artificial intelligence and machine learning in precision and genomic medicine. Med Oncol 2022; 39:120. [PMID: 35704152 PMCID: PMC9198206 DOI: 10.1007/s12032-022-01711-1] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Accepted: 03/14/2022] [Indexed: 10/28/2022]
Abstract
The advancement of precision medicine in medical care has led behind the conventional symptom-driven treatment process by allowing early risk prediction of disease through improved diagnostics and customization of more effective treatments. It is necessary to scrutinize overall patient data alongside broad factors to observe and differentiate between ill and relatively healthy people to take the most appropriate path toward precision medicine, resulting in an improved vision of biological indicators that can signal health changes. Precision and genomic medicine combined with artificial intelligence have the potential to improve patient healthcare. Patients with less common therapeutic responses or unique healthcare demands are using genomic medicine technologies. AI provides insights through advanced computation and inference, enabling the system to reason and learn while enhancing physician decision making. Many cell characteristics, including gene up-regulation, proteins binding to nucleic acids, and splicing, can be measured at high throughput and used as training objectives for predictive models. Researchers can create a new era of effective genomic medicine with the improved availability of a broad range of datasets and modern computer techniques such as machine learning. This review article has elucidated the contributions of ML algorithms in precision and genome medicine.
Collapse
Affiliation(s)
- Sameer Quazi
- GenLab Biosolutions Private Limited, Bangalore, Karnataka, 560043, India.
- Department of Biomedical Sciences, School of Life Sciences, Anglia Ruskin University, Cambridge, UK.
| |
Collapse
|
11
|
Abstract
The advancement of precision medicine in medical care has led behind the conventional symptom-driven treatment process by allowing early risk prediction of disease through improved diagnostics and customization of more effective treatments. It is necessary to scrutinize overall patient data alongside broad factors to observe and differentiate between ill and relatively healthy people to take the most appropriate path toward precision medicine, resulting in an improved vision of biological indicators that can signal health changes. Precision and genomic medicine combined with artificial intelligence have the potential to improve patient healthcare. Patients with less common therapeutic responses or unique healthcare demands are using genomic medicine technologies. AI provides insights through advanced computation and inference, enabling the system to reason and learn while enhancing physician decision making. Many cell characteristics, including gene up-regulation, proteins binding to nucleic acids, and splicing, can be measured at high throughput and used as training objectives for predictive models. Researchers can create a new era of effective genomic medicine with the improved availability of a broad range of datasets and modern computer techniques such as machine learning. This review article has elucidated the contributions of ML algorithms in precision and genome medicine.
Collapse
Affiliation(s)
- Sameer Quazi
- GenLab Biosolutions Private Limited, Bangalore, Karnataka, 560043, India.
- Department of Biomedical Sciences, School of Life Sciences, Anglia Ruskin University, Cambridge, UK.
| |
Collapse
|
12
|
Kejzlar V, Bhattacharya S, Son M, Maiti T. Black Box Variational Bayesian Model Averaging. AM STAT 2022. [DOI: 10.1080/00031305.2022.2058611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Affiliation(s)
| | | | - Mookyong Son
- Department of Statistics and Probability, Michigan State University
| | - Tapabrata Maiti
- Department of Statistics and Probability, Michigan State University
| |
Collapse
|
13
|
Jha AN, Kumar A, Tiwari G, Chatterjee N. Identification and analysis of offenders causing hit and run accidents using classification algorithms. Int J Inj Contr Saf Promot 2022; 29:360-371. [PMID: 35276052 DOI: 10.1080/17457300.2022.2040541] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Hit-and-run crashes are significant concern for many countries. Due to lack of information of offending vehicles it is difficult to understand dynamics of these crashes to have a prevention plan. The paper aims to identify the impacting vehicle in hit-and-run crashes. We studied fatal road crashes of New Delhi for eleven years (2006-2016) and found that approximately 40% fatal crashes are hit-and-run with unknown impacting vehicles. We proposed a framework using eleven different machine learning-based classification algorithms - Logistic-Regression, KNN, SVM-Linear and RBF-Kernel, Naïve-Bayes, Random-Forest, DecisionTree, AdaBoost, Multilayer-Perceptron, CART and Linear-Discriminant-Analysis. We found SVM-linear-kernel gave best results. Results reveal that cars, buses, and heavy vehicles are involved vehicles in hit-and-run crashes. Buses were primary cause leading to 39% of hit-and-run during 2006-2009 thereafter cars increased drastically. Our framework is robust and scalable to any city. The outcomes provide inputs to traffic engineers for better policy prescription and road user safety.
Collapse
Affiliation(s)
- Alok Nikhil Jha
- TRIPP, Indian Institute of Technology Delhi, New Delhi, India
| | - Ajay Kumar
- School of Basic & Applied Sciences, K R Mangalam University, Gurugram, India
| | - Geetam Tiwari
- TRIPP, Indian Institute of Technology Delhi, New Delhi, India
| | - Niladri Chatterjee
- Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
| |
Collapse
|
14
|
Abstract
Health information becomes importantly valuable for protecting public health in the current coronavirus situation. Knowledge-based information systems can play a crucial role in helping individuals to practice risk assessment and remote diagnosis. We introduce a novel approach that will develop causality-focused knowledge learning in a robust and transparent manner. Then, the machine gains the causality and probability knowledge for inference (thinking) and accurate prediction later. Besides, the hidden knowledge can be discovered beyond the existing understanding of the diseases. The whole approach is built on a Causal Probability Description Logic Framework that combines Natural Language Processing (NLP), Causality Analysis and extended Knowledge Graph (KG) technologies together. The experimental work has processed 801 diseases in total (from the UK NHS website linking with DBpedia datasets). As a result, the machine learnt comprehensive health causal knowledge and relations among the diseases, symptoms, and other facts efficiently.
Collapse
|
15
|
Pettit RW, Fullem R, Cheng C, Amos CI. Artificial intelligence, machine learning, and deep learning for clinical outcome prediction. Emerg Top Life Sci 2021; 5:ETLS20210246. [PMID: 34927670 PMCID: PMC8786279 DOI: 10.1042/etls20210246] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Revised: 12/03/2021] [Accepted: 12/07/2021] [Indexed: 12/12/2022]
Abstract
AI is a broad concept, grouping initiatives that use a computer to perform tasks that would usually require a human to complete. AI methods are well suited to predict clinical outcomes. In practice, AI methods can be thought of as functions that learn the outcomes accompanying standardized input data to produce accurate outcome predictions when trialed with new data. Current methods for cleaning, creating, accessing, extracting, augmenting, and representing data for training AI clinical prediction models are well defined. The use of AI to predict clinical outcomes is a dynamic and rapidly evolving arena, with new methods and applications emerging. Extraction or accession of electronic health care records and combining these with patient genetic data is an area of present attention, with tremendous potential for future growth. Machine learning approaches, including decision tree methods of Random Forest and XGBoost, and deep learning techniques including deep multi-layer and recurrent neural networks, afford unique capabilities to accurately create predictions from high dimensional, multimodal data. Furthermore, AI methods are increasing our ability to accurately predict clinical outcomes that previously were difficult to model, including time-dependent and multi-class outcomes. Barriers to robust AI-based clinical outcome model deployment include changing AI product development interfaces, the specificity of regulation requirements, and limitations in ensuring model interpretability, generalizability, and adaptability over time.
Collapse
Affiliation(s)
- Rowland W. Pettit
- Institute for Clinical and Translational Research, Baylor College of Medicine, Houston, TX, U.S.A
| | - Robert Fullem
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, U.S.A
| | - Chao Cheng
- Institute for Clinical and Translational Research, Baylor College of Medicine, Houston, TX, U.S.A
- Section of Epidemiology and Population Sciences, Department of Medicine, Baylor College of Medicine, Houston, TX, U.S.A
| | - Christopher I. Amos
- Institute for Clinical and Translational Research, Baylor College of Medicine, Houston, TX, U.S.A
- Section of Epidemiology and Population Sciences, Department of Medicine, Baylor College of Medicine, Houston, TX, U.S.A
- Dan L Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, U.S.A
| |
Collapse
|
16
|
Decision Tree in Working Memory Task Effectively Characterizes EEG Signals in Healthy Aging Adults. Ing Rech Biomed 2021. [DOI: 10.1016/j.irbm.2021.12.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
17
|
Rowe TW, Katzourou IK, Stevenson-Hoare JO, Bracher-Smith MR, Ivanov DK, Escott-Price V. Machine learning for the life-time risk prediction of Alzheimer's disease: a systematic review. Brain Commun 2021; 3:fcab246. [PMID: 34805994 PMCID: PMC8598986 DOI: 10.1093/braincomms/fcab246] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 06/30/2021] [Accepted: 07/19/2021] [Indexed: 12/23/2022] Open
Abstract
Alzheimer’s disease is a neurodegenerative disorder and the most common form of dementia. Early diagnosis may assist interventions to delay onset and reduce the progression rate of the disease. We systematically reviewed the use of machine learning algorithms for predicting Alzheimer’s disease using single nucleotide polymorphisms and instances where these were combined with other types of data. We evaluated the ability of machine learning models to distinguish between controls and cases, while also assessing their implementation and potential biases. Articles published between December 2009 and June 2020 were collected using Scopus, PubMed and Google Scholar. These were systematically screened for inclusion leading to a final set of 12 publications. Eighty-five per cent of the included studies used the Alzheimer's Disease Neuroimaging Initiative dataset. In studies which reported area under the curve, discrimination varied (0.49–0.97). However, more than half of the included manuscripts used other forms of measurement, such as accuracy, sensitivity and specificity. Model calibration statistics were also found to be reported inconsistently across all studies. The most frequent limitation in the assessed studies was sample size, with the total number of participants often numbering less than a thousand, whilst the number of predictors usually ran into the many thousands. In addition, key steps in model implementation and validation were often not performed or unreported, making it difficult to assess the capability of machine learning models.
Collapse
Affiliation(s)
- Thomas W Rowe
- UK Dementia Research Institute, Cardiff University, Cardiff, UK
| | | | | | - Matthew R Bracher-Smith
- Division of Psychological Medicine and Clinical Neurosciences, School of Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff CF24 4HQ, UK
| | - Dobril K Ivanov
- UK Dementia Research Institute, Cardiff University, Cardiff, UK
| | - Valentina Escott-Price
- UK Dementia Research Institute, Cardiff University, Cardiff, UK.,Division of Psychological Medicine and Clinical Neurosciences, School of Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff CF24 4HQ, UK
| |
Collapse
|
18
|
Comparison of machine learning methods for estimating case fatality ratios: An Ebola outbreak simulation study. PLoS One 2021; 16:e0257005. [PMID: 34525098 PMCID: PMC8443081 DOI: 10.1371/journal.pone.0257005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 08/20/2021] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Machine learning (ML) algorithms are now increasingly used in infectious disease epidemiology. Epidemiologists should understand how ML algorithms behave within the context of outbreak data where missingness of data is almost ubiquitous. METHODS Using simulated data, we use a ML algorithmic framework to evaluate data imputation performance and the resulting case fatality ratio (CFR) estimates, focusing on the scale and type of data missingness (i.e., missing completely at random-MCAR, missing at random-MAR, or missing not at random-MNAR). RESULTS Across ML methods, dataset sizes and proportions of training data used, the area under the receiver operating characteristic curve decreased by 7% (median, range: 1%-16%) when missingness was increased from 10% to 40%. Overall reduction in CFR bias for MAR across methods, proportion of missingness, outbreak size and proportion of training data was 0.5% (median, range: 0%-11%). CONCLUSION ML methods could reduce bias and increase the precision in CFR estimates at low levels of missingness. However, no method is robust to high percentages of missingness. Thus, a datacentric approach is recommended in outbreak settings-patient survival outcome data should be prioritised for collection and random-sample follow-ups should be implemented to ascertain missing outcomes.
Collapse
|
19
|
McCombe N, Liu S, Ding X, Prasad G, Bucholc M, Finn DP, Todd S, McClean PL, Wong-Lin K. Practical Strategies for Extreme Missing Data Imputation in Dementia Diagnosis. IEEE J Biomed Health Inform 2021; 26:818-827. [PMID: 34288882 DOI: 10.1109/jbhi.2021.3098511] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Accurate computational models for clinical decision support systems require clean and reliable data but, in clinical practice, data are often incomplete. Hence, missing data could arise not only from training datasets but also test datasets which could consist of a single undiagnosed case, an individual. This work addresses the problem of extreme missingness in both training and test data by evaluating multiple imputation and classification workflows based on both diagnostic classification accuracy and computational cost. Extreme missingness is defined as having ~50% of the total data missing in more than half the data features. In particular, we focus on dementia diagnosis due to long time delays, high variability, high attrition rates and lack of practical data imputation strategies in its diagnostic pathway. We identified and replicated the extreme missingness structure of data from a real-world memory clinic on a larger open dataset, with the original complete data acting as ground truth. Overall, we found that computational cost, but not accuracy, varies widely for various imputation and classification approaches. Particularly, we found that iterative imputation on the training dataset combined with a reduced-feature classification model provides the best approach, in terms of speed and accuracy. Taken together, this work has elucidated important factors to be considered when developing a predictive model for a dementia diagnostic support system.
Collapse
|
20
|
Chang CH, Lin CH, Liu CY, Huang CS, Chen SJ, Lin WC, Yang HT, Lane HY. Plasma d-glutamate levels for detecting mild cognitive impairment and Alzheimer's disease: Machine learning approaches. J Psychopharmacol 2021; 35:265-272. [PMID: 33586518 DOI: 10.1177/0269881120972331] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
BACKGROUND d-glutamate, which is involved in N-methyl-d-aspartate receptor modulation, may be associated with cognitive ageing. AIMS This study aimed to use peripheral plasma d-glutamate levels to differentiate patients with mild cognitive impairment (MCI) and Alzheimer's disease (AD) from healthy individuals and to evaluate its prediction ability using machine learning. METHODS Overall, 31 healthy controls, 21 patients with MCI and 133 patients with AD were recruited. Serum d-glutamate levels were measured using high-performance liquid chromatography (HPLC). Cognitive deficit severity was assessed using the Clinical Dementia Rating scale and the Mini-Mental Status Examination (MMSE). We employed four machine learning algorithms (support vector machine, logistic regression, random forest and naïve Bayes) to build an optimal predictive model to distinguish patients with MCI or AD from healthy controls. RESULTS The MCI and AD groups had lower plasma d-glutamate levels (1097.79 ± 283.99 and 785.10 ± 720.06 ng/mL, respectively) compared to healthy controls (1620.08 ± 548.80 ng/mL). The naïve Bayes model and random forest model appeared to be the best models for determining MCI and AD susceptibility, respectively (area under the receiver operating characteristic curve: 0.8207 and 0.7900; sensitivity: 0.8438 and 0.6997; and specificity: 0.8158 and 0.9188, respectively). The total MMSE score was positively correlated with d-glutamate levels (r = 0.368, p < 0.001). Multivariate regression analysis indicated that d-glutamate levels were significantly associated with the total MMSE score (B = 0.003, 95% confidence interval 0.002-0.005, p < 0.001). CONCLUSIONS Peripheral plasma d-glutamate levels were associated with cognitive impairment and may therefore be a suitable peripheral biomarker for detecting MCI and AD. Rapid and cost-effective HPLC for biomarkers and machine learning algorithms may assist physicians in diagnosing MCI and AD in outpatient clinics.
Collapse
Affiliation(s)
- Chun-Hung Chang
- Institute of Clinical Medical Science, China Medical University, Taichung, Taiwan.,Department of Psychiatry and Brain Disease Research Centre, China Medical University Hospital, Taichung, Taiwan.,An Nan Hospital, China Medical University, Tainan, Taiwan
| | - Chieh-Hsin Lin
- Institute of Clinical Medical Science, China Medical University, Taichung, Taiwan.,Graduate Institute of Biomedical Sciences, China Medical University, Taichung, Taiwan.,Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine, Kaohsiung, Taiwan
| | - Chieh-Yu Liu
- Biostatistical Consulting Lab, Department of Speech Language Pathology and Audiology, National Taipei University of Nursing and Health Sciences, Taipei, Taiwan
| | - Chih-Sheng Huang
- Artificial Intelligence Research and Development Department, ELAN Microelectronics Corporation, Hsinchu, Taiwan
| | - Shaw-Ji Chen
- Department of Psychiatry, Mackay Memorial Hospital Taitung Branch, Taitung, Taiwan.,Department of Medicine, Mackay Medical College, New Taipei, Taiwan
| | - Wen-Cheng Lin
- Department of Medical Informatics, Tzu Chi University, Hualien, Taiwan
| | - Hui-Ting Yang
- School of Food Safety, Taipei Medical University, Taipei, Taiwan
| | - Hsien-Yuan Lane
- Institute of Clinical Medical Science, China Medical University, Taichung, Taiwan.,Department of Psychiatry and Brain Disease Research Centre, China Medical University Hospital, Taichung, Taiwan.,Graduate Institute of Biomedical Sciences, China Medical University, Taichung, Taiwan.,Department of Psychology, College of Medical and Health Sciences, Asia University, Taichung, Taiwan
| |
Collapse
|
21
|
Bayesian Network as a Decision Tool for Predicting ALS Disease. Brain Sci 2021; 11:brainsci11020150. [PMID: 33498784 PMCID: PMC7912628 DOI: 10.3390/brainsci11020150] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 01/09/2021] [Accepted: 01/20/2021] [Indexed: 12/14/2022] Open
Abstract
Clinical diagnosis of amyotrophic lateral sclerosis (ALS) is difficult in the early period. But blood tests are less time consuming and low cost methods compared to other methods for the diagnosis. The ALS researchers have been used machine learning methods to predict the genetic architecture of disease. In this study we take advantages of Bayesian networks and machine learning methods to predict the ALS patients with blood plasma protein level and independent personal features. According to the comparison results, Bayesian Networks produced best results with accuracy (0.887), area under the curve (AUC) (0.970) and other comparison metrics. We confirmed that sex and age are effective variables on the ALS. In addition, we found that the probability of onset involvement in the ALS patients is very high. Also, a person’s other chronic or neurological diseases are associated with the ALS disease. Finally, we confirmed that the Parkin level may also have an effect on the ALS disease. While this protein is at very low levels in Parkinson’s patients, it is higher in the ALS patients than all control groups.
Collapse
|
22
|
Wang MWH, Goodman JM, Allen TEH. Machine Learning in Predictive Toxicology: Recent Applications and Future Directions for Classification Models. Chem Res Toxicol 2020; 34:217-239. [PMID: 33356168 DOI: 10.1021/acs.chemrestox.0c00316] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
In recent times, machine learning has become increasingly prominent in predictive toxicology as it has shifted from in vivo studies toward in silico studies. Currently, in vitro methods together with other computational methods such as quantitative structure-activity relationship modeling and absorption, distribution, metabolism, and excretion calculations are being used. An overview of machine learning and its applications in predictive toxicology is presented here, including support vector machines (SVMs), random forest (RF) and decision trees (DTs), neural networks, regression models, naïve Bayes, k-nearest neighbors, and ensemble learning. The recent successes of these machine learning methods in predictive toxicology are summarized, and a comparison of some models used in predictive toxicology is presented. In predictive toxicology, SVMs, RF, and DTs are the dominant machine learning methods due to the characteristics of the data available. Lastly, this review describes the current challenges facing the use of machine learning in predictive toxicology and offers insights into the possible areas of improvement in the field.
Collapse
Affiliation(s)
- Marcus W H Wang
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Jonathan M Goodman
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Timothy E H Allen
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom.,MRC Toxicology Unit, University of Cambridge, Hodgkin Building, Lancaster Road, Leicester LE1 7HB, United Kingdom
| |
Collapse
|
23
|
Mishra R, Li B. The Application of Artificial Intelligence in the Genetic Study of Alzheimer's Disease. Aging Dis 2020; 11:1567-1584. [PMID: 33269107 PMCID: PMC7673858 DOI: 10.14336/ad.2020.0312] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/12/2020] [Indexed: 12/13/2022] Open
Abstract
Alzheimer's disease (AD) is a neurodegenerative disease in which genetic factors contribute approximately 70% of etiological effects. Studies have found many significant genetic and environmental factors, but the pathogenesis of AD is still unclear. With the application of microarray and next-generation sequencing technologies, research using genetic data has shown explosive growth. In addition to conventional statistical methods for the processing of these data, artificial intelligence (AI) technology shows obvious advantages in analyzing such complex projects. This article first briefly reviews the application of AI technology in medicine and the current status of genetic research in AD. Then, a comprehensive review is focused on the application of AI in the genetic research of AD, including the diagnosis and prognosis of AD based on genetic data, the analysis of genetic variation, gene expression profile, gene-gene interaction in AD, and genetic analysis of AD based on a knowledge base. Although many studies have yielded some meaningful results, they are still in a preliminary stage. The main shortcomings include the limitations of the databases, failing to take advantage of AI to conduct a systematic biology analysis of multilevel databases, and lack of a theoretical framework for the analysis results. Finally, we outlook the direction of future development. It is crucial to develop high quality, comprehensive, large sample size, data sharing resources; a multi-level system biology AI analysis strategy is one of the development directions, and computational creativity may play a role in theory model building, verification, and designing new intervention protocols for AD.
Collapse
Affiliation(s)
- Rohan Mishra
- Washington Institute for Health Sciences, Arlington, VA 22203, USA
| | - Bin Li
- Washington Institute for Health Sciences, Arlington, VA 22203, USA
- Georgetown University Medical Center, Washington D.C. 20057, USA
| |
Collapse
|
24
|
Ahmed Z, Mohamed K, Zeeshan S, Dong X. Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine. Database (Oxford) 2020; 2020:baaa010. [PMID: 32185396 PMCID: PMC7078068 DOI: 10.1093/database/baaa010] [Citation(s) in RCA: 151] [Impact Index Per Article: 37.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Revised: 01/05/2020] [Accepted: 01/21/2020] [Indexed: 02/06/2023]
Abstract
Precision medicine is one of the recent and powerful developments in medical care, which has the potential to improve the traditional symptom-driven practice of medicine, allowing earlier interventions using advanced diagnostics and tailoring better and economically personalized treatments. Identifying the best pathway to personalized and population medicine involves the ability to analyze comprehensive patient information together with broader aspects to monitor and distinguish between sick and relatively healthy people, which will lead to a better understanding of biological indicators that can signal shifts in health. While the complexities of disease at the individual level have made it difficult to utilize healthcare information in clinical decision-making, some of the existing constraints have been greatly minimized by technological advancements. To implement effective precision medicine with enhanced ability to positively impact patient outcomes and provide real-time decision support, it is important to harness the power of electronic health records by integrating disparate data sources and discovering patient-specific patterns of disease progression. Useful analytic tools, technologies, databases, and approaches are required to augment networking and interoperability of clinical, laboratory and public health systems, as well as addressing ethical and social issues related to the privacy and protection of healthcare data with effective balance. Developing multifunctional machine learning platforms for clinical data extraction, aggregation, management and analysis can support clinicians by efficiently stratifying subjects to understand specific scenarios and optimize decision-making. Implementation of artificial intelligence in healthcare is a compelling vision that has the potential in leading to the significant improvements for achieving the goals of providing real-time, better personalized and population medicine at lower costs. In this study, we focused on analyzing and discussing various published artificial intelligence and machine learning solutions, approaches and perspectives, aiming to advance academic solutions in paving the way for a new data-centric era of discovery in healthcare.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, 112 Paterson Street, New Brunswick, NJ, USA
- Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, 125 Paterson Street, New Brunswick, NJ, USA
- Department of Genetics and Genome Sciences, School of Medicine, University of Connecticut Health Center, 263 Farmington Ave., Farmington, CT, USA
- Institute for Systems Genomics, University of Connecticut, 67 North Eagleville Road, Storrs, CT, USA
| | - Khalid Mohamed
- Department of Genetics and Genome Sciences, School of Medicine, University of Connecticut Health Center, 263 Farmington Ave., Farmington, CT, USA
| | - Saman Zeeshan
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA
| | - XinQi Dong
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, 112 Paterson Street, New Brunswick, NJ, USA
- Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, 125 Paterson Street, New Brunswick, NJ, USA
| |
Collapse
|
25
|
Bi Q, Goodman KE, Kaminsky J, Lessler J. What is Machine Learning? A Primer for the Epidemiologist. Am J Epidemiol 2019; 188:2222-2239. [PMID: 31509183 DOI: 10.1093/aje/kwz189] [Citation(s) in RCA: 94] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Revised: 07/29/2019] [Accepted: 08/14/2019] [Indexed: 12/22/2022] Open
Abstract
Machine learning is a branch of computer science that has the potential to transform epidemiologic sciences. Amid a growing focus on "Big Data," it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies. Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction to classification to clustering. We provide a brief introduction to 5 common machine learning algorithms and 4 ensemble-based approaches. We then summarize epidemiologic applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiologic research and discuss opportunities and challenges for integrating machine learning and existing epidemiologic research methods.
Collapse
Affiliation(s)
- Qifang Bi
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| | - Katherine E Goodman
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| | - Joshua Kaminsky
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| | - Justin Lessler
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| |
Collapse
|
26
|
Bottigliengo D, Berchialla P, Lanera C, Azzolina D, Lorenzoni G, Martinato M, Giachino D, Baldi I, Gregori D. The Role of Genetic Factors in Characterizing Extra-Intestinal Manifestations in Crohn's Disease Patients: Are Bayesian Machine Learning Methods Improving Outcome Predictions? J Clin Med 2019; 8:jcm8060865. [PMID: 31212952 PMCID: PMC6617350 DOI: 10.3390/jcm8060865] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Revised: 06/12/2019] [Accepted: 06/13/2019] [Indexed: 01/01/2023] Open
Abstract
(1) Background: The high heterogeneity of inflammatory bowel disease (IBD) makes the study of this condition challenging. In subjects affected by Crohn’s disease (CD), extra-intestinal manifestations (EIMs) have a remarkable potential impact on health status. Increasing numbers of patient characteristics and the small size of analyzed samples make EIMs prediction very difficult. Under such constraints, Bayesian machine learning techniques (BMLTs) have been proposed as a robust alternative to classical models for outcome prediction. This study aims to determine whether BMLT could improve EIM prediction and statistical support for the decision-making process of clinicians. (2) Methods: Three of the most popular BMLTs were employed in this study: Naϊve Bayes (NB), Bayesian Network (BN) and Bayesian Additive Regression Trees (BART). They were applied to a retrospective observational Italian study of IBD genetics. (3) Results: The performance of the model is strongly affected by the features of the dataset, and BMLTs poorly classify EIM appearance. (4) Conclusions: This study shows that BMLTs perform worse than expected in classifying the presence of EIMs compared to classical statistical tools in a context where mixed genetic and clinical data are available but relevant data are also missing, as often occurs in clinical practice.
Collapse
Affiliation(s)
- Daniele Bottigliengo
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, and Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy.
| | - Paola Berchialla
- Department of Clinical and Biological Sciences, University of Torino, 10126 Torino, Italy.
| | - Corrado Lanera
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, and Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy.
| | - Danila Azzolina
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, and Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy.
| | - Giulia Lorenzoni
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, and Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy.
| | - Matteo Martinato
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, and Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy.
| | - Daniela Giachino
- Department of Clinical and Biological Sciences, University of Torino, 10126 Torino, Italy.
| | - Ileana Baldi
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, and Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy.
| | - Dario Gregori
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, and Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy.
| |
Collapse
|
27
|
Zhao C, Jiang J, Guan Y, Guo X, He B. EMR-based medical knowledge representation and inference via Markov random fields and distributed representation learning. Artif Intell Med 2018; 87:49-59. [PMID: 29691122 DOI: 10.1016/j.artmed.2018.03.005] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Revised: 02/28/2018] [Accepted: 03/29/2018] [Indexed: 01/09/2023]
Abstract
OBJECTIVE Electronic medical records (EMRs) contain medical knowledge that can be used for clinical decision support (CDS). Our objective is to develop a general system that can extract and represent knowledge contained in EMRs to support three CDS tasks-test recommendation, initial diagnosis, and treatment plan recommendation-given the condition of a patient. METHODS We extracted four kinds of medical entities from records and constructed an EMR-based medical knowledge network (EMKN), in which nodes are entities and edges reflect their co-occurrence in a record. Three bipartite subgraphs (bigraphs) were extracted from the EMKN, one to support each task. One part of the bigraph was the given condition (e.g., symptoms), and the other was the condition to be inferred (e.g., diseases). Each bigraph was regarded as a Markov random field (MRF) to support the inference. We proposed three graph-based energy functions and three likelihood-based energy functions. Two of these functions are based on knowledge representation learning and can provide distributed representations of medical entities. Two EMR datasets and three metrics were utilized to evaluate the performance. RESULTS As a whole, the evaluation results indicate that the proposed system outperformed the baseline methods. The distributed representation of medical entities does reflect similarity relationships with respect to knowledge level. CONCLUSION Combining EMKN and MRF is an effective approach for general medical knowledge representation and inference. Different tasks, however, require individually designed energy functions.
Collapse
Affiliation(s)
- Chao Zhao
- School of Computer Science and Technology, Harbin, Heilongjiang 150001, China.
| | - Jingchi Jiang
- School of Computer Science and Technology, Harbin, Heilongjiang 150001, China.
| | - Yi Guan
- School of Computer Science and Technology, Harbin, Heilongjiang 150001, China.
| | - Xitong Guo
- School of Management, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China.
| | - Bin He
- School of Computer Science and Technology, Harbin, Heilongjiang 150001, China.
| |
Collapse
|
28
|
|
29
|
|
30
|
Langarizadeh M, Moghbeli F. Applying Naive Bayesian Networks to Disease Prediction: a Systematic Review. Acta Inform Med 2016; 24:364-369. [PMID: 28077895 PMCID: PMC5203736 DOI: 10.5455/aim.2016.24.364-369] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2016] [Accepted: 10/11/2016] [Indexed: 12/15/2022] Open
Abstract
INTRODUCTION Naive Bayesian networks (NBNs) are one of the most effective and simplest Bayesian networks for prediction. OBJECTIVE This paper aims to review published evidence about the application of NBNs in predicting disease and it tries to show NBNs as the fundamental algorithm for the best performance in comparison with other algorithms. METHODS PubMed was electronically checked for articles published between 2005 and 2015. For characterizing eligible articles, a comprehensive electronic searching method was conducted. Inclusion criteria were determined based on NBN and its effects on disease prediction. A total of 99 articles were found. After excluding the duplicates (n= 5), the titles and abstracts of 94 articles were skimmed according to the inclusion criteria. Finally, 38 articles remained. They were reviewed in full text and 15 articles were excluded. Eventually, 23 articles were selected which met our eligibility criteria and were included in this study. RESULT In this article, the use of NBN in predicting diseases was described. Finally, the results were reported in terms of Accuracy, Sensitivity, Specificity and Area under ROC curve (AUC). The last column in Table 2 shows the differences between NBNs and other algorithms. DISCUSSION This systematic review (23 studies, 53,725 patients) indicates that predicting diseases based on a NBN had the best performance in most diseases in comparison with the other algorithms. Finally in most cases NBN works better than other algorithms based on the reported accuracy. CONCLUSION The method, termed NBNs is proposed and can efficiently construct a prediction model for disease.
Collapse
Affiliation(s)
- Mostafa Langarizadeh
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
| | - Fateme Moghbeli
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
| |
Collapse
|
31
|
Whiteside D, Martini DN, Lepley AS, Zernicke RF, Goulet GC. Predictors of Ulnar Collateral Ligament Reconstruction in Major League Baseball Pitchers. Am J Sports Med 2016; 44:2202-9. [PMID: 27159303 DOI: 10.1177/0363546516643812] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
BACKGROUND Ulnar collateral ligament (UCL) reconstruction surgeries in Major League Baseball (MLB) have increased significantly in recent decades. Although several risk factors have been proposed, a scientific consensus is yet to be reached, providing challenges to those tasked with preventing UCL injuries. PURPOSE To identify significant predictors of UCL reconstruction in MLB pitchers. STUDY DESIGN Case control study; Level of evidence, 3. METHODS Demographic and pitching performance data were sourced from public databases for 104 MLB pitchers who underwent UCL reconstruction surgery and 104 age- and position-matched controls. These variables were compared between groups and inserted into a binary logistic regression to identify significant predictors of UCL reconstruction. Two machine learning models (naïve Bayes and support vector machine) were also employed to predict UCL reconstruction in this cohort. RESULTS The binary linear regression model was statistically significant (χ(2)(12) = 33.592; P = .001), explained 19.9% of the variance in UCL reconstruction surgery, and correctly classified 66.8% of cases. According to this model, (1) fewer days between consecutive games, (2) a smaller repertoire of pitches, (3) a less pronounced horizontal release location, (4) a smaller stature, (5) greater mean pitch speed, and (6) greater mean pitch counts per game were all significant predictors of UCL reconstruction. More specifically, an increase in mean days between consecutive games (odds ratio [OR], 0.685; 95% CI, 0.542-0.865) or number of unique pitch types thrown (OR, 0.672; 95% CI, 0.492-0.917) was associated with a significantly smaller likelihood of UCL reconstruction. In contrast, an increase in mean pitch speed (OR, 1.381; 95% CI, 1.103-1.729) or mean pitches per game (OR, 1.020; 95% CI, 1.007-1.033) was associated with significantly higher odds of UCL reconstruction surgery. The naïve Bayes classifier predicted UCL reconstruction with an accuracy of 72% and the support vector machine classifier with an accuracy of 75%. CONCLUSION This study identified 6 key performance factors that may present significant risk factors for UCL reconstruction in MLB pitchers. These findings could help to enhance the prevention of UCL reconstruction surgery in MLB pitchers and shape the direction of future research in this domain.
Collapse
Affiliation(s)
- David Whiteside
- School of Kinesiology, University of Michigan, Ann Arbor, Michigan, USA Game Insight Group, Tennis Australia, Melbourne, Australia Institute of Sport, Exercise and Active Living, Victoria University, Melbourne, Australia
| | - Douglas N Martini
- Department of Neurology, School of Medicine, Oregon Health and Science University, Portland, Oregon, USA
| | - Adam S Lepley
- Department of Kinesiology, University of Connecticut, Storrs, Connecticut, USA
| | - Ronald F Zernicke
- School of Kinesiology, University of Michigan, Ann Arbor, Michigan, USA Department of Orthopaedic Surgery, University of Michigan Medical School, Ann Arbor, Michigan, USA Department of Biomedical Engineering, University of Michigan, Ann Arbor, Michigan, USA
| | - Grant C Goulet
- School of Kinesiology, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
32
|
Ronquillo JG, Baer MR, Lester WT. Sex-specific patterns and differences in dementia and Alzheimer's disease using informatics approaches. J Women Aging 2016; 28:403-11. [PMID: 27105335 PMCID: PMC5110121 DOI: 10.1080/08952841.2015.1018038] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
The National Institutes of Health Office of Research on Women's Health recently highlighted the critical need for explicitly addressing sex differences in biomedical research, including Alzheimer's disease and dementia. The purpose of our study was to perform a sex-stratified analysis of cognitive impairment using diverse medical, clinical, and genetic factors of unprecedented scale and scope by applying informatics approaches to three large Alzheimer's databases. Analyses suggested females were 1.5 times more likely than males to have a documented diagnosis of probable Alzheimer's disease, and several other factors fell along sex-specific lines and were possibly associated with severity of cognitive impairment.
Collapse
Affiliation(s)
| | | | - William T. Lester
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, Massachusetts
- Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
33
|
Cai B, Jiang X. Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences. BMC Bioinformatics 2016; 17:116. [PMID: 26940649 PMCID: PMC4778322 DOI: 10.1186/s12859-016-0959-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2015] [Accepted: 02/19/2016] [Indexed: 11/10/2022] Open
Abstract
Background Ubiquitination is a very important process in protein post-translational modification, which has been widely investigated by biology scientists and researchers. Different experimental and computational methods have been developed to identify the ubiquitination sites in protein sequences. This paper aims at exploring computational machine learning methods for the prediction of ubiquitination sites using the physicochemical properties (PCPs) of amino acids in the protein sequences. Results We first establish six different ubiquitination data sets, whose records contain both ubiquitination sites and non-ubiquitination sites in variant numbers of protein sequence segments. In particular, to establish such data sets, protein sequence segments are extracted from the original protein sequences used in four published papers on ubiquitination, while 531 PCP features of each extracted protein sequence segment are calculated based on PCP values from AAindex (Amino Acid index database) by averaging PCP values of all amino acids on each segment. Various computational machine-learning methods, including four Bayesian network methods (i.e., Naïve Bayes (NB), Feature Selection NB (FSNB), Model Averaged NB (MANB), and Efficient Bayesian Multivariate Classifier (EBMC)) and three regression methods (i.e., Support Vector Machine (SVM), Logistic Regression (LR), and Least Absolute Shrinkage and Selection Operator (LASSO)), are then applied to the six established segment-PCP data sets. Five-fold cross-validation and the Area Under Receiver Operating Characteristic Curve (AUROC) are employed to evaluate the ubiquitination prediction performance of each method. Results demonstrate that the PCP data of protein sequences contain information that could be mined by machine learning methods for ubiquitination site prediction. The comparative results show that EBMC, SVM and LR perform better than other methods, and EBMC is the only method that can get AUCs greater than or equal to 0.6 for the six established data sets. Results also show EBMC tends to perform better for larger data. Conclusions Machine learning methods have been employed for the ubiquitination site prediction based on physicochemical properties of amino acids on protein sequences. Results demonstrate the effectiveness of using machine learning methodology to mine information from PCP data concerning protein sequences, as well as the superiority of EBMC, SVM and LR (especially EBMC) for the ubiquitination prediction compared to other methods. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0959-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Binghuang Cai
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15206-3701, USA.
| | - Xia Jiang
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15206-3701, USA.
| |
Collapse
|
34
|
Chen Y, Wang L, Li L, Zhang H, Yuan Z. Informative gene selection and the direct classification of tumors based on relative simplicity. BMC Bioinformatics 2016; 17:44. [PMID: 26792270 PMCID: PMC4721022 DOI: 10.1186/s12859-016-0893-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Accepted: 01/19/2016] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Selecting a parsimonious set of informative genes to build highly generalized performance classifier is the most important task for the analysis of tumor microarray expression data. Many existing gene pair evaluation methods cannot highlight diverse patterns of gene pairs only used one strategy of vertical comparison and horizontal comparison, while individual-gene-ranking method ignores redundancy and synergy among genes. RESULTS Here we proposed a novel score measure named relative simplicity (RS). We evaluated gene pairs according to integrating vertical comparison with horizontal comparison, finally built RS-based direct classifier (RS-based DC) based on a set of informative genes capable of binary discrimination with a paired votes strategy. Nine multi-class gene expression datasets involving human cancers were used to validate the performance of new method. Compared with the nine reference models, RS-based DC received the highest average independent test accuracy (91.40%), the best generalization performance and the smallest informative average gene number (20.56). Compared with the four reference feature selection methods, RS also received the highest average test accuracy in three classifiers (Naïve Bayes, k-Nearest Neighbor and Support Vector Machine), and only RS can improve the performance of SVM. CONCLUSIONS Diverse patterns of gene pairs could be highlighted more fully while integrating vertical comparison with horizontal comparison strategy. DC core classifier can effectively control over-fitting. RS-based feature selection method combined with DC classifier can lead to more robust selection of informative genes and classification accuracy.
Collapse
Affiliation(s)
- Yuan Chen
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, China. .,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| | - Lifeng Wang
- Biotechnology Research Center, Hunan Academy of Agricultural Sciences, Changsha, China.
| | - Lanzhi Li
- Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| | - Hongyan Zhang
- Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| | - Zheming Yuan
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, China. .,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| |
Collapse
|
35
|
Acikel C, Aydin Son Y, Celik C, Gul H. Evaluation of potential novel variations and their interactions related to bipolar disorders: analysis of genome-wide association study data. Neuropsychiatr Dis Treat 2016; 12:2997-3004. [PMID: 27920536 PMCID: PMC5127431 DOI: 10.2147/ndt.s112558] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Multifactor dimensionality reduction (MDR) is a nonparametric approach that can be used to detect relevant interactions between single-nucleotide polymorphisms (SNPs). The aim of this study was to build the best genomic model based on SNP associations and to identify candidate polymorphisms that are the underlying molecular basis of the bipolar disorders. METHODS This study was performed on Whole-Genome Association Study of Bipolar Disorder (dbGaP [database of Genotypes and Phenotypes] study accession number: phs000017.v3.p1) data. After preprocessing of the genotyping data, three classification-based data mining methods (ie, random forest, naïve Bayes, and k-nearest neighbor) were performed. Additionally, as a nonparametric, model-free approach, the MDR method was used to evaluate the SNP profiles. The validity of these methods was evaluated using true classification rate, recall (sensitivity), precision (positive predictive value), and F-measure. RESULTS Random forests, naïve Bayes, and k-nearest neighbors identified 16, 13, and ten candidate SNPs, respectively. Surprisingly, the top six SNPs were reported by all three methods. Random forests and k-nearest neighbors were more successful than naïve Bayes, with recall values >0.95. On the other hand, MDR generated a model with comparable predictive performance based on five SNPs. Although different SNP profiles were identified in MDR compared to the classification-based models, all models mapped SNPs to the DOCK10 gene. CONCLUSION Three classification-based data mining approaches, random forests, naïve Bayes, and k-nearest neighbors, have prioritized similar SNP profiles as predictors of bipolar disorders, in contrast to MDR, which has found different SNPs through analysis of two-way and three-way interactions. The reduced number of associated SNPs discovered by MDR, without loss in the classification performance, would facilitate validation studies and decision support models, and would reduce the cost to develop predictive and diagnostic tests. Nevertheless, we need to emphasize that translation of genomic models to the clinical setting requires models with higher classification performance.
Collapse
Affiliation(s)
| | - Yesim Aydin Son
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University
| | | | - Husamettin Gul
- Department of Medical Informatics, Gulhane Military Medical Academy, Ankara, Turkey
| |
Collapse
|
36
|
Jiang X, Neapolitan RE. Evaluation of a two-stage framework for prediction using big genomic data. Brief Bioinform 2015; 16:912-21. [PMID: 25788325 PMCID: PMC4652616 DOI: 10.1093/bib/bbv010] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2014] [Revised: 02/05/2015] [Indexed: 01/13/2023] Open
Abstract
We are in the era of abundant 'big' or 'high-dimensional' data. These data afford us the opportunity to discover predictors of an event of interest, and to estimate occurrence of the event based on values of these predictors. For example, 'genome-wide association studies' examine millions of single-nucleotide polymorphisms (SNPs), along with disease status. We can learn SNPs that affect disease status from these data sets, and use the knowledge learned to predict disease likelihood. Owing to the large number of features, it is difficult for many prediction methods to use all the features directly. The ReliefF algorithm ranks a set of features in terms of how well they predict a target. It can be used to identify good predictors, which can then be provided to a prediction method. We compared the performance of eight prediction methods when predicting binary outcomes using high-dimensional discrete data sets. We performed two-stage prediction, where ReliefF is used in the first stage to identify good predictors. Bayesian network (BN)-based methods performed best overall. Furthermore, ReliefF did not improve their performance. The BN-based methods use the Bayesian Dirichlet Equivalent Uniform score to evaluate candidate models, and use BN inference algorithms to perform prediction. This score and these algorithms were developed for discrete variables. This perhaps explains why they perform better in this domain. Many prediction methods are available, and researchers have little reason for choosing one over the other in the domain of binary prediction using high-dimensional data sets. Our results indicate that the best choices overall are BN-based methods.
Collapse
|
37
|
Cheng CW, Wang MD. Improving Personalized Clinical Risk Prediction Based on Causality-Based Association Rules. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2015; 2015:386-392. [PMID: 27532063 DOI: 10.1145/2808719.2808759] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Developing clinical risk prediction models is one of the main tasks of healthcare data mining. Advanced data collection techniques in current Big Data era have created an emerging and urgent need for scalable, computer-based data mining methods. These methods can turn data into useful, personalized decision support knowledge in a flexible, cost-effective, and productive way. In our previous study, we developed a tool, called icuARM- II, that can generate personalized clinical risk prediction evidence using a temporal rule mining framework. However, the generation of final risk prediction possibility with icuARM-II still relied on human interpretation, which was subjective and, most of time, biased. In this study, we propose a new mechanism to improve icuARM-II's rule selection by including the concept of causal analysis. The generated risk prediction is quantitatively assessed using calibration statistics. To evaluate the performance of the new rule selection mechanism, we conducted a case study to predict short-term intensive care unit mortality based on personalized lab testing abnormalities. Our results demonstrated a better-calibrated ICU risk prediction using the new causality-base rule selection solution by comparing with conventional confidence-only rule selection methods.
Collapse
|
38
|
Jeon SH, Jeon EH, Lee JY, Kim YS, Yoon HJ, Hong SP, Lee JH. The potential of interleukin 12 receptor beta 2 (IL12RB2) and tumor necrosis factor receptor superfamily member 8 (TNFRSF8) gene as diagnostic biomarkers of oral lichen planus (OLP). Acta Odontol Scand 2015; 73:588-94. [PMID: 25915578 DOI: 10.3109/00016357.2014.967719] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
OBJECTIVE This study evaluated the potential of interleukin 12 receptor beta 2 and tumor necrosis factor receptor superfamily member 8 as diagnostic biomarkers of oral lichen planus (OLP). MATERIALS AND METHODS The mRNA expression of IL12RB2 and TNFRSF8 in FFPE OLP samples (OLP group, n = 38) were investigated with quantitative reverse transcriptase-polymerase chain reaction (qRT-PCR) analysis and compared to those of chronic non-specific mucositis (Non-OLP group, n = 25) and normal mucosa (Normal group, n = 18). Predictive modeling of the expression of IL12RB2 and TNFRSF8 was constructed using support vector machine (SVM), random forest (RF), linear discriminant analysis (LDA), neural network (NN) and naive Bayes (NB) methods. RESULTS Normalized expression of IL12RB2 in the OLP group (3.78 ± 1.67) was significantly higher than the Normal group (1.97 ± 1.12), but lower than the Non-OLP group (6.86 ± 1.67). TNFRSF8 gene expression in the OLP group (7.46 ± 1.51) was significantly higher than the Normal group (2.90 ± 1.61), but no significant difference was found between the OLP and Non-OLP groups. The ratio of IL12RB2/TNFRSF8 in the OLP group (0.52 ± 0.23) was significantly lower than the Normal group (0.74 ± 0.39) and the Non-OLP group (1.07 ± 0.38). In the predictive modeling, the area under receiver operating characteristic (ROC) curves (AUC) ranged from 0.83-0.92 and their accuracy was higher than 0.75 in all methods. CONCLUSIONS The IL12RB2/TNFRSF8 ratio can be a useful diagnostic tool for OLP.
Collapse
Affiliation(s)
- Seung-Ho Jeon
- Department of Oral and Maxillofacial Surgery, School of Dentistry
| | | | | | | | | | | | | |
Collapse
|
39
|
Bielza C, Larrañaga P. Bayesian networks in neuroscience: a survey. Front Comput Neurosci 2014; 8:131. [PMID: 25360109 PMCID: PMC4199264 DOI: 10.3389/fncom.2014.00131] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2014] [Accepted: 09/26/2014] [Indexed: 12/29/2022] Open
Abstract
Bayesian networks are a type of probabilistic graphical models lie at the intersection between statistics and machine learning. They have been shown to be powerful tools to encode dependence relationships among the variables of a domain under uncertainty. Thanks to their generality, Bayesian networks can accommodate continuous and discrete variables, as well as temporal processes. In this paper we review Bayesian networks and how they can be learned automatically from data by means of structure learning algorithms. Also, we examine how a user can take advantage of these networks for reasoning by exact or approximate inference algorithms that propagate the given evidence through the graphical structure. Despite their applicability in many fields, they have been little used in neuroscience, where they have focused on specific problems, like functional connectivity analysis from neuroimaging data. Here we survey key research in neuroscience where Bayesian networks have been used with different aims: discover associations between variables, perform probabilistic reasoning over the model, and classify new observations with and without supervision. The networks are learned from data of any kind-morphological, electrophysiological, -omics and neuroimaging-, thereby broadening the scope-molecular, cellular, structural, functional, cognitive and medical- of the brain aspects to be studied.
Collapse
Affiliation(s)
- Concha Bielza
- *Correspondence: Concha Bielza, Departamento de Inteligencia Artificial, Universidad Politecnica de Madrid, Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain e-mail:
| | | |
Collapse
|
40
|
Jiang X, Cai B, Xue D, Lu X, Cooper GF, Neapolitan RE. A comparative analysis of methods for predicting clinical outcomes using high-dimensional genomic datasets. J Am Med Inform Assoc 2014; 21:e312-9. [PMID: 24737607 PMCID: PMC4173174 DOI: 10.1136/amiajnl-2013-002358] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2013] [Revised: 02/20/2014] [Accepted: 03/14/2014] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE The objective of this investigation is to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data. The central hypothesis is that the Bayesian network (BN)-based method called efficient Bayesian multivariate classifier (EBMC) will do well at this task because EBMC builds on BN-based methods that have performed well at learning epistatic interactions. METHOD We evaluate how well eight methods perform binary prediction using high-dimensional discrete genomic datasets containing epistatic interactions. The methods are as follows: naive Bayes (NB), model averaging NB (MANB), feature selection NB (FSNB), EBMC, logistic regression (LR), support vector machines (SVM), Lasso, and extreme learning machines (ELM). We use a hundred 1000-single nucleotide polymorphism (SNP) simulated datasets, ten 10,000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets in our evaluation. RESULTS In fivefold cross-validation studies, the SVM performed best on the 1000-SNP dataset, while the BN-based methods performed best on the other datasets, with EBMC exhibiting the best overall performance. In-sample testing indicates that LR, SVM, Lasso, ELM, and NB tend to overfit the data. DISCUSSION EBMC performed better than NB when there are several strong predictors, whereas NB performed better when there are many weak predictors. Furthermore, for all BN-based methods, prediction capability did not degrade as the dimension increased. CONCLUSIONS Our results support the hypothesis that EBMC performs well at binary outcome prediction using high-dimensional discrete datasets containing epistatic-like interactions. Future research using more GWAS datasets is needed to further investigate the potential of EBMC.
Collapse
Affiliation(s)
- Xia Jiang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| | - Binghuang Cai
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| | - Diyang Xue
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| | - Gregory F Cooper
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| | - Richard E Neapolitan
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
| |
Collapse
|
41
|
Informative gene selection and direct classification of tumor based on Chi-square test of pairwise gene interactions. BIOMED RESEARCH INTERNATIONAL 2014; 2014:589290. [PMID: 25140319 PMCID: PMC4130026 DOI: 10.1155/2014/589290] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 07/10/2014] [Indexed: 01/04/2023]
Abstract
In efforts to discover disease mechanisms and improve clinical diagnosis of tumors, it is useful to mine profiles for informative genes with definite biological meanings and to build robust classifiers with high precision. In this study, we developed a new method for tumor-gene selection, the Chi-square test-based integrated rank gene and direct classifier (χ2-IRG-DC). First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes and removed redundant genes by using leave-one-out cross-validation of the chi-square test-based Direct Classifier (χ2-DC) within the training set to obtain informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above with χ2-DC. Furthermore, we analyzed the robustness of χ2-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the accuracy of different classifiers. An independent test of ten multiclass tumor gene-expression datasets showed that χ2-IRG-DC could efficiently control overfitting and had higher generalization performance. The informative genes selected by χ2-IRG-DC could dramatically improve the independent test precision of other classifiers; meanwhile, the informative genes selected by other feature selection methods also had good performance in χ2-DC.
Collapse
|
42
|
Jiang X, Tse K, Wang S, Doan S, Kim H, Ohno-Machado L. Recent trends in biomedical informatics: a study based on JAMIA articles. J Am Med Inform Assoc 2013; 20:e198-205. [PMID: 24214018 PMCID: PMC3861936 DOI: 10.1136/amiajnl-2013-002429] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
In a growing interdisciplinary field like biomedical informatics, information dissemination and citation trends are changing rapidly due to many factors. To understand these factors better, we analyzed the evolution of the number of articles per major biomedical informatics topic, download/online view frequencies, and citation patterns (using Web of Science) for articles published from 2009 to 2012 in JAMIA. The number of articles published in JAMIA increased significantly from 2009 to 2012, and there were some topic differences in the last 4 years. Medical Record Systems, Algorithms, and Methods are topic categories that are growing fast in several publications. We observed a significant correlation between download frequencies and the number of citations per month since publication for a given article. Earlier free availability of articles to non-subscribers was associated with a higher number of downloads and showed a trend towards a higher number of citations. This trend will need to be verified as more data accumulate in coming years.
Collapse
Affiliation(s)
- Xiaoqian Jiang
- Division of Biomedical Informatics, Department of Medicine, University of California San Diego, La Jolla, California, USA
| | | | | | | | | | | |
Collapse
|
43
|
Mitchell E, Monaghan D, O'Connor NE. Classification of sporting activities using smartphone accelerometers. SENSORS 2013; 13:5317-37. [PMID: 23604031 PMCID: PMC3673139 DOI: 10.3390/s130405317] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2013] [Revised: 04/08/2013] [Accepted: 04/11/2013] [Indexed: 11/30/2022]
Abstract
In this paper we present a framework that allows for the automatic identification of sporting activities using commonly available smartphones. We extract discriminative informational features from smartphone accelerometers using the Discrete Wavelet Transform (DWT). Despite the poor quality of their accelerometers, smartphones were used as capture devices due to their prevalence in today's society. Successful classification on this basis potentially makes the technology accessible to both elite and non-elite athletes. Extracted features are used to train different categories of classifiers. No one classifier family has a reportable direct advantage in activity classification problems to date; thus we examine classifiers from each of the most widely used classifier families. We investigate three classification approaches; a commonly used SVM-based approach, an optimized classification model and a fusion of classifiers. We also investigate the effect of changing several of the DWT input parameters, including mother wavelets, window lengths and DWT decomposition levels. During the course of this work we created a challenging sports activity analysis dataset, comprised of soccer and field-hockey activities. The average maximum F-measure accuracy of 87% was achieved using a fusion of classifiers, which was 6% better than a single classifier model and 23% better than a standard SVM approach.
Collapse
Affiliation(s)
- Edmond Mitchell
- Centre for Sensor Web Technologies, Dublin City University, Dublin, Ireland.
| | | | | |
Collapse
|
44
|
Jiang X, Menon A, Wang S, Kim J, Ohno-Machado L. Doubly Optimized Calibrated Support Vector Machine (DOC-SVM): an algorithm for joint optimization of discrimination and calibration. PLoS One 2012; 7:e48823. [PMID: 23139819 PMCID: PMC3490990 DOI: 10.1371/journal.pone.0048823] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2012] [Accepted: 10/03/2012] [Indexed: 11/19/2022] Open
Abstract
Historically, probabilistic models for decision support have focused on discrimination, e.g., minimizing the ranking error of predicted outcomes. Unfortunately, these models ignore another important aspect, calibration, which indicates the magnitude of correctness of model predictions. Using discrimination and calibration simultaneously can be helpful for many clinical decisions. We investigated tradeoffs between these goals, and developed a unified maximum-margin method to handle them jointly. Our approach called, Doubly Optimized Calibrated Support Vector Machine (DOC-SVM), concurrently optimizes two loss functions: the ridge regression loss and the hinge loss. Experiments using three breast cancer gene-expression datasets (i.e., GSE2034, GSE2990, and Chanrion's datasets) showed that our model generated more calibrated outputs when compared to other state-of-the-art models like Support Vector Machine ( = 0.03, = 0.13, and <0.001) and Logistic Regression ( = 0.006, = 0.008, and <0.001). DOC-SVM also demonstrated better discrimination (i.e., higher AUCs) when compared to Support Vector Machine ( = 0.38, = 0.29, and = 0.047) and Logistic Regression ( = 0.38, = 0.04, and <0.0001). DOC-SVM produced a model that was better calibrated without sacrificing discrimination, and hence may be helpful in clinical decision making.
Collapse
Affiliation(s)
- Xiaoqian Jiang
- Division of Biomedical Informatics, University California San Diego (UCSD), La Jolla, California, USA.
| | | | | | | | | |
Collapse
|
45
|
Malovini A, Barbarini N, Bellazzi R, de Michelis F. Hierarchical Naive Bayes for genetic association studies. BMC Bioinformatics 2012; 13 Suppl 14:S6. [PMID: 23095471 PMCID: PMC3439732 DOI: 10.1186/1471-2105-13-s14-s6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Background Genome Wide Association Studies represent powerful approaches that aim at disentangling the genetic and molecular mechanisms underlying complex traits. The usual "one-SNP-at-the-time" testing strategy cannot capture the multi-factorial nature of this kind of disorders. We propose a Hierarchical Naïve Bayes classification model for taking into account associations in SNPs data characterized by Linkage Disequilibrium. Validation shows that our model reaches classification performances superior to those obtained by the standard Naïve Bayes classifier for simulated and real datasets. Methods In the Hierarchical Naïve Bayes implemented, the SNPs mapping to the same region of Linkage Disequilibrium are considered as "details" or "replicates" of the locus, each contributing to the overall effect of the region on the phenotype. A latent variable for each block, which models the "population" of correlated SNPs, can be then used to summarize the available information. The classification is thus performed relying on the latent variables conditional probability distributions and on the SNPs data available. Results The developed methodology has been tested on simulated datasets, each composed by 300 cases, 300 controls and a variable number of SNPs. Our approach has been also applied to two real datasets on the genetic bases of Type 1 Diabetes and Type 2 Diabetes generated by the Wellcome Trust Case Control Consortium. Conclusions The approach proposed in this paper, called Hierarchical Naïve Bayes, allows dealing with classification of examples for which genetic information of structurally correlated SNPs are available. It improves the Naïve Bayes performances by properly handling the within-loci variability.
Collapse
Affiliation(s)
- Alberto Malovini
- Department of Industrial and Information Engineering, University of Pavia, Pavia, 27100, Italy.
| | | | | | | |
Collapse
|
46
|
Russu A, Malovini A, Puca AA, Bellazzi R. Stochastic model search with binary outcomes for genome-wide association studies. J Am Med Inform Assoc 2012; 19:e13-20. [PMID: 22534080 PMCID: PMC3392850 DOI: 10.1136/amiajnl-2011-000741] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Objective The spread of case–control genome-wide association studies (GWASs) has stimulated the development of new variable selection methods and predictive models. We introduce a novel Bayesian model search algorithm, Binary Outcome Stochastic Search (BOSS), which addresses the model selection problem when the number of predictors far exceeds the number of binary responses. Materials and methods Our method is based on a latent variable model that links the observed outcomes to the underlying genetic variables. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor. Results BOSS is compared with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated benchmark. Two real case studies are also investigated: a GWAS on the genetic bases of longevity, and the type 2 diabetes study from the Wellcome Trust Case Control Consortium. Simulations show that BOSS achieves higher precisions than the reference methods while preserving good recall rates. In both experimental studies, BOSS successfully detects genetic polymorphisms previously reported to be associated with the analyzed phenotypes. Discussion BOSS outperforms the other methods in terms of F-measure on simulated data. In the two real studies, BOSS successfully detects biologically relevant features, some of which are missed by univariate analysis and the three reference techniques. Conclusion The proposed algorithm is an advance in the methodology for model selection with a large number of features. Our simulated and experimental results showed that BOSS proves effective in detecting relevant markers while providing a parsimonious model.
Collapse
Affiliation(s)
- Alberto Russu
- Department of Industrial and Information Engineering, University of Pavia, Pavia, Italy.
| | | | | | | |
Collapse
|
47
|
Jiang X, Boxwala AA, El-Kareh R, Kim J, Ohno-Machado L. A patient-driven adaptive prediction technique to improve personalized risk estimation for clinical decision support. J Am Med Inform Assoc 2012; 19:e137-44. [PMID: 22493049 PMCID: PMC3392846 DOI: 10.1136/amiajnl-2011-000751] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Objective Competing tools are available online to assess the risk of developing certain conditions of interest, such as cardiovascular disease. While predictive models have been developed and validated on data from cohort studies, little attention has been paid to ensure the reliability of such predictions for individuals, which is critical for care decisions. The goal was to develop a patient-driven adaptive prediction technique to improve personalized risk estimation for clinical decision support. Material and methods A data-driven approach was proposed that utilizes individualized confidence intervals (CIs) to select the most ‘appropriate’ model from a pool of candidates to assess the individual patient's clinical condition. The method does not require access to the training dataset. This approach was compared with other strategies: the BEST model (the ideal model, which can only be achieved by access to data or knowledge of which population is most similar to the individual), CROSS model, and RANDOM model selection. Results When evaluated on clinical datasets, the approach significantly outperformed the CROSS model selection strategy in terms of discrimination (p<1e–14) and calibration (p<0.006). The method outperformed the RANDOM model selection strategy in terms of discrimination (p<1e–12), but the improvement did not achieve significance for calibration (p=0.1375). Limitations The CI may not always offer enough information to rank the reliability of predictions, and this evaluation was done using aggregation. If a particular individual is very different from those represented in a training set of existing models, the CI may be somewhat misleading. Conclusion This approach has the potential to offer more reliable predictions than those offered by other heuristics for disease risk estimation of individual patients.
Collapse
Affiliation(s)
- Xiaoqian Jiang
- Division of Biomedical Informatics, University of California at San Diego, La Jolla, California 92093-0728, USA.
| | | | | | | | | |
Collapse
|
48
|
Aguiar-Pulido V, Munteanu CR, Seoane JA, Fernández-Blanco E, Pérez-Montoto LG, González-Díaz H, Dorado J. Naïve Bayes QSDR classification based on spiral-graph Shannon entropies for protein biomarkers in human colon cancer. MOLECULAR BIOSYSTEMS 2012; 8:1716-22. [PMID: 22466084 DOI: 10.1039/c2mb25039j] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Fast cancer diagnosis represents a real necessity in applied medicine due to the importance of this disease. Thus, theoretical models can help as prediction tools. Graph theory representation is one option because it permits us to numerically describe any real system such as the protein macromolecules by transforming real properties into molecular graph topological indices. This study proposes a new classification model for proteins linked with human colon cancer by using spiral graph topological indices of protein amino acid sequences. The best quantitative structure-disease relationship model is based on eleven Shannon entropy indices. It was obtained with the Naïve Bayes method and shows excellent predictive ability (90.92%) for new proteins linked with this type of cancer. The statistical analysis confirms that this model allows diagnosing the absence of human colon cancer obtaining an area under receiver operating characteristic of 0.91. The methodology presented can be used for any type of sequential information such as any protein and nucleic acid sequence.
Collapse
Affiliation(s)
- Vanessa Aguiar-Pulido
- Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain
| | | | | | | | | | | | | |
Collapse
|
49
|
Wu Y, Jiang X, Kim J, Ohno-Machado L. I-spline Smoothing for Calibrating Predictive Models. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2012; 2012:39-46. [PMID: 22779048 PMCID: PMC3392066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
We proposed the I-spline Smoothing approach for calibrating predictive models by solving a nonlinear monotone regression problem. We took advantage of I-spline properties to obtain globally optimal solutions while keeping the computational cost low. Numerical studies based on three data sets showed the empirical evidences of I-spline Smoothing in improving calibration (i.e.,1.6x, 1.4x, and 1.4x on the three datasets compared to the average of competitors-Binning, Platt Scaling, Isotonic Regression, Monotone Spline Smoothing, Smooth Isotonic Regression) without deterioration of discrimination.
Collapse
|
50
|
Abstract
The performance of a classification system depends on the context in which it will be used, including the prevalence of the classes and the relative costs of different types of errors. Metrics such as accuracy are limited to the context in which the experiment was originally carried out, and metrics such as sensitivity, specificity, and receiver operating characteristic area--while independent of prevalence--do not provide a clear picture of the performance characteristics of the system over different contexts. Graphing a prevalence-specific metric such as F-measure or the relative cost of errors over a wide range of prevalence allows a visualization of the performance of the system and a comparison of systems in different contexts.
Collapse
Affiliation(s)
- George Hripcsak
- Department of Biomedical Informatics, Columbia University Medical Center, 622 West 168th Street, VC5, New York, NY 10027, USA.
| |
Collapse
|