1
|
Barnett EJ, Onete DG, Salekin A, Faraone SV. Genomic Machine Learning Meta-regression: Insights on Associations of Study Features With Reported Model Performance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:169-177. [PMID: 38109236 DOI: 10.1109/tcbb.2023.3343808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2023]
Abstract
Many studies have been conducted with the goal of correctly predicting diagnostic status of a disorder using the combination of genomic data and machine learning. It is often hard to judge which components of a study led to better results and whether better reported results represent a true improvement or an uncorrected bias inflating performance. We extracted information about the methods used and other differentiating features in genomic machine learning models. We used these features in linear regressions predicting model performance. We tested for univariate and multivariate associations as well as interactions between features. Of the models reviewed, 46% used feature selection methods that can lead to data leakage. Across our models, the number of hyperparameter optimizations reported, data leakage due to feature selection, model type, and modeling an autoimmune disorder were significantly associated with an increase in reported model performance. We found a significant, negative interaction between data leakage and training size. Our results suggest that methods susceptible to data leakage are prevalent among genomic machine learning research, resulting in inflated reported performance. Best practice guidelines that promote the avoidance and recognition of data leakage may help the field avoid biased results.
Collapse
|
2
|
Research of Epidemic Big Data Based on Improved Deep Convolutional Neural Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:3641745. [PMID: 32774444 PMCID: PMC7396034 DOI: 10.1155/2020/3641745] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2020] [Accepted: 06/23/2020] [Indexed: 02/06/2023]
Abstract
In recent years, with the acceleration of the aging process and the aggravation of life pressure, the proportion of chronic epidemics has gradually increased. A large amount of medical data will be generated during the hospitalization of diabetics. It will have important practical significance and social value to discover potential medical laws and valuable information among medical data. In view of this, an improved deep convolutional neural network (“CNN+” for short) algorithm was proposed to predict the changes of diabetes. Firstly, the bagging integrated classification algorithm was used instead of the output layer function of the deep CNN, which can help the improved deep CNN algorithm constructed for the data set of diabetic patients and improve the accuracy of classification. In this way, the “CNN+” algorithm can take the advantages of both the deep CNN and the bagging algorithm. On the one hand, it can extract the potential features of the data set by using the powerful feature extraction ability of deep CNN. On the other hand, the bagging integrated classification algorithm can be used for feature classification, so as to improve the classification accuracy and obtain better disease prediction effect to assist doctors in diagnosis and treatment. Experimental results show that compared with the traditional convolutional neural network and other classification algorithm, the “CNN+” model can get more reliable prediction results.
Collapse
|
3
|
Stafford IS, Kellermann M, Mossotto E, Beattie RM, MacArthur BD, Ennis S. A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases. NPJ Digit Med 2020; 3:30. [PMID: 32195365 PMCID: PMC7062883 DOI: 10.1038/s41746-020-0229-3] [Citation(s) in RCA: 102] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Accepted: 01/17/2020] [Indexed: 02/07/2023] Open
Abstract
Autoimmune diseases are chronic, multifactorial conditions. Through machine learning (ML), a branch of the wider field of artificial intelligence, it is possible to extract patterns within patient data, and exploit these patterns to predict patient outcomes for improved clinical management. Here, we surveyed the use of ML methods to address clinical problems in autoimmune disease. A systematic review was conducted using MEDLINE, embase and computers and applied sciences complete databases. Relevant papers included "machine learning" or "artificial intelligence" and the autoimmune diseases search term(s) in their title, abstract or key words. Exclusion criteria: studies not written in English, no real human patient data included, publication prior to 2001, studies that were not peer reviewed, non-autoimmune disease comorbidity research and review papers. 169 (of 702) studies met the criteria for inclusion. Support vector machines and random forests were the most popular ML methods used. ML models using data on multiple sclerosis, rheumatoid arthritis and inflammatory bowel disease were most common. A small proportion of studies (7.7% or 13/169) combined different data types in the modelling process. Cross-validation, combined with a separate testing set for more robust model evaluation occurred in 8.3% of papers (14/169). The field may benefit from adopting a best practice of validation, cross-validation and independent testing of ML models. Many models achieved good predictive results in simple scenarios (e.g. classification of cases and controls). Progression to more complex predictive models may be achievable in future through integration of multiple data types.
Collapse
Affiliation(s)
- I. S. Stafford
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- Institute for Life Sciences, University of Southampton, Southampton, UK
| | - M. Kellermann
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
| | - E. Mossotto
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- Institute for Life Sciences, University of Southampton, Southampton, UK
| | - R. M. Beattie
- Department of Paediatric Gastroenterology, Southampton Children’s Hospital, Southampton, UK
| | - B. D. MacArthur
- Institute for Life Sciences, University of Southampton, Southampton, UK
| | - S. Ennis
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
| |
Collapse
|
4
|
Zhao LP, Carlsson A, Larsson HE, Forsander G, Ivarsson SA, Kockum I, Ludvigsson J, Marcus C, Persson M, Samuelsson U, Örtqvist E, Pyo CW, Bolouri H, Zhao M, Nelson WC, Geraghty DE, Lernmark Å. Building and validating a prediction model for paediatric type 1 diabetes risk using next generation targeted sequencing of class II HLA genes. Diabetes Metab Res Rev 2017; 33. [PMID: 28755385 DOI: 10.1002/dmrr.2921] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/28/2016] [Revised: 06/26/2017] [Accepted: 07/10/2017] [Indexed: 01/06/2023]
Abstract
AIM It is of interest to predict possible lifetime risk of type 1 diabetes (T1D) in young children for recruiting high-risk subjects into longitudinal studies of effective prevention strategies. METHODS Utilizing a case-control study in Sweden, we applied a recently developed next generation targeted sequencing technology to genotype class II genes and applied an object-oriented regression to build and validate a prediction model for T1D. RESULTS In the training set, estimated risk scores were significantly different between patients and controls (P = 8.12 × 10-92 ), and the area under the curve (AUC) from the receiver operating characteristic (ROC) analysis was 0.917. Using the validation data set, we validated the result with AUC of 0.886. Combining both training and validation data resulted in a predictive model with AUC of 0.903. Further, we performed a "biological validation" by correlating risk scores with 6 islet autoantibodies, and found that the risk score was significantly correlated with IA-2A (Z-score = 3.628, P < 0.001). When applying this prediction model to the Swedish population, where the lifetime T1D risk ranges from 0.5% to 2%, we anticipate identifying approximately 20 000 high-risk subjects after testing all newborns, and this calculation would identify approximately 80% of all patients expected to develop T1D in their lifetime. CONCLUSION Through both empirical and biological validation, we have established a prediction model for estimating lifetime T1D risk, using class II HLA. This prediction model should prove useful for future investigations to identify high-risk subjects for prevention research in high-risk populations.
Collapse
Affiliation(s)
- Lue Ping Zhao
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
- School of Public Health, University of Washington, Seattle, WA, USA
| | | | - Helena Elding Larsson
- Department of Clinical Sciences, Lund University/CRC, Skåne University Hospital, Malmö, Sweden
| | - Gun Forsander
- Institute of Clinical Sciences, Department of Pediatrics and the Queen Silvia Children's Hospital, Sahlgrenska University Hospital, Gothenburg, Sweden
| | - Sten A Ivarsson
- Department of Clinical Sciences, Lund University/CRC, Skåne University Hospital, Malmö, Sweden
| | - Ingrid Kockum
- Department of Clinical Neurosciences, Karolinska Institutet, Solna, Sweden
| | - Johnny Ludvigsson
- Department of Clinical and Experimental Medicine, Linköping University, Linköping, Sweden
| | - Claude Marcus
- Department of Clinical Science, Karolinska Institutet, Huddinge, Sweden
| | - Martina Persson
- Department of Medicine, Clinical Epidemiology, Karolinska University Hospital, Solna, Sweden
| | - Ulf Samuelsson
- Department of Clinical and Experimental Medicine, Linköping University, Linköping, Sweden
| | - Eva Örtqvist
- Department of Medicine, Clinical Epidemiology, Karolinska University Hospital, Solna, Sweden
| | - Chul-Woo Pyo
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Hamid Bolouri
- School of Arts and Sciences, University of Washington, Seattle, WA, USA
| | - Michael Zhao
- School of Arts and Sciences, University of Washington, Seattle, WA, USA
| | - Wyatt C Nelson
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Daniel E Geraghty
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Åke Lernmark
- Department of Clinical Sciences, Lund University/CRC, Skåne University Hospital, Malmö, Sweden
| |
Collapse
|
5
|
Zhang JW, Liu TF, Chen XH, Liang WY, Feng XR, Wang L, Fu SW, McCaffrey TA, Liu ML. Validation of aspirin response-related transcripts in patients with coronary artery disease and preliminary investigation on CMTM5 function. Gene 2017; 624:56-65. [PMID: 28457985 DOI: 10.1016/j.gene.2017.04.041] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Revised: 04/15/2017] [Accepted: 04/25/2017] [Indexed: 11/28/2022]
Abstract
Aspirin is widely used in the prevention of cardiovascular diseases, but the antiplatelet responses vary from one patient to another. To validate aspirin response related transcripts and illustrate their roles in predicting cardiovascular events, we have quantified the relative expression of 14 transcripts previously identified as related to high on-aspirin platelet reactivity (HAPR) in 223 patients with coronary artery disease (CAD) on regular aspirin treatment. All patients were followed up regularly for cardiovascular events (CVE). The mean age of our enrolled population was 75.80±8.57years. HAPR patients showed no significant differences in terms of co-morbidities and combined drugs. Besides, the relative expression of HLA-DQA1 was significantly lower in low on-aspirin platelet reactivity (LAPR) patients, when compared with HAPR and high normal (HN) group (p=0.028). What's more, the number of arteries involved, HAPR status and the relative expression of CLU, CMTM5 and SPARC were independent risk factors for CVE during follow up (p<0.05). In addition, overexpression of CMTM5 attenuated endothelial cells (ECs) migration and proliferation, with significantly decreased phosphorylated-Akt levels, while its inhibition promoted these processes in vitro (p<0.05).Our study provides evidence that circulating transcripts might be potential biomarkers in predicting cardiovascular events. CMTM5 might exert anti-atherosclerotic effects via suppressing migration and proliferation in the vessel wall. Nevertheless, larger-scale and long-term studies are still needed.
Collapse
Affiliation(s)
- J W Zhang
- Department of Geriatrics, Peking University First Hospital, Beijing, China
| | - T F Liu
- Department of Geriatrics, Peking University First Hospital, Beijing, China
| | - X H Chen
- Department of Geriatrics, Peking University First Hospital, Beijing, China
| | - W Y Liang
- Department of Geriatrics, Peking University First Hospital, Beijing, China
| | - X R Feng
- Department of Geriatrics, Peking University First Hospital, Beijing, China
| | - L Wang
- Peking University Center for Human Disease Genomics, Department of Immunology, Health Science Center, Peking University, Beijing, China
| | - Sidney W Fu
- Department of Medicine, George Washington University Medical Center, Washington DC, USA
| | - Timothy A McCaffrey
- Department of Medicine, George Washington University Medical Center, Washington DC, USA
| | - M L Liu
- Department of Geriatrics, Peking University First Hospital, Beijing, China.
| |
Collapse
|
6
|
Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine Learning and Data Mining Methods in Diabetes Research. Comput Struct Biotechnol J 2017; 15:104-116. [PMID: 28138367 PMCID: PMC5257026 DOI: 10.1016/j.csbj.2016.12.005] [Citation(s) in RCA: 340] [Impact Index Per Article: 48.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Revised: 12/20/2016] [Accepted: 12/27/2016] [Indexed: 12/14/2022] Open
Abstract
The remarkable advances in biotechnology and health sciences have led to a significant production of data, such as high throughput genetic data and clinical information, generated from large Electronic Health Records (EHRs). To this end, application of machine learning and data mining methods in biosciences is presently, more than ever before, vital and indispensable in efforts to transform intelligently all available information into valuable knowledge. Diabetes mellitus (DM) is defined as a group of metabolic disorders exerting significant pressure on human health worldwide. Extensive research in all aspects of diabetes (diagnosis, etiopathophysiology, therapy, etc.) has led to the generation of huge amounts of data. The aim of the present study is to conduct a systematic review of the applications of machine learning, data mining techniques and tools in the field of diabetes research with respect to a) Prediction and Diagnosis, b) Diabetic Complications, c) Genetic Background and Environment, and e) Health Care and Management with the first category appearing to be the most popular. A wide range of machine learning algorithms were employed. In general, 85% of those used were characterized by supervised learning approaches and 15% by unsupervised ones, and more specifically, association rules. Support vector machines (SVM) arise as the most successful and widely used algorithm. Concerning the type of data, clinical datasets were mainly used. The title applications in the selected articles project the usefulness of extracting valuable knowledge leading to new hypotheses targeting deeper understanding and further investigation in DM.
Collapse
Affiliation(s)
- Ioannis Kavakiotis
- Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
- Institute of Applied Biosciences, CERTH, Thessaloniki, Greece
| | - Olga Tsave
- Laboratory of Inorganic Chemistry, Department of Chemical Engineering, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| | - Athanasios Salifoglou
- Laboratory of Inorganic Chemistry, Department of Chemical Engineering, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| | - Nicos Maglaveras
- Institute of Applied Biosciences, CERTH, Thessaloniki, Greece
- Lab of Computing and Medical Informatics, Medical School, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| | - Ioannis Vlahavas
- Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| | - Ioanna Chouvarda
- Institute of Applied Biosciences, CERTH, Thessaloniki, Greece
- Lab of Computing and Medical Informatics, Medical School, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| |
Collapse
|