1
|
Ashayeri H, Sobhi N, Pławiak P, Pedrammehr S, Alizadehsani R, Jafarizadeh A. Transfer Learning in Cancer Genetics, Mutation Detection, Gene Expression Analysis, and Syndrome Recognition. Cancers (Basel) 2024; 16:2138. [PMID: 38893257 PMCID: PMC11171544 DOI: 10.3390/cancers16112138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 05/30/2024] [Accepted: 06/01/2024] [Indexed: 06/21/2024] Open
Abstract
Artificial intelligence (AI), encompassing machine learning (ML) and deep learning (DL), has revolutionized medical research, facilitating advancements in drug discovery and cancer diagnosis. ML identifies patterns in data, while DL employs neural networks for intricate processing. Predictive modeling challenges, such as data labeling, are addressed by transfer learning (TL), leveraging pre-existing models for faster training. TL shows potential in genetic research, improving tasks like gene expression analysis, mutation detection, genetic syndrome recognition, and genotype-phenotype association. This review explores the role of TL in overcoming challenges in mutation detection, genetic syndrome detection, gene expression, or phenotype-genotype association. TL has shown effectiveness in various aspects of genetic research. TL enhances the accuracy and efficiency of mutation detection, aiding in the identification of genetic abnormalities. TL can improve the diagnostic accuracy of syndrome-related genetic patterns. Moreover, TL plays a crucial role in gene expression analysis in order to accurately predict gene expression levels and their interactions. Additionally, TL enhances phenotype-genotype association studies by leveraging pre-trained models. In conclusion, TL enhances AI efficiency by improving mutation prediction, gene expression analysis, and genetic syndrome detection. Future studies should focus on increasing domain similarities, expanding databases, and incorporating clinical data for better predictions.
Collapse
Affiliation(s)
- Hamidreza Ashayeri
- Student Research Committee, Tabriz University of Medical Sciences, Tabriz 5165665811, Iran;
| | - Navid Sobhi
- Nikookari Eye Center, Tabriz University of Medical Sciences, Tabriz 5165665811, Iran; (N.S.); (A.J.)
| | - Paweł Pławiak
- Department of Computer Science, Faculty of Computer Science and Telecommunications, Cracow University of Technology, Warszawska 24, 31-155 Krakow, Poland
- Institute of Theoretical and Applied Informatics, Polish Academy of Sciences, Bałtycka 5, 44-100 Gliwice, Poland
| | - Siamak Pedrammehr
- Faculty of Design, Tabriz Islamic Art University, Tabriz 5164736931, Iran;
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Burwood, VIC 3216, Australia;
| | - Roohallah Alizadehsani
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Burwood, VIC 3216, Australia;
| | - Ali Jafarizadeh
- Nikookari Eye Center, Tabriz University of Medical Sciences, Tabriz 5165665811, Iran; (N.S.); (A.J.)
- Immunology Research Center, Tabriz University of Medical Sciences, Tabriz 5165665811, Iran
| |
Collapse
|
2
|
Nazari E, Naderi H, Tabadkani M, ArefNezhad R, Farzin AH, Dashtiahangar M, Khazaei M, Ferns GA, Mehrabian A, Tabesh H, Avan A. Breast cancer prediction using different machine learning methods applying multi factors. J Cancer Res Clin Oncol 2023; 149:17133-17146. [PMID: 37773467 DOI: 10.1007/s00432-023-05388-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Accepted: 09/01/2023] [Indexed: 10/01/2023]
Abstract
OBJECTIVE Breast cancer (BC) is a multifactorial disease and is one of the most common cancers globally. This study aimed to compare different machine learning (ML) techniques to develop a comprehensive breast cancer risk prediction model based on features of various factors. METHODS The population sample contained 810 records (115 cancer patients and 695 healthy individuals). 45 attributes out of 85 were selected based on the opinion of experts. These selected attributes are in genetic, biochemical, biomarker, gender, demographic and pathological factors. 13 Machine learning models were trained with proposed attributes and coefficient of attributes and internal relationships were calculated. RESULT Compared to other methods random forest (RF) has higher performance (accuracy 99.26%, precision 99%, and area under the curve (AUC) 99%). The results of assessing the impact and correlation of variables using the RF method based on PCA indicated that pathology, biomarker, biochemistry, gene, and demographic factors with a coefficient of 0.35, 0.23, 0.15, 0.14, and 0.13 respectively, affected the risk of BC (r2 = 0.54). CONCLUSION Breast cancer has several risk factors. Medical experts use these risk factors for early diagnosis. Therefore, identifying related risk factors and their effect can increase the accuracy of diagnosis. Considering the broad features for predicting breast cancer leads to the development of a comprehensive prediction model. In this study, using RF technique a breast cancer prediction model with 99.3% accuracy was developed based on multifactorial features.
Collapse
Affiliation(s)
- Elham Nazari
- Faculty of Medicine, Department of Medical Informatics, Mashhad University of Medical Sciences, Mashhad, Iran
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
- Department of Health Information Technology and Management, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Hamid Naderi
- Faculty of Medicine, Department of Medical Informatics, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Mahla Tabadkani
- Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Reza ArefNezhad
- Halal Research Center of IRI, FDA, Tehran, Iran
- Department of Anatomy, School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran
| | | | | | - Majid Khazaei
- Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Gordon A Ferns
- Division of Medical Education, Brighton & Sussex Medical School, Falmer, Brighton, BN1 9PH, Sussex, UK
| | - Amin Mehrabian
- Warwick Medical School, University of Warwick, Coventry, UK
| | - Hamed Tabesh
- Faculty of Medicine, Department of Medical Informatics, Mashhad University of Medical Sciences, Mashhad, Iran.
| | - Amir Avan
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran.
- Faculty of Health, School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD, Australia.
- College of Medicine, University of Warith Al-Anbiyaa, Karbala, Iraq.
| |
Collapse
|
3
|
Angelakis A, Soulioti I, Filippakis M. Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees. Heliyon 2023; 9:e20530. [PMID: 37860531 PMCID: PMC10582309 DOI: 10.1016/j.heliyon.2023.e20530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 09/27/2023] [Accepted: 09/28/2023] [Indexed: 10/21/2023] Open
Abstract
We define an iterative method for dimensionality reduction using categorical gradient boosted trees and Shapley values and created four machine learning models which potentially could be used as diagnostic tests for acute myeloid leukaemia (AML). For the final Catboost model we use a dataset of 2177 individuals using as features 16 probe sets and the age in order to classify if someone has AML or is healthy. The dataset is multicentric and consists of data from 27 organizations, 25 cities, 15 countries and 4 continents. The performance of our last model is specificity: 0.9909, sensitivity: 0.9985, F1-score: 0.9976 and its ROC-AUC: 0.9962 using ten fold cross validation. On an inference dataset the perormance is: specificity: 0.9909, sensitivity: 0.9969, F1-score: 0.9969 and its ROC-AUC: 0.9939. To the best of our knowledge the performance of our model is the best one in the literature, as regards the diagnosis of AML using similar or not data. Moreover, there has not been any bibliographic reference which associates AML or any other type of cancer with the 16 probe sets we used as features in our final model.
Collapse
Affiliation(s)
- Athanasios Angelakis
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam Public Health Research Institute, University of Amsterdam Data Science Center, Netherlands
| | - Ioanna Soulioti
- Department of Biology, National and Kapodistrian University of Athens, Greece
| | | |
Collapse
|
4
|
Rai HM. Cancer detection and segmentation using machine learning and deep learning techniques: a review. MULTIMEDIA TOOLS AND APPLICATIONS 2023. [DOI: 10.1007/s11042-023-16520-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Revised: 05/12/2023] [Accepted: 08/13/2023] [Indexed: 09/16/2023]
|
5
|
Almadhor A, Sattar U, Al Hejaili A, Ghulam Mohammad U, Tariq U, Ben Chikha H. An efficient computer vision-based approach for acute lymphoblastic leukemia prediction. Front Comput Neurosci 2022; 16:1083649. [PMID: 36507304 PMCID: PMC9729282 DOI: 10.3389/fncom.2022.1083649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Accepted: 11/14/2022] [Indexed: 11/25/2022] Open
Abstract
Leukemia (blood cancer) diseases arise when the number of White blood cells (WBCs) is imbalanced in the human body. When the bone marrow produces many immature WBCs that kill healthy cells, acute lymphocytic leukemia (ALL) impacts people of all ages. Thus, timely predicting this disease can increase the chance of survival, and the patient can get his therapy early. Manual prediction is very expensive and time-consuming. Therefore, automated prediction techniques are essential. In this research, we propose an ensemble automated prediction approach that uses four machine learning algorithms K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), and Naive Bayes (NB). The C-NMC leukemia dataset is used from the Kaggle repository to predict leukemia. Dataset is divided into two classes cancer and healthy cells. We perform data preprocessing steps, such as the first images being cropped using minimum and maximum points. Feature extraction is performed to extract the feature using pre-trained Convolutional Neural Network-based Deep Neural Network (DNN) architectures (VGG19, ResNet50, or ResNet101). Data scaling is performed by using the MinMaxScaler normalization technique. Analysis of Variance (ANOVA), Recursive Feature Elimination (RFE), and Random Forest (RF) as feature Selection techniques. Classification machine learning algorithms and ensemble voting are applied to selected features. Results reveal that SVM with 90.0% accuracy outperforms compared to other algorithms.
Collapse
Affiliation(s)
- Ahmad Almadhor
- Department of Computer Engineering and Networks, College of Computer and Information Sciences, Jouf University, Sakaka, Saudi Arabia,*Correspondence: Ahmad Almadhor
| | - Usman Sattar
- Department of Management Science, Beaconhouse National University, Lahore, Pakistan,Usman Sattar
| | - Abdullah Al Hejaili
- Computer Science Department, Faculty of Computers & Information Technology, University of Tabuk, Tabuk, Saudi Arabia
| | - Uzma Ghulam Mohammad
- Department of Computer Science and Software Engineering, International Islamic University, Islamabad, Pakistan
| | - Usman Tariq
- Department of Management Information Systems, Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia
| | - Haithem Ben Chikha
- Department of Computer Engineering and Networks, College of Computer and Information Sciences, Jouf University, Sakaka, Saudi Arabia
| |
Collapse
|
6
|
Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model. Sci Rep 2022; 12:1000. [PMID: 35046459 PMCID: PMC8770560 DOI: 10.1038/s41598-022-04835-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Accepted: 12/09/2021] [Indexed: 01/21/2023] Open
Abstract
Blood cancer has been a growing concern during the last decade and requires early diagnosis to start proper treatment. The diagnosis process is costly and time-consuming involving medical experts and several tests. Thus, an automatic diagnosis system for its accurate prediction is of significant importance. Diagnosis of blood cancer using leukemia microarray gene data and machine learning approach has become an important medical research today. Despite research efforts, desired accuracy and efficiency necessitate further enhancements. This study proposes an approach for blood cancer disease prediction using the supervised machine learning approach. For the current study, the leukemia microarray gene dataset containing 22,283 genes, is used. ADASYN resampling and Chi-squared (Chi2) features selection techniques are used to resolve imbalanced and high-dimensional dataset problems. ADASYN generates artificial data to make the dataset balanced for each target class, and Chi2 selects the best features out of 22,283 to train learning models. For classification, a hybrid logistics vector trees classifier (LVTrees) is proposed which utilizes logistic regression, support vector classifier, and extra tree classifier. Besides extensive experiments on the datasets, performance comparison with the state-of-the-art methods has been made for determining the significance of the proposed approach. LVTrees outperform all other models with ADASYN and Chi2 techniques with a significant 100% accuracy. Further, a statistical significance T-test is also performed to show the efficacy of the proposed approach. Results using k-fold cross-validation prove the supremacy of the proposed model.
Collapse
|
7
|
Nazari E, Biviji R, Roshandel D, Pour R, Shahriari MH, Mehrabian A, Tabesh H. Decision fusion in healthcare and medicine: a narrative review. Mhealth 2022; 8:8. [PMID: 35178439 PMCID: PMC8800206 DOI: 10.21037/mhealth-21-15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 08/02/2021] [Indexed: 11/06/2022] Open
Abstract
OBJECTIVE To provide an overview of the decision fusion (DF) technique and describe the applications of the technique in healthcare and medicine at prevention, diagnosis, treatment and administrative levels. BACKGROUND The rapid development of technology over the past 20 years has led to an explosion in data growth in various industries, like healthcare. Big data analysis within the healthcare systems is essential for arriving to a value-based decision over a period of time. Diversity and uncertainty in big data analytics have made it impossible to analyze data by using conventional data mining techniques and thus alternative solutions are required. DF is a form of data fusion techniques that could increase the accuracy of diagnosis and facilitate interpretation, summarization and sharing of information. METHODS We conducted a review of articles published between January 1980 and December 2020 from various databases such as Google Scholar, IEEE, PubMed, Science Direct, Scopus and web of science using the keywords decision fusion (DF), information fusion, healthcare, medicine and big data. A total of 141 articles were included in this narrative review. CONCLUSIONS Given the importance of big data analysis in reducing costs and improving the quality of healthcare; along with the potential role of DF in big data analysis, it is recommended to know the full potential of this technique including the advantages, challenges and applications of the technique before its use. Future studies should focus on describing the methodology and types of data used for its applications within the healthcare sector.
Collapse
Affiliation(s)
- Elham Nazari
- Department of Medical Informatics, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Rizwana Biviji
- Science of Healthcare Delivery, College of Health Solutions, Arizona State University, Phoenix, AZ, USA
| | - Danial Roshandel
- Centre for Ophthalmology and Visual Science (affiliated with the Lions Eye Institute), The University of Western Australia, Perth, Western Australia, Australia
| | - Reza Pour
- Department of Computer Engineering, Azad University, Mashhad, Iran
| | - Mohammad Hasan Shahriari
- Department of Health Information Technology and Management, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Amin Mehrabian
- Warwick Medical School, University of Warwick, Coventry, UK
| | - Hamed Tabesh
- Department of Medical Informatics, Mashhad University of Medical Sciences, Mashhad, Iran
| |
Collapse
|
8
|
Bailly A, Blanc C, Francis É, Guillotin T, Jamal F, Wakim B, Roy P. Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 213:106504. [PMID: 34798408 DOI: 10.1016/j.cmpb.2021.106504] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 10/24/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVE Machine learning and deep learning models are very powerful in predicting the presence of a disease. To achieve good predictions, those models require a certain amount of data to train on, whereas this amount i) is generally limited and difficult to obtain; and, ii) increases with the complexity of the interactions between the outcome (disease presence) and the model variables. This study compares the ways training dataset size and interactions affect the performance of those prediction models. METHODS To compare the two influences, several datasets were simulated that differed in the number of observations and the complexity of the interactions between the variables and the outcome. A few logistic regressions and neural networks were trained on the simulated datasets and their performance evaluated by cross-validation and compared using accuracy, F1 score, and AUC metrics. RESULTS Models trained on simulated datasets without interactions provided good results: AUC close to 0.80 with either logistic regression or neural networks. Models trained on simulated dataset with order 2 interactions led also to AUCs close to 0.80 with either logistic regression or neural networks. Models trained on simulated datasets with order 4 interactions led to AUC close to 0.80 with neural networks and 0.85 with penalized logistic regressions. Whatever the interaction order, increasing the dataset size did not significantly affect model performance, especially that of machine learning models. CONCLUSION Machine learning models were the less influenced by the dataset size but needed interaction terms to achieve good performance, whereas deep learning models could achieve good performance without interaction terms. Conclusively, with the considered scenarios, well-specified machine learning models outperformed deep learning models.
Collapse
Affiliation(s)
- Alexandre Bailly
- Everteam Software, Research and Development Lab, 17 quai Joseph Gillet, Lyon, France; Université de Lyon, Lyon, France; Université Lyon 1, Villeurbanne, France; Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, Lyon, France; Équipe Biostatistique-Santé, Laboratoire de Biométrie et Biologie Évolutive, CNRS UMR 5558 Villeurbanne, France.
| | - Corentin Blanc
- Everteam Software, Research and Development Lab, 17 quai Joseph Gillet, Lyon, France; Université de Lyon, Lyon, France; Université Lyon 1, Villeurbanne, France; Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, Lyon, France; Équipe Biostatistique-Santé, Laboratoire de Biométrie et Biologie Évolutive, CNRS UMR 5558 Villeurbanne, France
| | - Élie Francis
- Everteam Software, Research and Development Lab, 17 quai Joseph Gillet, Lyon, France
| | - Thierry Guillotin
- Everteam Software, Research and Development Lab, 17 quai Joseph Gillet, Lyon, France
| | | | | | - Pascal Roy
- Université de Lyon, Lyon, France; Université Lyon 1, Villeurbanne, France; Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, Lyon, France; Équipe Biostatistique-Santé, Laboratoire de Biométrie et Biologie Évolutive, CNRS UMR 5558 Villeurbanne, France
| |
Collapse
|
9
|
Deep Learning in Cancer Diagnosis and Prognosis Prediction: A Minireview on Challenges, Recent Trends, and Future Directions. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:9025470. [PMID: 34754327 PMCID: PMC8572604 DOI: 10.1155/2021/9025470] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 09/30/2021] [Accepted: 10/05/2021] [Indexed: 12/30/2022]
Abstract
Deep learning (DL) is a branch of machine learning and artificial intelligence that has been applied to many areas in different domains such as health care and drug design. Cancer prognosis estimates the ultimate fate of a cancer subject and provides survival estimation of the subjects. An accurate and timely diagnostic and prognostic decision will greatly benefit cancer subjects. DL has emerged as a technology of choice due to the availability of high computational resources. The main components in a standard computer-aided design (CAD) system are preprocessing, feature recognition, extraction and selection, categorization, and performance assessment. Reduction of costs associated with sequencing systems offers a myriad of opportunities for building precise models for cancer diagnosis and prognosis prediction. In this survey, we provided a summary of current works where DL has helped to determine the best models for the cancer diagnosis and prognosis prediction tasks. DL is a generic model requiring minimal data manipulations and achieves better results while working with enormous volumes of data. Aims are to scrutinize the influence of DL systems using histopathology images, present a summary of state-of-the-art DL methods, and give directions to future researchers to refine the existing methods.
Collapse
|