1
|
Shin H, Oh S. An effective heuristic for developing hybrid feature selection in high dimensional and low sample size datasets. BMC Bioinformatics 2024; 25:390. [PMID: 39722052 DOI: 10.1186/s12859-024-06017-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 12/17/2024] [Indexed: 12/28/2024] Open
Abstract
BACKGROUND High-dimensional datasets with low sample sizes (HDLSS) are pivotal in the fields of biology and bioinformatics. One of core objective of HDLSS is to select most informative features and discarding redundant or irrelevant features. This is particularly crucial in bioinformatics, where accurate feature (gene) selection can lead to breakthroughs in drug development and provide insights into disease diagnostics. Despite its importance, identifying optimal features is still a significant challenge in HDLSS. RESULTS To address this challenge, we propose an effective feature selection method that combines gradual permutation filtering with a heuristic tribrid search strategy, specifically tailored for HDLSS contexts. The proposed method considers inter-feature interactions and leverages feature rankings during the search process. In addition, a new performance metric for the HDLSS that evaluates both the number and quality of selected features is suggested. Through the comparison of the benchmark dataset with existing methods, the proposed method reduced the average number of selected features from 37.8 to 5.5 and improved the performance of the prediction model, based on the selected features, from 0.855 to 0.927. CONCLUSIONS The proposed method effectively selects a small number of important features and achieves high prediction performance.
Collapse
Affiliation(s)
- Hyunseok Shin
- Department of Computer Science, Dankook University, Youngin, Gyeonggi, South Korea
| | - Sejong Oh
- Department of Software Science, Dankook University, Youngin, Gyeonggi, South Korea.
| |
Collapse
|
2
|
Gürkan Kuntalp D, Özcan N, Düzyel O, Kababulut FY, Kuntalp M. A Comparative Study of Metaheuristic Feature Selection Algorithms for Respiratory Disease Classification. Diagnostics (Basel) 2024; 14:2244. [PMID: 39410648 PMCID: PMC11475976 DOI: 10.3390/diagnostics14192244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Revised: 10/02/2024] [Accepted: 10/07/2024] [Indexed: 10/20/2024] Open
Abstract
The correct diagnosis and early treatment of respiratory diseases can significantly improve the health status of patients, reduce healthcare expenses, and enhance quality of life. Therefore, there has been extensive interest in developing automatic respiratory disease detection systems. Most recent methods for detecting respiratory disease use machine and deep learning algorithms. The success of these machine learning methods depends heavily on the selection of proper features to be used in the classifier. Although metaheuristic-based feature selection methods have been successful in addressing difficulties presented by high-dimensional medical data in various biomedical classification tasks, there is not much research on the utilization of metaheuristic methods in respiratory disease classification. This paper aims to conduct a detailed and comparative analysis of six widely used metaheuristic optimization methods using eight different transfer functions in respiratory disease classification. For this purpose, two different classification cases were examined: binary and multi-class. The findings demonstrate that metaheuristic algorithms using correct transfer functions could effectively reduce data dimensionality while enhancing classification accuracy.
Collapse
Affiliation(s)
- Damla Gürkan Kuntalp
- Department of Electrical and Electronics Engineering, Dokuz Eylül University, İzmir 35160, Türkiye;
| | - Nermin Özcan
- Department of Biomedical Engineering, İskenderun Technical University, İskenderun 31200, Türkiye;
| | - Okan Düzyel
- Department of Electrical and Electronics Engineering, İzmir Institute of Technology, İzmir 35433, Türkiye;
| | | | - Mehmet Kuntalp
- Department of Electrical and Electronics Engineering, Dokuz Eylül University, İzmir 35160, Türkiye;
| |
Collapse
|
3
|
Jeon J, Suk Y, Kim SC, Jo HY, Kim K, Jung I. Denoiseit: denoising gene expression data using rank based isolation trees. BMC Bioinformatics 2024; 25:271. [PMID: 39169300 PMCID: PMC11340143 DOI: 10.1186/s12859-024-05899-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Accepted: 08/13/2024] [Indexed: 08/23/2024] Open
Abstract
BACKGROUND Selecting informative genes or eliminating uninformative ones before any downstream gene expression analysis is a standard task with great impact on the results. A carefully curated gene set significantly enhances the likelihood of identifying meaningful biomarkers. METHOD In contrast to the conventional forward gene search methods that focus on selecting highly informative genes, we propose a backward search method, DenoiseIt, that aims to remove potential outlier genes yielding a robust gene set with reduced noise. The gene set constructed by DenoiseIt is expected to capture biologically significant genes while pruning irrelevant ones to the greatest extent possible. Therefore, it also enhances the quality of downstream comparative gene expression analysis. DenoiseIt utilizes non-negative matrix factorization in conjunction with isolation forests to identify outlier rank features and remove their associated genes. RESULTS DenoiseIt was applied to both bulk and single-cell RNA-seq data collected from TCGA and a COVID-19 cohort to show that it proficiently identified and removed genes exhibiting expression anomalies confined to specific samples rather than a known group. DenoiseIt also showed to reduce the level of technical noise while preserving a higher proportion of biologically relevant genes compared to existing methods. The DenoiseIt Software is publicly available on GitHub at https://github.com/cobi-git/DenoiseIt.
Collapse
Affiliation(s)
- Jaemin Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-gu, Seoul, 08826, Republic of Korea
| | - Youjeong Suk
- School of Computer Science and Engineering, Kyungpook National University, Buk-gu, Daegu, 41566, Republic of Korea
| | - Sang Cheol Kim
- Division of Healthcare and Artificial Intelligence, Department of Precision Medicine, Korea National Institute of Health, Korea Disease Control and Prevention Agency, Osong, CheongJu, 28159, Republic of Korea
| | - Hye-Yeong Jo
- Division of Healthcare and Artificial Intelligence, Department of Precision Medicine, Korea National Institute of Health, Korea Disease Control and Prevention Agency, Osong, CheongJu, 28159, Republic of Korea
| | - Kwangsoo Kim
- Department of Transdisciplinary Medicine, Seoul National University Hospital, Jongno-gu, Seoul, 03080, Republic of Korea.
- Department of Medicine, Seoul National University, Jongno-gu, Seoul, 03080, Republic of Korea.
| | - Inuk Jung
- School of Computer Science and Engineering, Kyungpook National University, Buk-gu, Daegu, 41566, Republic of Korea.
| |
Collapse
|
4
|
Mei Y, Li M, Li Y, Sheng X, Zhu C, Fan X, Zhang L, Pan A. Early Warning Models Using Machine Learning to Predict Sepsis-Associated Chronic Critical Illness: A Study Based on the Medical Information Mart for Intensive Care Database. Cureus 2024; 16:e67121. [PMID: 39290928 PMCID: PMC11407544 DOI: 10.7759/cureus.67121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/18/2024] [Indexed: 09/19/2024] Open
Abstract
Background Patients with chronic critical illness (CCI) experience poor prognoses and incur high medical costs. However, there is currently limited clinical awareness of sepsis-associated CCI, resulting in insufficient vigilance. Therefore, it is necessary to build a machine learning model that can predict whether sepsis patients will develop CCI. Methods Clinical data on 19,077 sepsis patients from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database were analyzed. Predictive factors were identified using the Student's t-test, Mann-Whitney U test, or χ 2 test. Six machine learning classification models, namely, the logistic regression, support vector machine, decision tree, random forest, extreme gradient enhancement, and artificial neural network, were established. The optimal model was selected on the basis of its performance. Calibration curves were used to evaluate the accuracy of model classification, while the external validation dataset was used to evaluate the performance of the model. Results Thirty-seven characteristics, such as elevated alanine aminotransferase, rapid heart rate, and high Logistic Organ Dysfunction System scores, were identified as risk factors for developing CCI. The area under the receiver operating characteristic curve (AUROC) values for all models were above 0.73 on the internal test set. Among them, the extreme gradient enhancement model exhibited superior performance (F1 score = 0.91, AUROC = 0.91, Brier score = 0.052). It also exhibited stable prediction performance on the external validation set (AUROC = 0.72). Conclusion A machine learning model was established to predict whether sepsis patients will develop CCI. It can provide useful predictive information for clinical decision-making.
Collapse
Affiliation(s)
- Yulin Mei
- Department of Critical Care Medicine, Wannan Medical College, Wuhu, CHN
| | - Meng Li
- Department of Intensive Care Unit, First Affiliated Hospital of Anhui Medical University, Hefei, CHN
| | - Yuqi Li
- Department of Critical Care Medicine, Wannan Medical College, Wuhu, CHN
| | - Ximei Sheng
- Department of Critical Care Medicine, Wannan Medical College, Wuhu, CHN
| | - Chunyan Zhu
- Department of Critical Care Medicine, The First Affiliated Hospital of USTC, Division of Life Science and Medicine, University of Science and Technology of China, Hefei, CHN
| | - Xiaoqin Fan
- Department of Critical Care Medicine, The First Affiliated Hospital of USTC, Division of Life Science and Medicine, University of Science and Technology of China, Hefei, CHN
| | - Lei Zhang
- Department of Critical Care Medicine, The First Affiliated Hospital of USTC, Division of Life Science and Medicine, University of Science and Technology of China, Hefei, CHN
| | - Aijun Pan
- Department of Critical Care Medicine, The First Affiliated Hospital of USTC, Division of Life Science and Medicine, University of Science and Technology of China, Hefei, CHN
| |
Collapse
|
5
|
Pradhan UK, Meher PK, Naha S, Sharma NK, Agarwal A, Gupta A, Parsad R. DBPMod: a supervised learning model for computational recognition of DNA-binding proteins in model organisms. Brief Funct Genomics 2024; 23:363-372. [PMID: 37651627 DOI: 10.1093/bfgp/elad039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 08/09/2023] [Accepted: 08/15/2023] [Indexed: 09/02/2023] Open
Abstract
DNA-binding proteins (DBPs) play critical roles in many biological processes, including gene expression, DNA replication, recombination and repair. Understanding the molecular mechanisms underlying these processes depends on the precise identification of DBPs. In recent times, several computational methods have been developed to identify DBPs. However, because of the generic nature of the models, these models are unable to identify species-specific DBPs with higher accuracy. Therefore, a species-specific computational model is needed to predict species-specific DBPs. In this paper, we introduce the computational DBPMod method, which makes use of a machine learning approach to identify species-specific DBPs. For prediction, both shallow learning algorithms and deep learning models were used, with shallow learning models achieving higher accuracy. Additionally, the evolutionary features outperformed sequence-derived features in terms of accuracy. Five model organisms, including Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Homo sapiens and Mus musculus, were used to assess the performance of DBPMod. Five-fold cross-validation and independent test set analyses were used to evaluate the prediction accuracy in terms of area under receiver operating characteristic curve (auROC) and area under precision-recall curve (auPRC), which was found to be ~89-92% and ~89-95%, respectively. The comparative results demonstrate that the DBPMod outperforms 12 current state-of-the-art computational approaches in identifying the DBPs for all five model organisms. We further developed the web server of DBPMod to make it easier for researchers to detect DBPs and is publicly available at https://iasri-sg.icar.gov.in/dbpmod/. DBPMod is expected to be an invaluable tool for discovering DBPs, supplementing the current experimental and computational methods.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Nitesh K Sharma
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, 1540 Alcazar Street, Los Angeles, CA 90033, USA
| | - Aarushi Agarwal
- Amity Institute of Biotechnology, Amity University, Noida, Uttar Pradesh 201313, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
6
|
Chen L, Shao X, Yu P. Machine learning prediction models for diabetic kidney disease: systematic review and meta-analysis. Endocrine 2024; 84:890-902. [PMID: 38141061 DOI: 10.1007/s12020-023-03637-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 11/28/2023] [Indexed: 12/24/2023]
Abstract
BACKGROUND Machine learning is increasingly recognized as a viable approach for identifying risk factors associated with diabetic kidney disease (DKD). However, the current state of real-world research lacks a comprehensive systematic analysis of the predictive performance of machine learning (ML) models for DKD. OBJECTIVES The objectives of this study were to systematically summarize the predictive capabilities of various ML methods in forecasting the onset and the advancement of DKD, and to provide a basic outline for ML methods in DKD. METHODS We have searched mainstream databases, including PubMed, Web of Science, Embase, and MEDLINE databases to obtain the eligible studies. Subsequently, we categorized various ML techniques and analyzed the differences in their performance in predicting DKD. RESULTS Logistic regression (LR) was the prevailing ML method, yielding an overall pooled area under the receiver operating characteristic curve (AUROC) of 0.83. On the other hand, the non-LR models also performed well with an overall pooled AUROC of 0.80. Our t-tests showed no statistically significant difference in predicting ability between LR and non-LR models (t = 1.6767, p > 0.05). CONCLUSION All ML predicting models yielded relatively satisfied DKD predicting ability with their AUROCs greater than 0.7. However, we found no evidence that non-LR models outperformed the LR model. LR exhibits high performance or accuracy in practice, while it is known for algorithmic simplicity and computational efficiency compared to others. Thus, LR may be considered a cost-effective ML model in practice.
Collapse
Affiliation(s)
- Lianqin Chen
- NHC Key Laboratory of Hormones and Development, Tianjin Key Laboratory of Metabolic Diseases, Chu Hsien-I Memorial Hospital & Tianjin Institute of Endocrinology, Tianjin Medical University, Tianjin, 300134, China
| | - Xian Shao
- NHC Key Laboratory of Hormones and Development, Tianjin Key Laboratory of Metabolic Diseases, Chu Hsien-I Memorial Hospital & Tianjin Institute of Endocrinology, Tianjin Medical University, Tianjin, 300134, China
| | - Pei Yu
- NHC Key Laboratory of Hormones and Development, Tianjin Key Laboratory of Metabolic Diseases, Chu Hsien-I Memorial Hospital & Tianjin Institute of Endocrinology, Tianjin Medical University, Tianjin, 300134, China.
| |
Collapse
|
7
|
Verma RK, Lokhande KB, Srivastava PK, Singh A. Elucidating B4GALNT1 as potential biomarker in hepatocellular carcinoma using machine learning models and mutational dynamics explored through MD simulation. INFORMATICS IN MEDICINE UNLOCKED 2024; 48:101514. [DOI: 10.1016/j.imu.2024.101514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2025] Open
|
8
|
Yoon SJ, Kim D, Park SH, Han JH, Lim J, Shin JE, Eun HS, Lee SM, Park MS. Prediction of Postnatal Growth Failure in Very Low Birth Weight Infants Using a Machine Learning Model. Diagnostics (Basel) 2023; 13:3627. [PMID: 38132211 PMCID: PMC10743090 DOI: 10.3390/diagnostics13243627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/04/2023] [Accepted: 12/06/2023] [Indexed: 12/23/2023] Open
Abstract
Accurate prediction of postnatal growth failure (PGF) can be beneficial for early intervention and prevention. We aimed to develop a machine learning model to predict PGF at discharge among very low birth weight (VLBW) infants using extreme gradient boosting. A total of 729 VLBW infants, born between 2013 and 2017 in four hospitals, were included. PGF was defined as a decrease in z-score between birth and discharge that was greater than 1.28. Feature selection and addition were performed to improve the accuracy of prediction at four different time points, including 0, 7, 14, and 28 days after birth. A total of 12 features with high contribution at all time points by feature importance were decided upon, and good performance was shown as an area under the receiver operating characteristic curve (AUROC) of 0.78 at 7 days. After adding weight change to the 12 features-which included sex, gestational age, birth weight, small for gestational age, maternal hypertension, respiratory distress syndrome, duration of invasive ventilation, duration of non-invasive ventilation, patent ductus arteriosus, sepsis, use of parenteral nutrition, and reach at full enteral nutrition-the AUROC at 7 days after birth was shown as 0.84. Our prediction model for PGF performed well at early detection. Its potential clinical application as a supplemental tool could be helpful for reducing PGF and improving child health.
Collapse
Affiliation(s)
- So Jin Yoon
- Department of Pediatrics, Yonsei University College of Medicine, Seoul 03722, Republic of Korea; (S.J.Y.)
| | - Donghyun Kim
- Department of Advanced General Dentistry, Yonsei University College of Dentistry, Seoul 03722, Republic of Korea
- InVisionLab Inc., Seoul 05854, Republic of Korea
| | - Sook Hyun Park
- Department of Pediatrics, Yonsei University College of Medicine, Seoul 03722, Republic of Korea; (S.J.Y.)
| | - Jung Ho Han
- Department of Pediatrics, Yonsei University College of Medicine, Seoul 03722, Republic of Korea; (S.J.Y.)
| | - Joohee Lim
- Department of Pediatrics, Yonsei University College of Medicine, Seoul 03722, Republic of Korea; (S.J.Y.)
| | - Jeong Eun Shin
- Department of Pediatrics, Yonsei University College of Medicine, Seoul 03722, Republic of Korea; (S.J.Y.)
| | - Ho Seon Eun
- Department of Pediatrics, Yonsei University College of Medicine, Seoul 03722, Republic of Korea; (S.J.Y.)
| | - Soon Min Lee
- Department of Pediatrics, Yonsei University College of Medicine, Seoul 03722, Republic of Korea; (S.J.Y.)
| | - Min Soo Park
- Department of Pediatrics, Yonsei University College of Medicine, Seoul 03722, Republic of Korea; (S.J.Y.)
| |
Collapse
|
9
|
Houssein EH, Samee NA, Mahmoud NF, Hussain K. Dynamic Coati Optimization Algorithm for Biomedical Classification Tasks. Comput Biol Med 2023; 164:107237. [PMID: 37467535 DOI: 10.1016/j.compbiomed.2023.107237] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 06/13/2023] [Accepted: 07/07/2023] [Indexed: 07/21/2023]
Abstract
Medical datasets are primarily made up of numerous pointless and redundant elements in a collection of patient records. None of these characteristics are necessary for a medical decision-making process. Conversely, a large amount of data leads to increased dimensionality and decreased classifier performance in terms of machine learning. Numerous approaches have recently been put out to address this issue, and the results indicate that feature selection can be a successful remedy. To meet the various needs of input patterns, medical diagnostic tasks typically involve learning a suitable categorization model. The k-Nearest Neighbors algorithm (kNN) classifier's classification performance is typically decreased by the input variables' abundance of irrelevant features. To simplify the kNN classifier, essential attributes of the input variables have been searched using the feature selection approach. This paper presents the Coati Optimization Algorithm (DCOA) in a dynamic form as a feature selection technique where each iteration of the optimization process involves the introduction of a different feature. We enhance the exploration and exploitation capability of DCOA by employing dynamic opposing candidate solutions. The most impressive feature of DCOA is that it does not require any preparatory parameter fine-tuning to the most popular metaheuristic algorithms. The CEC'22 test suite and nine medical datasets with various dimension sizes were used to evaluate the performance of the original COA and the proposed dynamic version. The statistical results were validated using the Bonferroni-Dunn test and Kendall's W test and showed the superiority of DCOA over seven well-known metaheuristic algorithms with an overall accuracy of 89.7%, a feature selection of 24%, a sensitivity of 93.35% a specificity of 96.81%, and a precision of 93.90%.
Collapse
Affiliation(s)
- Essam H Houssein
- Faculty of Computers and Information, Minia University, Minia, Egypt.
| | - Nagwan Abdel Samee
- Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia.
| | - Noha F Mahmoud
- Rehabilitation Sciences Department, Health and Rehabilitation Sciences College, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia.
| | - Kashif Hussain
- Department of Science and Engineering, Solent University, East Park Terrace, Southampton, SO14 0YN, United Kingdom.
| |
Collapse
|
10
|
Fleck JL, Hooijenga D, Phan R, Xie X, Augusto V, Heudel PE. Adjuvant therapeutic strategy decision support for an elderly population with localized breast cancer: A monocentric cohort retrospective study. PLoS One 2023; 18:e0290566. [PMID: 37616325 PMCID: PMC10449163 DOI: 10.1371/journal.pone.0290566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Accepted: 08/09/2023] [Indexed: 08/26/2023] Open
Abstract
Guidelines for the management of elderly patients with early breast cancer are scarce. Additional adjuvant systemic treatment to surgery for early breast cancer in elderly populations is challenged by increasing comorbidities with age. In non-metastatic settings, treatment decisions are often made under considerable uncertainty; this commonly leads to undertreatment and, consequently, poorer outcomes. This study aimed to develop a decision support tool that can help to identify candidate adjuvant post-surgery treatment schemes for elderly breast cancer patients based on tumor and patient characteristics. Our approach was to generate predictions of patient outcomes for different courses of action; these predictions can, in turn, be used to inform clinical decisions for new patients. We used a cohort of elderly patients (≥ 70 years) who underwent surgery with curative intent for early breast cancer to train the models. We tested seven classification algorithms using 5-fold cross-validation, with 80% of the data being randomly selected for training and the remaining 20% for testing. We assessed model performance using accuracy, precision, recall, F1-score, and AUC score. We used an autoencoder to perform dimensionality reduction prior to classification. We observed consistently better performance using logistic regression and linear discriminant analysis models when compared to the other models we tested. Classification performance generally improved when an autoencoder was used, except for when we predicted the need for adjuvant treatment. We obtained overall best results using a logistic regression model without autoencoding to predict the need for adjuvant treatment (F1-score = 0.869).
Collapse
Affiliation(s)
- Julia L. Fleck
- Mines Saint-Etienne, Univ Clermont Auvergne, CNRS, UMR 6158 LIMOS, Centre CIS, Saint-Etienne, France
| | - Daniëlle Hooijenga
- Mines Saint-Etienne, Univ Clermont Auvergne, CNRS, UMR 6158 LIMOS, Centre CIS, Saint-Etienne, France
| | - Raksmey Phan
- Mines Saint-Etienne, Univ Clermont Auvergne, CNRS, UMR 6158 LIMOS, Centre CIS, Saint-Etienne, France
| | - Xiaolan Xie
- Mines Saint-Etienne, Univ Clermont Auvergne, CNRS, UMR 6158 LIMOS, Centre CIS, Saint-Etienne, France
| | - Vincent Augusto
- Mines Saint-Etienne, Univ Clermont Auvergne, CNRS, UMR 6158 LIMOS, Centre CIS, Saint-Etienne, France
| | | |
Collapse
|
11
|
Rajput A, Bhamare KT, Thakur A, Kumar M. Anti-Biofilm: Machine Learning Assisted Prediction of IC 50 Activity of Chemicals Against Biofilms of Microbes Causing Antimicrobial Resistance and Implications in Drug Repurposing. J Mol Biol 2023; 435:168115. [PMID: 37356913 DOI: 10.1016/j.jmb.2023.168115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Revised: 04/06/2023] [Accepted: 04/14/2023] [Indexed: 06/27/2023]
Abstract
Biofilms are one of the leading causes of antibiotic resistance. It acts as a physical barrier against the human immune system and drugs. The use of anti-biofilm agents helps in tackling the menace of antibiotic resistance. The identification of efficient anti-biofilm chemicals remains a challenge. Therefore, in this study, we developed 'anti-Biofilm', a machine learning technique (MLT) based predictive algorithm for identifying and analyzing the biofilm inhibition of small molecules. The algorithm is developed using experimentally validated anti-biofilm compounds with half maximal inhibitory concentration (IC50) values extracted from aBiofilm resource. Out of the five MLTs, the Support Vector Machine performed best with Pearson's correlation coefficient of 0.75 on the training/testing data set. The robustness of the developed model was further checked using an independent validation dataset. While analyzing the chemical diversity of the anti-biofilm compounds, we observed that they occupy diverse chemical spaces with parent molecules like furanone, urea, phenolic acids, quinolines, and many more. Use of diverse chemicals as input further signifies the robustness of our predictive models. The three best-performing machine learning models were implemented as a user-friendly 'anti-Biofilm' web server (https://bioinfo.imtech.res.in/manojk/antibiofilm/) with different other modules which make 'anti-Biofilm' a comprehensive platform. Therefore, we hope that our initiative will be helpful for the scientific community engaged in identifying effective anti-biofilm agents to target the problem of antimicrobial resistance.
Collapse
Affiliation(s)
- Akanksha Rajput
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Sector 39A, Chandigarh 160036, India
| | - Kailash T Bhamare
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Sector 39A, Chandigarh 160036, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| | - Anamika Thakur
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Sector 39A, Chandigarh 160036, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| | - Manoj Kumar
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Sector 39A, Chandigarh 160036, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India.
| |
Collapse
|
12
|
Tripathy G, Sharaff A. AEGA: enhanced feature selection based on ANOVA and extended genetic algorithm for online customer review analysis. THE JOURNAL OF SUPERCOMPUTING 2023; 79:1-30. [PMID: 37359344 PMCID: PMC10031171 DOI: 10.1007/s11227-023-05179-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 03/07/2023] [Indexed: 06/28/2023]
Abstract
Sentiment analysis involves extricating and interpreting people's views, feelings, beliefs, etc., about diverse actualities such as services, goods, and topics. People intend to investigate the users' opinions on the online platform to achieve better performance. Regardless, the high-dimensional feature set in an online review study affects the interpretation of classification. Several studies have implemented different feature selection techniques; however, getting a high accuracy with a very minimal number of features is yet to be accomplished. This paper develops an effective hybrid approach based on an enhanced genetic algorithm (GA) and analysis of variance (ANOVA) to achieve this purpose. To beat the local minima convergence problem, this paper uses a unique two-phase crossover and impressive selection approach, gaining high exploration and fast convergence of the model. The use of ANOVA drastically reduces the feature size to minimize the computational burden of the model. Experiments are performed to estimate the algorithm performance using different conventional classifiers and algorithms like GA, Particle Swarm Optimization (PSO), Recursive Feature Elimination (RFE), Random Forest, ExtraTree, AdaBoost, GradientBoost, and XGBoost. The proposed novel approach gives impressive results using the Amazon Review dataset with an accuracy of 78.60 %, F1 score of 79.38 %, and an average precision of 0.87, and the Restaurant Customer Review dataset with an accuracy of 77.70 %, F1 score of 78.24 %, and average precision of 0.89 as compared to other existing algorithms. The result shows that the proposed model outperforms other algorithms with nearly 45 and 42% fewer features for the Amazon Review and Restaurant Customer Review datasets.
Collapse
Affiliation(s)
- Gyananjaya Tripathy
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, Chhattisgarh 492010 India
| | - Aakanksha Sharaff
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, Chhattisgarh 492010 India
| |
Collapse
|
13
|
Wang X, Ren H, Ren J, Song W, Qiao Y, Ren Z, Zhao Y, Linghu L, Cui Y, Zhao Z, Chen L, Qiu L. Machine learning-enabled risk prediction of chronic obstructive pulmonary disease with unbalanced data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 230:107340. [PMID: 36640604 DOI: 10.1016/j.cmpb.2023.107340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Revised: 11/25/2022] [Accepted: 01/04/2023] [Indexed: 06/17/2023]
Abstract
BACKGROUND AND OBJECTIVE Since the early symptoms of chronic obstructive pulmonary disease (COPD) are not obvious, patients are not easily identified, causing improper time for prevention and treatment. In present study, machine learning (ML) methods were employed to construct a risk prediction model for COPD to improve its prediction efficiency. METHODS We collected data from a sample of 5807 cases with a complete COPD diagnosis from the 2019 COPD Surveillance Program in Shanxi Province and extracted 34 potentially relevant variables from the dataset. Firstly, we used feature selection methods (i.e., Generalized elastic net, Lasso and Adaptive lasso) to select ten variables. Afterwards, we employed supervised classifiers for class imbalanced data by combining the cost-sensitive learning and SMOTE resampling methods with the ML methods (Logistic Regression, SVM, Random Forest, XGBoost, LightGBM, NGBoost and Stacking), respectively. Last, we assessed their performance. RESULTS The cough frequently at age 14 and before and other 9 variables are significant parameters for COPD. The Stacking heterogeneous ensemble model showed relatively good performance in the unbalanced datasets. The Logistic Regression with class weighting enjoyed the best classification performance in the balancing data when these composite indicators (AUC, F1-Score and G-mean) were used as criteria for model comparison. The values of F1-Score and G-mean for the top three ML models were 0.290/0.660 for Logistic Regression with class weighting, 0.288/0.649 for Stacking with synthetic minority oversampling technique (SMOTE), and 0.285/0.648 for LightGBM with SMOTE. CONCLUSIONS This paper combining feature selection methods, unbalanced data processing methods and machine learning methods with data from disease surveillance questionnaires and physical measurements to identify people at risk of COPD, concluded that machine learning models based on survey questionnaires could provide an automated identification for patients at risk of COPD, and provide a simple and scientific aid for early identification of COPD.
Collapse
Affiliation(s)
- Xuchun Wang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Hao Ren
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Jiahui Ren
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Wenzhu Song
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Yuchao Qiao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Zeping Ren
- Shanxi Centre for Disease Control and Prevention, Taiyuan, Shanxi 030012, China
| | - Ying Zhao
- Shanxi Centre for Disease Control and Prevention, Taiyuan, Shanxi 030012, China
| | - Liqin Linghu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China; Shanxi Centre for Disease Control and Prevention, Taiyuan, Shanxi 030012, China
| | - Yu Cui
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Zhiyang Zhao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Limin Chen
- The Fifth Hospital (Shanxi People's Hospital) of Shanxi Medical University, No. 29, Shuangtaji Street, Taiyuan, Shanxi 030012, China.
| | - Lixia Qiu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China.
| |
Collapse
|
14
|
Liu J, Feng H, Tang Y, Zhang L, Qu C, Zeng X, Peng X. A novel hybrid algorithm based on Harris Hawks for tumor feature gene selection. PeerJ Comput Sci 2023; 9:e1229. [PMID: 37346505 PMCID: PMC10280456 DOI: 10.7717/peerj-cs.1229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Accepted: 01/09/2023] [Indexed: 06/23/2023]
Abstract
Background Gene expression data are often used to classify cancer genes. In such high-dimensional datasets, however, only a few feature genes are closely related to tumors. Therefore, it is important to accurately select a subset of feature genes with high contributions to cancer classification. Methods In this article, a new three-stage hybrid gene selection method is proposed that combines a variance filter, extremely randomized tree and Harris Hawks (VEH). In the first stage, we evaluated each gene in the dataset through the variance filter and selected the feature genes that meet the variance threshold. In the second stage, we use extremely randomized tree to further eliminate irrelevant genes. Finally, we used the Harris Hawks algorithm to select the gene subset from the previous two stages to obtain the optimal feature gene subset. Results We evaluated the proposed method using three different classifiers on eight published microarray gene expression datasets. The results showed a 100% classification accuracy for VEH in gastric cancer, acute lymphoblastic leukemia and ovarian cancer, and an average classification accuracy of 95.33% across a variety of other cancers. Compared with other advanced feature selection algorithms, VEH has obvious advantages when measured by many evaluation criteria.
Collapse
Affiliation(s)
- Junjian Liu
- Department of Statistics, Hunan Normal University College of Mathematics and Statistics, Changsha, Hunan, China
| | - Huicong Feng
- Department of Pathology and Pathophysiology, Hunan Normal University School of Medicine, Changsha, Hunan, China
| | - Yifan Tang
- Department of Pathology and Pathophysiology, Hunan Normal University School of Medicine, Changsha, Hunan, China
| | - Lupeng Zhang
- Department of Biochemistry and Molecular Biology, Jishou University School of Medicine, Jishou, Hunan, China
| | - Chiwen Qu
- Department of Statistics, Hunan Normal University College of Mathematics and Statistics, Changsha, Hunan, China
| | - Xiaomin Zeng
- Department of Epidemiology and Health Statistics, Xiangya Public Health School, Central South University, Changsha, Hunan, China
| | - Xiaoning Peng
- Department of Statistics, Hunan Normal University College of Mathematics and Statistics, Changsha, Hunan, China
- Department of Pathology and Pathophysiology, Hunan Normal University School of Medicine, Changsha, Hunan, China
| |
Collapse
|
15
|
Dehdar S, Salimifard K, Mohammadi R, Marzban M, Saadatmand S, Fararouei M, Dianati-Nasab M. Applications of different machine learning approaches in prediction of breast cancer diagnosis delay. Front Oncol 2023; 13:1103369. [PMID: 36874113 PMCID: PMC9978377 DOI: 10.3389/fonc.2023.1103369] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Accepted: 01/30/2023] [Indexed: 02/18/2023] Open
Abstract
Background The increasing rate of breast cancer (BC) incidence and mortality in Iran has turned this disease into a challenge. A delay in diagnosis leads to more advanced stages of BC and a lower chance of survival, which makes this cancer even more fatal. Objectives The present study was aimed at identifying the predicting factors for delayed BC diagnosis in women in Iran. Methods In this study, four machine learning methods, including extreme gradient boosting (XGBoost), random forest (RF), neural networks (NNs), and logistic regression (LR), were applied to analyze the data of 630 women with confirmed BC. Also, different statistical methods, including chi-square, p-value, sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (AUC), were utilized in different steps of the survey. Results Thirty percent of patients had a delayed BC diagnosis. Of all the patients with delayed diagnoses, 88.5% were married, 72.1% had an urban residency, and 84.8% had health insurance. The top three important factors in the RF model were urban residency (12.04), breast disease history (11.58), and other comorbidities (10.72). In the XGBoost, urban residency (17.54), having other comorbidities (17.14), and age at first childbirth (>30) (13.13) were the top factors; in the LR model, having other comorbidities (49.41), older age at first childbirth (82.57), and being nulliparous (44.19) were the top factors. Finally, in the NN, it was found that being married (50.05), having a marriage age above 30 (18.03), and having other breast disease history (15.83) were the main predicting factors for a delayed BC diagnosis. Conclusion Machine learning techniques suggest that women with an urban residency who got married or had their first child at an age older than 30 and those without children are at a higher risk of diagnosis delay. It is necessary to educate them about BC risk factors, symptoms, and self-breast examination to shorten the delay in diagnosis.
Collapse
Affiliation(s)
- Samira Dehdar
- Computational Intelligence & Intelligent Optimization Research Group, Business and Economic School, Persian Gulf University, Bushehr, Iran
| | - Khodakaram Salimifard
- Computational Intelligence & Intelligent Optimization Research Group, Business and Economic School, Persian Gulf University, Bushehr, Iran
| | - Reza Mohammadi
- Business Analytics Section, Amsterdam Business School, University of Amsterdam, Amsterdam, Netherlands
| | - Maryam Marzban
- Department of Public Health, School of Public Health, Bushehr University of Medical Science, Bushehr, Iran
| | - Sara Saadatmand
- Computational Intelligence & Intelligent Optimization Research Group, Business and Economic School, Persian Gulf University, Bushehr, Iran
| | - Mohammad Fararouei
- Department of Epidemiology, School of Public Health, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Mostafa Dianati-Nasab
- Department of Complex Genetics and Epidemiology, School of Nutrition and Translational Research in Metabolism, Maastricht University, Maastricht, Netherlands
| |
Collapse
|
16
|
Sheikhpour R. A local spline regression-based framework for semi-supervised sparse feature selection. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2023.110265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
17
|
Mukherjee R, Kundu A, Mukherjee I, Gupta D, Tiwari P, Khanna A, Shorfuzzaman M. IoT-cloud based healthcare model for COVID-19 detection: an enhanced k-Nearest Neighbour classifier based approach. COMPUTING 2023; 105. [PMCID: PMC8085103 DOI: 10.1007/s00607-021-00951-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
COVID - 19 affected severely worldwide. The pandemic has caused many causalities in a very short span. The IoT-cloud-based healthcare model requirement is utmost in this situation to provide a better decision in the covid-19 pandemic. In this paper, an attempt has been made to perform predictive analytics regarding the disease using a machine learning classifier. This research proposed an enhanced KNN (k NearestNeighbor) algorithm eKNN, which did not randomly choose the value of k. However, it used a mathematical function of the dataset’s sample size while determining the k value. The enhanced KNN algorithm eKNN has experimented on 7 benchmark COVID-19 datasets of different size, which has been gathered from standard data cloud of different countries (Brazil, Mexico, etc.). It appeared that the enhanced KNN classifier performs significantly better than ordinary KNN. The second research question augmented the enhanced KNN algorithm with feature selection using ACO (Ant Colony Optimization). Results indicated that the enhanced KNN classifier along with the feature selection mechanism performed way better than enhanced KNN without feature selection. This paper involves proposing an improved KNN attempting to find an optimal value of k and studying IoT-cloud-based COVID - 19 detection.
Collapse
Affiliation(s)
- Rajendrani Mukherjee
- Department of Computer Science and Engineering, University of Engineering and Management, Kolkata, India
| | - Aurghyadip Kundu
- Department of Computer Science and Engineering, Brainware University, Kolkata, India
| | - Indrajit Mukherjee
- Department of Computer Science and Engineering, Birla Institute of Technology, Mesra, India
| | - Deepak Gupta
- Maharaja Agrasen Institute of Technology, Delhi, India
| | - Prayag Tiwari
- Department of Computer Science, Aalto University, Espoo, Finland
| | - Ashish Khanna
- Maharaja Agrasen Institute of Technology, Delhi, India
| | - Mohammad Shorfuzzaman
- Department of Computer Science, College of Computers and Information Technology, Taif University, Taif, 21944 Saudi Arabia
| |
Collapse
|
18
|
Fan M, Zhang X, Hu J, Gu N, Tao D. Adaptive Data Structure Regularized Multiclass Discriminative Feature Selection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:5859-5872. [PMID: 33882003 DOI: 10.1109/tnnls.2021.3071603] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Feature selection (FS), which aims to identify the most informative subset of input features, is an important approach to dimensionality reduction. In this article, a novel FS framework is proposed for both unsupervised and semisupervised scenarios. To make efficient use of data distribution to evaluate features, the framework combines data structure learning (as referred to as data distribution modeling) and FS in a unified formulation such that the data structure learning improves the results of FS and vice versa. Moreover, two types of data structures, namely the soft and hard data structures, are learned and used in the proposed FS framework. The soft data structure refers to the pairwise weights among data samples, and the hard data structure refers to the estimated labels obtained from clustering or semisupervised classification. Both of these data structures are naturally formulated as regularization terms in the proposed framework. In the optimization process, the soft and hard data structures are learned from data represented by the selected features, and then, the most informative features are reselected by referring to the data structures. In this way, the framework uses the interactions between data structure learning and FS to select the most discriminative and informative features. Following the proposed framework, a new semisupervised FS (SSFS) method is derived and studied in depth. Experiments on real-world data sets demonstrate the effectiveness of the proposed method.
Collapse
|
19
|
Patterson A, Auslander N. Mutated processes predict immune checkpoint inhibitor therapy benefit in metastatic melanoma. Nat Commun 2022; 13:5151. [PMID: 36123351 PMCID: PMC9485158 DOI: 10.1038/s41467-022-32838-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 08/19/2022] [Indexed: 02/06/2023] Open
Abstract
Immune Checkpoint Inhibitor (ICI) therapy has revolutionized treatment for advanced melanoma; however, only a subset of patients benefit from this treatment. Despite considerable efforts, the Tumor Mutation Burden (TMB) is the only FDA-approved biomarker in melanoma. However, the mechanisms underlying TMB association with prolonged ICI survival are not entirely understood and may depend on numerous confounding factors. To identify more interpretable ICI response biomarkers based on tumor mutations, we train classifiers using mutations within distinct biological processes. We evaluate a variety of feature selection and classification methods and identify key mutated biological processes that provide improved predictive capability compared to the TMB. The top mutated processes we identify are leukocyte and T-cell proliferation regulation, which demonstrate stable predictive performance across different data cohorts of melanoma patients treated with ICI. This study provides biologically interpretable genomic predictors of ICI response with substantially improved predictive performance over the TMB.
Collapse
Affiliation(s)
- Andrew Patterson
- Genomics and Computational Biology Graduate Group, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA
- Program in Molecular and Cellular Oncogenesis, The Wistar Institute, Philadelphia, PA, 19104, USA
| | - Noam Auslander
- Program in Molecular and Cellular Oncogenesis, The Wistar Institute, Philadelphia, PA, 19104, USA.
| |
Collapse
|
20
|
Murphy RG, Gilmore A, Senevirathne S, O'Reilly PG, LaBonte Wilson M, Jain S, McArt DG. Particle Swarm Optimization Artificial Intelligence technique for gene signature discovery in transcriptomic cohorts. Comput Struct Biotechnol J 2022; 20:5547-5563. [PMID: 36249564 PMCID: PMC9556859 DOI: 10.1016/j.csbj.2022.09.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 09/22/2022] [Accepted: 09/22/2022] [Indexed: 11/12/2022] Open
Abstract
EBPSO identifies unique, accurate, and succinct gene signatures. Key genes within the signatures provide biological insights its associated functions. A web-based micro-framework developed for ease of use and real-time visualizations. A promising alternative to traditional single gene signature generation. Downstream analysis will better translate these signatures towards clinical translation.
The development of gene signatures is key for delivering personalized medicine, despite only a few signatures being available for use in the clinic for cancer patients. Gene signature discovery tends to revolve around identifying a single signature. However, it has been shown that various highly predictive signatures can be produced from the same dataset. This study assumes that the presentation of top ranked signatures will allow greater efforts in the selection of gene signatures for validation on external datasets and for their clinical translation. Particle swarm optimization (PSO) is an evolutionary algorithm often used as a search strategy and largely represented as binary PSO (BPSO) in this domain. BPSO, however, fails to produce succinct feature sets for complex optimization problems, thus affecting its overall runtime and optimization performance. Enhanced BPSO (EBPSO) was developed to overcome these shortcomings. Thus, this study will validate unique candidate gene signatures for different underlying biology from EBPSO on transcriptomics cohorts. EBPSO was consistently seen to be as accurate as BPSO with substantially smaller feature signatures and significantly faster runtimes. 100% accuracy was achieved in all but two of the selected data sets. Using clinical transcriptomics cohorts, EBPSO has demonstrated the ability to identify accurate, succinct, and significantly prognostic signatures that are unique from one another. This has been proposed as a promising alternative to overcome the issues regarding traditional single gene signature generation. Interpretation of key genes within the signatures provided biological insights into the associated functions that were well correlated to their cancer type.
Collapse
|
21
|
Qiu F, Zheng P, Heidari AA, Liang G, Chen H, Karim FK, Elmannai H, Lin H. Mutational Slime Mould Algorithm for Gene Selection. Biomedicines 2022; 10:2052. [PMID: 36009599 PMCID: PMC9406076 DOI: 10.3390/biomedicines10082052] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 08/14/2022] [Accepted: 08/16/2022] [Indexed: 02/02/2023] Open
Abstract
A large volume of high-dimensional genetic data has been produced in modern medicine and biology fields. Data-driven decision-making is particularly crucial to clinical practice and relevant procedures. However, high-dimensional data in these fields increase the processing complexity and scale. Identifying representative genes and reducing the data's dimensions is often challenging. The purpose of gene selection is to eliminate irrelevant or redundant features to reduce the computational cost and improve classification accuracy. The wrapper gene selection model is based on a feature set, which can reduce the number of features and improve classification accuracy. This paper proposes a wrapper gene selection method based on the slime mould algorithm (SMA) to solve this problem. SMA is a new algorithm with a lot of application space in the feature selection field. This paper improves the original SMA by combining the Cauchy mutation mechanism with the crossover mutation strategy based on differential evolution (DE). Then, the transfer function converts the continuous optimizer into a binary version to solve the gene selection problem. Firstly, the continuous version of the method, ISMA, is tested on 33 classical continuous optimization problems. Then, the effect of the discrete version, or BISMA, was thoroughly studied by comparing it with other gene selection methods on 14 gene expression datasets. Experimental results show that the continuous version of the algorithm achieves an optimal balance between local exploitation and global search capabilities, and the discrete version of the algorithm has the highest accuracy when selecting the least number of genes.
Collapse
Affiliation(s)
- Feng Qiu
- Department of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China
| | - Pan Zheng
- Information Systems, University of Canterbury, Christchurch 8014, New Zealand
| | - Ali Asghar Heidari
- Department of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China
| | - Guoxi Liang
- Department of Information Technology, Wenzhou Polytechnic, Wenzhou 325035, China
| | - Huiling Chen
- Department of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China
| | - Faten Khalid Karim
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
| | - Hela Elmannai
- Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
| | - Haiping Lin
- Department of Information Engineering, Hangzhou Vocational & Technical College, Hangzhou 310018, China
| |
Collapse
|
22
|
Zheng J, Qu H, Li Z, Li L, Tang X, Guo F. A novel autoencoder approach to feature extraction with linear separability for high-dimensional data. PeerJ Comput Sci 2022; 8:e1061. [PMID: 37547057 PMCID: PMC10403198 DOI: 10.7717/peerj-cs.1061] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 07/18/2022] [Indexed: 08/08/2023]
Abstract
Feature extraction often needs to rely on sufficient information of the input data, however, the distribution of the data upon a high-dimensional space is too sparse to provide sufficient information for feature extraction. Furthermore, high dimensionality of the data also creates trouble for the searching of those features scattered in subspaces. As such, it is a tricky task for feature extraction from the data upon a high-dimensional space. To address this issue, this article proposes a novel autoencoder method using Mahalanobis distance metric of rescaling transformation. The key idea of the method is that by implementing Mahalanobis distance metric of rescaling transformation, the difference between the reconstructed distribution and the original distribution can be reduced, so as to improve the ability of feature extraction to the autoencoder. Results show that the proposed approach wins the state-of-the-art methods in terms of both the accuracy of feature extraction and the linear separabilities of the extracted features. We indicate that distance metric-based methods are more suitable for extracting those features with linear separabilities from high-dimensional data than feature selection-based methods. In a high-dimensional space, evaluating feature similarity is relatively easier than evaluating feature importance, so that distance metric methods by evaluating feature similarity gain advantages over feature selection methods by assessing feature importance for feature extraction, while evaluating feature importance is more computationally efficient than evaluating feature similarity.
Collapse
Affiliation(s)
- Jian Zheng
- College of Computer Science and Technology, Chongqing University of Post and Telecommunications, Chongqing, China
| | - Hongchun Qu
- College of Computer Science and Technology, Chongqing University of Post and Telecommunications, Chongqing, China
- College of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Zhaoni Li
- College of Computer Science and Technology, Chongqing University of Post and Telecommunications, Chongqing, China
| | - Lin Li
- College of Computer Science and Technology, Chongqing University of Post and Telecommunications, Chongqing, China
| | - Xiaoming Tang
- College of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Fei Guo
- College of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China
| |
Collapse
|
23
|
Wang L, Guo J, Tian Z, Seery S, Jin Y, Zhang S. Developing a Hybrid Risk Assessment Tool for Familial Hypercholesterolemia: A Machine Learning Study of Chinese Arteriosclerotic Cardiovascular Disease Patients. Front Cardiovasc Med 2022; 9:893986. [PMID: 35990942 PMCID: PMC9381985 DOI: 10.3389/fcvm.2022.893986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Accepted: 06/22/2022] [Indexed: 11/15/2022] Open
Abstract
Background Familial hypercholesterolemia (FH) is an autosomal-dominant genetic disorder with a high risk of premature arteriosclerotic cardiovascular disease (ASCVD). There are many alternative risk assessment tools, for example, DLCN, although their sensitivity and specificity vary among specific populations. We aimed to assess the risk discovery performance of a hybrid model consisting of existing FH risk assessment tools and machine learning (ML) methods, based on the Chinese patients with ASCVD. Materials and Methods In total, 5,597 primary patients with ASCVD were assessed for FH risk using 11 tools. The three best performing tools were hybridized through a voting strategy. ML models were set according to hybrid results to create a hybrid FH risk assessment tool (HFHRAT). PDP and ICE were adopted to interpret black box features. Results After hybridizing the mDLCN, Taiwan criteria, and DLCN, the HFHRAT was taken as a stacking ensemble method (AUC_class[94.85 ± 0.47], AUC_prob[98.66 ± 0.27]). The interpretation of HFHRAT suggests that patients aged <75 years with LDL-c >4 mmol/L were more likely to be at risk of developing FH. Conclusion The HFHRAT has provided a median of the three tools, which could reduce the false-negative rate associated with existing tools and prevent the development of atherosclerosis. The hybrid tool could satisfy the need for a risk assessment tool for specific populations.
Collapse
Affiliation(s)
- Lei Wang
- State Key Laboratory of Complex Severe and Rare Diseases, Department of Cardiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Jian Guo
- State Key Laboratory of Complex Severe and Rare Diseases, Department of Cardiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
- State Key Laboratory of Complex Severe and Rare Diseases, Medical Research Center, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Zhuang Tian
- State Key Laboratory of Complex Severe and Rare Diseases, Department of Cardiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Samuel Seery
- Department of Humanities and Social Sciences, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Ye Jin
- State Key Laboratory of Complex Severe and Rare Diseases, Medical Research Center, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Shuyang Zhang
- State Key Laboratory of Complex Severe and Rare Diseases, Department of Cardiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
- *Correspondence: Shuyang Zhang,
| |
Collapse
|
24
|
Liang X, Li J, Fu Y, Qu L, Tan Y, Zhang P. A novel machine learning model based on sparse structure learning with adaptive graph regularization for predicting drug side effects. J Biomed Inform 2022; 132:104131. [PMID: 35840061 DOI: 10.1016/j.jbi.2022.104131] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 06/08/2022] [Accepted: 06/29/2022] [Indexed: 10/17/2022]
Abstract
Drug side effects are closely related to the success and failure of drug development. Here we present a novel machine learning method for side effect prediction. The proposed method treats side effect prediction as a multi-label learning problem and uses sparse structure learning to model the relationships between side effects. Additionally, the proposed method adopts the adaptive graph regularization strategy to explore the local structure in drug data and fuse multiple types of drug features. An alternating optimization algorithm is proposed to solve the optimization problem. We collected chemical structures and biological pathway features of drugs as the inputs of our method to predict drug side effects. The results of the cross-validation experiment showed that our method could significantly improve the prediction performance compared to the other state-of-the-art methods. Besides, our model is highly interpretable. It could learn the drug neighbourhood relationships, side effect relationships, and drug features related to side effects. We systematically validated the information extracted by the model with independent data. Some prediction results could also be supported by literature reports. The proposed method could be applied to integrate both chemical and biological data to predict side effects and helps improve drug safety.
Collapse
Affiliation(s)
- Xujun Liang
- NHC Key Laboratory of Cancer Proteomics, Department of Oncology, PR China; National Clinical Research Center for Gerontology, Xiangya Hospital, Central South University, PR China.
| | - Jun Li
- NHC Key Laboratory of Cancer Proteomics, Department of Oncology, PR China
| | - Ying Fu
- NHC Key Laboratory of Cancer Proteomics, Department of Oncology, PR China
| | - Lingzhi Qu
- NHC Key Laboratory of Cancer Proteomics, Department of Oncology, PR China
| | - Yuying Tan
- NHC Key Laboratory of Cancer Proteomics, Department of Oncology, PR China
| | - Pengfei Zhang
- NHC Key Laboratory of Cancer Proteomics, Department of Oncology, PR China; National Clinical Research Center for Gerontology, Xiangya Hospital, Central South University, PR China
| |
Collapse
|
25
|
Chen Z, Liu X, Zhao P, Li C, Wang Y, Li F, Akutsu T, Bain C, Gasser RB, Li J, Yang Z, Gao X, Kurgan L, Song J. iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. Nucleic Acids Res 2022; 50:W434-W447. [PMID: 35524557 PMCID: PMC9252729 DOI: 10.1093/nar/gkac351] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 04/22/2022] [Accepted: 04/25/2022] [Indexed: 01/07/2023] Open
Abstract
The rapid accumulation of molecular data motivates development of innovative approaches to computationally characterize sequences, structures and functions of biological and chemical molecules in an efficient, accessible and accurate manner. Notwithstanding several computational tools that characterize protein or nucleic acids data, there are no one-stop computational toolkits that comprehensively characterize a wide range of biomolecules. We address this vital need by developing a holistic platform that generates features from sequence and structural data for a diverse collection of molecule types. Our freely available and easy-to-use iFeatureOmega platform generates, analyzes and visualizes 189 representations for biological sequences, structures and ligands. To the best of our knowledge, iFeatureOmega provides the largest scope when directly compared to the current solutions, in terms of the number of feature extraction and analysis approaches and coverage of different molecules. We release three versions of iFeatureOmega including a webserver, command line interface and graphical interface to satisfy needs of experienced bioinformaticians and less computer-savvy biologists and biochemists. With the assistance of iFeatureOmega, users can encode their molecular data into representations that facilitate construction of predictive models and analytical studies. We highlight benefits of iFeatureOmega based on three research applications, demonstrating how it can be used to accelerate and streamline research in bioinformatics, computational biology, and cheminformatics areas. The iFeatureOmega webserver is freely available at http://ifeatureomega.erc.monash.edu and the standalone versions can be downloaded from https://github.com/Superzchen/iFeatureOmega-GUI/ and https://github.com/Superzchen/iFeatureOmega-CLI/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
- Center for Crop Genome Engineering, Henan Agricultural University, Zhengzhou 450046, China
| | - Xuhan Liu
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Einsteinweg 55, Leiden 2333 CC, The Netherlands
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Yanan Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Chris Bain
- Monash Data Future Institutes, Monash University, Melbourne, Victoria 3800, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Zuoren Yang
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Future Institutes, Monash University, Melbourne, Victoria 3800, Australia
| |
Collapse
|
26
|
Kurata H, Tsukiyama S, Manavalan B. iACVP: markedly enhanced identification of anti-coronavirus peptides using a dataset-specific word2vec model. Brief Bioinform 2022; 23:6623727. [PMID: 35772910 DOI: 10.1093/bib/bbac265] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 05/23/2022] [Accepted: 06/06/2022] [Indexed: 01/22/2023] Open
Abstract
The COVID-19 pandemic caused several million deaths worldwide. Development of anti-coronavirus drugs is thus urgent. Unlike conventional non-peptide drugs, antiviral peptide drugs are highly specific, easy to synthesize and modify, and not highly susceptible to drug resistance. To reduce the time and expense involved in screening thousands of peptides and assaying their antiviral activity, computational predictors for identifying anti-coronavirus peptides (ACVPs) are needed. However, few experimentally verified ACVP samples are available, even though a relatively large number of antiviral peptides (AVPs) have been discovered. In this study, we attempted to predict ACVPs using an AVP dataset and a small collection of ACVPs. Using conventional features, a binary profile and a word-embedding word2vec (W2V), we systematically explored five different machine learning methods: Transformer, Convolutional Neural Network, bidirectional Long Short-Term Memory, Random Forest (RF) and Support Vector Machine. Via exhaustive searches, we found that the RF classifier with W2V consistently achieved better performance on different datasets. The two main controlling factors were: (i) the dataset-specific W2V dictionary was generated from the training and independent test datasets instead of the widely used general UniProt proteome and (ii) a systematic search was conducted and determined the optimal k-mer value in W2V, which provides greater discrimination between positive and negative samples. Therefore, our proposed method, named iACVP, consistently provides better prediction performance compared with existing state-of-the-art methods. To assist experimentalists in identifying putative ACVPs, we implemented our model as a web server accessible via the following link: http://kurata35.bio.kyutech.ac.jp/iACVP.
Collapse
Affiliation(s)
- Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| |
Collapse
|
27
|
Dokeroglu T, Deniz A, Kiziloz HE. A comprehensive survey on recent metaheuristics for feature selection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.083] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
28
|
Multi-classification for high-dimensional data using probabilistic neural networks. JOURNAL OF RADIATION RESEARCH AND APPLIED SCIENCES 2022. [DOI: 10.1016/j.jrras.2022.05.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
29
|
Identification of gene signatures for COAD using feature selection and Bayesian network approaches. Sci Rep 2022; 12:8761. [PMID: 35610288 PMCID: PMC9130243 DOI: 10.1038/s41598-022-12780-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 05/03/2022] [Indexed: 12/13/2022] Open
Abstract
The combination of TCGA and GTEx databases will provide more comprehensive information for characterizing the human genome in health and disease, especially for underlying the cancer genetic alterations. Here we analyzed the gene expression profile of COAD in both tumor samples from TCGA and normal colon tissues from GTEx. Using the SNR-PPFS feature selection algorithms, we discovered a 38 gene signatures that performed well in distinguishing COAD tumors from normal samples. Bayesian network of the 38 genes revealed that DEGs with similar expression patterns or functions interacted more closely. We identified 14 up-DEGs that were significantly correlated with tumor stages. Cox regression analysis demonstrated that tumor stage, STMN4 and FAM135B dysregulation were independent prognostic factors for COAD survival outcomes. Overall, this study indicates that using feature selection approaches to select key gene signatures from high-dimensional datasets can be an effective way for studying cancer genomic characteristics.
Collapse
|
30
|
Abbasi MS, Al-Sahaf H, Mansoori M, Welch I. Behavior-based ransomware classification: A particle swarm optimization wrapper-based approach for feature selection. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108744] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
31
|
Nersisyan S, Novosad V, Galatenko A, Sokolov A, Bokov G, Konovalov A, Alekseev D, Tonevitsky A. ExhauFS: exhaustive search-based feature selection for classification and survival regression. PeerJ 2022; 10:e13200. [PMID: 35378930 PMCID: PMC8976470 DOI: 10.7717/peerj.13200] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Accepted: 03/09/2022] [Indexed: 01/12/2023] Open
Abstract
Feature selection is one of the main techniques used to prevent overfitting in machine learning applications. The most straightforward approach for feature selection is an exhaustive search: one can go over all possible feature combinations and pick up the model with the highest accuracy. This method together with its optimizations were actively used in biomedical research, however, publicly available implementation is missing. We present ExhauFS-the user-friendly command-line implementation of the exhaustive search approach for classification and survival regression. Aside from tool description, we included three application examples in the manuscript to comprehensively review the implemented functionality. First, we executed ExhauFS on a toy cervical cancer dataset to illustrate basic concepts. Then, multi-cohort microarray breast cancer datasets were used to construct gene signatures for 5-year recurrence classification. The vast majority of signatures constructed by ExhauFS passed 0.65 threshold of sensitivity and specificity on all datasets, including the validation one. Moreover, a number of gene signatures demonstrated reliable performance on independent RNA-seq dataset without any coefficient re-tuning, i.e., turned out to be cross-platform. Finally, Cox survival regression models were used to fit isomiR signatures for overall survival prediction for patients with colorectal cancer. Similarly to the previous example, the major part of models passed the pre-defined concordance index threshold 0.65 on all datasets. In both real-world scenarios (breast and colorectal cancer datasets), ExhauFS was benchmarked against state-of-the-art feature selection models, including L1-regularized sparse models. In case of breast cancer, we were unable to construct reliable cross-platform classifiers using alternative feature selection approaches. In case of colorectal cancer not a single model passed the same 0.65 threshold. Source codes and documentation of ExhauFS are available on GitHub: https://github.com/s-a-nersisyan/ExhauFS.
Collapse
Affiliation(s)
- Stepan Nersisyan
- Faculty of Biology and Biotechnology, HSE University, Moscow, Russia
| | - Victor Novosad
- Faculty of Biology and Biotechnology, HSE University, Moscow, Russia,Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry RAS, Moscow, Russia
| | - Alexei Galatenko
- Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, Moscow, Russia,Moscow Center for Fundamental and Applied Mathematics, Moscow, Russia
| | - Andrey Sokolov
- Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, Moscow, Russia,Moscow Center for Fundamental and Applied Mathematics, Moscow, Russia
| | - Grigoriy Bokov
- Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, Moscow, Russia,Moscow Center for Fundamental and Applied Mathematics, Moscow, Russia
| | - Alexander Konovalov
- Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, Moscow, Russia,Moscow Center for Fundamental and Applied Mathematics, Moscow, Russia
| | - Dmitry Alekseev
- Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, Moscow, Russia,Moscow Center for Fundamental and Applied Mathematics, Moscow, Russia
| | - Alexander Tonevitsky
- Faculty of Biology and Biotechnology, HSE University, Moscow, Russia,Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry RAS, Moscow, Russia,Institute of Nanotechnologies of Microelectronics RAS, Moscow, Russia
| |
Collapse
|
32
|
Veiner M, Morimoto J, Leadbeater E, Manfredini F. Machine Learning models identify gene predictors of waggle dance behaviour in honeybees. Mol Ecol Resour 2022; 22:2248-2261. [PMID: 35334147 DOI: 10.1111/1755-0998.13611] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 03/02/2022] [Accepted: 03/21/2022] [Indexed: 11/28/2022]
Abstract
The molecular characterisation of complex behaviours is a challenging task as a range of different factors are often involved to produce the observed phenotype. An established approach is to look at the overall levels of expression of brain genes - or 'neurogenomics' - to select the best candidates that associate with patterns of interest. However, traditional neurogenomic analyses have some well-known limitations; above all, the usually limited number of biological replicates compared to the number of genes tested - known as "curse of dimensionality". In this study we implemented a Machine Learning (ML) approach that can be used as a complement to more established methods of transcriptomic analyses. We tested three supervised learning algorithms (Random Forests, Lasso and Elastic net Regularized Generalized Linear Model, and Support Vector Machine) for their performance in the characterization of transcriptomic patterns and identification of genes associated with honeybee waggle dance. We then intersected the results of these analyses with traditional outputs of differential gene expression analyses and identified two promising candidates for the neural regulation of the waggle dance: boss and hnRNP A1. Overall, our study demonstrates the application of Machine Learning to analyse transcriptomics data and identify candidate genes underlying social behaviour. This approach has great potential for application to a wide range of different scenarios in evolutionary ecology, when investigating the genomic basis for complex phenotypic traits and can present some clear advantages compared to the established tools of gene expression analysis, making it a valuable complement for future studies.
Collapse
Affiliation(s)
- Marcell Veiner
- The School of Natural and Computing Sciences, University of Aberdeen, Aberdeen Scotland, UK
| | - Juliano Morimoto
- The School of Biological Sciences, University of Aberdeen, Aberdeen Scotland, UK
| | - Ellouise Leadbeater
- School of Biological Sciences, Royal Holloway University of London, Egham Surrey, UK
| | - Fabio Manfredini
- The School of Biological Sciences, University of Aberdeen, Aberdeen Scotland, UK.,School of Biological Sciences, Royal Holloway University of London, Egham Surrey, UK
| |
Collapse
|
33
|
Wang Y, Gao X, Ru X, Sun P, Wang J. A hybrid feature selection algorithm and its application in bioinformatics. PeerJ Comput Sci 2022; 8:e933. [PMID: 35494789 PMCID: PMC9044222 DOI: 10.7717/peerj-cs.933] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 03/03/2022] [Indexed: 06/14/2023]
Abstract
Feature selection is an independent technology for high-dimensional datasets that has been widely applied in a variety of fields. With the vast expansion of information, such as bioinformatics data, there has been an urgent need to investigate more effective and accurate methods involving feature selection in recent decades. Here, we proposed the hybrid MMPSO method, by combining the feature ranking method and the heuristic search method, to obtain an optimal subset that can be used for higher classification accuracy. In this study, ten datasets obtained from the UCI Machine Learning Repository were analyzed to demonstrate the superiority of our method. The MMPSO algorithm outperformed other algorithms in terms of classification accuracy while utilizing the same number of features. Then we applied the method to a biological dataset containing gene expression information about liver hepatocellular carcinoma (LIHC) samples obtained from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx). On the basis of the MMPSO algorithm, we identified a 18-gene signature that performed well in distinguishing normal samples from tumours. Nine of the 18 differentially expressed genes were significantly up-regulated in LIHC tumour samples, and the area under curves (AUC) of the combination seven genes (ADRA2B, ERAP2, NPC1L1, PLVAP, POMC, PYROXD2, TRIM29) in classifying tumours with normal samples was greater than 0.99. Six genes (ADRA2B, PYROXD2, CACHD1, FKBP1B, PRKD1 and RPL7AP6) were significantly correlated with survival time. The MMPSO algorithm can be used to effectively extract features from a high-dimensional dataset, which will provide new clues for identifying biomarkers or therapeutic targets from biological data and more perspectives in tumor research.
Collapse
Affiliation(s)
- Yangyang Wang
- School of Electronics and Information, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| | - Xiaoguang Gao
- School of Electronics and Information, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| | - Xinxin Ru
- School of Electronics and Information, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| | - Pengzhan Sun
- School of Electronics and Information, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| | - Jihan Wang
- Institute of Medical Research, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| |
Collapse
|
34
|
Wang L, Sourina O, Erdt M, Wang Y, Chang Q. Machine learning methods for bio-medical image and signal processing: Recent advances. Methods 2022; 202:1-2. [DOI: 10.1016/j.ymeth.2022.03.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
|
35
|
Li F, Zhou Y, Zhang Y, Yin J, Qiu Y, Gao J, Zhu F. POSREG: proteomic signature discovered by simultaneously optimizing its reproducibility and generalizability. Brief Bioinform 2022; 23:6532538. [PMID: 35183059 DOI: 10.1093/bib/bbac040] [Citation(s) in RCA: 91] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 01/21/2022] [Accepted: 01/27/2022] [Indexed: 12/17/2022] Open
Abstract
Mass spectrometry-based proteomic technique has become indispensable in current exploration of complex and dynamic biological processes. Instrument development has largely ensured the effective production of proteomic data, which necessitates commensurate advances in statistical framework to discover the optimal proteomic signature. Current framework mainly emphasizes the generalizability of the identified signature in predicting the independent data but neglects the reproducibility among signatures identified from independently repeated trials on different sub-dataset. These problems seriously restricted the wide application of the proteomic technique in molecular biology and other related directions. Thus, it is crucial to enable the generalizable and reproducible discovery of the proteomic signature with the subsequent indication of phenotype association. However, no such tool has been developed and available yet. Herein, an online tool, POSREG, was therefore constructed to identify the optimal signature for a set of proteomic data. It works by (i) identifying the proteomic signature of good reproducibility and aggregating them to ensemble feature ranking by ensemble learning, (ii) assessing the generalizability of ensemble feature ranking to acquire the optimal signature and (iii) indicating the phenotype association of discovered signature. POSREG is unique in its capacity of discovering the proteomic signature by simultaneously optimizing its reproducibility and generalizability. It is now accessible free of charge without any registration or login requirement at https://idrblab.org/posreg/.
Collapse
Affiliation(s)
- Fengcheng Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Ying Zhou
- State Key Laboratory for Diagnosis and Treatment of Infectious Disease, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou, Zhejiang 310000, China
| | - Ying Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Jiayi Yin
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yunqing Qiu
- State Key Laboratory for Diagnosis and Treatment of Infectious Disease, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou, Zhejiang 310000, China
| | - Jianqing Gao
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
36
|
Xu P, Chen H, Li M, Lu W. New Opportunity: Machine Learning for Polymer Materials Design and Discovery. ADVANCED THEORY AND SIMULATIONS 2022. [DOI: 10.1002/adts.202100565] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Pengcheng Xu
- Materials Genome Institute Shanghai University Shanghai 200444 China
| | - Huimin Chen
- Department of Mathematics College of Sciences Shanghai University Shanghai 200444 China
| | - Minjie Li
- Department of Chemistry College of Sciences Shanghai University Shanghai 200444 China
| | - Wencong Lu
- Materials Genome Institute Shanghai University Shanghai 200444 China
- Department of Chemistry College of Sciences Shanghai University Shanghai 200444 China
| |
Collapse
|
37
|
Martínez-García M, Hernández-Lemus E. Data Integration Challenges for Machine Learning in Precision Medicine. Front Med (Lausanne) 2022; 8:784455. [PMID: 35145977 PMCID: PMC8821900 DOI: 10.3389/fmed.2021.784455] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 12/28/2021] [Indexed: 12/19/2022] Open
Abstract
A main goal of Precision Medicine is that of incorporating and integrating the vast corpora on different databases about the molecular and environmental origins of disease, into analytic frameworks, allowing the development of individualized, context-dependent diagnostics, and therapeutic approaches. In this regard, artificial intelligence and machine learning approaches can be used to build analytical models of complex disease aimed at prediction of personalized health conditions and outcomes. Such models must handle the wide heterogeneity of individuals in both their genetic predisposition and their social and environmental determinants. Computational approaches to medicine need to be able to efficiently manage, visualize and integrate, large datasets combining structure, and unstructured formats. This needs to be done while constrained by different levels of confidentiality, ideally doing so within a unified analytical architecture. Efficient data integration and management is key to the successful application of computational intelligence approaches to medicine. A number of challenges arise in the design of successful designs to medical data analytics under currently demanding conditions of performance in personalized medicine, while also subject to time, computational power, and bioethical constraints. Here, we will review some of these constraints and discuss possible avenues to overcome current challenges.
Collapse
Affiliation(s)
- Mireya Martínez-García
- Clinical Research Division, National Institute of Cardiology ‘Ignacio Chávez’, Mexico City, Mexico
| | - Enrique Hernández-Lemus
- Computational Genomics Division, National Institute of Genomic Medicine (INMEGEN), Mexico City, Mexico
- Center for Complexity Sciences, Universidad Nacional Autnoma de Mexico, Mexico City, Mexico
| |
Collapse
|
38
|
Yadav NS, Kumar P, Singh I. Structural and functional analysis of protein. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00026-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
39
|
Xicota L, De Toma I, Maffioletti E, Pisanu C, Squassina A, Baune BT, Potier MC, Stacey D, Dierssen M. Recommendations for pharmacotranscriptomic profiling of drug response in CNS disorders. Eur Neuropsychopharmacol 2022; 54:41-53. [PMID: 34743061 DOI: 10.1016/j.euroneuro.2021.10.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 08/16/2021] [Accepted: 10/08/2021] [Indexed: 12/13/2022]
Abstract
Pharmacotranscriptomics is a still very new field of research that has just begun to flourish and promises to enable target discovery, inform biomarker and evaluate drug efficacy beyond pharmacogenomics. The aim of this review is to provide a critical overview of the biological foundations of transcriptomics, methodological approaches to transcriptomic studies, and their advantages and limitations. We present the different RNA species (rRNAs, tRNAs, mtRNAs, snRNAs, scRNAs, mRNAs, ncRNAs, LINE and SINE transcripts, circular RNAs, piRNAs, miRNAs, snoRNAs) and their potential for pharmacotranscriptomic studies as markers to predict treatment response in neurological and psychiatric disorders. We also review the accessible sources of RNA in patients peripheral blood cells (including platelets), plasma, microvesicles, exosomes, apoptotic bodies, and how those affect the integrity and relative abundances of RNAs and reflect the situation in the Central Nervous System (CNS). Finally, we discuss the suitability and indications of different techniques, such as microarrays and RNA-sequencing (RNA-Seq) techniques to understand gene expression differences or to reveal variation in expression levels of coding and non-coding genes. We conclude with some recommendations for future directions, e.g., gaps of knowledge and particular RNAs/tissues that have been overlooked.
Collapse
Affiliation(s)
- Laura Xicota
- Paris Brain Institute, CNRS UMR7225, INSERM U1127, UPMC, Hôpital de la Pitié-Salpêtrière, Paris, France
| | - Ilario De Toma
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Elisabetta Maffioletti
- Genetics Unit, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, Italy
| | - Claudia Pisanu
- Department of Biomedical Sciences, Section of Neuroscience and Clinical Pharmacology, University of Cagliari, Cagliari, Italy
| | - Alessio Squassina
- Department of Biomedical Sciences, Section of Neuroscience and Clinical Pharmacology, University of Cagliari, Cagliari, Italy; Department of Psychiatry, Dalhousie University, Halifax, NS, Canada
| | - Bernhard T Baune
- Department of Psychiatry, University of Muenster, Muenster, Germany; Department of Psychiatry, Melbourne Medical School, The University of Melbourne, Melbourne, Australia; The Florey Institute of Neuroscience and Mental Health, The University of Melbourne, Parkville, VIC, Australia
| | - Marie Claude Potier
- Paris Brain Institute, CNRS UMR7225, INSERM U1127, UPMC, Hôpital de la Pitié-Salpêtrière, Paris, France
| | - David Stacey
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom of Great Britain and Northern Ireland United Kingdom
| | - Mara Dierssen
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; Biomedical Research Networking Center on Rare Diseases (CIBERER), Institute of Health Carlos III, Madrid, Spain.
| | | |
Collapse
|
40
|
Whiteside TL. Tumor-Infiltrating Lymphocytes and Their Role in Solid Tumor Progression. EXPERIENTIA SUPPLEMENTUM (2012) 2022; 113:89-106. [PMID: 35165861 PMCID: PMC9113058 DOI: 10.1007/978-3-030-91311-3_3] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Tumor-infiltrating lymphocytes (TIL) are an important component of the tumor environment. Their role in tumor growth and progression has been debated for decades. Today, emphasis has shifted to beneficial effects of TIL for the host and to therapies optimizing the benefits by reducing immune suppression in the tumor microenvironment. Evidence indicates that when TILs are present in the tumor as dense aggregates of activated immune cells, tumor prognosis and responses to therapy are favorable. Gene signatures and protein profiling of TIL at the population and single-cell levels provide clues not only about their phenotype and numbers but also about TIL potential functions in the tumor. Correlations of the TIL data with clinicopathological tumor characteristics, clinical outcome, and patients' survival indicate that TILs exert influence on the disease progression, especially in colorectal carcinomas and breast cancer. At the same time, the recognition that TIL signatures vary with time and cancer progression has initiated investigations of TIL as potential prognostic biomarkers. Multiple mechanisms are utilized by tumors to subvert the host immune system. The balance between pro- and antitumor responses of TIL largely depends on the tumor microenvironment, which is unique in each cancer patient. This balance is orchestrated by the tumor and thus is shifted toward the promotion of tumor growth. Changes occurring in TIL during tumor progression appear to serve as a measure of tumor aggressiveness and potentially provide a key to selecting therapeutic strategies and inform about prognosis.
Collapse
Affiliation(s)
- Theresa L Whiteside
- Departments of Pathology and Immunology, University of Pittsburgh School of Medicine, UPMC Hillman Cancer Center, Pittsburgh, PA, USA.
| |
Collapse
|
41
|
Wang H, Xie X, Zhu J, Qi S, Xie J. Comprehensive analysis identifies IFI16 as a novel signature associated with overall survival and immune infiltration of skin cutaneous melanoma. Cancer Cell Int 2021; 21:694. [PMID: 34930258 PMCID: PMC8690488 DOI: 10.1186/s12935-021-02409-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Accepted: 12/13/2021] [Indexed: 11/13/2022] Open
Abstract
Background Skin cutaneous melanoma (SKCM) is the most common skin tumor with high mortality. The unfavorable outcome of SKCM urges the discovery of prognostic biomarkers for accurate therapy. The present study aimed to explore novel prognosis-related signatures of SKCM and determine the significance of immune cell infiltration in this pathology. Methods Four gene expression profiles (GSE130244, GSE3189, GSE7553 and GSE46517) of SKCM and normal skin samples were retrieved from the GEO database. Differentially expressed genes (DEGs) were then screened, and the feature genes were identified by the LASSO regression and Boruta algorithm. Survival analysis was performed to filter the potential prognostic signature, and GEPIA was used for preliminary validation. The area under the receiver operating characteristic curve (AUC) was obtained to evaluate discriminatory ability. The Gene Set Variation Analysis (GSVA) was performed, and the composition of the immune cell infiltration in SKCM was estimated using CIBERSORT. At last, paraffin-embedded specimens of primary SKCM and normal skin tissues were collected, and the signature was validated by fluorescence in situ hybridization (FISH) and immunohistochemistry (IHC). Results Totally 823 DEGs and 16 feature genes were screened. IFI16 was identified as the signature associated with overall survival of SKCM with a great discriminatory ability (AUC > 0.9 for all datasets). GSVA noticed that IFI16 might be involved in apoptosis and ultraviolet response in SKCM, and immune cell infiltration of IFI16 was evaluated. At last, FISH and IHC both validated the differential expression of IFI16 in SKCM. Conclusions In conclusion, our comprehensive analysis identified IFI16 as a signature associated with overall survival and immune infiltration of SKCM, which may play a critical role in the occurrence and development of SKCM. Supplementary Information The online version contains supplementary material available at 10.1186/s12935-021-02409-6.
Collapse
|
42
|
Shu L, Huang K, Jiang W, Wu W, Liu H. Feature selection using autoencoders with Bayesian methods to high-dimensional data. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2021. [DOI: 10.3233/jifs-211348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.
Collapse
Affiliation(s)
- Lei Shu
- Chongqing Aerospace Polytechnic, Chongqing, China
| | - Kun Huang
- Urban Vocational College of Sichuan, P.R. China
| | - Wenhao Jiang
- Chongqing Aerospace Polytechnic, Chongqing, China
| | - Wenming Wu
- Chongqing Aerospace Polytechnic, Chongqing, China
| | - Hongling Liu
- Chongqing Aerospace Polytechnic, Chongqing, China
| |
Collapse
|
43
|
R. P. S. M, A. M. K. Big data feature selection using fish and frog optimization. Comput Intell 2021. [DOI: 10.1111/coin.12483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Manikandan R. P. S.
- Department of Information Technology Sri Shakthi Institute of Engineering & Technology Coimbatore India
| | - Kalpana A. M.
- Department of CSE Government College of Engineering Salem India
| |
Collapse
|
44
|
Protein function prediction using functional inter-relationship. Comput Biol Chem 2021; 95:107593. [PMID: 34736126 DOI: 10.1016/j.compbiolchem.2021.107593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 08/25/2021] [Accepted: 10/03/2021] [Indexed: 11/23/2022]
Abstract
With the growth of high throughput sequencing techniques, the generation of protein sequences has become fast and cheap, leading to a huge increase in the number of known proteins. However, it is challenging to identify the functions being performed by these newly discovered proteins. Machine learning techniques have improved traditional methods' efficiency by suggesting relevant functions but fails to perform well when the number of functions to be predicted becomes large. In this work, we propose a machine learning-based approach to predict huge set of protein functions that use the inter-relationships between functions to improve the model's predictability. These inter-relationships of functions is used to reduce the redundancy caused by highly correlated functions. The proposed model is trained on the reduced set of non-redundant functions hindering the ambiguity caused due to inter-related functions. Here, we use two statistical approaches 1) Pearson's correlation coefficient 2) Jaccard similarity coefficient, as a measure of correlation to remove redundant functions. To have a fair evaluation of the proposed model, we recreate our original function set by inverse transforming the reduced set using the two proposed approaches: Direct mapping and Ensemble approach. The model is tested using different feature sets and function sets of biological processes and molecular functions to get promising results on DeepGO and CAFA3 dataset. The proposed model is able to predict specific functions for the test data which were unpredictable by other compared methods. The experimental models, code and other relevant data are available at https://github.com/richadhanuka/PFP-using-Functional-interrelationship.
Collapse
|
45
|
da Silva PN, Plastino A, Fabris F, Freitas AA. A Novel Feature Selection Method for Uncertain Features: An Application to the Prediction of Pro-/Anti-Longevity Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2230-2238. [PMID: 32324561 DOI: 10.1109/tcbb.2020.2988450] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Understanding the ageing process is a very challenging problem for biologists. To help in this task, there has been a growing use of classification methods (from machine learning) to learn models that predict whether a gene influences the process of ageing or promotes longevity. One type of predictive feature often used for learning such classification models is Protein-Protein Interaction (PPI) features. One important property of PPI features is their uncertainty, i.e., a given feature (PPI annotation) is often associated with a confidence score, which is usually ignored by conventional classification methods. Hence, we propose the Lazy Feature Selection for Uncertain Features (LFSUF) method, which is tailored for coping with the uncertainty in PPI confidence scores. In addition, following the lazy learning paradigm, LFSUF selects features for each instance to be classified, making the feature selection process more flexible. We show that our LFSUF method achieves better predictive accuracy when compared to other feature selection methods that either do not explicitly take PPI confidence scores into account or deal with uncertainty globally rather than using a per-instance approach. Also, we interpret the results of the classification process using the features selected by LFSUF, showing that the number of selected features is significantly reduced, assisting the interpretability of the results. The datasets used in the experiments and the program code of the LFSUF method are freely available on the web at http://github.com/pablonsilva/FSforUncertainFeatureSpaces.
Collapse
|
46
|
Wang M, Liu W, Chen M, Huang X, Han W. A band selection approach based on a modified gray wolf optimizer and weight updating of bands for hyperspectral image. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107805] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
47
|
Evaluation of Feature Selection Methods for Mammographic Breast Cancer Diagnosis in a Unified Framework. BIOMED RESEARCH INTERNATIONAL 2021; 2021:6079163. [PMID: 34646886 PMCID: PMC8505067 DOI: 10.1155/2021/6079163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Revised: 07/10/2020] [Accepted: 07/18/2020] [Indexed: 11/17/2022]
Abstract
Over recent years, feature selection (FS) has gained more attention in intelligent diagnosis. This study is aimed at evaluating FS methods in a unified framework for mammographic breast cancer diagnosis. After FS methods generated rank lists according to feature importance, the framework added features incrementally as the input of random forest which performed as the classifier for breast lesion classification. In this study, 10 FS methods were evaluated and the digital database for screening mammography (1104 benign and 980 malignant lesions) was analyzed. The classification performance was quantified with the area under the curve (AUC), and accuracy, sensitivity, and specificity were also considered. Experimental results suggested that both infinite latent FS method (AUC, 0.866 ± 0.028) and RELIEFF (AUC, 0.855 ± 0.020) achieved good prediction (AUC ≥ 0.85) when 6 features were used, followed by correlation-based FS method (AUC, 0.867 ± 0.023) using 7 features and WILCOXON (AUC, 0.887 ± 0.019) using 8 features. The reliability of the diagnosis models was also verified, indicating that correlation-based FS method was generally superior over other methods. Identification of discriminative features among high-throughput ones remains an unavoidable challenge in intelligent diagnosis, and extra efforts should be made toward accurate and efficient feature selection.
Collapse
|
48
|
Li Q, Xie W, Li L, Wang L, You Q, Chen L, Li J, Ke Y, Fang J, Liu L, Hong H. Development and Validation of a Prediction Model for Elevated Arterial Stiffness in Chinese Patients With Diabetes Using Machine Learning. Front Physiol 2021; 12:714195. [PMID: 34497538 PMCID: PMC8419456 DOI: 10.3389/fphys.2021.714195] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 07/31/2021] [Indexed: 01/21/2023] Open
Abstract
Background Arterial stiffness assessed by pulse wave velocity is a major risk factor for cardiovascular diseases. The incidence of cardiovascular events remains high in diabetics. However, a clinical prediction model for elevated arterial stiffness using machine learning to identify subjects consequently at higher risk remains to be developed. Methods Least absolute shrinkage and selection operator and support vector machine-recursive feature elimination were used for feature selection. Four machine learning algorithms were used to construct a prediction model, and their performance was compared based on the area under the receiver operating characteristic curve metric in a discovery dataset (n = 760). The model with the best performance was selected and validated in an independent dataset (n = 912) from the Dryad Digital Repository (https://doi.org/10.5061/dryad.m484p). To apply our model to clinical practice, we built a free and user-friendly web online tool. Results The predictive model includes the predictors: age, systolic blood pressure, diastolic blood pressure, and body mass index. In the discovery cohort, the gradient boosting-based model outperformed other methods in the elevated arterial stiffness prediction. In the validation cohort, the gradient boosting model showed a good discrimination capacity. A cutoff value of 0.46 for the elevated arterial stiffness risk score in the gradient boosting model resulted in a good specificity (0.813 in the discovery data and 0.761 in the validation data) and sensitivity (0.875 and 0.738, respectively) trade-off points. Conclusion The gradient boosting-based prediction system presents a good classification in elevated arterial stiffness prediction. The web online tool makes our gradient boosting-based model easily accessible for further clinical studies and utilization.
Collapse
Affiliation(s)
- Qingqing Li
- Fujian Key Laboratory of Vascular Aging, Department of Geriatrics, Department of Cardiology, Department of Cardiac Surgery, Fujian Heart Disease Center, Fujian Institute of Geriatrics, Fujian Medical University Union Hospital, Fuzhou, China
| | - Wenhui Xie
- Fujian Key Laboratory of Vascular Aging, Department of Geriatrics, Department of Cardiology, Department of Cardiac Surgery, Fujian Heart Disease Center, Fujian Institute of Geriatrics, Fujian Medical University Union Hospital, Fuzhou, China
| | - Liping Li
- Fujian Key Laboratory of Vascular Aging, Department of Geriatrics, Department of Cardiology, Department of Cardiac Surgery, Fujian Heart Disease Center, Fujian Institute of Geriatrics, Fujian Medical University Union Hospital, Fuzhou, China
| | - Lijing Wang
- Department of Endocrinology, Fujian Medical University Union Hospital, Fuzhou, China
| | - Qinyi You
- Fujian Key Laboratory of Vascular Aging, Department of Geriatrics, Department of Cardiology, Department of Cardiac Surgery, Fujian Heart Disease Center, Fujian Institute of Geriatrics, Fujian Medical University Union Hospital, Fuzhou, China
| | - Lu Chen
- Fujian Key Laboratory of Vascular Aging, Department of Geriatrics, Department of Cardiology, Department of Cardiac Surgery, Fujian Heart Disease Center, Fujian Institute of Geriatrics, Fujian Medical University Union Hospital, Fuzhou, China
| | - Jing Li
- Fujian Key Laboratory of Vascular Aging, Department of Geriatrics, Department of Cardiology, Department of Cardiac Surgery, Fujian Heart Disease Center, Fujian Institute of Geriatrics, Fujian Medical University Union Hospital, Fuzhou, China
| | - Yilang Ke
- Fujian Key Laboratory of Vascular Aging, Department of Geriatrics, Department of Cardiology, Department of Cardiac Surgery, Fujian Heart Disease Center, Fujian Institute of Geriatrics, Fujian Medical University Union Hospital, Fuzhou, China
| | - Jun Fang
- Fujian Key Laboratory of Vascular Aging, Department of Geriatrics, Department of Cardiology, Department of Cardiac Surgery, Fujian Heart Disease Center, Fujian Institute of Geriatrics, Fujian Medical University Union Hospital, Fuzhou, China
| | - Libin Liu
- Department of Endocrinology, Fujian Medical University Union Hospital, Fuzhou, China
| | - Huashan Hong
- Fujian Key Laboratory of Vascular Aging, Department of Geriatrics, Department of Cardiology, Department of Cardiac Surgery, Fujian Heart Disease Center, Fujian Institute of Geriatrics, Fujian Medical University Union Hospital, Fuzhou, China
| |
Collapse
|
49
|
Mohamad M, Selamat A, Subroto IM, Krejcar O. Improving the classification performance on imbalanced data sets via new hybrid parameterisation model. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2021. [DOI: 10.1016/j.jksuci.2019.04.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
50
|
Li Q, Xie W, Li L, Wang L, You Q, Chen L, Li J, Ke Y, Fang J, Liu L, Hong H. Development and Validation of a Prediction Model for Elevated Arterial Stiffness in Chinese Patients With Diabetes Using Machine Learning. Front Physiol 2021. [DOI: 10.3389/fphys.2021.714195
expr 962169460 + 908583142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/16/2023] Open
Abstract
BackgroundArterial stiffness assessed by pulse wave velocity is a major risk factor for cardiovascular diseases. The incidence of cardiovascular events remains high in diabetics. However, a clinical prediction model for elevated arterial stiffness using machine learning to identify subjects consequently at higher risk remains to be developed.MethodsLeast absolute shrinkage and selection operator and support vector machine-recursive feature elimination were used for feature selection. Four machine learning algorithms were used to construct a prediction model, and their performance was compared based on the area under the receiver operating characteristic curve metric in a discovery dataset (n = 760). The model with the best performance was selected and validated in an independent dataset (n = 912) from the Dryad Digital Repository (https://doi.org/10.5061/dryad.m484p). To apply our model to clinical practice, we built a free and user-friendly web online tool.ResultsThe predictive model includes the predictors: age, systolic blood pressure, diastolic blood pressure, and body mass index. In the discovery cohort, the gradient boosting-based model outperformed other methods in the elevated arterial stiffness prediction. In the validation cohort, the gradient boosting model showed a good discrimination capacity. A cutoff value of 0.46 for the elevated arterial stiffness risk score in the gradient boosting model resulted in a good specificity (0.813 in the discovery data and 0.761 in the validation data) and sensitivity (0.875 and 0.738, respectively) trade-off points.ConclusionThe gradient boosting-based prediction system presents a good classification in elevated arterial stiffness prediction. The web online tool makes our gradient boosting-based model easily accessible for further clinical studies and utilization.
Collapse
|