1
|
Zhao H, Qiu S, Bai M, Wang L, Wang Z. Toxicity prediction and classification of Gunqile-7 with small sample based on transfer learning method. Comput Biol Med 2024; 173:108348. [PMID: 38531249 DOI: 10.1016/j.compbiomed.2024.108348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 03/10/2024] [Accepted: 03/17/2024] [Indexed: 03/28/2024]
Abstract
Drug-induced diseases are the most important component of iatrogenic disease. It is the duty of doctors to provide a reasonable and safe dose of medication. Gunqile-7 is a Mongolian medicine with analgesic and anti-inflammatory effects. As a foreign substance in the body, even with reasonable medication, it may produce varying degrees of adverse reactions or toxic side effects. Since the cost of collecting Gunqile-7 for pharmacological animal trials is high and the data sample is small, this paper employs transfer learning and data augmentation methods to study the toxicity of Gunqile-7. More specifically, to reduce the necessary number of training samples, the data augmentation approach is employed to extend the data set. Then, the transfer learning method and one-dimensional convolutional neural network are utilized to train the network. In addition, we use the support vector machine-recursive feature elimination method for feature selection to reduce features that have adverse effects on model predictions. Furthermore, due to the important role of the pre-trained model of transfer learning, we select a quantitative toxicity prediction model as the pre-trained model, which is consistent with the purpose of this paper. Lastly, the experimental results demonstrate the efficiency of the proposed method. Our method can improve accuracy by up to 9 percentage points compared to the method without transfer learning on a small sample set.
Collapse
Affiliation(s)
- Hongkai Zhao
- Key Laboratory of Intelligent Control and Optimization for Industrial Equipment of Ministry of Education, Dalian University of Technology, Dalian 116024, China; School of Control Science and Engineering, Dalian University of Technology, Dalian 116024, China.
| | - Sen Qiu
- Key Laboratory of Intelligent Control and Optimization for Industrial Equipment of Ministry of Education, Dalian University of Technology, Dalian 116024, China; School of Control Science and Engineering, Dalian University of Technology, Dalian 116024, China.
| | - Meirong Bai
- Key Laboratory of Ministry of Education of Mongolian Medicine RD Engineering, Inner Mongolia Minzu University, Tongliao 028000, China.
| | - Luyao Wang
- Key Laboratory of Intelligent Control and Optimization for Industrial Equipment of Ministry of Education, Dalian University of Technology, Dalian 116024, China; School of Control Science and Engineering, Dalian University of Technology, Dalian 116024, China.
| | - Zhelong Wang
- Key Laboratory of Intelligent Control and Optimization for Industrial Equipment of Ministry of Education, Dalian University of Technology, Dalian 116024, China; School of Control Science and Engineering, Dalian University of Technology, Dalian 116024, China.
| |
Collapse
|
2
|
Li W, Chi Y, Yu K, Xie W. A two-stage hybrid biomarker selection method based on ensemble filter and binary differential evolution incorporating binary African vultures optimization. BMC Bioinformatics 2023; 24:130. [PMID: 37016297 PMCID: PMC10072044 DOI: 10.1186/s12859-023-05247-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Accepted: 03/21/2023] [Indexed: 04/06/2023] Open
Abstract
BACKGROUND In the field of genomics and personalized medicine, it is a key issue to find biomarkers directly related to the diagnosis of specific diseases from high-throughput gene microarray data. Feature selection technology can discover biomarkers with disease classification information. RESULTS We use support vector machines as classifiers and use the five-fold cross-validation average classification accuracy, recall, precision and F1 score as evaluation metrics to evaluate the identified biomarkers. Experimental results show classification accuracy above 0.93, recall above 0.92, precision above 0.91, and F1 score above 0.94 on eight microarray datasets. METHOD This paper proposes a two-stage hybrid biomarker selection method based on ensemble filter and binary differential evolution incorporating binary African vultures optimization (EF-BDBA), which can effectively reduce the dimension of microarray data and obtain optimal biomarkers. In the first stage, we propose an ensemble filter feature selection method. The method combines an improved fast correlation-based filter algorithm with Fisher score. obviously redundant and irrelevant features can be filtered out to initially reduce the dimensionality of the microarray data. In the second stage, the optimal feature subset is selected using an improved binary differential evolution incorporating an improved binary African vultures optimization algorithm. The African vultures optimization algorithm has excellent global optimization ability. It has not been systematically applied to feature selection problems, especially for gene microarray data. We combine it with a differential evolution algorithm to improve population diversity. CONCLUSION Compared with traditional feature selection methods and advanced hybrid methods, the proposed method achieves higher classification accuracy and identifies excellent biomarkers while retaining fewer features. The experimental results demonstrate the effectiveness and advancement of our proposed algorithmic model.
Collapse
Affiliation(s)
- Wei Li
- Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, Ministry of Education, Shenyang, China
| | - Yuhuan Chi
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Kun Yu
- School of Biomedical and Information Engineering, Northeastern University, Shenyang, China
| | - Weidong Xie
- School of Computer Science and Engineering, Northeastern University, Shenyang, China.
| |
Collapse
|
3
|
An in-depth and contrasting survey of meta-heuristic approaches with classical feature selection techniques specific to cervical cancer. Knowl Inf Syst 2023. [DOI: 10.1007/s10115-022-01825-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
|
4
|
Chamlal H, Ouaderhman T, Rebbah FE. A hybrid feature selection approach for Microarray datasets using graph theoretic-based method. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
5
|
Ma XA, Ju C. Fuzzy information-theoretic feature selection via relevance, redundancy, and complementarity criteria. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.08.067] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
6
|
A Highly Discriminative Hybrid Feature Selection Algorithm for Cancer Diagnosis. ScientificWorldJournal 2022; 2022:1056490. [PMID: 35983572 PMCID: PMC9381276 DOI: 10.1155/2022/1056490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 07/20/2022] [Indexed: 11/17/2022] Open
Abstract
Cancer is a deadly disease that occurs due to rapid and uncontrolled cell growth. In this article, a machine learning (ML) algorithm is proposed to diagnose different cancer diseases from big data. The algorithm comprises a two-stage hybrid feature selection. In the first stage, an overall ranker is initiated to combine the results of three filter-based feature evaluation methods, namely, chi-squared, F-statistic, and mutual information (MI). The features are then ordered according to this combination. In the second stage, the modified wrapper-based sequential forward selection is utilized to discover the optimal feature subset, using ML models such as support vector machine (SVM), decision tree (DT), random forest (RF), and K-nearest neighbor (KNN) classifiers. To examine the proposed algorithm, many tests have been carried out on four cancerous microarray datasets, employing in the process 10-fold cross-validation and hyperparameter tuning. The performance of the algorithm is evaluated by calculating the diagnostic accuracy. The results indicate that for the leukemia dataset, both SVM and KNN models register the highest accuracy at 100% using only 5 features. For the ovarian cancer dataset, the SVM model achieves the highest accuracy at 100% using only 6 features. For the small round blue cell tumor (SRBCT) dataset, the SVM model also achieves the highest accuracy at 100% using only 8 features. For the lung cancer dataset, the SVM model also achieves the highest accuracy at 99.57% using 19 features. By comparing with other algorithms, the results obtained from the proposed algorithm are superior in terms of the number of selected features and diagnostic accuracy.
Collapse
|
7
|
Liang Y, Yang S, Zheng L, Wang H, Zhou J, Huang S, Yang L, Zuo Y. Research progress of reduced amino acid alphabets in protein analysis and prediction. Comput Struct Biotechnol J 2022; 20:3503-3510. [PMID: 35860409 PMCID: PMC9284397 DOI: 10.1016/j.csbj.2022.07.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 06/30/2022] [Accepted: 07/01/2022] [Indexed: 11/29/2022] Open
Abstract
A comprehensive summary of the literature on the reduced amino acid alphabets. A systematic review of the development history of reduced amino acid alphabets. Rich application cases of amino acid reduction alphabets are described in the article. A detailed analysis of the properties and uses of the reduced amino acid alphabets.
Proteins are the executors of cellular physiological activities, and accurate structural and function elucidation are crucial for the refined mapping of proteins. As a feature engineering method, the reduction of amino acid composition is not only an important method for protein structure and function analysis, but also opens a broad horizon for the complex field of machine learning. Representing sequences with fewer amino acid types greatly reduces the complexity and noise of traditional feature engineering in dimension, and provides more interpretable predictive models for machine learning to capture key features. In this paper, we systematically reviewed the strategy and method studies of the reduced amino acid (RAA) alphabets, and summarized its main research in protein sequence alignment, functional classification, and prediction of structural properties, respectively. In the end, we gave a comprehensive analysis of 672 RAA alphabets from 74 reduction methods.
Collapse
Affiliation(s)
- Yuchao Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Siqi Yang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Hao Wang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Jian Zhou
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
- Corresponding authors.
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
- Corresponding authors.
| |
Collapse
|
8
|
Zaman EAK, Mohamed A, Ahmad A. Feature selection for online streaming high-dimensional data: A state-of-the-art review. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
9
|
Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:927312. [PMID: 36304293 PMCID: PMC9580915 DOI: 10.3389/fbinf.2022.927312] [Citation(s) in RCA: 75] [Impact Index Per Article: 37.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 06/03/2022] [Indexed: 01/14/2023] Open
Abstract
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
Collapse
Affiliation(s)
| | - Tayaza Fadason
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
| | - Andreas W. Kempa-Liehr
- Department of Engineering Science, The University of Auckland, Auckland, New Zealand
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| | - Justin M. O'Sullivan
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Australian Parkinson’s Mission, Garvan Institute of Medical Research, Sydney, NSW, Australia
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| |
Collapse
|
10
|
A Path-Based Feature Selection Algorithm for Enterprise Credit Risk Evaluation. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:7650207. [PMID: 35586103 PMCID: PMC9110157 DOI: 10.1155/2022/7650207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 04/12/2022] [Indexed: 11/22/2022]
Abstract
In recent years, there has been increasing interest in exploring diversified features to measure small and medium-sized enterprises (SMEs) credit risk. Path-based features, revealing logical connections between SMEs, are widely adopted as informative feature kinds for causal inference in credit risk evaluation. Since there may exist thousands of feature paths to the target enterprise, to evaluate its credit risk, how to select the most informative path-based features becomes a challenging problem. To solve the problem, in this paper, we propose a novel method of feature selection, considering both similarity and importance on features' structured semantics as the factors of informativeness. With this, the proposed method can effectively rank both conventional and path-based features together. Furthermore, to improve the efficiency of the method, a heuristic algorithm is proposed to fast search for the candidate features. Through extensive experiments, we show our method performs competitively with other state-of-the-art selection methods.
Collapse
|
11
|
Wu X, Chen H, Li T, Wan J. Semi-supervised feature selection with minimal redundancy based on local adaptive. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02288-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
12
|
Females and Males Show Differences in Early-Stage Transcriptomic Biomarkers of Lung Adenocarcinoma and Lung Squamous Cell Carcinoma. Diagnostics (Basel) 2021; 11:diagnostics11020347. [PMID: 33669819 PMCID: PMC7922551 DOI: 10.3390/diagnostics11020347] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 02/15/2021] [Accepted: 02/17/2021] [Indexed: 12/25/2022] Open
Abstract
The incidence and mortality rates of lung cancers are different between females and males. Therefore, sex information should be an important part of how to train and optimize a diagnostic model. However, most of the existing studies do not fully utilize this information. This study carried out a comparative investigation between sex-specific models and sex-independent models. Three feature selection algorithms and five classifiers were utilized to evaluate the contribution of the sex information to the detection of early-stage lung cancers. Both lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) showed that the sex-specific models outperformed the sex-independent detection of early-stage lung cancers. The Venn plots suggested that females and males shared only a few transcriptomic biomarkers of early-stage lung cancers. Our experimental data suggested that sex information should be included in optimizing disease diagnosis models.
Collapse
|
13
|
Han Y, Huang L, Zhou F. A dynamic recursive feature elimination framework (dRFE) to further refine a set of OMIC biomarkers. Bioinformatics 2021; 37:2183-2189. [PMID: 33515240 DOI: 10.1093/bioinformatics/btab055] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 12/23/2020] [Accepted: 01/25/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION A feature selection algorithm may select the subset of features with the best associations with the class labels. The recursive feature elimination (RFE) is a heuristic feature screening framework and has been widely used to select the biological OMIC biomarkers. This study proposed a dynamic recursive feature elimination (dRFE) framework with more flexible feature elimination operations. The proposed dRFE was comprehensively compared with 11 existing feature selection algorithms and five classifiers on the eight difficult transcriptome datasets from a previous study, the ten newly collected transcriptome datasets and the five methylome datasets. RESULTS The experimental data suggested that the regular RFE framework did not perform well, and dRFE outperformed the existing feature selection algorithms in most cases. The dRFE-detected features achieved Acc=1.0000 for the two methylome datasets GSE53045 and GSE66695. The best prediction accuracies of the dRFE-detected features were 0.9259, 0.9424, and 0.8601 for the other three methylome datasets GSE74845, GSE103186, and GSE80970, respectively. Four transcriptome datasets received Acc=1.0000 using the dRFE-detected features, and the prediction accuracies for the other six newly collected transcriptome datasets were between 0.6301 and 0.9917. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuanyuan Han
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China, 130012
| | - Lan Huang
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China, 130012
| | - Fengfeng Zhou
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China, 130012
| |
Collapse
|