1
|
Zhang M, Du J, Nie B, Luo J, Liu M, Yuan Y. Hybrid mRMR and multi-objective particle swarm feature selection methods and application to metabolomics of traditional Chinese medicine. PeerJ Comput Sci 2024; 10:e2073. [PMID: 38855250 PMCID: PMC11157565 DOI: 10.7717/peerj-cs.2073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 04/29/2024] [Indexed: 06/11/2024]
Abstract
Metabolomics data has high-dimensional features and a small sample size, which is typical of high-dimensional small sample (HDSS) data. Too high a dimensionality leads to the curse of dimensionality, and too small a sample size tends to trigger overfitting, which poses a challenge to deeper mining in metabolomics. Feature selection is a valuable technique for effectively handling the challenges HDSS data poses. For the feature selection problem of HDSS data in metabolomics, a hybrid Max-Relevance and Min-Redundancy (mRMR) and multi-objective particle swarm feature selection method (MCMOPSO) is proposed. Experimental results using metabolomics data and various University of California, Irvine (UCI) public datasets demonstrate the effectiveness of MCMOPSO in selecting feature subsets with a limited number of high-quality features. MCMOPSO achieves this by efficiently eliminating irrelevant and redundant features, showcasing its efficacy. Therefore, MCMOPSO is a powerful approach for selecting features from high-dimensional metabolomics data with limited sample sizes.
Collapse
Affiliation(s)
- Mengting Zhang
- School of Computer Science, Jiangxi University of Chinese Medicine, Nanchang, China
| | - Jianqiang Du
- School of Computer Science, Jiangxi University of Chinese Medicine, Nanchang, China
- Key Laboratory of Artificial Intelligence in Chinese Medicine, Jiangxi University of Chinese Medicine, Nanchang, China
| | - Bin Nie
- School of Computer Science, Jiangxi University of Chinese Medicine, Nanchang, China
- Key Laboratory of Artificial Intelligence in Chinese Medicine, Jiangxi University of Chinese Medicine, Nanchang, China
| | - Jigen Luo
- School of Computer Science, Jiangxi University of Chinese Medicine, Nanchang, China
- Key Laboratory of Artificial Intelligence in Chinese Medicine, Jiangxi University of Chinese Medicine, Nanchang, China
| | - Ming Liu
- School of Computer Science, Jiangxi University of Chinese Medicine, Nanchang, China
| | - Yang Yuan
- School of Computer Science, Jiangxi University of Chinese Medicine, Nanchang, China
| |
Collapse
|
2
|
You W, Yang Z, Ji G. PLS-based gene subset augmentation and tumor-specific gene identification. Comput Biol Med 2024; 174:108434. [PMID: 38636329 DOI: 10.1016/j.compbiomed.2024.108434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/18/2024] [Accepted: 04/07/2024] [Indexed: 04/20/2024]
Abstract
In the study of tumor disease pathogenesis, the identification of genes specifically expressed in disease states is pivotal, yet challenges arise from high-dimensional datasets with limited samples. Conventional gene (feature) selection methods often fall short of capturing the complexity of gene-phenotype and gene-gene interactions, necessitating a more robust analysis method. To address these challenges, a gene subset augmentation strategy is proposed in this paper. Our approach introduces diverse perturbation mechanisms to generate distinct gene subsets. The partial least squares-based multiple gene measurement algorithm considers gene-phenotype and gene-gene correlations, identifying differentially expressed genes, including those with weak signals. The constructed gene networks derived from the augmented subsets unveil regulatory patterns, enabling association analysis to explore gene associations comprehensively. Our algorithm excels in identifying small-sized gene subsets with strong discriminative power, surpassing traditional methods that yield a single gene subset. Unlike conventional approaches, our algorithm reveals a spectrum of different gene subsets and their weakly differentially expressed genes. This nuanced perspective aids in unraveling the molecular characteristics and specific expression patterns of tumor genes. The versatility of our approach not only contributes to the advancement of tumor-specific gene identification but also holds promise for addressing challenges in various fields characterized by high-dimensional datasets and limited samples. The Python implementation is available at http://github.com/wenjieyou/PLSGSA.
Collapse
Affiliation(s)
- Wenjie You
- School of Big Data and Artificial Intelligence, Fujian Polytechnic Normal University, Fuqing, 350300, China.
| | - Zijiang Yang
- School of Information Technology, York University, Toronto, M3J 1P3, Canada.
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, 361005, China.
| |
Collapse
|
3
|
He N, Kou C. Predicting verbal and performance intelligence quotients from multimodal data in individuals with attention deficit/hyperactivity disorder. Int J Dev Neurosci 2024; 84:217-226. [PMID: 38387863 DOI: 10.1002/jdn.10320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 01/10/2024] [Accepted: 02/05/2024] [Indexed: 02/24/2024] Open
Abstract
Despite the importance of understanding how intelligence is ingrained in the function and structure of the brain in some neurological disorders, the alterations of intelligence-associated neurological factors in atypical neurodevelopmental disorders, such as attention deficit/hyperactivity disorder (ADHD), are limited. Therefore, we aimed to explore the relationship between the brain functional and morphological characteristics and the intellectual performance of 139 patients with ADHD. Resting-state functional and T1-weighted structural magnetic resonance imaging (MRI) data and intellectual-performance data of the patients were collected. The MRI data were preprocessed to extract four indicators characterizing the participants' brain features: fractional amplitude of low-frequency fluctuation, regional homogeneity, and gray and white matter volumes. Then, we used a two-layer feature-selection method with support vector regression models based on three kernel functions to predict the verbal and performance intelligent quotients of the patients, along with ten fold cross-validation to evaluate the models' predictive performance. All models showed good performance; the correlation coefficients between the predicted and observed values for each predictive phenotypic variable were >0.41, with statistical significance. The brain features that could best predict the intellectual performance of the patients were concentrated in the superior and inferior frontal gyrus of the prefrontal areas, the angular gyrus and precuneus of the parietal lobe, the inferior and middle temporal gyrus of the temporal lobe, and part of the cerebellar regions. Thus, the voxel-based brain-feature indicators could adequately predict the intellectual performance of patients with ADHD, providing a foundation for future neuroimaging studies of this disorder.
Collapse
Affiliation(s)
- Ningning He
- School of Mathematics and Statistics, Zhoukou Normal University, Zhoukou, People's Republic of China
| | - Chao Kou
- School of Foreign Languages, Zhoukou Normal University, Zhoukou, People's Republic of China
| |
Collapse
|
4
|
Jia Y, Hu X, Kang W, Dong X. Unveiling Microbial Nitrogen Metabolism in Rivers using a Machine Learning Approach. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024; 58:6605-6615. [PMID: 38566483 DOI: 10.1021/acs.est.3c09653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Microbial nitrogen metabolism is a complicated and key process in mediating environmental pollution and greenhouse gas emissions in rivers. However, the interactive drivers of microbial nitrogen metabolism in rivers have not been identified. Here, we analyze the microbial nitrogen metabolism patterns in 105 rivers in China driven by 26 environmental and socioeconomic factors using an interpretable causal machine learning (ICML) framework. ICML better recognizes the complex relationships between factors and microbial nitrogen metabolism than traditional linear regression models. Furthermore, tipping points and concentration windows were proposed to precisely regulate microbial nitrogen metabolism. For example, concentrations of dissolved organic carbon (DOC) below tipping points of 6.2 and 4.2 mg/L easily reduce bacterial denitrification and nitrification, respectively. The concentration windows for NO3--N (15.9-18.0 mg/L) and DOC (9.1-10.8 mg/L) enabled the highest abundance of denitrifying bacteria on a national scale. The integration of ICML models and field data clarifies the important drivers of microbial nitrogen metabolism, supporting the precise regulation of nitrogen pollution and river ecological management.
Collapse
Affiliation(s)
- Yuying Jia
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education), Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Xiangang Hu
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education), Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Weilu Kang
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education), Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Xu Dong
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education), Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| |
Collapse
|
5
|
Ersoz NS, Bakir-Gungor B, Yousef M. GeNetOntology: identifying affected gene ontology terms via grouping, scoring, and modeling of gene expression data utilizing biological knowledge-based machine learning. Front Genet 2023; 14:1139082. [PMID: 37671046 PMCID: PMC10476493 DOI: 10.3389/fgene.2023.1139082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 07/05/2023] [Indexed: 09/07/2023] Open
Abstract
Introduction: Identifying significant sets of genes that are up/downregulated under specific conditions is vital to understand disease development mechanisms at the molecular level. Along this line, in order to analyze transcriptomic data, several computational feature selection (i.e., gene selection) methods have been proposed. On the other hand, uncovering the core functions of the selected genes provides a deep understanding of diseases. In order to address this problem, biological domain knowledge-based feature selection methods have been proposed. Unlike computational gene selection approaches, these domain knowledge-based methods take the underlying biology into account and integrate knowledge from external biological resources. Gene Ontology (GO) is one such biological resource that provides ontology terms for defining the molecular function, cellular component, and biological process of the gene product. Methods: In this study, we developed a tool named GeNetOntology which performs GO-based feature selection for gene expression data analysis. In the proposed approach, the process of Grouping, Scoring, and Modeling (G-S-M) is used to identify significant GO terms. GO information has been used as the grouping information, which has been embedded into a machine learning (ML) algorithm to select informative ontology terms. The genes annotated with the selected ontology terms have been used in the training part to carry out the classification task of the ML model. The output is an important set of ontologies for the two-class classification task applied to gene expression data for a given phenotype. Results: Our approach has been tested on 11 different gene expression datasets, and the results showed that GeNetOntology successfully identified important disease-related ontology terms to be used in the classification model. Discussion: GeNetOntology will assist geneticists and scientists to identify a range of disease-related genes and ontologies in transcriptomic data analysis, and it will also help doctors design diagnosis platforms and improve patient treatment plans.
Collapse
Affiliation(s)
- Nur Sebnem Ersoz
- Department of Bioengineering, Graduate School of Engineering and Science, Abdullah Gul University, Kayseri, Türkiye
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Türkiye
- Department of Bioengineering, Faculty of Life and Natural Sciences, Abdullah Gul University, Kayseri, Türkiye
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| |
Collapse
|
6
|
Park J, Lee JW, Park M. Comparison of cancer subtype identification methods combined with feature selection methods in omics data analysis. BioData Min 2023; 16:18. [PMID: 37420304 DOI: 10.1186/s13040-023-00334-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 06/30/2023] [Indexed: 07/09/2023] Open
Abstract
BACKGROUND Cancer subtype identification is important for the early diagnosis of cancer and the provision of adequate treatment. Prior to identifying the subtype of cancer in a patient, feature selection is also crucial for reducing the dimensionality of the data by detecting genes that contain important information about the cancer subtype. Numerous cancer subtyping methods have been developed, and their performance has been compared. However, combinations of feature selection and subtype identification methods have rarely been considered. This study aimed to identify the best combination of variable selection and subtype identification methods in single omics data analysis. RESULTS Combinations of six filter-based methods and six unsupervised subtype identification methods were investigated using The Cancer Genome Atlas (TCGA) datasets for four cancers. The number of features selected varied, and several evaluation metrics were used. Although no single combination was found to have a distinctively good performance, Consensus Clustering (CC) and Neighborhood-Based Multi-omics Clustering (NEMO) used with variance-based feature selection had a tendency to show lower p-values, and nonnegative matrix factorization (NMF) stably showed good performance in many cases unless the Dip test was used for feature selection. In terms of accuracy, the combination of NMF and similarity network fusion (SNF) with Monte Carlo Feature Selection (MCFS) and Minimum-Redundancy Maximum Relevance (mRMR) showed good overall performance. NMF always showed among the worst performances without feature selection in all datasets, but performed much better when used with various feature selection methods. iClusterBayes (ICB) had decent performance when used without feature selection. CONCLUSIONS Rather than a single method clearly emerging as optimal, the best methodology was different depending on the data used, the number of features selected, and the evaluation method. A guideline for choosing the best combination method under various situations is provided.
Collapse
Affiliation(s)
- JiYoon Park
- Department of Statistics, Korea University, 145 Anam-Ro, Seongbuk-Gu, Seoul, 02841, South Korea
| | - Jae Won Lee
- Department of Statistics, Korea University, 145 Anam-Ro, Seongbuk-Gu, Seoul, 02841, South Korea
| | - Mira Park
- Department of Preventive Medicine, Eulji University, 77 Gyeryong-Ro, Jung-Gu, Daejeon, 34824, South Korea.
| |
Collapse
|
7
|
Liu SH, Yang ZK, Pan KL, Zhu X, Chen W. Estimation of Left Ventricular Ejection Fraction Using Cardiovascular Hemodynamic Parameters and Pulse Morphological Characteristics with Machine Learning Algorithms. Nutrients 2022; 14:nu14194051. [PMID: 36235703 PMCID: PMC9572754 DOI: 10.3390/nu14194051] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Revised: 09/22/2022] [Accepted: 09/26/2022] [Indexed: 11/16/2022] Open
Abstract
It is estimated that 360,000 patients have suffered from heart failure (HF) in Taiwan, mostly those over the age of 65 years, who need long-term medication and daily healthcare to reduce the risk of mortality. The left ventricular ejection fraction (LVEF) is an important index to diagnose the HF. The goal of this study is to estimate the LVEF using the cardiovascular hemodynamic parameters, morphological characteristics of pulse, and bodily information with two machine learning algorithms. Twenty patients with HF who have been treated for at least six to nine months participated in this study. The self-constructing neural fuzzy inference network (SoNFIN) and XGBoost regression models were used to estimate their LVEF. A total of 193 training samples and 118 test samples were obtained. The recursive feature elimination algorithm is used to choose the optimal parameter set. The results show that the estimating root-mean-square errors (ERMS) of SoNFIN and XGBoost are 6.9 ± 2.3% and 6.4 ± 2.4%, by comparing with echocardiography as the ground truth, respectively. The benefit of this study is that the LVEF could be measured by the non-medical image method conveniently. Thus, the proposed method may arrive at an application level for clinical practice in the future.
Collapse
Affiliation(s)
- Shing-Hong Liu
- Department of Computer Science and Information Engineering, Chaoyang University of Technology, Taichung City 41349, Taiwan
| | - Zhi-Kai Yang
- Department of Computer Science and Information Engineering, Chaoyang University of Technology, Taichung City 41349, Taiwan
| | - Kuo-Li Pan
- Division of Cardiology, Department of Internal Medicine, Chang Gung Memorial Hospital, Chiayi Branch, Chiayi City 61363, Taiwan
- College of Medicine, Chang Gung University, Taoyuan City 33305, Taiwan
- Heart Failure Center, Chang Gung Memorial Hospital, Chiayi Branch, Chiayi City 61363, Taiwan
- Correspondence: (K.-L.P.); (W.C.); Tel.: +886-5-362-1000-2854 (K.-L.P.); +81-242-37-2606 (W.C.)
| | - Xin Zhu
- Division of Information Systems, School of Computer Science and Engineering, University of Aizu, Aizu-Wakamatsu City, Fukushima 965-8580, Japan
| | - Wenxi Chen
- Division of Information Systems, School of Computer Science and Engineering, University of Aizu, Aizu-Wakamatsu City, Fukushima 965-8580, Japan
- Correspondence: (K.-L.P.); (W.C.); Tel.: +886-5-362-1000-2854 (K.-L.P.); +81-242-37-2606 (W.C.)
| |
Collapse
|
8
|
Simic V, Ebadi Torkayesh A, Ijadi Maghsoodi A. Locating a disinfection facility for hazardous healthcare waste in the COVID-19 era: a novel approach based on Fermatean fuzzy ITARA-MARCOS and random forest recursive feature elimination algorithm. ANNALS OF OPERATIONS RESEARCH 2022; 328:1-46. [PMID: 35821664 PMCID: PMC9263821 DOI: 10.1007/s10479-022-04822-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 06/07/2022] [Indexed: 05/09/2023]
Abstract
Hazardous healthcare waste (HCW) management system is one of the most critical urban systems affected by the COVID-19 pandemic due to the increase in waste generation rate in hospitals and medical centers dealing with infected patients as well as the degree of hazardousness of generated waste due to exposure to the virus. In this regard, waste network flow would face severe problems without taking care of hazardous waste through disinfection facilities. For this purpose, this study aims to develop an advanced decision support system based on a multi-stage model that was combined with the random forest recursive feature elimination (RF-RFE) algorithm, the indifference threshold-based attribute ratio analysis (ITARA), and measurement of alternatives and ranking according to compromise solution (MARCOS) methods into a unique framework under the Fermatean fuzzy environment. In the first stage, the innovative Fermatean fuzzy RF-RFE algorithm extracts core criteria from a finite set of initial criteria. In the second stage, the novel Fermatean fuzzy ITARA determines the semi-objective importance of the core criteria. In the third stage, the new Fermatean fuzzy MARCOS method ranks alternatives. A real-life case study in Istanbul, Turkey, illustrates the applicability of the introduced methodology. Our empirical findings indicate that "Pendik" is the best among five candidate locations for sitting a new disinfection facility for hazardous HCW in Istanbul. The sensitivity and comparative analyses confirmed that our approach is highly robust and reliable. This approach could be used to tackle other critical multi-dimensional problems related to COVID-19 and support sustainability and circular economy. Supplementary Information The online version contains supplementary material available at 10.1007/s10479-022-04822-0.
Collapse
Affiliation(s)
- Vladimir Simic
- Faculty of Transport and Traffic Engineering, University of Belgrade, Vojvode Stepe 305, 11010 Belgrade, Serbia
| | - Ali Ebadi Torkayesh
- School of Business and Economics, RWTH Aachen University, 52072 Aachen, Germany
| | - Abtin Ijadi Maghsoodi
- Department of Information Systems and Operations Management, Faculty of Business and Economics, Business School, University of Auckland, Auckland, 1010 New Zealand
| |
Collapse
|
9
|
Cheng KS, Su YL, Kuo LC, Yang TH, Lee CL, Chen W, Liu SH. Muscle Mass Measurement Using Machine Learning Algorithms with Electrical Impedance Myography. SENSORS 2022; 22:s22083087. [PMID: 35459072 PMCID: PMC9031580 DOI: 10.3390/s22083087] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 04/15/2022] [Accepted: 04/16/2022] [Indexed: 02/04/2023]
Abstract
Sarcopenia is a wild chronic disease among elderly people. Although it does not entail a life-threatening risk, it will increase the adverse risk due to the associated unsteady gait, fall, fractures, and functional disability. The import factors in diagnosing sarcopenia are muscle mass and strength. The examination of muscle mass must be carried in the clinic. However, the loss of muscle mass can be improved by rehabilitation that can be performed in non-medical environments. Electronic impedance myography (EIM) can measure some parameters of muscles that have the correlations with muscle mass and strength. The goal of this study is to use machine learning algorithms to estimate the total mass of thigh muscles (MoTM) with the parameters of EIM and body information. We explored the seven major muscles of lower limbs. The feature selection methods, including recursive feature elimination (RFE) and feature combination, were used to select the optimal features based on the ridge regression (RR) and support vector regression (SVR) models. The optimal features were the resistance of rectus femoris normalized by the thigh circumference, phase of tibialis anterior combined with the gender, and body information, height, and weight. There were 96 subjects involved in this study. The performances of estimating the MoTM used the regression coefficient (r2) and root-mean-square error (RMSE), which were 0.800 and 0.929, and 1.432 kg and 0.980 kg for RR and SVR models, respectively. Thus, the proposed method could have the potential to support people examining their muscle mass in non-medical environments.
Collapse
Affiliation(s)
- Kuo-Sheng Cheng
- Department of Biomedical Engineering, National Cheng Kung University, Tainai 701, Taiwan; (K.-S.C.); (Y.-L.S.); (T.-H.Y.)
| | - Ya-Ling Su
- Department of Biomedical Engineering, National Cheng Kung University, Tainai 701, Taiwan; (K.-S.C.); (Y.-L.S.); (T.-H.Y.)
| | - Li-Chieh Kuo
- Department of Occupational Therapy, National Cheng Kung University, Tainan 701, Taiwan;
| | - Tai-Hua Yang
- Department of Biomedical Engineering, National Cheng Kung University, Tainai 701, Taiwan; (K.-S.C.); (Y.-L.S.); (T.-H.Y.)
| | - Chia-Lin Lee
- Department of Physical Education, National Kaohsiung Normal University, Kaohsiung City 80201, Taiwan;
| | - Wenxi Chen
- Biomedical Information Engineering Laboratory, The University of Aizu, Aizu-Wakamatsu City, Fukushima 965-8580, Japan;
| | - Shing-Hong Liu
- Department of Computer Science and Information Engineering, Chaoyang University of Technology, Taichung 413310, Taiwan
- Correspondence: ; Tel.: +886-4-233230000-7811
| |
Collapse
|
10
|
|
11
|
Xiao J, Wang Y, Chen J, Xie L, Huang J. Impact of resampling methods and classification models on the imbalanced credit scoring problems. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.05.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
12
|
Lu S, Zhao J, Wang H. MD-MBPLS: A novel explanatory model in computational social science. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107023] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
13
|
Comprehensive relative importance analysis and its applications to high dimensional gene expression data analysis. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106120] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
14
|
Xiao J, Zhou X, Zhong Y, Xie L, Gu X, Liu D. Cost-sensitive semi-supervised selective ensemble model for customer credit scoring. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105118] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
15
|
Zou M, Zhou ZW, Fan L, Zhang WJ, Zhao L, Liu XP, Wang HB, Tan WS. A novel method based on nonparametric regression with a Gaussian kernel algorithm identifies the critical components in CHO media and feed optimization. J Ind Microbiol Biotechnol 2019; 47:63-72. [PMID: 31754859 DOI: 10.1007/s10295-019-02248-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2019] [Accepted: 11/08/2019] [Indexed: 12/18/2022]
Abstract
As the composition of animal cell culture medium becomes more complex, the identification of key variables is important for simplifying and guiding the subsequent medium optimization. However, the traditional experimental design methods are impractical and limited in their ability to explore such large feature spaces. Therefore, in this work, we developed a NRGK (nonparametric regression with Gaussian kernel) method, which aimed to identify the critical components that affect product titres during the development of cell culture media. With this nonparametric model, we successfully identified the important components that were neglected by the conventional PLS (partial least squares regression) method. The superiority of the NRGK method was further verified by ANOVA (analysis of variance). Additionally, it was proven that the selection accuracy was increased with the NRGK method because of its ability to model both the nonlinear and linear relationships between the medium components and titres. The application of this NRGK method provides new perspectives for the more precise identification of the critical components that further enable the optimization of media in a shorter timeframe.
Collapse
Affiliation(s)
- Mao Zou
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, 200237, Shanghai, China
| | - Zi-Wei Zhou
- Shanghai Bioengine Sci-Tech Co. Ltd, 201203, Shanghai, China
| | - Li Fan
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, 200237, Shanghai, China
| | - Wei-Jian Zhang
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, 200237, Shanghai, China
| | - Liang Zhao
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, 200237, Shanghai, China
| | - Xu-Ping Liu
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, 200237, Shanghai, China
| | - Hai-Bin Wang
- Hisun Pharmaceutical (Hangzhou) Co. Ltd, Xialiancun, Xukou, Fuyang, 311404, Hangzhou, Zhejiang, China
| | - Wen-Song Tan
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, 200237, Shanghai, China.
| |
Collapse
|
16
|
Identifying Brain Abnormalities with Schizophrenia Based on a Hybrid Feature Selection Technology. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9102148] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Many medical imaging data, especially the magnetic resonance imaging (MRI) data, usually have a small sample size, but a large number of features. How to reduce effectively the data dimension and locate accurately the biomarkers from such kinds of data are quite crucial for diagnosis and further precision medicine. In this paper, we propose a hybrid feature selection method based on machine learning and traditional statistical approaches and explore the brain abnormalities of schizophrenia by using the functional and structural MRI data. The results show that the abnormal brain regions are mainly distributed in the supramarginal gyrus, cingulate gyrus, frontal gyrus, precuneus and caudate, and the abnormal functional connections are related to the caudate nucleus, insula and rolandic operculum. In addition, some complex network analyses based on graph theory are utilized on the functional connection data, and the results demonstrate that the located abnormal functional connections in brain can distinguish schizophrenia patients from healthy controls. The identified abnormalities in brain with schizophrenia by the proposed hybrid feature selection method show that there do exist some abnormal brain regions and abnormal disruption of the network segregation and network integration for schizophrenia, and these changes may lead to inaccurate and inefficient information processing and synthesis in the brain, which provide further evidence for the cognitive dysmetria of schizophrenia.
Collapse
|
17
|
Wang W, Ackland DC, McClelland JA, Webster KE, Halgamuge S. Assessment of Gait Characteristics in Total Knee Arthroplasty Patients Using a Hierarchical Partial Least Squares Method. IEEE J Biomed Health Inform 2017; 22:205-214. [PMID: 28371786 DOI: 10.1109/jbhi.2017.2689070] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Quantitative gait analysis is an important tool in objective assessment and management of total knee arthroplasty (TKA) patients. Studies evaluating gait patterns in TKA patients have tended to focus on discrete data such as spatiotemporal information, joint range of motion and peak values of kinematics and kinetics, or consider selected principal components of gait waveforms for analysis. These strategies may not have the capacity to capture small variations in gait patterns associated with each joint across an entire gait cycle, and may ultimately limit the accuracy of gait classification. The aim of this study was to develop an automatic feature extraction method to analyse patterns from high-dimensional autocorrelated gait waveforms. A general linear feature extraction framework was proposed and a hierarchical partial least squares method derived for discriminant analysis of multiple gait waveforms. The effectiveness of this strategy was verified using a dataset of joint angle and ground reaction force waveforms from 43 patients after TKA surgery and 31 healthy control subjects. Compared with principal component analysis and partial least squares methods, the hierarchical partial least squares method achieved generally better classification performance on all possible combinations of waveforms, with the highest classification accuracy . The novel hierarchical partial least squares method proposed is capable of capturing virtually all significant differences between TKA patients and the controls, and provides new insights into data visualization. The proposed framework presents a foundation for more rigorous classification of gait, and may ultimately be used to evaluate the effects of interventions such as surgery and rehabilitation.
Collapse
|
18
|
Liu Y, Li L, Xiao Y, Yao J, Li P, Chen L, Yu D, Ma Y. Rapid identification of the quality decoction pieces by partial least squares -based pattern recognition: grade classification of the decoction pieces of Saposhnikovia divaricata. Biomed Chromatogr 2016; 30:1240-7. [PMID: 26683172 DOI: 10.1002/bmc.3673] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2015] [Revised: 12/15/2015] [Accepted: 12/16/2015] [Indexed: 01/31/2023]
Abstract
Herbal medicines are commonly used in many countries after they undergo processing. Quality decoction pieces are a guarantee of the efficacy and safety of the herbal medical products. Here, a strategy based on chemical analysis combined with chemometric techniques was proposed for the classification and prediction of the different grades of the decoction pieces. Considering the necessity for a shared and simple method for the grade classification for the public, in this paper, the characterization of the chemical constituents was determined by utilizing high-performance liquid chromatography (HPLC)/diode array detection. HPLC was first established for the characterization of the chemical constituents of the different grade decoction pieces. Furthermore, a simultaneous quantification of several of the marker compounds in these decoction pieces was obtained. Finally, a partial least squares-based pattern recognition method was utilized to obtain a predictive model for the grade classification of the decoction pieces. Saposhnikovia divaricata (Turcz.) Schischk was used as a case study. The partial least squares -based pattern recognition for the grade classification of the decoction pieces of S. divaricata demonstrated good sensitivity, specificity and prediction performance, which may efficiently validate the identification results of appearance assessment. The proposed strategy is expected to provide a new insight for the grade classification and quality control of the decoction pieces. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Ying Liu
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, No. 16 Nanxiao Lane, Dongzhimennei, Beijing, 100700, People's Republic of China
| | - Li Li
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, No. 16 Nanxiao Lane, Dongzhimennei, Beijing, 100700, People's Republic of China
| | - Yongqing Xiao
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, No. 16 Nanxiao Lane, Dongzhimennei, Beijing, 100700, People's Republic of China
| | - Jiaqi Yao
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, No. 16 Nanxiao Lane, Dongzhimennei, Beijing, 100700, People's Republic of China
| | - Pengyuan Li
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, No. 16 Nanxiao Lane, Dongzhimennei, Beijing, 100700, People's Republic of China
| | - Liang Chen
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, No. 16 Nanxiao Lane, Dongzhimennei, Beijing, 100700, People's Republic of China
| | - Dingrong Yu
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, No. 16 Nanxiao Lane, Dongzhimennei, Beijing, 100700, People's Republic of China
| | - Yinlian Ma
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, No. 16 Nanxiao Lane, Dongzhimennei, Beijing, 100700, People's Republic of China
| |
Collapse
|
19
|
|
20
|
Local configuration pattern features for age-related macular degeneration characterization and classification. Comput Biol Med 2015; 63:208-18. [PMID: 26093788 DOI: 10.1016/j.compbiomed.2015.05.019] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2015] [Revised: 05/25/2015] [Accepted: 05/26/2015] [Indexed: 12/30/2022]
Abstract
Age-related Macular Degeneration (AMD) is an irreversible and chronic medical condition characterized by drusen, Choroidal Neovascularization (CNV) and Geographic Atrophy (GA). AMD is one of the major causes of visual loss among elderly people. It is caused by the degeneration of cells in the macula which is responsible for central vision. AMD can be dry or wet type, however dry AMD is most common. It is classified into early, intermediate and late AMD. The early detection and treatment may help one to stop the progression of the disease. Automated AMD diagnosis may reduce the screening time of the clinicians. In this work, we have introduced LCP to characterize normal and AMD classes using fundus images. Linear Configuration Coefficients (CC) and Pattern Occurrence (PO) features are extracted from fundus images. These extracted features are ranked using p-value of the t-test and fed to various supervised classifiers viz. Decision Tree (DT), Nearest Neighbour (k-NN), Naive Bayes (NB), Probabilistic Neural Network (PNN) and Support Vector Machine (SVM) to classify normal and AMD classes. The performance of the system is evaluated using both private (Kasturba Medical Hospital, Manipal, India) and public domain datasets viz. Automated Retinal Image Analysis (ARIA) and STructured Analysis of the Retina (STARE) using ten-fold cross validation. The proposed approach yielded best performance with a highest average accuracy of 97.78%, sensitivity of 98.00% and specificity of 97.50% for STARE dataset using 22 significant features. Hence, this system can be used as an aiding tool to the clinicians during mass eye screening programs to diagnose AMD.
Collapse
|
21
|
Wang A, An N, Chen G, Li L, Alterovitz G. Improving PLS-RFE based gene selection for microarray data classification. Comput Biol Med 2015; 62:14-24. [PMID: 25912984 DOI: 10.1016/j.compbiomed.2015.04.011] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2014] [Revised: 04/07/2015] [Accepted: 04/08/2015] [Indexed: 10/23/2022]
Abstract
Gene selection plays a crucial role in constructing efficient classifiers for microarray data classification, since microarray data is characterized by high dimensionality and small sample sizes and contains irrelevant and redundant genes. In practical use, partial least squares-based gene selection approaches can obtain gene subsets of good qualities, but are considerably time-consuming. In this paper, we propose to integrate partial least squares based recursive feature elimination (PLS-RFE) with two feature elimination schemes: simulated annealing and square root, respectively, to speed up the feature selection process. Inspired from the strategy of annealing schedule, the two proposed approaches eliminate a number of features rather than one least informative feature during each iteration and the number of removed features decreases as the iteration proceeds. To verify the effectiveness and efficiency of the proposed approaches, we perform extensive experiments on six publicly available microarray data with three typical classifiers, including Naïve Bayes, K-Nearest-Neighbor and Support Vector Machine, and compare our approaches with ReliefF, PLS and PLS-RFE feature selectors in terms of classification accuracy and running time. Experimental results demonstrate that the two proposed approaches accelerate the feature selection process impressively without degrading the classification accuracy and obtain more compact feature subsets for both two-category and multi-category problems. Further experimental comparisons in feature subset consistency show that the proposed approach with simulated annealing scheme not only has better time performance, but also obtains slightly better feature subset consistency than the one with square root scheme.
Collapse
Affiliation(s)
- Aiguo Wang
- School of Computer and Information, Hefei University of Technology, Hefei, China.
| | - Ning An
- School of Computer and Information, Hefei University of Technology, Hefei, China.
| | - Guilin Chen
- School of Computer and Information Engineering, Chuzhou University, Chuzhou, China.
| | - Lian Li
- School of Computer and Information, Hefei University of Technology, Hefei, China.
| | - Gil Alterovitz
- Center for Biomedical Informatics, Harvard Medical School, Boston, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA; Children׳s Hospital Informatics Program at the Harvard/MIT Division of Health Sciences and Technology, Boston, USA.
| |
Collapse
|
22
|
Sun S, Peng Q, Shakoor A. A kernel-based multivariate feature selection method for microarray data classification. PLoS One 2014; 9:e102541. [PMID: 25048512 PMCID: PMC4105478 DOI: 10.1371/journal.pone.0102541] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2014] [Accepted: 06/20/2014] [Indexed: 11/19/2022] Open
Abstract
High dimensionality and small sample sizes, and their inherent risk of overfitting, pose great challenges for constructing efficient classifiers in microarray data classification. Therefore a feature selection technique should be conducted prior to data classification to enhance prediction performance. In general, filter methods can be considered as principal or auxiliary selection mechanism because of their simplicity, scalability, and low computational complexity. However, a series of trivial examples show that filter methods result in less accurate performance because they ignore the dependencies of features. Although few publications have devoted their attention to reveal the relationship of features by multivariate-based methods, these methods describe relationships among features only by linear methods. While simple linear combination relationship restrict the improvement in performance. In this paper, we used kernel method to discover inherent nonlinear correlations among features as well as between feature and target. Moreover, the number of orthogonal components was determined by kernel Fishers linear discriminant analysis (FLDA) in a self-adaptive manner rather than by manual parameter settings. In order to reveal the effectiveness of our method we performed several experiments and compared the results between our method and other competitive multivariate-based features selectors. In our comparison, we used two classifiers (support vector machine, [Formula: see text]-nearest neighbor) on two group datasets, namely two-class and multi-class datasets. Experimental results demonstrate that the performance of our method is better than others, especially on three hard-classify datasets, namely Wang's Breast Cancer, Gordon's Lung Adenocarcinoma and Pomeroy's Medulloblastoma.
Collapse
Affiliation(s)
- Shiquan Sun
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Qinke Peng
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Adnan Shakoor
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| |
Collapse
|