1
|
Xiao D, Lin M, Liu C, Geddes TA, Burchfield J, Parker B, Humphrey SJ, Yang P. SnapKin: a snapshot deep learning ensemble for kinase-substrate prediction from phosphoproteomics data. NAR Genom Bioinform 2023; 5:lqad099. [PMID: 37954574 PMCID: PMC10632189 DOI: 10.1093/nargab/lqad099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 09/18/2023] [Accepted: 10/25/2023] [Indexed: 11/14/2023] Open
Abstract
A major challenge in mass spectrometry-based phosphoproteomics lies in identifying the substrates of kinases, as currently only a small fraction of substrates identified can be confidently linked with a known kinase. Machine learning techniques are promising approaches for leveraging large-scale phosphoproteomics data to computationally predict substrates of kinases. However, the small number of experimentally validated kinase substrates (true positive) and the high data noise in many phosphoproteomics datasets together limit their applicability and utility. Here, we aim to develop advanced kinase-substrate prediction methods to address these challenges. Using a collection of seven large phosphoproteomics datasets, and both traditional and deep learning models, we first demonstrate that a 'pseudo-positive' learning strategy for alleviating small sample size is effective at improving model predictive performance. We next show that a data resampling-based ensemble learning strategy is useful for improving model stability while further enhancing prediction. Lastly, we introduce an ensemble deep learning model ('SnapKin') by incorporating the above two learning strategies into a 'snapshot' ensemble learning algorithm. We propose SnapKin, an ensemble deep learning method, for predicting substrates of kinases from large-scale phosphoproteomics data. We demonstrate that SnapKin consistently outperforms existing methods in kinase-substrate prediction. SnapKin is freely available at https://github.com/PYangLab/SnapKin.
Collapse
Affiliation(s)
- Di Xiao
- Computational Systems Biology Group, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
| | - Michael Lin
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Group, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
| | - Thomas A Geddes
- Computational Systems Biology Group, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Environmental and Life Sciences, The University of Sydney, Sydney, NSW 2006, Australia
| | - James G Burchfield
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Environmental and Life Sciences, The University of Sydney, Sydney, NSW 2006, Australia
| | - Benjamin L Parker
- Centre for Muscle Research, Department of Anatomy and Physiology, School of Biomedical Sciences, Melbourne, VIC 3010, Australia
| | - Sean J Humphrey
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Environmental and Life Sciences, The University of Sydney, Sydney, NSW 2006, Australia
- Murdoch Children’s Research Institute, The Royal Children’s Hospital, Melbourne, VIC, 3052, Australia
| | - Pengyi Yang
- Computational Systems Biology Group, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
2
|
Huang H, Liu C, Wagle MM, Yang P. Evaluation of deep learning-based feature selection for single-cell RNA sequencing data analysis. Genome Biol 2023; 24:259. [PMID: 37950331 PMCID: PMC10638755 DOI: 10.1186/s13059-023-03100-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 10/24/2023] [Indexed: 11/12/2023] Open
Abstract
BACKGROUND Feature selection is an essential task in single-cell RNA-seq (scRNA-seq) data analysis and can be critical for gene dimension reduction and downstream analyses, such as gene marker identification and cell type classification. Most popular methods for feature selection from scRNA-seq data are based on the concept of differential distribution wherein a statistical model is used to detect changes in gene expression among cell types. Recent development of deep learning-based feature selection methods provides an alternative approach compared to traditional differential distribution-based methods in that the importance of a gene is determined by neural networks. RESULTS In this work, we explore the utility of various deep learning-based feature selection methods for scRNA-seq data analysis. We sample from Tabula Muris and Tabula Sapiens atlases to create scRNA-seq datasets with a range of data properties and evaluate the performance of traditional and deep learning-based feature selection methods for cell type classification, feature selection reproducibility and diversity, and computational time. CONCLUSIONS Our study provides a reference for future development and application of deep learning-based feature selection methods for single-cell omics data analyses.
Collapse
Affiliation(s)
- Hao Huang
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, Camperdown, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Manoj M Wagle
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, Camperdown, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Pengyi Yang
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia.
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, Camperdown, NSW, 2006, Australia.
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia.
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, 2006, Australia.
| |
Collapse
|
3
|
Pan X, Cheng J, Hou F, Lan R, Lu C, Li L, Feng Z, Wang H, Liang C, Liu Z, Chen X, Han C, Liu Z. SMILE: Cost-sensitive multi-task learning for nuclear segmentation and classification with imbalanced annotations. Med Image Anal 2023; 88:102867. [PMID: 37348167 DOI: 10.1016/j.media.2023.102867] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 03/25/2023] [Accepted: 06/07/2023] [Indexed: 06/24/2023]
Abstract
High throughput nuclear segmentation and classification of whole slide images (WSIs) is crucial to biological analysis, clinical diagnosis and precision medicine. With the advances of CNN algorithms and the continuously growing datasets, considerable progress has been made in nuclear segmentation and classification. However, few works consider how to reasonably deal with nuclear heterogeneity in the following two aspects: imbalanced data distribution and diversified morphology characteristics. The minority classes might be dominated by the majority classes due to the imbalanced data distribution and the diversified morphology characteristics may lead to fragile segmentation results. In this study, a cost-Sensitive MultI-task LEarning (SMILE) framework is conducted to tackle the data heterogeneity problem. Based on the most popular multi-task learning backbone in nuclei segmentation and classification, we propose a multi-task correlation attention (MTCA) to perform feature interaction of multiple high relevant tasks to learn better feature representation. A cost-sensitive learning strategy is proposed to solve the imbalanced data distribution by increasing the penalization for the error classification of the minority classes. Furthermore, we propose a novel post-processing step based on the coarse-to-fine marker-controlled watershed scheme to alleviate fragile segmentation when nuclei are with large size and unclear contour. Extensive experiments show that the proposed method achieves state-of-the-art performances on CoNSeP and MoNuSAC 2020 datasets. The code is available at: https://github.com/panxipeng/nuclear_segandcls.
Collapse
Affiliation(s)
- Xipeng Pan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China; Department of Radiology, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong 510080, China; Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong 510080, China.
| | - Jijun Cheng
- Software Engineering Institute, East China Normal University, Shanghai 200062, China
| | - Feihu Hou
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
| | - Rushi Lan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
| | - Cheng Lu
- Department of Radiology, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong 510080, China; Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong 510080, China
| | - Lingqiao Li
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
| | - Zhengyun Feng
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
| | - Huadeng Wang
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
| | - Changhong Liang
- Department of Radiology, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong 510080, China; Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong 510080, China
| | - Zhenbing Liu
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China.
| | - Xin Chen
- Department of Radiology, Guangzhou First People's Hospital, School of Medicine, South China University of Technology, Guangzhou, Guangdong 510180, China.
| | - Chu Han
- Department of Radiology, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong 510080, China; Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong 510080, China.
| | - Zaiyi Liu
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China; Department of Radiology, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong 510080, China; Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong 510080, China.
| |
Collapse
|
4
|
Xu Y, Yu Z, Chen CLP, Liu Z. Adaptive Subspace Optimization Ensemble Method for High-Dimensional Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:2284-2297. [PMID: 34469316 DOI: 10.1109/tnnls.2021.3106306] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
It is hard to construct an optimal classifier for high-dimensional imbalanced data, on which the performance of classifiers is seriously affected and becomes poor. Although many approaches, such as resampling, cost-sensitive, and ensemble learning methods, have been proposed to deal with the skewed data, they are constrained by high-dimensional data with noise and redundancy. In this study, we propose an adaptive subspace optimization ensemble method (ASOEM) for high-dimensional imbalanced data classification to overcome the above limitations. To construct accurate and diverse base classifiers, a novel adaptive subspace optimization (ASO) method based on adaptive subspace generation (ASG) process and rotated subspace optimization (RSO) process is designed to generate multiple robust and discriminative subspaces. Then a resampling scheme is applied on the optimized subspace to build a class-balanced data for each base classifier. To verify the effectiveness, our ASOEM is implemented based on different resampling strategies on 24 real-world high-dimensional imbalanced datasets. Experimental results demonstrate that our proposed methods outperform other mainstream imbalance learning approaches and classifier ensemble methods.
Collapse
|
5
|
Sharma M, Patel RK, Garg A, SanTan R, Acharya UR. Automated detection of schizophrenia using deep learning: a review for the last decade. Physiol Meas 2023; 44. [PMID: 36630717 DOI: 10.1088/1361-6579/acb24d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 01/11/2023] [Indexed: 01/12/2023]
Abstract
Schizophrenia (SZ) is a devastating mental disorder that disrupts higher brain functions like thought, perception, etc., with a profound impact on the individual's life. Deep learning (DL) can detect SZ automatically by learning signal data characteristics hierarchically without the need for feature engineering associated with traditional machine learning. We performed a systematic review of DL models for SZ detection. Various deep models like long short-term memory, convolution neural networks, AlexNet, etc., and composite methods have been published based on electroencephalographic signals, and structural and/or functional magnetic resonance imaging acquired from SZ patients and healthy patients control subjects in diverse public and private datasets. The studies, the study datasets, and model methodologies are reported in detail. In addition, the challenges of DL models for SZ diagnosis and future works are discussed.
Collapse
Affiliation(s)
- Manish Sharma
- Department of Electrical and Computer Science Engineering, Institute of Infrastructure Technology Research and Management, Ahmedabad 380026, India
| | - Ruchit Kumar Patel
- Department of Electrical and Computer Science Engineering, Institute of Infrastructure Technology Research and Management, Ahmedabad 380026, India
| | - Akshat Garg
- Department of Electrical and Computer Science Engineering, Institute of Infrastructure Technology Research and Management, Ahmedabad 380026, India
| | - Ru SanTan
- Department of Cardiology, National Heart Centre Singapore, Singapore 169609, Singapore
| | - U Rajendra Acharya
- Department of Electronics and Computer Engineering, Ngee Ann Polytechnic, Singapore 639798, Singapore.,Department of Bioinformatics and Medical Engineering, Asia University, Taichung 41354, Taiwan.,Department of Biomedical Engineering, School of Science and Technology, Singapore 639798, Singapore
| |
Collapse
|
6
|
Yang Y, Hu Y, Zhang X, Wang S. Two-Stage Selective Ensemble of CNN via Deep Tree Training for Medical Image Classification. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:9194-9207. [PMID: 33705343 DOI: 10.1109/tcyb.2021.3061147] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Medical image classification is an important task in computer-aided diagnosis systems. Its performance is critically determined by the descriptiveness and discriminative power of features extracted from images. With rapid development of deep learning, deep convolutional neural networks (CNNs) have been widely used to learn the optimal high-level features from the raw pixels of images for a given classification task. However, due to the limited amount of labeled medical images with certain quality distortions, such techniques crucially suffer from the training difficulties, including overfitting, local optimums, and vanishing gradients. To solve these problems, in this article, we propose a two-stage selective ensemble of CNN branches via a novel training strategy called deep tree training (DTT). In our approach, DTT is adopted to jointly train a series of networks constructed from the hidden layers of CNN in a hierarchical manner, leading to the advantage that vanishing gradients can be mitigated by supplementing gradients for hidden layers of CNN, and intrinsically obtain the base classifiers on the middle-level features with minimum computation burden for an ensemble solution. Moreover, the CNN branches as base learners are combined into the optimal classifier via the proposed two-stage selective ensemble approach based on both accuracy and diversity criteria. Extensive experiments on CIFAR-10 benchmark and two specific medical image datasets illustrate that our approach achieves better performance in terms of accuracy, sensitivity, specificity, and F1 score measurement.
Collapse
|
7
|
Xu Y, Yu Z, Chen CLP. Classifier Ensemble Based on Multiview Optimization for High-Dimensional Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:870-883. [PMID: 35657843 DOI: 10.1109/tnnls.2022.3177695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
High-dimensional class imbalanced data have plagued the performance of classification algorithms seriously. Because of a large number of redundant/invalid features and the class imbalanced issue, it is difficult to construct an optimal classifier for high-dimensional imbalanced data. Classifier ensemble has attracted intensive attention since it can achieve better performance than an individual classifier. In this work, we propose a multiview optimization (MVO) to learn more effective and robust features from high-dimensional imbalanced data, based on which an accurate and robust ensemble system is designed. Specifically, an optimized subview generation (OSG) in MVO is first proposed to generate multiple optimized subviews from different scenarios, which can strengthen the classification ability of features and increase the diversity of ensemble members simultaneously. Second, a new evaluation criterion that considers the distribution of data in each optimized subview is developed based on which a selective ensemble of optimized subviews (SEOS) is designed to perform the subview selective ensemble. Finally, an oversampling approach is executed on the optimized view to obtain a new class rebalanced subset for the classifier. Experimental results on 25 high-dimensional class imbalanced datasets indicate that the proposed method outperforms other mainstream classifier ensemble methods.
Collapse
|
8
|
Zhu Z, Wang Z, Li D, Du W. Globalized Multiple Balanced Subsets With Collaborative Learning for Imbalanced Data. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:2407-2417. [PMID: 32609619 DOI: 10.1109/tcyb.2020.3001158] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The skewed distribution of data brings difficulties to classify minority and majority samples in the imbalanced problem. The balanced bagging randomly undersampes majority samples several times and combines the selected majority samples with minority samples to form several balanced subsets, in which the numbers of minority and majority samples are roughly equal. However, the balanced bagging is the lack of a unified learning framework. Moreover, it fails to concern the connection of all subsets and the global information of the entire data distribution. To this end, this article puts several balanced subsets into an effective learning framework with a criterion function. In the learning framework, one regularization term called RS establishes the connection and realizes the collaborative learning of all subsets by requiring the consistent outputs of the minority samples in different subsets. Besides, another regularization term called RW provides the global information to each basic classifier by reducing the difference between the direction of the solution vector in each subset and that in the entire dataset. The proposed learning framework is called globalized multiple balanced subsets with collaborative learning (GMBSCL). The experimental results validate the effectiveness of the proposed GMBSCL.
Collapse
|
9
|
Zhang J, Wang T, Ng WW, Pedrycz W. Ensembling perturbation-based oversamplers for imbalanced datasets. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.01.049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
10
|
Li Y, Wang S, Jin J, Philip Chen CL. Weighted Competitive-Collaborative Representation Based Classifier for Imbalanced Data Classification. ARTIF INTELL 2022. [DOI: 10.1007/978-3-031-20500-2_38] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
|
11
|
Singh A, Ranjan RK, Tiwari A. Credit Card Fraud Detection under Extreme Imbalanced Data: A Comparative Study of Data-level Algorithms. J EXP THEOR ARTIF IN 2021. [DOI: 10.1080/0952813x.2021.1907795] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Amit Singh
- Indian Computer Emergency Response Team, Ministry of Electronics and Information Technology, New Delhi, India
| | | | - Abhishek Tiwari
- Department of Computer Science, Central University of Haryana, Mahendergarh, India
| |
Collapse
|
12
|
Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, Santamaría J, Fadhel MA, Al-Amidie M, Farhan L. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. JOURNAL OF BIG DATA 2021; 8:53. [PMID: 33816053 PMCID: PMC8010506 DOI: 10.1186/s40537-021-00444-8] [Citation(s) in RCA: 775] [Impact Index Per Article: 258.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 03/22/2021] [Indexed: 05/04/2023]
Abstract
In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL is the ability to learn massive amounts of data. The DL field has grown fast in the last few years and it has been extensively used to successfully address a wide range of traditional applications. More importantly, DL has outperformed well-known ML techniques in many domains, e.g., cybersecurity, natural language processing, bioinformatics, robotics and control, and medical information processing, among many others. Despite it has been contributed several works reviewing the State-of-the-Art on DL, all of them only tackled one aspect of the DL, which leads to an overall lack of knowledge about it. Therefore, in this contribution, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of DL. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field. In particular, this paper outlines the importance of DL, presents the types of DL techniques and networks. It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net). Finally, we further present the challenges and suggested solutions to help researchers understand the existing research gaps. It is followed by a list of the major DL applications. Computational tools including FPGA, GPU, and CPU are summarized along with a description of their influence on DL. The paper ends with the evolution matrix, benchmark datasets, and summary and conclusion.
Collapse
Affiliation(s)
- Laith Alzubaidi
- School of Computer Science, Queensland University of Technology, Brisbane, QLD 4000 Australia
- AlNidhal Campus, University of Information Technology & Communications, Baghdad, 10001 Iraq
| | - Jinglan Zhang
- School of Computer Science, Queensland University of Technology, Brisbane, QLD 4000 Australia
| | - Amjad J. Humaidi
- Control and Systems Engineering Department, University of Technology, Baghdad, 10001 Iraq
| | - Ayad Al-Dujaili
- Electrical Engineering Technical College, Middle Technical University, Baghdad, 10001 Iraq
| | - Ye Duan
- Faculty of Electrical Engineering & Computer Science, University of Missouri, Columbia, MO 65211 USA
| | - Omran Al-Shamma
- AlNidhal Campus, University of Information Technology & Communications, Baghdad, 10001 Iraq
| | - J. Santamaría
- Department of Computer Science, University of Jaén, 23071 Jaén, Spain
| | - Mohammed A. Fadhel
- College of Computer Science and Information Technology, University of Sumer, Thi Qar, 64005 Iraq
| | - Muthana Al-Amidie
- Faculty of Electrical Engineering & Computer Science, University of Missouri, Columbia, MO 65211 USA
| | - Laith Farhan
- School of Engineering, Manchester Metropolitan University, Manchester, M1 5GD UK
| |
Collapse
|
13
|
Jing XY, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang JY. Multiset Feature Learning for Highly Imbalanced Data Classification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:139-156. [PMID: 31331881 DOI: 10.1109/tpami.2019.2929166] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
With the expansion of data, increasing imbalanced data has emerged. When the imbalance ratio (IR) of data is high, most existing imbalanced learning methods decline seriously in classification performance. In this paper, we systematically investigate the highly imbalanced data classification problem, and propose an uncorrelated cost-sensitive multiset learning (UCML) approach for it. Specifically, UCML first constructs multiple balanced subsets through random partition, and then employs the multiset feature learning (MFL) to learn discriminant features from the constructed multiset. To enhance the usability of each subset and deal with the non-linearity issue existed in each subset, we further propose a deep metric based UCML (DM-UCML) approach. DM-UCML introduces the generative adversarial network technique into the multiset constructing process, such that each subset can own similar distribution with the original dataset. To cope with the non-linearity issue, DM-UCML integrates deep metric learning with MFL, such that more favorable performance can be achieved. In addition, DM-UCML designs a new discriminant term to enhance the discriminability of learned metrics. Experiments on eight traditional highly class-imbalanced datasets and two large-scale datasets indicate that: the proposed approaches outperform state-of-the-art highly imbalanced learning methods and are more robust to high IR.
Collapse
|
14
|
Guo Y, Chu Y, Jiao B, Cheng J, Yu Z, Cui N, Ma L. Evolutionary Dual-Ensemble Class Imbalance Learning for Human Activity Recognition. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2021. [DOI: 10.1109/tetci.2021.3079966] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
15
|
Iiduka H. Stochastic Fixed Point Optimization Algorithm for Classifier Ensemble. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:4370-4380. [PMID: 31247582 DOI: 10.1109/tcyb.2019.2921369] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
This paper considers a classifier ensemble problem with sparsity and diversity learning, which arises in the field of machine learning, and shows that the classifier ensemble problem can be formulated as a convex stochastic optimization problem over the fixed point set of a quasi-nonexpansive mapping. Specifically, for such a problem, this paper proposes an algorithm referred to as the stochastic fixed point optimization algorithm and performs a convergence analysis for three types of step size: 1) constant step size; 2) decreasing step size; and 3) a step size computed by line searches. In the case of a constant step size, the results indicate that a sufficiently small constant step size allows a solution to the problem to be approximated. In the case of a decreasing step size, conditions are shown under which the algorithm converges in probability to a solution. For the third case, a variation of the basic proposed algorithm also achieves convergence in probability to a solution. The high classification accuracies of the proposed algorithms are demonstrated through numerical comparisons with the conventional algorithm.
Collapse
|
16
|
|
17
|
Pei W, Xue B, Shang L, Zhang M. Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism. Soft comput 2020. [DOI: 10.1007/s00500-020-05056-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
18
|
Sachdev K, Gupta MK. A comprehensive review of computational techniques for the prediction of drug side effects. Drug Dev Res 2020; 81:650-670. [DOI: 10.1002/ddr.21669] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Revised: 03/18/2020] [Accepted: 03/30/2020] [Indexed: 12/28/2022]
Affiliation(s)
- Kanica Sachdev
- School of Computer Science and EngineeringShri Mata Vaishno Devi University Katra Jammu and Kashmir India
| | - Manoj K. Gupta
- School of Computer Science and EngineeringShri Mata Vaishno Devi University Katra Jammu and Kashmir India
| |
Collapse
|
19
|
Yang K, Yu Z, Wen X, Cao W, Chen CLP, Wong HS, You J. Hybrid Classifier Ensemble for Imbalanced Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:1387-1400. [PMID: 31265410 DOI: 10.1109/tnnls.2019.2920246] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The class imbalance problem has become a leading challenge. Although conventional imbalance learning methods are proposed to tackle this problem, they have some limitations: 1) undersampling methods suffer from losing important information and 2) cost-sensitive methods are sensitive to outliers and noise. To address these issues, we propose a hybrid optimal ensemble classifier framework that combines density-based undersampling and cost-effective methods through exploring state-of-the-art solutions using multi-objective optimization algorithm. Specifically, we first develop a density-based undersampling method to select informative samples from the original training data with probability-based data transformation, which enables to obtain multiple subsets following a balanced distribution across classes. Second, we exploit the cost-sensitive classification method to address the incompleteness of information problem via modifying weights of misclassified minority samples rather than the majority ones. Finally, we introduce a multi-objective optimization procedure and utilize connections between samples to self-modify the classification result using an ensemble classifier framework. Extensive comparative experiments conducted on real-world data sets demonstrate that our method outperforms the majority of imbalance and ensemble classification approaches.
Collapse
|
20
|
Kim T, Lo K, Geddes TA, Kim HJ, Yang JYH, Yang P. scReClassify: post hoc cell type classification of single-cell rNA-seq data. BMC Genomics 2019; 20:913. [PMID: 31874628 PMCID: PMC6929456 DOI: 10.1186/s12864-019-6305-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Background Single-cell RNA-sequencing (scRNA-seq) is a fast emerging technology allowing global transcriptome profiling on the single cell level. Cell type identification from scRNA-seq data is a critical task in a variety of research such as developmental biology, cell reprogramming, and cancers. Typically, cell type identification relies on human inspection using a combination of prior biological knowledge (e.g. marker genes and morphology) and computational techniques (e.g. PCA and clustering). Due to the incompleteness of our current knowledge and the subjectivity involved in this process, a small amount of cells may be subject to mislabelling. Results Here, we propose a semi-supervised learning framework, named scReClassify, for ‘post hoc’ cell type identification from scRNA-seq datasets. Starting from an initial cell type annotation with potentially mislabelled cells, scReClassify first performs dimension reduction using PCA and next applies a semi-supervised learning method to learn and subsequently reclassify cells that are likely mislabelled initially to the most probable cell types. By using both simulated and real-world experimental datasets that profiled various tissues and biological systems, we demonstrate that scReClassify is able to accurately identify and reclassify misclassified cells to their correct cell types. Conclusions scReClassify can be used for scRNA-seq data as a post hoc cell type classification tool to fine-tune cell type annotations generated by any cell type classification procedure. It is implemented as an R package and is freely available from https://github.com/SydneyBioX/scReClassify
Collapse
Affiliation(s)
- Taiyun Kim
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, 2006, NSW, Australia.,Charles Perkins Centre, The University of Sydney, 2006, NSW, Australia
| | - Kitty Lo
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, 2006, NSW, Australia.,Charles Perkins Centre, The University of Sydney, 2006, NSW, Australia
| | - Thomas A Geddes
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, 2006, NSW, Australia.,Charles Perkins Centre, The University of Sydney, 2006, NSW, Australia.,School of Life and Environmental Sciences, Faculty of Science, The University of Sydney, 2006, NSW, Australia
| | - Hani Jieun Kim
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, 2006, NSW, Australia.,Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, 2145, NSW, Australia.,Charles Perkins Centre, The University of Sydney, 2006, NSW, Australia
| | - Jean Yee Hwa Yang
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, 2006, NSW, Australia.,Charles Perkins Centre, The University of Sydney, 2006, NSW, Australia
| | - Pengyi Yang
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, 2006, NSW, Australia. .,Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, 2145, NSW, Australia. .,Charles Perkins Centre, The University of Sydney, 2006, NSW, Australia.
| |
Collapse
|
21
|
Quan Y, Luo ZH, Yang QY, Li J, Zhu Q, Liu YM, Lv BM, Cui ZJ, Qin X, Xu YH, Zhu LD, Zhang HY. Systems Chemical Genetics-Based Drug Discovery: Prioritizing Agents Targeting Multiple/Reliable Disease-Associated Genes as Drug Candidates. Front Genet 2019; 10:474. [PMID: 31191604 PMCID: PMC6549477 DOI: 10.3389/fgene.2019.00474] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 05/01/2019] [Indexed: 01/10/2023] Open
Abstract
Genetic disease genes are considered a promising source of drug targets. Most diseases are caused by more than one pathogenic factor; thus, it is reasonable to consider that chemical agents targeting multiple disease genes are more likely to have desired activities. This is supported by a comprehensive analysis on the relationships between agent activity/druggability and target genetic characteristics. The therapeutic potential of agents increases steadily with increasing number of targeted disease genes, and can be further enhanced by strengthened genetic links between targets and diseases. By using the multi-label classification models for genetics-based drug activity prediction, we provide universal tools for prioritizing drug candidates. All of the documented data and the machine-learning prediction service are available at SCG-Drug (http://zhanglab.hzau.edu.cn/scgdrug).
Collapse
Affiliation(s)
- Yuan Quan
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Zhi-Hui Luo
- College of Life Sciences and Technology, Huazhong Agricultural University, Wuhan, China
| | - Qing-Yong Yang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Jiang Li
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Qiang Zhu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Ye-Mao Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Bo-Min Lv
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Ze-Jia Cui
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Xuan Qin
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Yan-Hua Xu
- Sci-meds Biopharmaceutical Co., Ltd., Wuhan, China
| | - Li-Da Zhu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Hong-Yu Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
22
|
Susan S, Kumar A. SSOMaj-SMOTE-SSOMin: Three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2019.02.028] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
23
|
Yang P, Ormerod JT, Liu W, Ma C, Zomaya AY, Yang JYH. AdaSampling for Positive-Unlabeled and Label Noise Learning With Bioinformatics Applications. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:1932-1943. [PMID: 29993676 DOI: 10.1109/tcyb.2018.2816984] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled (PU) learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise (LN). Here we propose adaptive sampling (AdaSampling), a framework for both PU learning and learning with class LN. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalizable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for PU learning and/or learning with LN. We then introduce two novel bioinformatics applications where AdaSampling is used to: 1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and 2) predict transcription factor target genes by integrating various next-generation sequencing data.
Collapse
|
24
|
Yang Y, Jiang J. Adaptive Bi-Weighting Toward Automatic Initialization and Model Selection for HMM-Based Hybrid Meta-Clustering Ensembles. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:1657-1668. [PMID: 29994293 DOI: 10.1109/tcyb.2018.2809562] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Temporal data clustering can provide underpinning techniques for the discovery of intrinsic structures, which proved important in condensing or summarizing information demanded in various fields of information sciences, ranging from time series analysis to sequential data understanding. In this paper, we propose a novel hidden Markov model (HMM)-based hybrid meta-clustering ensemble with bi-weighting scheme to solve the problems of initialization and model selection associated with temporal data clustering. To improve the performance of the ensemble techniques, the proposed bi-weighting scheme adaptively examines the partition process and hence optimizes the fusion of consensus functions. Specifically, three consensus functions are used to combine the input partitions, generated by HMM-based K -models under different initializations, into a robust consensus partition. An optimal consensus partition is then selected from the three candidates by a normalized mutual information-based objective function. Finally, the optimal consensus partition is further refined by the HMM-based agglomerative clustering algorithm in association with dendrogram-based similarity partitioning algorithm, leading to the advantage that the number of clusters can be automatically and adaptively determined. Extensive experiments on synthetic data, time series, and real-world motion trajectory datasets illustrate that our proposed approach outperforms all the selected benchmarks and hence providing promising potentials for developing improved clustering tools for information analysis and management.
Collapse
|
25
|
Binary teaching–learning-based optimization algorithm with a new update mechanism for sample subset optimization in software defect prediction. Soft comput 2018. [DOI: 10.1007/s00500-018-3546-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
26
|
Yu Z, Lu Y, Zhang J, You J, Wong HS, Wang Y, Han G. Progressive Semisupervised Learning of Multiple Classifiers. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:689-702. [PMID: 28113355 DOI: 10.1109/tcyb.2017.2651114] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Semisupervised learning methods are often adopted to handle datasets with very small number of labeled samples. However, conventional semisupervised ensemble learning approaches have two limitations: 1) most of them cannot obtain satisfactory results on high dimensional datasets with limited labels and 2) they usually do not consider how to use an optimization process to enlarge the training set. In this paper, we propose the progressive semisupervised ensemble learning approach (PSEMISEL) to address the above limitations and handle datasets with very small number of labeled samples. When compared with traditional semisupervised ensemble learning approaches, PSEMISEL is characterized by two properties: 1) it adopts the random subspace technique to investigate the structure of the dataset in the subspaces and 2) a progressive training set generation process and a self evolutionary sample selection process are proposed to enlarge the training set. We also use a set of nonparametric tests to compare different semisupervised ensemble learning methods over multiple datasets. The experimental results on 18 real-world datasets from the University of California, Irvine machine learning repository show that PSEMISEL works well on most of the real-world datasets, and outperforms other state-of-the-art approaches on 10 out of 18 datasets.
Collapse
|
27
|
Zhang X, Zhuang Y, Wang W, Pedrycz W. Transfer Boosting With Synthetic Instances for Class Imbalanced Object Recognition. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:357-370. [PMID: 28026795 DOI: 10.1109/tcyb.2016.2636370] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
A challenging problem in object recognition is to train a robust classifier with small and imbalanced data set. In such cases, the learned classifier tends to overfit the training data and has low prediction accuracy on the minority class. In this paper, we address the problem of class imbalanced object recognition by combining synthetic minorities over-sampling technique (SMOTE) and instance-based transfer boosting to rebalance the skewed class distribution. We present ways of generating synthetic instances under the learning framework of transfer Adaboost. A novel weighted SMOTE technique (WSMOTE) is proposed to generate weighted synthetic instances with weighted source and target instances at each boosting round. Based on WSMOTE, we propose a novel class imbalanced transfer boosting algorithm called WSMOTE-TrAdaboost and experimentally demonstrate its effectiveness on four datasets (Office, Caltech256, SUN2012, and VOC2012) for object recognition application. Bag-of-words model with SURF features and histogram of oriented gradient features are separately used to represent an image. We experimentally demonstrated the effectiveness and robustness of our approaches by comparing it with several baseline algorithms in boosting family for class imbalanced learning.
Collapse
|
28
|
Wankhade KK, Jondhale KC, Thool VR. A hybrid approach for classification of rare class data. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-017-1114-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
29
|
Yu Z, Chen H, Liuxs J, You J, Leung H, Han G. Hybrid k -Nearest Neighbor Classifier. IEEE TRANSACTIONS ON CYBERNETICS 2016; 46:1263-1275. [PMID: 26126291 DOI: 10.1109/tcyb.2015.2443857] [Citation(s) in RCA: 52] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Conventional k -nearest neighbor (KNN) classification approaches have several limitations when dealing with some problems caused by the special datasets, such as the sparse problem, the imbalance problem, and the noise problem. In this paper, we first perform a brief survey on the recent progress of the KNN classification approaches. Then, the hybrid KNN (HBKNN) classification approach, which takes into account the local and global information of the query sample, is designed to address the problems raised from the special datasets. In the following, the random subspace ensemble framework based on HBKNN (RS-HBKNN) classifier is proposed to perform classification on the datasets with noisy attributes in the high-dimensional space. Finally, the nonparametric tests are proposed to be adopted to compare the proposed method with other classification approaches over multiple datasets. The experiments on the real-world datasets from the Knowledge Extraction based on Evolutionary Learning dataset repository demonstrate that RS-HBKNN works well on real datasets, and outperforms most of the state-of-the-art classification approaches.
Collapse
|
30
|
|
31
|
Hu J, Li Y, Yan WX, Yang JY, Shen HB, Yu DJ. KNN-based dynamic query-driven sample rescaling strategy for class imbalance learning. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.01.043] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
32
|
Zhang W, Liu F, Luo L, Zhang J. Predicting drug side effects by multi-label learning and ensemble learning. BMC Bioinformatics 2015; 16:365. [PMID: 26537615 PMCID: PMC4634905 DOI: 10.1186/s12859-015-0774-y] [Citation(s) in RCA: 100] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2015] [Accepted: 10/14/2015] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Predicting drug side effects is an important topic in the drug discovery. Although several machine learning methods have been proposed to predict side effects, there is still space for improvements. Firstly, the side effect prediction is a multi-label learning task, and we can adopt the multi-label learning techniques for it. Secondly, drug-related features are associated with side effects, and feature dimensions have specific biological meanings. Recognizing critical dimensions and reducing irrelevant dimensions may help to reveal the causes of side effects. METHODS In this paper, we propose a novel method 'feature selection-based multi-label k-nearest neighbor method' (FS-MLKNN), which can simultaneously determine critical feature dimensions and construct high-accuracy multi-label prediction models. RESULTS Computational experiments demonstrate that FS-MLKNN leads to good performances as well as explainable results. To achieve better performances, we further develop the ensemble learning model by integrating individual feature-based FS-MLKNN models. When compared with other state-of-the-art methods, the ensemble method produces better performances on benchmark datasets. CONCLUSIONS In conclusion, FS-MLKNN and the ensemble method are promising tools for the side effect prediction. The source code and datasets are available in the Additional file 1.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer, Wuhan University, Wuhan, 430072, China. .,Research Institute of Shenzhen, Wuhan University, Shenzhen, 518057, China.
| | - Feng Liu
- International School of software, Wuhan University, Wuhan, 430072, China.
| | - Longqiang Luo
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China.
| | - Jingxia Zhang
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China.
| |
Collapse
|
33
|
Yang P, Humphrey SJ, James DE, Yang YH, Jothi R. Positive-unlabeled ensemble learning for kinase substrate prediction from dynamic phosphoproteomics data. Bioinformatics 2015; 32:252-9. [PMID: 26395771 DOI: 10.1093/bioinformatics/btv550] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 09/11/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Protein phosphorylation is a post-translational modification that underlines various aspects of cellular signaling. A key step to reconstructing signaling networks involves identification of the set of all kinases and their substrates. Experimental characterization of kinase substrates is both expensive and time-consuming. To expedite the discovery of novel substrates, computational approaches based on kinase recognition sequence (motifs) from known substrates, protein structure, interaction and co-localization have been proposed. However, rarely do these methods take into account the dynamic responses of signaling cascades measured from in vivo cellular systems. Given that recent advances in mass spectrometry-based technologies make it possible to quantify phosphorylation on a proteome-wide scale, computational approaches that can integrate static features with dynamic phosphoproteome data would greatly facilitate the prediction of biologically relevant kinase-specific substrates. RESULTS Here, we propose a positive-unlabeled ensemble learning approach that integrates dynamic phosphoproteomics data with static kinase recognition motifs to predict novel substrates for kinases of interest. We extended a positive-unlabeled learning technique for an ensemble model, which significantly improves prediction sensitivity on novel substrates of kinases while retaining high specificity. We evaluated the performance of the proposed model using simulation studies and subsequently applied it to predict novel substrates of key kinases relevant to insulin signaling. Our analyses show that static sequence motifs and dynamic phosphoproteomics data are complementary and that the proposed integrated model performs better than methods relying only on static information for accurate prediction of kinase-specific substrates. AVAILABILITY AND IMPLEMENTATION Executable GUI tool, source code and documentation are freely available at https://github.com/PengyiYang/KSP-PUEL. CONTACT pengyi.yang@nih.gov or jothi@mail.nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pengyi Yang
- Systems Biology Section, Epigenetics & Stem Cell Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, RTP, NC 27709, USA
| | - Sean J Humphrey
- Department of Proteomics and Signal Transduction, Max-Planck-Institute of Biochemistry, Martinsried, Germany
| | - David E James
- Charles Perkins Centre, School of Molecular Bioscience, Sydney Medical School and
| | - Yee Hwa Yang
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
| | - Raja Jothi
- Systems Biology Section, Epigenetics & Stem Cell Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, RTP, NC 27709, USA
| |
Collapse
|
34
|
Yu Z, Li L, Liu J, Han G. Hybrid adaptive classifier ensemble. IEEE TRANSACTIONS ON CYBERNETICS 2015; 45:177-190. [PMID: 24860045 DOI: 10.1109/tcyb.2014.2322195] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Traditional random subspace-based classifier ensemble approaches (RSCE) have several limitations, such as viewing the same importance for the base classifiers trained in different subspaces, not considering how to find the optimal random subspace set. In this paper, we design a general hybrid adaptive ensemble learning framework (HAEL), and apply it to address the limitations of RSCE. As compared with RSCE, HAEL consists of two adaptive processes, i.e., base classifier competition and classifier ensemble interaction, so as to adjust the weights of the base classifiers in each ensemble and to explore the optimal random subspace set simultaneously. The experiments on the real-world datasets from the KEEL dataset repository for the classification task and the cancer gene expression profiles show that: 1) HAEL works well on both the real-world KEEL datasets and the cancer gene expression profiles and 2) it outperforms most of the state-of-the-art classifier ensemble approaches on 28 out of 36 KEEL datasets and 6 out of 6 cancer datasets.
Collapse
|
35
|
Zhang XL. Heuristic ternary error-correcting output codes via weight optimization and layered clustering-based approach. IEEE TRANSACTIONS ON CYBERNETICS 2015; 45:289-301. [PMID: 25486660 DOI: 10.1109/tcyb.2014.2325603] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
One important classifier ensemble for multiclass classification problems is error-correcting output codes (ECOCs). It bridges multiclass problems and binary-class classifiers by decomposing multiclass problems to a serial binary-class problems. In this paper, we present a heuristic ternary code, named weight optimization and layered clustering-based ECOC (WOLC-ECOC). It starts with an arbitrary valid ECOC and iterates the following two steps until the training risk converges. The first step, named layered clustering-based ECOC (LC-ECOC), constructs multiple strong classifiers on the most confusing binary-class problem. The second step adds the new classifiers to ECOC by a novel optimized weighted (OW) decoding algorithm, where the optimization problem of the decoding is solved by the cutting plane algorithm. Technically, LC-ECOC makes the heuristic training process not blocked by some difficult binary-class problem. OW decoding guarantees the nonincrease of the training risk for ensuring a small code length. Results on 14 UCI datasets and a music genre classification problem demonstrate the effectiveness of WOLC-ECOC.
Collapse
|