1
|
Analysis of variables to determine their influence on renewable energy forecasting using ensemble methods. Heliyon 2024; 10:e30002. [PMID: 38774065 PMCID: PMC11106819 DOI: 10.1016/j.heliyon.2024.e30002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/04/2024] [Accepted: 04/18/2024] [Indexed: 05/24/2024] Open
Abstract
Forecasting is of great importance in the field of renewable energies because it allows us to know the quantity of energy that can be produced, and thus, to have an efficient management of energy sources. However, determining which prediction system is more adequate is very complex, as each energy infrastructure is different. This work studies the influence of some variables when making predictions using ensemble methods for different locations. In particular, the proposal analyzes the influence of the aspects: the variation of the sampling frequency of solar panel systems, the influence of the type of neural network architecture and the number of ensemble method blocks for each model. Following comprehensive experimentation across multiple locations, our study has identified the most effective solar energy prediction model tailored to the specific conditions of each energy infrastructure. The results offer a decisive framework for selecting the optimal system for accurate and efficient energy forecasting. The key point is the use of short time intervals, which is independent of type of prediction model and of their ensemble method.
Collapse
|
2
|
CGO-ensemble: Chaos game optimization algorithm-based fusion of deep neural networks for accurate Mpox detection. Neural Netw 2024; 173:106183. [PMID: 38382397 DOI: 10.1016/j.neunet.2024.106183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 12/19/2023] [Accepted: 02/15/2024] [Indexed: 02/23/2024]
Abstract
The rising global incidence of human Mpox cases necessitates prompt and accurate identification for effective disease control. Previous studies have predominantly delved into traditional ensemble methods for detection, we introduce a novel approach by leveraging a metaheuristic-based ensemble framework. In this research, we present an innovative CGO-Ensemble framework designed to elevate the accuracy of detecting Mpox infection in patients. Initially, we employ five transfer learning base models that integrate feature integration layers and residual blocks. These components play a crucial role in capturing significant features from the skin images, thereby enhancing the models' efficacy. In the next step, we employ a weighted averaging scheme to consolidate predictions generated by distinct models. To achieve the optimal allocation of weights for each base model in the ensemble process, we leverage the Chaos Game Optimization (CGO) algorithm. This strategic weight assignment enhances classification outcomes considerably, surpassing the performance of randomly assigned weights. Implementing this approach yields notably enhanced prediction accuracy compared to using individual models. We evaluate the effectiveness of our proposed approach through comprehensive experiments conducted on two widely recognized benchmark datasets: the Mpox Skin Lesion Dataset (MSLD) and the Mpox Skin Image Dataset (MSID). To gain insights into the decision-making process of the base models, we have performed Gradient Class Activation Mapping (Grad-CAM) analysis. The experimental results showcase the outstanding performance of the CGO-ensemble, achieving an impressive accuracy of 100% on MSLD and 94.16% on MSID. Our approach significantly outperforms other state-of-the-art optimization algorithms, traditional ensemble methods, and existing techniques in the context of Mpox detection on these datasets. These findings underscore the effectiveness and superiority of the CGO-Ensemble in accurately identifying Mpox cases, highlighting its potential in disease detection and classification.
Collapse
|
3
|
Revolutionizing heart disease prediction with quantum-enhanced machine learning. Sci Rep 2024; 14:7453. [PMID: 38548774 PMCID: PMC10978992 DOI: 10.1038/s41598-024-55991-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 06/23/2023] [Indexed: 04/01/2024] Open
Abstract
The recent developments in quantum technology have opened up new opportunities for machine learning algorithms to assist the healthcare industry in diagnosing complex health disorders, such as heart disease. In this work, we summarize the effectiveness of QuEML in heart disease prediction. To evaluate the performance of QuEML against traditional machine learning algorithms, the Kaggle heart disease dataset was used which contains 1190 samples out of which 53% of samples are labeled as positive samples and rest 47% samples are labeled as negative samples. The performance of QuEML was evaluated in terms of accuracy, precision, recall, specificity, F1 score, and training time against traditional machine learning algorithms. From the experimental results, it has been observed that proposed quantum approaches predicted around 50.03% of positive samples as positive and an average of 44.65% of negative samples are predicted as negative whereas traditional machine learning approaches could predict around 49.78% of positive samples as positive and 44.31% of negative samples as negative. Furthermore, the computational complexity of QuEML was measured which consumed average of 670 µs for its training whereas traditional machine learning algorithms could consume an average 862.5 µs for training. Hence, QuEL was found to be a promising approach in heart disease prediction with an accuracy rate of 0.6% higher and training time of 192.5 µs faster than that of traditional machine learning approaches.
Collapse
|
4
|
A numerical compass for experiment design in chemical kinetics and molecular property estimation. J Cheminform 2024; 16:34. [PMID: 38520014 PMCID: PMC10960421 DOI: 10.1186/s13321-024-00825-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 03/10/2024] [Indexed: 03/25/2024] Open
Abstract
Kinetic process models are widely applied in science and engineering, including atmospheric, physiological and technical chemistry, reactor design, or process optimization. These models rely on numerous kinetic parameters such as reaction rate, diffusion or partitioning coefficients. Determining these properties by experiments can be challenging, especially for multiphase systems, and researchers often face the task of intuitively selecting experimental conditions to obtain insightful results. We developed a numerical compass (NC) method that integrates computational models, global optimization, ensemble methods, and machine learning to identify experimental conditions with the greatest potential to constrain model parameters. The approach is based on the quantification of model output variance in an ensemble of solutions that agree with experimental data. The utility of the NC method is demonstrated for the parameters of a multi-layer model describing the heterogeneous ozonolysis of oleic acid aerosols. We show how neural network surrogate models of the multiphase chemical reaction system can be used to accelerate the application of the NC for a comprehensive mapping and analysis of experimental conditions. The NC can also be applied for uncertainty quantification of quantitative structure-activity relationship (QSAR) models. We show that the uncertainty calculated for molecules that are used to extend training data correlates with the reduction of QSAR model error. The code is openly available as the Julia package KineticCompass.
Collapse
|
5
|
Evaluation of penalized and machine learning methods for asthma disease prediction in the Korean Genome and Epidemiology Study (KoGES). BMC Bioinformatics 2024; 25:56. [PMID: 38308205 PMCID: PMC10837879 DOI: 10.1186/s12859-024-05677-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 01/26/2024] [Indexed: 02/04/2024] Open
Abstract
BACKGROUND Genome-wide association studies have successfully identified genetic variants associated with human disease. Various statistical approaches based on penalized and machine learning methods have recently been proposed for disease prediction. In this study, we evaluated the performance of several such methods for predicting asthma using the Korean Chip (KORV1.1) from the Korean Genome and Epidemiology Study (KoGES). RESULTS First, single-nucleotide polymorphisms were selected via single-variant tests using logistic regression with the adjustment of several epidemiological factors. Next, we evaluated the following methods for disease prediction: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation, support vector machine, random forest, boosting, bagging, naïve Bayes, and k-nearest neighbor. Finally, we compared their predictive performance based on the area under the curve of the receiver operating characteristic curves, precision, recall, F1-score, Cohen's Kappa, balanced accuracy, error rate, Matthews correlation coefficient, and area under the precision-recall curve. Additionally, three oversampling algorithms are used to deal with imbalance problems. CONCLUSIONS Our results show that penalized methods exhibit better predictive performance for asthma than that achieved via machine learning methods. On the other hand, in the oversampling study, randomforest and boosting methods overall showed better prediction performance than penalized methods.
Collapse
|
6
|
Star algorithm for neural network ensembling. Neural Netw 2024; 170:364-375. [PMID: 38029718 DOI: 10.1016/j.neunet.2023.11.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 09/20/2023] [Accepted: 11/07/2023] [Indexed: 12/01/2023]
Abstract
Neural network ensembling is a common and robust way to increase model efficiency. In this paper, we propose a new neural network ensemble algorithm based on Audibert's empirical star algorithm. We provide optimal theoretical minimax bound on the excess squared risk. Additionally, we empirically study this algorithm on regression and classification tasks and compare it to most popular ensembling methods.
Collapse
|
7
|
A Comprehensive Review on Ensemble Solar Power Forecasting Algorithms. JOURNAL OF ELECTRICAL ENGINEERING & TECHNOLOGY 2023; 18:719-733. [PMID: 37521955 PMCID: PMC9834683 DOI: 10.1007/s42835-023-01378-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 12/25/2022] [Accepted: 01/03/2023] [Indexed: 08/01/2023]
Abstract
With increasing demand for energy, the penetration of alternative sources such as renewable energy in power grids has increased. Solar energy is one of the most common and well-known sources of energy in existing networks. But because of its non-stationary and non-linear characteristics, it needs to predict solar irradiance to provide more reliable Photovoltaic (PV) plants and manage the power of supply and demand. Although there are various methods to predict the solar irradiance. This paper gives the overview of recent studies with focus on solar irradiance forecasting with ensemble methods which are divided into two main categories: competitive and cooperative ensemble forecasting. In addition, parameter diversity and data diversity are considered as competitive ensemble forecasting and also preprocessing and post-processing are as cooperative ensemble forecasting. All these ensemble forecasting methods are investigated in this study. In the end, the conclusion has been drawn and the recommendations for future studies have been discussed.
Collapse
|
8
|
Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms. APPLIED NANOSCIENCE 2023; 13:1829-1840. [PMID: 35132368 PMCID: PMC8811587 DOI: 10.1007/s13204-021-02063-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 08/28/2021] [Indexed: 12/03/2022]
Abstract
One of the prominent uses of Predictive Analytics is Health care for more accurate predictions based on proper analysis of cumulative datasets. Often times the datasets are quite imbalanced and sampling techniques like Synthetic Minority Oversampling Technique (SMOTE) give only moderate accuracy in such cases. To overcome this problem, a two-step approach has been proposed. In the first step, SMOTE is modified to reduce the class imbalance in terms of Distance-based SMOTE (D-SMOTE) and Bi-phasic SMOTE (BP-SMOTE) which were then coupled with selective classifiers for prediction. An increase in accuracy is noted for both BP-SMOTE and D-SMOTE compared to basic SMOTE. In the second step, Machine learning, Deep Learning and Ensemble algorithms were used to develop a Stacking Ensemble Framework which showed a significant increase in accuracy for Stacking compared to individual machine learning algorithms like Decision Tree, Naïve Bayes, Neural Networks and Ensemble techniques like Voting, Bagging and Boosting. Two different methods have been developed by combing Deep learning with Stacking approach namely Stacked CNN and Stacked RNN which yielded significantly higher accuracy of 96-97% compared to individual algorithms. Framingham dataset is used for data sampling, Wisconsin Hospital data of Breast Cancer study is used for Stacked CNN and Novel Coronavirus 2019 dataset relating to forecasting COVID-19 cases, is used for Stacked RNN.
Collapse
|
9
|
Machine Learning Methods for Virus-Host Protein-Protein Interaction Prediction. Methods Mol Biol 2023; 2690:401-417. [PMID: 37450162 DOI: 10.1007/978-1-0716-3327-4_31] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2023]
Abstract
The attachment of a virion to a respective cellular receptor on the host organism occurring through the virus-host protein-protein interactions (PPIs) is a decisive step for viral pathogenicity and infectivity. Therefore, a vast number of wet-lab experimental techniques are used to study virus-host PPIs. Taking the great number and enormous variety of virus-host PPIs and the cost as well as labor of laboratory work, however, computational approaches toward analyzing the available interaction data and predicting previously unidentified interactions have been on the rise. Among them, machine-learning-based models are getting increasingly more attention with a great body of resources and tools proposed recently.In this chapter, we first provide the methodology with major steps toward the development of a virus-host PPI prediction tool. Next, we discuss the challenges involved and evaluate several existing machine-learning-based virus-host PPI prediction tools. Finally, we describe our experience with several ensemble techniques as utilized on available prediction results retrieved from individual PPI prediction tools. Overall, based on our experience, we recognize there is still room for the development of new individual and/or ensemble virus-host PPI prediction tools that leverage existing tools.
Collapse
|
10
|
A Comparison of the Various Methods for Selecting Features for Single-Cell RNA Sequencing Data in Alzheimer's Disease. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2023; 1424:241-246. [PMID: 37486500 DOI: 10.1007/978-3-031-31982-2_27] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/25/2023]
Abstract
The high-throughput sequencing method known as RNA-Seq records the whole transcriptome of individual cells. Single-cell RNA sequencing, also known as scRNA-Seq, is widely utilized in the field of biomedical research and has resulted in the generation of huge quantities and types of data. The noise and artifacts that are present in the raw data require extensive cleaning before they can be used. When applied to applications for machine learning or pattern recognition, feature selection methods offer a method to reduce the amount of time spent on calculation while simultaneously improving predictions and offering a better knowledge of the data. The process of discovering biomarkers is analogous to feature selection methods used in machine learning and is especially helpful for applications in the medical field. An attempt is made by a feature selection algorithm to cut down on the total number of features by eliminating those that are unnecessary or redundant while retaining those that are the most helpful.We apply FS algorithms designed for scRNA-Seq to Alzheimer's disease, which is the most prevalent neurodegenerative disease in the western world and causes cognitive and behavioral impairment. AD is clinically and pathologically varied, and genetic studies imply a diversity of biological mechanisms and pathways. Over 20 new Alzheimer's disease susceptibility loci have been discovered through linkage, genome-wide association, and next-generation sequencing (Tosto G, Reitz C, Mol Cell Probes 30:397-403, 2016). In this study, we focus on the performance of three different approaches to marker gene selection methods and compare them using the support vector machine (SVM), k-nearest neighbors' algorithm (k-NN), and linear discriminant analysis (LDA), which are mainly supervised classification algorithms.
Collapse
|
11
|
A divisive hierarchical clustering methodology for enhancing the ensemble prediction power in large scale population studies: the ATHLOS project. Health Inf Sci Syst 2022; 10:6. [PMID: 35529251 PMCID: PMC9013733 DOI: 10.1007/s13755-022-00171-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Accepted: 03/30/2022] [Indexed: 01/13/2023] Open
Abstract
The ATHLOS cohort is composed of several harmonized datasets of international groups related to health and aging. As a result, the Healthy Aging index has been constructed based on a selection of variables from 16 individual studies. In this paper, we consider additional variables found in ATHLOS and investigate their utilization for predicting the Healthy Aging index. For this purpose, motivated by the volume and diversity of the dataset, we focus our attention upon data clustering, where unsupervised learning is utilized to enhance prediction power. Thus we show the predictive utility of exploiting hidden data structures. In addition, we demonstrate that imposed computation bottlenecks can be surpassed when using appropriate hierarchical clustering, within a clustering for ensemble classification scheme, while retaining prediction benefits. We propose a complete methodology that is evaluated against baseline methods and the original concept. The results are very encouraging suggesting further developments in this direction along with applications in tasks with similar characteristics. A straightforward open source implementation for the R project is also provided (https://github.com/Petros-Barmpas/HCEP). Supplementary Information The online version contains supplementary material available at 10.1007/s13755-022-00171-1.
Collapse
|
12
|
Predicting implementation of active learning by tenure-track teaching faculty using robust cluster analysis. INTERNATIONAL JOURNAL OF STEM EDUCATION 2022; 9:49. [PMID: 35915654 PMCID: PMC9334417 DOI: 10.1186/s40594-022-00365-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Accepted: 07/08/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND The University of California system has a novel tenure-track education-focused faculty position called Lecturer with Security of Employment (working titles: Teaching Professor or Professor of Teaching). We focus on the potential difference in implementation of active-learning strategies by faculty type, including tenure-track education-focused faculty, tenure-track research-focused faculty, and non-tenure-track lecturers. In addition, we consider other instructor characteristics (faculty rank, years of teaching, and gender) and classroom characteristics (campus, discipline, and class size). We use a robust clustering algorithm to determine the number of clusters, identify instructors using active learning, and to understand the instructor and classroom characteristics in relation to the adoption of active-learning strategies. RESULTS We observed 125 science, technology, engineering, and mathematics (STEM) undergraduate courses at three University of California campuses using the Classroom Observation Protocol for Undergraduate STEM to examine active-learning strategies implemented in the classroom. Tenure-track education-focused faculty are more likely to teach with active-learning strategies compared to tenure-track research-focused faculty. Instructor and classroom characteristics that are also related to active learning include campus, discipline, and class size. The campus with initiatives and programs to support undergraduate STEM education is more likely to have instructors who adopt active-learning strategies. There is no difference in instructors in the Biological Sciences, Engineering, or Information and Computer Sciences disciplines who teach actively. However, instructors in the Physical Sciences are less likely to teach actively. Smaller class sizes also tend to have instructors who teach more actively. CONCLUSIONS The novel tenure-track education-focused faculty position within the University of California system represents a formal structure that results in higher adoption of active-learning strategies in undergraduate STEM education. Campus context and evolving expectations of the position (faculty rank) contribute to the symbols related to learning and teaching that correlate with differential implementation of active learning. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1186/s40594-022-00365-9.
Collapse
|
13
|
A machine learning-based approach to determine infection status in recipients of BBV152 (Covaxin) whole-virion inactivated SARS-CoV-2 vaccine for serological surveys. Comput Biol Med 2022; 146:105419. [PMID: 35483225 PMCID: PMC9040372 DOI: 10.1016/j.compbiomed.2022.105419] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 02/19/2022] [Accepted: 02/19/2022] [Indexed: 12/16/2022]
Abstract
Data science has been an invaluable part of the COVID-19 pandemic response with multiple applications, ranging from tracking viral evolution to understanding the vaccine effectiveness. Asymptomatic breakthrough infections have been a major problem in assessing vaccine effectiveness in populations globally. Serological discrimination of vaccine response from infection has so far been limited to Spike protein vaccines since whole virion vaccines generate antibodies against all the viral proteins. Here, we show how a statistical and machine learning (ML) based approach can be used to discriminate between SARS-CoV-2 infection and immune response to an inactivated whole virion vaccine (BBV152, Covaxin). For this, we assessed serial data on antibodies against Spike and Nucleocapsid antigens, along with age, sex, number of doses taken, and days since last dose, for 1823 Covaxin recipients. An ensemble ML model, incorporating a consensus clustering approach alongside the support vector machine model, was built on 1063 samples where reliable qualifying data existed, and then applied to the entire dataset. Of 1448 self-reported negative subjects, our ensemble ML model classified 724 to be infected. For method validation, we determined the relative ability of a random subset of samples to neutralize Delta versus wild-type strain using a surrogate neutralization assay. We worked on the premise that antibodies generated by a whole virion vaccine would neutralize wild type more efficiently than delta strain. In 100 of 156 samples, where ML prediction differed from self-reported uninfected status, neutralization against Delta strain was more effective, indicating infection. We found 71.8% subjects predicted to be infected during the surge, which is concordant with the percentage of sequences classified as Delta (75.6%-80.2%) over the same period. Our approach will help in real-world vaccine effectiveness assessments where whole virion vaccines are commonly used.
Collapse
|
14
|
Ensemble blood glucose prediction in diabetes mellitus: A review. Comput Biol Med 2022; 147:105674. [PMID: 35716436 DOI: 10.1016/j.compbiomed.2022.105674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 04/28/2022] [Accepted: 05/25/2022] [Indexed: 11/03/2022]
Abstract
Considering the complexity of blood glucose dynamics, the adoption of a single model to predict blood glucose level does not always capture the inter- and intra-patients' context changes. Ensembles are a set of machine learning techniques combining multiple single learners to find a better variance/bias trade-off and hence improve the prediction accuracy. The present paper aims to review the state of the art in predicting blood glucose using ensemble methods with regard to 8 criteria: publication year and sources, datasets used to train/evaluate the models, types of ensembles used, single learners involved to construct ensembles, combination schemes used to aggregate the base learners, metrics and validation methods adopted to assess the performance of ensembles, reported overall performance of the predictors and accuracy comparison of ensemble techniques with single models. A systematic literature review has been conducted in order to analyze and synthetize primary studies published between 2000 and 2020 in six digital libraries. A total of 32 primary papers were selected and reviewed with regard to eight review questions. The results show that ensembles have gained wider interest during the last years and improved in general the performance compared with other single models. However, multiple gaps have been identified concerning the ensembles construction process and the performance metrics used. Several recommendations have been made in this regard to design accurate ensembles for blood glucose level prediction.
Collapse
|
15
|
Genome-Enabled Prediction Methods Based on Machine Learning. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2467:189-218. [PMID: 35451777 DOI: 10.1007/978-1-0716-2205-6_7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Growth of artificial intelligence and machine learning (ML) methodology has been explosive in recent years. In this class of procedures, computers get knowledge from sets of experiences and provide forecasts or classification. In genome-wide based prediction (GWP), many ML studies have been carried out. This chapter provides a description of main semiparametric and nonparametric algorithms used in GWP in animals and plants. Thirty-four ML comparative studies conducted in the last decade were used to develop a meta-analysis through a Thurstonian model, to evaluate algorithms with the best predictive qualities. It was found that some kernel, Bayesian, and ensemble methods displayed greater robustness and predictive ability. However, the type of study and data distribution must be considered in order to choose the most appropriate model for a given problem.
Collapse
|
16
|
Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text. COMPLEX INTELL SYST 2022; 8:4897-4909. [PMID: 35496326 PMCID: PMC9039275 DOI: 10.1007/s40747-022-00741-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 04/08/2022] [Indexed: 12/20/2022]
Abstract
The increase in people’s use of mobile messaging services has led to the spread of social engineering attacks like phishing, considering that spam text is one of the main factors in the dissemination of phishing attacks to steal sensitive data such as credit cards and passwords. In addition, rumors and incorrect medical information regarding the COVID-19 pandemic are widely shared on social media leading to people’s fear and confusion. Thus, filtering spam content is vital to reduce risks and threats. Previous studies relied on machine learning and deep learning approaches for spam classification, but these approaches have two limitations. Machine learning models require manual feature engineering, whereas deep neural networks require a high computational cost. This paper introduces a dynamic deep ensemble model for spam detection that adjusts its complexity and extracts features automatically. The proposed model utilizes convolutional and pooling layers for feature extraction along with base classifiers such as random forests and extremely randomized trees for classifying texts into spam or legitimate ones. Moreover, the model employs ensemble learning procedures like boosting and bagging. As a result, the model achieved high precision, recall, f1-score and accuracy of 98.38%.
Collapse
|
17
|
Assessing the utility of remote sensing data to accurately estimate changes in groundwater storage. THE SCIENCE OF THE TOTAL ENVIRONMENT 2022; 807:150635. [PMID: 34606871 DOI: 10.1016/j.scitotenv.2021.150635] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Revised: 09/09/2021] [Accepted: 09/23/2021] [Indexed: 06/13/2023]
Abstract
Accurate and timely estimates of groundwater storage changes are critical to the sustainable management of aquifers worldwide, but are hindered by the lack of in-situ groundwater measurements in most regions. Hydrologic remote sensing measurements provide a potential pathway to quantify groundwater storage changes by closing the water balance, but the degree to which remote sensing data can accurately estimate groundwater storage changes is unclear. In this study, we quantified groundwater storage changes in California's Central Valley at two spatial scales for the period 2002 through 2020 using remote sensing data and an ensemble water balance method. To evaluate performance, we compared estimates of groundwater storage changes to three independent estimates: GRACE satellite data, groundwater wells and a groundwater flow model. Results suggest evapotranspiration has the highest uncertainty among water balance components, while precipitation has the lowest. We found that remote sensing-based groundwater storage estimates correlated well with independent estimates; annual trends during droughts fall within 15% of trends calculated using wells and groundwater models within the Central Valley. Remote sensing-based estimates also reliably estimated the long-term trend, seasonality, and rate of groundwater depletion during major drought events. Additionally, our study suggests that the proposed method estimate changes in groundwater at sub-annual latencies, which is not currently possible using other methods. The findings have implications for improving the understanding of aquifer dynamics and can inform regional water managers about the status of groundwater systems during droughts.
Collapse
|
18
|
Comparing model skills for deterministic versus ensemble dispersion modelling: The Fukushima Daiichi NPP accident as a case study. THE SCIENCE OF THE TOTAL ENVIRONMENT 2022; 806:150128. [PMID: 34583084 DOI: 10.1016/j.scitotenv.2021.150128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 08/11/2021] [Accepted: 08/31/2021] [Indexed: 06/13/2023]
Abstract
Atmospheric dispersion models are crucial for nuclear risk assessment and emergency response systems since they rapidly predict air concentrations and deposition of released radionuclides, providing a basis for dose estimations and countermeasure strategies. Atmospheric dispersion models are associated with relatively large and often unknown uncertainties that are mostly attributed to meteorology, source terms and parametrisation of the dispersion model. By developing methods that can provide reliable uncertainty ranges for model outputs, decision makers have an improved basis for handling nuclear emergency situations. In the present work, model skill of the Severe Nuclear Accident Programme (SNAP) model was quantified by employing an ensemble method in which 51 meteorological realisations from a numerical weather prediction model were combined with 9 source term descriptions for the accidental 137Cs releases from Fukushima Daiichi Nuclear Power Plant during 14th-17th March 2011. The meteorological forecast was compared to observations of wind speed from 30 meteorological stations. The 459 dispersion realisations were compared with hourly observations of activity concentrations from 100 air filter stations. Exclusive use of deterministic meteorology resulted in most members of the dispersion ensemble showing too low concentration values, however this was mitigated by applying ensemble meteorology. Ensemble predictions, including both the meteorological and source term ensemble, show an overall higher prediction skill compared to individual meteorology and source term runs, with true predictive rate accuracy increasing from 30%-50% to 70%-90%, with a decrease in positive predictive rate accuracy from 75%-80% to 65%-75%. Skill scores and other ensemble indicators also showed improvements in using ensembles of source terms and meteorology. From the present study on the Fukushima accident there are strong indications that ensemble predictions improve the basis for decision making in the early phase after a nuclear accident, which emphasises the importance of including ensemble prediction in nuclear preparedness tools of the future.
Collapse
|
19
|
A U-Net Ensemble for breast lesion segmentation in DCE MRI. Comput Biol Med 2022; 140:105093. [PMID: 34883343 DOI: 10.1016/j.compbiomed.2021.105093] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2021] [Revised: 11/26/2021] [Accepted: 11/26/2021] [Indexed: 11/16/2022]
Abstract
Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCE-MRI) has been recognized as an effective tool for Breast Cancer (BC) diagnosis. Automatic BC analysis from DCE-MRI depends on features extracted particularly from lesions, hence, lesions need to be accurately segmented as a prior step. Due to the time and experience required to manually segment lesions in 4D DCE-MRI, automating this task is expected to reduce the workload, reduce observer variability and improve diagnostic accuracy. In this paper we propose an automated method for breast lesion segmentation from DCE-MRI based on a U-Net framework. The contributions of this work are the proposal of a modified U-Net architecture and the analysis of the input DCE information. In that sense, we propose the use of an ensemble method combining three U-Net models, each using a different input combination, outperforming all individual methods and other existing approaches. For evaluation, we use a subset of 46 cases from the TCGA-BRCA dataset, a challenging and publicly available dataset not reported to date for this task. Due to the incomplete annotations provided, we complement them with the help of a radiologist in order to include secondary lesions that were not originally segmented. The proposed ensemble method obtains a mean Dice Similarity Coefficient (DSC) of 0.680 (0.802 for main lesions) which outperforms state-of-the art methods using the same dataset, demonstrating the effectiveness of our method considering the complexity of the dataset.
Collapse
|
20
|
A novel methodology for Groundwater Flooding Susceptibility assessment through Machine Learning techniques in a mixed-land use aquifer. THE SCIENCE OF THE TOTAL ENVIRONMENT 2021; 790:148067. [PMID: 34111794 DOI: 10.1016/j.scitotenv.2021.148067] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Revised: 05/21/2021] [Accepted: 05/23/2021] [Indexed: 06/12/2023]
Abstract
Many areas around the world are affected by Groundwater Level rising (GWLr). One of the most severe consequences of this phenomenon is Groundwater Flooding (GF), with serious impacts for the human and natural environment. In Europe, GF has recently received specific attention with Directive 2007/60/EC, which requires Member States to map GF hazard and propose measures for risk mitigation. In this paper a methodology has been developed for Groundwater Flooding Susceptibility (GFS) assessment, using for the first time Spatial Distribution Models. These Machine Learning techniques connect occurrence data to predisposing factors (PFs) to estimate their distributions. The implemented methodology employs aquifer type, depth of piezometric level, thickness and hydraulic conductivity of unsaturated zone, drainage density and land-use as PFs, and a GF observations inventory as occurrences. The algorithms adopted to perform the analysis are Generalized Boosting Model, Artificial Neural Network and Maximum Entropy. Ensemble Models are carried out to reduce the uncertainty associated with each algorithm and increase its reliability. GFS is mapped by choosing the ensemble model with the best predictivity performance and dividing occurrence probability values into five classes, from very low to very high susceptibility, using Natural Breaks classification. The methodology has been tested and statistically validated in an area of 14,3 km2 located in the Metropolitan City of Naples (Italy), affected by GWLr since 1990 and GF in buildings and agricultural soils since 2007. The results of modeling show that about 93% of the inventoried points fall in the high and very high GFS classes, and piezometric level depth, thickness of unsaturated zone and drainage density are the most influencing PFs, in accordance with field observations and the triggering mechanism of GF. The outcomes provide a first step in the assessment of GF hazard and a decision support tool to local authorities for GF risk management.
Collapse
|
21
|
A laminar augmented cascading flexible neural forest model for classification of cancer subtypes based on gene expression data. BMC Bioinformatics 2021; 22:475. [PMID: 34600466 PMCID: PMC8487515 DOI: 10.1186/s12859-021-04391-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 09/22/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Correctly classifying the subtypes of cancer is of great significance for the in-depth study of cancer pathogenesis and the realization of personalized treatment for cancer patients. In recent years, classification of cancer subtypes using deep neural networks and gene expression data has gradually become a research hotspot. However, most classifiers may face overfitting and low classification accuracy when dealing with small sample size and high-dimensional biology data. RESULTS In this paper, a laminar augmented cascading flexible neural forest (LACFNForest) model was proposed to complete the classification of cancer subtypes. This model is a cascading flexible neural forest using deep flexible neural forest (DFNForest) as the base classifier. A hierarchical broadening ensemble method was proposed, which ensures the robustness of classification results and avoids the waste of model structure and function as much as possible. We also introduced an output judgment mechanism to each layer of the forest to reduce the computational complexity of the model. The deep neural forest was extended to the densely connected deep neural forest to improve the prediction results. The experiments on RNA-seq gene expression data showed that LACFNForest has better performance in the classification of cancer subtypes compared to the conventional methods. CONCLUSION The LACFNForest model effectively improves the accuracy of cancer subtype classification with good robustness. It provides a new approach for the ensemble learning of classifiers in terms of structural design.
Collapse
|
22
|
Extremely randomized neural networks for constructing prediction intervals. Neural Netw 2021; 144:113-128. [PMID: 34487958 DOI: 10.1016/j.neunet.2021.08.020] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Revised: 07/20/2021] [Accepted: 08/12/2021] [Indexed: 11/29/2022]
Abstract
The aim of this paper is to propose a novel prediction model based on an ensemble of deep neural networks adapting the extremely randomized trees method originally developed for random forests. The extra-randomness introduced in the ensemble reduces the variance of the predictions and improves out-of-sample accuracy. As a byproduct, we are able to compute the uncertainty about our model predictions and construct interval forecasts. Some of the limitations associated with bootstrap-based algorithms can be overcome by not performing data resampling and thus, by ensuring the suitability of the methodology in low and mid-dimensional settings, or when the i.i.d. assumption does not hold. An extensive Monte Carlo simulation exercise shows the good performance of this novel prediction method in terms of mean square prediction error and the accuracy of the prediction intervals in terms of out-of-sample prediction interval coverage probabilities. The advanced approach delivers better out-of-sample accuracy in experimental settings, improving upon state-of-the-art methods like MC dropout and bootstrap procedures.
Collapse
|
23
|
Automatic extension of corpora from the intelligent ensembling of eHealth knowledge discovery systems outputs. J Biomed Inform 2021; 116:103716. [PMID: 33647519 DOI: 10.1016/j.jbi.2021.103716] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2020] [Revised: 01/28/2021] [Accepted: 02/14/2021] [Indexed: 11/20/2022]
Abstract
Corpora are one of the most valuable resources at present for building machine learning systems. However, building new corpora is an expensive task, which makes the automatic extension of corpora a highly attractive task to develop. Hence, finding new strategies that reduce the cost and effort involved in this task, while at the same time guaranteeing quality, remains an open and important challenge for the research community. In this paper, we present a set of ensembling strategies oriented toward entity and relation extraction tasks. The main goal is to combine several automatically annotated versions of corpora to produce a single version with improved quality. An ensembler is built by exploring a configuration space in search of the combination that maximizes the fitness of the ensembled collection according to a reference collection. The eHealth-KD 2019 challenge was chosen for the case study. The submitted systems' outputs were ensembled, resulting in the construction of an automatically annotated collection of 8000 sentences. We show that using this collection as additional training input for a baseline algorithm has a positive impact on its performance. Additionally, the ensembling pipeline was used as a participant system in the 2020 edition of the challenge. The ensembled run achieved a slightly better performance than the individual runs.
Collapse
|
24
|
An ensemble approach for multi-stage transfer learning models for COVID-19 detection from chest CT scans. INTELLIGENCE-BASED MEDICINE 2021; 5:100027. [PMID: 33623929 PMCID: PMC7891130 DOI: 10.1016/j.ibmed.2021.100027] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 11/25/2020] [Accepted: 02/12/2021] [Indexed: 12/23/2022]
Abstract
The novel coronavirus outbreak of 2019 reached pandemic status in March 2020. Since then, many countries have joined efforts to fight the COVID-19 pandemic. A central task for governments is the rapid and effective identification of COVID-19 positive patients. While many molecular tests currently exist, not all hospitals have immediate access to these. However, CT scans, which are readily available at most hospitals, offer an additional method to diagnose COVID-19. As a result, hospitals lacking molecular tests can benefit from it as a way of mitigating said shortage. Furthermore, radiologists have come to achieve accuracy levels over 80% on identifying COVID-19 cases by CT scan image analysis. This paper adds to the existing literature a model based on ensemble methods and 2-stage transfer learning to detect COVID-19 cases based on CT scan images, relying on a simple architecture, yet complex enough model definition, to attain a competitive performance. The proposed model achieved an accuracy of 86.70%, an F1 score of 85.86% and an AUC of 90.82%, proving capable of assisting radiologists with COVID-19 diagnosis. Code developed for this research can be found in the following repository: https://github.com/josehernandezsc/COVID19Net.
Collapse
|
25
|
Early diagnosis of thyroid cancer diseases using computational intelligence techniques: A case study of a Saudi Arabian dataset. Comput Biol Med 2021; 131:104267. [PMID: 33647831 DOI: 10.1016/j.compbiomed.2021.104267] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2020] [Revised: 02/08/2021] [Accepted: 02/09/2021] [Indexed: 10/22/2022]
Abstract
In recent times, researchers have noticed that chronic diseases have become more common. In the Kingdom of Saudi Arabia, the number of patients with thyroid cancer (TC) has become a concern, necessitating a proactive system that can help cut down the incidence of this disease, where the system can assist in early interventions to prevent or cure the disease. In this paper, we introduce our work developing machine learning-based tools that can serve as early warning systems by detecting TC at very early stages (pre-symptomatic stage). In addition, we aimed at obtaining the greatest possible accuracy while using fewer features. It must be noted that while there have been past efforts to use machine learning in predicting TC, this is the first attempt using a Saudi Arabian dataset as well as targeting diagnosis in the pre-symptomatic stage (pre-emptive diagnosis). The techniques used in this work include random forest (RF), artificial neural network (ANN), support vector machine (SVM), and naïve Bayes (NB), each of which was selected for their unique capabilities. The highest accuracy rate obtained was 90.91% with the RF technique, while SVM, ANN, and NB achieved 84.09%, 88.64%, and 81.82% accuracy, respectively. These levels were obtained by using only seven features out of an available 15. Considering the pattern of the obtained results, it is clear that the RF technique is better and, hence, recommended for this specific problem.
Collapse
|
26
|
Ensemble of heterogeneous classifiers for diagnosis and prediction of coronary artery disease with reduced feature subset. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 198:105770. [PMID: 33027698 DOI: 10.1016/j.cmpb.2020.105770] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 09/19/2020] [Indexed: 06/11/2023]
Abstract
BACKGROUND AND OBJECTIVE Coronary artery disease (CAD) is considered one of the most prominent health issues causing high mortality in the world population. Hence, earlier diagnosis and prediction of CAD is essential for the proper medication of patients. The objective of this study is to develop a machine learning algorithm that will help in accurate diagnosis of CAD. METHODS In this paper, we have proposed a novel heterogeneous ensemble method combining three base classifiers viz., K-Nearest Neighbour, Random Forest, and Support Vector Machine for effective diagnosis of CAD. The results of base classifiers are combined using ensemble voting technique based on average-voting (AVEn), majority-voting (MVEn), and weighted-average voting (WAVEn) for prediction of CAD. The random forest-based Boruta wrapper feature selection algorithm and feature importance of SVM are used for relevant feature selection based on attribute importance and rank. RESULTS The proposed ensemble algorithm is developed using 5 features selected based on the feature importance and the performance of the algorithm is evaluated using the Z-Alizadeh Sani dataset. Further, the dataset is balanced using Synthetic Minority Over-sampling Technique and its performance is evaluated. The result analysis shows that the WAVEn algorithm achieves better classification accuracy, sensitivity, specificity and precision of 98.97%, 100%, 96.3% and 98.3% respectively for the original dataset. The WAVEn algorithm applied on the balanced dataset achieves 100% accuracy, sensitivity, specificity and precision in diagnosing CAD. To the best of author's knowledge, the accuracy achieved by WAVEn is the highest accuracy when compared with the state-of-the-art algorithms in the literature for both original and balanced dataset. CONCLUSIONS The statistical results prove the robustness of the WAVEn algorithm in reliably discriminating the CAD patients from healthy ones with high precision, and therefore it can be used for developing a decision support system for diagnosing CAD at an early stage.
Collapse
|
27
|
CatBoost for big data: an interdisciplinary review. JOURNAL OF BIG DATA 2020; 7:94. [PMID: 33169094 PMCID: PMC7610170 DOI: 10.1186/s40537-020-00369-8] [Citation(s) in RCA: 133] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 10/19/2020] [Indexed: 05/25/2023]
Abstract
Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.
Collapse
|
28
|
AI-driven quantification, staging and outcome prediction of COVID-19 pneumonia. Med Image Anal 2020; 67:101860. [PMID: 33171345 PMCID: PMC7558247 DOI: 10.1016/j.media.2020.101860] [Citation(s) in RCA: 77] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2020] [Revised: 08/24/2020] [Accepted: 09/29/2020] [Indexed: 12/11/2022]
Abstract
Coronavirus disease 2019 (COVID-19) emerged in 2019 and disseminated around the world rapidly. Computed tomography (CT) imaging has been proven to be an important tool for screening, disease quantification and staging. The latter is of extreme importance for organizational anticipation (availability of intensive care unit beds, patient management planning) as well as to accelerate drug development through rapid, reproducible and quantified assessment of treatment response. Even if currently there are no specific guidelines for the staging of the patients, CT together with some clinical and biological biomarkers are used. In this study, we collected a multi-center cohort and we investigated the use of medical imaging and artificial intelligence for disease quantification, staging and outcome prediction. Our approach relies on automatic deep learning-based disease quantification using an ensemble of architectures, and a data-driven consensus for the staging and outcome prediction of the patients fusing imaging biomarkers with clinical and biological attributes. Highly promising results on multiple external/independent evaluation cohorts as well as comparisons with expert human readers demonstrate the potentials of our approach.
Collapse
|
29
|
A mapping study of ensemble classification methods in lung cancer decision support systems. Med Biol Eng Comput 2020; 58:2177-2193. [PMID: 32621068 DOI: 10.1007/s11517-020-02223-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Accepted: 06/25/2020] [Indexed: 10/23/2022]
Abstract
Achieving a high level of classification accuracy in medical datasets is a capital need for researchers to provide effective decision systems to assist doctors in work. In many domains of artificial intelligence, ensemble classification methods are able to improve the performance of single classifiers. This paper reports the state of the art of ensemble classification methods in lung cancer detection. We have performed a systematic mapping study to identify the most interesting papers concerning this topic. A total of 65 papers published between 2000 and 2018 were selected after an automatic search in four digital libraries and a careful selection process. As a result, it was observed that diagnosis was the task most commonly studied; homogeneous ensembles and decision trees were the most frequently adopted for constructing ensembles; and the majority voting rule was the predominant combination rule. Few studies considered the parameter tuning of the techniques used. These findings open several perspectives for researchers to enhance lung cancer research by addressing the identified gaps, such as investigating different classification methods, proposing other heterogeneous ensemble methods, and using new combination rules. Graphical abstract Main features of the mapping study performed in ensemble classification methods applied on lung cancer decision support systems.
Collapse
|
30
|
Comparing performance of ensemble methods in predicting movie box office revenue. Heliyon 2020; 6:e04260. [PMID: 32613125 PMCID: PMC7322254 DOI: 10.1016/j.heliyon.2020.e04260] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Revised: 04/07/2020] [Accepted: 06/16/2020] [Indexed: 11/26/2022] Open
Abstract
While many business intelligence methods have been applied to predict movie box office revenue, the studies using an ensemble approach to predict box office revenue are almost nonexistent. In this study, we propose decision trees, k-nearest-neighbors (k-NN), and linear regression using ensemble methods and the prediction performance of decision trees based on random forests, bagging and boosting are compared with that of k-NN and linear regression based on bagging and boosting using the sample of 1439 movies. The results indicate that ensemble methods based on decision trees (random forests, bagging, boosting) outperform ensemble methods based on k-NN (bagging, boosting) in predicting box office at week 1, 2, 3 after release. Decision trees using ensemble methods provide better prediction performance than ensemble methods based on linear regression analysis in the box office at week 1 after release. This is explained by the results that after comparing the prediction performance between ensemble methods and non-ensemble methods. For decision tree methods, unlike the other methods, the prediction performance of ensemble methods is greater than that of non-ensemble methods. This shows that decision trees using ensemble methods provide better application effectiveness of ensemble methods than k-NN and linear regression analysis.
Collapse
|
31
|
A novel approach to modeling multifactorial diseases using Ensemble Bayesian Rule classifiers. J Biomed Inform 2020; 107:103455. [PMID: 32497685 DOI: 10.1016/j.jbi.2020.103455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 03/26/2020] [Accepted: 05/10/2020] [Indexed: 10/24/2022]
Abstract
Modeling factors influencing disease phenotypes, from biomarker profiling study datasets, is a critical task in biomedicine. Such datasets are typically generated from high-throughput 'omic' technologies, which help examine disease mechanisms at an unprecedented resolution. These datasets are challenging because they are high-dimensional. The disease mechanisms they study are also complex because many diseases are multifactorial, resulting from the collective activity of several factors, each with a small effect. Bayesian rule learning (BRL) is a rule model inferred from learning Bayesian networks from data, and has been shown to be effective in modeling high-dimensional datasets. However, BRL is not efficient at modeling multifactorial diseases since it suffers from data fragmentation during learning. In this paper, we overcome this limitation by implementing and evaluating three types of ensemble model combination strategies with BRL- uniform combination (UC; same as Bagging), Bayesian model averaging (BMA), and Bayesian model combination (BMC)- collectively called Ensemble Bayesian Rule Learning (EBRL). We also introduce a novel method to visualize EBRL models, called the Bayesian Rule Ensemble Visualizing tool (BREVity), which helps extract interpret the most important rule patterns guiding the predictions made by the ensemble model. Our results using twenty-five public, high-dimensional, gene expression datasets of multifactorial diseases, suggest that, both EBRL models using UC and BMC achieve better predictive performance than BMA and other classic machine learning methods. Furthermore, BMC is found to be more reliable than UC, when the ensemble includes sub-optimal models resulting from the stochasticity of the model search process. Together, EBRL and BREVity provides researchers a promising and novel tool for modeling multifactorial diseases from high-dimensional datasets that leverages strengths of ensemble methods for predictive performance, while also providing interpretable explanations for its predictions.
Collapse
|
32
|
A flexible analytic wavelet transform based approach for motor-imagery tasks classification in BCI applications. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2020; 187:105325. [PMID: 31964514 DOI: 10.1016/j.cmpb.2020.105325] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 12/16/2019] [Accepted: 01/08/2020] [Indexed: 05/04/2023]
Abstract
BACKGROUND AND OBJECTIVE Motor Imagery (MI) based Brain-Computer-Interface (BCI) is a rising support system that can assist disabled people to communicate with the real world, without any external help. It serves as an alternative communication channel between the user and computer. Electroencephalogram (EEG) recordings prove to be an appropriate choice for imaging MI tasks in a BCI system as it provides a non-invasive way for completing the task. The reliability of a BCI system confides on the efficiency of the assessment of different MI tasks. METHODS The present work proposes a new approach for the classification of distinct MI tasks based on EEG signals using the flexible analytic wavelet transform (FAWT) technique. The FAWT decomposes the EEG signal into sub-bands and temporal moment-based features are extracted from the sub-bands. Feature normalization is applied to minimize the bias nature of classifier. The FAWT-based features are utilized as inputs to multiple classifiers. Ensemble learning method based Subspace k-Nearest Neighbour (kNN) classifier is established as the best and robust classifier for the distinction of the right hand (RH) and right foot (RF) MI tasks. RESULTS The sub-band (SB) wise features are tested on multiple classifiers and best performance parameters are obtained using the ensemble method based subspace kNN classifier. The best results of parameters are obtained for fourth SB as accuracy 99.33%, sensitivity 99%, specificity 99.6%, F1-Score 0.9925, and kappa value 0.9865. The other sub-bands are also attained significant results using subspace KNN classifier. CONCLUSIONS The proposed work explores the utility of FAWT based features for the classification of RH and RF MI tasks EEG signals. The suggested work highlights the effectiveness of multiple classifiers for classification MI-tasks. The proposed method shows better performance in comparison to state-of-arts methods. Thus, the potential to implement a BCI system for controlling wheelchairs, robotic arms, etc.
Collapse
|
33
|
Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. WATER RESEARCH 2020; 171:115454. [PMID: 31918388 DOI: 10.1016/j.watres.2019.115454] [Citation(s) in RCA: 97] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2019] [Revised: 12/24/2019] [Accepted: 12/30/2019] [Indexed: 06/10/2023]
Abstract
The water quality prediction performance of machine learning models may be not only dependent on the models, but also dependent on the parameters in data set chosen for training the learning models. Moreover, the key water parameters should also be identified by the learning models, in order to further reduce prediction costs and improve prediction efficiency. Here we endeavored for the first time to compare the water quality prediction performance of 10 learning models (7 traditional and 3 ensemble models) using big data (33,612 observations) from the major rivers and lakes in China from 2012 to 2018, based on the precision, recall, F1-score, weighted F1-score, and explore the potential key water parameters for future model prediction. Our results showed that the bigger data could improve the performance of learning models in prediction of water quality. Compared to other 7 models, decision tree (DT), random forest (RF) and deep cascade forest (DCF) trained by data sets of pH, DO, CODMn, and NH3-N had significantly better performance in prediction of all 6 Levels of water quality recommended by Chinese government. Moreover, two key water parameter sets (DO, CODMn, and NH3-N; CODMn, and NH3-N) were identified and validated by DT, RF and DCF to be high specificities for perdition water quality. Therefore, DT, RF and DCF with selected key water parameters could be prioritized for future water quality monitoring and providing timely water quality warning.
Collapse
|
34
|
Individualized treatment effects with censored data via fully nonparametric Bayesian accelerated failure time models. Biostatistics 2020; 21:50-68. [PMID: 30052809 PMCID: PMC8972560 DOI: 10.1093/biostatistics/kxy028] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2017] [Revised: 05/24/2018] [Accepted: 06/14/2018] [Indexed: 09/04/2023] Open
Abstract
Individuals often respond differently to identical treatments, and characterizing such variability in treatment response is an important aim in the practice of personalized medicine. In this article, we describe a nonparametric accelerated failure time model that can be used to analyze heterogeneous treatment effects (HTE) when patient outcomes are time-to-event. By utilizing Bayesian additive regression trees and a mean-constrained Dirichlet process mixture model, our approach offers a flexible model for the regression function while placing few restrictions on the baseline hazard. Our nonparametric method leads to natural estimates of individual treatment effect and has the flexibility to address many major goals of HTE assessment. Moreover, our method requires little user input in terms of model specification for treatment covariate interactions or for tuning parameter selection. Our procedure shows strong predictive performance while also exhibiting good frequentist properties in terms of parameter coverage and mitigation of spurious findings of HTE. We illustrate the merits of our proposed approach with a detailed analysis of two large clinical trials (N = 6769) for the prevention and treatment of congestive heart failure using an angiotensin-converting enzyme inhibitor. The analysis revealed considerable evidence for the presence of HTE in both trials as demonstrated by substantial estimated variation in treatment effect and by high proportions of patients exhibiting strong evidence of having treatment effects which differ from the overall treatment effect.
Collapse
|
35
|
Reviewing ensemble classification methods in breast cancer. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 177:89-112. [PMID: 31319964 DOI: 10.1016/j.cmpb.2019.05.019] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Revised: 05/16/2019] [Accepted: 05/18/2019] [Indexed: 05/09/2023]
Abstract
CONTEXT Ensemble methods consist of combining more than one single technique to solve the same task. This approach was designed to overcome the weaknesses of single techniques and consolidate their strengths. Ensemble methods are now widely used to carry out prediction tasks (e.g. classification and regression) in several fields, including that of bioinformatics. Researchers have particularly begun to employ ensemble techniques to improve research into breast cancer, as this is the most frequent type of cancer and accounts for most of the deaths among women. OBJECTIVE AND METHOD The goal of this study is to analyse the state of the art in ensemble classification methods when applied to breast cancer as regards 9 aspects: publication venues, medical tasks tackled, empirical and research types adopted, types of ensembles proposed, single techniques used to construct the ensembles, validation framework adopted to evaluate the proposed ensembles, tools used to build the ensembles, and optimization methods used for the single techniques. This paper was undertaken as a systematic mapping study. RESULTS A total of 193 papers that were published from the year 2000 onwards, were selected from four online databases: IEEE Xplore, ACM digital library, Scopus and PubMed. This study found that of the six medical tasks that exist, the diagnosis medical task was that most frequently researched, and that the experiment-based empirical type and evaluation-based research type were the most dominant approaches adopted in the selected studies. The homogeneous type was that most widely used to perform the classification task. With regard to single techniques, this mapping study found that decision trees, support vector machines and artificial neural networks were those most frequently adopted to build ensemble classifiers. In the case of the evaluation framework, the Wisconsin Breast Cancer dataset was the most frequently used by researchers to perform their experiments, while the most noticeable validation method was k-fold cross-validation. Several tools are available to perform experiments related to ensemble classification methods, such as Weka and R Software. Few researchers took into account the optimisation of the single technique of which their proposed ensemble was composed, while the grid search method was that most frequently adopted to tune the parameter settings of a single classifier. CONCLUSION This paper reports an in-depth study of the application of ensemble methods as regards breast cancer. Our results show that there are several gaps and issues and we, therefore, provide researchers in the field of breast cancer research with recommendations. Moreover, after analysing the papers found in this systematic mapping study, we discovered that the majority report positive results concerning the accuracy of ensemble classifiers when compared to the single classifiers. In order to aggregate the evidence reported in literature, it will, therefore, be necessary to perform a systematic literature review and meta-analysis in which an in-depth analysis could be conducted so as to confirm the superiority of ensemble classifiers over the classical techniques.
Collapse
|
36
|
Ensemble genomic analysis in human lung tissue identifies novel genes for chronic obstructive pulmonary disease. Hum Genomics 2018; 12:1. [PMID: 29335020 PMCID: PMC5769240 DOI: 10.1186/s40246-018-0132-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Accepted: 01/02/2018] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) significantly associated with chronic obstructive pulmonary disease (COPD). However, many genetic variants show suggestive evidence for association but do not meet the strict threshold for genome-wide significance. Integrative analysis of multiple omics datasets has the potential to identify novel genes involved in disease pathogenesis by leveraging these variants in a functional, regulatory context. RESULTS We performed expression quantitative trait locus (eQTL) analysis using genome-wide SNP genotyping and gene expression profiling of lung tissue samples from 86 COPD cases and 31 controls, testing for SNPs associated with gene expression levels. These results were integrated with a prior COPD GWAS using an ensemble statistical and network methods approach to identify relevant genes and observe them in the context of overall genetic control of gene expression to highlight co-regulated genes and disease pathways. We identified 250,312 unique SNPs and 4997 genes in the cis(local)-eQTL analysis (5% false discovery rate). The top gene from the integrative analysis was MAPT, a gene recently identified in an independent GWAS of lung function. The genes HNRNPAB and PCBP2 with RNA binding activity and the gene ACVR1B were identified in network communities with validated disease relevance. CONCLUSIONS The integration of lung tissue gene expression with genome-wide SNP genotyping and subsequent intersection with prior GWAS and omics studies highlighted candidate genes within COPD loci and in communities harboring known COPD genes. This integration also identified novel disease genes in sub-threshold regions that would otherwise have been missed through GWAS.
Collapse
|
37
|
L₁ splitting rules in survival forests. LIFETIME DATA ANALYSIS 2017; 23:671-691. [PMID: 27379423 DOI: 10.1007/s10985-016-9372-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2015] [Accepted: 06/17/2016] [Indexed: 06/06/2023]
Abstract
The log-rank test is used as the split function in many commonly used survival trees and forests algorithms. However, the log-rank test may have a significant loss of power in some circumstances, especially when the hazard functions or when the survival functions cross each other in the two compared groups. We investigate the use of the integrated absolute difference between the two children nodes survival functions as the splitting rule. Simulations studies and applications to real data sets show that forests built with this rule produce very good results in general, and that they are often better compared to forests built with the log-rank splitting rule.
Collapse
|
38
|
Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol 2017; 18:182. [PMID: 28934964 PMCID: PMC5609029 DOI: 10.1186/s13059-017-1299-7] [Citation(s) in RCA: 163] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Accepted: 08/16/2017] [Indexed: 12/25/2022] Open
Abstract
Background One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. Results In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. Conclusions This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1299-7) contains supplementary material, which is available to authorized users.
Collapse
|
39
|
Extract critical factors affecting the length of hospital stay of pneumonia patient by data mining (case study: an Iranian hospital). Artif Intell Med 2017; 83:2-13. [PMID: 28712673 DOI: 10.1016/j.artmed.2017.06.010] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Revised: 06/21/2017] [Accepted: 06/28/2017] [Indexed: 11/30/2022]
Abstract
MOTIVATION Pneumonia is a prevalent infection of lower respiratory tract caused by infected lungs. Length of stay (LOS) in hospital is one of the simplest and most important indicators in hospital activity that is used for different purposes. The aim of this study is to explore the important factors affecting the LOS of patients with pneumonia in hospitals. METHODS The clinical data set for the study were collected from 387 patients in a specialized hospital in Iran between 2009 and 2015. Patients discharge summary includes their demographic details, reasons for admission, prescribed medications for the patient, the result of laboratory tests, and length of treatment. RESULTS AND CONCLUSIONS The proposed model in the study demonstrates the way various scenarios of data processing impact on the scale efficiency model, which points to the significance of the pre-processing in data mining. In this article, some methods were utilized; it is noteworthy that Bayesian boosting method led to better results in identifying the factors affecting LOS (accuracy 95.17%). In addition, it was found that 58% of patients younger than 15 years old and 74% of the elderly within the age range of 74-88 were more vulnerable to pneumonia disease. Also, it was found that the Meropenem is a relatively more effective medicine compared to other antibiotics which are used to treat pneumonia in the majority of age groups. Regardless of the impact of various laboratory findings (including CRP, ESR, WBC, NA, K), the patients LOS decreased as a result of Meropenem.
Collapse
|
40
|
Improved Prediction of Procedure Duration for Elective Surgery. Stud Health Technol Inform 2017; 239:133-138. [PMID: 28756448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Accurate surgery duration estimation is essential for efficient use of hospital operating theatres and the scheduling of elective patients. This study focuses on analysing the performance of previously developed surgery duration prediction algorithms at a specialty level to gain further insight on their performance. We also evaluate algorithm performance after applying filtering to exclude unreliable data from modelling, and develop and validate new ensemble approaches for prediction. These are shown to significantly improve the prediction accuracy of the algorithms. Employing filtered data delivers a reduction in overall prediction error of 44% (Mean Absolute Percentage Error from 0.68 to 0.38) employing the Random Forests algorithm, while using the newly developed ensemble approach delivers a Mean Absolute Percentage Error of 0.31, a reduction of 55% when compared to the original error, and a reduction of 18% when compared to the Random Forests performance on filtered data.
Collapse
|
41
|
Abstract
Background The Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to numerous assertions regarding the performance reliability of the default parameters, many RF models are fit using these values. However there has not yet been a thorough examination of the parameter-sensitivity of RFs in computational genomic studies. We address this gap here. Results We examined the effects of parameter selection on classification performance using the RF machine learning algorithm on two biological datasets with distinct p/n ratios: sequencing summary statistics (low p/n) and microarray-derived data (high p/n). Here, p, refers to the number of variables and, n, the number of samples. Our findings demonstrate that parameterization is highly correlated with prediction accuracy and variable importance measures (VIMs). Further, we demonstrate that different parameters are critical in tuning different datasets, and that parameter-optimization significantly enhances upon the default parameters. Conclusions Parameter performance demonstrated wide variability on both low and high p/n data. Therefore, there is significant benefit to be gained by model tuning RFs away from their default parameter settings. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1228-x) contains supplementary material, which is available to authorized users.
Collapse
|
42
|
Interpretable per case weighted ensemble method for cancer associations. BMC Genomics 2016; 17:501. [PMID: 27435615 PMCID: PMC4952276 DOI: 10.1186/s12864-016-2647-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2015] [Accepted: 04/22/2016] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Molecular measurements from cancer patients such as gene expression and DNA methylation can be influenced by several external factors. This makes it harder to reproduce the exact values of measurements coming from different laboratories. Furthermore, some cancer types are very heterogeneous, meaning that there might be different underlying causes for the same type of cancer among different individuals. If a model does not take potential biases in the data into account, this can lead to problems when trying to predict the stage of a certain cancer type. This is especially true when these biases differ between the training and test set. RESULTS We introduce a method that can estimate this bias on a per-feature level and incorporate calculated feature confidences into a weighted combination of classifiers with disjoint feature sets. In this way, the method provides a prediction that is adjusted for the potential biases on a per-patient basis, providing a personalized prediction for each test patient. The new method achieves state-of-the-art performance on many different cancer data sets with measured DNA methylation or gene expression. Moreover, we show how to visualize the learned classifiers to display interesting associations with the target label. Applied to a leukemia data set, our method finds several ribosomal proteins associated with the risk group, which might be interesting targets for follow-up studies. This discovery supports the hypothesis that the ribosomes are a new frontier in genadaptivelearninge regulation. CONCLUSION We introduce a new method for robust prediction of phenotypes from molecular measurements such as DNA methylation or gene expression. Furthermore, the visualization capabilities enable exploratory analysis on the learnt dependencies and pave the way for a personalized prediction of phenotypes. The software is available under GPL2+ from https://github.com/adrinjalali/Network-Classifier/tree/v1.0 .
Collapse
|
43
|
Inference for survival prediction under the regularized Cox model. Biostatistics 2016; 17:692-707. [PMID: 27107008 DOI: 10.1093/biostatistics/kxw016] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 03/23/2016] [Indexed: 12/31/2022] Open
Abstract
When a moderate number of potential predictors are available and a survival model is fit with regularization to achieve variable selection, providing accurate inference on the predicted survival can be challenging. We investigate inference on the predicted survival estimated after fitting a Cox model under regularization guaranteeing the oracle property. We demonstrate that existing asymptotic formulas for the standard errors of the coefficients tend to underestimate the variability for some coefficients, while typical resampling such as the bootstrap tends to overestimate it; these approaches can both lead to inaccurate variance estimation for predicted survival functions. We propose a two-stage adaptation of a resampling approach that brings the estimated error in line with the truth. In stage 1, we estimate the coefficients in the observed data set and in [Formula: see text] resampled data sets, and allow the resampled coefficient estimates to vote on whether each coefficient should be 0. For those coefficients voted as zero, we set both the point and interval estimates to [Formula: see text] In stage 2, to make inference about coefficients not voted as zero in stage 1, we refit the penalized model in the observed data and in the [Formula: see text] resampled data sets with only variables corresponding to those coefficients. We demonstrate that ensemble voting-based point and interval estimators of the coefficients perform well in finite samples, and prove that the point estimator maintains the oracle property. We extend this approach to derive inference procedures for survival functions and demonstrate that our proposed interval estimation procedures substantially outperform estimators based on asymptotic inference or standard bootstrap. We further illustrate our proposed procedures to predict breast cancer survival in a gene expression study.
Collapse
|
44
|
Ensemble of a subset of kNN classifiers. ADV DATA ANAL CLASSI 2016; 12:827-840. [PMID: 30931011 PMCID: PMC6404785 DOI: 10.1007/s11634-015-0227-5] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2014] [Revised: 10/12/2015] [Accepted: 12/10/2015] [Indexed: 01/04/2023]
Abstract
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines.
Collapse
|
45
|
Analysis of shared miRNAs of different species using ensemble CCA and genetic distance. Comput Biol Med 2015; 64:261-7. [PMID: 26233781 DOI: 10.1016/j.compbiomed.2015.06.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2015] [Revised: 06/23/2015] [Accepted: 06/24/2015] [Indexed: 10/23/2022]
Abstract
MicroRNA is a type of single stranded RNA molecule and has an important role for gene expression. Although there have been a number of computational methodologies in bioinformatics research for miRNA classification and target prediction tasks, analysis of shared miRNAs among different species has not yet been addressed. In this article, we analyzed miRNAs that have the same name and function but have different sequences and belong to different (but closely related) species which are constructed from the online miRBase database. We used sequence-driven features and performed the standard and the ensemble versions of Canonical Correlation Analysis (CCA). However, due to its sensitivity to noise and outliers, we extended it using an ensemble approach. Using linear combinations of dimer features, the proposed Ensemble CCA (ECCA) method has identified higher test-set-correlations than CCA. Moreover, our analysis reveals that the Redundancy Index of ECCA applied to a pair of species has correlation with their genetic distance.
Collapse
|