51
|
Rahmanian S, Pourghasemi HR, Pouyan S, Karami S. Habitat potential modelling and mapping of Teucrium polium using machine learning techniques. ENVIRONMENTAL MONITORING AND ASSESSMENT 2021; 193:759. [PMID: 34718878 DOI: 10.1007/s10661-021-09551-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 10/19/2021] [Indexed: 06/13/2023]
Abstract
Determining suitable habitats is important for the successful management and conservation of plant and wildlife species. Teucrium polium L. is a wild plant species found in Iran. It is widely used to treat numerous health problems. The range of this plant is shrinking due to habitat destruction and overexploitation. Therefore, habitat suitability (HS) modeling is critical for conservation. HS modeling can also identify the key characteristics of habitats that support this species. This study models the habitats of T. polium using five data mining models: random forest (RF), flexible discriminant analysis (FDA), multivariate adaptive regression splines (MARS), support vector machine (SVM), and generalized linear model (GLM). A total of 119 T. poliumlocations were identified and mapped. According to the RF model, the most important factors describing T. polium habitat were elevation, soil texture, and mean annual rainfall. HS maps (HSMs) were prepared, and habitat suitability was classified as low, medium, high, or very high. The percentages of the study area assigned high or very high suitability ratings by each of the models were 44.62% for FDA, 43.75% for GLM, 43.12% for SVM, 38.91% for RF, 28.72% for MARS, and 39.16% for their ensemble. Although the six models were reasonably accurate, the ensemble model had the highest AUC value, demonstrating a strong predictive performance. The rank order of the other models in this regard is RF, MARS, SVM, FDA, and GLM. HSMs can provide useful output to support the sustainable management of rangelands, reclamation, and land protection.
Collapse
|
52
|
Cui L, Wang S. Mapping the daily nitrous acid (HONO) concentrations across China during 2006-2017 through ensemble machine-learning algorithm. THE SCIENCE OF THE TOTAL ENVIRONMENT 2021; 785:147325. [PMID: 33957584 DOI: 10.1016/j.scitotenv.2021.147325] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/21/2021] [Revised: 04/19/2021] [Accepted: 04/21/2021] [Indexed: 06/12/2023]
Abstract
Nitrous acid (HONO) is a major source of the hydroxyl radical (OH) and plays a key role in atmospheric photochemistry. The lack of spatially resolved HONO concentration information results in large knowledge gaps of HONO and its role in atmospheric chemistry and air pollution in China. In this work, an ensemble machine learning model comprising of random forest, gradient boosting, and back propagation neural network was proposed, for the first time, to estimate the long-term (2006-2017) daily HONO concentrations across China in 0.25° resolution. Further, the key factors controlling the space-time variablity of HONO concentrations were analyzed based on variable importance values. The ensemble model well characterized the spatiotemporal distribution of daily HONO concentrations with the sampled-based, site-based and by-year cross-validation (CV) R2 (RMSE) of 0.7 (0.36 ppbv), 0.67 (0.36 ppbv), and 0.62 (0.40 ppbv), respectively. HONO hotspots were mainly distributed in the Beijing-Tianjin-Hebei (BTH), Pearl River Delta (PRD), Yangtze River Delta (YRD), and several sites of Sichuan Basin, in line with the distribution patterns of the tropospheric NO2 columns and assimilated surface NO3- levels. The national HONO levels stagnated during 2006-2013, then declined after 2013 benefiting from the implementation of the Action Plan for Air Pollution Prevention and Control. The NO3- concentration, urban area, NO2 column density ranked as important variables for HONO prediction, while agricultral land, forest and grassland played minor roles in affecting HONO concentrations, suggesting the significant role of heterogeneous HONO production from anthropogenic precursor emissions. Leveraging the ground-level HONO observations, this study fills the gap of statistically modelling nationwide HONO in China, which provides essential data for atmospheric chemistry research.
Collapse
|
53
|
Ke B, Nguyen H, Bui XN, Bui HB, Choi Y, Zhou J, Moayedi H, Costache R, Nguyen-Trang T. Predicting the sorption efficiency of heavy metal based on the biochar characteristics, metal sources, and environmental conditions using various novel hybrid machine learning models. CHEMOSPHERE 2021; 276:130204. [PMID: 34088091 DOI: 10.1016/j.chemosphere.2021.130204] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Revised: 02/17/2021] [Accepted: 03/04/2021] [Indexed: 06/12/2023]
Abstract
Heavy metals in water and wastewater are taken into account as one of the most hazardous environmental issues that significantly impact human health. The use of biochar systems with different materials helped significantly remove heavy metals in the water, especially wastewater treatment systems. Nevertheless, heavy metal's sorption efficiency on the biochar systems is highly dependent on the biochar characteristics, metal sources, and environmental conditions. Therefore, this study implicates the feasibility of biochar systems in the heavy metal sorption in water/wastewater and the use of artificial intelligence (AI) models in investigating efficiency sorption of heavy metal on biochar. Accordingly, this work investigated and proposed 20 artificial intelligent models for forecasting the sorption efficiency of heavy metal onto biochar based on five machine learning algorithms and bagging technique (BA). Accordingly, support vector machine (SVM), random forest (RF), artificial neural network (ANN), M5Tree, and Gaussian process (GP) algorithms were used as the key algorithms for the aim of this study. Subsequently, the individual models were bagged with each other to generate new ensemble models. Finally, 20 intelligent models were developed and evaluated, including SVM, RF, M5Tree, GP, ANN, BA-SVM, BA-RF, BA-M5Tree, BA-GP, BA-ANN, SVM-RF, SVM-M5Tree, SVM-GP, SVM-ANN, RF-M5Tree, RF-GP, RF-ANN, M5Tree-GP, M5Tree-ANN, GP-ANN. Of those, the hybrid models (i.e., BA-SVM, BA-RF, BA-M5Tree, BA-GP, BA-ANN, SVM-RF, SVM-M5Tree, SVM-GP, SVM-ANN, RF-M5Tree, RF-GP, RF-ANN, M5Tree-GP, M5Tree-ANN, GP-ANN) are introduced as the novelty of this study for estimating the heavy metal's sorption efficiency on the biochar systems. Also, the biochar characteristics, metal sources, and environmental conditions were comprehensively assessed and used, and they are considered as a novelty of the study as well. For this aim, a dataset of sorption efficiency of heavy metal was collected and processed with 353 experimental tests. Various performance indexes were applied to evaluate the models, such as RMSE, R2, MAE, color intensity, Taylor diagram, box and whiskers plots. This study's findings revealed that AI models could predict heavy metal's sorption efficiency onto biochar with high reliability, and the efficiency of the ensemble models is higher than those of individual models. The results also reported that the SVM-ANN ensemble model is the most superior model among 20 developed models. The predictive model proposed that heavy metal's efficiency sorption on biochar can be accurately forecasted and early warning for the water pollution by heavy metal.
Collapse
|
54
|
Mukherjee T, Sharma V, Sharma LK, Thakur M, Joshi BD, Sharief A, Thapa A, Dutta R, Dolker S, Tripathy B, Chandra K. Landscape-level habitat management plan through geometric reserve design for critically endangered Hangul (Cervus hanglu hanglu). THE SCIENCE OF THE TOTAL ENVIRONMENT 2021; 777:146031. [PMID: 33676208 DOI: 10.1016/j.scitotenv.2021.146031] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/20/2020] [Revised: 02/17/2021] [Accepted: 02/17/2021] [Indexed: 06/12/2023]
Abstract
Hangul (Cervus hanglu hanglu), the only red deer subspecies surviving in the Indian subcontinent, is of top conservation priority with global importance. Unfortunately, it has lost much of its historical distribution range, and it is now confined to Dachigam landscape within the Kashmir valley of India. The Government of India initiated a recovery plan in 2008 to augment their numbers through ex-situ conservation programs. However, it was necessary to identify potential hangul habitats in Kashmir valley for adopting landscape-level conservation planning for the species. Based on geometric aspects of reserve design, we modeled hangul habitat using an ensemble approach to identify hangul habitats. The present model indicates that the conifer and broadleaf mixed forests were the most suitable habitats. Only 9% of the total study landscape was found suitable for the species. We identified corridors among the suitable habitat blocks, which may be vital for the species' long-term genetic viability. We suggest reorganizing the existing management of Dachigam National Park (NP) following the landscape level and habitat block-level management planning based on the core principles of geometric reserve design. We recommend that the identified patch (PID-6) in the southern region of the landscape to be converted into a Conservation Reserve or merged with the Overa-Aru Wildlife Sanctuary. This habitat patch PID-6 may be a stepping stone habitat and vital for maintaining the species landscape connectivity and metapopulation dynamics.
Collapse
|
55
|
Tanveer MA, Khan MJ, Sajid H, Naseer N. Convolutional neural networks ensemble model for neonatal seizure detection. J Neurosci Methods 2021; 358:109197. [PMID: 33864835 DOI: 10.1016/j.jneumeth.2021.109197] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2020] [Revised: 04/11/2021] [Accepted: 04/12/2021] [Indexed: 10/21/2022]
Abstract
BACKGROUND Neonatal seizures are a common occurrence in clinical settings, requiring immediate attention and detection. Previous studies have proposed using manual feature extraction coupled with machine learning, or deep learning to classify between seizure and non-seizure states. NEW METHOD In this paper a deep learning based approach is used for neonatal seizure classification using electroencephalogram (EEG) signals. The architecture detects seizure activity in raw EEG signals as opposed to common state-of-art, where manual feature extraction with machine learning algorithms is used. The architecture is a two-dimensional (2D) convolutional neural network (CNN) to classify between seizure/non-seizure states. RESULTS The dataset used for this study is annotated by three experts and as such three separate models are trained on individual annotations, resulting in average accuracies (ACC) of 95.6 %, 94.8 % and 90.1 % respectively, and average area under the receiver operating characteristic curve (AUC) of 99.2 %, 98.4 % and 96.7 % respectively. The testing was done using 10-cross fold validation, so that the performance can be an accurate representation of the architectures classification capability in a clinical setting. After training/testing of the three individual models, a final ensemble model is made consisting of the three models. The ensemble model gives an average ACC and AUC of 96.3 % and 99.3 % respectively. COMPARISON WITH EXISTING METHODS This study outperforms previous studies, with increased ACC and AUC results coupled with use of small time windows (1 s) used for evaluation. CONCLUSION The proposed approach is promising for detecting seizure activity in unseen neonate data in a clinical setting.
Collapse
|
56
|
Yu X, Yang Q, Wang D, Li Z, Chen N, Kong DX. Predicting lung adenocarcinoma disease progression using methylation-correlated blocks and ensemble machine learning classifiers. PeerJ 2021; 9:e10884. [PMID: 33628643 PMCID: PMC7894106 DOI: 10.7717/peerj.10884] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Accepted: 01/12/2021] [Indexed: 01/20/2023] Open
Abstract
Applying the knowledge that methyltransferases and demethylases can modify adjacent cytosine-phosphorothioate-guanine (CpG) sites in the same DNA strand, we found that combining multiple CpGs into a single block may improve cancer diagnosis. However, survival prediction remains a challenge. In this study, we developed a pipeline named "stacked ensemble of machine learning models for methylation-correlated blocks" (EnMCB) that combined Cox regression, support vector regression (SVR), and elastic-net models to construct signatures based on DNA methylation-correlated blocks for lung adenocarcinoma (LUAD) survival prediction. We used methylation profiles from the Cancer Genome Atlas (TCGA) as the training set, and profiles from the Gene Expression Omnibus (GEO) as validation and testing sets. First, we partitioned the genome into blocks of tightly co-methylated CpG sites, which we termed methylation-correlated blocks (MCBs). After partitioning and feature selection, we observed different diagnostic capacities for predicting patient survival across the models. We combined the multiple models into a single stacking ensemble model. The stacking ensemble model based on the top-ranked block had the area under the receiver operating characteristic curve of 0.622 in the TCGA training set, 0.773 in the validation set, and 0.698 in the testing set. When stratified by clinicopathological risk factors, the risk score predicted by the top-ranked MCB was an independent prognostic factor. Our results showed that our pipeline was a reliable tool that may facilitate MCB selection and survival prediction.
Collapse
|
57
|
Mukherjee T, Sharma LK, Kumar V, Sharief A, Dutta R, Kumar M, Joshi BD, Thakur M, Venkatraman C, Chandra K. Adaptive spatial planning of protected area network for conserving the Himalayan brown bear. THE SCIENCE OF THE TOTAL ENVIRONMENT 2021; 754:142416. [PMID: 33254933 DOI: 10.1016/j.scitotenv.2020.142416] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 09/12/2020] [Accepted: 09/14/2020] [Indexed: 06/12/2023]
Abstract
Large mammals that occur in low densities, particularly in the high-altitude areas, are globally threatened due to fragile climatic and ecological envelopes. Among bear species, the Himalayan brown bear (Ursus arctos isabellinus) has a distribution that is restricted to Himalayan highlands with relatively small and fragmented populations. To date, very little scientific information on the Himalayan brown bear, which is vital for the conservation of the species and the management of its habitats, especially in protected areas of the landscape, is available. The present study aims to understand the effectiveness of existing Himalayan Protected Areas in terms of representativeness for the conservation of Himalayan brown bear (HBB), an umbrella species in high-altitude habitats of the Himalayan region. We used the ensemble approach of the species distribution model and then assessed biological connectivity to predict the current and future distribution and movement of HBB in climate change scenarios for the year 2050. Approximately 33 protected areas (PAs) currently possess suitable habitats. Our model suggests a massive decline of approximately 73.38% and 72.87% under 4.5 and 8.5 representative concentration pathway (RCP) respectively in the year 2050 compared with the current distribution. The predicted change in suitability will result in loss of habitats from thirteen PAs; eight will become completely uninhabitable by the year 2050, followed by loss of connectivity in the majority of PAs. Habitat configuration analysis suggested a 40% decline in the number of suitable patches, a reduction in large habitat patches (up to 50%) and aggregation of suitable areas (9%) by 2050, indicating fragmentation. The predicted change in geographic isotherm will result in loss of habitats from thirteen PAs, eight of them will become completely inhabitable. Hence, these PAs may lose their effectiveness and representativeness in achieving the very objective of their existence or conservation goals. Therefore, we recommend adaptive spatial planning for protecting suitable habitats distributed outside the PA for climate change adaptation.
Collapse
|
58
|
Gifani P, Shalbaf A, Vafaeezadeh M. Automated detection of COVID-19 using ensemble of transfer learning with deep convolutional neural network based on CT scans. Int J Comput Assist Radiol Surg 2021; 16:115-123. [PMID: 33191476 PMCID: PMC7667011 DOI: 10.1007/s11548-020-02286-w] [Citation(s) in RCA: 69] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2020] [Accepted: 10/23/2020] [Indexed: 12/18/2022]
Abstract
PURPOSE COVID-19 has infected millions of people worldwide. One of the most important hurdles in controlling the spread of this disease is the inefficiency and lack of medical tests. Computed tomography (CT) scans are promising in providing accurate and fast detection of COVID-19. However, determining COVID-19 requires highly trained radiologists and suffers from inter-observer variability. To remedy these limitations, this paper introduces an automatic methodology based on an ensemble of deep transfer learning for the detection of COVID-19. METHODS A total of 15 pre-trained convolutional neural networks (CNNs) architectures: EfficientNets(B0-B5), NasNetLarge, NasNetMobile, InceptionV3, ResNet-50, SeResnet 50, Xception, DenseNet121, ResNext50 and Inception_resnet_v2 are used and then fine-tuned on the target task. After that, we built an ensemble method based on majority voting of the best combination of deep transfer learning outputs to further improve the recognition performance. We have used a publicly available dataset of CT scans, which consists of 349 CT scans labeled as being positive for COVID-19 and 397 negative COVID-19 CT scans that are normal or contain other types of lung diseases. RESULTS The experimental results indicate that the majority voting of 5 deep transfer learning architecture with EfficientNetB0, EfficientNetB3, EfficientNetB5, Inception_resnet_v2, and Xception has the higher results than the individual transfer learning structure and among the other models based on precision (0.857), recall (0.854) and accuracy (0.85) metrics in diagnosing COVID-19 from CT scans. CONCLUSION Our study based on an ensemble deep transfer learning system with different pre-trained CNNs architectures can work well on a publicly available dataset of CT images for the diagnosis of COVID-19 based on CT scans.
Collapse
|
59
|
Singh P, Kaur R. An integrated fog and Artificial Intelligence smart health framework to predict and prevent COVID-19. GLOBAL TRANSITIONS 2020; 2:283-292. [PMID: 33205037 PMCID: PMC7659515 DOI: 10.1016/j.glt.2020.11.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/04/2020] [Revised: 10/09/2020] [Accepted: 11/01/2020] [Indexed: 05/18/2023]
Abstract
Nowadays, COVID-19 is spreading at a rapid rate in almost all the continents of the world. It has already affected many people who are further spreading it day by day. Hence, it is the most essential to alert nearby people to be aware of it due to its communicable behavior. Till May 2020, no vaccine is available for the treatment of this COVID-19, but the existing technologies can be used to minimize its effect. Cloud/fog computing could be used to monitor and control this rapidly spreading infection in a cost-effective and time-saving manner. To strengthen COVID-19 patient prediction, Artificial Intelligence(AI) can be integrated with cloud/fog computing for practical solutions. In this paper, fog assisted the internet of things based quality of service framework is presented to prevent and protect from COVID-19. It provides real-time processing of users' health data to predict the COVID-19 infection by observing their symptoms and immediately generates an emergency alert, medical reports, and significant precautions to the user, their guardian as well as doctors/experts. It collects sensitive information from the hospitals/quarantine shelters through the patient IoT devices for taking necessary actions/decisions. Further, it generates an alert message to the government health agencies for controlling the outbreak of chronic illness and for tanking quick and timely actions.
Collapse
|
60
|
Saha S, Saha M, Mukherjee K, Arabameri A, Ngo PTT, Paul GC. Predicting the deforestation probability using the binary logistic regression, random forest, ensemble rotational forest, REPTree: A case study at the Gumani River Basin, India. THE SCIENCE OF THE TOTAL ENVIRONMENT 2020; 730:139197. [PMID: 32402979 DOI: 10.1016/j.scitotenv.2020.139197] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2020] [Revised: 05/01/2020] [Accepted: 05/01/2020] [Indexed: 04/15/2023]
Abstract
Rapid population growth and its corresponding effects like the expansion of human settlement, increasing agricultural land, and industry lead to the loss of forest area in most parts of the world especially in such highly populated nations like India. Forest canopy density (FCD) is a useful measure to assess the forest cover change in its own as numerous works of forest change have been done using only FCD with the help of remote sensing and GIS. The coupling of binary logistic regression (BLR), random forest (RF), ensemble of rotational forest and reduced error pruning trees (RTF-REPTree) with FCD makes it more convenient to find out the deforestation probability. Advanced vegetation index (AVI), bare soil index (BSI), shadow index (SI), and scaled vegetation density (VD) derived from Landsat imageries are the main input parameters to identify the FCD. After preparing the FCDs of 1990, 2000, 2010 and 2017 the deforestation map of the study area was prepared and considered as dependent parameter for deforestation probability modelling. On the other hand, twelve deforestation determining factors were used to delineate the deforestation probability with the help of BLR, RF and RTF-REPTree models. These deforestation probability models were validated through area under curve (AUC), receiver operating characteristics (ROC), efficiency, true skill statistics (TSS) and Kappa co-efficient. The validation result shows that all the models like BLR (AUC = 0.874), RF (AUC = 0.886) and RTF-REPTree (AUC = 0.919) have good capability of assessing the deforestation probability but among them, RTF-REPTree has the highest accuracy level. The result also shows that low canopy density area i.e. not under the dense forest cover has increased by 9.26% from 1990 to 2017. Besides, nearly 30% of the forested land is under high to very high deforestation probable zone, which needs to be protected with immediate measures.
Collapse
|
61
|
Hwang S, Shin HK, Shin SE, Seo M, Jeon HN, Yim DE, Kim DH, No KT. PreMetabo: An in silico phase I and II drug metabolism prediction platform. Drug Metab Pharmacokinet 2020; 35:361-367. [PMID: 32616370 DOI: 10.1016/j.dmpk.2020.05.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Revised: 03/23/2020] [Accepted: 05/18/2020] [Indexed: 10/24/2022]
Abstract
This study aimed to develop a drug metabolism prediction platform using knowledge-based prediction models. Site of Metabolism (SOM) prediction models for four cytochrome P450 (CYP) subtypes were developed along with uridine 5'-diphosphoglucuronosyltransferase (UGT) and sulfotransferase (SULT) substrate classification models. The SOM substrate for a certain CYP was determined using the sum of the activation energy required for the reaction at the reaction site of the substrate and the binding energy of the substrate to the CYP enzyme. Activation energy was calculated using the EaMEAD model and binding energy was calculated by docking simulation. Phase II prediction models were developed to predict whether a molecule is the substrate of a certain phase II conjugate protein, i.e., UGT or SULT. Using SOM prediction models, the predictability of the major metabolite in the top-3 was obtained as 72.5-84.5% for four CYPs, respectively. For internal validation, the accuracy of the UGT and SULT substrate classification model was obtained as 93.94% and 80.68%, respectively. Additionally, for external validation, the accuracy of the UGT substrate classification model was obtained as 81% in the case of 11 FDA-approved drugs. PreMetabo is implemented in a web environment and is available at https://premetabo.bmdrc.kr/.
Collapse
|
62
|
Liu M, Zhang L, Li S, Yang T, Liu L, Zhao J, Liu H. Prediction of hERG potassium channel blockage using ensemble learning methods and molecular fingerprints. Toxicol Lett 2020; 332:88-96. [PMID: 32629073 DOI: 10.1016/j.toxlet.2020.07.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Revised: 06/16/2020] [Accepted: 07/02/2020] [Indexed: 11/30/2022]
Abstract
The human ether-a-go-go-related gene (hERG) encodes a tetrameric potassium channel called Kv11.1. This channel can be blocked by certain drugs, which leads to long QT syndrome, causing cardiotoxicity. This is a significant problem during drug development. Using computer models to predict compound cardiotoxicity during the early stages of drug design will help to solve this problem. In this study, we used a dataset of 1865 compounds exhibiting known hERG inhibitory activities as a training set. Thirty cardiotoxicity classification models were established using three machine learning algorithms based on molecular fingerprints and molecular descriptors. Through using these models as the base classifier, a new cardiotoxicity classification model with better predictive performance was developed using ensemble learning method. The accuracy of the best base classifier, which was generated using the XGBoost method with molecular descriptors, was 84.8 %, and the area under the receiver-operating characteristic curve (AUC) was 0.876 in the five fold cross-validation. However, all of the ensemble models that we developed had higher predictive performance than the base classifiers in the five fold cross-validation. The best predictive performance was achieved by the Ensemble-Top7 model, with accuracy of 84.9 % and AUC of 0.887. We also tested the ensemble model using external validation data and achieved accuracy of 85.0 % and AUC of 0.786. Furthermore, we identified several hERG-related substructures, which provide valuable information for designing drug candidates.
Collapse
|
63
|
Lopez-Martin M, Nevado A, Carro B. Detection of early stages of Alzheimer's disease based on MEG activity with a randomized convolutional neural network. Artif Intell Med 2020; 107:101924. [PMID: 32828459 DOI: 10.1016/j.artmed.2020.101924] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Revised: 04/27/2020] [Accepted: 07/01/2020] [Indexed: 11/16/2022]
Abstract
The early detection of Alzheimer's disease can potentially make eventual treatments more effective. This work presents a deep learning model to detect early symptoms of Alzheimer's disease using synchronization measures obtained with magnetoencephalography. The proposed model is a novel deep learning architecture based on an ensemble of randomized blocks formed by a sequence of 2D-convolutional, batch-normalization and pooling layers. An important challenge is to avoid overfitting, as the number of features is very high (25755) compared to the number of samples (132 patients). To address this issue the model uses an ensemble of identical sub-models all sharing weights, with a final stage that performs an average across sub-models. To facilitate the exploration of the feature space, each sub-model receives a random permutation of features. The features correspond to magnetic signals reflecting neural activity and are arranged in a matrix structure interpreted as a 2D image that is processed by 2D convolutional networks. The proposed detection model is a binary classifier (disease/non-disease), which compared to other deep learning architectures and classic machine learning classifiers, such as random forest and support vector machine, obtains the best classification performance results with an average F1-score of 0.92. To perform the comparison a strict validation procedure is proposed, and a thorough study of results is provided.
Collapse
|
64
|
Kong W, Wang W, An J. Prediction of 5-hydroxytryptamine transporter inhibitors based on machine learning. Comput Biol Chem 2020; 87:107303. [PMID: 32563857 DOI: 10.1016/j.compbiolchem.2020.107303] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2020] [Revised: 05/29/2020] [Accepted: 06/04/2020] [Indexed: 01/08/2023]
Abstract
In patients with depression, the use of 5-HT reuptake inhibitors can improve the condition. Machine learning methods can be used in ligand-based activity prediction processes. In order to predict SERT inhibitors, the SERT inhibitor data from the ChEMBL database was screened and pre-processed. Then 4 machine learning methods (LR, SVM, RF, and KNN) and 4 molecular fingerprints (CDK, Graph, MACCS, and PubChem) were used to build 16 prediction models. The top 5 models of accuracy (Q) in the cross-validation of training set were used to build three different ensemble learning models. In the test1 set, the VOT_CLF3 model had the largest SP (0.871), Q (0.869), AUC (0.919), and MCC (0.728). In the unbalanced test2 set, VOT_CLF3 had the largest SE (0.857), SP (0.867), Q (0.865) and MCC (0.639). VOT_CLF3 was recommended for the virtual screening process of SERT inhibitors. In addition, 12 molecular structural alerts that frequently appear in SERT inhibitors were found (P < 0.05), which provided important reference value for the design work of SERT inhibitors.
Collapse
|
65
|
Liu X, Jin J, Wu W, Herz F. A novel support vector machine ensemble model for estimation of free lime content in cement clinkers. ISA TRANSACTIONS 2020; 99:479-487. [PMID: 31515089 DOI: 10.1016/j.isatra.2019.09.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 09/01/2019] [Accepted: 09/01/2019] [Indexed: 06/10/2023]
Abstract
Free lime (f-CaO) content is a crucial quality parameter for cement clinkers in rotary cement kiln. Due to lack of hardware sensors, f-CaO content in cement clinker is mostly obtained by offline laboratory measurement, making timely control rather difficult and even impossible. In this work, a soft sensor approach named as support vector machine ensemble (ESVM) model is proposed to estimate f-CaO content. The process data employed to train and test the model were collected from a cement plant in China, covering a time span of about 30 days. The raw data were preprocessed by filters and time-series matching. The processed data were then clustered by fuzzy c-means clustering algorithm to capture process features at different operating conditions. For each individual cluster, a base SVM regressor was trained to estimate f-CaO content. Finally, an ensemble model consisting of four base SVM regressors was established to estimate f-CaO content at multifarious process conditions. The effectiveness of the proposed ESVM model was investigated by comparing it with manual measurements and other models available in literature. The results demonstrate that the proposed ESVM model achieves improvements in model accuracy as well as generalization capability. The proposed ESVM model has a broad application space in cement production process for automatic monitoring of f-CaO content.
Collapse
|
66
|
Akpoti K, Kabo-Bah AT, Dossou-Yovo ER, Groen TA, Zwart SJ. Mapping suitability for rice production in inland valley landscapes in Benin and Togo using environmental niche modeling. THE SCIENCE OF THE TOTAL ENVIRONMENT 2020; 709:136165. [PMID: 31905543 DOI: 10.1016/j.scitotenv.2019.136165] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Revised: 11/19/2019] [Accepted: 12/14/2019] [Indexed: 06/10/2023]
Abstract
Inland valleys (IVs) in Africa are important landscapes for rice cultivation and are targeted by national governments to attain self-sufficiency. Yet, there is limited information on the spatial distribution of IVs suitability at the national scale. In the present study, we developed an ensemble model approach to characterize the IVs suitability for rainfed lowland rice using 4 machine learning algorithms based on environmental niche modeling (ENM) with presence-only data and background sample, namely Boosted Regression Tree (BRT), Generalized Linear Model (GLM), Maximum Entropy (MAXNT) and Random Forest (RF). We used a set of predictors that were grouped under climatic variables, agricultural water productivity and soil water content, soil chemical properties, soil physical properties, vegetation cover, and socio-economic variables. The Area Under the Curves (AUC) evaluation metrics for both training and testing were respectively 0.999 and 0.873 for BRT, 0.866 and 0.816 for GLM, 0.948 and 0.861 for MAXENT and 0.911 and 0.878 for RF. Results showed that proximity of inland valleys to roads and urban centers, elevation, soil water holding capacity, bulk density, vegetation index, gross biomass water productivity, precipitation of the wettest quarter, isothermality, annual precipitation, and total phosphorus among others were major predictors of IVs suitability for rainfed lowland rice. Suitable IVs areas were estimated at 155,000-225,000 Ha in Togo and 351,000-406,000 Ha in Benin. We estimated that 53.8% of the suitable IVs area is needed in Togo to attain self-sufficiency in rice while 60.1% of the suitable IVs area is needed in Benin to attain self-sufficiency in rice. These results demonstrated the effectiveness of an ensemble environmental niche modeling approach that combines the strengths of several models.
Collapse
|
67
|
iPseU-Layer: Identifying RNA Pseudouridine Sites Using Layered Ensemble Model. Interdiscip Sci 2020; 12:193-203. [PMID: 32170573 DOI: 10.1007/s12539-020-00362-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Revised: 02/16/2020] [Accepted: 02/19/2020] [Indexed: 01/28/2023]
Abstract
Pseudouridine represents one of the most prevalent post-transcriptional RNA modifications. The identification of pseudouridine sites is an essential step toward understanding RNA functions, RNA structure stabilization, translation process, and RNA stability; however, high-throughput experimental techniques remain expensive and time-consuming in lab explorations and biochemical processes. Thus, how to develop an efficient pseudouridine site identification method based on machine learning is very important both in academic research and drug development. Motived by this, we present an effective layered ensemble model designated as iPseU-Layer for identification of RNA pseudouridine sites. The proposed iPseU-Layer approach is essentially based on three different machine learning layers including: feature selection layer, feature extraction and fusion layer, and prediction layer. The feature selection layer reduces the dimensionality, which can be regarded as a data pre-processing stage. The feature extraction and fusion layer utilizes an ensemble method which is implemented through various machine learning algorithms to generate some outputs. The prediction layer applies classic random forest to identify the final results. Furthermore, we systematically conduct the validation experiments using cross-validation tests and independent test with the current state-of-the-art models. The proposed iPseU-Layer provides a promising predictive performance in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient. Collectively, these findings indicate that the framework of iPseU-Layer is a feasible and effective strategy for the prediction of RNA pseudouridine sites.
Collapse
|
68
|
Di Q, Amini H, Shi L, Kloog I, Silvern R, Kelly J, Sabath MB, Choirat C, Koutrakis P, Lyapustin A, Wang Y, Mickley LJ, Schwartz J. An ensemble-based model of PM 2.5 concentration across the contiguous United States with high spatiotemporal resolution. ENVIRONMENT INTERNATIONAL 2019; 130:104909. [PMID: 31272018 PMCID: PMC7063579 DOI: 10.1016/j.envint.2019.104909] [Citation(s) in RCA: 285] [Impact Index Per Article: 57.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Revised: 06/03/2019] [Accepted: 06/06/2019] [Indexed: 05/17/2023]
Abstract
Various approaches have been proposed to model PM2.5 in the recent decade, with satellite-derived aerosol optical depth, land-use variables, chemical transport model predictions, and several meteorological variables as major predictor variables. Our study used an ensemble model that integrated multiple machine learning algorithms and predictor variables to estimate daily PM2.5 at a resolution of 1 km × 1 km across the contiguous United States. We used a generalized additive model that accounted for geographic difference to combine PM2.5 estimates from neural network, random forest, and gradient boosting. The three machine learning algorithms were based on multiple predictor variables, including satellite data, meteorological variables, land-use variables, elevation, chemical transport model predictions, several reanalysis datasets, and others. The model training results from 2000 to 2015 indicated good model performance with a 10-fold cross-validated R2 of 0.86 for daily PM2.5 predictions. For annual PM2.5 estimates, the cross-validated R2 was 0.89. Our model demonstrated good performance up to 60 μg/m3. Using trained PM2.5 model and predictor variables, we predicted daily PM2.5 from 2000 to 2015 at every 1 km × 1 km grid cell in the contiguous United States. We also used localized land-use variables within 1 km × 1 km grids to downscale PM2.5 predictions to 100 m × 100 m grid cells. To characterize uncertainty, we used meteorological variables, land-use variables, and elevation to model the monthly standard deviation of the difference between daily monitored and predicted PM2.5 for every 1 km × 1 km grid cell. This PM2.5 prediction dataset, including the downscaled and uncertainty predictions, allows epidemiologists to accurately estimate the adverse health effect of PM2.5. Compared with model performance of individual base learners, an ensemble model would achieve a better overall estimation. It is worth exploring other ensemble model formats to synthesize estimations from different models or from different groups to improve overall performance.
Collapse
|
69
|
West AM, Jarnevich CS, Young NE, Fuller PL. Evaluating Potential Distribution of High-Risk Aquatic Invasive Species in the Water Garden and Aquarium Trade at a Global Scale Based on Current Established Populations. RISK ANALYSIS : AN OFFICIAL PUBLICATION OF THE SOCIETY FOR RISK ANALYSIS 2019; 39:1169-1191. [PMID: 30428498 DOI: 10.1111/risa.13230] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2016] [Revised: 06/20/2018] [Accepted: 10/08/2018] [Indexed: 06/09/2023]
Abstract
Aquatic non-native invasive species are commonly traded in the worldwide water garden and aquarium markets, and some of these species pose major threats to the economy, the environment, and human health. Understanding the potential suitable habitat for these species at a global scale and at regional scales can inform risk assessments and predict future potential establishment. Typically, global habitat suitability models are fit for freshwater species with only climate variables, which provides little information about suitable terrestrial conditions for aquatic species. Remotely sensed data including topography and land cover data have the potential to improve our understanding of suitable habitat for aquatic species. In this study, we fit species distribution models using five different model algorithms for three non-native aquatic invasive species with bioclimatic, topographic, and remotely sensed covariates to evaluate potential suitable habitat beyond simple climate matches. The species examined included a frog (Xenopus laevis), toad (Bombina orientalis), and snail (Pomacea spp.). Using a unique modeling approach for each species including background point selection based on known established populations resulted in robust ensemble habitat suitability models. All models for all species had test area under the receiver operating characteristic curve values greater than 0.70 and percent correctly classified values greater than 0.65. Importantly, we employed multivariate environmental similarity surface maps to evaluate potential extrapolation beyond observed conditions when applying models globally. These global models provide necessary forecasts of where these aquatic invasive species have the potential for establishment outside their native range, a key component in risk analyses.
Collapse
|
70
|
Shang Z, Deng T, He J, Duan X. A novel model for hourly PM 2.5 concentration prediction based on CART and EELM. THE SCIENCE OF THE TOTAL ENVIRONMENT 2019; 651:3043-3052. [PMID: 30463154 DOI: 10.1016/j.scitotenv.2018.10.193] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2018] [Revised: 10/09/2018] [Accepted: 10/14/2018] [Indexed: 06/09/2023]
Abstract
Hourly PM2.5 concentrations have multiple change patterns. For hourly PM2.5 concentration prediction, it is beneficial to split the whole dataset into several subsets with similar properties and to train a local prediction model for each subset. However, the methods based on local models need to solve the global-local duality. In this study, a novel prediction model based on classification and regression tree (CART) and ensemble extreme learning machine (EELM) methods is developed to split the dataset into subsets in a hierarchical fashion and build a prediction model for each leaf. Firstly, CART is used to split the dataset by constructing a shallow hierarchical regression tree. Then at each node of the tree, EELM models are built using the training samples of the node, and hidden neuron numbers are selected to minimize validation errors respectively on the leaves of a sub-tree that takes the node as the root. Finally, for each leaf of the tree, a global and several local EELMs on the path from the root to the leaf are compared, and the one with the smallest validation error on the leaf is chosen. The meteorological data of Yancheng urban area and the air pollutant concentration data from City Monitoring Centre are used to evaluate the method developed. The experimental results demonstrate that the method developed addresses the global-local duality, having better performance than global models including random forest (RF), v-support vector regression (v-SVR) and EELM, and other local models based on season and k-means clustering. The new model has improved the capability of treating multiple change patterns.
Collapse
|
71
|
Guo P, Zhang Q, Chen Y, Xiao J, He J, Zhang Y, Wang L, Liu T, Ma W. An ensemble forecast model of dengue in Guangzhou, China using climate and social media surveillance data. THE SCIENCE OF THE TOTAL ENVIRONMENT 2019; 647:752-762. [PMID: 30092532 DOI: 10.1016/j.scitotenv.2018.08.044] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2018] [Revised: 07/08/2018] [Accepted: 08/03/2018] [Indexed: 02/05/2023]
Abstract
BACKGROUND China experienced an unprecedented outbreak of dengue in 2014, and the number of dengue cases reached the highest level over the past 25 years. There is a significant delay in the release of official case count data, and our ability to timely track the timing and magnitude of local outbreaks of dengue remains limited. MATERIAL AND METHODS We developed an ensemble penalized regression algorithm (EPRA) for initializing near-real time forecasts of the dengue epidemic trajectory by integrating different penalties (LASSO, Ridge, Elastic Net, SCAD and MCP) with the techniques of iteratively sampling and model averaging. Multiple streams of near-real time data including dengue-related Baidu searches, Sina Weibo posts, and climatic conditions with historical dengue incidence were used. We compared the predictive power of the EPRA with the alternates, penalized regression models using single penalties, to retrospectively forecast weekly dengue incidence and detect outbreak occurrence defined using different cutoffs, during the periods of 2011-2016 in Guangzhou, south China. RESULTS The EPRA showed the best or at least comparable performance for 1-, 2-week ahead out-of-sample and leave-one-out cross validation forecasts. The findings indicate that skillful near-real time forecasts of dengue and confidence in those predictions can be made. For detecting dengue outbreaks, the EPRA predicted periods of high incidence of dengue more accurately than the alternates. CONCLUSION This study developed a statistically rigorous approach for near-real time forecast of dengue in China. The EPRA provides skillful forecasts and can be used as timely and complementary ways to assess dengue dynamics, which will help to design interventions to mitigate dengue transmission.
Collapse
|
72
|
Ensemble Technique for Prediction of T-cell Mycobacterium tuberculosis Epitopes. Interdiscip Sci 2018; 11:611-627. [PMID: 30406342 DOI: 10.1007/s12539-018-0309-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 09/14/2018] [Accepted: 10/24/2018] [Indexed: 02/06/2023]
Abstract
Development of an effective machine-learning model for T-cell Mycobacterium tuberculosis (M. tuberculosis) epitopes is beneficial for saving biologist's time and effort for identifying epitope in a targeted antigen. Existing NetMHC 2.2, NetMHC 2.3, NetMHC 3.0 and NetMHC 4.0 estimate binding capacity of peptide. This is still a challenge for those servers to predict whether a given peptide is M. tuberculosis epitope or non-epitope. One of the servers, CTLpred, works in this category but it is limited to peptide length of 9-mers. Therefore, in this work direct method of predicting M. tuberculosis epitope or non-epitope has been proposed which also overcomes the limitations of above servers. The proposed method is able to work with variable length epitopes having size even greater than 9-mers. Identification of T-cell or B-cell epitopes in the targeted antigen is the main goal in designing epitope-based vaccine, immune-diagnostic tests and antibody production. Therefore, it is important to introduce a reliable system which may help in the diagnosis of M. tuberculosis. In the present study, computational intelligence methods are used to classify T-cell M. tuberculosis epitopes. The caret feature selection approach is used to find out the set of relevant features. The ensemble model is designed by combining three models and is used to predict M. tuberculosis epitopes of variable length (7-40-mers). The proposed ensemble model achieves 82.0% accuracy, 0.89 specificity, 0.77 sensitivity with repeated k-fold cross-validation having average accuracy of 80.61%. The proposed ensemble model has been validated and compared with NetMHC 2.3, NetMHC 4.0 servers and CTLpred T-cell prediction server.
Collapse
|
73
|
Chen W, Li H, Hou E, Wang S, Wang G, Panahi M, Li T, Peng T, Guo C, Niu C, Xiao L, Wang J, Xie X, Ahmad BB. GIS-based groundwater potential analysis using novel ensemble weights-of-evidence with logistic regression and functional tree models. THE SCIENCE OF THE TOTAL ENVIRONMENT 2018; 634:853-867. [PMID: 29653429 DOI: 10.1016/j.scitotenv.2018.04.055] [Citation(s) in RCA: 63] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2018] [Revised: 04/04/2018] [Accepted: 04/05/2018] [Indexed: 06/08/2023]
Abstract
The aim of the current study was to produce groundwater spring potential maps using novel ensemble weights-of-evidence (WoE) with logistic regression (LR) and functional tree (FT) models. First, a total of 66 springs were identified by field surveys, out of which 70% of the spring locations were used for training the models and 30% of the spring locations were employed for the validation process. Second, a total of 14 affecting factors including aspect, altitude, slope, plan curvature, profile curvature, stream power index (SPI), topographic wetness index (TWI), sediment transport index (STI), lithology, normalized difference vegetation index (NDVI), land use, soil, distance to roads, and distance to streams was used to analyze the spatial relationship between these affecting factors and spring occurrences. Multicollinearity analysis and feature selection of the correlation attribute evaluation (CAE) method were employed to optimize the affecting factors. Subsequently, the novel ensembles of the WoE, LR, and FT models were constructed using the training dataset. Finally, the receiver operating characteristic (ROC) curves, standard error, confidence interval (CI) at 95%, and significance level P were employed to validate and compare the performance of three models. Overall, all three models performed well for groundwater spring potential evaluation. The prediction capability of the FT model, with the highest AUC values, the smallest standard errors, the narrowest CIs, and the smallest P values for the training and validation datasets, is better compared to those of other models. The groundwater spring potential maps can be adopted for the management of water resources and land use by planners and engineers.
Collapse
|
74
|
Wang Q, Xie Z, Li F. Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2015; 206:227-35. [PMID: 26188913 DOI: 10.1016/j.envpol.2015.06.040] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/17/2015] [Revised: 05/13/2015] [Accepted: 06/29/2015] [Indexed: 05/22/2023]
Abstract
This study aims to identify and apportion multi-source and multi-phase heavy metal pollution from natural and anthropogenic inputs using ensemble models that include stochastic gradient boosting (SGB) and random forest (RF) in agricultural soils on the local scale. The heavy metal pollution sources were quantitatively assessed, and the results illustrated the suitability of the ensemble models for the assessment of multi-source and multi-phase heavy metal pollution in agricultural soils on the local scale. The results of SGB and RF consistently demonstrated that anthropogenic sources contributed the most to the concentrations of Pb and Cd in agricultural soils in the study region and that SGB performed better than RF.
Collapse
|