1
|
Sadaiappan B, Balakrishnan P, C.R. V, Vijayan NT, Subramanian M, Gauns MU. Applications of Machine Learning in Chemical and Biological Oceanography. ACS OMEGA 2023; 8:15831-15853. [PMID: 37179641 PMCID: PMC10173431 DOI: 10.1021/acsomega.2c06441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 02/22/2023] [Indexed: 05/15/2023]
Abstract
Machine learning (ML) refers to computer algorithms that predict a meaningful output or categorize complex systems based on a large amount of data. ML is applied in various areas including natural science, engineering, space exploration, and even gaming development. This review focuses on the use of machine learning in the field of chemical and biological oceanography. In the prediction of global fixed nitrogen levels, partial carbon dioxide pressure, and other chemical properties, the application of ML is a promising tool. Machine learning is also utilized in the field of biological oceanography to detect planktonic forms from various images (i.e., microscopy, FlowCAM, and video recorders), spectrometers, and other signal processing techniques. Moreover, ML successfully classified the mammals using their acoustics, detecting endangered mammalian and fish species in a specific environment. Most importantly, using environmental data, the ML proved to be an effective method for predicting hypoxic conditions and harmful algal bloom events, an essential measurement in terms of environmental monitoring. Furthermore, machine learning was used to construct a number of databases for various species that will be useful to other researchers, and the creation of new algorithms will help the marine research community better comprehend the chemistry and biology of the ocean.
Collapse
Affiliation(s)
- Balamurugan Sadaiappan
- Department
of Biology, United Arab Emirates University, Al Ain 971, UAE
- Plankton
Laboratory, Biological Oceanography Division, CSIR-National Institute of Oceanography, Dona Paula, Goa 403004, India
| | - Preethiya Balakrishnan
- Faraday-Fleming
Laboratory, London W148TL, United Kingdom
- University
of London, London WC1E 7HU, United
Kingdom
| | - Vishal C.R.
- Plankton
Laboratory, Biological Oceanography Division, CSIR-National Institute of Oceanography, Dona Paula, Goa 403004, India
| | - Neethu T. Vijayan
- Plankton
Laboratory, Biological Oceanography Division, CSIR-National Institute of Oceanography, Dona Paula, Goa 403004, India
| | - Mahendran Subramanian
- Faraday-Fleming
Laboratory, London W148TL, United Kingdom
- Department
of Computing, Imperial College, London SW7 2AZ, United Kingdom
| | - Mangesh U. Gauns
- Plankton
Laboratory, Biological Oceanography Division, CSIR-National Institute of Oceanography, Dona Paula, Goa 403004, India
| |
Collapse
|
2
|
Lynn TF, Ottino JM, Lueptow RM, Umbanhowar PB. Potentialities and limitations of machine learning to solve cut-and-shuffle mixing problems: A case study. Chem Eng Sci 2022. [DOI: 10.1016/j.ces.2022.117840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
3
|
Zhu X, Hu J, Xiao T, Huang S, Wen Y, Shang D. An interpretable stacking ensemble learning framework based on multi-dimensional data for real-time prediction of drug concentration: The example of olanzapine. Front Pharmacol 2022; 13:975855. [PMID: 36238557 PMCID: PMC9552071 DOI: 10.3389/fphar.2022.975855] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 09/05/2022] [Indexed: 11/13/2022] Open
Abstract
Background and Aim: Therapeutic drug monitoring (TDM) has evolved over the years as an important tool for personalized medicine. Nevertheless, some limitations are associated with traditional TDM. Emerging data-driven model forecasting [e.g., through machine learning (ML)-based approaches] has been used for individualized therapy. This study proposes an interpretable stacking-based ML framework to predict concentrations in real time after olanzapine (OLZ) treatment. Methods: The TDM-OLZ dataset, consisting of 2,142 OLZ measurements and 472 features, was formed by collecting electronic health records during the TDM of 927 patients who had received OLZ treatment. We compared the performance of ML algorithms by using 10-fold cross-validation and the mean absolute error (MAE). The optimal subset of features was analyzed by a random forest-based sequential forward feature selection method in the context of the top five heterogeneous regressors as base models to develop a stacked ensemble regressor, which was then optimized via the grid search method. Its predictions were explained by using local interpretable model-agnostic explanations (LIME) and partial dependence plots (PDPs). Results: A state-of-the-art stacking ensemble learning framework that integrates optimized extra trees, XGBoost, random forest, bagging, and gradient-boosting regressors was developed for nine selected features [i.e., daily dose (OLZ), gender_male, age, valproic acid_yes, ALT, K, BW, MONO#, and time of blood sampling after first administration]. It outperformed other base regressors that were considered, with an MAE of 0.064, R-square value of 0.5355, mean squared error of 0.0089, mean relative error of 13%, and ideal rate (the percentages of predicted TDM within ± 30% of actual TDM) of 63.40%. Predictions at the individual level were illustrated by LIME plots, whereas the global interpretation of associations between features and outcomes was illustrated by PDPs. Conclusion: This study highlights the feasibility of the real-time estimation of drug concentrations by using stacking-based ML strategies without losing interpretability, thus facilitating model-informed precision dosing.
Collapse
Affiliation(s)
- Xiuqing Zhu
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Guangdong Engineering Technology Research Center for Translational Medicine of Mental Disorders, Guangzhou, China
| | - Jinqing Hu
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Guangdong Engineering Technology Research Center for Translational Medicine of Mental Disorders, Guangzhou, China
| | - Tao Xiao
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Department of Clinical Research, Guangdong Second Provincial General Hospital, Guangzhou, China
| | - Shanqing Huang
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Guangdong Engineering Technology Research Center for Translational Medicine of Mental Disorders, Guangzhou, China
| | - Yuguan Wen
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Guangdong Engineering Technology Research Center for Translational Medicine of Mental Disorders, Guangzhou, China
- *Correspondence: Yuguan Wen, ; Dewei Shang,
| | - Dewei Shang
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Guangdong Engineering Technology Research Center for Translational Medicine of Mental Disorders, Guangzhou, China
- *Correspondence: Yuguan Wen, ; Dewei Shang,
| |
Collapse
|
4
|
Automation of species-specific cyanobacteria phycocyanin fluorescence compensation using machine learning classification. ECOL INFORM 2022. [DOI: 10.1016/j.ecoinf.2022.101669] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
5
|
Bourel M, Segura AM, Crisci C, López G, Sampognaro L, Vidal V, Kruk C, Piccini C, Perera G. Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters. WATER RESEARCH 2021; 202:117450. [PMID: 34352535 DOI: 10.1016/j.watres.2021.117450] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/11/2020] [Revised: 07/09/2021] [Accepted: 07/15/2021] [Indexed: 06/13/2023]
Abstract
Predicting water contamination by statistical models is a useful tool to manage health risk in recreational beaches. Extreme contamination events, i.e. those exceeding normative are generally rare with respect to bathing conditions and thus the data is said to be imbalanced. Modeling and predicting those rare events present unique challenges. Here we introduce and evaluate several machine learning techniques and metrics to model imbalanced data and evaluate model performance. We do so by using a) simulated data-sets and b) a real data base with records of faecal coliform abundance monitored for 10 years in 21 recreational beaches in Uruguay (N ≈ 19000) using in situ and meteorological variables. We discuss advantages and disadvantages of the methods and provide a simple guide to perform models for a general audience. We also provide R codes to reproduce model fitting and testing. We found that most Machine Learning techniques are sensitive to imbalance and require specific data pre-treatment (e.g. upsampling) to improve performance. Accuracy (i.e. correctly classified cases over total cases) is not adequate to evaluate model performance on imbalanced data set. Instead, true positive rates (TPR) and false positive rates (FPR) are recommended. Among the 52 possible candidate algorithms tested, the stratified Random forest presented the better performance improving TPR in 50% with respect to baseline (0.4) and outperformed baseline in the evaluated metrics. Support vector machines combined with upsampling method or synthetic minority oversampling technique (SMOTE) performed well, similar to Adaboost with SMOTE. These results suggests that combining modeling strategies is necessary to improve our capacity to anticipate water contamination and avoid health risk.
Collapse
Affiliation(s)
- Mathias Bourel
- IMERL, Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay; Departamento de Modelización Estadística de Datos e Inteligencia Artificial (MEDIA), Centro Universitario Regional Este, Universidad de la República, Rocha, Uruguay.
| | - Angel M Segura
- Departamento de Modelización Estadística de Datos e Inteligencia Artificial (MEDIA), Centro Universitario Regional Este, Universidad de la República, Rocha, Uruguay
| | - Carolina Crisci
- Departamento de Modelización Estadística de Datos e Inteligencia Artificial (MEDIA), Centro Universitario Regional Este, Universidad de la República, Rocha, Uruguay
| | - Guzmán López
- Departamento de Modelización Estadística de Datos e Inteligencia Artificial (MEDIA), Centro Universitario Regional Este, Universidad de la República, Rocha, Uruguay
| | - Lia Sampognaro
- Departamento de Modelización Estadística de Datos e Inteligencia Artificial (MEDIA), Centro Universitario Regional Este, Universidad de la República, Rocha, Uruguay
| | - Victoria Vidal
- Departamento de Modelización Estadística de Datos e Inteligencia Artificial (MEDIA), Centro Universitario Regional Este, Universidad de la República, Rocha, Uruguay
| | - Carla Kruk
- Departamento de Modelización Estadística de Datos e Inteligencia Artificial (MEDIA), Centro Universitario Regional Este, Universidad de la República, Rocha, Uruguay; Departamento de Microbiología, Instituto de Investigaciones Biológicas Clemente Estable, Ministerio de Educación y Cultura, Montevideo, Uruguay; Instituto de Ecología y Ciencias Ambientales, Facultad de Ciencias, Universidad de la República, Montevideo, Uruguay
| | - Claudia Piccini
- Departamento de Modelización Estadística de Datos e Inteligencia Artificial (MEDIA), Centro Universitario Regional Este, Universidad de la República, Rocha, Uruguay; Departamento de Microbiología, Instituto de Investigaciones Biológicas Clemente Estable, Ministerio de Educación y Cultura, Montevideo, Uruguay
| | - Gonzalo Perera
- Departamento de Modelización Estadística de Datos e Inteligencia Artificial (MEDIA), Centro Universitario Regional Este, Universidad de la República, Rocha, Uruguay
| |
Collapse
|
6
|
Tamvakis A, Tsirtsis G, Karydis M, Patsidis K, Kokkoris GD. Drivers of harmful algal blooms in coastal areas of Eastern Mediterranean: a machine learning methodological approach. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:6484-6505. [PMID: 34517542 DOI: 10.3934/mbe.2021322] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Harmful algal species are present in the Mediterranean Sea and are often associated with toxic events affecting the nearby coastal zones. The presence of 18 marine microalgae, at genus level, associated with potentially harmful characteristics was predicted using a number of machine learning techniques based exclusively on a small set of abiotic variables, already identified as drivers of blooms. Random Forest (RF) algorithm achieved the best predictive performance by correctly identifying the presence of most genera with a mean of 89.2% of total samples. Although, RF has shown lower predictive performance for genera present in a low number of samples, its predictive power remains at least "fair' in these cases. The main tree-based advantage of RF was thereafter used to assess the importance of the input variables in predicting the presence of the algal genera. Temperature had the most powerful effect on genera's presences, although this effect varies among genera. Finally, the genera were clustered based on their response to the considered abiotic variables and common trends in an ecological context were identified.
Collapse
Affiliation(s)
- Androniki Tamvakis
- Department of Marine Sciences, Faculty of Environment, University of the Aegean, University Hill, GR81100, Mytilene, Greece
| | - George Tsirtsis
- Department of Marine Sciences, Faculty of Environment, University of the Aegean, University Hill, GR81100, Mytilene, Greece
| | - Michael Karydis
- Department of Marine Sciences, Faculty of Environment, University of the Aegean, University Hill, GR81100, Mytilene, Greece
| | - Kleanthis Patsidis
- Department of Marine Sciences, Faculty of Environment, University of the Aegean, University Hill, GR81100, Mytilene, Greece
| | - Giorgos D Kokkoris
- Department of Marine Sciences, Faculty of Environment, University of the Aegean, University Hill, GR81100, Mytilene, Greece
| |
Collapse
|
7
|
Gawriljuk VO, Foil DH, Puhl AC, Zorn KM, Lane TR, Riabova O, Makarov V, Godoy AS, Oliva G, Ekins S. Development of Machine Learning Models and the Discovery of a New Antiviral Compound against Yellow Fever Virus. J Chem Inf Model 2021; 61:3804-3813. [PMID: 34286575 DOI: 10.1021/acs.jcim.1c00460] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Yellow fever (YF) is an acute viral hemorrhagic disease transmitted by infected mosquitoes. Large epidemics of YF occur when the virus is introduced into heavily populated areas with high mosquito density and low vaccination coverage. The lack of a specific small molecule drug treatment against YF as well as for homologous infections, such as zika and dengue, highlights the importance of these flaviviruses as a public health concern. With the advancement in computer hardware and bioactivity data availability, new tools based on machine learning methods have been introduced into drug discovery, as a means to utilize the growing high throughput screening (HTS) data generated to reduce costs and increase the speed of drug development. The use of predictive machine learning models using previously published data from HTS campaigns or data available in public databases, can enable the selection of compounds with desirable bioactivity and absorption, distribution, metabolism, and excretion profiles. In this study, we have collated cell-based assay data for yellow fever virus from the literature and public databases. The data were used to build predictive models with several machine learning methods that could prioritize compounds for in vitro testing. Five molecules were prioritized and tested in vitro from which we have identified a new pyrazolesulfonamide derivative with EC50 3.2 μM and CC50 24 μM, which represents a new scaffold suitable for hit-to-lead optimization that can expand the available drug discovery candidates for YF.
Collapse
Affiliation(s)
- Victor O Gawriljuk
- São Carlos Institute of Physics, University of São Paulo, Av. João Dagnone, 1100 - Santa Angelina, São Carlos, São Paulo 13563-120, Brazil
| | - Daniel H Foil
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Ana C Puhl
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Kimberley M Zorn
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Thomas R Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Olga Riabova
- Research Center of Biotechnology RAS, Leninsky Prospekt 33-2, 119071 Moscow, Russia
| | - Vadim Makarov
- Research Center of Biotechnology RAS, Leninsky Prospekt 33-2, 119071 Moscow, Russia
| | - Andre S Godoy
- São Carlos Institute of Physics, University of São Paulo, Av. João Dagnone, 1100 - Santa Angelina, São Carlos, São Paulo 13563-120, Brazil
| | - Glaucius Oliva
- São Carlos Institute of Physics, University of São Paulo, Av. João Dagnone, 1100 - Santa Angelina, São Carlos, São Paulo 13563-120, Brazil
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| |
Collapse
|
8
|
Prediction of Chlorophyll-a Concentrations in the Nakdong River Using Machine Learning Methods. WATER 2020. [DOI: 10.3390/w12061822] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Many studies have attempted to predict chlorophyll-a concentrations using multiple regression models and validating them with a hold-out technique. In this study commonly used machine learning models, such as Support Vector Regression, Bagging, Random Forest, Extreme Gradient Boosting (XGBoost), Recurrent Neural Network (RNN), and Long–Short-Term Memory (LSTM), are used to build a new model to predict chlorophyll-a concentrations in the Nakdong River, Korea. We employed 1–step ahead recursive prediction to reflect the characteristics of the time series data. In order to increase the prediction accuracy, the model construction was based on forward variable selection. The fitted models were validated by means of cumulative learning and rolling window learning, as opposed to the hold–out technique. The best results were obtained when the chlorophyll-a concentration was predicted by combining the RNN model with the rolling window learning method. The results suggest that the selection of explanatory variables and 1–step ahead recursive prediction in the machine learning model are important processes for improving its prediction performance.
Collapse
|
9
|
Muñoz-Mas R, Gil-Martínez E, Oliva-Paterna FJ, Belda EJ, Martínez-Capel F. Tree-based ensembles unveil the microhabitat suitability for the invasive bleak (Alburnus alburnus L.) and pumpkinseed (Lepomis gibbosus L.): Introducing XGBoost to eco-informatics. ECOL INFORM 2019. [DOI: 10.1016/j.ecoinf.2019.100974] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|