1
|
Xu Y, Liu X, Xia W, Ge J, Ju CW, Zhang H, Zhang JZH. ChemXTree: A Feature-Enhanced Graph Neural Network-Neural Decision Tree Framework for ADMET Prediction. J Chem Inf Model 2024. [PMID: 39497657 DOI: 10.1021/acs.jcim.4c01186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2024]
Abstract
The rapid progression of machine learning, especially deep learning (DL), has catalyzed a new era in drug discovery, introducing innovative approaches for predicting molecular properties. Despite the many methods available for feature representation, efficiently utilizing rich, high-dimensional information remains a significant challenge. Our work introduces ChemXTree, a novel graph-based model that integrates a Gate Modulation Feature Unit (GMFU) and neural decision tree (NDT) in the output layer to address this challenge. Extensive evaluations on benchmark data sets, including MoleculeNet and eight additional drug databases, have demonstrated ChemXTree's superior performance, surpassing or matching the current state-of-the-art models. Visualization techniques clearly demonstrate that ChemXTree significantly improves the separation between substrates and nonsubstrates in the latent space. In summary, ChemXTree demonstrates a promising approach for integrating advanced feature extraction with neural decision trees, offering significant improvements in predictive accuracy for drug discovery tasks and opening new avenues for optimizing molecular properties.
Collapse
Affiliation(s)
- Yuzhi Xu
- Shanghai Frontiers Science Center of Artificial Intelligence and Deep Learning and NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Xinxin Liu
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania 19104, United States
- Department of Materials Science and Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, United States
| | - Wei Xia
- Shanghai Frontiers Science Center of Artificial Intelligence and Deep Learning and NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Jiankai Ge
- Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Cheng-Wei Ju
- Pritzker School of Molecular Engineering, The University of Chicago, Chicago, Illinois 60615, United States
| | - Haiping Zhang
- Faculty of Synthetic Biology, Shenzhen Institute of Advanced Technology, Shenzhen 518055, China
| | - John Z H Zhang
- Shanghai Frontiers Science Center of Artificial Intelligence and Deep Learning and NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China
- Department of Chemistry, New York University, New York, New York 10003, United States
- Faculty of Synthetic Biology, Shenzhen Institute of Advanced Technology, Shenzhen 518055, China
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, 200062 Shanghai, China
| |
Collapse
|
2
|
Wang G, Feng H, Du M, Feng Y, Cao C. Multimodal Representation Learning via Graph Isomorphism Network for Toxicity Multitask Learning. J Chem Inf Model 2024. [PMID: 39432821 DOI: 10.1021/acs.jcim.4c01061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2024]
Abstract
Toxicity is paramount for comprehending compound properties, particularly in the early stages of drug design. Due to the diversity and complexity of toxic effects, it became a challenge to compute compound toxicity tasks. To address this issue, we propose a multimodal representation learning model, termed multimodal graph isomorphism network (MMGIN), to address this challenge for compound toxicity multitask learning. Based on fingerprints and molecular graphs of compounds, our MMGIN model incorporates a multimodal representation learning model to acquire a comprehensive compound representation. This model adopts a two-channel structure to independently learn fingerprint representation and molecular graph representation. Subsequently, two feedforward neural networks utilize the learned multimodal compound representation to perform multitask learning, encompassing compound toxicity classification and multiple compound category classification simultaneously. To test the effectiveness of our model, we constructed a novel data set, termed the compound toxicity multitask learning (CTMTL) data set, derived from the TOXRIC data set. We compare our MMGIN model with other representative machine learning and deep learning models on the CTMTL and Tox21 data sets. The experimental results demonstrate significant advancements achieved by our MMGIN model. Furthermore, the ablation study underscores the effectiveness of the introduced fingerprints, molecular graphs, the multimodal representation learning model, and the multitask learning model, showcasing the model's superior predictive capability and robustness.
Collapse
Affiliation(s)
- Guishen Wang
- School of Computer Science and Engineering, Changchun University of Technology, North Yuanda Street No. 3000, Changchun, 130012 Jilin, China
| | - Hui Feng
- School of Computer Science and Engineering, Changchun University of Technology, North Yuanda Street No. 3000, Changchun, 130012 Jilin, China
| | - Mengyan Du
- School of Computer Science and Engineering, Changchun University of Technology, North Yuanda Street No. 3000, Changchun, 130012 Jilin, China
| | - Yuncong Feng
- School of Computer Science and Engineering, Changchun University of Technology, North Yuanda Street No. 3000, Changchun, 130012 Jilin, China
| | - Chen Cao
- Key Laboratory for Bio-Electromagnetic Environment and Advanced Medical Theranostics, School of Biomedical Engineering and Informatics, Nanjing Medical University, Longmian Avenue No. 101, Nanjing, 211166 Jiangsu, China
| |
Collapse
|
3
|
Zhang T, Wang S, Chai Y, Yu J, Zhu W, Li L, Li B. Prediction and Interpretability Study of the Glass Transition Temperature of Polyimide Based on Machine Learning with Quantitative Structure-Property Relationship ( Tg-QSPR). J Phys Chem B 2024; 128:8807-8817. [PMID: 38979707 DOI: 10.1021/acs.jpcb.4c00756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
The glass transition temperature (Tg) is a crucial characteristic of polyimides (PIs). Developing a Tg predictive model using machine learning methodologies can facilitate the design of PI structures and expedite the development process. In this investigation, a data set comprising 1257 PIs was assembled, with Tg values determined using differential scanning calorimetry. 210 molecular descriptors were computed using RDKit, and subsequently, six distinct feature selection methodologies were employed to refine the descriptor set. Quantitative structure-property relationship models targeting Tg (Tg-QSPR) were then constructed using five ensemble learning algorithms and one deep learning algorithm. These models exhibited high predictive accuracy and robustness, with the CATBoost model demonstrating the highest accuracy, achieving a coefficient of determination of 0.823 for the test set, a mean absolute error of 20.1 °C, and a root-mean-square error of 29.0 °C. The study identified the NumRotatableBonds descriptor as particularly influential on Tg, showing a negative correlation with the property. Additionally, the model's accuracy was validated using ten new PI films not included in the original data set, resulting in absolute errors ranging from 2.5 to 26.9 °C and absolute percentage error rates of 1.0-12.8%. These findings underscore the importance of utilizing extensive and diverse data sets for predictive modeling to enhance accuracy and stability. Furthermore, exploring the interpretability of the model and experimentally validating newly synthesized PIs have augmented the practical utility of the model. Under the guidance of the Tg-QSPR models, it will be possible to accelerate the performance prediction and structural design of PIs in the future.
Collapse
Affiliation(s)
- Tianyong Zhang
- Tianjin Key Laboratory of Applied Catalysis Science and Technology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300354, China
- Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), Tianjin 300072, China
- Tianjin Engineering Research Center of Functional Fine Chemicals, Tianjin 300354, China
| | - Suisui Wang
- Tianjin Key Laboratory of Applied Catalysis Science and Technology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300354, China
| | - Yamei Chai
- Tianjin Key Laboratory of Applied Catalysis Science and Technology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300354, China
| | - Jianing Yu
- Tianjin Key Laboratory of Applied Catalysis Science and Technology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300354, China
| | - Wenxuan Zhu
- Tianjin Key Laboratory of Applied Catalysis Science and Technology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300354, China
| | - Liang Li
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Bin Li
- Tianjin Key Laboratory of Applied Catalysis Science and Technology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300354, China
- Tianjin Engineering Research Center of Functional Fine Chemicals, Tianjin 300354, China
| |
Collapse
|
4
|
Arab I, Laukens K, Bittremieux W. Semisupervised Learning to Boost hERG, Nav1.5, and Cav1.2 Cardiac Ion Channel Toxicity Prediction by Mining a Large Unlabeled Small Molecule Data Set. J Chem Inf Model 2024; 64:6410-6420. [PMID: 39110924 DOI: 10.1021/acs.jcim.4c01102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Predicting drug toxicity is a critical aspect of ensuring patient safety during the drug design process. Although conventional machine learning techniques have shown some success in this field, the scarcity of annotated toxicity data poses a significant challenge in enhancing models' performance. In this study, we explore the potential of leveraging large unlabeled small molecule data sets using semisupervised learning to improve drug cardiotoxicity predictive performance across three cardiac ion channel targets: the voltage-gated potassium channel (hERG), the voltage-gated sodium channel (Nav1.5), and the voltage-gated calcium channel (Cav1.2). We extensively mined the ChEMBL database, comprising approximately 2 million small molecules, and then employed semisupervised learning to construct robust classification models for this purpose. We achieved a performance boost on highly diverse (i.e., structurally dissimilar) test data sets across all three targets. Using our built models, we screened the whole ChEMBL database and a large set of FDA-approved drugs, identifying several compounds with potential cardiac ion channel activity. To ensure broad accessibility and usability for both technical and nontechnical users, we developed a cross-platform graphical user interface that allows users to make predictions and gain insights into the cardiotoxicity of drugs and other small molecules. The software is made available as open source under the permissive MIT license at https://github.com/issararab/CToxPred2.
Collapse
Affiliation(s)
- Issar Arab
- Department of Computer Science, University of Antwerp, 2020 Antwerp, Belgium
- Biomedical Informatics Network Antwerpen (biomina), 2020 Antwerp, Belgium
| | - Kris Laukens
- Department of Computer Science, University of Antwerp, 2020 Antwerp, Belgium
- Biomedical Informatics Network Antwerpen (biomina), 2020 Antwerp, Belgium
| | - Wout Bittremieux
- Department of Computer Science, University of Antwerp, 2020 Antwerp, Belgium
- Biomedical Informatics Network Antwerpen (biomina), 2020 Antwerp, Belgium
| |
Collapse
|
5
|
Le K, Radović JR, MacCallum JL, Larter SR, Van Humbeck JF. Machine Learning in Complex Organic Mixtures: Applying Domain Knowledge Allows for Meaningful Performance with Small Data Sets. J Am Chem Soc 2024; 146:22563-22569. [PMID: 39082215 DOI: 10.1021/jacs.4c06595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/15/2024]
Abstract
The ability to quantify individual components of complex mixtures is a challenge found throughout the life and physical sciences. An improved capacity to generate large data sets along with the uptake of machine-learning (ML)-based analysis tools has allowed for various "omics" disciplines to realize exceptional advances. Other areas of chemistry that deal with complex mixtures often do not leverage these advances. Environmental samples, for example, can be more difficult to access, and the resulting small data sets are less appropriate for unconstrained ML approaches. Herein, we present an approach to address this latter issue. Using a very small environmental data set─35 high-resolution mass spectra gathered from various solvent extractions of Canadian petroleum fractions─we show that the application of specific domain knowledge can lead to ML models with notable performance.
Collapse
Affiliation(s)
- Katelyn Le
- Department of Chemistry, University of Calgary, Calgary, Alberta T2N 1N4, Canada
| | - Jagoš R Radović
- Center for Petroleum Geochemistry (UH-CPG), Department of Earth and Atmospheric Sciences, University of Houston, Houston, Texas 77204-5007, United States
| | - Justin L MacCallum
- Department of Chemistry, University of Calgary, Calgary, Alberta T2N 1N4, Canada
| | - Stephen R Larter
- Department of Earth, Energy, and Environment, University of Calgary, Calgary, Alberta T2N 1N4, Canada
| | | |
Collapse
|
6
|
Ausri IR, Sadeghzadeh S, Biswas S, Zheng H, GhavamiNejad P, Huynh MDT, Keyvani F, Shirzadi E, Rahman FA, Quadrilatero J, GhavamiNejad A, Poudineh M. Multifunctional Dopamine-Based Hydrogel Microneedle Electrode for Continuous Ketone Sensing. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2024; 36:e2402009. [PMID: 38847967 DOI: 10.1002/adma.202402009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 05/05/2024] [Indexed: 06/18/2024]
Abstract
Diabetic ketoacidosis (DKA), a severe complication of type 1 diabetes (T1D), is triggered by production of large quantities of ketone bodies, requiring patients with T1D to constantly monitor their ketone levels. Here, a skin-compatible hydrogel microneedle (HMN)-continuous ketone monitoring (HMN-CKM) device is reported. The sensing mechanism relies on the catechol-quinone chemistry inherent to the dopamine (DA) molecules that are covalently linked to the polymer structure of the HMN patch. The DA serves the dual-purpose of acting as a redox mediator for measuring the byproduct of oxidation of 3-beta-hydroxybutyrate (β-HB), the primary ketone bodies; while, also facilitating the formation of a crosslinked HMN patch. A universal approach involving pre-oxidation and detection of the generated catechol compounds is introduced to correlate the sensor response to the β-HB concentrations. It is further shown that real-time tracking of a decrease in ketone levels of T1D rat model is possible using the HMN-CKM device, in conjunction with a data-driven machine learning model that considers potential time delays.
Collapse
Affiliation(s)
- Irfani Rahmi Ausri
- Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
- Waterloo Institute for Nanotechnology, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Sadegh Sadeghzadeh
- Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
- Waterloo Institute for Nanotechnology, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Subhamoy Biswas
- Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
- Waterloo Institute for Nanotechnology, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Hanjia Zheng
- Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
- Waterloo Institute for Nanotechnology, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Peyman GhavamiNejad
- Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
- Waterloo Institute for Nanotechnology, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Michelle Dieu Thao Huynh
- Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
- Waterloo Institute for Nanotechnology, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Fatemeh Keyvani
- Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
- Waterloo Institute for Nanotechnology, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Erfan Shirzadi
- Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
- Waterloo Institute for Nanotechnology, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Fasih A Rahman
- Department of Kinesiology and Health Sciences, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Joe Quadrilatero
- Department of Kinesiology and Health Sciences, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Amin GhavamiNejad
- Advanced Pharmaceutics and Drug Delivery Laboratory, Leslie L. Dan Faculty of Pharmacy, University of Toronto, Toronto, ON, M5S 3M2, Canada
| | - Mahla Poudineh
- Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
- Waterloo Institute for Nanotechnology, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| |
Collapse
|
7
|
Huang Y, Zhang L. Descriptor Design for Perovskite Material with Compatible Molecules via Language Model and First-Principles. J Chem Theory Comput 2024. [PMID: 39037056 DOI: 10.1021/acs.jctc.4c00465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/23/2024]
Abstract
Directly applying big language models for material and molecular design is not straightforward, particularly for real-scenario cases, where experimental validation accuracy is required. In this study, we propose a multimode descriptor design method for materials prediction and analysis, leveraging the advantages of the natural language processing literature model and density functional theory (DFT) calculations with the assistance of the genetic algorithm (GA). A case study on prediction of aqueous photocurrents of multisolvent engineered halide perovskite CH3NH3PbI3 is performed, and the following-up validation experiments are carried out to demonstrate the improved accuracy of the multimode descriptors (an unprecedented experimental validation accuracy of 87.5% via the GA is achieved) for predicting aqueous photocurrents of perovskite materials (c.f. only 50% experimental accuracy for other common machine learning models). The improved experimental accuracy of the descriptors is attributed to the successful deployment of a language model incorporating concise scientific information from >1 million articles into molecular descriptors in combination with DFT calculations. The subsequent machine learning analysis suggests the importance of cation···π and crystallization in molecule-modified halide perovskite materials representing ontological and conceptual understanding. Importantly, the genetic process affords an accurate "white-box" model to describe the perovskite stability (accuracy = 90.2% for the test data set and 92.3% for the train data set) with the mathematical equation Stability = tan F 2 × F 3 × F 1 F 2 + F 4 + F 5 , where F1 ∼ F5 atomic-level structural and chemical details such as cation···π interactions and highest occupied molecular orbital levels. This study offers a feasible descriptor design route to accurately predict complex material properties, leveraging both language models and density functional theories.
Collapse
Affiliation(s)
- Yiru Huang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Lei Zhang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| |
Collapse
|
8
|
Whitehead TM, Strickland J, Conduit GJ, Borrel A, Mucs D, Baskerville-Abraham I. Quantifying the Benefits of Imputation over QSAR Methods in Toxicology Data Modeling. J Chem Inf Model 2024; 64:2624-2636. [PMID: 38091381 DOI: 10.1021/acs.jcim.3c01695] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Imputation machine learning (ML) surpasses traditional approaches in modeling toxicity data. The method was tested on an open-source data set comprising approximately 2500 ingredients with limited in vitro and in vivo data obtained from the OECD QSAR Toolbox. By leveraging the relationships between different toxicological end points, imputation extracts more valuable information from each data point compared to well-established single end point methods, such as ML-based Quantitative Structure Activity Relationship (QSAR) approaches, providing a final improvement of up to around 0.2 in the coefficient of determination. A significant aspect of this methodology is its resilience to the inclusion of extraneous chemical or experimental data. While additional data typically introduces a considerable level of noise and can hinder performance of single end point QSAR modeling, imputation models remain unaffected. This implies a reduction in the need for laborious manual preprocessing tasks such as feature selection, thereby making data preparation for ML analysis more efficient. This successful test, conducted on open-source data, validates the efficacy of imputation approaches in toxicity data analysis. This work opens the way for applying similar methods to other types of sparse toxicological data matrices, and so we discuss the development of regulatory authority guidelines to accept imputation models, a key aspect for the wider adoption of these methods.
Collapse
Affiliation(s)
- Thomas M Whitehead
- Intellegens Ltd., The Studio, Chesterton Mill, Cambridge CB4 3NP, United Kingdom
| | - Joel Strickland
- Intellegens Ltd., The Studio, Chesterton Mill, Cambridge CB4 3NP, United Kingdom
| | - Gareth J Conduit
- Intellegens Ltd., The Studio, Chesterton Mill, Cambridge CB4 3NP, United Kingdom
| | - Alexandre Borrel
- Inotiv, Research Triangle Park, North Carolina 27560, United States
| | - Daniel Mucs
- Scientific and Regulatory Affairs, JT International SA, 8, rue Kazem Radjavi, 1202 Geneva, Switzerland
| | - Irene Baskerville-Abraham
- Scientific and Regulatory Affairs, JT International SA, 8, rue Kazem Radjavi, 1202 Geneva, Switzerland
| |
Collapse
|
9
|
Chen L, Jiang J, Dou B, Feng H, Liu J, Zhu Y, Zhang B, Zhou T, Wei GW. Machine learning study of the extended drug-target interaction network informed by pain related voltage-gated sodium channels. Pain 2024; 165:908-921. [PMID: 37851391 PMCID: PMC11021136 DOI: 10.1097/j.pain.0000000000003089] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 09/09/2023] [Indexed: 10/19/2023]
Abstract
ABSTRACT Pain is a significant global health issue, and the current treatment options for pain management have limitations in terms of effectiveness, side effects, and potential for addiction. There is a pressing need for improved pain treatments and the development of new drugs. Voltage-gated sodium channels, particularly Nav1.3, Nav1.7, Nav1.8, and Nav1.9, play a crucial role in neuronal excitability and are predominantly expressed in the peripheral nervous system. Targeting these channels may provide a means to treat pain while minimizing central and cardiac adverse effects. In this study, we construct protein-protein interaction (PPI) networks based on pain-related sodium channels and develop a corresponding drug-target interaction network to identify potential lead compounds for pain management. To ensure reliable machine learning predictions, we carefully select 111 inhibitor data sets from a pool of more than 1000 targets in the PPI network. We employ 3 distinct machine learning algorithms combined with advanced natural language processing (NLP)-based embeddings, specifically pretrained transformer and autoencoder representations. Through a systematic screening process, we evaluate the side effects and repurposing potential of more than 150,000 drug candidates targeting Nav1.7 and Nav1.8 sodium channels. In addition, we assess the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of these candidates to identify leads with near-optimal characteristics. Our strategy provides an innovative platform for the pharmacological development of pain treatments, offering the potential for improved efficacy and reduced side effects.
Collapse
Affiliation(s)
- Long Chen
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
- Department of Mathematics, Michigan State University, East Lansing, MI, United States
| | - Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Hongsong Feng
- Department of Mathematics, Michigan State University, East Lansing, MI, United States
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Tianshou Zhou
- Key Laboratory of Computational Mathematics, Guangdong Province, and School of Mathematics, Sun Yat-sen University, Guangzhou, P R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
10
|
Yang Z, Wang L, Yang Y, Pang X, Sun Y, Liang Y, Cao H. Screening of the Antagonistic Activity of Potential Bisphenol A Alternatives toward the Androgen Receptor Using Machine Learning and Molecular Dynamics Simulation. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024; 58:2817-2829. [PMID: 38291630 DOI: 10.1021/acs.est.3c09779] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Over the past few decades, extensive research has indicated that exposure to bisphenol A (BPA) increases the health risks in humans. Toxicological studies have demonstrated that BPA can bind to the androgen receptor (AR), resulting in endocrine-disrupting effects. In recent investigations, many alternatives to BPA have been detected in various environmental media as major pollutants. However, related experimental evaluations of BPA alternatives have not been systematically implemented for the assessment of chemical safety and the effects of structural characteristics on the antagonistic activity of the AR. To promote the green development of BPA alternatives, high-throughput toxicological screening is fundamental for prioritizing chemical tests. Therefore, we proposed a hybrid deep learning architecture that combines molecular descriptors and molecular graphs to predict AR antagonistic activity. Compared to previous models, this hybrid architecture can extract substantial chemical information from various molecular representations to improve the model's generalization ability for BPA alternatives. Our predictions suggest that lignin-derivable bisguaiacols, as alternatives to BPA, are likely to be nonantagonist for AR compared to bisphenol analogues. Additionally, molecular dynamics (MD) simulations identified the dihydrotestosterone-bound pocket, rather than the surface, as the major binding site of bisphenol analogues. The conformational changes of key helix H12 from an agonistic to an antagonistic conformation can be evaluated qualitatively by accelerated MD simulations to explain the underlying mechanism. Overall, our computational study is helpful for toxicological screening of BPA alternatives and the design of environmentally friendly BPA alternatives.
Collapse
Affiliation(s)
- Zeguo Yang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Ling Wang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Ying Yang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Xudi Pang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Yuzhen Sun
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Yong Liang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Huiming Cao
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| |
Collapse
|
11
|
Joseph J, Niemczak C, Lichtenstein J, Kobrina A, Magohe A, Leigh S, Ealer C, Fellows A, Reike C, Massawe E, Gui J, Buckey JC. Central auditory test performance predicts future neurocognitive function in children living with and without HIV. Sci Rep 2024; 14:2712. [PMID: 38302516 PMCID: PMC10834399 DOI: 10.1038/s41598-024-52380-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 01/18/2024] [Indexed: 02/03/2024] Open
Abstract
Tests of the brain's ability to process complex sounds (central auditory tests) correlate with overall measures of neurocognitive performance. In the low- middle-income countries where resources to conduct detailed cognitive testing is limited, tests that assess the central auditory system may provide a novel and useful way to track neurocognitive performance. This could be particularly useful for children living with HIV (CLWH). To evaluate this, we administered central auditory tests to CLWH and children living without HIV and examined whether central auditory tests given early in a child's life could predict later neurocognitive performance. We used a machine learning technique to incorporate factors known to affect performance on neurocognitive tests, such as education. The results show that central auditory tests are useful predictors of neurocognitive performance and perform as well or in some cases better than factors such as education. Central auditory tests may offer an objective way to track neurocognitive performance in CLWH.
Collapse
Affiliation(s)
- Jeff Joseph
- Department of Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Christopher Niemczak
- Department of Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
- Department of Medicine, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
| | - Jonathan Lichtenstein
- Department of Psychiatry, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
- The Dartmouth Institute for Health Policy and Clinical Practice, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Anastasiya Kobrina
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Albert Magohe
- Muhimbili University of Health and Allied Sciences, Dar es Salaam, Tanzania
| | - Samantha Leigh
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Christin Ealer
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Abigail Fellows
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Catherine Reike
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Enica Massawe
- Muhimbili University of Health and Allied Sciences, Dar es Salaam, Tanzania
| | - Jiang Gui
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Jay C Buckey
- Department of Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth, Hanover, NH, USA.
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA.
- Department of Medicine, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA.
| |
Collapse
|
12
|
Glaubitz C, Bazzoni A, Ackermann-Hirschi L, Baraldi L, Haeffner M, Fortunatus R, Rothen-Rutishauser B, Balog S, Petri-Fink A. Leveraging Machine Learning for Size and Shape Analysis of Nanoparticles: A Shortcut to Electron Microscopy. THE JOURNAL OF PHYSICAL CHEMISTRY. C, NANOMATERIALS AND INTERFACES 2024; 128:421-427. [PMID: 38229591 PMCID: PMC10788956 DOI: 10.1021/acs.jpcc.3c05938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Revised: 11/20/2023] [Accepted: 11/21/2023] [Indexed: 01/18/2024]
Abstract
Characterizing nanoparticles (NPs) is crucial in nanoscience due to the direct influence of their physiochemical properties on their behavior. Various experimental techniques exist to analyze the size and shape of NPs, each with advantages, limitations, proneness to uncertainty, and resource requirements. One of them is electron microscopy (EM), often considered the gold standard, which offers visualization of the primary particles. However, despite its advantages, EM can be expensive, less accessible, and difficult to apply during dynamic processes. Therefore, using EM for specific experimental conditions, such as observing dynamic processes or visualizing low-contrast particles, is challenging. This study showcases the potential of machine learning in deriving EM parameters by utilizing cost-effective and dynamic techniques such as dynamic light scattering (DLS) and UV-vis spectroscopy. Our developed model successfully predicts the size and shape parameters of gold NPs based on DLS and UV-vis results. Furthermore, we demonstrate the practicality of our model in situations in which conducting EM measurements presents a challenge: Tracking in situ the synthesis of 100 nm gold NPs.
Collapse
Affiliation(s)
- Christina Glaubitz
- Adolphe
Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland
| | - Amélie Bazzoni
- Adolphe
Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland
| | | | - Laura Baraldi
- Department
of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parco Area delle Scienze 17/A, 43124 Parma, Italy
| | - Moritz Haeffner
- Adolphe
Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland
| | - Roman Fortunatus
- Adolphe
Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland
| | | | - Sandor Balog
- Adolphe
Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland
| | - Alke Petri-Fink
- Adolphe
Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland
- Chemistry
Department, University of Fribourg, Chemin du Musée 9, 1700 Fribourg, Switzerland
| |
Collapse
|
13
|
Xia S, Chen E, Zhang Y. Integrated Molecular Modeling and Machine Learning for Drug Design. J Chem Theory Comput 2023; 19:7478-7495. [PMID: 37883810 PMCID: PMC10653122 DOI: 10.1021/acs.jctc.3c00814] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/10/2023] [Accepted: 10/11/2023] [Indexed: 10/28/2023]
Abstract
Modern therapeutic development often involves several stages that are interconnected, and multiple iterations are usually required to bring a new drug to the market. Computational approaches have increasingly become an indispensable part of helping reduce the time and cost of the research and development of new drugs. In this Perspective, we summarize our recent efforts on integrating molecular modeling and machine learning to develop computational tools for modulator design, including a pocket-guided rational design approach based on AlphaSpace to target protein-protein interactions, delta machine learning scoring functions for protein-ligand docking as well as virtual screening, and state-of-the-art deep learning models to predict calculated and experimental molecular properties based on molecular mechanics optimized geometries. Meanwhile, we discuss remaining challenges and promising directions for further development and use a retrospective example of FDA approved kinase inhibitor Erlotinib to demonstrate the use of these newly developed computational tools.
Collapse
Affiliation(s)
- Song Xia
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Eric Chen
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department
of Chemistry, New York University, New York, New York 10003, United States
- Simons
Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States
- NYU-ECNU
Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
14
|
Banerjee A, Roy K. Read-across-based intelligent learning: development of a global q-RASAR model for the efficient quantitative predictions of skin sensitization potential of diverse organic chemicals. ENVIRONMENTAL SCIENCE. PROCESSES & IMPACTS 2023; 25:1626-1644. [PMID: 37682520 DOI: 10.1039/d3em00322a] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
Environmental chemicals and contaminants cause a wide array of harmful implications to terrestrial and aquatic life which ranges from skin sensitization to acute oral toxicity. The current study aims to assess the quantitative skin sensitization potential of a large set of industrial and environmental chemicals acting through different mechanisms using the novel quantitative Read-Across Structure-Activity Relationship (q-RASAR) approach. Based on the identified important set of structural and physicochemical features, Read-Across-based hyperparameters were optimized using the training set compounds followed by the calculation of similarity and error-based RASAR descriptors. Data fusion, further feature selection, and removal of prediction confidence outliers were performed to generate a partial least squares (PLS) q-RASAR model, followed by the application of various Machine Learning (ML) tools to check the quality of predictions. The PLS model was found to be the best among different models. A simple user-friendly Java-based software tool was developed based on the PLS model, which efficiently predicts the toxicity value(s) of query compound(s) along with their status of Applicability Domain (AD) in terms of leverage values. This model has been developed using structurally diverse compounds and is expected to predict efficiently and quantitatively the skin sensitization potential of environmental chemicals to estimate their occupational and health hazards.
Collapse
Affiliation(s)
- Arkaprava Banerjee
- Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700032, India.
| | - Kunal Roy
- Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700032, India.
| |
Collapse
|
15
|
Viljanen M, Minnema J, Wassenaar PNH, Rorije E, Peijnenburg W. What is the ecotoxicity of a given chemical for a given aquatic species? Predicting interactions between species and chemicals using recommender system techniques. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2023; 34:765-788. [PMID: 37670728 DOI: 10.1080/1062936x.2023.2254225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Accepted: 08/27/2023] [Indexed: 09/07/2023]
Abstract
Ecotoxicological safety assessment of chemicals requires toxicity data on multiple species, despite the general desire of minimizing animal testing. Predictive models, specifically machine learning (ML) methods, are one of the tools capable of solving this apparent contradiction as they allow to generalize toxicity patterns across chemicals and species. However, despite the availability of large public toxicity datasets, the data is highly sparse, complicating model development. The aim of this study is to provide insights into how ML can predict toxicity using a large but sparse dataset. We developed models to predict LC50-values, based on experimental LC50-data covering 2431 organic chemicals and 1506 aquatic species from the ECOTOX-database. Several well-known ML techniques were evaluated and a new ML model was developed, inspired by recommender systems. This new model involves a simple linear model that learns low-rank interactions between species and chemicals using factorization machines. We evaluated the predictive performances of the developed models based on two validation settings: 1) predicting unseen chemical-species pairs, and 2) predicting unseen chemicals. The results of this study show that ML models can accurately predict LC50-values in both validation settings. Moreover, we show that the novel factorization machine approach can match well-tuned, complex, ML approaches.
Collapse
Affiliation(s)
- M Viljanen
- Department of Statistics, Data Science and Modelling, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
| | - J Minnema
- Center for Safety of Substances and Products, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
| | - P N H Wassenaar
- Center for Safety of Substances and Products, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
| | - E Rorije
- Center for Safety of Substances and Products, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
| | - W Peijnenburg
- Center for Safety of Substances and Products, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
- Institute of Environmental Sciences (CML), Leiden University, Leiden, The Netherlands
| |
Collapse
|
16
|
Zhang H, Li H, Xin H, Zhang J. Property Prediction and Structural Feature Extraction of Polyimide Materials Based on Machine Learning. J Chem Inf Model 2023; 63:5473-5483. [PMID: 37620998 DOI: 10.1021/acs.jcim.3c00326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/26/2023]
Abstract
The construction of material prediction models using machine learning algorithms can aid in the polyimide structural design and screening of materials as well as accelerate the development of new materials. There is a lack of research on predicting the optical properties of polyimide materials and the interpretation of the structural features. Here, we collected 652 polyimide molecular structures and used seven popular machine learning algorithms to predict the glass transition temperature and cut-off wavelength of polyimide materials and extract key feature information of repeating unit structures. The results showed that the root mean square error of the glass transition temperature prediction model was 33.92 °C, and the correlation coefficient was 0.861. The root mean square error of the cut-off wavelength prediction model was 17.18 nm, and the correlation coefficient was 0.837. The elasticity of the molecular structure was also found to be the key factor affecting glass transition temperature, and the presence and location of heterogeneous atoms had a significant effect on the cut-off wavelengths. Finally, eight polyimide materials were synthesized to test the accuracy of the prediction models, and the experimental characterization values agreed with the predicted values. The results would contribute to the development of polyimide structural design and materials preparation for flexible display.
Collapse
Affiliation(s)
- Han Zhang
- School of Microelectronics, Shanghai University, Shanghai 201800, China
| | - Haoyuan Li
- School of Microelectronics, Shanghai University, Shanghai 201800, China
| | - Hanshen Xin
- School of Microelectronics, Shanghai University, Shanghai 201800, China
| | - Jianhua Zhang
- School of Microelectronics, Shanghai University, Shanghai 201800, China
| |
Collapse
|
17
|
Wang L, Yang F, Bao X, Bo X, Dang S, Wang R, Pan F. Deep learning-mediated prediction of concealed accessory pathway based on sinus rhythmic electrocardiograms. Ann Noninvasive Electrocardiol 2023; 28:e13072. [PMID: 37530078 PMCID: PMC10475885 DOI: 10.1111/anec.13072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 06/01/2023] [Accepted: 06/27/2023] [Indexed: 08/03/2023] Open
Abstract
BACKGROUND Concealed accessory pathway (AP) may cause atrial ventricular reentrant tachycardia impacting the health of patients. However, it is asymptomatic and undetectable during sinus rhythm. METHODS To detect concealed AP with electrocardiography (ECG) images, we collected normal sinus rhythmic ECG images of concealed AP patients and healthy subjects. All ECG images were randomly allocated to the training and testing datasets, and were used to train and test six popular convolutional neural networks from ImageNet pre-training and random initialization, respectively. RESULTS We screened 152 ECG recordings in concealed AP group and 600 ECG recordings in control group. There were no statistically significant differences in ECG characteristics between control group and concealed AP group in terms of PR interval and QRS interval. However, the QT interval and QTc were slightly higher in control group than in concealed AP group. In the testing set, ResNet26, SE-ResNet50, MobileNetV3_large_100, and DenseNet169 achieved a sensitivity rate more than 87.0% with a specificity rate above 98.0%. And models trained from random initialization showed similar performance and convergence with models trained from ImageNet pre-training. CONCLUSION Our study suggests that deep learning could be an effective way to predict concealed AP with normal sinus rhythmic ECG images. And our results might encourage people to rethink the possibility of training from random initialization on ECG image tasks.
Collapse
Affiliation(s)
- Lei Wang
- Department of CardiologyThe Affiliated Wuxi People's Hospital of Nanjing Medical UniversityWuxiChina
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education)Jiangnan UniversityWuxiChina
| | - Fang Yang
- Department of CardiologyThe Affiliated Wuxi People's Hospital of Nanjing Medical UniversityWuxiChina
| | - Xiao‐Jing Bao
- Department of CardiologyThe Affiliated Wuxi People's Hospital of Nanjing Medical UniversityWuxiChina
| | - Xiao‐Ping Bo
- Department of CardiologyThe Affiliated Wuxi People's Hospital of Nanjing Medical UniversityWuxiChina
| | - Shipeng Dang
- Department of CardiologyThe Affiliated Wuxi People's Hospital of Nanjing Medical UniversityWuxiChina
| | - Ru‐Xing Wang
- Department of CardiologyThe Affiliated Wuxi People's Hospital of Nanjing Medical UniversityWuxiChina
| | - Feng Pan
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education)Jiangnan UniversityWuxiChina
| |
Collapse
|
18
|
Rana MM, Nguyen DD. Geometric graph learning with extended atom-types features for protein-ligand binding affinity prediction. Comput Biol Med 2023; 164:107250. [PMID: 37515872 DOI: 10.1016/j.compbiomed.2023.107250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 06/12/2023] [Accepted: 07/07/2023] [Indexed: 07/31/2023]
Abstract
Understanding and accurately predicting protein-ligand binding affinity are essential in the drug design and discovery process. At present, machine learning-based methodologies are gaining popularity as a means of predicting binding affinity due to their efficiency and accuracy, as well as the increasing availability of structural and binding affinity data for protein-ligand complexes. In biomolecular studies, graph theory has been widely applied since graphs can be used to model molecules or molecular complexes in a natural manner. In the present work, we upgrade the graph-based learners for the study of protein-ligand interactions by integrating extensive atom types such as SYBYL and extended connectivity interactive features (ECIF) into multiscale weighted colored graphs (MWCG). By pairing with the gradient boosting decision tree (GBDT) machine learning algorithm, our approach results in two different methods, namely sybylGGL-Score and ecifGGL-Score. Both of our models are extensively validated in their scoring power using three commonly used benchmark datasets in the drug design area, namely CASF-2007, CASF-2013, and CASF-2016. The performance of our best model sybylGGL-Score is compared with other state-of-the-art models in the binding affinity prediction for each benchmark. While both of our models achieve state-of-the-art results, the SYBYL atom-type model sybylGGL-Score outperforms other methods by a wide margin in all benchmarks. Finally, the best-performing SYBYL atom-type model is evaluated on two test sets that are independent of CASF benchmarks.
Collapse
Affiliation(s)
- Md Masud Rana
- Department of Mathematics, University of Kentucky, Lexington, 40506, KY, USA.
| | - Duc Duy Nguyen
- Department of Mathematics, University of Kentucky, Lexington, 40506, KY, USA.
| |
Collapse
|
19
|
Dou B, Zhu Z, Merkurjev E, Ke L, Chen L, Jiang J, Zhu Y, Liu J, Zhang B, Wei GW. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem Rev 2023; 123:8736-8780. [PMID: 37384816 PMCID: PMC10999174 DOI: 10.1021/acs.chemrev.3c00189] [Citation(s) in RCA: 37] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023]
Abstract
Small data are often used in scientific and engineering research due to the presence of various constraints, such as time, cost, ethics, privacy, security, and technical limitations in data acquisition. However, big data have been the focus for the past decade, small data and their challenges have received little attention, even though they are technically more severe in machine learning (ML) and deep learning (DL) studies. Overall, the small data challenge is often compounded by issues, such as data diversity, imputation, noise, imbalance, and high-dimensionality. Fortunately, the current big data era is characterized by technological breakthroughs in ML, DL, and artificial intelligence (AI), which enable data-driven scientific discovery, and many advanced ML and DL technologies developed for big data have inadvertently provided solutions for small data problems. As a result, significant progress has been made in ML and DL for small data challenges in the past decade. In this review, we summarize and analyze several emerging potential solutions to small data challenges in molecular science, including chemical and biological sciences. We review both basic machine learning algorithms, such as linear regression, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), kernel learning (KL), random forest (RF), and gradient boosting trees (GBT), and more advanced techniques, including artificial neural network (ANN), convolutional neural network (CNN), U-Net, graph neural network (GNN), Generative Adversarial Network (GAN), long short-term memory (LSTM), autoencoder, transformer, transfer learning, active learning, graph-based semi-supervised learning, combining deep learning with traditional machine learning, and physical model-based data augmentation. We also briefly discuss the latest advances in these methods. Finally, we conclude the survey with a discussion of promising trends in small data challenges in molecular science.
Collapse
Affiliation(s)
- Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Zailiang Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Ekaterina Merkurjev
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Long Chen
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
20
|
Shim E, Tewari A, Cernak T, Zimmerman PM. Machine Learning Strategies for Reaction Development: Toward the Low-Data Limit. J Chem Inf Model 2023; 63:3659-3668. [PMID: 37312524 PMCID: PMC11163943 DOI: 10.1021/acs.jcim.3c00577] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Machine learning models are increasingly being utilized to predict outcomes of organic chemical reactions. A large amount of reaction data is used to train these models, which is in stark contrast to how expert chemists discover and develop new reactions by leveraging information from a small number of relevant transformations. Transfer learning and active learning are two strategies that can operate in low-data situations, which may help fill this gap and promote the use of machine learning for tackling real-world challenges in organic synthesis. This Perspective introduces active and transfer learning and connects these to potential opportunities and directions for further research, especially in the area of prospective development of chemical transformations.
Collapse
Affiliation(s)
- Eunjae Shim
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Ambuj Tewari
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Tim Cernak
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Paul M Zimmerman
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
21
|
Fraigne JJ, Wang J, Lee H, Luke R, Pintwala SK, Peever JH. A novel machine learning system for identifying sleep-wake states in mice. Sleep 2023; 46:zsad101. [PMID: 37021715 PMCID: PMC10262194 DOI: 10.1093/sleep/zsad101] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2022] [Revised: 03/23/2023] [Indexed: 04/07/2023] Open
Abstract
Research into sleep-wake behaviors relies on scoring sleep states, normally done by manual inspection of electroencephalogram (EEG) and electromyogram (EMG) recordings. This is a highly time-consuming process prone to inter-rater variability. When studying relationships between sleep and motor function, analyzing arousal states under a four-state system of active wake (AW), quiet wake (QW), nonrapid-eye-movement (NREM) sleep, and rapid-eye-movement (REM) sleep provides greater precision in behavioral analysis but is a more complex model for classification than the traditional three-state identification (wake, NREM, and REM sleep) usually used in rodent models. Characteristic features between sleep-wake states provide potential for the use of machine learning to automate classification. Here, we devised SleepEns, which uses a novel ensemble architecture, the time-series ensemble. SleepEns achieved 90% accuracy to the source expert, which was statistically similar to the performance of two other human experts. Considering the capacity for classification disagreements that are still physiologically reasonable, SleepEns had an acceptable performance of 99% accuracy, as determined blindly by the source expert. Classifications given by SleepEns also maintained similar sleep-wake characteristics compared to expert classifications, some of which were essential for sleep-wake identification. Hence, our approach achieves results comparable to human ability in a fraction of the time. This new machine-learning ensemble will significantly impact the ability of sleep researcher to detect and study sleep-wake behaviors in mice and potentially in humans.
Collapse
Affiliation(s)
- Jimmy J Fraigne
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Jeffrey Wang
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Hanhee Lee
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Russell Luke
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Sara K Pintwala
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada
| | - John H Peever
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada
- Department of Physiology, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
22
|
Merkurjev E, Nguyen DD, Wei GW. Multiscale Laplacian Learning. APPL INTELL 2023; 53:15727-15746. [PMID: 38031564 PMCID: PMC10686291 DOI: 10.1007/s10489-022-04333-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/08/2022] [Indexed: 11/29/2022]
Abstract
Machine learning has greatly influenced many fields, including science. However, despite of the tremendous accomplishments of machine learning, one of the key limitations of most existing machine learning approaches is their reliance on large labeled sets, and thus, data with limited labeled samples remains a challenge. Moreover, the performance of machine learning methods often severely hindered in case of diverse data, usually associated with smaller data sets or data associated with areas of study where the size of the data sets is constrained by high experimental cost and/or ethics. These challenges call for innovative strategies for dealing with these types of data. In this work, the aforementioned challenges are addressed by integrating graph-based frameworks, semi-supervised techniques, multiscale structures, and modified and adapted optimization procedures. This results in two innovative multiscale Laplacian learning (MLL) approaches for machine learning tasks, such as data classification, and for tackling data with limited samples, diverse data, and small data sets. The first approach, multikernel manifold learning (MML), integrates manifold learning with multikernel information and incorporates a warped kernel regularizer using multiscale graph Laplacians. The second approach, the multiscale MBO (MMBO) method, introduces multiscale Laplacians to the modification of the famous classical Merriman-Bence-Osher (MBO) scheme, and makes use of fast solvers. We demonstrate the performance of our algorithms experimentally on a variety of benchmark data sets, and compare them favorably to the state-of-art approaches.
Collapse
Affiliation(s)
| | - Duc Duy Nguyen
- Department of Mathematics, University of Kentucky, KY 40506, USA
| | - Guo-Wei Wei
- Department of Mathematics, Department of Biochemistry and Molecular Biology, Department of Electrical and Computer Engineering Michigan State University, MI 48824, USA
| |
Collapse
|
23
|
Xia M, Yang R, Zhao N, Chen X, Dong M, Chen J. A Method of Water COD Retrieval Based on 1D CNN and 2D Gabor Transform for Absorption-Fluorescence Spectra. MICROMACHINES 2023; 14:1128. [PMID: 37374713 DOI: 10.3390/mi14061128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 05/19/2023] [Accepted: 05/25/2023] [Indexed: 06/29/2023]
Abstract
Chemical Oxygen Demand (COD) is one of the indicators of organic pollution in water bodies. The rapid and accurate detection of COD is of great significance to environmental protection. To address the problem of COD retrieval errors in the absorption spectrum method for fluorescent organic matter solutions, a rapid synchronous COD retrieval method for the absorption-fluorescence spectrum is proposed. Based on a one-dimensional convolutional neural network and 2D Gabor transform, an absorption-fluorescence spectrum fusion neural network algorithm is developed to improve the accuracy of water COD retrieval. Results show that the RRMSEP of the absorption-fluorescence COD retrieval method is 0.32% in amino acid aqueous solution, which is 84% lower than that of the single absorption spectrum method. The accuracy of COD retrieval is 98%, which is 15.3% higher than that of the single absorption spectrum method. The test results on the actual sampled water spectral dataset demonstrate that the fusion network outperformed the absorption spectrum CNN network in measuring COD accuracy, with the RRMSEP improving from 5.09% to 1.15%.
Collapse
Affiliation(s)
- Meng Xia
- Key Laboratory of Environmental Optics and Technology, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, China
- Science Island Branch of Graduate School, University of Science and Technology of China, Hefei 230026, China
| | - Ruifang Yang
- Key Laboratory of Environmental Optics and Technology, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, China
| | - Nanjing Zhao
- Key Laboratory of Environmental Optics and Technology, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, China
- Institutes of Physical Science and Information Technology, Anhui University, Hefei 230601, China
| | - Xiaowei Chen
- Key Laboratory of Environmental Optics and Technology, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, China
| | - Ming Dong
- Key Laboratory of Environmental Optics and Technology, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, China
- Science Island Branch of Graduate School, University of Science and Technology of China, Hefei 230026, China
| | - Jingsong Chen
- Key Laboratory of Environmental Optics and Technology, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, China
- Science Island Branch of Graduate School, University of Science and Technology of China, Hefei 230026, China
| |
Collapse
|
24
|
Tran TTV, Surya Wibowo A, Tayara H, Chong KT. Artificial Intelligence in Drug Toxicity Prediction: Recent Advances, Challenges, and Future Perspectives. J Chem Inf Model 2023; 63:2628-2643. [PMID: 37125780 DOI: 10.1021/acs.jcim.3c00200] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
Toxicity prediction is a critical step in the drug discovery process that helps identify and prioritize compounds with the greatest potential for safe and effective use in humans, while also reducing the risk of costly late-stage failures. It is estimated that over 30% of drug candidates are discarded owing to toxicity. Recently, artificial intelligence (AI) has been used to improve drug toxicity prediction as it provides more accurate and efficient methods for identifying the potentially toxic effects of new compounds before they are tested in human clinical trials, thus saving time and money. In this review, we present an overview of recent advances in AI-based drug toxicity prediction, including the use of various machine learning algorithms and deep learning architectures, of six major toxicity properties and Tox21 assay end points. Additionally, we provide a list of public data sources and useful toxicity prediction tools for the research community and highlight the challenges that must be addressed to enhance model performance. Finally, we discuss future perspectives for AI-based drug toxicity prediction. This review can aid researchers in understanding toxicity prediction and pave the way for new methods of drug discovery.
Collapse
Affiliation(s)
- Thi Tuyet Van Tran
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Faculty of Information Technology, An Giang University, Long Xuyen 880000, Vietnam
- Vietnam National University - Ho Chi Minh City, Ho Chi Minh 700000, Vietnam
| | - Agung Surya Wibowo
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Department of Electrical Engineering, Telkom University, Bandung 40257, Indonesia
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Kil To Chong
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Republic of Korea
| |
Collapse
|
25
|
Mucllari E, Zadorozhnyy V, Ye Q, Nguyen DD. Novel Molecular Representations Using Neumann-Cayley Orthogonal Gated Recurrent Unit. J Chem Inf Model 2023; 63:2656-2666. [PMID: 37075324 DOI: 10.1021/acs.jcim.2c01526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/21/2023]
Abstract
Advances in deep neural networks (DNNs) have made a very powerful machine learning method available to researchers across many fields of study, including the biomedical and cheminformatics communities, where DNNs help to improve tasks such as protein performance, molecular design, drug discovery, etc. Many of those tasks rely on molecular descriptors for representing molecular characteristics in cheminformatics. Despite significant efforts and the introduction of numerous methods that derive molecular descriptors, the quantitative prediction of molecular properties remains challenging. One widely used method of encoding molecule features into bit strings is the molecular fingerprint. In this work, we propose using new Neumann-Cayley Gated Recurrent Units (NC-GRU) inside the Neural Nets encoder (AutoEncoder) to create neural molecular fingerprints (NC-GRU fingerprints). The NC-GRU AutoEncoder introduces orthogonal weights into widely used GRU architecture, resulting in faster, more stable training, and more reliable molecular fingerprints. Integrating novel NC-GRU fingerprints and Multi-Task DNN schematics improves the performance of various molecular-related tasks such as toxicity, partition coefficient, lipophilicity, and solvation-free energy, producing state-of-the-art results on several benchmarks.
Collapse
Affiliation(s)
- Edison Mucllari
- Department of Mathematics, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Vasily Zadorozhnyy
- Department of Mathematics, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Qiang Ye
- Department of Mathematics, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Duc Duy Nguyen
- Department of Mathematics, University of Kentucky, Lexington, Kentucky 40506, United States
| |
Collapse
|
26
|
Peng Y, Zheng C, Guo S, Gao F, Wang X, Du Z, Gao F, Su F, Zhang W, Yu X, Liu G, Liu B, Wu C, Sun Y, Yang Z, Hao Z, Yu X. Metabolomics integrated with machine learning to discriminate the geographic origin of Rougui Wuyi rock tea. NPJ Sci Food 2023; 7:7. [PMID: 36928372 PMCID: PMC10020150 DOI: 10.1038/s41538-023-00187-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Accepted: 03/03/2023] [Indexed: 03/18/2023] Open
Abstract
The geographic origin of agri-food products contributes greatly to their quality and market value. Here, we developed a robust method combining metabolomics and machine learning (ML) to authenticate the geographic origin of Wuyi rock tea, a premium oolong tea. The volatiles of 333 tea samples (174 from the core region and 159 from the non-core region) were profiled using gas chromatography time-of-flight mass spectrometry and a series of ML algorithms were tested. Wuyi rock tea from the two regions featured distinct aroma profiles. Multilayer Perceptron achieved the best performance with an average accuracy of 92.7% on the training data using 176 volatile features. The model was benchmarked with two independent test sets, showing over 90% accuracy. Gradient Boosting algorithm yielded the best accuracy (89.6%) when using only 30 volatile features. The proposed methodology holds great promise for its broader applications in identifying the geographic origins of other valuable agri-food products.
Collapse
Affiliation(s)
- Yifei Peng
- College of Horticulture, Fujian Agriculture and Forestry University, Fuzhou, 350002, China.,FAFU-UCR Joint Center for Horticultural Biology and Metabolomics, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Chao Zheng
- FAFU-UCR Joint Center for Horticultural Biology and Metabolomics, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Shuang Guo
- College of Horticulture, Fujian Agriculture and Forestry University, Fuzhou, 350002, China.,FAFU-UCR Joint Center for Horticultural Biology and Metabolomics, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Fuquan Gao
- College of Horticulture, Fujian Agriculture and Forestry University, Fuzhou, 350002, China.,FAFU-UCR Joint Center for Horticultural Biology and Metabolomics, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Xiaxia Wang
- FAFU-UCR Joint Center for Horticultural Biology and Metabolomics, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Zhenghua Du
- FAFU-UCR Joint Center for Horticultural Biology and Metabolomics, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Feng Gao
- Fujian Farming Technology Extension Center, Fuzhou, 350003, China
| | - Feng Su
- Fujian Farming Technology Extension Center, Fuzhou, 350003, China
| | - Wenjing Zhang
- Fujian Farming Technology Extension Center, Fuzhou, 350003, China
| | - Xueling Yu
- Fujian Farming Technology Extension Center, Fuzhou, 350003, China
| | - Guoying Liu
- Wuyishan Institute of Agricultural Sciences, Wuyishan, 354300, China
| | - Baoshun Liu
- Wuyishan Tea Bureau, Wuyishan, 354300, China
| | - Chengjian Wu
- Fujian Vocational College of Agriculture, Fuzhou, 350119, China
| | - Yun Sun
- College of Horticulture, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Zhenbiao Yang
- FAFU-UCR Joint Center for Horticultural Biology and Metabolomics, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou, 350002, China.
| | - Zhilong Hao
- College of Horticulture, Fujian Agriculture and Forestry University, Fuzhou, 350002, China.
| | - Xiaomin Yu
- FAFU-UCR Joint Center for Horticultural Biology and Metabolomics, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou, 350002, China.
| |
Collapse
|
27
|
Zhu Z, Dou B, Cao Y, Jiang J, Zhu Y, Chen D, Feng H, Liu J, Zhang B, Zhou T, Wei GW. TIDAL: Topology-Inferred Drug Addiction Learning. J Chem Inf Model 2023; 63:1472-1489. [PMID: 36826415 DOI: 10.1021/acs.jcim.3c00046] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Drug addiction is a global public health crisis, and the design of antiaddiction drugs remains a major challenge due to intricate mechanisms. Since experimental drug screening and optimization are too time-consuming and expensive, there is urgent need to develop innovative artificial intelligence (AI) methods for addressing the challenge. We tackle this challenge by topology-inferred drug addiction learning (TIDAL) built from integrating multiscale topological Laplacians, deep bidirectional transformer, and ensemble-assisted neural networks (EANNs). Multiscale topological Laplacians are a novel class of algebraic topology tools that embed molecular topological invariants and algebraic invariants into its harmonic spectra and nonharmonic spectra, respectively. These invariants complement sequence information extracted from a bidirectional transformer. We validate the proposed TIDAL framework on 22 drug addiction related, 4 hERG, and 12 DAT data sets, which suggests that the proposed TIDAL is a state-of-the-art framework for the modeling and analysis of drug addiction data. We carry out cross-target analysis of the current drug addiction candidates to alert their side effects and identify their repurposing potentials. Our analysis reveals drug-mediated linear and bilinear target correlations. Finally, TIDAL is applied to shed light on relative efficacy, repurposing potential, and potential side effects of 12 existing antiaddiction medications. Our results suggest that TIDAL provides a new computational strategy for pressingly needed antisubstance addiction drug development.
Collapse
Affiliation(s)
- Zailiang Zhu
- School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, 430200, P R. China
| | - Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, 430200, P R. China
| | - Yukang Cao
- School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, 430200, P R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, 430200, P R. China.,Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, 430200, P R. China
| | - Dong Chen
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Hongsong Feng
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, 430200, P R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, 430200, P R. China
| | - Tianshou Zhou
- Key Laboratory of Computational Mathematics, Guangdong Province, and School of Mathematics, Sun Yat-sen University, Guangzhou, 510006, P R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States.,Department of Electrical and Computer Engineering Michigan State University, East Lansing, Michigan 48824, United States.,Department of Biochemistry and Molecular Biology Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
28
|
Feng H, Elladki R, Jiang J, Wei GW. Machine-learning analysis of opioid use disorder informed by MOR, DOR, KOR, NOR and ZOR-based interactome networks. Comput Biol Med 2023; 157:106745. [PMID: 36924727 DOI: 10.1016/j.compbiomed.2023.106745] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 02/11/2023] [Accepted: 03/04/2023] [Indexed: 03/17/2023]
Abstract
Opioid use disorder (OUD) continuously poses major public health challenges and social implications worldwide with dramatic rise of opioid dependence leading to potential abuse. Despite that a few pharmacological agents have been approved for OUD treatment, the efficacy of said agents for OUD requires further improvement in order to provide safer and more effective pharmacological and psychosocial treatments. Proteins including mu, delta, kappa, nociceptin, and zeta opioid receptors are the direct targets of opioids and play critical roles in therapeutic treatments. The protein-protein interaction (PPI) networks of the these receptors increase the complexity in the drug development process for an effective opioid addiction treatment. The report below presents a PPI-network informed machine-learning study of OUD. We have examined more than 500 proteins in the five opioid receptor networks and subsequently collected 74 inhibitor datasets. Machine learning models were constructed by pairing gradient boosting decision tree (GBDT) algorithm with two advanced natural language processing (NLP)-based autoencoder and Transformer fingerprints for molecules. With these models, we systematically carried out evaluations of screening and repurposing potential of more than 120,000 drug candidates for four opioid receptors. In addition, absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties were also considered in the screening of potential drug candidates. Our machine-learning tools determined a few inhibitor compounds with desired potency and ADMET properties for nociceptin opioid receptors. Our approach offers a valuable and promising tool for the pharmacological development of OUD treatments.
Collapse
Affiliation(s)
- Hongsong Feng
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Rana Elladki
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, 430200, PR China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA.
| |
Collapse
|
29
|
Hayes N, Merkurjev E, Wei GW. Integrating transformer and autoencoder techniques with spectral graph algorithms for the prediction of scarcely labeled molecular data. Comput Biol Med 2023; 153:106479. [PMID: 36610214 PMCID: PMC9868114 DOI: 10.1016/j.compbiomed.2022.106479] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Revised: 10/25/2022] [Accepted: 12/21/2022] [Indexed: 12/24/2022]
Abstract
In molecular and biological sciences, experiments are expensive, time-consuming, and often subject to ethical constraints. Consequently, one often faces the challenging task of predicting desirable properties from small data sets or scarcely-labeled data sets. Although transfer learning can be advantageous, it requires the existence of a related large data set. This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge. Specifically, graph-based modifications of the MBO scheme are integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder, in order to deal with scarcely-labeled data sets. In addition, a consensus technique is detailed. The proposed models are validated using five benchmark data sets. We also provide a thorough comparison to other competing methods, such as support vector machines, random forests, and gradient boosting decision trees, which are known for their good performance on small data sets. The performances of various methods are analyzed using residue-similarity (R-S) scores and R-S indices. Extensive computational experiments and theoretical analysis show that the new models perform very well even when as little as 1% of the data set is used as labeled data.
Collapse
Affiliation(s)
- Nicole Hayes
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Ekaterina Merkurjev
- Department of Mathematics, Michigan State University, MI 48824, USA; Department of Computational Mathematics, Science and Engineering, Michigan State University, MI 48824, USA.
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA
| |
Collapse
|
30
|
Naciri LC, Mastinu M, Crnjar R, Barbarossa IT, Melis M. Automated identification of the genetic variants of TAS2R38 bitter taste receptor with supervised learning. Comput Struct Biotechnol J 2023; 21:1054-1065. [PMID: 38213886 PMCID: PMC10782009 DOI: 10.1016/j.csbj.2023.01.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Revised: 01/20/2023] [Accepted: 01/20/2023] [Indexed: 01/23/2023] Open
Abstract
Several studies were focused on the genetic ability to taste the bitter compound 6-n-propylthiouracil (PROP) to assess the inter-individual taste variability in humans, and its effect on food predilections, nutrition, and health. PROP taste sensitivity and that of other chemical molecules throughout the body are mediated by the bitter receptor TAS2R38, and their variability is significantly associated with TAS2R38 genetic variants. We recently automatically identified PROP phenotypes with high precision using Machine Learning (mL). Here we have used Supervised Learning (SL) algorithms to automatically identify TAS2R38 genotypes by using the biological features of eighty-four participants. The catBoost algorithm was the best-suited model for the automatic discrimination of the genotypes. It allowed us to automatically predict the identification of genotypes and precisely define the effectiveness and impact of each feature. The ratings of perceived intensity for PROP solutions (0.32 and 0.032 mM) and medium taster (MT) category were the most important features in training the model and understanding the difference between genotypes. Our findings suggest that SL may represent a trustworthy and objective tool for identifying TAS2R38 variants which, reducing the costs and times of molecular analysis, can find wide application in taste physiology and medicine studies.
Collapse
Affiliation(s)
- Lala Chaimae Naciri
- Department of Biomedical Sciences, University of Cagliari, Monserrato, CA 09042, Italy
| | - Mariano Mastinu
- Department of Biomedical Sciences, University of Cagliari, Monserrato, CA 09042, Italy
| | - Roberto Crnjar
- Department of Biomedical Sciences, University of Cagliari, Monserrato, CA 09042, Italy
| | | | - Melania Melis
- Department of Biomedical Sciences, University of Cagliari, Monserrato, CA 09042, Italy
| |
Collapse
|
31
|
Xia S, Zhang D, Zhang Y. Multitask Deep Ensemble Prediction of Molecular Energetics in Solution: From Quantum Mechanics to Experimental Properties. J Chem Theory Comput 2023; 19:10.1021/acs.jctc.2c01024. [PMID: 36607141 PMCID: PMC10323048 DOI: 10.1021/acs.jctc.2c01024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The past few years have witnessed significant advances in developing machine learning methods for molecular energetics predictions, including calculated electronic energies with high-level quantum mechanical methods and experimental properties, such as solvation free energy and logP. Typically, task-specific machine learning models are developed for distinct prediction tasks. In this work, we present a multitask deep ensemble model, sPhysNet-MT-ens5, which can simultaneously and accurately predict electronic energies of molecules in gas, water, and octanol phases, as well as transfer free energies at both calculated and experimental levels. On the calculated data set Frag20-solv-678k, which is developed in this work and contains 678,916 molecular conformations, up to 20 heavy atoms, and their properties calculated at B3LYP/6-31G* level of theory with continuum solvent models, sPhysNet-MT-ens5 predicts density functional theory (DFT)-level electronic energies directly from force field-optimized geometry within chemical accuracy. On the experimental data sets, sPhysNet-MT-ens5 achieves state-of-the-art performances, which predict both experimental hydration free energy with a RMSE of 0.620 kcal/mol on the FreeSolv data set and experimental logP with a RMSE of 0.393 on the PHYSPROP data set. Furthermore, sPhysNet-MT-ens5 also provides a reasonable estimation of model uncertainty which shows correlations with prediction error. Finally, by analyzing the atomic contributions of its predictions, we find that the developed deep learning model is aware of the chemical environment of each atom by assigning reasonable atomic contributions consistent with our chemical knowledge.
Collapse
Affiliation(s)
- Song Xia
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Dongdong Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States
- Simons Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States
- NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
32
|
Lansford JL, Barnes BC, Rice BM, Jensen KF. Building Chemical Property Models for Energetic Materials from Small Datasets Using a Transfer Learning Approach. J Chem Inf Model 2022; 62:5397-5410. [PMID: 36240441 DOI: 10.1021/acs.jcim.2c00841] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
For many experimentally measured chemical properties that cannot be directly computed from first-principles, the existing physics-based models do not extrapolate well to out-of-sample molecules, and experimental datasets themselves are too small for traditional machine learning (ML) approaches. To overcome these limitations, we apply a transfer learning approach, whereby we simultaneously train a multi-target regression model on a small number of molecules with experimentally measured values and a large number of molecules with related computed properties. We demonstrate this methodology on predicting the experimentally measured impact sensitivity of energetic crystals, finding that both characteristics of the computed dataset and model architecture are important to prediction accuracy of the small experimental dataset. Our directed-message passing neural network (D-MPNN) ML model using transfer learning outperforms direct-ML and physics-based models on a diverse test set, and the new methods described here are widely applicable to modeling many other structure-property relationships.
Collapse
Affiliation(s)
- Joshua L Lansford
- U.S. Army Combat Capabilities Development Command (DEVCOM) Army Research Laboratory, Aberdeen Proving Ground, Maryland 21005, United States.,Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | - Brian C Barnes
- U.S. Army Combat Capabilities Development Command (DEVCOM) Army Research Laboratory, Aberdeen Proving Ground, Maryland 21005, United States
| | - Betsy M Rice
- U.S. Army Combat Capabilities Development Command (DEVCOM) Army Research Laboratory, Aberdeen Proving Ground, Maryland 21005, United States
| | - Klavs F Jensen
- Department of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
33
|
García-Jacas CR, García-González LA, Martinez-Rios F, Tapia-Contreras IP, Brizuela CA. Handcrafted versus non-handcrafted (self-supervised) features for the classification of antimicrobial peptides: complementary or redundant? Brief Bioinform 2022; 23:6754757. [PMID: 36215083 DOI: 10.1093/bib/bbac428] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 08/28/2022] [Accepted: 09/02/2022] [Indexed: 12/14/2022] Open
Abstract
Antimicrobial peptides (AMPs) have received a great deal of attention given their potential to become a plausible option to fight multi-drug resistant bacteria as well as other pathogens. Quantitative sequence-activity models (QSAMs) have been helpful to discover new AMPs because they allow to explore a large universe of peptide sequences and help reduce the number of wet lab experiments. A main aspect in the building of QSAMs based on shallow learning is to determine an optimal set of protein descriptors (features) required to discriminate between sequences with different antimicrobial activities. These features are generally handcrafted from peptide sequence datasets that are labeled with specific antimicrobial activities. However, recent developments have shown that unsupervised approaches can be used to determine features that outperform human-engineered (handcrafted) features. Thus, knowing which of these two approaches contribute to a better classification of AMPs, it is a fundamental question in order to design more accurate models. Here, we present a systematic and rigorous study to compare both types of features. Experimental outcomes show that non-handcrafted features lead to achieve better performances than handcrafted features. However, the experiments also prove that an improvement in performance is achieved when both types of features are merged. A relevance analysis reveals that non-handcrafted features have higher information content than handcrafted features, while an interaction-based importance analysis reveals that handcrafted features are more important. These findings suggest that there is complementarity between both types of features. Comparisons regarding state-of-the-art deep models show that shallow models yield better performances both when fed with non-handcrafted features alone and when fed with non-handcrafted and handcrafted features together.
Collapse
Affiliation(s)
- César R García-Jacas
- Cátedras CONACYT - Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Luis A García-González
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | | | - Issac P Tapia-Contreras
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Carlos A Brizuela
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| |
Collapse
|
34
|
Glaubitz C, Rothen-Rutishauser B, Lattuada M, Balog S, Petri-Fink A. Designing the ultrasonic treatment of nanoparticle-dispersions via machine learning. NANOSCALE 2022; 14:12940-12950. [PMID: 36043853 PMCID: PMC9477382 DOI: 10.1039/d2nr03240f] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Ultrasonication is a widely used and standardized method to redisperse nanopowders in liquids and to homogenize nanoparticle dispersions. One goal of sonication is to disrupt agglomerates without changing the intrinsic physicochemical properties of the primary particles. The outcome of sonication, however, is most of the time uncertain, and quantitative models have been beyond reach. The magnitude of this problem is considerable owing to fact that the efficiency of sonication is not only dependent on the parameters of the actual device, but also on the physicochemical properties such as of the particle dispersion itself. As a consequence, sonication suffers from poor reproducibility. To tackle this problem, we propose to involve machine learning. By focusing on four nanoparticle types in aqueous dispersions, we combine supervised machine learning and dynamic light scattering to analyze the aggregate size after sonication, and demonstrate the potential to improve considerably the design and reproducibility of sonication experiments.
Collapse
Affiliation(s)
- Christina Glaubitz
- Adolphe Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland.
| | | | - Marco Lattuada
- Chemistry Department, University of Fribourg, Chemin du Musée 9, 1700 Fribourg, Switzerland
| | - Sandor Balog
- Adolphe Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland.
| | - Alke Petri-Fink
- Adolphe Merkle Institute, University of Fribourg, Chemin des Verdiers 4, 1700 Fribourg, Switzerland.
- Chemistry Department, University of Fribourg, Chemin du Musée 9, 1700 Fribourg, Switzerland
| |
Collapse
|
35
|
Katayama Y, Kobayashi TJ. Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of k -mer Feature Extraction. Front Immunol 2022; 13:797640. [PMID: 35936014 PMCID: PMC9346074 DOI: 10.3389/fimmu.2022.797640] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 06/20/2022] [Indexed: 01/18/2023] Open
Abstract
The repertoire of T cell receptors encodes various types of immunological information. Machine learning is indispensable for decoding such information from repertoire datasets measured by next-generation sequencing (NGS). In particular, the classification of repertoires is the most basic task, which is relevant for a variety of scientific and clinical problems. Supported by the recent appearance of large datasets, efficient but data-expensive methods have been proposed. However, it is unclear whether they can work efficiently when the available sample size is severely restricted as in practical situations. In this study, we demonstrate that their performances can be impaired substantially below critical sample sizes. To complement this drawback, we propose MotifBoost, which exploits the information of short k-mer motifs of TCRs. MotifBoost can perform the classification as efficiently as a deep learning method on large datasets while providing more stable and reliable results on small datasets. We tested MotifBoost on the four small datasets which consist of various conditions such as Cytomegalovirus (CMV), HIV, α-chain, β-chain and it consistently preserved the stability. We also clarify that the robustness of MotifBoost can be attributed to the efficiency of k-mer motifs as representation features of repertoires. Finally, by comparing the predictions of these methods, we show that the whole sequence identity and sequence motifs encode partially different information and that a combination of such complementary information is necessary for further development of repertoire analysis.
Collapse
Affiliation(s)
- Yotaro Katayama
- Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| | | |
Collapse
|
36
|
García-Jacas CR, Pinacho-Castellanos SA, García-González LA, Brizuela CA. Do deep learning models make a difference in the identification of antimicrobial peptides? Brief Bioinform 2022; 23:6563422. [PMID: 35380616 DOI: 10.1093/bib/bbac094] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 02/16/2022] [Accepted: 02/23/2022] [Indexed: 12/21/2022] Open
Abstract
In the last few decades, antimicrobial peptides (AMPs) have been explored as an alternative to classical antibiotics, which in turn motivated the development of machine learning models to predict antimicrobial activities in peptides. The first generation of these predictors was filled with what is now known as shallow learning-based models. These models require the computation and selection of molecular descriptors to characterize each peptide sequence and train the models. The second generation, known as deep learning-based models, which no longer requires the explicit computation and selection of those descriptors, started to be used in the prediction task of AMPs just four years ago. The superior performance claimed by deep models regarding shallow models has created a prevalent inertia to using deep learning to identify AMPs. However, methodological flaws and/or modeling biases in the building of deep models do not support such superiority. Here, we analyze the main pitfalls that led to establish biased conclusions on the leading performance of deep models. Also, we analyze whether deep models truly contribute to achieve better predictions than shallow models by performing fair studies on different state-of-the-art benchmarking datasets. The experiments reveal that deep models do not outperform shallow models in the classification of AMPs, and that both types of models codify similar chemical information since their predictions are highly similar. Thus, according to the currently available datasets, we conclude that the use of deep learning could not be the most suitable approach to develop models to identify AMPs, mainly because shallow models achieve comparable-to-superior performances and are simpler (Ockham's razor principle). Even so, we suggest the use of deep learning only when its capabilities lead to obtaining significantly better performance gains worth the additional computational cost.
Collapse
Affiliation(s)
- César R García-Jacas
- Cátedras CONACYT - Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Sergio A Pinacho-Castellanos
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México.,Centro de Investigación y Desarrollo de Tecnología Digital (CITEDI), Instituto Politécnico Nacional (IPN), 22435 Tijuana, Baja California, México
| | - Luis A García-González
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Carlos A Brizuela
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| |
Collapse
|
37
|
Jiang Y, Xiong Z, Zhao W, Zhang J, Guo Y, Li G, Li Z. Computed tomography radiomics-based distinction of invasive adenocarcinoma from minimally invasive adenocarcinoma manifesting as pure ground-glass nodules with bubble-like signs. Gan To Kagaku Ryoho 2022; 70:880-890. [PMID: 35301662 DOI: 10.1007/s11748-022-01801-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Accepted: 03/03/2022] [Indexed: 11/04/2022]
Abstract
BACKGROUND To explore an effective model based on radiomics features extracted from nonenhanced computed tomography (CT) images to distinguish invasive adenocarcinoma (IAC) from minimally invasive adenocarcinoma (MIA) presenting as pure ground-glass nodules (pGGNs) with bubble-like (B-pGGNs) signs. PATIENTS AND METHODS We retrospectively reviewed 511 nodules (MIA, n = 288; IAC, n = 223) between November 2012 and June 2018 from almost all pGGNs pathologically confirmed MIA or IAC. Eventually, a total of 109 B-pGGNs (MIA, n = 55; IAC, n = 54) from 109 patients fulfilling the criteria were randomly assigned to the training and test cluster at a ratio of 7:3. The gradient boosting decision tree (GBDT) method and logistic regression (LR) analysis were applied to feature selection (radiomics, semantic, and conventional CT features). LR was performed to construct three models (the conventional, radiomics and combined model). The performance of the predictive models was evaluated using the area under the curve (AUC). RESULTS The radiomics model had good AUCs of 0.947 in the training cluster and of 0.945 in the test cluster. The combined model produced an AUC of 0.953 in the training cluster and of 0.945 in the test cluster. The combined model yielded no performance improvement (vs. the radiomics model). The rad_score was the only independent predictor of invasiveness. CONCLUSION The radiomics model showed excellent predictive performance in discriminating IAC from MIA presenting as B-pGGNs and may provide a necessary reference for extending clinical practice.
Collapse
Affiliation(s)
- Yining Jiang
- Department of Radiology, The First Affiliated Hospital of Dalian Medical University, Dalian, China
| | - Ziqi Xiong
- Department of Radiology, The First Affiliated Hospital of Dalian Medical University, Dalian, China
| | - Wenjing Zhao
- Department of Radiology, The First Affiliated Hospital of Dalian Medical University, Dalian, China
| | - Jingyu Zhang
- Department of Radiology, The First Affiliated Hospital of Dalian Medical University, Dalian, China
| | - Yan Guo
- GE Healthcare, Beijing, China
| | - Guosheng Li
- Department of Pathology, The First Affiliated Hospital of Dalian Medical University, Dalian, China
| | - Zhiyong Li
- Department of Radiology, The First Affiliated Hospital of Dalian Medical University, Dalian, China. .,Dalian Engineering Research Centre for Artificial Intelligence in Medical Imaging, Dalian, China.
| |
Collapse
|
38
|
A machine learning model for predicting deterioration of COVID-19 inpatients. Sci Rep 2022; 12:2630. [PMID: 35173197 PMCID: PMC8850417 DOI: 10.1038/s41598-022-05822-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 01/19/2022] [Indexed: 01/22/2023] Open
Abstract
The COVID-19 pandemic has been spreading worldwide since December 2019, presenting an urgent threat to global health. Due to the limited understanding of disease progression and of the risk factors for the disease, it is a clinical challenge to predict which hospitalized patients will deteriorate. Moreover, several studies suggested that taking early measures for treating patients at risk of deterioration could prevent or lessen condition worsening and the need for mechanical ventilation. We developed a predictive model for early identification of patients at risk for clinical deterioration by retrospective analysis of electronic health records of COVID-19 inpatients at the two largest medical centers in Israel. Our model employs machine learning methods and uses routine clinical features such as vital signs, lab measurements, demographics, and background disease. Deterioration was defined as a high NEWS2 score adjusted to COVID-19. In the prediction of deterioration within the next 7–30 h, the model achieved an area under the ROC curve of 0.84 and an area under the precision-recall curve of 0.74. In external validation on data from a different hospital, it achieved values of 0.76 and 0.7, respectively.
Collapse
|
39
|
Naciri LC, Mastinu M, Crnjar R, Tomassini Barbarossa I, Melis M. Automated Classification of 6-n-Propylthiouracil Taster Status with Machine Learning. Nutrients 2022; 14:252. [PMID: 35057433 PMCID: PMC8778915 DOI: 10.3390/nu14020252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 12/31/2021] [Accepted: 01/03/2022] [Indexed: 12/03/2022] Open
Abstract
Several studies have used taste sensitivity to 6-n-propylthiouracil (PROP) to evaluate interindividual taste variability and its impact on food preferences, nutrition, and health. We used a supervised learning (SL) approach for the automatic identification of the PROP taster categories (super taster (ST); medium taster (MT); and non-taster (NT)) of 84 subjects (aged 18-40 years). Biological features determined from subjects were included for the training system. Results showed that SL enables the automatic identification of objective PROP taster status, with high precision (97%). The biological features were classified in order of importance in facilitating learning and as prediction factors. The ratings of perceived taste intensity for PROP paper disks (50 mM) and PROP solution (3.2 mM), along with fungiform papilla density, were the most important features, and high estimated values pushed toward ST prediction, while low values leaned toward NT prediction. Furthermore, TAS2R38 genotypes were significant features (AVI/AVI, PAV/PAV, and PAV/AVI to classify NTs, STs, and MTs, respectively). These results, in showing that the SL approach enables an automatic, immediate, scalable, and high-precision classification of PROP taster status, suggest that it may represent an objective and reliable tool in taste physiology studies, with applications ranging from basic science and medicine to food sciences.
Collapse
Affiliation(s)
| | | | | | - Iole Tomassini Barbarossa
- Department of Biomedical Sciences, University of Cagliari, Monserrato, 09042 Cagliari, Italy; (L.C.N.); (M.M.); (R.C.); (M.M.)
| | | |
Collapse
|
40
|
Iyer J, Jalid F, Khan TS, Haider MA. Tracing the reactivity of single atom alloys for ethanol dehydrogenation using ab initio simulations. REACT CHEM ENG 2022. [DOI: 10.1039/d1re00396h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
A full DFT parameterized MKM is used to accurately predict the reactivity trend for ethanol dehydrogenation reaction on SAAs.
Collapse
Affiliation(s)
- Jayendran Iyer
- Renewable Energy and Chemicals Laboratory, Department of Chemical Engineering, Indian Institute of Technology Delhi, Hauz Khas, Delhi, 110016, India
| | - Fatima Jalid
- Department of Chemical Engineering, National Institute of Technology Srinagar, Srinagar, Jammu and Kashmir, 190006, India
| | - Tuhin S. Khan
- Light Stock Processing Division, CSIR-Indian Institute of Petroleum, Dehradun, Uttarakhand, 248005, India
| | - M. Ali Haider
- Renewable Energy and Chemicals Laboratory, Department of Chemical Engineering, Indian Institute of Technology Delhi, Hauz Khas, Delhi, 110016, India
| |
Collapse
|
41
|
Baker RS, Hawn A. Algorithmic Bias in Education. INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION 2021. [DOI: 10.1007/s40593-021-00285-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
42
|
Wang Y, Wang B, Jiang J, Guo J, Lai J, Lian XY, Wu J. Multitask CapsNet: An Imbalanced Data Deep Learning Method for Predicting Toxicants. ACS OMEGA 2021; 6:26545-26555. [PMID: 34661009 PMCID: PMC8515573 DOI: 10.1021/acsomega.1c03842] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 09/14/2021] [Indexed: 05/17/2023]
Abstract
Drug development has a high failure rate, with safety properties constituting a considerable challenge. To reduce risk, in silico tools, including various machine learning methods, have been applied for toxicity prediction. However, these approaches often confront a serious problem: the training data sets are usually biased (imbalanced positive and negative samples), which would result in model training difficulty and unsatisfactory prediction accuracy. Multitask networks obtained significantly better predictive accuracies than single-task methods, and capsule neural networks showed excellent performance in sparse data sets in previous studies. In this study, we developed a new multitask framework based on a capsule neural network (multitask CapsNet) to measure 12 different toxic effects simultaneously. We found that multitask CapsNet excelled in toxicity prediction and outperformed many other computational approaches using the multitask strategy. Only after training on biased data sets did multitask CapsNet achieve significantly improved prediction accuracy on the Tox21 Data Challenge, which gave the largest ratio of highest accuracy (8/12) among compared models. Our model gave a prediction accuracy of 96.6% for the target NR.PPAR.gamma, whose ratio of negative to positive samples was up to 36:1. These results suggested that multitask CapsNet could overcome the bias problems and would provide a novel, accurate, and efficient approach for predicting the toxicities of compounds.
Collapse
Affiliation(s)
- Yiwei Wang
- School
of Preclinical Medicine, Southwest Medical
University, Luzhou 646000, China
| | - Binyou Wang
- School
of Pharmacy, Southwest Medical University, Luzhou 646000, China
| | - Jie Jiang
- School
of Preclinical Medicine, Southwest Medical
University, Luzhou 646000, China
| | - Jianmin Guo
- School
of Preclinical Medicine, Southwest Medical
University, Luzhou 646000, China
| | - Jia Lai
- School
of Pharmacy, Southwest Medical University, Luzhou 646000, China
| | - Xiao-Yuan Lian
- School
of Pharmacy, Zhejiang University, Hangzhou 310011, China
| | - Jianming Wu
- Key
Laboratory of Medical Electrophysiology, Ministry of Education of
China, Medical Key Laboratory for Drug Discovery and Druggability
Evaluation of Sichuan Province, Luzhou Key
Laboratory of Activity Screening and Druggability Evaluation for Chinese
Materia Medica, Luzhou 646000, China
| |
Collapse
|
43
|
Chen D, Gao K, Nguyen DD, Chen X, Jiang Y, Wei GW, Pan F. Algebraic graph-assisted bidirectional transformers for molecular property prediction. Nat Commun 2021; 12:3521. [PMID: 34112777 PMCID: PMC8192505 DOI: 10.1038/s41467-021-23720-w] [Citation(s) in RCA: 57] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 05/06/2021] [Indexed: 11/09/2022] Open
Abstract
The ability of molecular property prediction is of great significance to drug discovery, human health, and environmental protection. Despite considerable efforts, quantitative prediction of various molecular properties remains a challenge. Although some machine learning models, such as bidirectional encoder from transformer, can incorporate massive unlabeled molecular data into molecular representations via a self-supervised learning strategy, it neglects three-dimensional (3D) stereochemical information. Algebraic graph, specifically, element-specific multiscale weighted colored algebraic graph, embeds complementary 3D molecular information into graph invariants. We propose an algebraic graph-assisted bidirectional transformer (AGBT) framework by fusing representations generated by algebraic graph and bidirectional transformer, as well as a variety of machine learning algorithms, including decision trees, multitask learning, and deep neural networks. We validate the proposed AGBT framework on eight molecular datasets, involving quantitative toxicity, physical chemistry, and physiology datasets. Extensive numerical experiments have shown that AGBT is a state-of-the-art framework for molecular property prediction.
Collapse
Affiliation(s)
- Dong Chen
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen, China
- Department of Mathematics, Michigan State University, East Lansing, MI, USA
| | - Kaifu Gao
- Department of Mathematics, Michigan State University, East Lansing, MI, USA
| | - Duc Duy Nguyen
- Department of Mathematics, University of Kentucky, Lexington, KY, USA
| | - Xin Chen
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen, China
| | - Yi Jiang
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen, China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI, USA.
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, USA.
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA.
| | - Feng Pan
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen, China.
| |
Collapse
|
44
|
Deng D, Chen X, Zhang R, Lei Z, Wang X, Zhou F. XGraphBoost: Extracting Graph Neural Network-Based Features for a Better Prediction of Molecular Properties. J Chem Inf Model 2021; 61:2697-2705. [PMID: 34009965 DOI: 10.1021/acs.jcim.0c01489] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Determining the properties of chemical molecules is essential for screening candidates similar to a specific drug. These candidate molecules are further evaluated for their target binding affinities, side effects, target missing probabilities, etc. Conventional machine learning algorithms demonstrated satisfying prediction accuracies of molecular properties. A molecule cannot be directly loaded into a machine learning model, and a set of engineered features needs to be designed and calculated from a molecule. Such hand-crafted features rely heavily on the experiences of the investigating researchers. The concept of graph neural networks (GNNs) was recently introduced to describe the chemical molecules. The features may be automatically and objectively extracted from the molecules through various types of GNNs, e.g., GCN (graph convolution network), GGNN (gated graph neural network), DMPNN (directed message passing neural network), etc. However, the training of a stable GNN model requires a huge number of training samples and a large amount of computing power, compared with the conventional machine learning strategies. This study proposed the integrated framework XGraphBoost to extract the features using a GNN and build an accurate prediction model of molecular properties using the classifier XGBoost. The proposed framework XGraphBoost fully inherits the merits of the GNN-based automatic molecular feature extraction and XGBoost-based accurate prediction performance. Both classification and regression problems were evaluated using the framework XGraphBoost. The experimental results strongly suggest that XGraphBoost may facilitate the efficient and accurate predictions of various molecular properties. The source code is freely available to academic users at https://github.com/chenxiaowei-vincent/XGraphBoost.git.
Collapse
Affiliation(s)
- Daiguo Deng
- Fermion Technology Co., Ltd., Guangzhou, Guangdong 510000, P.R. China
| | - Xiaowei Chen
- Fermion Technology Co., Ltd., Guangzhou, Guangdong 510000, P.R. China
| | - Ruochi Zhang
- Fermion Technology Co., Ltd., Guangzhou, Guangdong 510000, P.R. China.,College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, P.R. China
| | - Zengrong Lei
- Fermion Technology Co., Ltd., Guangzhou, Guangdong 510000, P.R. China
| | - Xiaojian Wang
- State Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing 100050, P.R. China
| | - Fengfeng Zhou
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, P.R. China
| |
Collapse
|
45
|
Szocinski T, Nguyen DD, Wei GW. AweGNN: Auto-parametrized weighted element-specific graph neural networks for molecules. Comput Biol Med 2021; 134:104460. [PMID: 34020133 DOI: 10.1016/j.compbiomed.2021.104460] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Revised: 04/23/2021] [Accepted: 04/26/2021] [Indexed: 11/29/2022]
Abstract
While automated feature extraction has had tremendous success in many deep learning algorithms for image analysis and natural language processing, it does not work well for data involving complex internal structures, such as molecules. Data representations via advanced mathematics, including algebraic topology, differential geometry, and graph theory, have demonstrated superiority in a variety of biomolecular applications, however, their performance is often dependent on manual parametrization. This work introduces the auto-parametrized weighted element-specific graph neural network, dubbed AweGNN, to overcome the obstacle of this tedious parametrization process while also being a suitable technique for automated feature extraction on these internally complex biomolecular data sets. The AweGNN is a neural network model based on geometric-graph features of element-pair interactions, with its graph parameters being updated throughout the training, which results in what we call a network-enabled automatic representation (NEAR). To enhance the predictions with small data sets, we construct multi-task (MT) AweGNN models in addition to single-task (ST) AweGNN models. The proposed methods are applied to various benchmark data sets, including four data sets for quantitative toxicity analysis and another data set for solvation prediction. Extensive numerical tests show that AweGNN models can achieve state-of-the-art performance in molecular property predictions.
Collapse
Affiliation(s)
- Timothy Szocinski
- Department of Mathematics, Michigan State University, MI, 48824, USA
| | - Duc Duy Nguyen
- Department of Mathematics, University of Kentucky, KY, 40506, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI, 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, MI, 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, MI, 48824, USA.
| |
Collapse
|
46
|
Mamprin M, Zelis JM, Tonino PAL, Zinger S, de With PHN. Decision Trees for Predicting Mortality in Transcatheter Aortic Valve Implantation. Bioengineering (Basel) 2021; 8:bioengineering8020022. [PMID: 33572063 PMCID: PMC7915084 DOI: 10.3390/bioengineering8020022] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2020] [Revised: 01/27/2021] [Accepted: 02/04/2021] [Indexed: 12/23/2022] Open
Abstract
Current prognostic risk scores in cardiac surgery do not benefit yet from machine learning (ML). This research aims to create a machine learning model to predict one-year mortality of a patient after transcatheter aortic valve implantation (TAVI). We adopt a modern gradient boosting on decision trees classifier (GBDTs), specifically designed for categorical features. In combination with a recent technique for model interpretations, we developed a feature analysis and selection stage, enabling the identification of the most important features for the prediction. We base our prediction model on the most relevant features, after interpreting and discussing the feature analysis results with clinical experts. We validated our model on 270 consecutive TAVI cases, reaching a C-statistic of 0.83 with CI [0.82, 0.84]. The model has achieved a positive predictive value ranging from 57% to 64%, suggesting that the patient selection made by the heart team of professionals can be further improved by taking into consideration the clinical data we identified as important and by exploiting ML approaches in the development of clinical risk scores. Our approach has shown promising predictive potential also with respect to widespread prognostic risk scores, such as logistic European system for cardiac operative risk evaluation (EuroSCORE II) and the society of thoracic surgeons (STS) risk score, which are broadly adopted by cardiologists worldwide.
Collapse
Affiliation(s)
- Marco Mamprin
- Department of Electrical Engineering, Eindhoven University of Technology, 5612 AJ Eindhoven, The Netherlands; (S.Z.); (P.H.N.d.W.)
- Correspondence:
| | - Jo M. Zelis
- Department of Cardiology, Catharina Hospital, 5623 EJ Eindhoven, The Netherlands; (J.M.Z.); (P.A.L.T.)
| | - Pim A. L. Tonino
- Department of Cardiology, Catharina Hospital, 5623 EJ Eindhoven, The Netherlands; (J.M.Z.); (P.A.L.T.)
| | - Sveta Zinger
- Department of Electrical Engineering, Eindhoven University of Technology, 5612 AJ Eindhoven, The Netherlands; (S.Z.); (P.H.N.d.W.)
| | - Peter H. N. de With
- Department of Electrical Engineering, Eindhoven University of Technology, 5612 AJ Eindhoven, The Netherlands; (S.Z.); (P.H.N.d.W.)
| |
Collapse
|
47
|
Jain S, Siramshetty VB, Alves VM, Muratov EN, Kleinstreuer N, Tropsha A, Nicklaus MC, Simeonov A, Zakharov AV. Large-Scale Modeling of Multispecies Acute Toxicity End Points Using Consensus of Multitask Deep Learning Methods. J Chem Inf Model 2021; 61:653-663. [PMID: 33533614 DOI: 10.1021/acs.jcim.0c01164] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Computational methods to predict molecular properties regarding safety and toxicology represent alternative approaches to expedite drug development, screen environmental chemicals, and thus significantly reduce associated time and costs. There is a strong need and interest in the development of computational methods that yield reliable predictions of toxicity, and many approaches, including the recently introduced deep neural networks, have been leveraged towards this goal. Herein, we report on the collection, curation, and integration of data from the public data sets that were the source of the ChemIDplus database for systemic acute toxicity. These efforts generated the largest publicly available such data set comprising > 80,000 compounds measured against a total of 59 acute systemic toxicity end points. This data was used for developing multiple single- and multitask models utilizing random forest, deep neural networks, convolutional, and graph convolutional neural network approaches. For the first time, we also reported the consensus models based on different multitask approaches. To the best of our knowledge, prediction models for 36 of the 59 end points have never been published before. Furthermore, our results demonstrated a significantly better performance of the consensus model obtained from three multitask learning approaches that particularly predicted the 29 smaller tasks (less than 300 compounds) better than other models developed in the study. The curated data set and the developed models have been made publicly available at https://github.com/ncats/ld50-multitask, https://predictor.ncats.io/, and https://cactus.nci.nih.gov/download/acute-toxicity-db (data set only) to support regulatory and research applications.
Collapse
Affiliation(s)
- Sankalp Jain
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, Maryland 20850, United States
| | - Vishal B Siramshetty
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, Maryland 20850, United States
| | - Vinicius M Alves
- UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Eugene N Muratov
- UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Nicole Kleinstreuer
- Division of Intramural Research, Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T.W. Alexander Drive, Durham, North Carolina 27709, United States.,National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, 111 T.W. Alexander Drive, Durham, North Carolina 27709, United States
| | - Alexander Tropsha
- UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Marc C Nicklaus
- Computer-Aided Drug Design (CADD) Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, DHHS, NCI-Frederick, 376 Boyles Street, Frederick, Maryland 21702, United States
| | - Anton Simeonov
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, Maryland 20850, United States
| | - Alexey V Zakharov
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, Maryland 20850, United States
| |
Collapse
|
48
|
Antelo-Collado A, Carrasco-Velar R, García-Pedrajas N, Cerruela-García G. Effective Feature Selection Method for Class-Imbalance Datasets Applied to Chemical Toxicity Prediction. J Chem Inf Model 2020; 61:76-94. [PMID: 33350301 DOI: 10.1021/acs.jcim.0c00908] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
During the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of in silico quantitative structure-activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models. The performance of feature selection (FS) for QSAR models is usually damaged by the class-imbalance nature of the involved datasets. This paper proposes the use of an FS method focused on dealing with the class-imbalance problems. The method is based on the use of FS ensembles constructed by boosting and using two well-known FS methods, fast clustering-based FS and the fast correlation-based filter. The experimental results demonstrate the efficiency of the proposal in terms of the classification performance compared to standard methods. The proposal can be extended to other FS methods and applied to other problems in cheminformatics.
Collapse
Affiliation(s)
| | | | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| | - Gonzalo Cerruela-García
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| |
Collapse
|
49
|
Elbadawi M, Gaisford S, Basit AW. Advanced machine-learning techniques in drug discovery. Drug Discov Today 2020; 26:769-777. [PMID: 33290820 DOI: 10.1016/j.drudis.2020.12.003] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 11/16/2020] [Accepted: 12/02/2020] [Indexed: 01/20/2023]
Abstract
The popularity of machine learning (ML) across drug discovery continues to grow, yielding impressive results. As their use increases, so do their limitations become apparent. Such limitations include their need for big data, sparsity in data, and their lack of interpretability. It has also become apparent that the techniques are not truly autonomous, requiring retraining even post deployment. In this review, we detail the use of advanced techniques to circumvent these challenges, with examples drawn from drug discovery and allied disciplines. In addition, we present emerging techniques and their potential role in drug discovery. The techniques presented herein are anticipated to expand the applicability of ML in drug discovery.
Collapse
Affiliation(s)
- Moe Elbadawi
- Department of Pharmaceutics, UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London, WC1N 1AX, UK
| | - Simon Gaisford
- Department of Pharmaceutics, UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London, WC1N 1AX, UK; FabRx Ltd, 3 Romney Road, Ashford, TN24 0RW, UK
| | - Abdul W Basit
- Department of Pharmaceutics, UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London, WC1N 1AX, UK; FabRx Ltd, 3 Romney Road, Ashford, TN24 0RW, UK.
| |
Collapse
|
50
|
Electric Vehicles Plug-In Duration Forecasting Using Machine Learning for Battery Optimization. ENERGIES 2020. [DOI: 10.3390/en13164208] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The aging of rechargeable batteries, with its associated replacement costs, is one of the main issues limiting the diffusion of electric vehicles (EVs) as the future transportation infrastructure. An effective way to mitigate battery aging is to act on its charge cycles, more controllable than discharge ones, implementing so-called battery-aware charging protocols. Since one of the main factors affecting battery aging is its average state of charge (SOC), these protocols try to minimize the standby time, i.e., the time interval between the end of the actual charge and the moment when the EV is unplugged from the charging station. Doing so while still ensuring that the EV is fully charged when needed (in order to achieve a satisfying user experience) requires a “just-in-time” charging protocol, which completes exactly at the plug-out time. This type of protocol can only be achieved if an estimate of the expected plug-in duration is available. While many previous works have stressed the importance of having this estimate, they have either used straightforward forecasting methods, or assumed that the plug-in duration was directly indicated by the user, which could lead to sub-optimal results. In this paper, we evaluate the effectiveness of a more advanced forecasting based on machine learning (ML). With experiments on a public dataset containing data from domestic EV charge points, we show that a simple tree-based ML model, trained on each charge station based on its users’ behaviour, can reduce the forecasting error by up to 4× compared to the simple predictors used in previous works. This, in turn, leads to an improvement of up to 50% in a combined aging-quality of service metric.
Collapse
|