1
|
Saifi I, Bhat BA, Hamdani SS, Bhat UY, Lobato-Tapia CA, Mir MA, Dar TUH, Ganie SA. Artificial intelligence and cheminformatics tools: a contribution to the drug development and chemical science. J Biomol Struct Dyn 2024; 42:6523-6541. [PMID: 37434311 DOI: 10.1080/07391102.2023.2234039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2023] [Accepted: 07/03/2023] [Indexed: 07/13/2023]
Abstract
In the ever-evolving field of drug discovery, the integration of Artificial Intelligence (AI) and Machine Learning (ML) with cheminformatics has proven to be a powerful combination. Cheminformatics, which combines the principles of computer science and chemistry, is used to extract chemical information and search compound databases, while the application of AI and ML allows for the identification of potential hit compounds, optimization of synthesis routes, and prediction of drug efficacy and toxicity. This collaborative approach has led to the discovery, preclinical evaluations and approval of over 70 drugs in recent years. To aid researchers in the pursuit of new drugs, this article presents a comprehensive list of databases, datasets, predictive and generative models, scoring functions and web platforms that have been launched between 2021 and 2022. These resources provide a wealth of information and tools for computer-assisted drug development, and are a valuable asset for those working in the field of cheminformatics. Overall, the integration of AI, ML and cheminformatics has greatly advanced the drug discovery process and continues to hold great potential for the future. As new resources and technologies become available, we can expect to see even more groundbreaking discoveries and advancements in these fields.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Ifra Saifi
- Chaudhary Charan Singh University, Meerut, Uttar Pradesh, India
| | - Basharat Ahmad Bhat
- Department of Bioresources, School of Biological Sciences, University of Kashmir, Srinagar, J&K, India
| | - Syed Suhail Hamdani
- Department of Bioresources, School of Biological Sciences, University of Kashmir, Srinagar, J&K, India
| | - Umar Yousuf Bhat
- Department of Zoology, School of Biological Sciences, University of Kashmir, Srinagar, J&K, India
| | | | - Mushtaq Ahmad Mir
- Department of Clinical Laboratory Sciences, College of Applied Medical Science, King Khalid University, KSA, Saudi Arabia
| | - Tanvir Ul Hasan Dar
- Department of Biotechnology, School of Biosciences and Biotechnology, BGSB University, Rajouri, India
| | - Showkat Ahmad Ganie
- Department of Clinical Biochemistry, School of Biological Sciences, University of Kashmir, Srinagar, J&K, India
| |
Collapse
|
2
|
Ouyang H, Liu W, Tao J, Luo Y, Zhang W, Zhou J, Geng S, Zhang C. ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning. Sci Rep 2024; 14:17126. [PMID: 39054356 PMCID: PMC11272916 DOI: 10.1038/s41598-024-67496-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Accepted: 07/11/2024] [Indexed: 07/27/2024] Open
Abstract
Chemical molecular structures are a direct and convenient means of expressing chemical knowledge, playing a vital role in academic communication. In chemistry, hand drawing is a common task for students and researchers. If we can convert hand-drawn chemical molecular structures into machine-readable formats, like SMILES encoding, computers can efficiently process and analyze these structures, significantly enhancing the efficiency of chemical research. Furthermore, with the progress of educational technology, automated grading is gaining popularity. When machines automatically recognize chemical molecular structures and assess the correctness of the drawings, it offers great convenience to teachers. We created ChemReco, a tool designed to identify chemical molecular structures involving three atoms: C, H, and O, providing convenience for chemical researchers. Currently, there are limited studies on hand-drawn chemical molecular structures. Therefore, the primary focus of this paper is constructing datasets. We propose a synthetic image method to rapidly generate images resembling hand-drawn chemical molecular structures, enhancing dataset acquisition efficiency. Regarding model selection, the hand-drawn chemical molecule structural recognition model developed in this article achieves a final recognition accuracy of 96.90%. This model employs the encoder-decoder architecture of EfficientNet + Transformer, demonstrating superior performance compared to other encoder-decoder combinations.
Collapse
Affiliation(s)
- Hengjie Ouyang
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| | - Wei Liu
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China.
| | - Jiajun Tao
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| | - Yanghong Luo
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| | - Wanjia Zhang
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| | - Jiayu Zhou
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| | - Shuqi Geng
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| | - Chengpeng Zhang
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| |
Collapse
|
3
|
Qu G, Song Q, Fang T. The artistic image processing for visual healing in smart city. Sci Rep 2024; 14:16846. [PMID: 39039163 PMCID: PMC11263401 DOI: 10.1038/s41598-024-68082-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 07/19/2024] [Indexed: 07/24/2024] Open
Abstract
This study investigates the processing methods of artistic images within the context of Smart city (SC) initiatives, focusing on the visual healing effects of artistic image processing to enhance urban residents' mental health and quality of life. Firstly, it examines the role of artistic image processing techniques in visual healing. Secondly, deep learning technology is introduced and improved, proposing the overlapping segmentation vision transformer (OSViT) for image blocks, and further integrating the bidirectional long short-term memory (BiLSTM) algorithm. An innovative artistic image processing and classification recognition model based on OSViT-BiLSTM is then constructed. Finally, the visual healing effect of the processed art images in different scenes is analyzed. The results demonstrate that the proposed model achieves a classification recognition accuracy of 92.9% for art images, which is at least 6.9% higher than that of other existing model algorithms. Additionally, over 90% of users report satisfaction with the visual healing effects of the artistic images. Therefore, it is found that the proposed model can accurately identify artistic images, enhance their beauty and artistry, and improve the visual healing effect. This study provides an experimental reference for incorporating visual healing into SC initiatives.
Collapse
Affiliation(s)
- Guangfu Qu
- School of Film, Shandong University of Arts, Jinan, 250300, China.
| | - Qian Song
- Sichuan Fine Arts Institute, Chongqing, 400053, China
| | - Ting Fang
- Sichuan Fine Arts Institute, Chongqing, 400053, China.
| |
Collapse
|
4
|
Fan V, Qian Y, Wang A, Wang A, Coley CW, Barzilay R. OpenChemIE: An Information Extraction Toolkit for Chemistry Literature. J Chem Inf Model 2024; 64:5521-5534. [PMID: 38950894 DOI: 10.1021/acs.jcim.4c00572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/03/2024]
Abstract
Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.
Collapse
Affiliation(s)
- Vincent Fan
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Alex Wang
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Amber Wang
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
5
|
Borup RM, Ree N, Jensen JH. pKalculator: A p K a predictor for C-H bonds. Beilstein J Org Chem 2024; 20:1614-1622. [PMID: 39076289 PMCID: PMC11285060 DOI: 10.3762/bjoc.20.144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Accepted: 07/02/2024] [Indexed: 07/31/2024] Open
Abstract
Determining the pK a values of various C-H sites in organic molecules offers valuable insights for synthetic chemists in predicting reaction sites. As molecular complexity increases, this task becomes more challenging. This paper introduces pKalculator, a quantum chemistry (QM)-based workflow for automatic computations of C-H pK a values, which is used to generate a training dataset for a machine learning (ML) model. The QM workflow is benchmarked against 695 experimentally determined C-H pK a values in DMSO. The ML model is trained on a diverse dataset of 775 molecules with 3910 C-H sites. Our ML model predicts C-H pK a values with a mean absolute error (MAE) and a root mean squared error (RMSE) of 1.24 and 2.15 pK a units, respectively. Furthermore, we employ our model on 1043 pK a-dependent reactions (aldol, Claisen, and Michael) and successfully indicate the reaction sites with a Matthew's correlation coefficient (MCC) of 0.82.
Collapse
Affiliation(s)
- Rasmus M Borup
- Department of Chemistry, University of Copenhagen, Copenhagen, DK-2100, Denmark
| | - Nicolai Ree
- Department of Chemistry, University of Copenhagen, Copenhagen, DK-2100, Denmark
| | - Jan H Jensen
- Department of Chemistry, University of Copenhagen, Copenhagen, DK-2100, Denmark
| |
Collapse
|
6
|
Rajan K, Brinkhaus HO, Zielesny A, Steinbeck C. Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture. J Cheminform 2024; 16:78. [PMID: 38970120 PMCID: PMC11227129 DOI: 10.1186/s13321-024-00872-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Accepted: 06/12/2024] [Indexed: 07/07/2024] Open
Abstract
Accurate recognition of hand-drawn chemical structures is crucial for digitising hand-written chemical information in traditional laboratory notebooks or facilitating stylus-based structure entry on tablets or smartphones. However, the inherent variability in hand-drawn structures poses challenges for existing Optical Chemical Structure Recognition (OCSR) software. To address this, we present an enhanced Deep lEarning for Chemical ImagE Recognition (DECIMER) architecture that leverages a combination of Convolutional Neural Networks (CNNs) and Transformers to improve the recognition of hand-drawn chemical structures. The model incorporates an EfficientNetV2 CNN encoder that extracts features from hand-drawn images, followed by a Transformer decoder that converts the extracted features into Simplified Molecular Input Line Entry System (SMILES) strings. Our models were trained using synthetic hand-drawn images generated by RanDepict, a tool for depicting chemical structures with different style elements. A benchmark was performed using a real-world dataset of hand-drawn chemical structures to evaluate the model's performance. The results indicate that our improved DECIMER architecture exhibits a significantly enhanced recognition accuracy compared to other approaches. SCIENTIFIC CONTRIBUTION: The new DECIMER model presented here refines our previous research efforts and is currently the only open-source model tailored specifically for the recognition of hand-drawn chemical structures. The enhanced model performs better in handling variations in handwriting styles, line thicknesses, and background noise, making it suitable for real-world applications. The DECIMER hand-drawn structure recognition model and its source code have been made available as an open-source package under a permissive license.
Collapse
Affiliation(s)
- Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Henning Otto Brinkhaus
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Achim Zielesny
- Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany.
| |
Collapse
|
7
|
Luong KD, Singh A. Application of Transformers in Cheminformatics. J Chem Inf Model 2024; 64:4392-4409. [PMID: 38815246 PMCID: PMC11167597 DOI: 10.1021/acs.jcim.3c02070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 04/05/2024] [Accepted: 05/06/2024] [Indexed: 06/01/2024]
Abstract
By accelerating time-consuming processes with high efficiency, computing has become an essential part of many modern chemical pipelines. Machine learning is a class of computing methods that can discover patterns within chemical data and utilize this knowledge for a wide variety of downstream tasks, such as property prediction or substance generation. The complex and diverse chemical space requires complex machine learning architectures with great learning power. Recently, learning models based on transformer architectures have revolutionized multiple domains of machine learning, including natural language processing and computer vision. Naturally, there have been ongoing endeavors in adopting these techniques to the chemical domain, resulting in a surge of publications within a short period. The diversity of chemical structures, use cases, and learning models necessitate a comprehensive summarization of existing works. In this paper, we review recent innovations in adapting transformers to solve learning problems in chemistry. Because chemical data is diverse and complex, we structure our discussion based on chemical representations. Specifically, we highlight the strengths and weaknesses of each representation, the current progress of adapting transformer architectures, and future directions.
Collapse
Affiliation(s)
- Kha-Dinh Luong
- Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106, United States
| | - Ambuj Singh
- Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106, United States
| |
Collapse
|
8
|
Meewan I, Panmanee J, Petchyam N, Lertvilai P. HBCVTr: an end-to-end transformer with a deep neural network hybrid model for anti-HBV and HCV activity predictor from SMILES. Sci Rep 2024; 14:9262. [PMID: 38649402 PMCID: PMC11035669 DOI: 10.1038/s41598-024-59933-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 04/16/2024] [Indexed: 04/25/2024] Open
Abstract
Hepatitis B and C viruses (HBV and HCV) are significant causes of chronic liver diseases, with approximately 350 million infections globally. To accelerate the finding of effective treatment options, we introduce HBCVTr, a novel ligand-based drug design (LBDD) method for predicting the inhibitory activity of small molecules against HBV and HCV. HBCVTr employs a hybrid model consisting of double encoders of transformers and a deep neural network to learn the relationship between small molecules' simplified molecular-input line-entry system (SMILES) and their antiviral activity against HBV or HCV. The prediction accuracy of HBCVTr has surpassed baseline machine learning models and existing methods, with R-squared values of 0.641 and 0.721 for the HBV and HCV test sets, respectively. The trained models were successfully applied to virtual screening against 10 million compounds within 240 h, leading to the discovery of the top novel inhibitor candidates, including IJN04 for HBV and IJN12 and IJN19 for HCV. Molecular docking and dynamics simulations identified IJN04, IJN12, and IJN19 target proteins as the HBV core antigen, HCV NS5B RNA-dependent RNA polymerase, and HCV NS3/4A serine protease, respectively. Overall, HBCVTr offers a new and rapid drug discovery and development screening method targeting HBV and HCV.
Collapse
Affiliation(s)
- Ittipat Meewan
- Center for Advanced Therapeutics, Institute of Molecular Biosciences, Mahidol University, Nakhon Pathom, 73170, Thailand.
| | - Jiraporn Panmanee
- Research Center for Neuroscience, Institute of Molecular Biosciences, Mahidol University, Nakhon Pathom, 73170, Thailand
| | - Nopphon Petchyam
- Center for Advanced Therapeutics, Institute of Molecular Biosciences, Mahidol University, Nakhon Pathom, 73170, Thailand
| | - Pichaya Lertvilai
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, 92037, USA
| |
Collapse
|
9
|
Liu P, Ren Y, Tao J, Ren Z. GIT-Mol: A multi-modal large language model for molecular science with graph, image, and text. Comput Biol Med 2024; 171:108073. [PMID: 38359660 DOI: 10.1016/j.compbiomed.2024.108073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/25/2023] [Accepted: 01/27/2024] [Indexed: 02/17/2024]
Abstract
Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.
Collapse
Affiliation(s)
- Pengfei Liu
- Peng Cheng Laboratory, Shenzhen, 518055, Guangdong Province, China; School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, 510006, Guangdong Province, China
| | - Yiming Ren
- Peng Cheng Laboratory, Shenzhen, 518055, Guangdong Province, China
| | - Jun Tao
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, 510006, Guangdong Province, China
| | - Zhixiang Ren
- Peng Cheng Laboratory, Shenzhen, 518055, Guangdong Province, China.
| |
Collapse
|
10
|
Kunnakkattu IR, Choudhary P, Pravda L, Nadzirin N, Smart OS, Yuan Q, Anyango S, Nair S, Varadi M, Velankar S. PDBe CCDUtils: an RDKit-based toolkit for handling and analysing small molecules in the Protein Data Bank. J Cheminform 2023; 15:117. [PMID: 38042830 PMCID: PMC10693035 DOI: 10.1186/s13321-023-00786-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 11/17/2023] [Indexed: 12/04/2023] Open
Abstract
While the Protein Data Bank (PDB) contains a wealth of structural information on ligands bound to macromolecules, their analysis can be challenging due to the large amount and diversity of data. Here, we present PDBe CCDUtils, a versatile toolkit for processing and analysing small molecules from the PDB in PDBx/mmCIF format. PDBe CCDUtils provides streamlined access to all the metadata for small molecules in the PDB and offers a set of convenient methods to compute various properties using RDKit, such as 2D depictions, 3D conformers, physicochemical properties, scaffolds, common fragments, and cross-references to small molecule databases using UniChem. The toolkit also provides methods for identifying all the covalently attached chemical components in a macromolecular structure and calculating similarity among small molecules. By providing a broad range of functionality, PDBe CCDUtils caters to the needs of researchers in cheminformatics, structural biology, bioinformatics and computational chemistry.
Collapse
Affiliation(s)
- Ibrahim Roshan Kunnakkattu
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Preeti Choudhary
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Lukas Pravda
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Nurul Nadzirin
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Oliver S Smart
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Qi Yuan
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Stephen Anyango
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Sreenath Nair
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Mihaly Varadi
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| |
Collapse
|
11
|
Zhou C, Liu W, Song X, Yang M, Peng X. YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications. J Cheminform 2023; 15:111. [PMID: 37986007 PMCID: PMC10662772 DOI: 10.1186/s13321-023-00783-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 11/11/2023] [Indexed: 11/22/2023] Open
Abstract
In chemistry-related disciplines, a vast repository of molecular structural data has been documented in scientific publications but remains inaccessible to computational analyses owing to its non-machine-readable format. Optical chemical structure recognition (OCSR) addresses this gap by converting images of chemical molecular structures into a format accessible to computers and convenient for storage, paving the way for further analyses and studies on chemical information. A pivotal initial step in OCSR is automating the noise-free extraction of molecular descriptions from literature. Despite efforts utilising rule-based and deep learning approaches for the extraction process, the accuracy achieved to date is unsatisfactory. To address this issue, we introduce a deep learning model named YoDe-Segmentation in this study, engineered for the automated retrieval of molecular structures from scientific documents. This model operates via a three-stage process encompassing detection, mask generation, and calculation. Initially, it identifies and isolates molecular structures during the detection phase. Subsequently, mask maps are created based on these isolated structures in the mask generation stage. In the final calculation stage, refined and separated mask maps are combined with the isolated molecular structure images, resulting in the acquisition of pure molecular structures. Our model underwent rigorous testing using texts from multiple chemistry-centric journals, with the outcomes subjected to manual validation. The results revealed the superior performance of YoDe-Segmentation compared to alternative algorithms, documenting an average extraction efficiency of 97.62%. This outcome not only highlights the robustness and reliability of the model but also suggests its applicability on a broad scale.
Collapse
Affiliation(s)
- Chong Zhou
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| | - Wei Liu
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China.
| | - Xiyue Song
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| | - Mengling Yang
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| | - Xiaowang Peng
- School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China
| |
Collapse
|
12
|
Mullowney MW, Duncan KR, Elsayed SS, Garg N, van der Hooft JJJ, Martin NI, Meijer D, Terlouw BR, Biermann F, Blin K, Durairaj J, Gorostiola González M, Helfrich EJN, Huber F, Leopold-Messer S, Rajan K, de Rond T, van Santen JA, Sorokina M, Balunas MJ, Beniddir MA, van Bergeijk DA, Carroll LM, Clark CM, Clevert DA, Dejong CA, Du C, Ferrinho S, Grisoni F, Hofstetter A, Jespers W, Kalinina OV, Kautsar SA, Kim H, Leao TF, Masschelein J, Rees ER, Reher R, Reker D, Schwaller P, Segler M, Skinnider MA, Walker AS, Willighagen EL, Zdrazil B, Ziemert N, Goss RJM, Guyomard P, Volkamer A, Gerwick WH, Kim HU, Müller R, van Wezel GP, van Westen GJP, Hirsch AKH, Linington RG, Robinson SL, Medema MH. Artificial intelligence for natural product drug discovery. Nat Rev Drug Discov 2023; 22:895-916. [PMID: 37697042 DOI: 10.1038/s41573-023-00774-7] [Citation(s) in RCA: 33] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/20/2023] [Indexed: 09/13/2023]
Abstract
Developments in computational omics technologies have provided new means to access the hidden diversity of natural products, unearthing new potential for drug discovery. In parallel, artificial intelligence approaches such as machine learning have led to exciting developments in the computational drug design field, facilitating biological activity prediction and de novo drug design for molecular targets of interest. Here, we describe current and future synergies between these developments to effectively identify drug candidates from the plethora of molecules produced by nature. We also discuss how to address key challenges in realizing the potential of these synergies, such as the need for high-quality datasets to train deep learning algorithms and appropriate strategies for algorithm validation.
Collapse
Affiliation(s)
| | - Katherine R Duncan
- Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Glasgow, UK
| | - Somayah S Elsayed
- Department of Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Neha Garg
- School of Chemistry and Biochemistry, Center for Microbial Dynamics and Infection, Georgia Institute of Technology, Atlanta, GA, USA
| | - Justin J J van der Hooft
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
- Department of Biochemistry, University of Johannesburg, Johannesburg, South Africa
| | - Nathaniel I Martin
- Biological Chemistry Group, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - David Meijer
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| | - Barbara R Terlouw
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| | - Friederike Biermann
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
- Institute of Molecular Bio Science, Goethe-University Frankfurt, Frankfurt am Main, Germany
- LOEWE Center for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - Kai Blin
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Marina Gorostiola González
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden, The Netherlands
- ONCODE institute, Leiden, The Netherlands
| | - Eric J N Helfrich
- Institute of Molecular Bio Science, Goethe-University Frankfurt, Frankfurt am Main, Germany
- LOEWE Center for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - Florian Huber
- Center for Digitalization and Digitality, Hochschule Düsseldorf, Düsseldorf, Germany
| | - Stefan Leopold-Messer
- Institut für Mikrobiologie, Eidgenössische Technische Hochschule (ETH) Zürich, Zürich, Switzerland
| | - Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Jena, Germany
| | - Tristan de Rond
- School of Chemical Sciences, University of Auckland, Auckland, New Zealand
| | - Jeffrey A van Santen
- Department of Chemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Maria Sorokina
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller University, Jena, Germany
- Pharmaceuticals R&D, Bayer AG, Berlin, Germany
| | - Marcy J Balunas
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, MI, USA
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Mehdi A Beniddir
- Équipe "Chimie des Substances Naturelles", Université Paris-Saclay, CNRS, BioCIS, Orsay, France
| | - Doris A van Bergeijk
- Department of Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Laura M Carroll
- Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
| | - Chase M Clark
- Division of Pharmaceutical Sciences, School of Pharmacy, University of Wisconsin-Madison, Madison, WI, USA
| | | | | | - Chao Du
- Department of Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | | | - Francesca Grisoni
- Institute for Complex Molecular Systems, Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
- Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Utrecht, The Netherlands
| | | | - Willem Jespers
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden, The Netherlands
| | - Olga V Kalinina
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany
- Drug Bioinformatics, Medical Faculty, Saarland University, Homburg, Germany
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
| | | | - Hyunwoo Kim
- College of Pharmacy and Integrated Research Institute for Drug Development, Dongguk University Seoul, Goyang-si, Republic of Korea
| | - Tiago F Leao
- Center for Nuclear Energy in Agriculture, University of São Paulo, Piracicaba, Brazil
| | - Joleen Masschelein
- Center for Microbiology, VIB-KU Leuven, Heverlee, Belgium
- Department of Biology, KU Leuven, Heverlee, Belgium
| | - Evan R Rees
- Division of Pharmaceutical Sciences, School of Pharmacy, University of Wisconsin-Madison, Madison, WI, USA
| | - Raphael Reher
- Institute of Pharmaceutical Biology and Biotechnology, University of Marburg, Marburg, Germany
- Institute of Pharmacy, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
| | - Daniel Reker
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
- Duke Microbiome Center, Duke University, Durham, NC, USA
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence, Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | | | - Michael A Skinnider
- Adapsyn Bioscience, Hamilton, Ontario, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | - Allison S Walker
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
| | - Egon L Willighagen
- Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Maastricht, The Netherlands
| | - Barbara Zdrazil
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridgeshire, UK
| | - Nadine Ziemert
- Interfaculty Institute for Microbiology and Infection Medicine Tuebingen (IMIT), Institute for Bioinformatics and Medical Informatics (IBMI), University of Tuebingen, Tuebingen, Germany
| | | | - Pierre Guyomard
- Bonsai team, CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Université de Lille, Villeneuve d'Ascq Cedex, France
| | - Andrea Volkamer
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
- In silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - William H Gerwick
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA
| | - Hyun Uk Kim
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
| | - Rolf Müller
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany
- Department of Pharmacy, Saarland University, Saarbrücken, Germany
- German Center for infection research (DZIF), Braunschweig, Germany
- Helmholtz International Lab for Anti-Infectives, Saarbrücken, Germany
| | - Gilles P van Wezel
- Department of Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
- Netherlands Institute of Ecology, NIOO-KNAW, Wageningen, The Netherlands
| | - Gerard J P van Westen
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden, The Netherlands.
| | - Anna K H Hirsch
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany.
- Department of Pharmacy, Saarland University, Saarbrücken, Germany.
- German Center for infection research (DZIF), Braunschweig, Germany.
- Helmholtz International Lab for Anti-Infectives, Saarbrücken, Germany.
| | - Roger G Linington
- Department of Chemistry, Simon Fraser University, Burnaby, British Columbia, Canada.
| | - Serina L Robinson
- Department of Environmental Microbiology, Eawag: Swiss Federal Institute for Aquatic Science and Technology, Dübendorf, Switzerland.
| | - Marnix H Medema
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
- Institute of Biology, Leiden University, Leiden, The Netherlands.
| |
Collapse
|
13
|
Scholz VA, Stork C, Frericks M, Kirchmair J. Computational prediction of the metabolites of agrochemicals formed in rats. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 895:165039. [PMID: 37355108 DOI: 10.1016/j.scitotenv.2023.165039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 06/17/2023] [Accepted: 06/19/2023] [Indexed: 06/26/2023]
Abstract
Today, computational tools for the prediction of the metabolite structures of xenobiotics are widely available and employed in small-molecule research. Reflecting the availability of measured data, these in silico tools are trained and validated primarily on drug metabolism data. In this work, we assessed the capacity of five leading metabolite structure predictors to represent the metabolism of agrochemicals observed in rats. More specifically, we tested the ability of SyGMa, GLORY, GLORYx, BioTransformer 3.0, and MetaTrans to correctly predict and rank the experimentally observed metabolites of a set of 85 parent compounds. We found that the models were able to recover about one to two-thirds of the experimentally observed first-generation, second-generation and third-generation metabolites, confirming their value in applications such as metabolite identification. However, precision was low for all investigated tools and did not exceed approximately 18 % for the pool of first-generation metabolites and 2 % for the pool of compounds representing the first three generations of metabolites. The variance in prediction success rates was high across the individual metabolic maps, meaning that outcomes depend strongly on the specific compound under investigation. We also found that the predictions for individual parent compounds differed strongly between the tools, particularly between those built on orthogonal technologies (e.g., rule-based and end-to-end machine learning approaches). This renders ensemble model strategies promising for improving success rates. Overall, the results of this benchmark study show that there is still considerable room for the improvement of metabolite structure predictors left. Our discussion points out several avenues to progress. The bottleneck in method development certainly has been, and will remain, for the foreseeable future, the limited quantity and quality of available measured data on small-molecule metabolism.
Collapse
Affiliation(s)
- Vincent-Alexander Scholz
- Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria; Vienna Doctoral School of Pharmaceutical, Nutritional and Sport Sciences (PhaNuSpo), University of Vienna, 1090 Vienna, Austria
| | | | | | - Johannes Kirchmair
- Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria; Christian Doppler Laboratory for Molecular Informatics in the Biosciences, Department for Pharmaceutical Sciences, University of Vienna, 1090 Vienna, Austria.
| |
Collapse
|
14
|
Wilary D, Cole JM. ReactionDataExtractor 2.0: A Deep Learning Approach for Data Extraction from Chemical Reaction Schemes. J Chem Inf Model 2023; 63:6053-6067. [PMID: 37729111 PMCID: PMC10565829 DOI: 10.1021/acs.jcim.3c00422] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Indexed: 09/22/2023]
Abstract
Knowledge in the chemical domain is often disseminated graphically via chemical reaction schemes. The task of describing chemical transformations is greatly simplified by introducing reaction schemes that are composed of chemical diagrams and symbols. While intuitively understood by any chemist, like most graphical representations, such drawings are not easily understood by machines; this poses a challenge in the context of data extraction. Currently available tools are limited in their scope of extraction and require manual preprocessing, thus slowing down the speed of data extraction. We present a new tool, ReactionDataExtractor v2.0, which uses a combination of neural networks and symbolic artificial intelligence to effectively remove this barrier. We have evaluated our tool on a test set composed of reaction schemes that were taken from open-source journal articles and realized F1 score metrics between 75 and 96%. These evaluation metrics can be further improved by tuning our object-detection models to a specific chemical subdomain thanks to a data-driven approach that we have adopted with synthetically generated data. The system architecture of our tool is modular, which allows it to balance speed and accuracy to afford an autonomous, high-throughput solution for image-based chemical data extraction.
Collapse
Affiliation(s)
- Damian
M. Wilary
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.
| | - Jacqueline M. Cole
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.
- ISIS
Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
| |
Collapse
|
15
|
Rajan K, Brinkhaus HO, Agea MI, Zielesny A, Steinbeck C. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. Nat Commun 2023; 14:5045. [PMID: 37598180 PMCID: PMC10439916 DOI: 10.1038/s41467-023-40782-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 08/09/2023] [Indexed: 08/21/2023] Open
Abstract
The number of publications describing chemical structures has increased steadily over the last decades. However, the majority of published chemical information is currently not available in machine-readable form in public databases. It remains a challenge to automate the process of information extraction in a way that requires less manual intervention - especially the mining of chemical structure depictions. As an open-source platform that leverages recent advancements in deep learning, computer vision, and natural language processing, DECIMER.ai (Deep lEarning for Chemical IMagE Recognition) strives to automatically segment, classify, and translate chemical structure depictions from the printed literature. The segmentation and classification tools are the only openly available packages of their kind, and the optical chemical structure recognition (OCSR) core application yields outstanding performance on all benchmark datasets. The source code, the trained models and the datasets developed in this work have been published under permissive licences. An instance of the DECIMER web application is available at https://decimer.ai .
Collapse
Affiliation(s)
- Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Henning Otto Brinkhaus
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - M Isabel Agea
- Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technicka 5, 166 28, Prague, Czech Republic
| | - Achim Zielesny
- Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany.
| |
Collapse
|
16
|
Carracedo-Cosme J, Romero-Muñiz C, Pou P, Pérez R. Molecular Identification from AFM Images Using the IUPAC Nomenclature and Attribute Multimodal Recurrent Neural Networks. ACS APPLIED MATERIALS & INTERFACES 2023; 15:22692-22704. [PMID: 37126486 PMCID: PMC10176476 DOI: 10.1021/acsami.3c01550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Spectroscopic methods─like nuclear magnetic resonance, mass spectrometry, X-ray diffraction, and UV/visible spectroscopies─applied to molecular ensembles have so far been the workhorse for molecular identification. Here, we propose a radically different chemical characterization approach, based on the ability of noncontact atomic force microscopy with metal tips functionalized with a CO molecule at the tip apex (referred as HR-AFM) to resolve the internal structure of individual molecules. Our work demonstrates that a stack of constant-height HR-AFM images carries enough chemical information for a complete identification (structure and composition) of quasiplanar organic molecules, and that this information can be retrieved using machine learning techniques that are able to disentangle the contribution of chemical composition, bond topology, and internal torsion of the molecule to the HR-AFM contrast. In particular, we exploit multimodal recurrent neural networks (M-RNN) that combine convolutional neural networks for image analysis and recurrent neural networks to deal with language processing, to formulate the molecular identification as an imaging captioning problem. The algorithm is trained using a data set─which contains almost 700,000 molecules and 165 million theoretical AFM images─to produce as final output the IUPAC name of the imaged molecule. Our extensive test with theoretical images and a few experimental ones shows the potential of deep learning algorithms in the automatic identification of molecular compounds by AFM. This achievement supports the development of on-surface synthesis and overcomes some limitations of spectroscopic methods in traditional solution-based synthesis.
Collapse
Affiliation(s)
- Jaime Carracedo-Cosme
- Quasar Science Resources S.L., Camino de las Ceudas 2, E-28232 Las Rozas de Madrid, Spain
- Departamento de Física Teórica de la Materia Condensada, Universidad Autónoma de Madrid, E-28049 Madrid, Spain
| | - Carlos Romero-Muñiz
- Departamento de Física de la Materia Condensada, Universidad de Sevilla, P.O. Box 1065, 41080 Sevilla, Spain
| | - Pablo Pou
- Departamento de Física Teórica de la Materia Condensada, Universidad Autónoma de Madrid, E-28049 Madrid, Spain
- Condensed Matter Physics Center (IFIMAC), Universidad Autónoma de Madrid, E-28049 Madrid, Spain
| | - Rubén Pérez
- Departamento de Física Teórica de la Materia Condensada, Universidad Autónoma de Madrid, E-28049 Madrid, Spain
- Condensed Matter Physics Center (IFIMAC), Universidad Autónoma de Madrid, E-28049 Madrid, Spain
| |
Collapse
|
17
|
Qian Y, Guo J, Tu Z, Li Z, Coley CW, Barzilay R. MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation. J Chem Inf Model 2023; 63:1925-1934. [PMID: 36971363 DOI: 10.1021/acs.jcim.2c01480] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
Abstract
Molecular structure recognition is the task of translating a molecular image into its graph structure. Significant variation in drawing styles and conventions exhibited in chemical literature poses a significant challenge for automating this task. In this paper, we propose MolScribe, a novel image-to-graph generation model that explicitly predicts atoms and bonds, along with their geometric layouts, to construct the molecular structure. Our model flexibly incorporates symbolic chemistry constraints to recognize chirality and expand abbreviated structures. We further develop data augmentation strategies to enhance the model robustness against domain shifts. In experiments on both synthetic and realistic molecular images, MolScribe significantly outperforms previous models, achieving 76-93% accuracy on public benchmarks. Chemists can also easily verify MolScribe's prediction, informed by its confidence estimation and atom-level alignment with the input image. MolScribe is publicly available through Python and web interfaces: https://github.com/thomas0809/MolScribe.
Collapse
Affiliation(s)
- Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jiang Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhengkai Tu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhening Li
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
18
|
Abstract
Citations are an essential aspect of research communication and have become the basis of many evaluation metrics in the academic world. Some see citation counts as a mark of scientific impact or even quality, but in reality the reasons for citing other work are manifold which makes the interpretation more complicated than a single citation count can reflect. Two years ago, the Journal of Cheminformatics proposed the CiTO Pilot for the adoption of a practice of annotating citations with their citation intentions. Basically, when you cite a journal article or dataset (or any other source), you also explain why specifically you cite that source. Particularly, the agreement and disagreement and reuse of methods and data are of interest. This article explores what happened after the launch of the pilot. We summarize how authors in the Journal of Cheminformatics used the pilot, shows citation annotations are distributed with Wikidata, visualized with Scholia, discusses adoption outside BMC, and finally present some thoughts on what needs to happen next.
Collapse
|
19
|
Lai A, Schaub J, Steinbeck C, Schymanski EL. An algorithm to classify homologous series within compound datasets. J Cheminform 2022; 14:85. [PMID: 36510332 PMCID: PMC9746203 DOI: 10.1186/s13321-022-00663-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 11/27/2022] [Indexed: 12/15/2022] Open
Abstract
Homologous series are groups of related compounds that share the same core structure attached to a motif that repeats to different degrees. Compounds forming homologous series are of interest in multiple domains, including natural products, environmental chemistry, and drug design. However, many homologous compounds remain unannotated as such in compound datasets, which poses obstacles to understanding chemical diversity and their analytical identification via database matching. To overcome these challenges, an algorithm to detect homologous series within compound datasets was developed and implemented using the RDKit. The algorithm takes a list of molecules as SMILES strings and a monomer (i.e., repeating unit) encoded as SMARTS as its main inputs. In an iterative process, substructure matching of repeating units, molecule fragmentation, and core detection lead to homologous series classification through grouping of identical cores. Three open compound datasets from environmental chemistry (NORMAN Suspect List Exchange, NORMAN-SLE), exposomics (PubChemLite for Exposomics), and natural products (the COlleCtion of Open NatUral producTs, COCONUT) were subject to homologous series classification using the algorithm. Over 2000, 12,000, and 5000 series with CH2 repeating units were classified in the NORMAN-SLE, PubChemLite, and COCONUT respectively. Validation of classified series was performed using published homologous series and structure categories, including a comparison with a similar existing method for categorising PFAS compounds. The OngLai algorithm and its implementation for classifying homologues are openly available at: https://github.com/adelenelai/onglai-classify-homologues .
Collapse
Affiliation(s)
- Adelene Lai
- grid.16008.3f0000 0001 2295 9843Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 Avenue du Swing, 4367 Belvaux, Luxembourg ,grid.9613.d0000 0001 1939 2794Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessing Strasse 8, 07743 Jena, Germany
| | - Jonas Schaub
- grid.9613.d0000 0001 1939 2794Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessing Strasse 8, 07743 Jena, Germany
| | - Christoph Steinbeck
- grid.9613.d0000 0001 1939 2794Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessing Strasse 8, 07743 Jena, Germany
| | - Emma L. Schymanski
- grid.16008.3f0000 0001 2295 9843Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 Avenue du Swing, 4367 Belvaux, Luxembourg
| |
Collapse
|
20
|
Xu Y, Xiao J, Chou CH, Zhang J, Zhu J, Hu Q, Li H, Han N, Liu B, Zhang S, Han J, Zhang Z, Zhang S, Zhang W, Lai L, Pei J. MolMiner: You Only Look Once for Chemical Structure Recognition. J Chem Inf Model 2022; 62:5321-5328. [PMID: 36108142 PMCID: PMC9710516 DOI: 10.1021/acs.jcim.2c00733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Indexed: 12/13/2022]
Abstract
Molecular structures are commonly depicted in 2D printed forms in scientific documents such as journal papers and patents. However, these 2D depictions are not machine readable. Due to a backlog of decades and an increasing amount of printed literatures, there is a high demand for translating printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recognition (OCSR). Most OCSR systems developed over the last three decades use a rule-based approach, which vectorizes the depiction based on the interpretation of vectors and nodes as bonds and atoms. Here, we present a practical software called MolMiner, which is primarily built using deep neural networks originally developed for semantic segmentation and object detection to recognize atom and bond elements from documents. These recognized elements can be easily connected as a molecular graph with a distance-based construction algorithm. MolMiner gave state-of-the-art performance on four benchmark data sets and a self-collected external data set from scientific papers. As MolMiner performed similarly well in real-world OCSR tasks with a user-friendly interface, it is a useful and valuable tool for daily applications. The free download links of Mac and Windows versions are available at https://github.com/iipharma/pharmamind-molminer.
Collapse
Affiliation(s)
- Youjun Xu
- Infinite
Intelligence Pharma, Beijing, China 100083
| | | | | | | | - Jintao Zhu
- Center
for Quantitative Biology, Peking University, Beijing, China 100871
| | - Qiwan Hu
- Center
for Quantitative Biology, Peking University, Beijing, China 100871
| | - Hemin Li
- Infinite
Intelligence Pharma, Beijing, China 100083
| | | | - Bingyu Liu
- Infinite
Intelligence Pharma, Beijing, China 100083
| | | | - Jinyu Han
- Infinite
Intelligence Pharma, Beijing, China 100083
| | - Zhen Zhang
- Infinite
Intelligence Pharma, Beijing, China 100083
| | - Shuhao Zhang
- Infinite
Intelligence Pharma, Beijing, China 100083
| | - Weilin Zhang
- Infinite
Intelligence Pharma, Beijing, China 100083
| | - Luhua Lai
- Center
for Quantitative Biology, Peking University, Beijing, China 100871
- BNLMS,
Peking-Tsinghua Center for Life Sciences at the College of Chemistry
and Molecular Engineering, Peking University, Beijing, China 100871
| | - Jianfeng Pei
- Center
for Quantitative Biology, Peking University, Beijing, China 100871
| |
Collapse
|
21
|
Wang J, Shen Z, Liao Y, Yuan Z, Li S, He G, Lan M, Qian X, Zhang K, Li H. Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space. Brief Bioinform 2022; 23:6761958. [PMID: 36252922 PMCID: PMC9677486 DOI: 10.1093/bib/bbac461] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 09/21/2022] [Accepted: 09/26/2022] [Indexed: 12/14/2022] Open
Abstract
Identification of new chemical compounds with desired structural diversity and biological properties plays an essential role in drug discovery, yet the construction of such a potential space with elements of 'near-drug' properties is still a challenging task. In this work, we proposed a multimodal chemical information reconstruction system to automatically process, extract and align heterogeneous information from the text descriptions and structural images of chemical patents. Our key innovation lies in a heterogeneous data generator that produces cross-modality training data in the form of text descriptions and Markush structure images, from which a two-branch model with image- and text-processing units can then learn to both recognize heterogeneous chemical entities and simultaneously capture their correspondence. In particular, we have collected chemical structures from ChEMBL database and chemical patents from the European Patent Office and the US Patent and Trademark Office using keywords 'A61P, compound, structure' in the years from 2010 to 2020, and generated heterogeneous chemical information datasets with 210K structural images and 7818 annotated text snippets. Based on the reconstructed results and substituent replacement rules, structural libraries of a huge number of near-drug compounds can be generated automatically. In quantitative evaluations, our model can correctly reconstruct 97% of the molecular images into structured format and achieve an F1-score around 97-98% in the recognition of chemical entities, which demonstrated the effectiveness of our model in automatic information extraction from chemical patents, and hopefully transforming them to a user-friendly, structured molecular database enriching the near-drug space to realize the intelligent retrieval technology of chemical knowledge.
Collapse
Affiliation(s)
| | | | - Yichen Liao
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China
| | - Zhen Yuan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China
| | - Gaoqi He
- School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
| | - Man Lan
- School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
| | - Xuhong Qian
- Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China
| | - Kai Zhang
- Corresponding authors: Kai Zhang, School of Computer Science and Technology, Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail: ; Honglin Li, Shanghai Key Laboratory of New Drug Design, East China University of Science & Technology, Shanghai 200237, China. Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail:
| | - Honglin Li
- Corresponding authors: Kai Zhang, School of Computer Science and Technology, Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail: ; Honglin Li, Shanghai Key Laboratory of New Drug Design, East China University of Science & Technology, Shanghai 200237, China. Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail:
| |
Collapse
|
22
|
Musazade F, Jamalova N, Hasanov J. Review of techniques and models used in optical chemical structure recognition in images and scanned documents. J Cheminform 2022; 14:61. [PMID: 36076301 PMCID: PMC9461257 DOI: 10.1186/s13321-022-00642-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 08/20/2022] [Indexed: 11/10/2022] Open
Abstract
Extraction of chemical formulas from images was not in the top priority of Computer Vision tasks for a while. The complexity both on the input and prediction sides has made this task challenging for the conventional Artificial Intelligence and Machine Learning problems. A binary input image which might seem trivial for convolutional analysis was not easy to classify, since the provided sample was not representative of the given molecule: to describe the same formula, a variety of graphical representations which do not resemble each other can be used. Considering the variety of molecules, the problem shifted from classification to that of formula generation, which makes Natural Language Processing (NLP) a good candidate for an effective solution. This paper describes the evolution of approaches from rule-based structure analyses to complex statistical models, and compares the efficiency of models and methodologies used in the recent years. Although the latest achievements deliver ideal results on particular datasets, the authors mention possible problems for various scenarios and provide suggestions for further development.
Collapse
Affiliation(s)
- Fidan Musazade
- School of Engineering and Applied Science, The George Washington University, Washington, DC, United States
| | - Narmin Jamalova
- School of Engineering and Applied Science, The George Washington University, Washington, DC, United States
| | - Jamaladdin Hasanov
- School of Engineering and Applied Science, The George Washington University, Washington, DC, United States. .,School of IT and Engineering, ADA University, Baku, Azerbaijan.
| |
Collapse
|
23
|
Xu Z, Li J, Yang Z, Li S, Li H. SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. J Cheminform 2022; 14:41. [PMID: 35778754 PMCID: PMC9248127 DOI: 10.1186/s13321-022-00624-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 06/12/2022] [Indexed: 11/26/2022] Open
Abstract
Optical chemical structure recognition from scientific publications is essential for rediscovering a chemical structure. It is an extremely challenging problem, and current rule-based and deep-learning methods cannot achieve satisfactory recognition rates. Herein, we propose SwinOCSR, an end-to-end model based on a Swin Transformer. This model uses the Swin Transformer as the backbone to extract image features and introduces Transformer models to convert chemical information from publications into DeepSMILES. A novel chemical structure dataset was constructed to train and verify our method. Our proposed Swin Transformer-based model was extensively tested against the backbone of existing publicly available deep learning methods. The experimental results show that our model significantly outperforms the compared methods, demonstrating the model’s effectiveness. Moreover, we used a focal loss to address the token imbalance problem in the text representation of the chemical structure diagram, and our model achieved an accuracy of 98.58%.
Collapse
Affiliation(s)
- Zhanpeng Xu
- School of Information Science and Engineering, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China
| | - Jianhua Li
- School of Information Science and Engineering, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China.
| | - Zhaopeng Yang
- School of Information Science and Engineering, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China
| | - Shiliang Li
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Honglin Li
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| |
Collapse
|
24
|
Brinkhaus HO, Zielesny A, Steinbeck C, Rajan K. DECIMER-hand-drawn molecule images dataset. J Cheminform 2022; 14:36. [PMID: 35681226 PMCID: PMC9185882 DOI: 10.1186/s13321-022-00620-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 05/25/2022] [Indexed: 12/01/2022] Open
Abstract
The translation of images of chemical structures into machine-readable representations of the depicted molecules is known as optical chemical structure recognition (OCSR). There has been a lot of progress over the last three decades in this field, but the development of systems for the recognition of complex hand-drawn structure depictions is still at the beginning. Currently, there is no data for the systematic evaluation of OCSR methods on hand-drawn structures available. Here we present DECIMER — Hand-drawn molecule images, a standardised, openly available benchmark dataset of 5088 hand-drawn depictions of diversely picked chemical structures. Every structure depiction in the dataset is mapped to a machine-readable representation of the underlying molecule. The dataset is openly available and published under the CC-BY 4.0 licence which applies very few limitations. We hope that it will contribute to the further development of the field.
Collapse
Affiliation(s)
- Henning Otto Brinkhaus
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Achim Zielesny
- Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany.
| |
Collapse
|
25
|
Brinkhaus HO, Rajan K, Zielesny A, Steinbeck C. RanDepict: Random chemical structure depiction generator. J Cheminform 2022; 14:31. [PMID: 35668480 PMCID: PMC9169273 DOI: 10.1186/s13321-022-00609-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 05/09/2022] [Indexed: 11/10/2022] Open
Abstract
The development of deep learning-based optical chemical structure recognition (OCSR) systems has led to a need for datasets of chemical structure depictions. The diversity of the features in the training data is an important factor for the generation of deep learning systems that generalise well and are not overfit to a specific type of input. In the case of chemical structure depictions, these features are defined by the depiction parameters such as bond length, line thickness, label font style and many others. Here we present RanDepict, a toolkit for the creation of diverse sets of chemical structure depictions. The diversity of the image features is generated by making use of all available depiction parameters in the depiction functionalities of the CDK, RDKit, and Indigo. Furthermore, there is the option to enhance and augment the image with features such as curved arrows, chemical labels around the structure, or other kinds of distortions. Using depiction feature fingerprints, RanDepict ensures diversely picked image features. Here, the depiction and augmentation features are summarised in binary vectors and the MaxMin algorithm is used to pick diverse samples out of all valid options. By making all resources described herein publicly available, we hope to contribute to the development of deep learning-based OCSR systems.
Collapse
Affiliation(s)
- Henning Otto Brinkhaus
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Achim Zielesny
- Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, D-45665, Recklinghausen, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany.
| |
Collapse
|
26
|
Petmezas G, Stefanopoulos L, Kilintzis V, Tzavelis A, Rogers JA, Katsaggelos AK, Maglaveras N. State-of-the-art Deep Learning Methods on Electrocardiogram Data: A Systematic Review (Preprint). JMIR Med Inform 2022; 10:e38454. [PMID: 35969441 PMCID: PMC9425174 DOI: 10.2196/38454] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Revised: 06/03/2022] [Accepted: 07/03/2022] [Indexed: 11/13/2022] Open
Abstract
Background Electrocardiogram (ECG) is one of the most common noninvasive diagnostic tools that can provide useful information regarding a patient’s health status. Deep learning (DL) is an area of intense exploration that leads the way in most attempts to create powerful diagnostic models based on physiological signals. Objective This study aimed to provide a systematic review of DL methods applied to ECG data for various clinical applications. Methods The PubMed search engine was systematically searched by combining “deep learning” and keywords such as “ecg,” “ekg,” “electrocardiogram,” “electrocardiography,” and “electrocardiology.” Irrelevant articles were excluded from the study after screening titles and abstracts, and the remaining articles were further reviewed. The reasons for article exclusion were manuscripts written in any language other than English, absence of ECG data or DL methods involved in the study, and absence of a quantitative evaluation of the proposed approaches. Results We identified 230 relevant articles published between January 2020 and December 2021 and grouped them into 6 distinct medical applications, namely, blood pressure estimation, cardiovascular disease diagnosis, ECG analysis, biometric recognition, sleep analysis, and other clinical analyses. We provide a complete account of the state-of-the-art DL strategies per the field of application, as well as major ECG data sources. We also present open research problems, such as the lack of attempts to address the issue of blood pressure variability in training data sets, and point out potential gaps in the design and implementation of DL models. Conclusions We expect that this review will provide insights into state-of-the-art DL methods applied to ECG data and point to future directions for research on DL to create robust models that can assist medical experts in clinical decision-making.
Collapse
Affiliation(s)
- Georgios Petmezas
- Lab of Computing, Medical Informatics and Biomedical-Imaging Technologies, The Medical School, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Leandros Stefanopoulos
- Lab of Computing, Medical Informatics and Biomedical-Imaging Technologies, The Medical School, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Vassilis Kilintzis
- Lab of Computing, Medical Informatics and Biomedical-Imaging Technologies, The Medical School, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Andreas Tzavelis
- Department of Biomedical Engineering, Northwestern University, Evanston, IL, United States
| | - John A Rogers
- Department of Material Science, Northwestern University, Evanston, IL, United States
| | - Aggelos K Katsaggelos
- Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, United States
| | - Nicos Maglaveras
- Lab of Computing, Medical Informatics and Biomedical-Imaging Technologies, The Medical School, Aristotle University of Thessaloniki, Thessaloniki, Greece
| |
Collapse
|
27
|
Stocker M, Heger T, Schweidtmann A, Ćwiek-Kupczyńska H, Penev L, Dojchinovski M, Willighagen E, Vidal ME, Turki H, Balliet D, Tiddi I, Kuhn T, Mietchen D, Karras O, Vogt L, Hellmann S, Jeschke J, Krajewski P, Auer S. SKG4EOSC - Scholarly Knowledge Graphs for EOSC: Establishing a backbone of knowledge graphs for FAIR Scholarly Information in EOSC. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e83789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In the age of advanced information systems powering fast-paced knowledge economies that face global societal challenges, it is no longer adequate to express scholarly information - an essential resource for modern economies - primarily as article narratives in document form. Despite being a well-established tradition in scholarly communication, PDF-based text publishing is hindering scientific progress as it buries scholarly information into non-machine-readable formats. The key objective of SKG4EOSC is to improve science productivity through development and implementation of services for text and data conversion, and production, curation, and re-use of FAIR scholarly information. This will be achieved by (1) establishing the Open Research Knowledge Graph (ORKG, orkg.org), a service operated by the SKG4EOSC coordinator, as a Hub for access to FAIR scholarly information in the EOSC; (2) lifting to EOSC of numerous and heterogeneous domain-specific research infrastructures through the ORKG Hub’s harmonized access facilities; and (3) leverage the Hub to support cross-disciplinary research and policy decisions addressing societal challenges. SKG4EOSC will pilot the devised approaches and technologies in four research domains: biodiversity crisis, precision oncology, circular processes, and human cooperation. With the aim to improve machine-based scholarly information use, SKG4EOSC addresses an important current and future need of researchers. It extends the application of the FAIR data principles to scholarly communication practices, hence a more comprehensive coverage of the entire research lifecycle. Through explicit, machine actionable provenance links between FAIR scholarly information, primary data and contextual entities, it will substantially contribute to reproducibility, validation and trust in science. The resulting advanced machine support will catalyse new discoveries in basic research and solutions in key application areas.
Collapse
|