1
|
Moreau E, Hardiman O, Heverin M, O'Sullivan D. Mining impactful discoveries from the biomedical literature. BMC Bioinformatics 2024; 25:303. [PMID: 39285337 PMCID: PMC11403870 DOI: 10.1186/s12859-024-05881-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 07/23/2024] [Indexed: 09/20/2024] Open
Abstract
BACKGROUND Literature-based discovery (LBD) aims to help researchers to identify relations between concepts which are worthy of further investigation by text-mining the biomedical literature. While the LBD literature is rich and the field is considered mature, standard practice in the evaluation of LBD methods is methodologically poor and has not progressed on par with the domain. The lack of properly designed and decent-sized benchmark dataset hinders the progress of the field and its development into applications usable by biomedical experts. RESULTS This work presents a method for mining past discoveries from the biomedical literature. It leverages the impact made by a discovery, using descriptive statistics to detect surges in the prevalence of a relation across time. The validity of the method is tested against a baseline representing the state-of-the-art "time-sliced" method. CONCLUSIONS This method allows the collection of a large amount of time-stamped discoveries. These can be used for LBD evaluation, alleviating the long-standing issue of inadequate evaluation. It might also pave the way for more fine-grained LBD methods, which could exploit the diversity of these past discoveries to train supervised models. Finally the dataset (or some future version of it inspired by our method) could be used as a methodological tool for systematic reviews. We provide an online exploration tool in this perspective, available at https://brainmend.adaptcentre.ie/ .
Collapse
Affiliation(s)
- Erwan Moreau
- Adapt Centre and School of Computer Science and Statistics, Trinity College Dublin, Dublin, Ireland.
| | - Orla Hardiman
- School of Medicine, Trinity College Dublin, Dublin, Ireland
| | - Mark Heverin
- School of Medicine, Trinity College Dublin, Dublin, Ireland
| | - Declan O'Sullivan
- Adapt Centre and School of Computer Science and Statistics, Trinity College Dublin, Dublin, Ireland
| |
Collapse
|
2
|
Pu Y, Beck D, Verspoor K. Graph embedding-based link prediction for literature-based discovery in Alzheimer's Disease. J Biomed Inform 2023; 145:104464. [PMID: 37541406 DOI: 10.1016/j.jbi.2023.104464] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Revised: 07/29/2023] [Accepted: 07/30/2023] [Indexed: 08/06/2023]
Abstract
OBJECTIVE We explore the framing of literature-based discovery (LBD) as link prediction and graph embedding learning, with Alzheimer's Disease (AD) as our focus disease context. The key link prediction setting of prediction window length is specifically examined in the context of a time-sliced evaluation methodology. METHODS We propose a four-stage approach to explore literature-based discovery for Alzheimer's Disease, creating and analyzing a knowledge graph tailored to the AD context, and predicting and evaluating new knowledge based on time-sliced link prediction. The first stage is to collect an AD-specific corpus. The second stage involves constructing an AD knowledge graph with identified AD-specific concepts and relations from the corpus. In the third stage, 20 pairs of training and testing datasets are constructed with the time-slicing methodology. Finally, we infer new knowledge with graph embedding-based link prediction methods. We compare different link prediction methods in this context. The impact of limiting prediction evaluation of LBD models in the context of short-term and longer-term knowledge evolution for Alzheimer's Disease is assessed. RESULTS We constructed an AD corpus of over 16 k papers published in 1977-2021, and automatically annotated it with concepts and relations covering 11 AD-specific semantic entity types. The knowledge graph of Alzheimer's Disease derived from this resource consisted of ∼11 k nodes and ∼394 k edges, among which 34% were genotype-phenotype relationships, 57% were genotype-genotype relationships, and 9% were phenotype-phenotype relationships. A Structural Deep Network Embedding (SDNE) model consistently showed the best performance in terms of returning the most confident set of link predictions as time progresses over 20 years. A huge improvement in model performance was observed when changing the link prediction evaluation setting to consider a more distant future, reflecting the time required for knowledge accumulation. CONCLUSION Neural network graph-embedding link prediction methods show promise for the literature-based discovery context, although the prediction setting is extremely challenging, with graph densities of less than 1%. Varying prediction window length on the time-sliced evaluation methodology leads to hugely different results and interpretations of LBD studies. Our approach can be generalized to enable knowledge discovery for other diseases. AVAILABILITY Code, AD ontology, and data are available at https://github.com/READ-BioMed/readbiomed-lbd.
Collapse
Affiliation(s)
- Yiyuan Pu
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.
| | - Daniel Beck
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia; School of Computing Technologies, RMIT University, Melbourne, Victoria, Australia.
| |
Collapse
|
3
|
Cuffy C, McInnes BT. Exploring a deep learning neural architecture for closed Literature-based discovery. J Biomed Inform 2023; 143:104362. [PMID: 37146741 DOI: 10.1016/j.jbi.2023.104362] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 03/15/2023] [Accepted: 04/06/2023] [Indexed: 05/07/2023]
Abstract
Scientific literature presents a wealth of information yet to be explored. As the number of researchers increase with each passing year and publications are released, this contributes to an era where specialized fields of research are becoming more prevalent. As this trend continues, this further propagates the separation of interdisciplinary publications and makes keeping up to date with literature a laborious task. Literature-based discovery (LBD) aims to mitigate these concerns by promoting information sharing among non-interacting literature while extracting potentially meaningful information. Furthermore, recent advances in neural network architectures and data representation techniques have fueled their respective research communities in achieving state-of-the-art performance in many downstream tasks. However, studies of neural network-based methods for LBD remain to be explored. We introduce and explore a deep learning neural network-based approach for LBD. Additionally, we investigate various approaches to represent terms as concepts and analyze the affect of feature scaling representations into our model. We compare the evaluation performance of our method on five hallmarks of cancer datasets utilized for closed discovery. Our results show the chosen representation as input into our model affects evaluation performance. We found feature scaling our input representations increases evaluation performance and decreases the necessary number of epochs needed to achieve model generalization. We also explore two approaches to represent model output. We found reducing the model's output to capturing a subset of concepts improved evaluation performance at the cost of model generalizability. We also compare the efficacy of our method on the five hallmarks of cancer datasets to a set of randomly chosen relations between concepts. We found these experiments confirm our method's suitability for LBD.
Collapse
Affiliation(s)
- Clint Cuffy
- Virginia Commonwealth University, 401 S. Main St., Richmond, VA 23284, USA.
| | - Bridget T McInnes
- Virginia Commonwealth University, 401 S. Main St., Richmond, VA 23284, USA.
| |
Collapse
|
4
|
Launer-Wachs S, Taub-Tabib H, Tokarev Madem J, Bar-Natan O, Goldberg Y, Shamay Y. From Centralized to Ad-Hoc Knowledge Base Construction for Hypotheses Generation. J Biomed Inform 2023; 142:104383. [PMID: 37196989 DOI: 10.1016/j.jbi.2023.104383] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 04/27/2023] [Accepted: 05/03/2023] [Indexed: 05/19/2023]
Abstract
OBJECTIVE To demonstrate and develop an approach enabling individual researchers or small teams to create their own ad-hoc, lightweight knowledge bases tailored for specialized scientific interests, using text-mining over scientific literature, and demonstrate the effectiveness of these knowledge bases in hypothesis generation and literature-based discovery (LBD). METHODS We propose a lightweight process using an extractive search framework to create ad-hoc knowledge bases, which require minimal training and no background in bio-curation or computer science. These knowledge bases are particularly effective for LBD and hypothesis generation using Swanson's ABC method. The personalized nature of the knowledge bases allows for a somewhat higher level of noise than "public facing" ones, as researchers are expected to have prior domain experience to separate signal from noise. Fact verification is shifted from exhaustive verification of the knowledge base to post-hoc verification of specific entries of interest, allowing researchers to assess the correctness of relevant knowledge base entries by considering the paragraphs in which the facts were introduced. RESULTS We demonstrate the methodology by constructing several knowledge bases of different kinds: three knowledge bases that support lab-internal hypothesis generation: Drug Delivery to Ovarian Tumors (DDOT); Tissue Engineering and Regeneration; Challenges in Cancer Research; and an additional comprehensive, accurate knowledge base designated as a public resource for the wider community on the topic of Cell Specific Drug Delivery (CSDD). In each case, we show the design and construction process, along with relevant visualizations for data exploration, and hypothesis generation. For CSDD and DDOT we also show meta-analysis, human evaluation, and in vitro experimental evaluation. CONCLUSION Our approach enables researchers to create personalized, lightweight knowledge bases for specialized scientific interests, effectively facilitating hypothesis generation and literature-based discovery (LBD). By shifting fact verification efforts to post-hoc verification of specific entries, researchers can focus on exploring and generating hypotheses based on their expertise. The constructed knowledge bases demonstrate the versatility and adaptability of our approach to versatile research interests. The web-based platform, available at https://spike-kbc.apps.allenai.org , provides researchers with a valuable tool for rapid construction of knowledge bases tailored to their needs.
Collapse
Affiliation(s)
- Shaked Launer-Wachs
- Faculty of Biomedical Engineering, Technion - Israel Institute of Technology, Haifa, Israel
| | | | - Jennie Tokarev Madem
- Faculty of Biomedical Engineering, Technion - Israel Institute of Technology, Haifa, Israel
| | - Orr Bar-Natan
- Faculty of Biomedical Engineering, Technion - Israel Institute of Technology, Haifa, Israel
| | - Yoav Goldberg
- Allen Institute for AI, Tel Aviv, Israel; Bar-Ilan University, Ramat-Gan, Israel
| | - Yosi Shamay
- Faculty of Biomedical Engineering, Technion - Israel Institute of Technology, Haifa, Israel.
| |
Collapse
|
5
|
Moreau E. Literature-based discovery: addressing the issue of the subpar evaluation methodology. Bioinformatics 2023; 39:btad090. [PMID: 36786419 PMCID: PMC9945845 DOI: 10.1093/bioinformatics/btad090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 01/26/2023] [Accepted: 02/13/2023] [Indexed: 02/15/2023] Open
Affiliation(s)
- Erwan Moreau
- Adapt Centre, Trinity College Dublin, Dublin, Ireland
| |
Collapse
|
6
|
Wu H, Wang M, Wu J, Francis F, Chang YH, Shavick A, Dong H, Poon MTC, Fitzpatrick N, Levine AP, Slater LT, Handy A, Karwath A, Gkoutos GV, Chelala C, Shah AD, Stewart R, Collier N, Alex B, Whiteley W, Sudlow C, Roberts A, Dobson RJB. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. NPJ Digit Med 2022; 5:186. [PMID: 36544046 PMCID: PMC9770568 DOI: 10.1038/s41746-022-00730-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 11/29/2022] [Indexed: 12/24/2022] Open
Abstract
Much of the knowledge and information needed for enabling high-quality clinical research is stored in free-text format. Natural language processing (NLP) has been used to extract information from these sources at scale for several decades. This paper aims to present a comprehensive review of clinical NLP for the past 15 years in the UK to identify the community, depict its evolution, analyse methodologies and applications, and identify the main barriers. We collect a dataset of clinical NLP projects (n = 94; £ = 41.97 m) funded by UK funders or the European Union's funding programmes. Additionally, we extract details on 9 funders, 137 organisations, 139 persons and 431 research papers. Networks are created from timestamped data interlinking all entities, and network analysis is subsequently applied to generate insights. 431 publications are identified as part of a literature review, of which 107 are eligible for final analysis. Results show, not surprisingly, clinical NLP in the UK has increased substantially in the last 15 years: the total budget in the period of 2019-2022 was 80 times that of 2007-2010. However, the effort is required to deepen areas such as disease (sub-)phenotyping and broaden application domains. There is also a need to improve links between academia and industry and enable deployments in real-world settings for the realisation of clinical NLP's great potential in care delivery. The major barriers include research and development access to hospital data, lack of capable computational resources in the right places, the scarcity of labelled data and barriers to sharing of pretrained models.
Collapse
Affiliation(s)
- Honghan Wu
- Institute of Health Informatics, University College London, London, UK.
| | - Minhong Wang
- Institute of Health Informatics, University College London, London, UK
| | - Jinge Wu
- Institute of Health Informatics, University College London, London, UK
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Farah Francis
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Yun-Hsuan Chang
- Institute of Health Informatics, University College London, London, UK
| | - Alex Shavick
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Hang Dong
- Usher Institute, University of Edinburgh, Edinburgh, UK
- Department of Computer Science, University of Oxford, Oxford, UK
| | | | | | - Adam P Levine
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Luke T Slater
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Alex Handy
- Institute of Health Informatics, University College London, London, UK
- University College London Hospitals NHS Trust, London, UK
| | - Andreas Karwath
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Georgios V Gkoutos
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Claude Chelala
- Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Anoop Dinesh Shah
- Institute of Health Informatics, University College London, London, UK
| | - Robert Stewart
- Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King's College London, London, UK
- South London and Maudsley NHS Foundation Trust, London, UK
| | - Nigel Collier
- Theoretical and Applied Linguistics, Faculty of Modern & Medieval Languages & Linguistics, University of Cambridge, Cambridge, UK
| | - Beatrice Alex
- Edinburgh Futures Institute, University of Edinburgh, Edinburgh, UK
| | | | - Cathie Sudlow
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Angus Roberts
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| | - Richard J B Dobson
- Institute of Health Informatics, University College London, London, UK
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| |
Collapse
|
7
|
Optimizations for Computing Relatedness in Biomedical Heterogeneous Information Networks: SemNet 2.0. BIG DATA AND COGNITIVE COMPUTING 2022; 6. [PMID: 35936510 PMCID: PMC9351549 DOI: 10.3390/bdcc6010027] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
Literature-based discovery (LBD) summarizes information and generates insight from large text corpuses. The SemNet framework utilizes a large heterogeneous information network or “knowledge graph” of nodes and edges to compute relatedness and rank concepts pertinent to a user-specified target. SemNet provides a way to perform multi-factorial and multi-scalar analysis of complex disease etiology and therapeutic identification using the 33+ million articles in PubMed. The present work improves the efficacy and efficiency of LBD for end users by augmenting SemNet to create SemNet 2.0. A custom Python data structure replaced reliance on Neo4j to improve knowledge graph query times by several orders of magnitude. Additionally, two randomized algorithms were built to optimize the HeteSim metric calculation for computing metapath similarity. The unsupervised learning algorithm for rank aggregation (ULARA), which ranks concepts with respect to the user-specified target, was reconstructed using derived mathematical proofs of correctness and probabilistic performance guarantees for optimization. The upgraded ULARA is generalizable to other rank aggregation problems outside of SemNet. In summary, SemNet 2.0 is a comprehensive open-source software for significantly faster, more effective, and user-friendly means of automated biomedical LBD. An example case is performed to rank relationships between Alzheimer’s disease and metabolic co-morbidities.
Collapse
|
8
|
Bhasuran B. Combining Literature Mining and Machine Learning for Predicting Biomedical Discoveries. Methods Mol Biol 2022; 2496:123-140. [PMID: 35713862 DOI: 10.1007/978-1-0716-2305-3_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The major outcomes and insights of scientific research and clinical study end up in the form of publication or clinical record in an unstructured text format. Due to advancements in biomedical research, the growth of published literature is getting tremendous large in recent years. The scientists and clinical researchers are facing a big challenge to stay current with the knowledge and to extract hidden information from this sheer quantity of millions of published biomedical literature. The potential one-stop automated solution to this problem is biomedical literature mining. One of the long-standing goals in biology is to discover the disease-causing genes and their specific roles in personalized precision medicine and drug repurposing. However, the empirical approaches and clinical affirmation are expensive and time-consuming. In silico approach using text mining to identify the disease causing genes can contribute towards biomarker discovery. This chapter presents a protocol on combining literature mining and machine learning for predicting biomedical discoveries with a special emphasis on gene-disease relation based discovery. The protocol is presented as a literature based discovery (LBD) pipeline for gene-disease based discovery. The protocol includes our web based tools: (1) DNER (Disease Named Entity Recognizer) for disease entity recognition, (2) BCCNER (Bidirectional, Contextual clues Named Entity Tagger) for gene/protein entity recognition, (3) DisGeReExT (Disease-Gene Relation Extractor) for statistically validated results and visualization, and (4) a newly introduced deep learning based method for association discovery. Our proposed deep learning based method can be generalized and applied to other important biomedical discoveries focusing on entities such as drug/chemical, or miRNA.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
9
|
Phang CSJ, Vong WT, Sebastian Y, Raman V, Then PHH. Understanding the Usability of a Literature-Based Discovery System Among Clinical Researchers in Sarawak, Malaysia. INTERNATIONAL JOURNAL OF TECHNOLOGY AND HUMAN INTERACTION 2022. [DOI: 10.4018/ijthi.304092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The rapid increase in scientific publications makes it difficult for researchers to keep up with the latest literature and to explore new research directions. The literature-based discovery (LBD) systems aim to resolve this issue by bridging literatures from disparate fields to assist researchers in knowledge discovery and the formulation and testing of research hypotheses. Previous studies have focused mainly on evaluating the efficacy of LBD systems by replicating historical LBD events. The usability of LBD systems has been under-researched, which partly explains the low adoption of the systems. This paper presents a survey study that evaluates the usability of a LBD system for knowledge discovery and hypothesis refinement, and also investigates factors affecting its adoption among biomedical researchers in Sarawak, Malaysia. The findings suggest that the adoption of the LBD system is related to their perceived usefulness and perceived difficulty in interacting with the user interface features of the system.
Collapse
Affiliation(s)
| | - Wan-Tze Vong
- Swinburne University of Technology, Sarawak, Malaysia
| | | | | | | |
Collapse
|
10
|
Sharma PP, Bansal M, Sethi A, Poonam, Pena L, Goel VK, Grishina M, Chaturvedi S, Kumar D, Rathi B. Computational methods directed towards drug repurposing for COVID-19: advantages and limitations. RSC Adv 2021; 11:36181-36198. [PMID: 35492747 PMCID: PMC9043418 DOI: 10.1039/d1ra05320e] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2021] [Accepted: 10/07/2021] [Indexed: 12/19/2022] Open
Abstract
Novel coronavirus disease 2019 (COVID-19) has significantly altered the socio-economic status of countries. Although vaccines are now available against the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a causative agent for COVID-19, it continues to transmit and newer variants of concern have been consistently emerging world-wide. Computational strategies involving drug repurposing offer a viable opportunity to choose a medication from a rundown of affirmed drugs against distinct diseases including COVID-19. While pandemics impede the healthcare systems, drug repurposing or repositioning represents a hopeful approach in which existing drugs can be remodeled and employed to treat newer diseases. In this review, we summarize the diverse computational approaches attempted for developing drugs through drug repurposing or repositioning against COVID-19 and discuss their advantages and limitations. To this end, we have outlined studies that utilized computational techniques such as molecular docking, molecular dynamic simulation, disease-disease association, drug-drug interaction, integrated biological network, artificial intelligence, machine learning and network medicine to accelerate creation of smart and safe drugs against COVID-19.
Collapse
Affiliation(s)
- Prem Prakash Sharma
- Laboratory For Translational Chemistry and Drug Discovery, Department of Chemistry, Hansraj College, University of Delhi Delhi 110007 India
| | - Meenakshi Bansal
- Laboratory For Translational Chemistry and Drug Discovery, Department of Chemistry, Hansraj College, University of Delhi Delhi 110007 India
| | - Aaftaab Sethi
- Department of Medicinal Chemistry, National Institute of Pharmaceutical Education and Research (NIPER) Hyderabad India
| | - Poonam
- Department of Chemistry, Miranda House, University of Delhi Delhi 110007 India
| | - Lindomar Pena
- Department of Virology, Aggeu Magalhaes, Institute (IAM), Oswaldo Cruz Foundation (Fiocruz) Recife 50670-420 Pernambuco Brazil
| | - Vijay Kumar Goel
- School of Physical Sciences, Jawaharlal Nehru University New Delhi 110067 India
| | - Maria Grishina
- South Ural State University, Laboratory of Computational Modelling of Drugs Pr. Lenina 76 454080 Russia
| | - Shubhra Chaturvedi
- Division of Cyclotron and Radiopharmaceutical Sciences, Institute of Nuclear Medicine and Allied Sciences New Delhi 110054 India
| | - Dhruv Kumar
- Amity Institute of Molecular Medicine & Stem Cell Research (AIMMSCR), Amity University Uttar Pradesh Noida 201313 India
| | - Brijesh Rathi
- Laboratory For Translational Chemistry and Drug Discovery, Department of Chemistry, Hansraj College, University of Delhi Delhi 110007 India
| |
Collapse
|
11
|
Barupal DK, Schubauer-Berigan MK, Korenjak M, Zavadil J, Guyton KZ. Prioritizing cancer hazard assessments for IARC Monographs using an integrated approach of database fusion and text mining. ENVIRONMENT INTERNATIONAL 2021; 156:106624. [PMID: 33984576 PMCID: PMC8380673 DOI: 10.1016/j.envint.2021.106624] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Revised: 03/22/2021] [Accepted: 04/30/2021] [Indexed: 05/14/2023]
Abstract
BACKGROUND Systematic evaluation of literature data on the cancer hazards of human exposures is an essential process underlying cancer prevention strategies. The scope and volume of evidence for suspected carcinogens can range from very few to thousands of publications, requiring a complex, systematically planned, and critical procedure to nominate, prioritize and evaluate carcinogenic agents. To aid in this process, database fusion, cheminformatics and text mining techniques can be combined into an integrated approach to inform agent prioritization, selection, and grouping. RESULTS We have applied these techniques to agents recommended for the IARC Monographs evaluations during 2020-2024. An integration of PubMed filters to cover cancer epidemiology, key characteristics of carcinogens, chemical lists from 34 databases relevant for cancer research, chemical structure grouping and a literature data-based clustering was applied in an innovative approach to 119 agents recommended by an advisory group for future IARC Monographs evaluations. The approach also facilitated a rational grouping of these agents and aids in understanding the volume and complexity of relevant information, as well as important gaps in coverage of the available studies on cancer etiology and carcinogenesis. CONCLUSION A new data-science approach has been applied to diverse agents recommended for cancer hazard assessments, and its applications for the IARC Monographs are demonstrated. The prioritization approach has been made available at www.cancer.idsl.me site for ranking cancer agents.
Collapse
Affiliation(s)
- Dinesh Kumar Barupal
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mt Sinai, NY, USA.
| | - Mary K Schubauer-Berigan
- Evidence Synthesis and Classification Branch, International Agency for Research on Cancer, Lyon, France
| | - Michael Korenjak
- Epigenomics and Mechanisms Branch, International Agency for Research on Cancer, Lyon, France
| | - Jiri Zavadil
- Epigenomics and Mechanisms Branch, International Agency for Research on Cancer, Lyon, France
| | - Kathryn Z Guyton
- Evidence Synthesis and Classification Branch, International Agency for Research on Cancer, Lyon, France
| |
Collapse
|
12
|
Škrlj B, Kokalj E, Lavrač N. PubMed-Scale Chemical Concept Embeddings Reconstruct Physical Protein Interaction Networks. Front Res Metr Anal 2021; 6:644614. [PMID: 33928210 PMCID: PMC8076635 DOI: 10.3389/frma.2021.644614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 02/08/2021] [Indexed: 11/13/2022] Open
Abstract
PubMed is the largest resource of curated biomedical knowledge to date, entailing more than 25 million documents. Large quantities of novel literature prevent a single expert from keeping track of all potentially relevant papers, resulting in knowledge gaps. In this article, we present CHEMMESHNET, a newly developed PubMed-based network comprising more than 10,000,000 associations, constructed from expert-curated MeSH annotations of chemicals based on all currently available PubMed articles. By learning latent representations of concepts in the obtained network, we demonstrate in a proof of concept study that purely literature-based representations are sufficient for the reconstruction of a large part of the currently known network of physical, empirically determined protein-protein interactions. We demonstrate that simple linear embeddings of node pairs, when coupled with a neural network-based classifier, reliably reconstruct the existing collection of empirically confirmed protein-protein interactions. Furthermore, we demonstrate how pairs of learned representations can be used to prioritize potentially interesting novel interactions based on the common chemical context. Highly ranked interactions are qualitatively inspected in terms of potential complex formation at the structural level and represent potentially interesting new knowledge. We demonstrate that two protein-protein interactions, prioritized by structure-based approaches, also emerge as probable with regard to the trained machine-learning model.
Collapse
Affiliation(s)
- Blaž Škrlj
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
- Jožef Stefan Institute, Ljubljana, Slovenia
| | - Enja Kokalj
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
- Jožef Stefan Institute, Ljubljana, Slovenia
| | - Nada Lavrač
- Jožef Stefan Institute, Ljubljana, Slovenia
- University of Nova Gorica, Vipava, Slovenia
| |
Collapse
|
13
|
Tworowski D, Gorohovski A, Mukherjee S, Carmi G, Levy E, Detroja R, Mukherjee SB, Frenkel-Morgenstern M. COVID19 Drug Repository: text-mining the literature in search of putative COVID19 therapeutics. Nucleic Acids Res 2021; 49:D1113-D1121. [PMID: 33166390 PMCID: PMC7778969 DOI: 10.1093/nar/gkaa969] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 10/07/2020] [Accepted: 11/04/2020] [Indexed: 12/12/2022] Open
Abstract
The recent outbreak of COVID-19 has generated an enormous amount of Big Data. To date, the COVID-19 Open Research Dataset (CORD-19), lists ∼130,000 articles from the WHO COVID-19 database, PubMed Central, medRxiv, and bioRxiv, as collected by Semantic Scholar. According to LitCovid (11 August 2020), ∼40,300 COVID19-related articles are currently listed in PubMed. It has been shown in clinical settings that the analysis of past research results and the mining of available data can provide novel opportunities for the successful application of currently approved therapeutics and their combinations for the treatment of conditions caused by a novel SARS-CoV-2 infection. As such, effective responses to the pandemic require the development of efficient applications, methods and algorithms for data navigation, text-mining, clustering, classification, analysis, and reasoning. Thus, our COVID19 Drug Repository represents a modular platform for drug data navigation and analysis, with an emphasis on COVID-19-related information currently being reported. The COVID19 Drug Repository enables users to focus on different levels of complexity, starting from general information about (FDA-) approved drugs, PubMed references, clinical trials, recipes as well as the descriptions of molecular mechanisms of drugs' action. Our COVID19 drug repository provide a most updated world-wide collection of drugs that has been repurposed for COVID19 treatments around the world.
Collapse
Affiliation(s)
- Dmitry Tworowski
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Alessandro Gorohovski
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Sumit Mukherjee
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Gon Carmi
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Eliad Levy
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Rajesh Detroja
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Sunanda Biswas Mukherjee
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Milana Frenkel-Morgenstern
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| |
Collapse
|
14
|
|