101
|
Garda S, Lenihan-Geels F, Proft S, Hochmuth S, Schülke M, Seelow D, Leser U. RegEl corpus: identifying DNA regulatory elements in the scientific literature. Database (Oxford) 2022; 2022:6618549. [PMID: 35758881 PMCID: PMC9235371 DOI: 10.1093/database/baac043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Revised: 05/25/2022] [Accepted: 06/02/2022] [Indexed: 11/17/2022]
Abstract
High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48–0.91 for entity detection and 0.71–0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL: https://zenodo.org/record/6418451#.YqcLHvexVqg
Collapse
Affiliation(s)
- Samuele Garda
- Humboldt-Universitält zu Berlin Computer Science, , Rudower Chaussee 25, 12489, Berlin, Germany
| | - Freyda Lenihan-Geels
- Charité-Universitätsmedizin Berlin Klinik für Pädiatrie m.S. Neurologie, , Augustenburger Platz 1, 13353, Berlin, Germany
| | - Sebastian Proft
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin Bioinformatics and Translational Genetics, , Anna-Louisa-Karsch-Straße 2, 10178, Berlin, Germany
- Charité-Universitätsmedizin Berlin Institut für Medizinische Genetik und Humangenetik, , Augustenburger Platz 1, 13353, Berlin, Germany
| | - Stefanie Hochmuth
- Charité-Universitätsmedizin Berlin Klinik für Pädiatrie m.S. Neurologie, , Augustenburger Platz 1, 13353, Berlin, Germany
| | - Markus Schülke
- Charité-Universitätsmedizin Berlin Klinik für Pädiatrie m.S. Neurologie, , Augustenburger Platz 1, 13353, Berlin, Germany
| | - Dominik Seelow
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin Bioinformatics and Translational Genetics, , Anna-Louisa-Karsch-Straße 2, 10178, Berlin, Germany
| | - Ulf Leser
- Humboldt-Universitält zu Berlin Computer Science, , Rudower Chaussee 25, 12489, Berlin, Germany
| |
Collapse
|
102
|
Lischka A, Lassuthova P, Çakar A, Record CJ, Van Lent J, Baets J, Dohrn MF, Senderek J, Lampert A, Bennett DL, Wood JN, Timmerman V, Hornemann T, Auer-Grumbach M, Parman Y, Hübner CA, Elbracht M, Eggermann K, Geoffrey Woods C, Cox JJ, Reilly MM, Kurth I. Genetic pain loss disorders. Nat Rev Dis Primers 2022; 8:41. [PMID: 35710757 DOI: 10.1038/s41572-022-00365-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/10/2022] [Indexed: 01/05/2023]
Abstract
Genetic pain loss includes congenital insensitivity to pain (CIP), hereditary sensory neuropathies and, if autonomic nerves are involved, hereditary sensory and autonomic neuropathy (HSAN). This heterogeneous group of disorders highlights the essential role of nociception in protecting against tissue damage. Patients with genetic pain loss have recurrent injuries, burns and poorly healing wounds as disease hallmarks. CIP and HSAN are caused by pathogenic genetic variants in >20 genes that lead to developmental defects, neurodegeneration or altered neuronal excitability of peripheral damage-sensing neurons. These genetic variants lead to hyperactivity of sodium channels, disturbed haem metabolism, altered clathrin-mediated transport and impaired gene regulatory mechanisms affecting epigenetic marks, long non-coding RNAs and repetitive elements. Therapies for pain loss disorders are mainly symptomatic but the first targeted therapies are being tested. Conversely, chronic pain remains one of the greatest unresolved medical challenges, and the genes and mechanisms associated with pain loss offer new targets for analgesics. Given the progress that has been made, the coming years are promising both in terms of targeted treatments for pain loss disorders and the development of innovative pain medicines based on knowledge of these genetic diseases.
Collapse
Affiliation(s)
- Annette Lischka
- Institute of Human Genetics, Medical Faculty, Uniklinik RWTH Aachen University, Aachen, Germany
| | - Petra Lassuthova
- Department of Paediatric Neurology, 2nd Faculty of Medicine, Charles University in Prague and Motol University Hospital, Prague, Czech Republic
| | - Arman Çakar
- Neuromuscular Unit, Department of Neurology, Istanbul Faculty of Medicine, Istanbul University, Istanbul, Turkey
| | - Christopher J Record
- Centre for Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London, UK
| | - Jonas Van Lent
- Peripheral Neuropathy Research Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium.,Laboratory of Neuromuscular Pathology, Institute Born Bunge, Antwerp, Belgium
| | - Jonathan Baets
- Laboratory of Neuromuscular Pathology, Institute Born Bunge, Antwerp, Belgium.,Translational Neurosciences, Faculty of Medicine and Health Sciences, University of Antwerp, Antwerp, Belgium.,Neuromuscular Reference Centre, Department of Neurology, Antwerp University Hospital, Antwerp, Belgium
| | - Maike F Dohrn
- Department of Neurology, Medical Faculty, Uniklinik RWTH Aachen University, Aachen, Germany.,Dr. John T. Macdonald Foundation, Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami, Miller School of Medicine, Miami, FL, USA
| | - Jan Senderek
- Friedrich-Baur-Institute, Department of Neurology, Ludwig-Maximilians-University, Munich, Germany
| | - Angelika Lampert
- Institute of Physiology, Medical Faculty, Uniklinik RWTH Aachen University, Aachen, Germany
| | - David L Bennett
- Nuffield Department of Clinical Neuroscience, Oxford University, Oxford, UK
| | - John N Wood
- Molecular Nociception Group, Wolfson Institute for Biomedical Research, University College London, London, UK
| | - Vincent Timmerman
- Peripheral Neuropathy Research Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium.,Laboratory of Neuromuscular Pathology, Institute Born Bunge, Antwerp, Belgium
| | - Thorsten Hornemann
- Department of Clinical Chemistry, University Hospital Zurich, University of Zurich, Zurich, Switzerland
| | - Michaela Auer-Grumbach
- Department of Orthopedics and Trauma Surgery, Medical University of Vienna, Vienna, Austria
| | - Yesim Parman
- Neuromuscular Unit, Department of Neurology, Istanbul Faculty of Medicine, Istanbul University, Istanbul, Turkey
| | | | - Miriam Elbracht
- Institute of Human Genetics, Medical Faculty, Uniklinik RWTH Aachen University, Aachen, Germany
| | - Katja Eggermann
- Institute of Human Genetics, Medical Faculty, Uniklinik RWTH Aachen University, Aachen, Germany
| | - C Geoffrey Woods
- Cambridge Institute for Medical Research, Keith Peters Building, Cambridge Biomedical Campus, Cambridge, UK
| | - James J Cox
- Molecular Nociception Group, Wolfson Institute for Biomedical Research, University College London, London, UK
| | - Mary M Reilly
- Centre for Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London, UK
| | - Ingo Kurth
- Institute of Human Genetics, Medical Faculty, Uniklinik RWTH Aachen University, Aachen, Germany.
| |
Collapse
|
103
|
Yates T, Lain A, Campbell J, FitzPatrick DR, Simpson TI. Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders. Database (Oxford) 2022; 2022:baac038. [PMID: 35670729 PMCID: PMC9216525 DOI: 10.1093/database/baac038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 03/26/2022] [Accepted: 05/25/2022] [Indexed: 11/24/2022]
Abstract
There are >2500 different genetically determined developmental disorders (DD), which, as a group, show very high levels of both locus and allelic heterogeneity. This has led to the wide-spread use of evidence-based filtering of genome-wide sequence data as a diagnostic tool in DD. Determining whether the association of a filtered variant at a specific locus is a plausible explanation of the phenotype in the proband is crucial and commonly requires extensive manual literature review by both clinical scientists and clinicians. Access to a database of weighted clinical features extracted from rigorously curated literature would increase the efficiency of this process and facilitate the development of robust phenotypic similarity metrics. However, given the large and rapidly increasing volume of published information, conventional biocuration approaches are becoming impractical. Here, we present a scalable, automated method for the extraction of categorical phenotypic descriptors from the full-text literature. Papers identified through literature review were downloaded and parsed using the Cadmus custom retrieval package. Human Phenotype Ontology terms were extracted using MetaMap, with 76-84% precision and 65-73% recall. Mean terms per paper increased from 9 in title + abstract, to 68 using full text. We demonstrate that these literature-derived disease models plausibly reflect true disease expressivity more accurately than widely used manually curated models, through comparison with prospectively gathered data from the Deciphering Developmental Disorders study. The area under the curve for receiver operating characteristic (ROC) curves increased by 5-10% through the use of literature-derived models. This work shows that scalable automated literature curation increases performance and adds weight to the need for this strategy to be integrated into informatic variant analysis pipelines. Database URL: https://doi.org/10.1093/database/baac038.
Collapse
Affiliation(s)
- T.M Yates
- MRC Human Genetics Unit, Western General Hospital, Institute of Genetics and Cancer, The University of Edinburgh, Crewe Road South, Edinburgh EH4 2XU, UK
- Transforming Genetic Medicine Initiative, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - A Lain
- Institute for Adaptive and Neural Computation, Informatics Forum, The University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK
| | - J Campbell
- MRC Human Genetics Unit, Western General Hospital, Institute of Genetics and Cancer, The University of Edinburgh, Crewe Road South, Edinburgh EH4 2XU, UK
- Simons Initiative for the Developing Brain, The University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XF, UK
| | - D R FitzPatrick
- MRC Human Genetics Unit, Western General Hospital, Institute of Genetics and Cancer, The University of Edinburgh, Crewe Road South, Edinburgh EH4 2XU, UK
- Transforming Genetic Medicine Initiative, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- Simons Initiative for the Developing Brain, The University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XF, UK
| | - T I Simpson
- Institute for Adaptive and Neural Computation, Informatics Forum, The University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK
- Simons Initiative for the Developing Brain, The University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XF, UK
| |
Collapse
|
104
|
Feng B, Gao J. AnthraxKP: a knowledge graph-based, Anthrax Knowledge Portal mined from biomedical literature. Database (Oxford) 2022; 2022:6598946. [PMID: 35653350 PMCID: PMC9216567 DOI: 10.1093/database/baac037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 04/13/2022] [Accepted: 05/13/2022] [Indexed: 11/15/2022]
Abstract
Abstract
Anthrax is a zoonotic infectious disease caused by Bacillus anthracis (anthrax bacterium) that affects not only domestic and wild animals worldwide but also human health. As the study develops in-depth, a large quantity of related biomedical publications emerge. Acquiring knowledge from the literature is essential for gaining insight into anthrax etiology, diagnosis, treatment and research. In this study, we used a set of text mining tools to identify nearly 14 000 entities of 29 categories, such as genes, diseases, chemicals, species, vaccines and proteins, from nearly 8000 anthrax biomedical literature and extracted 281 categories of association relationships among the entities. We curated Anthrax-related Entities Dictionary and Anthrax Ontology. We formed Anthrax Knowledge Graph (AnthraxKG) containing more than 6000 nodes, 6000 edges and 32 000 properties. An interactive visualized Anthrax Knowledge Portal(AnthraxKP) was also developed based on AnthraxKG by using Web technology. AnthraxKP in this study provides rich and authentic relevant knowledge in many forms, which can help researchers carry out research more efficiently.
Database URL: AnthraxKP is permitted users to query and download data at http://139.224.212.120:18095/.
Collapse
Affiliation(s)
- Baiyang Feng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University , Erdos East Street No. 29, Hohhot 010011, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry , Zhaowuda Road No. 306, Hohhot 010018, China
| | - Jing Gao
- College of Computer and Information Engineering, Inner Mongolia Agricultural University , Erdos East Street No. 29, Hohhot 010011, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry , Zhaowuda Road No. 306, Hohhot 010018, China
- Inner Mongolia Autonomous Region Big Data Center , Chilechuan Street No. 1, Hohhot 010091, China
| |
Collapse
|
105
|
Yang H, Lee N, Park B, Park J, Lee J, Jang HS, Yoo H. Hierarchical network analysis of co-occurring bioentities in literature. Sci Rep 2022; 12:7885. [PMID: 35550589 PMCID: PMC9098521 DOI: 10.1038/s41598-022-12093-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 05/03/2022] [Indexed: 11/09/2022] Open
Abstract
Biomedical databases grow by more than a thousand new publications every day. The large volume of biomedical literature that is being published at an unprecedented rate hinders the discovery of relevant knowledge from keywords of interest to gather new insights and form hypotheses. A text-mining tool, PubTator, helps to automatically annotate bioentities, such as species, chemicals, genes, and diseases, from PubMed abstracts and full-text articles. However, the manual re-organization and analysis of bioentities is a non-trivial and highly time-consuming task. ChexMix was designed to extract the unique identifiers of bioentities from query results. Herein, ChexMix was used to construct a taxonomic tree with allied species among Korean native plants and to extract the medical subject headings unique identifier of the bioentities, which co-occurred with the keywords in the same literature. ChexMix discovered the allied species related to a keyword of interest and experimentally proved its usefulness for multi-species analysis.
Collapse
Affiliation(s)
- Heejung Yang
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea. .,Bionsight, Inc., Chuncheon, 24341, Republic of Korea.
| | - Namgil Lee
- Bionsight, Inc., Chuncheon, 24341, Republic of Korea.,Department of Information Statistics, Kangwon National University, Gangwondaehak-gil 1, Chuncheon, Gangwon, 24341, Republic of Korea
| | - Beomjun Park
- Bionsight, Inc., Chuncheon, 24341, Republic of Korea
| | - Jinyoung Park
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Jiho Lee
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Hyeon Seok Jang
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Hojin Yoo
- Bionsight, Inc., Chuncheon, 24341, Republic of Korea
| |
Collapse
|
106
|
Gyori BM, Hoyt CT, Steppi A. Gilda: biomedical entity text normalization with machine-learned disambiguation as a service. BIOINFORMATICS ADVANCES 2022; 2:vbac034. [PMID: 36699362 PMCID: PMC9710686 DOI: 10.1093/bioadv/vbac034] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 04/27/2022] [Accepted: 05/06/2022] [Indexed: 01/28/2023]
Abstract
Summary Gilda is a software tool and web service that implements a scored string matching algorithm for names and synonyms across entries in biomedical ontologies covering genes, proteins (and their families and complexes), small molecules, biological processes and diseases. Gilda integrates machine-learned disambiguation models to choose between ambiguous strings given relevant surrounding text as context, and supports species-prioritization in case of ambiguity. Availability and implementation The Gilda web service is available at http://grounding.indra.bio with source code, documentation and tutorials available via https://github.com/indralab/gilda. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | - Albert Steppi
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
107
|
Tamanai-Shacoori Z, Le Gall-David S, Moussouni F, Sweidan A, Polard E, Bousarghin L, Jolivet-Gougeon A. SARS-CoV-2 and Prevotella spp.: friend or foe? A systematic literature review. J Med Microbiol 2022; 71. [PMID: 35511246 DOI: 10.1099/jmm.0.001520] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
During this global pandemic of the COVID-19 disease, a lot of information has arisen in the media and online without scientific validation, and among these is the possibility that this disease could be aggravated by a secondary bacterial infection such as Prevotella, as well as the interest or not in using azithromycin, a potentially active antimicrobial agent. The aim of this study was to carry out a systematic literature review, to prove or disprove these allegations by scientific arguments. The search included Medline, PubMed, and Pubtator Central databases for English-language articles published 1999-2021. After removing duplicates, a total of final eligible studies (n=149) were selected. There were more articles showing an increase of Prevotella abundance in the presence of viral infection like that related to Human Immunodeficiency Virus (HIV), Papillomavirus (HPV), Herpesviridae and respiratory virus, highlighting differences according to methodologies and patient groups. The arguments for or against the use of azithromycin are stated in light of the results of the literature, showing the role of intercurrent factors, such as age, drug consumption, the presence of cancer or periodontal diseases. However, clinical trials are lacking to prove the direct link between the presence of Prevotella spp. and a worsening of COVID-19, mainly those using azithromycin alone in this indication.
Collapse
Affiliation(s)
- Zohreh Tamanai-Shacoori
- Univ Rennes, INSERM, INRAE, CHU Rennes, Institut NUMECAN (Nutrition Metabolisms and Cancer), F-35000 Rennes, France
| | - Sandrine Le Gall-David
- Univ Rennes, INSERM, INRAE, CHU Rennes, Institut NUMECAN (Nutrition Metabolisms and Cancer), F-35000 Rennes, France
| | - Fouzia Moussouni
- Univ Rennes, INSERM, INRAE, CHU Rennes, Institut NUMECAN (Nutrition Metabolisms and Cancer), F-35000 Rennes, France
| | - Alaa Sweidan
- Laboratory of Microbiology, Department of Life and Earth Sciences, Faculty of Sciences, Lebanese University, Hadath Campus, Beirut, Lebanon
| | - Elisabeth Polard
- Teaching Hospital Rennes, Service de Pharmacovigilance, F-35033 Rennes, France
| | - Latifa Bousarghin
- Univ Rennes, INSERM, INRAE, CHU Rennes, Institut NUMECAN (Nutrition Metabolisms and Cancer), F-35000 Rennes, France
| | - Anne Jolivet-Gougeon
- Univ Rennes, INSERM, INRAE, CHU Rennes, Institut NUMECAN (Nutrition Metabolisms and Cancer), F-35000 Rennes, France
| |
Collapse
|
108
|
Gunturkun MH, Flashner E, Wang T, Mulligan MK, Williams RW, Prins P, Chen H. GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships. G3 (BETHESDA, MD.) 2022; 12:jkac059. [PMID: 35285473 PMCID: PMC9073678 DOI: 10.1093/g3journal/jkac059] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Accepted: 03/04/2022] [Indexed: 11/13/2022]
Abstract
Interpreting and integrating results from omics studies typically requires a comprehensive and time consuming survey of extant literature. GeneCup is a literature mining web service that retrieves sentences containing user-provided gene symbols and keywords from PubMed abstracts. The keywords are organized into an ontology and can be extended to include results from human genome-wide association studies. We provide a drug addiction keyword ontology that contains over 300 keywords as an example. The literature search is conducted by querying the PubMed server using a programming interface, which is followed by retrieving abstracts from a local copy of the PubMed archive. The main results presented to the user are sentences where gene symbol and keywords co-occur. These sentences are presented through an interactive graphical interface or as tables. All results are linked to the original abstract in PubMed. In addition, a convolutional neural network is employed to distinguish sentences describing systemic stress from those describing cellular stress. The automated and comprehensive search strategy provided by GeneCup facilitates the integration of new discoveries from omic studies with existing literature. GeneCup is free and open source software. The source code of GeneCup and the link to a running instance is available at https://github.com/hakangunturkun/GeneCup.
Collapse
Affiliation(s)
- Mustafa H Gunturkun
- Department of Pharmacology, Addiction Science and Toxicology, University of Tennessee Health Science, Memphis, TN 38103, USA
| | - Efraim Flashner
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science, Memphis, TN 38103, USA
| | - Tengfei Wang
- Department of Pharmacology, Addiction Science and Toxicology, University of Tennessee Health Science, Memphis, TN 38103, USA
| | - Megan K Mulligan
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science, Memphis, TN 38103, USA
| | - Robert W Williams
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science, Memphis, TN 38103, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science, Memphis, TN 38103, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science and Toxicology, University of Tennessee Health Science, Memphis, TN 38103, USA
| |
Collapse
|
109
|
Di Maria A, Alaimo S, Bellomo L, Billeci F, Ferragina P, Ferro A, Pulvirenti A. BioTAGME: A Comprehensive Platform for Biological Knowledge Network Analysis. Front Genet 2022; 13:855739. [PMID: 35571058 PMCID: PMC9096447 DOI: 10.3389/fgene.2022.855739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 03/24/2022] [Indexed: 02/02/2023] Open
Abstract
The inference of novel knowledge and new hypotheses from the current literature analysis is crucial in making new scientific discoveries. In bio-medicine, given the enormous amount of literature and knowledge bases available, the automatic gain of knowledge concerning relationships among biological elements, in the form of semantically related terms (or entities), is rising novel research challenges and corresponding applications. In this regard, we propose BioTAGME, a system that combines an entity-annotation framework based on Wikipedia corpus (i.e., TAGME tool) with a network-based inference methodology (i.e., DT-Hybrid). This integration aims to create an extensive Knowledge Graph modeling relations among biological terms and phrases extracted from titles and abstracts of papers available in PubMed. The framework consists of a back-end and a front-end. The back-end is entirely implemented in Scala and runs on top of a Spark cluster that distributes the computing effort among several machines. The front-end is released through the Laravel framework, connected with the Neo4j graph database to store the knowledge graph.
Collapse
Affiliation(s)
- Antonio Di Maria
- Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | - Salvatore Alaimo
- Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | | | - Fabrizio Billeci
- Department of Maths and Computer Science, University of Catania, Catania, Italy
| | - Paolo Ferragina
- Department of Computer Science, University of Pisa, Pisa, Italy
| | - Alfredo Ferro
- Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | - Alfredo Pulvirenti
- Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
- *Correspondence: Alfredo Pulvirenti,
| |
Collapse
|
110
|
Zhu X, Gu Y, Xiao Z. HerbKG: Constructing a Herbal-Molecular Medicine Knowledge Graph Using a Two-Stage Framework Based on Deep Transfer Learning. Front Genet 2022; 13:799349. [PMID: 35571049 PMCID: PMC9091197 DOI: 10.3389/fgene.2022.799349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 04/05/2022] [Indexed: 11/13/2022] Open
Abstract
Recent advances have witnessed a growth of herbalism studies adopting a modern scientific approach in molecular medicine, offering valuable domain knowledge that can potentially boost the development of herbalism with evidence-supported efficacy and safety. However, these domain-specific scientific findings have not been systematically organized, affecting the efficiency of knowledge discovery and usage. Existing knowledge graphs in herbalism mainly focus on diagnosis and treatment with an absence of knowledge connection with molecular medicine. To fill this gap, we present HerbKG, a knowledge graph that bridges herbal and molecular medicine. The core bio-entities of HerbKG include herbs, chemicals extracted from the herbs, genes that are affected by the chemicals, and diseases treated by herbs due to the functions of genes. We have developed a learning framework to automate the process of HerbKG construction. The resulting HerbKG, after analyzing over 500K PubMed abstracts, is populated with 53K relations, providing extensive herbal-molecular domain knowledge in support of downstream applications. The code and an interactive tool are available at https://github.com/FeiYee/HerbKG.
Collapse
Affiliation(s)
- Xian Zhu
- School of Information Management, Nanjing University, Nanjing, China
- School of Health Economics and Management, Nanjing University of Chinese Medicine, Nanjing, China
| | - Yueming Gu
- School of Computing and Information Systems, Faculty of Engineering and Information Technology, University of Melbourne, Parkville, VIC, Australia
| | - Zhifeng Xiao
- School of Engineering, Penn State Erie, The Behrend College, Erie, PA, United States
| |
Collapse
|
111
|
Yue Z, Slominski R, Bharti S, Chen JY. PAGER Web APP: An Interactive, Online Gene Set and Network Interpretation Tool for Functional Genomics. Front Genet 2022; 13:820361. [PMID: 35495152 PMCID: PMC9039620 DOI: 10.3389/fgene.2022.820361] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Accepted: 03/17/2022] [Indexed: 12/30/2022] Open
Abstract
Functional genomics studies have helped researchers annotate differentially expressed gene lists, extract gene expression signatures, and identify biological pathways from omics profiling experiments conducted on biological samples. The current geneset, network, and pathway analysis (GNPA) web servers, e.g., DAVID, EnrichR, WebGestaltR, or PAGER, do not allow automated integrative functional genomic downstream analysis. In this study, we developed a new web-based interactive application, “PAGER Web APP”, which supports online R scripting of integrative GNPA. In a case study of melanoma drug resistance, we showed that the new PAGER Web APP enabled us to discover highly relevant pathways and network modules, leading to novel biological insights. We also compared PAGER Web APP’s pathway analysis results retrieved among PAGER, EnrichR, and WebGestaltR to show its advantages in integrative GNPA. The interactive online web APP is publicly accessible from the link, https://aimed-lab.shinyapps.io/PAGERwebapp/.
Collapse
Affiliation(s)
- Zongliang Yue
- Informatics Institute in the School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, United States
| | - Radomir Slominski
- Informatics Institute in the School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, United States
- Graduate Biomedical Sciences Program, The University of Alabama at Birmingham, Birmingham, AL, United States
| | - Samuel Bharti
- Informatics Institute in the School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jake Y. Chen
- Informatics Institute in the School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, United States
- *Correspondence: Jake Y. Chen,
| |
Collapse
|
112
|
Tomar S, Klinzing DC, Chen CK, Gan LH, Moscarello T, Reuter C, Ashley EA, Foo R. Causative Variants for Inherited Cardiac Conditions in a Southeast Asian Population Cohort. Circ Genom Precis Med 2022; 15:e003536. [DOI: 10.1161/circgen.121.003536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Background:
Variable penetrance and late-onset phenotypes are key challenges for classifying causal as well as incidental findings in inherited cardiac conditions. Allele frequencies of variants in ancestry-specific populations, along with clinical variant analysis and interpretation, are critical to determine their true significance.
Methods:
Here, we carefully reviewed and classified variants in genes associated with inherited cardiac conditions based on a population whole-genome sequencing cohort of 4810 Singaporeans representing Southeast Asian ancestries.
Results:
Eighty-nine (1.85%) individuals carried either pathogenic or likely pathogenic variants across 25 genes. Forty-six (51.7%) had variants in causal genes for familial hyperlipidemia, but there were also recurrent variants in
SCN5A
and
MYBPC3
, causal genes for inherited arrhythmia and cardiomyopathy, which, despite previous reports, we determined to lack criteria for pathogenicity.
Conclusions:
Our findings highlight the incidence of disease-related variants in inherited cardiac conditions and emphasize the value of large-scale sequencing in specific ancestries. Follow-up detailed phenotyping and analysis of pedigrees are crucial because assigning pathogenicity will significantly affect clinical management for individuals and their family members.
Collapse
Affiliation(s)
- Swati Tomar
- Cardiovascular Disease Translational Research Programme, Yong Loo Lin School of Medicine, National University Singapore (S.T., D.C.K., C.K.C., L.H.G., R.F.)
- Cardiovascular Research Institute, National University Heart Centre (S.T., D.C.K., C.K.C., L.H.G., R.F.), National University Health System, Singapore
| | - David C. Klinzing
- Cardiovascular Disease Translational Research Programme, Yong Loo Lin School of Medicine, National University Singapore (S.T., D.C.K., C.K.C., L.H.G., R.F.)
- Cardiovascular Research Institute, National University Heart Centre (S.T., D.C.K., C.K.C., L.H.G., R.F.), National University Health System, Singapore
- Khoo Teck Puat National University Children’s Medical Institute (C.K.C.), National University Health System, Singapore
- Department of Pediatrics, Yong Loo Lin School of Medicine, National University Singapore, Singapore (C.K.C.)
| | - Ching Kit Chen
- Cardiovascular Disease Translational Research Programme, Yong Loo Lin School of Medicine, National University Singapore (S.T., D.C.K., C.K.C., L.H.G., R.F.)
- Cardiovascular Research Institute, National University Heart Centre (S.T., D.C.K., C.K.C., L.H.G., R.F.), National University Health System, Singapore
| | - Louis Hanqiang Gan
- Cardiovascular Disease Translational Research Programme, Yong Loo Lin School of Medicine, National University Singapore (S.T., D.C.K., C.K.C., L.H.G., R.F.)
- Cardiovascular Research Institute, National University Heart Centre (S.T., D.C.K., C.K.C., L.H.G., R.F.), National University Health System, Singapore
| | - Tia Moscarello
- Centre for Inherited Cardiovascular Disease, Stanford University Medical Center, CA (T.M., C.R., E.A.A.)
| | - Chloe Reuter
- Centre for Inherited Cardiovascular Disease, Stanford University Medical Center, CA (T.M., C.R., E.A.A.)
| | - Euan A. Ashley
- Centre for Inherited Cardiovascular Disease, Stanford University Medical Center, CA (T.M., C.R., E.A.A.)
| | - Roger Foo
- Cardiovascular Disease Translational Research Programme, Yong Loo Lin School of Medicine, National University Singapore (S.T., D.C.K., C.K.C., L.H.G., R.F.)
- Cardiovascular Research Institute, National University Heart Centre (S.T., D.C.K., C.K.C., L.H.G., R.F.), National University Health System, Singapore
- Genome Institute of Singapore (R.F.)
| |
Collapse
|
113
|
He H, Fu S, Wang L, Liu S, Wen A, Liu H. MedTator: a serverless annotation tool for corpus development. Bioinformatics 2022; 38:1776-1778. [PMID: 34983060 DOI: 10.1093/bioinformatics/btab880] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Revised: 12/23/2021] [Accepted: 12/31/2021] [Indexed: 02/05/2023] Open
Abstract
SUMMARY Building a high-quality annotation corpus requires expenditure of considerable time and expertise, particularly for biomedical and clinical research applications. Most existing annotation tools provide many advanced features to cover a variety of needs where the installation, integration and difficulty of use present a significant burden for actual annotation tasks. Here, we present MedTator, a serverless annotation tool, aiming to provide an intuitive and interactive user interface that focuses on the core steps related to corpus annotation, such as document annotation, corpus summarization, annotation export and annotation adjudication. AVAILABILITY AND IMPLEMENTATION MedTator and its tutorial are freely available from https://ohnlp.github.io/MedTator. MedTator source code is available under the Apache 2.0 license: https://github.com/OHNLP/MedTator. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Huan He
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55901, USA
| | - Sunyang Fu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55901, USA
| | - Liwei Wang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55901, USA
| | - Sijia Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55901, USA
| | - Andrew Wen
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55901, USA
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55901, USA
| |
Collapse
|
114
|
Choi H, Lee K, Kim D, Kim S, Lee JH. The implication of holocytochrome c synthase mutation in Korean familial hypoplastic amelogenesis imperfecta. Clin Oral Investig 2022; 26:4487-4498. [PMID: 35243551 PMCID: PMC9203382 DOI: 10.1007/s00784-022-04413-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Accepted: 02/15/2022] [Indexed: 11/18/2022]
Abstract
Objectives This study aimed to comprehensively characterise genetic variants of amelogenesis imperfecta in a single Korean family through whole-exome sequencing and bioinformatics analysis. Material and methods Thirty-one individuals of a Korean family, 9 of whom were affected and 22 unaffected by amelogenesis imperfecta, were enrolled. Whole-exome sequencing was performed on 12 saliva samples, including samples from 8 affected and 4 unaffected individuals. The possible candidate genes associated with the disease were screened by segregation analysis and variant filtering. In silico mutation impact analysis was then performed on the filtered variants based on sequence conservation and protein structure. Results Whole-exome sequencing data revealed an X-linked dominant, heterozygous genomic missense mutation in the mitochondrial gene holocytochrome c synthase (HCCS). We also found that HCCS is potentially related to the role of mitochondria in amelogenesis. The HCCS variant was expected to be deleterious in both evolution-based and large population-based analyses. Further, the variant was predicted to have a negative effect on catalytic function of HCCS by in silico analysis of protein structure. In addition, HCCS had significant association with amelogenesis in literature mining analysis. Conclusions These findings suggest new evidence for the relationship between amelogenesis and mitochondria function, which could be implicated in the pathogenesis of amelogenesis imperfecta. Clinical relevance The discovery of HCCS mutations and a deeper understanding of the pathogenesis of amelogenesis imperfecta could lead to finding solutions for the fundamental treatment of this disease. Furthermore, it enables dental practitioners to establish predictable prosthetic treatment plans at an early stage by early detection of amelogenesis imperfecta through personalised medicine. Supplementary Information The online version contains supplementary material available at 10.1007/s00784-022-04413-0.
Collapse
Affiliation(s)
- Hyejin Choi
- Department of Prosthodontics, College of Dentistry at Yonsei University, 50-1 Yonsei-ro, Seodaemoon-gu, Seoul, 120-752, Republic of Korea
| | - Kwanghwan Lee
- Department of Life Sciences, Pohang University of Science and Technology, Pohang, 790-784, Republic of Korea
| | - Donghyo Kim
- Department of Life Sciences, Pohang University of Science and Technology, Pohang, 790-784, Republic of Korea
| | - Sanguk Kim
- Department of Life Sciences, Pohang University of Science and Technology, Pohang, 790-784, Republic of Korea.
| | - Jae Hoon Lee
- Department of Prosthodontics, College of Dentistry at Yonsei University, 50-1 Yonsei-ro, Seodaemoon-gu, Seoul, 120-752, Republic of Korea.
| |
Collapse
|
115
|
Systematic illumination of druggable genes in cancer genomes. Cell Rep 2022; 38:110400. [PMID: 35196490 PMCID: PMC8919705 DOI: 10.1016/j.celrep.2022.110400] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Revised: 09/12/2021] [Accepted: 01/26/2022] [Indexed: 01/15/2023] Open
Abstract
By combining 6 druggable genome resources, we identify 6,083 genes as potential druggable genes (PDGs). We characterize their expression, recurrent genomic alterations, cancer dependencies, and therapeutic potentials by integrating genome, functionome, and druggome profiles across cancers. 81.5% of PDGs are reliably expressed in major adult cancers, 46.9% show selective expression patterns, and 39.1% exhibit at least one recurrent genomic alteration. We annotate a total of 784 PDGs as dependent genes for cancer cell growth. We further quantify 16 cancer-related features and estimate a PDG cancer drug target score (PCDT score). PDGs with higher PCDT scores are significantly enriched for genes encoding kinases and histone modification enzymes. Importantly, we find that a considerable portion of high PCDT score PDGs are understudied genes, providing unexplored opportunities for drug development in oncology. By integrating the druggable genome and the cancer genome, our study thus generates a comprehensive blueprint of potential druggable genes across cancers. Jiang et al. generate a comprehensive blueprint of potential druggable genes (PDGs) across cancers by a systematic integration of the druggable genome and the cancer genome. This resource is publicly available to the cancer research community in The Cancer Druggable Gene Atlas (TCDA) through the Functional Cancer Genome data portal.
Collapse
|
116
|
Foote SL, Jones S, Lockmuller J, Brown L, Breen J, Gururaj A. Parsing Immune Correlates of Protection Against SARS-CoV-2 from Biomedical Literature. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2021:466-475. [PMID: 35308924 PMCID: PMC8861695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
After the emergence of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) in 2019, identification of immune correlates of protection (CoPs) have become increasingly important to understand the immune response to SARS-CoV-2. The vast amount of preprint and published literature related to COVID-19 makes it challenging for researchers to stay up to date on research results regarding CoPs against SARS-CoV-2. To address this problem, we developed a machine learning classifier to identify papers relevant to CoPs and a customized named entity recognition (NER) model to extract terms of interest, including CoPs, vaccines, assays, and animal models. A user-friendly visualization tool was populated with the extracted and normalized NER results and associated publication information including links to full-text articles and clinical trial information where available. The goal of this pilot project is to provide a basis for developing real-time informatics platforms that can inform researchers with scientific insights from emerging research.
Collapse
Affiliation(s)
- Sydney L Foote
- Office of Data Science and Emerging Technologies, NIAID, NIH, Rockville, MD, USA
- Both authors contributed to the work equally
| | - Sara Jones
- Office of Data Science and Emerging Technologies, NIAID, NIH, Rockville, MD, USA
- Both authors contributed to the work equally
| | - Jane Lockmuller
- Office of Data Science and Emerging Technologies, NIAID, NIH, Rockville, MD, USA
| | - Liliana Brown
- Division of Microbiology and Infectious Diseases, NIAID, NIH, Rockville, MD, USA
| | - Joseph Breen
- Division of Allergy, Immunology, and Transplantation, NIAID, NIH, Rockville, MD, USA
| | - Anupama Gururaj
- Division of Allergy, Immunology, and Transplantation, NIAID, NIH, Rockville, MD, USA
| |
Collapse
|
117
|
Nicholson DN, Rubinetti V, Hu D, Thielk M, Hunter LE, Greene CS. Examining linguistic shifts between preprints and publications. PLoS Biol 2022; 20:e3001470. [PMID: 35104289 PMCID: PMC8806061 DOI: 10.1371/journal.pbio.3001470] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 11/05/2021] [Indexed: 11/19/2022] Open
Abstract
Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review. Studies on preprints within bioRxiv have been largely focused on article metadata and how often these preprints are downloaded, cited, published, and discussed online. A missing element that has yet to be examined is the language contained within the bioRxiv preprint repository. We sought to compare and contrast linguistic features within bioRxiv preprints to published biomedical text as a whole as this is an excellent opportunity to examine how peer review changes these documents. The most prevalent features that changed appear to be associated with typesetting and mentions of supporting information sections or additional files. In addition to text comparison, we created document embeddings derived from a preprint-trained word2vec model. We found that these embeddings are able to parse out different scientific approaches and concepts, link unannotated preprint-peer-reviewed article pairs, and identify journals that publish linguistically similar papers to a given preprint. We also used these embeddings to examine factors associated with the time elapsed between the posting of a first preprint and the appearance of a peer-reviewed publication. We found that preprints with more versions posted and more textual changes took longer to publish. Lastly, we constructed a web application (https://greenelab.github.io/preprint-similarity-search/) that allows users to identify which journals and articles that are most linguistically similar to a bioRxiv or medRxiv preprint as well as observe where the preprint would be positioned within a published article landscape.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Dongbo Hu
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Marvin Thielk
- Elsevier, Philadelphia, Pennsylvania, United States of America
| | - Lawrence E. Hunter
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| |
Collapse
|
118
|
Geronikolou SA, Takan I, Pavlopoulou A, Mantzourani M, Chrousos GP. Thrombocytopenia in COVID‑19 and vaccine‑induced thrombotic thrombocytopenia. Int J Mol Med 2022; 49:35. [PMID: 35059730 PMCID: PMC8815408 DOI: 10.3892/ijmm.2022.5090] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 12/28/2021] [Indexed: 12/16/2022] Open
Abstract
The highly heterogeneous symptomatology and unpredictable progress of COVID-19 triggered unprecedented intensive biomedical research and a number of clinical research projects. Although the pathophysiology of the disease is being progressively clarified, its complexity remains vast. Moreover, some extremely infrequent cases of thrombotic thrombocytopenia following vaccination against SARS-CoV-2 infection have been observed. The present study aimed to map the signaling pathways of thrombocytopenia implicated in COVID-19, as well as in vaccine-induced thrombotic thrombocytopenia (VITT). The biomedical literature database, MEDLINE/PubMed, was thoroughly searched using artificial intelligence techniques for the semantic relations among the top 50 similar words (>0.9) implicated in COVID-19-mediated human infection or VITT. Additionally, STRING, a database of primary and predicted associations among genes and proteins (collected from diverse resources, such as documented pathway knowledge, high-throughput experimental studies, cross-species extrapolated information, automated text mining results, computationally predicted interactions, etc.), was employed, with the confidence threshold set at 0.7. In addition, two interactomes were constructed: i) A network including 119 and 56 nodes relevant to COVID-19 and thrombocytopenia, respectively; and ii) a second network containing 60 nodes relevant to VITT. Although thrombocytopenia is a dominant morbidity in both entities, three nodes were observed that corresponded to genes (AURKA, CD46 and CD19) expressed only in VITT, whilst ADAM10, CDC20, SHC1 and STXBP2 are silenced in VITT, but are commonly expressed in both COVID-19 and thrombocytopenia. The calculated average node degree was immense (11.9 in COVID-19 and 6.43 in VITT), illustrating the complexity of COVID-19 and VITT pathologies and confirming the importance of cytokines, as well as of pathways activated following hypoxic events. In addition, PYCARD, NLP3 and P2RX7 are key potential therapeutic targets for all three morbid entities, meriting further research. This interactome was based on wild-type genes, revealing the predisposition of the body to hypoxia-induced thrombosis, leading to the acute COVID-19 phenotype, the 'long-COVID syndrome', and/or VITT. Thus, common nodes appear to be key players in illness prevention, progression and treatment.
Collapse
Affiliation(s)
- Styliani A Geronikolou
- Clinical, Translational and Experimental Surgery Research Centre, Biomedical Research Foundation Academy of Athens, 11527 Athens, Greece
| | - Işil Takan
- Izmir Biomedicine and Genome Center (IBG), 35340 Izmir, Turkey
| | | | - Marina Mantzourani
- First Department of Internal Medicine, Laiko Hospital, National and Kapodistrian University of Athens Medical School, 11527 Athens, Greece
| | - George P Chrousos
- Clinical, Translational and Experimental Surgery Research Centre, Biomedical Research Foundation Academy of Athens, 11527 Athens, Greece
| |
Collapse
|
119
|
Dholakia D, Kalra A, Misir BR, Kanga U, Mukerji M. HLA-SPREAD: a natural language processing based resource for curating HLA association from PubMed abstracts. BMC Genomics 2022; 23:10. [PMID: 34991484 PMCID: PMC8740486 DOI: 10.1186/s12864-021-08239-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2021] [Accepted: 12/07/2021] [Indexed: 11/16/2022] Open
Abstract
Extreme complexity in the Human Leukocyte Antigens (HLA) system and its nomenclature makes it difficult to interpret and integrate relevant information for HLA associations with diseases, Adverse Drug Reactions (ADR) and Transplantation. PubMed search displays ~ 146,000 studies on HLA reported from diverse locations. Currently, IPD-IMGT/HLA (Robinson et al., Nucleic Acids Research 48:D948-D955, 2019) database houses data on 28,320 HLA alleles. We developed an automated pipeline with a unified graphical user interface HLA-SPREAD that provides a structured information on SNPs, Populations, REsources, ADRs and Diseases information. Information on HLA was extracted from ~ 28 million PubMed abstracts extracted using Natural Language Processing (NLP). Python scripts were used to mine and curate information on diseases, filter false positives and categorize to 24 tree hierarchical groups and named Entity Recognition (NER) algorithms followed by semantic analysis to infer HLA association(s). This resource from 109 countries and 40 ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and ADRs for hypersensitivity. Summary information on clinically relevant biomarkers related to HLA disease associations with mapped susceptible/risk alleles are readily retrievable from HLASPREAD. The resource is available at URL http://hla-spread.igib.res.in/ . This resource is first of its kind that can help uncover novel patterns in HLA gene-disease associations.
Collapse
Affiliation(s)
- Dhwani Dholakia
- Institute of Genomics and Integrative Biology-Council of Scientific and Industrial Research, New Delhi, 110025, India.
- Academy of Scientific and Innovative Research, Ghaziabad, 201002, India.
| | - Ankit Kalra
- Netaji Subhas University of Technology, New Delhi, 110078, India
| | - Bishnu Raman Misir
- Centre of Excellence for Applied Development of Ayurveda, Prakriti and Genomics, CSIR- IGIB, Delhi, 110007, India
| | - Uma Kanga
- All India Institute of Medical Sciences, New Delhi, 110029, India
| | - Mitali Mukerji
- Institute of Genomics and Integrative Biology-Council of Scientific and Industrial Research, New Delhi, 110025, India.
- Centre of Excellence for Applied Development of Ayurveda, Prakriti and Genomics, CSIR- IGIB, Delhi, 110007, India.
- Present Address: Department of Bioscience and Bioengineering, Indian Institute of Technology, Jodhpur, Rajasthan, 342037, India.
| |
Collapse
|
120
|
Church K, Liu B. Acronyms and Opportunities for Improving Deep Nets. Front Artif Intell 2022; 4:732381. [PMID: 34988434 PMCID: PMC8721666 DOI: 10.3389/frai.2021.732381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 10/21/2021] [Indexed: 11/13/2022] Open
Abstract
Recently, several studies have reported promising results with BERT-like methods on acronym tasks. In this study, we find an older rule-based program, Ab3P, not only performs better, but error analysis suggests why. There is a well-known spelling convention in acronyms where each letter in the short form (SF) refers to “salient” letters in the long form (LF). The error analysis uses decision trees and logistic regression to show that there is an opportunity for many pre-trained models (BERT, T5, BioBert, BART, ERNIE) to take advantage of this spelling convention.
Collapse
Affiliation(s)
| | - Boxiang Liu
- Baidu Research, Sunnyvale, CA, United States
| |
Collapse
|
121
|
Yan D, Zheng G, Wang C, Chen Z, Mao T, Gao J, Yan Y, Chen X, Ji X, Yu J, Mo S, Wen H, Han W, Zhou M, Wang Y, Wang J, Tang K, Cao Z. HIT 2.0: an enhanced platform for Herbal Ingredients' Targets. Nucleic Acids Res 2022; 50:D1238-D1243. [PMID: 34986599 PMCID: PMC8728248 DOI: 10.1093/nar/gkab1011] [Citation(s) in RCA: 45] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 10/08/2021] [Accepted: 11/15/2021] [Indexed: 12/29/2022] Open
Abstract
Literature-described targets of herbal ingredients have been explored to facilitate the mechanistic study of herbs, as well as the new drug discovery. Though several databases provided similar information, the majority of them are limited to literatures before 2010 and need to be updated urgently. HIT 2.0 was here constructed as the latest curated dataset focusing on Herbal Ingredients’ Targets covering PubMed literatures 2000–2020. Currently, HIT 2.0 hosts 10 031 compound-target activity pairs with quality indicators between 2208 targets and 1237 ingredients from more than 1250 reputable herbs. The molecular targets cover those genes/proteins being directly/indirectly activated/inhibited, protein binders, and enzymes substrates or products. Also included are those genes regulated under the treatment of individual ingredient. Crosslinks were made to databases of TTD, DrugBank, KEGG, PDB, UniProt, Pfam, NCBI, TCM-ID and others. More importantly, HIT enables automatic Target-mining and My-target curation from daily released PubMed literatures. Thus, users can retrieve and download the latest abstracts containing potential targets for interested compounds, even for those not yet covered in HIT. Further, users can log into ‘My-target’ system, to curate personal target-profiling on line based on retrieved abstracts. HIT can be accessible at http://hit2.badd-cao.net.
Collapse
Affiliation(s)
- Deyu Yan
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Genhui Zheng
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Caicui Wang
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Zikun Chen
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Tiantian Mao
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Jian Gao
- International Human Phenome Institutes (Shanghai), Shanghai, China.,Department of Thoracic Surgery, Fudan University Shanghai Cancer Center, Shanghai, China
| | - Yu Yan
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Xiangyi Chen
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Xuejie Ji
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Jinyu Yu
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Saifeng Mo
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Haonan Wen
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Wenhao Han
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Mengdi Zhou
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Yuan Wang
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Jun Wang
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Kailin Tang
- Dept. of Gastroenterology, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Zhiwei Cao
- School of Life Sciences, Fudan University, Shanghai 200092, China
| |
Collapse
|
122
|
Wang Y, Tong Y, Zhang Z, Zheng R, Huang D, Yang J, Zong H, Tan F, Xie Y, Huang H, Zhang X. ViMIC: a database of human disease-related virus mutations, integration sites and cis-effects. Nucleic Acids Res 2022; 50:D918-D927. [PMID: 34500462 PMCID: PMC8728280 DOI: 10.1093/nar/gkab779] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 08/10/2021] [Accepted: 08/26/2021] [Indexed: 02/06/2023] Open
Abstract
Molecular mechanisms of virus-related diseases involve multiple factors, including viral mutation accumulation and integration of a viral genome into the host DNA. With increasing attention being paid to virus-mediated pathogenesis and the development of many useful technologies to identify virus mutations (VMs) and viral integration sites (VISs), much research on these topics is available in PubMed. However, knowledge of VMs and VISs is widely scattered in numerous published papers which lack standardization, integration and curation. To address these challenges, we built a pilot database of human disease-related Virus Mutations, Integration sites and Cis-effects (ViMIC), which specializes in three features: virus mutation sites, viral integration sites and target genes. In total, the ViMIC provides information on 31 712 VMs entries, 105 624 VISs, 16 310 viral target genes and 1 110 015 virus sequences of eight viruses in 77 human diseases obtained from the public domain. Furthermore, in ViMIC users are allowed to explore the cis-effects of virus-host interactions by surveying 78 histone modifications, binding of 1358 transcription regulators and chromatin accessibility on these VISs. We believe ViMIC will become a valuable resource for the virus research community. The database is available at http://bmtongji.cn/ViMIC/index.php.
Collapse
Affiliation(s)
- Ying Wang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Department of Laboratory Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, Shanghai 200438, China
| | - Yuantao Tong
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Zeyu Zhang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Rongbin Zheng
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Danqi Huang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Jinxuan Yang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Hui Zong
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Fanglin Tan
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Yujia Xie
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Honglian Huang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Xiaoyan Zhang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| |
Collapse
|
123
|
Muscolino A, Di Maria A, Rapicavoli RV, Alaimo S, Bellomo L, Billeci F, Borzì S, Ferragina P, Ferro A, Pulvirenti A. NETME: on-the-fly knowledge network construction from biomedical literature. APPLIED NETWORK SCIENCE 2022; 7:1. [PMID: 35013714 PMCID: PMC8733431 DOI: 10.1007/s41109-021-00435-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 09/21/2021] [Indexed: 06/14/2023]
Abstract
BACKGROUND The rapidly increasing biological literature is a key resource to automatically extract and gain knowledge concerning biological elements and their relations. Knowledge Networks are helpful tools in the context of biological knowledge discovery and modeling. RESULTS We introduce a novel system called NETME, which, starting from a set of full-texts obtained from PubMed, through an easy-to-use web interface, interactively extracts biological elements from ontological databases and then synthesizes a network inferring relations among such elements. The results clearly show that our tool is capable of inferring comprehensive and reliable biological networks. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s41109-021-00435-x.
Collapse
Affiliation(s)
| | - Antonio Di Maria
- Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | | | - Salvatore Alaimo
- Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | - Lorenzo Bellomo
- Department of Computer Science, University of Pisa, Pisa, Italy
| | - Fabrizio Billeci
- Department of Maths and Computer Science, University of Catania, Catania, Italy
| | - Stefano Borzì
- Department of Maths and Computer Science, University of Catania, Catania, Italy
| | - Paolo Ferragina
- Department of Computer Science, University of Pisa, Pisa, Italy
| | - Alfredo Ferro
- Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | - Alfredo Pulvirenti
- Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| |
Collapse
|
124
|
El Idrissi F, Fruchart M, Belarbi K, Lamer A, Dubois-Deruy E, Lemdani M, N’Guessan AL, Guinhouya BC, Zitouni D. Exploration of the core protein network under endometriosis symptomatology using a computational approach. Front Endocrinol (Lausanne) 2022; 13:869053. [PMID: 36120440 PMCID: PMC9478376 DOI: 10.3389/fendo.2022.869053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 08/17/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Endometriosis is defined by implantation and invasive growth of endometrial tissue in extra-uterine locations causing heterogeneous symptoms, and a unique clinical picture for each patient. Understanding the complex biological mechanisms underlying these symptoms and the protein networks involved may be useful for early diagnosis and identification of pharmacological targets. METHODS In the present study, we combined three approaches (i) a text-mining analysis to perform a systematic search of proteins over existing literature, (ii) a functional enrichment analysis to identify the biological pathways in which proteins are most involved, and (iii) a protein-protein interaction (PPI) network to identify which proteins modulate the most strongly the symptomatology of endometriosis. RESULTS Two hundred seventy-eight proteins associated with endometriosis symptomatology in the scientific literature were extracted. Thirty-five proteins were selected according to degree and betweenness scores criteria. The most enriched biological pathways associated with these symptoms were (i) Interleukin-4 and Interleukin-13 signaling (p = 1.11 x 10-16), (ii) Signaling by Interleukins (p = 1.11 x 10-16), (iii) Cytokine signaling in Immune system (p = 1.11 x 10-16), and (iv) Interleukin-10 signaling (p = 5.66 x 10-15). CONCLUSION Our study identified some key proteins with the ability to modulate endometriosis symptomatology. Our findings indicate that both pro- and anti-inflammatory biological pathways may play important roles in the symptomatology of endometriosis. This approach represents a genuine systemic method that may complement traditional experimental studies. The current data can be used to identify promising biomarkers for early diagnosis and potential therapeutic targets.
Collapse
Affiliation(s)
- Fatima El Idrissi
- Univ. Lille, UFR 3S, Faculté Ingénierie et Management de la Santé, Lille, France
- Univ. Lille, UFR 3S, Faculté de Pharmacie, Lille, France
| | - Mathilde Fruchart
- Univ. Lille, UFR 3S, Faculté Ingénierie et Management de la Santé, Lille, France
- Univ. Lille, CHU Lille, ULR 2694 - METRICS, Lille, France
| | - Karim Belarbi
- Univ. Lille, UFR 3S, Faculté de Pharmacie, Lille, France
- Univ. Lille, Inserm, CHU-Lille, Lille Neuroscience & Cognition, Lille, France
| | - Antoine Lamer
- Univ. Lille, UFR 3S, Faculté Ingénierie et Management de la Santé, Lille, France
- Univ. Lille, CHU Lille, ULR 2694 - METRICS, Lille, France
| | - Emilie Dubois-Deruy
- Univ. Lille, Inserm, CHU Lille, Institut Pasteur de Lille, U1167 - RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France
| | - Mohamed Lemdani
- Univ. Lille, UFR 3S, Faculté de Pharmacie, Lille, France
- Univ. Lille, CHU Lille, ULR 2694 - METRICS, Lille, France
| | - Assi L. N’Guessan
- Univ. Lille, UMR CNRS 8524, Laboratoire Paul Painlevé, Villeneuve d’Ascq, Cedex, France
| | - Benjamin C. Guinhouya
- Univ. Lille, UFR 3S, Faculté Ingénierie et Management de la Santé, Lille, France
- Univ. Lille, CHU Lille, ULR 2694 - METRICS, Lille, France
- *Correspondence: Benjamin C. Guinhouya,
| | - Djamel Zitouni
- Univ. Lille, UFR 3S, Faculté de Pharmacie, Lille, France
- Univ. Lille, CHU Lille, ULR 2694 - METRICS, Lille, France
| |
Collapse
|
125
|
Stoeger T, Nunes Amaral LA. The characteristics of early-stage research into human genes are substantially different from subsequent research. PLoS Biol 2022; 20:e3001520. [PMID: 34990452 PMCID: PMC8769369 DOI: 10.1371/journal.pbio.3001520] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 01/19/2022] [Accepted: 12/21/2021] [Indexed: 11/19/2022] Open
Abstract
Throughout the last 2 decades, several scholars observed that present day research into human genes rarely turns toward genes that had not already been extensively investigated in the past. Guided by hypotheses derived from studies of science and innovation, we present here a literature-wide data-driven meta-analysis to identify the specific scientific and organizational contexts that coincided with early-stage research into human genes throughout the past half century. We demonstrate that early-stage research into human genes differs in team size, citation impact, funding mechanisms, and publication outlet, but that generalized insights derived from studies of science and innovation only partially apply to early-stage research into human genes. Further, we demonstrate that, presently, genome biology accounts for most of the initial early-stage research, while subsequent early-stage research can engage other life sciences fields. We therefore anticipate that the specificity of our findings will enable scientists and policymakers to better promote early-stage research into human genes and increase overall innovation within the life sciences.
Collapse
Affiliation(s)
- Thomas Stoeger
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, Illinois, United States of America
- Northwestern Institute on Complex Systems (NICO), Northwestern University, Evanston, Illinois, United States of America
- Center for Genetic Medicine, Northwestern University, Chicago, Illinois, United States of America
| | - Luís A. Nunes Amaral
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, Illinois, United States of America
- Northwestern Institute on Complex Systems (NICO), Northwestern University, Evanston, Illinois, United States of America
- Department of Molecular Bioscience, Northwestern University, Evanston, Illinois, United States of America
- Department of Physics and Astronomy, Northwestern University, Evanston, Illinois, United States of America
- Department of Medicine, Northwestern University School of Medicine, Chicago, Illinois, United States of America
| |
Collapse
|
126
|
Giachelle F, Irrera O, Silvello G. MedTAG: a portable and customizable annotation tool for biomedical documents. BMC Med Inform Decis Mak 2021; 21:352. [PMID: 34922517 PMCID: PMC8684237 DOI: 10.1186/s12911-021-01706-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 12/01/2021] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Semantic annotators and Natural Language Processing (NLP) methods for Named Entity Recognition and Linking (NER+L) require plenty of training and test data, especially in the biomedical domain. Despite the abundance of unstructured biomedical data, the lack of richly annotated biomedical datasets poses hindrances to the further development of NER+L algorithms for any effective secondary use. In addition, manual annotation of biomedical documents performed by physicians and experts is a costly and time-consuming task. To support, organize and speed up the annotation process, we introduce MedTAG, a collaborative biomedical annotation tool that is open-source, platform-independent, and free to use/distribute. RESULTS We present the main features of MedTAG and how it has been employed in the histopathology domain by physicians and experts to annotate more than seven thousand clinical reports manually. We compare MedTAG with a set of well-established biomedical annotation tools, including BioQRator, ezTag, MyMiner, and tagtog, comparing their pros and cons with those of MedTag. We highlight that MedTAG is one of the very few open-source tools provided with an open license and a straightforward installation procedure supporting cross-platform use. CONCLUSIONS MedTAG has been designed according to five requirements (i.e. available, distributable, installable, workable and schematic) defined in a recent extensive review of manual annotation tools. Moreover, MedTAG satisfies 20 over 22 criteria specified in the same study.
Collapse
Affiliation(s)
- Fabio Giachelle
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Ornella Irrera
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
127
|
Ravanmehr V, Blau H, Cappelletti L, Fontana T, Carmody L, Coleman B, George J, Reese J, Joachimiak M, Bocci G, Hansen P, Bult C, Rueter J, Casiraghi E, Valentini G, Mungall C, Oprea TI, Robinson PN. Supervised learning with word embeddings derived from PubMed captures latent knowledge about protein kinases and cancer. NAR Genom Bioinform 2021; 3:lqab113. [PMID: 34888523 PMCID: PMC8652379 DOI: 10.1093/nargab/lqab113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2021] [Revised: 10/14/2021] [Accepted: 11/24/2021] [Indexed: 11/17/2022] Open
Abstract
Inhibiting protein kinases (PKs) that cause cancers has been an important topic in cancer therapy for years. So far, almost 8% of >530 PKs have been targeted by FDA-approved medications, and around 150 protein kinase inhibitors (PKIs) have been tested in clinical trials. We present an approach based on natural language processing and machine learning to investigate the relations between PKs and cancers, predicting PKs whose inhibition would be efficacious to treat a certain cancer. Our approach represents PKs and cancers as semantically meaningful 100-dimensional vectors based on word and concept neighborhoods in PubMed abstracts. We use information about phase I-IV trials in ClinicalTrials.gov to construct a training set for random forest classification. Our results with historical data show that associations between PKs and specific cancers can be predicted years in advance with good accuracy. Our tool can be used to predict the relevance of inhibiting PKs for specific cancers and to support the design of well-focused clinical trials to discover novel PKIs for cancer therapy.
Collapse
Affiliation(s)
- Vida Ravanmehr
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Hannah Blau
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Luca Cappelletti
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Italy
| | - Tommaso Fontana
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Italy
| | - Leigh Carmody
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Ben Coleman
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
- University of Connecticut Health Center, Department of Genetics and Genome Sciences, Farmington, CT 06030, USA
| | - Joshy George
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, USA
| | - Marcin Joachimiak
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, USA
| | - Giovanni Bocci
- Department of Internal Medicine and UNM Comprehensive Cancer Center, UNM School of, Medicine, Albuquerque, NM 87102, USA
| | - Peter Hansen
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Carol Bult
- The Jackson Laboratory for Mammalian Genetics, Bar Harbor, ME 04609, USA
| | - Jens Rueter
- The Jackson Laboratory for Mammalian Genetics, Bar Harbor, ME 04609, USA
| | - Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Italy
| | - Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Italy
| | - Christopher Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, USA
| | - Tudor I Oprea
- Department of Internal Medicine and UNM Comprehensive Cancer Center, UNM School of, Medicine, Albuquerque, NM 87102, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
- Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, USA
| |
Collapse
|
128
|
Borchert F, Mock A, Tomczak A, Hügel J, Alkarkoukly S, Knurr A, Volckmar AL, Stenzinger A, Schirmacher P, Debus J, Jäger D, Longerich T, Fröhling S, Eils R, Bougatf N, Sax U, Schapranow MP. Knowledge bases and software support for variant interpretation in precision oncology. Brief Bioinform 2021; 22:bbab134. [PMID: 33971666 PMCID: PMC8574624 DOI: 10.1093/bib/bbab134] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 03/10/2021] [Accepted: 03/30/2021] [Indexed: 12/12/2022] Open
Abstract
Precision oncology is a rapidly evolving interdisciplinary medical specialty. Comprehensive cancer panels are becoming increasingly available at pathology departments worldwide, creating the urgent need for scalable cancer variant annotation and molecularly informed treatment recommendations. A wealth of mainly academia-driven knowledge bases calls for software tools supporting the multi-step diagnostic process. We derive a comprehensive list of knowledge bases relevant for variant interpretation by a review of existing literature followed by a survey among medical experts from university hospitals in Germany. In addition, we review cancer variant interpretation tools, which integrate multiple knowledge bases. We categorize the knowledge bases along the diagnostic process in precision oncology and analyze programmatic access options as well as the integration of knowledge bases into software tools. The most commonly used knowledge bases provide good programmatic access options and have been integrated into a range of software tools. For the wider set of knowledge bases, access options vary across different parts of the diagnostic process. Programmatic access is limited for information regarding clinical classifications of variants and for therapy recommendations. The main issue for databases used for biological classification of pathogenic variants and pathway context information is the lack of standardized interfaces. There is no single cancer variant interpretation tool that integrates all identified knowledge bases. Specialized tools are available and need to be further developed for different steps in the diagnostic process.
Collapse
Affiliation(s)
- Florian Borchert
- Digital Health Center, Hasso Plattner Institute (HPI), University of Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany
| | - Andreas Mock
- Department of Translational Medical Oncology (TMO), National Center for Tumor Diseases (NCT) Heidelberg, German Cancer Research Center (DKFZ) Heidelberg, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
- Department of Medical Oncology, National Center for Tumor Diseases (NCT) Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
| | - Aurelie Tomczak
- Institute of Pathology Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 224, 69120 Heidelberg, Germany
- Liver Cancer Center Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
| | - Jonas Hügel
- Department of Medical Informatics, University Medical Center Göttingen, Von-Siebold-Str. 3, 37099 Göttingen, Germany
- Campus Institute Data Science, Göttingen, Germany
| | - Samer Alkarkoukly
- CECAD, Faculty of Medicine and University Hospital Cologne, University of Cologne, Joseph-Stelzmann-Straße 26, 50931 Cologne
| | - Alexander Knurr
- Division of Medical Informatics for Translational Oncology, German Cancer Research Center (DKFZ) Heidelberg, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
| | - Anna-Lena Volckmar
- Institute of Pathology Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 224, 69120 Heidelberg, Germany
| | - Albrecht Stenzinger
- Institute of Pathology Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 224, 69120 Heidelberg, Germany
| | - Peter Schirmacher
- Institute of Pathology Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 224, 69120 Heidelberg, Germany
- Liver Cancer Center Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
| | - Jürgen Debus
- Department of Radiation Oncology, Heidelberg University Hospital, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany
- National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
- Clinical Cooperation Unit Radiation Oncology, German Cancer Research Center (DKFZ) Heidelberg, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
- Heidelberg Ion-Beam Therapy Center (HIT), Department of Radiation Oncology, Heidelberg University Hospital, Im Neuenheimer Feld 450, 69120 Heidelberg, Germany
- Heidelberg Institute of Radiation Oncology (HIRO), Heidelberg University Hospital, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany
| | - Dirk Jäger
- Department of Medical Oncology, National Center for Tumor Diseases (NCT) Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
- Clinical Coorporation Unit Applied Tumor-Immunity, German Cancer Research Center (DKFZ) Heidelberg, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
| | - Thomas Longerich
- Institute of Pathology Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 224, 69120 Heidelberg, Germany
- Liver Cancer Center Heidelberg, Heidelberg University Hospital, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
| | - Stefan Fröhling
- Department of Translational Medical Oncology (TMO), National Center for Tumor Diseases (NCT) Heidelberg, German Cancer Research Center (DKFZ) Heidelberg, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
- German Cancer Consortium (DKTK), 69120 Heidelberg, Germany
| | - Roland Eils
- Health Data Science Unit, Heidelberg University Hospital, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany
- Center for Digital Health, Berlin Institute of Health and Charité Universitötsmedizin Berlin, Kapelle-Ufer 2, 10117 Berlin, Germany
| | - Nina Bougatf
- Department of Radiation Oncology, Heidelberg University Hospital, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany
- National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
- Clinical Cooperation Unit Radiation Oncology, German Cancer Research Center (DKFZ) Heidelberg, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
- Heidelberg Ion-Beam Therapy Center (HIT), Department of Radiation Oncology, Heidelberg University Hospital, Im Neuenheimer Feld 450, 69120 Heidelberg, Germany
- Heidelberg Institute of Radiation Oncology (HIRO), Heidelberg University Hospital, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany
| | - Ulrich Sax
- Department of Medical Informatics, University Medical Center Göttingen, Von-Siebold-Str. 3, 37099 Göttingen, Germany
- Campus Institute Data Science, Göttingen, Germany
| | - Matthieu-P Schapranow
- Digital Health Center, Hasso Plattner Institute (HPI), University of Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany
| |
Collapse
|
129
|
Kuiper M, Bonello J, Fernández-Breis JT, Bucher P, Futschik ME, Gaudet P, Kulakovskiy IV, Licata L, Logie C, Lovering RC, Makeev VJ, Orchard S, Panni S, Perfetto L, Sant D, Schulz S, Zerbino DR, Lægreid A. The Gene Regulation Knowledge Commons: The action area of GREEKC. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2021; 1865:194768. [PMID: 34757206 DOI: 10.1016/j.bbagrm.2021.194768] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 10/18/2021] [Accepted: 10/20/2021] [Indexed: 02/08/2023]
Abstract
The COST Action Gene Regulation Ensemble Effort for the Knowledge Commons (GREEKC, CA15205, www.greekc.org) organized nine workshops in a four-year period, starting September 2016. The workshops brought together a wide range of experts from all over the world working on various parts of the knowledge cycle that is central to understanding gene regulatory mechanisms. The discussions between ontologists, curators, text miners, biologists, bioinformaticians, philosophers and computational scientists spawned a host of activities aimed to update and standardise existing knowledge management workflows, encourage new experimental approaches and thoroughly involve end-users in the process to design the Gene Regulation Knowledge Commons (GRKC). The GREEKC consortium describes its main achievements, contextualised in a state-of-the-art of current tools and resources that today represent the GRKC.
Collapse
Affiliation(s)
- Martin Kuiper
- Systems Biology Group, Department of Biology, Norwegian University of Science and Technology, Trondheim, Norway.
| | - Joseph Bonello
- Faculty of Information & Communication Technology, University of Malta, Msida, Malta
| | | | - Philipp Bucher
- Swiss Institute of Bioinformatics, Quartier Sorge, Bâtiment Amphipôle, 1015 Lausanne, Switzerland
| | - Matthias E Futschik
- Systems Biology and Bioinformatics Laboratory (SysBioLab), Centre of Marine Sciences (CCMAR), University of Algarve, 8005-139 Faro, Portugal
| | - Pascale Gaudet
- SIB Swiss Institute of Bioinformatics, 1 Rue Michel-Servet, 1204 Geneva, Switzerland
| | - Ivan V Kulakovskiy
- Institute of Protein Research, Russian Academy of Sciences, Institutskaya 4, 142290 Pushchino, Russia
| | - Luana Licata
- Department of Biology, University of Rome Tor Vergata, Rome, Italy
| | - Colin Logie
- Department of Molecular Biology, Faculty of Science, Radboud University, PO Box 9101, Nijmegen 6500HG, the Netherlands
| | - Ruth C Lovering
- Functional Gene Annotation, Pre-clinical and Fundamental Science, Institute of Cardiovascular Science, University College London, 5 University Street, London WC1E 6JF, UK
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, 119991 Moscow, Russia
| | - Sandra Orchard
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Simona Panni
- Department DIBEST, University of Calabria, Rende, Italy
| | - Livia Perfetto
- Fondazione Human Technopole, Department of Biology, Via Cristina Belgioioso, 171, 20157 Milan, Italy
| | - David Sant
- Department of Biomedical Informatics, University of Utah, 421 Wakara Way #140, Salt Lake City, UT 84108, United States
| | - Stefan Schulz
- Institute of Medical Informatics, Statistics and Documentation, Medical University of Graz, Auenbruggerpl. 2, Graz, Austria
| | - Daniel R Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Astrid Lægreid
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | | |
Collapse
|
130
|
Chen C, Ross KE, Gavali S, Cowart JE, Wu CH. COVID-19 knowledge graph from semantic integration of biomedical literature and databases. Bioinformatics 2021; 37:4597-4598. [PMID: 34613368 PMCID: PMC8513397 DOI: 10.1093/bioinformatics/btab694] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2021] [Revised: 09/26/2021] [Accepted: 10/04/2021] [Indexed: 11/12/2022] Open
Abstract
SUMMARY The global response to the COVID-19 pandemic has led to a rapid increase of scientific literature on this deadly disease. Extracting knowledge from biomedical literature and integrating it with relevant information from curated biological databases is essential to gain insight into COVID-19 etiology, diagnosis, and treatment. We used Semantic Web technology RDF to integrate COVID-19 knowledge mined from literature by iTextMine, PubTator, and SemRep with relevant biological databases and formalized the knowledge in a standardized and computable COVID-19 Knowledge Graph (KG). We published the COVID-19 KG via a SPARQL endpoint to support federated queries on the Semantic Web and developed a knowledge portal with browsing and searching interfaces. We also developed a RESTful API to support programmatic access and provided RDF dumps for download. AVAILABILITY AND IMPLEMENTATION The COVID-19 Knowledge Graph is publicly available under CC-BY 4.0 license at https://research.bioinformatics.udel.edu/covid19kg/.
Collapse
Affiliation(s)
- Chuming Chen
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, 19716, USA
| | - Karen E Ross
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, 20007, USA
| | - Sachin Gavali
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, 19716, USA
| | - Julie E Cowart
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, 19716, USA
| | - Cathy H Wu
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, 19716, USA.,Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, 20007, USA
| |
Collapse
|
131
|
Ouyang S, Wang Y, Zhou K, Xia J. LitCovid-AGAC: cellular and molecular level annotation data set based on COVID-19. Genomics Inform 2021; 19:e23. [PMID: 34638170 PMCID: PMC8510875 DOI: 10.5808/gi.21013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 09/03/2021] [Accepted: 09/13/2021] [Indexed: 12/20/2022] Open
Abstract
Currently, coronavirus disease 2019 (COVID-19) literature has been increasing dramatically, and the increased text amount make it possible to perform large scale text mining and knowledge discovery. Therefore, curation of these texts becomes a crucial issue for Bio-medical Natural Language Processing (BioNLP) community, so as to retrieve the important information about the mechanism of COVID-19. PubAnnotation is an aligned annotation system which provides an efficient platform for biological curators to upload their annotations or merge other external annotations. Inspired by the integration among multiple useful COVID-19 annotations, we merged three annotations resources to LitCovid data set, and constructed a cross-annotated corpus, LitCovid-AGAC. This corpus consists of 12 labels including Mutation, Species, Gene, Disease from PubTator, GO, CHEBI from OGER, Var, MPA, CPA, NegReg, PosReg, Reg from AGAC, upon 50,018 COVID-19 abstracts in LitCovid. Contain sufficient abundant information being possible to unveil the hidden knowledge in the pathological mechanism of COVID-19.
Collapse
Affiliation(s)
- Sizhuo Ouyang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, 430070 Wuhan, China
| | - Yuxing Wang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, 430070 Wuhan, China
| | - Kaiyin Zhou
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, 430070 Wuhan, China
| | - Jingbo Xia
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, 430070 Wuhan, China
| |
Collapse
|
132
|
Liu D, Han M, Tian Y, Gong L, Jia C, Cai P, Tu W, Chen J, Hu QN. Cell2Chem: mining explored and unexplored biosynthetic chemical spaces. Bioinformatics 2021; 36:5269-5270. [PMID: 32697815 DOI: 10.1093/bioinformatics/btaa660] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Revised: 06/14/2020] [Accepted: 07/16/2020] [Indexed: 11/12/2022] Open
Abstract
SUMMARY Living cell strains have important applications in synthesizing their native compounds and potential for use in studies exploring the universal chemical space. Here, we present a web server named as Cell2Chem which accelerates the search for explored compounds in organisms, facilitating investigations of biosynthesis in unexplored chemical spaces. Cell2Chem uses co-occurrence networks and natural language processing to provide a systematic method for linking living organisms to biosynthesized compounds and the processes that produce these compounds. The Cell2Chem platform comprises 40 370 species and 125 212 compounds. Using reaction pathway and enzyme function in silico prediction methods, Cell2Chem reveals possible biosynthetic pathways of compounds and catalytic functions of proteins to expand unexplored biosynthetic chemical spaces. Cell2Chem can help improve biosynthesis research and enhance the efficiency of synthetic biology. AVAILABILITY AND IMPLEMENTATION Cell2Chem is available at: http://www.rxnfinder.org/cell2chem/.
Collapse
Affiliation(s)
- Dongliang Liu
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, P. R. China
| | - Mengying Han
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, P. R. China
| | - Yu Tian
- Tianjin Institute of Industrial Biotechnology, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Tianjin 300308, P. R. China
| | - Linlin Gong
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, P. R. China
| | - Cancan Jia
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, P. R. China
| | - Pengli Cai
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, P. R. China.,Tianjin Institute of Industrial Biotechnology, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Tianjin 300308, P. R. China
| | - Weizhong Tu
- Wuhan LifeSynther Science and Technology Co. Limited, Wuhan 430070, P. R. China
| | - Junni Chen
- Wuhan LifeSynther Science and Technology Co. Limited, Wuhan 430070, P. R. China
| | - Qian-Nan Hu
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, P. R. China
| |
Collapse
|
133
|
Djekidel MN, Rosikiewicz W, Peng JC, Kanneganti TD, Hui Y, Jin H, Hedges D, Schreiner P, Fan Y, Wu G, Xu B. CovidExpress: an interactive portal for intuitive investigation on SARS-CoV-2 related transcriptomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.05.14.444026. [PMID: 34075382 PMCID: PMC8168395 DOI: 10.1101/2021.05.14.444026] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Infection with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in humans could cause coronavirus disease 2019 (COVID-19). Since its first discovery in Dec 2019, SARS-CoV-2 has become a global pandemic and caused 3.3 million direct/indirect deaths (2021 May). Amongst the scientific community's response to COVID-19, data sharing has emerged as an essential aspect of the combat against SARS-CoV-2. Despite the ever-growing studies about SARS-CoV-2 and COVID-19, to date, only a few databases were curated to enable access to gene expression data. Furthermore, these databases curated only a small set of data and do not provide easy access for investigators without computational skills to perform analyses. To fill this gap and advance open-access to the growing gene expression data on this deadly virus, we collected about 1,500 human bulk RNA-seq datasets from publicly available resources, developed a database and visualization tool, named CovidExpress (https://stjudecab.github.io/covidexpress). This open access database will allow research investigators to examine the gene expression in various tissues, cell lines, and their response to SARS-CoV-2 under different experimental conditions, accelerating the understanding of the etiology of this disease to inform the drug and vaccine development. Our integrative analysis of this big dataset highlights a set of commonly regulated genes in SARS-CoV-2 infected lung and Rhinovirus infected nasal tissues, including OASL that were under-studied in COVID-19 related reports. Our results also suggested a potential FURIN positive feedback loop that might explain the evolutional advantage of SARS-CoV-2.
Collapse
Affiliation(s)
- Mohamed Nadhir Djekidel
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
- These authors contributed equally to this study
| | - Wojciech Rosikiewicz
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
- These authors contributed equally to this study
| | - Jamy C. Peng
- Department of Developmental Neurobiology, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
| | | | - Yawei Hui
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
| | - Hongjian Jin
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
| | - Dale Hedges
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
| | - Patrick Schreiner
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
| | - Yiping Fan
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
| | - Gang Wu
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
| | - Beisi Xu
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, Tennessee, 38105, USA
| |
Collapse
|
134
|
Noh J, Kavuluru R. Improved biomedical word embeddings in the transformer era. J Biomed Inform 2021; 120:103867. [PMID: 34284119 PMCID: PMC8373296 DOI: 10.1016/j.jbi.2021.103867] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 05/10/2021] [Accepted: 07/11/2021] [Indexed: 10/20/2022]
Abstract
BACKGROUND Recent natural language processing (NLP) research is dominated by neural network methods that employ word embeddings as basic building blocks. Pre-training with neural methods that capture local and global distributional properties (e.g., skip-gram, GLoVE) using free text corpora is often used to embed both words and concepts. Pre-trained embeddings are typically leveraged in downstream tasks using various neural architectures that are designed to optimize task-specific objectives that might further tune such embeddings. OBJECTIVE Despite advances in contextualized language model based embeddings, static word embeddings still form an essential starting point in BioNLP research and applications. They are useful in low resource settings and in lexical semantics studies. Our main goal is to build improved biomedical word embeddings and make them publicly available for downstream applications. METHODS We jointly learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information manifesting in co-occurring Medical Subject Heading (MeSH) concepts in biomedical citations. This fine-tuning is accomplished with the transformer-based BERT architecture in the two-sentence input mode with a classification objective that captures MeSH pair co-occurrence. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts. RESULTS Both in qualitative and quantitative evaluations we demonstrate that our methods produce improved biomedical embeddings in comparison with other static embedding efforts. Without selectively culling concepts and terms (as was pursued by previous efforts), we believe we offer the most exhaustive evaluation of biomedical embeddings to date with clear performance improvements across the board. CONCLUSION We repurposed a transformer architecture (typically used to generate dynamic embeddings) to improve static biomedical word embeddings using concept correlations. We provide our code and embeddings for public use for downstream applications and research endeavors: https://github.com/bionlproc/BERT-CRel-Embeddings.
Collapse
Affiliation(s)
- Jiho Noh
- Department of Computer Science, University of Kentucky, United States of America.
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, United States of America; Department of Computer Science, University of Kentucky, United States of America.
| |
Collapse
|
135
|
Allot A, Lee K, Chen Q, Luo L, Lu Z. LitSuggest: a web-based system for literature recommendation and curation using machine learning. Nucleic Acids Res 2021; 49:W352-W358. [PMID: 33950204 DOI: 10.1093/nar/gkab326] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 04/16/2021] [Accepted: 04/20/2021] [Indexed: 01/02/2023] Open
Abstract
Searching and reading relevant literature is a routine practice in biomedical research. However, it is challenging for a user to design optimal search queries using all the keywords related to a given topic. As such, existing search systems such as PubMed often return suboptimal results. Several computational methods have been proposed as an effective alternative to keyword-based query methods for literature recommendation. However, those methods require specialized knowledge in machine learning and natural language processing, which can make them difficult for biologists to utilize. In this paper, we propose LitSuggest, a web server that provides an all-in-one literature recommendation and curation service to help biomedical researchers stay up to date with scientific literature. LitSuggest combines advanced machine learning techniques for suggesting relevant PubMed articles with high accuracy. In addition to innovative text-processing methods, LitSuggest offers multiple advantages over existing tools. First, LitSuggest allows users to curate, organize, and download classification results in a single interface. Second, users can easily fine-tune LitSuggest results by updating the training corpus. Third, results can be readily shared, enabling collaborative analysis and curation of scientific literature. Finally, LitSuggest provides an automated personalized weekly digest of newly published articles for each user's project. LitSuggest is publicly available at https://www.ncbi.nlm.nih.gov/research/litsuggest.
Collapse
Affiliation(s)
- Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA.,Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
136
|
Chen Q, Leaman R, Allot A, Luo L, Wei CH, Yan S, Lu Z. Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing. Annu Rev Biomed Data Sci 2021; 4:313-339. [PMID: 34465169 DOI: 10.1146/annurev-biodatasci-021821-061045] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP)-the branch of artificial intelligence that interprets human language-can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Alexis Allot
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Ling Luo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Shankai Yan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| |
Collapse
|
137
|
Chatterjee S, Chakraborty R, Hasija Y. Polymorphisms at site 469 of B-RAF protein associated with skin melanoma may be correlated with dabrafenib resistance: An in silico study. J Biomol Struct Dyn 2021; 40:10862-10877. [PMID: 34278963 DOI: 10.1080/07391102.2021.1950571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2021] [Accepted: 06/28/2021] [Indexed: 12/24/2022]
Abstract
Melanoma is a type of skin cancer. Numerous genes and their proteins are strongly associated with melanoma susceptibility. This study aims to use an in silico method to identify genetic variants in the melanoma susceptibility gene. The COSMIC database was queried for genes and cross-referenced with three environment-gene interaction databases (EGP, SeattleSNPs and CTD) to identify shared genes. The majority of approved skin melanoma drugs were found to act on the protein serine/threonine-protein kinase (B-RAF) encoded by the BRAF gene, which was also present in all three referenced databases. Comprehensive computational analysis was performed to predict deleterious genetic variants associated with skin melanoma, and the nsSNPs G469V and G469E were prioritized based on their predicted deleterious effects. Molecular dynamic simulation analysis of the B-RAF protein mutants G469V and G469E reveals that variations in the amino acid conformation at the drug binding site result in inconsistency in drug interaction. Additionally, this analysis showed that the G469V and G469E mutants have lower binding energy for dabrafenib than the wild type. The population with the highest frequency of each deleterious and pathogenic variant has been determined. The study's findings would support the development of more effective treatment strategies for skin melanoma. Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
| | | | - Yasha Hasija
- Department of Biotechnology, Delhi Technological University, Delhi, India
| |
Collapse
|
138
|
Li P, Jiang X, Zhang G, Trabucco JT, Raciti D, Smith C, Ringwald M, Marai GE, Arighi C, Shatkay H. Utilizing image and caption information for biomedical document classification. Bioinformatics 2021; 37:i468-i476. [PMID: 34252939 PMCID: PMC8346654 DOI: 10.1093/bioinformatics/btab331] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/06/2021] [Indexed: 11/15/2022] Open
Abstract
Motivation Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature—a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. Results We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. Availability and implementation Source code and the list of PMIDs of the publications in our datasets are available upon request.
Collapse
Affiliation(s)
- Pengyuan Li
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
| | - Xiangying Jiang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA.,Amazon, Seattle, WA 98109, USA
| | - Gongbo Zhang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA.,Google, Mountain View, CA 94043, USA
| | - Juan Trelles Trabucco
- Department of Computer Science, The University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Daniela Raciti
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | | | | | - G Elisabeta Marai
- Department of Computer Science, The University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
| | - Hagit Shatkay
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
| |
Collapse
|
139
|
Su J, Wu Y, Ting HF, Lam TW, Luo R. RENET2: high-performance full-text gene-disease relation extraction with iterative training data expansion. NAR Genom Bioinform 2021; 3:lqab062. [PMID: 34235433 PMCID: PMC8256824 DOI: 10.1093/nargab/lqab062] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 06/16/2021] [Accepted: 06/23/2021] [Indexed: 01/06/2023] Open
Abstract
Relation extraction (RE) is a fundamental task for extracting gene–disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene–disease associations only from single sentences or abstract texts. A few studies have explored extracting gene–disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene–disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene–disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene–disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene–disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.
Collapse
Affiliation(s)
- Junhao Su
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Ye Wu
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Hing-Fung Ting
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| |
Collapse
|
140
|
Untangling the genetic link between type 1 and type 2 diabetes using functional genomics. Sci Rep 2021; 11:13871. [PMID: 34230558 PMCID: PMC8260770 DOI: 10.1038/s41598-021-93346-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Accepted: 06/16/2021] [Indexed: 02/06/2023] Open
Abstract
There is evidence pointing towards shared etiological features between type 1 diabetes (T1D) and type 2 diabetes (T2D) despite both phenotypes being considered genetically distinct. However, the existence of shared genetic features for T1D and T2D remains complex and poorly defined. To better understand the link between T1D and T2D, we employed an integrated functional genomics approach involving extensive chromatin interaction data (Hi-C) and expression quantitative trait loci (eQTL) data to characterize the tissue-specific impacts of single nucleotide polymorphisms associated with T1D and T2D. We identified 195 pleiotropic genes that are modulated by tissue-specific spatial eQTLs associated with both T1D and T2D. The pleiotropic genes are enriched in inflammatory and metabolic pathways that include mitogen-activated protein kinase activity, pertussis toxin signaling, and the Parkinson's disease pathway. We identified 8 regulatory elements within the TCF7L2 locus that modulate transcript levels of genes involved in immune regulation as well as genes important in the etiology of T2D. Despite the observed gene and pathway overlaps, there was no significant genetic correlation between variant effects on T1D and T2D risk using European ancestral summary data. Collectively, our findings support the hypothesis that T1D and T2D specific genetic variants act through genetic regulatory mechanisms to alter the regulation of common genes, and genes that co-locate in biological pathways, to mediate pleiotropic effects on disease development. Crucially, a high risk genetic profile for T1D alters biological pathways that increase the risk of developing both T1D and T2D. The same is not true for genetic profiles that increase the risk of developing T2D. The conversion of information on genetic susceptibility to the protein pathways that are altered provides an important resource for repurposing or designing novel therapies for the management of diabetes.
Collapse
|
141
|
Desterke C, Turhan AG, Bennaceur-Griscelli A, Griscelli F. HLA-dependent heterogeneity and macrophage immunoproteasome activation during lung COVID-19 disease. J Transl Med 2021; 19:290. [PMID: 34225749 PMCID: PMC8256232 DOI: 10.1186/s12967-021-02965-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 06/27/2021] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND The worldwide pandemic caused by the SARS-CoV-2 virus is characterized by significant and unpredictable heterogeneity in symptoms that remains poorly understood. METHODS Transcriptome and single cell transcriptome of COVID19 lung were integrated with deeplearning analysis of MHC class I immunopeptidome against SARS-COV2 proteome. RESULTS An analysis of the transcriptomes of lung samples from COVID-19 patients revealed that activation of MHC class I antigen presentation in these tissues was correlated with the amount of SARS-CoV-2 RNA present. Similarly, a positive relationship was detected in these samples between the level of SARS-CoV-2 and the expression of a genomic cluster located in the 6p21.32 region (40 kb long, inside the MHC-II cluster) that encodes constituents of the immunoproteasome. An analysis of single-cell transcriptomes of bronchoalveolar cells highlighted the activation of the immunoproteasome in CD68 + M1 macrophages of COVID-19 patients in addition to a PSMB8-based trajectory in these cells that featured an activation of defense response during mild cases of the disease, and an impairment of alveolar clearance mechanisms during severe COVID-19. By examining the binding affinity of the SARS-CoV-2 immunopeptidome with the most common HLA-A, -B, and -C alleles worldwide, we found higher numbers of stronger presenters in type A alleles and in Asian populations, which could shed light on why this disease is now less widespread in this part of the world. CONCLUSIONS HLA-dependent heterogeneity in macrophage immunoproteasome activation during lung COVID-19 disease could have implications for efforts to predict the response to HLA-dependent SARS-CoV-2 vaccines in the global population.
Collapse
Affiliation(s)
- Christophe Desterke
- INSERM UA9- University Paris-Saclay, 94800, Villejuif, France
- University Paris Saclay, Faculty of Medicine, 94275, Le Kremlin Bicêtre, France
| | - Ali G Turhan
- INSERM UA9- University Paris-Saclay, 94800, Villejuif, France
- ESTeam Paris Sud, INGESTEM National IPSC Infrastructure, University Paris-Saclay, 94800, Villejuif, France
- Division of Hematology, Kremlin-Bicetre Hospital, 94270, Kremlin Bicetre, France
- University Paris Saclay, Faculty of Medicine, 94275, Le Kremlin Bicêtre, France
| | - Annelise Bennaceur-Griscelli
- INSERM UA9- University Paris-Saclay, 94800, Villejuif, France
- ESTeam Paris Sud, INGESTEM National IPSC Infrastructure, University Paris-Saclay, 94800, Villejuif, France
- Division of Hematology, Kremlin-Bicetre Hospital, 94270, Kremlin Bicetre, France
- University Paris Saclay, Faculty of Medicine, 94275, Le Kremlin Bicêtre, France
| | - Frank Griscelli
- INSERM UA9- University Paris-Saclay, 94800, Villejuif, France.
- ESTeam Paris Sud, INGESTEM National IPSC Infrastructure, University Paris-Saclay, 94800, Villejuif, France.
- University of Paris, Faculty Sorbonne Paris Cité, Faculté Des Sciences Pharmaceutiques Et Biologiques, Paris, France.
- Department of Biopathology, Gustave-Roussy Cancer Institute, 94800, Villejuif, France.
- INSERM UA9, Institut André Lwoff, Hôpital Paul Brousse, Bâtiment A CNRS, 7 rue Guy Moquet, 94802, Villejuif, France.
| |
Collapse
|
142
|
Mining Proteome Research Reports: A Bird's Eye View. Proteomes 2021; 9:proteomes9020029. [PMID: 34200663 PMCID: PMC8293458 DOI: 10.3390/proteomes9020029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Revised: 05/27/2021] [Accepted: 06/08/2021] [Indexed: 01/25/2023] Open
Abstract
The complexity of data has burgeoned to such an extent that scientists of every realm are encountering the incessant challenge of data management. Modern-day analytical approaches with the help of free source tools and programming languages have facilitated access to the context of the various domains as well as specific works reported. Here, with this article, an attempt has been made to provide a systematic analysis of all the available reports at PubMed on Proteome using text mining. The work is comprised of scientometrics as well as information extraction to provide the publication trends as well as frequent keywords, bioconcepts and most importantly gene–gene co-occurrence network. Out of 33,028 PMIDs collected initially, the segregation of 24,350 articles under 28 Medical Subject Headings (MeSH) was analyzed and plotted. Keyword link network and density visualizations were provided for the top 1000 frequent Mesh keywords. PubTator was used, and 322,026 bioconcepts were able to extracted under 10 classes (such as Gene, Disease, CellLine, etc.). Co-occurrence networks were constructed for PMID-bioconcept as well as bioconcept–bioconcept associations. Further, for creation of subnetwork with respect to gene–gene co-occurrence, a total of 11,100 unique genes participated with mTOR and AKT showing the highest (64) number of connections. The gene p53 was the most popular one in the network in accordance with both the degree and weighted degree centrality, which were 425 and 1414, respectively. The present piece of study is an amalgam of bibliometrics and scientific data mining methods looking deeper into the whole scale analysis of available literature on proteome.
Collapse
|
143
|
Garda S, Schwarz JM, Schuelke M, Leser U, Seelow D. Public data sources for regulatory genomic features. MED GENET-BERLIN 2021; 33:167-177. [PMID: 38836022 PMCID: PMC11113004 DOI: 10.1515/medgen-2021-2075] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 06/24/2021] [Indexed: 06/06/2024]
Abstract
High-throughput technologies have led to a continuously growing amount of information about regulatory features in the genome. A wealth of data generated by large international research consortia is available from online databases. Disease-driven studies provide details on specific DNA elements or epigenetic modifications regulating gene expression in specific cellular and developmental contexts, but these results are usually only published in scientific articles. All this information can be helpful in interpreting variants in the regulatory genome. This review describes a selection of high-profile data sources providing information on the non-coding genome, as well as pitfalls and techniques to search and capture information from the literature.
Collapse
Affiliation(s)
- Samuele Garda
- Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
| | - Jana Marie Schwarz
- Department of Neuropediatrics, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- NeuroCure Cluster of Excellence, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Markus Schuelke
- Department of Neuropediatrics, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- NeuroCure Cluster of Excellence, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Ulf Leser
- Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
| | - Dominik Seelow
- BIH-Bioinformatics and Translational Genetics, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
144
|
Islamaj R, Wei CH, Cissel D, Miliaras N, Printseva O, Rodionov O, Sekiya K, Ward J, Lu Z. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. J Biomed Inform 2021; 118:103779. [PMID: 33839304 PMCID: PMC11037554 DOI: 10.1016/j.jbi.2021.103779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 03/14/2021] [Accepted: 04/05/2021] [Indexed: 10/21/2022]
Abstract
The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. While current methods for tagging gene entities have been developed for biomedical literature, their performance on species other than human is substantially lower due to the lack of annotation data. We therefore present the NLM-Gene corpus, a high-quality manually annotated corpus for genes developed at the US National Library of Medicine (NLM), covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per document, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed abstracts from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each document to control for bias. The annotators worked in three annotation rounds until they reached complete agreement. This gold-standard corpus can serve as a benchmark to develop & test new gene text mining algorithms. Using this new resource, we have developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/lu/NLMGene. We have also applied this tool to the entire PubMed/PMC with their results freely accessible through our web-based tool PubTator (www.ncbi.nlm.nih.gov/research/pubtator).
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Nicholas Miliaras
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Olga Printseva
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Oleg Rodionov
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Janice Ward
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
145
|
Abstract
The SARS-CoV-2 pandemic has caused a surge in research exploring all aspects of the virus and its effects on human health. The overwhelming publication rate means that researchers are unable to keep abreast of the literature. To ameliorate this, we present the CoronaCentral resource that uses machine learning to process the research literature on SARS-CoV-2 together with SARS-CoV and MERS-CoV. We categorize the literature into useful topics and article types and enable analysis of the contents, pace, and emphasis of research during the crisis with integration of Altmetric data. These topics include therapeutics, disease forecasting, as well as growing areas such as “long COVID” and studies of inequality. This resource, available at https://coronacentral.ai, is updated daily.
Collapse
|
146
|
Macnee M, Pérez-Palma E, Schumacher-Bass S, Dalton J, Leu C, Blankenberg D, Lal D. SimText: A text mining framework for interactive analysis and visualization of similarities among biomedical entities. Bioinformatics 2021; 37:4285-4287. [PMID: 34037702 PMCID: PMC9502138 DOI: 10.1093/bioinformatics/btab365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 04/07/2021] [Accepted: 05/24/2021] [Indexed: 11/23/2022] Open
Abstract
Summary Literature exploration in PubMed on a large number of biomedical entities (e.g. genes, diseases or experiments) can be time-consuming and challenging, especially when assessing associations between entities. Here, we describe SimText, a user-friendly toolset that provides customizable and systematic workflows for the analysis of similarities among a set of entities based on text. SimText can be used for (i) text collection from PubMed and extraction of words with different text mining approaches, and (ii) interactive analysis and visualization of data using unsupervised learning techniques in an interactive app. Availability and implementation We developed SimText as an open-source R software and integrated it into Galaxy (https://usegalaxy.eu), an online data analysis platform with supporting self-learning training material available at https://training.galaxyproject.org. A command-line version of the toolset is available for download from GitHub (https://github.com/dlal-group/simtext) or as Docker image (https://hub.docker.com/r/dlalgroup/simtext/tags.). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marie Macnee
- Cologne Center for Genomics (CCG), Medical Faculty of the University of Cologne, University Hospital of Cologne, Cologne, 50931, Germany
| | - Eduardo Pérez-Palma
- Universidad del Desarrollo, Centro de Genética y Genómica, Facultad de Medicina Clínica Alemana, Santiago, Chile
| | | | - Jarrod Dalton
- Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, Ohio, 44195, USA
| | - Costin Leu
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, 44195, USA
| | - Daniel Blankenberg
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, 44195, USA
| | - Dennis Lal
- Cologne Center for Genomics (CCG), Medical Faculty of the University of Cologne, University Hospital of Cologne, Cologne, 50931, Germany.,Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, 44195, USA.,Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.,Epilepsy Center, Neurological Institute, Cleveland Clinic, Cleveland, OH, 44195, USA
| |
Collapse
|
147
|
Wu M, Zhang Y, Grosser M, Tipper S, Venter D, Lin H, Lu J. Profiling COVID-19 Genetic Research: A Data-Driven Study Utilizing Intelligent Bibliometrics. Front Res Metr Anal 2021; 6:683212. [PMID: 34109284 PMCID: PMC8184093 DOI: 10.3389/frma.2021.683212] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2021] [Accepted: 05/06/2021] [Indexed: 12/14/2022] Open
Abstract
The COVID-19 pandemic constitutes an ongoing worldwide threat to human society and has caused massive impacts on global public health, the economy and the political landscape. The key to gaining control of the disease lies in understanding the genetics of SARS-CoV-2 and the disease spectrum that follows infection. This study leverages traditional and intelligent bibliometric methods to conduct a multi-dimensional analysis on 5,632 COVID-19 genetic research papers, revealing that 1) the key players include research institutions from the United States, China, Britain and Canada; 2) research topics predominantly focus on virus infection mechanisms, virus testing, gene expression related to the immune reactions and patient clinical manifestation; 3) studies originated from the comparison of SARS-CoV-2 to previous human coronaviruses, following which research directions diverge into the analysis of virus molecular structure and genetics, the human immune response, vaccine development and gene expression related to immune responses; and 4) genes that are frequently highlighted include ACE2, IL6, TMPRSS2, and TNF. Emerging genes to the COVID-19 consist of FURIN, CXCL10, OAS1, OAS2, OAS3, and ISG15. This study demonstrates that our suite of novel bibliometric tools could help biomedical researchers follow this rapidly growing field and provide substantial evidence for policymakers’ decision-making on science policy and public health administration.
Collapse
Affiliation(s)
- Mengjia Wu
- Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW, Australia
| | - Yi Zhang
- Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW, Australia
| | | | | | | | - Hua Lin
- 23Strands, Pyrmont, NSW, Australia
| | - Jie Lu
- Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW, Australia
| |
Collapse
|
148
|
LabelRS: An Automated Toolbox to Make Deep Learning Samples from Remote Sensing Images. REMOTE SENSING 2021. [DOI: 10.3390/rs13112064] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Deep learning technology has achieved great success in the field of remote sensing processing. However, the lack of tools for making deep learning samples with remote sensing images is a problem, so researchers have to rely on a small amount of existing public data sets that may influence the learning effect. Therefore, we developed an add-in (LabelRS) based on ArcGIS to help researchers make their own deep learning samples in a simple way. In this work, we proposed a feature merging strategy that enables LabelRS to automatically adapt to both sparsely distributed and densely distributed scenarios. LabelRS solves the problem of size diversity of the targets in remote sensing images through sliding windows. We have designed and built in multiple band stretching, image resampling, and gray level transformation algorithms for LabelRS to deal with the high spectral remote sensing images. In addition, the attached geographic information helps to achieve seamless conversion between natural samples, and geographic samples. To evaluate the reliability of LabelRS, we used its three sub-tools to make semantic segmentation, object detection and image classification samples, respectively. The experimental results show that LabelRS can produce deep learning samples with remote sensing images automatically and efficiently.
Collapse
|
149
|
Lee K, Wei CH, Lu Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 2021; 22:bbaa142. [PMID: 32770181 PMCID: PMC8138883 DOI: 10.1093/bib/bbaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/07/2020] [Accepted: 06/25/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION To obtain key information for personalized medicine and cancer research, clinicians and researchers in the biomedical field are in great need of searching genomic variant information from the biomedical literature now than ever before. Due to the various written forms of genomic variants, however, it is difficult to locate the right information from the literature when using a general literature search system. To address the difficulty of locating genomic variant information from the literature, researchers have suggested various solutions based on automated literature-mining techniques. There is, however, no study for summarizing and comparing existing tools for genomic variant literature mining in terms of how to search easily for information in the literature on genomic variants. RESULTS In this article, we systematically compared currently available genomic variant recognition and normalization tools as well as the literature search engines that adopted these literature-mining techniques. First, we explain the problems that are caused by the use of non-standard formats of genomic variants in the PubMed literature by considering examples from the literature and show the prevalence of the problem. Second, we review literature-mining tools that address the problem by recognizing and normalizing the various forms of genomic variants in the literature and systematically compare them. Third, we present and compare existing literature search engines that are designed for a genomic variant search by using the literature-mining techniques. We expect this work to be helpful for researchers who seek information about genomic variants from the literature, developers who integrate genomic variant information from the literature and beyond.
Collapse
Affiliation(s)
- Kyubum Lee
- National Center for Biotechnology Information
| | | | - Zhiyong Lu
- National Center for Biotechnology Information
| |
Collapse
|
150
|
Núñez-Carpintero I, Petrizzelli M, Zinovyev A, Cirillo D, Valencia A. The multilayer community structure of medulloblastoma. iScience 2021; 24:102365. [PMID: 33889829 PMCID: PMC8050854 DOI: 10.1016/j.isci.2021.102365] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 03/17/2021] [Accepted: 03/24/2021] [Indexed: 01/20/2023] Open
Abstract
Multilayer networks allow interpreting the molecular basis of diseases, which is particularly challenging in rare diseases where the number of cases is small compared with the size of the associated multi-omics datasets. In this work, we develop a dimensionality reduction methodology to identify the minimal set of genes that characterize disease subgroups based on their persistent association in multilayer network communities. We use this approach to the study of medulloblastoma, a childhood brain tumor, using proteogenomic data. Our approach is able to recapitulate known medulloblastoma subgroups (accuracy >94%) and provide a clear characterization of gene associations, with the downstream implications for diagnosis and therapeutic interventions. We verified the general applicability of our method on an independent medulloblastoma dataset (accuracy >98%). This approach opens the door to a new generation of multilayer network-based methods able to overcome the specific dimensionality limitations of rare disease datasets. The molecular interpretation of rare diseases is a challenging task Multilayer networks allow patient stratification and explainability We identify subgroup-specific genes and multilayer associations in medulloblastoma Multilayer community analysis enables the molecular interpretation of rare diseases
Collapse
Affiliation(s)
| | - Marianyela Petrizzelli
- Institut Curie, PSL Research University, 75005 Paris, France
- INSERM, U900, 75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 75006 Paris, France
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, 75005 Paris, France
- INSERM, U900, 75005 Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 75006 Paris, France
- Lobachevsky University, 603000 Nizhny Novgorod, Russia
| | - Davide Cirillo
- Barcelona Supercomputing Center (BSC), C/ Jordi Girona 29, 08034, Barcelona, Spain
- Corresponding author
| | - Alfonso Valencia
- Barcelona Supercomputing Center (BSC), C/ Jordi Girona 29, 08034, Barcelona, Spain
- ICREA - Institució Catalana de Recerca i Estudis Avançats, Pg. Lluís Companys 23, 08010, Barcelona, Spain
| |
Collapse
|