1
|
Allot A, Wei CH, Phan L, Hefferon T, Landrum M, Rehm HL, Lu Z. Tracking genetic variants in the biomedical literature using LitVar 2.0. Nat Genet 2023; 55:901-903. [PMID: 37268776 PMCID: PMC11096795 DOI: 10.1038/s41588-023-01414-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Affiliation(s)
- Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Lon Phan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Timothy Hefferon
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Melissa Landrum
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Heidi L Rehm
- Program in Medical and Population Genetics, Broad institute of MIT and Harvard, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
2
|
Pasche E, Mottaz A, Caucheteur D, Gobeill J, Michel PA, Ruch P. Variomes: a high recall search engine to support the curation of genomic variants. Bioinformatics 2022; 38:2595-2601. [PMID: 35274687 PMCID: PMC9048643 DOI: 10.1093/bioinformatics/btac146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 02/07/2022] [Accepted: 03/10/2022] [Indexed: 12/02/2022] Open
Abstract
Motivation Identification and interpretation of clinically actionable variants is a critical bottleneck. Searching for evidence in the literature is mandatory according to ASCO/AMP/CAP practice guidelines; however, it is both labor-intensive and error-prone. We developed a system to perform triage of publications relevant to support an evidence-based decision. The system is also able to prioritize variants. Our system searches within pre-annotated collections such as MEDLINE and PubMed Central. Results We assess the search effectiveness of the system using three different experimental settings: literature triage; variant prioritization and comparison of Variomes with LitVar. Almost two-thirds of the publications returned in the top-5 are relevant for clinical decision-support. Our approach enabled identifying 81.8% of clinically actionable variants in the top-3. Variomes retrieves on average +21.3% more articles than LitVar and returns the same number of results or more results than LitVar for 90% of the queries when tested on a set of 803 queries; thus, establishing a new baseline for searching the literature about variants. Availability and implementation Variomes is publicly available at https://candy.hesge.ch/Variomes. Source code is freely available at https://github.com/variomes/sibtm-variomes. SynVar is publicly available at https://goldorak.hesge.ch/synvar. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Emilie Pasche
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Anaïs Mottaz
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Déborah Caucheteur
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Julien Gobeill
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Pierre-André Michel
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Patrick Ruch
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| |
Collapse
|
3
|
Adams T, Namysl M, Kodamullil AT, Behnke S, Jacobs M. Benchmarking table recognition performance on biomedical literature on neurological disorders. Bioinformatics 2022; 38:1624-1630. [PMID: 34935870 DOI: 10.1093/bioinformatics/btab843] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 12/07/2021] [Accepted: 12/18/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Table recognition systems are widely used to extract and structure quantitative information from the vast amount of documents that are increasingly available from different open sources. While many systems already perform well on tables with a simple layout, tables in the biomedical domain are often much more complex. Benchmark and training data for such tables are however very limited. RESULTS To address this issue, we present a novel, highly curated benchmark dataset based on a hand-curated literature corpus on neurological disorders, which can be used to tune and evaluate table extraction applications for this challenging domain. We evaluate several state-of-the-art table extraction systems based on our proposed benchmark and discuss challenges that emerged during the benchmark creation as well as factors that can impact the performance of recognition methods. For the evaluation procedure, we propose a new metric as well as several improvements that result in a better performance evaluation. AVAILABILITY AND IMPLEMENTATION The resulting benchmark dataset (https://zenodo.org/record/5549977) as well as the source code to our novel evaluation approach can be openly accessed. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tim Adams
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin 53757, Germany
| | - Marcin Namysl
- Fraunhofer Institute for Intelligent Analysis and Information Systems, Schloss Birlinghoven, Sankt Augustin 53757, Germany.,Autonomous Intelligent Systems, Computer Science Institute VI, University of Bonn, Bonn 53115, Germany
| | - Alpha Tom Kodamullil
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin 53757, Germany
| | - Sven Behnke
- Fraunhofer Institute for Intelligent Analysis and Information Systems, Schloss Birlinghoven, Sankt Augustin 53757, Germany.,Autonomous Intelligent Systems, Computer Science Institute VI, University of Bonn, Bonn 53115, Germany
| | - Marc Jacobs
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin 53757, Germany
| |
Collapse
|
4
|
|
5
|
Saberian N, Shafi A, Peyvandipour A, Draghici S. MAGPEL: an autoMated pipeline for inferring vAriant-driven Gene PanEls from the full-length biomedical literature. Sci Rep 2020; 10:12365. [PMID: 32703994 PMCID: PMC7378213 DOI: 10.1038/s41598-020-68649-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 06/17/2020] [Indexed: 11/09/2022] Open
Abstract
In spite of the efforts in developing and maintaining accurate variant databases, a large number of disease-associated variants are still hidden in the biomedical literature. Curation of the biomedical literature in an effort to extract this information is a challenging task due to: (i) the complexity of natural language processing, (ii) inconsistent use of standard recommendations for variant description, and (iii) the lack of clarity and consistency in describing the variant-genotype-phenotype associations in the biomedical literature. In this article, we employ text mining and word cloud analysis techniques to address these challenges. The proposed framework extracts the variant-gene-disease associations from the full-length biomedical literature and designs an evidence-based variant-driven gene panel for a given condition. We validate the identified genes by showing their diagnostic abilities to predict the patients' clinical outcome on several independent validation cohorts. As representative examples, we present our results for acute myeloid leukemia (AML), breast cancer and prostate cancer. We compare these panels with other variant-driven gene panels obtained from Clinvar, Mastermind and others from literature, as well as with a panel identified with a classical differentially expressed genes (DEGs) approach. The results show that the panels obtained by the proposed framework yield better results than the other gene panels currently available in the literature.
Collapse
Affiliation(s)
- Nafiseh Saberian
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Adib Shafi
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Azam Peyvandipour
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, USA.
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA.
| |
Collapse
|
6
|
Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020; 47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 188] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open
Abstract
PubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles. PubTator Central (PTC) provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download. PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). The new PTC web interface allows users to build full text document collections and visualize concept annotations in each document. Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. PTC is synchronized with PubMed and PubMed Central, with new articles added daily. The original PubTator service has served annotated abstracts for ∼300 million requests, enabling third-party research in use cases such as biocuration support, gene prioritization, genetic disease analysis, and literature-based knowledge discovery. We demonstrate the full text results in PTC significantly increase biomedical concept coverage and anticipate this expansion will both enhance existing downstream applications and enable new use cases.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
7
|
Guin D, Rani J, Singh P, Grover S, Bora S, Talwar P, Karthikeyan M, Satyamoorthy K, Adithan C, Ramachandran S, Saso L, Hasija Y, Kukreti R. Global Text Mining and Development of Pharmacogenomic Knowledge Resource for Precision Medicine. Front Pharmacol 2019; 10:839. [PMID: 31447668 PMCID: PMC6692532 DOI: 10.3389/fphar.2019.00839] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 07/01/2019] [Indexed: 11/20/2022] Open
Abstract
Understanding patients' genomic variations and their effect in protecting or predisposing them to drug response phenotypes is important for providing personalized healthcare. Several studies have manually curated such genotype-phenotype relationships into organized databases from clinical trial data or published literature. However, there are no text mining tools available to extract high-accuracy information from such existing knowledge. In this work, we used a semiautomated text mining approach to retrieve a complete pharmacogenomic (PGx) resource integrating disease-drug-gene-polymorphism relationships to derive a global perspective for ease in therapeutic approaches. We used an R package, pubmed.mineR, to automatically retrieve PGx-related literature. We identified 1,753 disease types, and 666 drugs, associated with 4,132 genes and 33,942 polymorphisms collated from 180,088 publications. With further manual curation, we obtained a total of 2,304 PGx relationships. We evaluated our approach by performance (precision = 0.806) with benchmark datasets like Pharmacogenomic Knowledgebase (PharmGKB) (0.904), Online Mendelian Inheritance in Man (OMIM) (0.600), and The Comparative Toxicogenomics Database (CTD) (0.729). We validated our study by comparing our results with 362 commercially used the US- Food and drug administration (FDA)-approved drug labeling biomarkers. Of the 2,304 PGx relationships identified, 127 belonged to the FDA list of 362 approved pharmacogenomic markers, indicating that our semiautomated text mining approach may reveal significant PGx information with markers for drug response prediction. In addition, it is a scalable and state-of-art approach in curation for PGx clinical utility.
Collapse
Affiliation(s)
- Debleena Guin
- Genomics and Molecular Medicine Unit, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India
- Department of Biotechnology, Delhi Technological University, Delhi, India
| | - Jyoti Rani
- Department of Biomedical Sciences, Acharya Narayan Dev College, University of Delhi, New Delhi, India
- G N Ramachandran Knowledge Centre, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India
| | - Priyanka Singh
- Genomics and Molecular Medicine Unit, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India
- Academy of Scientific & Innovative Research (AcSIR), New Delhi, India
| | - Sandeep Grover
- Institute of Medical Biometry and Statistics, University of Lübeck University Medical Center Schleswig-Holstein - Campus Lübeck, Lübeck, Germany
| | - Shivangi Bora
- Genomics and Molecular Medicine Unit, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India
- Department of Biotechnology, Delhi Technological University, Delhi, India
| | - Puneet Talwar
- Institute of Human Behaviour and Allied Sciences, Delhi, India
| | | | - K Satyamoorthy
- School of Life Sciences, Manipal University, Manipal, India
| | - C Adithan
- Central Inter-Disciplinary Research Facility (CIDRF), Pondicherry, India
| | - S Ramachandran
- G N Ramachandran Knowledge Centre, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India
- Academy of Scientific & Innovative Research (AcSIR), New Delhi, India
| | - Luciano Saso
- Department of Physiology and Pharmacology “Vittorio Erspamer,” Sapienza University of Rome, Rome, Italy
| | - Yasha Hasija
- Department of Biotechnology, Delhi Technological University, Delhi, India
| | - Ritushree Kukreti
- Genomics and Molecular Medicine Unit, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India
- Academy of Scientific & Innovative Research (AcSIR), New Delhi, India
| |
Collapse
|
8
|
McAlpine JB, Chen SN, Kutateladze A, MacMillan JB, Appendino G, Barison A, Beniddir MA, Biavatti MW, Bluml S, Boufridi A, Butler MS, Capon RJ, Choi YH, Coppage D, Crews P, Crimmins MT, Csete M, Dewapriya P, Egan JM, Garson MJ, Genta-Jouve G, Gerwick WH, Gross H, Harper MK, Hermanto P, Hook JM, Hunter L, Jeannerat D, Ji NY, Johnson TA, Kingston DGI, Koshino H, Lee HW, Lewin G, Li J, Linington RG, Liu M, McPhail KL, Molinski TF, Moore BS, Nam JW, Neupane RP, Niemitz M, Nuzillard JM, Oberlies NH, Ocampos FMM, Pan G, Quinn RJ, Reddy DS, Renault JH, Rivera-Chávez J, Robien W, Saunders CM, Schmidt TJ, Seger C, Shen B, Steinbeck C, Stuppner H, Sturm S, Taglialatela-Scafati O, Tantillo DJ, Verpoorte R, Wang BG, Williams CM, Williams PG, Wist J, Yue JM, Zhang C, Xu Z, Simmler C, Lankin DC, Bisson J, Pauli GF. The value of universally available raw NMR data for transparency, reproducibility, and integrity in natural product research. Nat Prod Rep 2019; 36:35-107. [PMID: 30003207 PMCID: PMC6350634 DOI: 10.1039/c7np00064b] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2017] [Indexed: 12/20/2022]
Abstract
Covering: up to 2018With contributions from the global natural product (NP) research community, and continuing the Raw Data Initiative, this review collects a comprehensive demonstration of the immense scientific value of disseminating raw nuclear magnetic resonance (NMR) data, independently of, and in parallel with, classical publishing outlets. A comprehensive compilation of historic to present-day cases as well as contemporary and future applications show that addressing the urgent need for a repository of publicly accessible raw NMR data has the potential to transform natural products (NPs) and associated fields of chemical and biomedical research. The call for advancing open sharing mechanisms for raw data is intended to enhance the transparency of experimental protocols, augment the reproducibility of reported outcomes, including biological studies, become a regular component of responsible research, and thereby enrich the integrity of NP research and related fields.
Collapse
Affiliation(s)
- James B McAlpine
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. ,
| | - Shao-Nong Chen
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. ,
| | - Andrei Kutateladze
- Department of Chemistry and Biochemistry, University of Denver, Denver, CO 80210, USA
| | - John B MacMillan
- Department of Chemistry and Biochemistry, University of California, Santa Cruz, CA 95064, USA
| | - Giovanni Appendino
- Dipartimento di Scienze Chimiche, Alimentari, Farmaceutiche e Farmacologiche, Universita` del Piemonte Orientale, Via Bovio 6, 28100 Novara, Italy
| | | | - Mehdi A Beniddir
- Équipe "Pharmacognosie-Chimie des Substances Naturelles" BioCIS, Univ. Paris-Sud, CNRS, Université Paris-Saclay, 5 rue J.-B. Clément, 92290 Châtenay-Malabry, France
| | - Maique W Biavatti
- Department of Pharmaceutical Sciences, Federal University of Santa Catarina, Florianópolis, Brazil
| | - Stefan Bluml
- University of Southern California, Keck School of Medicine, Los Angeles, CA 90089, USA
| | - Asmaa Boufridi
- Griffith Institute for Drug Discovery, Griffith University, Brisbane, QLD 4111, Australia
| | - Mark S Butler
- Institute for Molecular Bioscience, The University of Queensland, St. Lucia, QLD 4072, Australia
| | - Robert J Capon
- Institute for Molecular Bioscience, The University of Queensland, St. Lucia, QLD 4072, Australia
| | - Young H Choi
- Division of Pharmacognosy, Section Metabolomics, Institute of Biology, Leiden University, P.O. Box 9502, 2300 RA Leiden, The Netherlands
| | - David Coppage
- Department of Chemistry and Biochemistry, University of California, Santa Cruz, CA 95064, USA
| | - Phillip Crews
- Department of Chemistry and Biochemistry, University of California, Santa Cruz, CA 95064, USA
| | - Michael T Crimmins
- Kenan and Caudill Laboratories of Chemistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Marie Csete
- University of Southern California, Huntington Medical Research Institutes, 99 N. El Molino Ave., Pasadena, CA 91101, USA
| | - Pradeep Dewapriya
- Institute for Molecular Bioscience, The University of Queensland, St. Lucia, QLD 4072, Australia
| | - Joseph M Egan
- Department of Chemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Mary J Garson
- School of Chemistry and Molecular Sciences, University of Queensland, St. Lucia, QLD 4072, Australia
| | - Grégory Genta-Jouve
- C-TAC, UMR 8638 CNRS, Faculté de Pharmacie de Paris, Paris-Descartes University, Sorbonne, Paris Cité, 4, Aveue de l'Observatoire, 75006 Paris, France
| | - William H Gerwick
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, La Jolla, San Diego, CA 92093, USA and Center for Marine Biotechnology and Biomedicine, Scripps Institution of Oceanography, La Jolla, CA 92093, USA
| | - Harald Gross
- Pharmaceutical Institute, Department of Pharmaceutical Biology, Eberhard Karls University of Tübingen, Auf der Morgenstelle 8, 72076 Tübingen, Germany
| | - Mary Kay Harper
- Department of Medicinal Chemistry, University of Utah, Salt Lake City, UT 84112, USA
| | - Precilia Hermanto
- NMR Facility, Mark Wainwright Analytical Centre, University of New South Wales, Sydney, NSW 2052, Australia
| | - James M Hook
- NMR Facility, Mark Wainwright Analytical Centre, University of New South Wales, Sydney, NSW 2052, Australia
| | - Luke Hunter
- NMR Facility, Mark Wainwright Analytical Centre, University of New South Wales, Sydney, NSW 2052, Australia
| | - Damien Jeannerat
- University of Geneva, Department of Organic Chemistry, 30 quai E. Ansermet, CH 1211 Geneva 4, Switzerland
| | - Nai-Yun Ji
- Yantai Institute of Coastal Zone Research, Chinese Academy of Sciences, Chunhui Road 17, Yantai 264003, People's Republic of China
| | - Tyler A Johnson
- Department of Chemistry and Biochemistry, University of California, Santa Cruz, CA 95064, USA
| | - David G I Kingston
- Department of Chemistry, M/C 0212, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA
| | - Hiroyuki Koshino
- RIKEN Center for Sustainable Resource Science, Wako, Saitama, 351-0198, Japan
| | - Hsiau-Wei Lee
- Department of Chemistry and Biochemistry, University of California, Santa Cruz, CA 95064, USA
| | - Guy Lewin
- Équipe "Pharmacognosie-Chimie des Substances Naturelles" BioCIS, Univ. Paris-Sud, CNRS, Université Paris-Saclay, 5 rue J.-B. Clément, 92290 Châtenay-Malabry, France
| | - Jie Li
- Center for Marine Biotechnology and Biomedicine, Scripps Institution of Oceanography, La Jolla, CA 92093, USA
| | - Roger G Linington
- Department of Chemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Miaomiao Liu
- Griffith Institute for Drug Discovery, Griffith University, Brisbane, QLD 4111, Australia
| | - Kerry L McPhail
- Department of Pharmaceutical Sciences, College of Pharmacy, Oregon State University, Corvallis, OR 97331, USA
| | - Tadeusz F Molinski
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Bradley S Moore
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, La Jolla, San Diego, CA 92093, USA and Center for Marine Biotechnology and Biomedicine, Scripps Institution of Oceanography, La Jolla, CA 92093, USA
| | - Joo-Won Nam
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Ram P Neupane
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Matthias Niemitz
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Jean-Marc Nuzillard
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Nicholas H Oberlies
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | | | - Guohui Pan
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Ronald J Quinn
- Griffith Institute for Drug Discovery, Griffith University, Brisbane, QLD 4111, Australia
| | - D Sai Reddy
- Department of Chemistry and Biochemistry, University of Denver, Denver, CO 80210, USA
| | - Jean-Hugues Renault
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - José Rivera-Chávez
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Wolfgang Robien
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Carla M Saunders
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Thomas J Schmidt
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Christoph Seger
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Ben Shen
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Christoph Steinbeck
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Hermann Stuppner
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Sonja Sturm
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Orazio Taglialatela-Scafati
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Dean J Tantillo
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Robert Verpoorte
- Division of Pharmacognosy, Section Metabolomics, Institute of Biology, Leiden University, P.O. Box 9502, 2300 RA Leiden, The Netherlands
| | - Bin-Gui Wang
- Yantai Institute of Coastal Zone Research, Chinese Academy of Sciences, Chunhui Road 17, Yantai 264003, People's Republic of China and Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Craig M Williams
- School of Chemistry and Molecular Sciences, University of Queensland, St. Lucia, QLD 4072, Australia
| | - Philip G Williams
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Julien Wist
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Jian-Min Yue
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Chen Zhang
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Zhengren Xu
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. , and
| | - Charlotte Simmler
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. ,
| | - David C Lankin
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. ,
| | - Jonathan Bisson
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. ,
| | - Guido F Pauli
- Center for Natural Product Technologies (CENAPT), Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, 833 S. Wood St., Chicago, IL 60612, USA. ,
| |
Collapse
|
9
|
Abstract
PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.
Collapse
Affiliation(s)
- Helen V Cook
- School of Clinical Medicine, University of Cambridge, Cambridge, UK.,Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
10
|
Kilicoglu H. Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform 2018; 19:1400-1414. [PMID: 28633401 PMCID: PMC6291799 DOI: 10.1093/bib/bbx057] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 04/10/2017] [Indexed: 01/01/2023] Open
Abstract
An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted because of problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the manifestation of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part toward enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, US National Library of Medicine
| |
Collapse
|
11
|
Wei CH, Phan L, Feltz J, Maiti R, Hefferon T, Lu Z. tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine. Bioinformatics 2018; 34:80-87. [PMID: 28968638 DOI: 10.1093/bioinformatics/btx541] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 08/31/2017] [Indexed: 11/12/2022] Open
Abstract
Motivation Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data. Results We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research. Availability and implementation The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/. Contact zhiyong.lu@nih.gov.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Lon Phan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Juliana Feltz
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Rama Maiti
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Tim Hefferon
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| |
Collapse
|
12
|
Cejuela JM, Bojchevski A, Uhlig C, Bekmukhametov R, Kumar Karn S, Mahmuti S, Baghudana A, Dubey A, Satagopam VP, Rost B. nala: text mining natural language mutation mentions. Bioinformatics 2018; 33:1852-1858. [PMID: 28200120 PMCID: PMC5870606 DOI: 10.1093/bioinformatics/btx083] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2016] [Accepted: 02/08/2017] [Indexed: 01/30/2023] Open
Abstract
Motivation The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’). Results We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28–77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. Availability and Implementation Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Juan Miguel Cejuela
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany
| | - Aleksandar Bojchevski
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany
| | - Carsten Uhlig
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany
| | - Rustem Bekmukhametov
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,Microsoft, WA, Bellevue, USA
| | - Sanjeev Kumar Karn
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,Ludwig Maximilian University, 80538 Munich & Siemens AG, Corporate Technology, Munich, Germany
| | - Shpend Mahmuti
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany
| | - Ashish Baghudana
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,BITS-Pilani K. K. Birla Goa Campus, Goa, India
| | - Ankit Dubey
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,Concur (Germany) GmbH, Frankfurt am Main, Germany
| | - Venkata P Satagopam
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Belvaux, Luxembourg
| | - Burkhard Rost
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,Institute of Advanced Study (TUM-IAS) & Institute for Food and Plant Sciences WZW - Weihenstephan & New York Consortium on Membrane Protein Structure (NYCOMPS) & Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| |
Collapse
|
13
|
Zhao Y, Song WM, Zhang F, Zhou MM, Zhang W, Walsh MJ, Zhang B. Distinct distributions of genomic features of the 5’ and 3’ partners of coding somatic cancer gene fusions: arising mechanisms and functional implications. Oncotarget 2017; 8:66769-66783. [PMID: 28977995 PMCID: PMC5620135 DOI: 10.18632/oncotarget.10734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2016] [Accepted: 06/06/2016] [Indexed: 11/25/2022] Open
Abstract
The genomic features and arising mechanisms of coding cancer somatic gene fusions (CSGFs) largely remain elusive. In this study, we show the gene origin stratification pattern of CSGF partners that fusion partners in human cancers are significantly enriched for genes with the gene age ofEuteleostomes and with the gene family age of Bilateria. GC skew (a measurement of G, C nucleotide content bias, (G-C)/(G+C)) is a useful measurement to indicate the DNA leading strand, lagging strand, replication origin, and replication terminal and DNA-RNA R-loop formation. We find that GC skew bias at the 5 prime (5′) but not the 3 prime (3’) partners of CSGFs, coincident with the polarity feature of gene expression breadth that the 5’ partners are more ubiquitous while the 3’ fusion partners are more tissue specific in general. We reveal distinct length and composition distributions of 5’ and 3’ of CSGFs, including sequence features corresponded to the 5’ untranslated regions (UTRs), 3’ UTRs, and the N-terminal sequences of the encoded proteins. Oncogenic somatic gene fusions are most enriched for the 5’ and 3’ genes’ somatic amplification alongside a substantial proportion of other types of combinations. At the function level, 5’ partners of CSGFs appear more likely to be tumour suppressor genes while many 3’ partners appear to be proto-oncogene. Such distinct polarities of CSGFs at the evolutionary, structural, genomic and functional levels indicate the heterogeneous arsing mechanisms of CSGFs including R-loops and suggest potential novel targeted therapeutics specific to CSGF functional categories.
Collapse
Affiliation(s)
- Yongzhong Zhao
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, NY 10029, USA
- Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, NY 10029, USA
| | - Won-Min Song
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, NY 10029, USA
- Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, NY 10029, USA
| | - Fan Zhang
- Department of Medicine, Icahn School of Medicine at Mount Sinai, NY 10029, USA
| | - Ming-Ming Zhou
- Department of Structural and Chemical Biology, Icahn School of Medicine at Mount Sinai, NY 10029, USA
| | - Weijia Zhang
- Department of Medicine, Icahn School of Medicine at Mount Sinai, NY 10029, USA
| | - Martin J. Walsh
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, NY 10029, USA
- Department of Structural and Chemical Biology, Icahn School of Medicine at Mount Sinai, NY 10029, USA
- Department of Pediatrics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Bin Zhang
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, NY 10029, USA
- Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, NY 10029, USA
| |
Collapse
|
14
|
Literature evidence in open targets - a target validation platform. J Biomed Semantics 2017; 8:20. [PMID: 28587637 PMCID: PMC5461726 DOI: 10.1186/s13326-017-0131-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Accepted: 05/31/2017] [Indexed: 11/13/2022] Open
Abstract
Background We present the Europe PMC literature component of Open Targets - a target validation platform that integrates various evidence to aid drug target identification and validation. The component identifies target-disease associations in documents and ranks the documents based on their confidence from the Europe PMC literature database, by using rules utilising expert-provided heuristic information. The confidence score of a given document represents how valuable the document is in the scope of target validation for a given target-disease association by taking into account the credibility of the association based on the properties of the text. The component serves the platform regularly with the up-to-date data since December, 2015. Results Currently, there are a total number of 1168365 distinct target-disease associations text mined from >26 million PubMed abstracts and >1.2 million Open Access full text articles. Our comparative analyses on the current available evidence data in the platform revealed that 850179 of these associations are exclusively identified by literature mining. Conclusions This component helps the platform’s users by providing the most relevant literature hits for a given target and disease. The text mining evidence along with the other types of evidence can be explored visually through https://www.targetvalidation.org and all the evidence data is available for download in json format from https://www.targetvalidation.org/downloads/data.
Collapse
|
15
|
Singhal A, Simmons M, Lu Z. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine. PLoS Comput Biol 2016; 12:e1005017. [PMID: 27902695 PMCID: PMC5130168 DOI: 10.1371/journal.pcbi.1005017] [Citation(s) in RCA: 66] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 06/04/2016] [Indexed: 11/23/2022] Open
Abstract
The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships. To provide personalized health care it is important to understand patients’ genomic variations and the effect these variants have in protecting or predisposing patients to disease. Several projects aim at providing this information by manually curating such genotype-phenotype relationships in organized databases using data from clinical trials and biomedical literature. However, the exponentially increasing size of biomedical literature and the limited ability of manual curators to discover the genotype-phenotype relationships “hidden” in text has led to delays in keeping such databases updated with the current findings. The result is a bottleneck in leveraging valuable information that is currently available to develop personalized health care solutions. In the past, a few computational techniques have attempted to speed up the curation efforts by using text mining techniques to automatically mine genotype-phenotype information from biomedical literature. However, such computational approaches have not been able to achieve accuracy levels sufficient to make them appealing for practical use. In this work, we present a highly accurate machine-learning-based text mining approach for mining complete genotype-phenotype relationships from biomedical literature. We test the performance of this approach on ten well-known diseases and demonstrate the validity of our approach and its potential utility for practical purposes. We are currently working towards generating genotype-phenotype relationships for all PubMed data with the goal of developing an exhaustive database of all the known diseases in life science. We believe that this work will provide very important and needed support for implementation of personalized health care using genomic data.
Collapse
Affiliation(s)
- Ayush Singhal
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Michael Simmons
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|
16
|
Thomas P, Rocktäschel T, Hakenberg J, Lichtblau Y, Leser U. SETH detects and normalizes genetic variants in text. Bioinformatics 2016; 32:2883-5. [PMID: 27256315 DOI: 10.1093/bioinformatics/btw234] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2015] [Accepted: 04/18/2016] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED : Descriptions of genetic variations and their effect are widely spread across the biomedical literature. However, finding all mentions of a specific variation, or all mentions of variations in a specific gene, is difficult to achieve due to the many ways such variations are described. Here, we describe SETH, a tool for the recognition of variations from text and their subsequent normalization to dbSNP or UniProt. SETH achieves high precision and recall on several evaluation corpora of PubMed abstracts. It is freely available and encompasses stand-alone scripts for isolated application and evaluation as well as a thorough documentation for integration into other applications. AVAILABILITY AND IMPLEMENTATION SETH is released under the Apache 2.0 license and can be downloaded from http://rockt.github.io/SETH/ CONTACT: thomas@informatik.hu-berlin.de or leser@informatik.hu-berlin.de.
Collapse
Affiliation(s)
- Philippe Thomas
- Language Technology Lab, DFKI Berlin, Germany Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität Zu Berlin, Unter Den Linden 6, Berlin 10099, Germany
| | | | - Jörg Hakenberg
- Illumina, Inc, 451 El Camino Real, Santa Clara, CA 95050, USA
| | - Yvonne Lichtblau
- Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität Zu Berlin, Unter Den Linden 6, Berlin 10099, Germany
| | - Ulf Leser
- Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität Zu Berlin, Unter Den Linden 6, Berlin 10099, Germany
| |
Collapse
|
17
|
Kafkas Ş, Kim JH, Pi X, McEntyre JR. Database citation in supplementary data linked to Europe PubMed Central full text biomedical articles. J Biomed Semantics 2015; 6:1. [PMID: 25789152 PMCID: PMC4363206 DOI: 10.1186/2041-1480-6-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Accepted: 12/16/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In this study, we present an analysis of data citation practices in full text research articles and their corresponding supplementary data files, made available in the Open Access set of articles from Europe PubMed Central. Our aim is to investigate whether supplementary data files should be considered as a source of information for integrating the literature with biomolecular databases. RESULTS Using text-mining methods to identify and extract a variety of core biological database accession numbers, we found that the supplemental data files contain many more database citations than the body of the article, and that those citations often take the form of a relatively small number of articles citing large collections of accession numbers in text-based files. Moreover, citation of value-added databases derived from submission databases (such as Pfam, UniProt or Ensembl) is common, demonstrating the reuse of these resources as datasets in themselves. All the database accession numbers extracted from the supplementary data are publicly accessible from http://dx.doi.org/10.5281/zenodo.11771. CONCLUSIONS Our study suggests that supplementary data should be considered when linking articles with data, in curation pipelines, and in information retrieval tasks in order to make full use of the entire research article. These observations highlight the need to improve the management of supplemental data in general, in order to make this information more discoverable and useful.
Collapse
Affiliation(s)
- Şenay Kafkas
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK
| | - Jee-Hyub Kim
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK
| | - Xingjun Pi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK
| | - Johanna R McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK
| |
Collapse
|
18
|
Good BM, Nanis M, Wu C, Su AI. Microtask crowdsourcing for disease mention annotation in PubMed abstracts. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2015:282-293. [PMID: 25592589 PMCID: PMC4299946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses. Many biological natural language processing (BioNLP) projects attempt to address this challenge, but the state of the art still leaves much room for improvement. Progress in BioNLP research depends on large, annotated corpora for evaluating information extraction systems and training machine learning models. Traditionally, such corpora are created by small numbers of expert annotators often working over extended periods of time. Recent studies have shown that workers on microtask crowdsourcing platforms such as Amazon's Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text. Here, we investigated the use of the AMT in capturing disease mentions in PubMed abstracts. We used the NCBI Disease corpus as a gold standard for refining and benchmarking our crowdsourcing protocol. After several iterations, we arrived at a protocol that reproduced the annotations of the 593 documents in the 'training set' of this gold standard with an overall F measure of 0.872 (precision 0.862, recall 0.883). The output can also be tuned to optimize for precision (max = 0.984 when recall = 0.269) or recall (max = 0.980 when precision = 0.436). Each document was completed by 15 workers, and their annotations were merged based on a simple voting method. In total 145 workers combined to complete all 593 documents in the span of 9 days at a cost of $.066 per abstract per worker. The quality of the annotations, as judged with the F measure, increases with the number of workers assigned to each task; however minimal performance gains were observed beyond 8 workers per task. These results add further evidence that microtask crowdsourcing can be a valuable tool for generating well-annotated corpora in BioNLP. Data produced for this analysis are available at http://figshare.com/articles/Disease_Mention_Annotation_with_Mechanical_Turk/1126402.
Collapse
|
19
|
Macintyre G, Jimeno Yepes A, Ong CS, Verspoor K. Associating disease-related genetic variants in intergenic regions to the genes they impact. PeerJ 2014; 2:e639. [PMID: 25374782 PMCID: PMC4217187 DOI: 10.7717/peerj.639] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2014] [Accepted: 10/07/2014] [Indexed: 11/20/2022] Open
Abstract
We present a method to assist in interpretation of the functional impact of intergenic disease-associated SNPs that is not limited to search strategies proximal to the SNP. The method builds on two sources of external knowledge: the growing understanding of three-dimensional spatial relationships in the genome, and the substantial repository of information about relationships among genetic variants, genes, and diseases captured in the published biomedical literature. We integrate chromatin conformation capture data (HiC) with literature support to rank putative target genes of intergenic disease-associated SNPs. We demonstrate that this hybrid method outperforms a genomic distance baseline on a small test set of expression quantitative trait loci, as well as either method individually. In addition, we show the potential for this method to uncover relationships between intergenic SNPs and target genes across chromosomes. With more extensive chromatin conformation capture data becoming readily available, this method provides a way forward towards functional interpretation of SNPs in the context of the three dimensional structure of the genome in the nucleus.
Collapse
Affiliation(s)
- Geoff Macintyre
- Department of Computing and Information Systems, The University of Melbourne, VIC, Australia
- Centre for Neural Engineering, The University of Melbourne, VIC, Australia
| | - Antonio Jimeno Yepes
- Department of Computing and Information Systems, The University of Melbourne, VIC, Australia
| | - Cheng Soon Ong
- Department of Electrical and Electronic Engineering, The University of Melbourne, VIC, Australia
- Machine Learning Group, NICTA Canberra Research Laboratory, Australia
- Research School of Computer Science, Australian National University, Australia
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, VIC, Australia
- Health and Biomedical Informatics Centre, The University of Melbourne, VIC, Australia
| |
Collapse
|
20
|
Burger JD, Doughty E, Khare R, Wei CH, Mishra R, Aberdeen J, Tresner-Kirsch D, Wellner B, Kann MG, Lu Z, Hirschman L. Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau094. [PMID: 25246425 PMCID: PMC4170591 DOI: 10.1093/database/bau094] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Background: This article describes capture of biological information using a hybrid approach that combines natural language processing to extract biological entities and crowdsourcing with annotators recruited via Amazon Mechanical Turk to judge correctness of candidate biological relations. These techniques were applied to extract gene– mutation relations from biomedical abstracts with the goal of supporting production scale capture of gene–mutation–disease findings as an open source resource for personalized medicine. Results: The hybrid system could be configured to provide good performance for gene–mutation extraction (precision ∼82%; recall ∼70% against an expert-generated gold standard) at a cost of $0.76 per abstract. This demonstrates that crowd labor platforms such as Amazon Mechanical Turk can be used to recruit quality annotators, even in an application requiring subject matter expertise; aggregated Turker judgments for gene–mutation relations exceeded 90% accuracy. Over half of the precision errors were due to mismatches against the gold standard hidden from annotator view (e.g. incorrect EntrezGene identifier or incorrect mutation position extracted), or incomplete task instructions (e.g. the need to exclude nonhuman mutations). Conclusions: The hybrid curation model provides a readily scalable cost-effective approach to curation, particularly if coupled with expert human review to filter precision errors. We plan to generalize the framework and make it available as open source software. Database URL:http://www.mitre.org/publications/technical-papers/hybrid-curation-of-gene-mutation-relations-combining-automated
Collapse
Affiliation(s)
- John D Burger
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Emily Doughty
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Ritu Khare
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Chih-Hsuan Wei
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Rajashree Mishra
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - John Aberdeen
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - David Tresner-Kirsch
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Ben Wellner
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Maricel G Kann
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Zhiyong Lu
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Lynette Hirschman
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| |
Collapse
|
21
|
Jimeno Yepes A, Verspoor K. Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Res 2014; 3:18. [PMID: 25285203 PMCID: PMC4176422 DOI: 10.12688/f1000research.3-18.v2] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/27/2014] [Indexed: 11/20/2022] Open
Abstract
As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, individually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.95 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any individual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.
Collapse
Affiliation(s)
- Antonio Jimeno Yepes
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia ; Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia ; Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
22
|
Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges. INTERACTIVE KNOWLEDGE DISCOVERY AND DATA MINING IN BIOMEDICAL INFORMATICS 2014. [DOI: 10.1007/978-3-662-43968-5_16] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|