1
|
Vollmar M, Tirunagari S, Harrus D, Armstrong D, Gáborová R, Gupta D, Afonso MQL, Evans G, Velankar S. Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature. Sci Data 2024; 11:1032. [PMID: 39333508 PMCID: PMC11436914 DOI: 10.1038/s41597-024-03841-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 08/29/2024] [Indexed: 09/29/2024] Open
Abstract
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
Collapse
Affiliation(s)
- Melanie Vollmar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Santosh Tirunagari
- Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Deborah Harrus
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - David Armstrong
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Romana Gáborová
- CEITEC - Central European Institute of Technology, Masaryk University, Kamenice 5, 62500, Brno, Czech Republic
| | - Deepti Gupta
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Marcelo Querino Lima Afonso
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Genevieve Evans
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
2
|
Lambert SA, Wingfield B, Gibson JT, Gil L, Ramachandran S, Yvon F, Saverimuttu S, Tinsley E, Lewis E, Ritchie SC, Wu J, Cánovas R, McMahon A, Harris LW, Parkinson H, Inouye M. Enhancing the Polygenic Score Catalog with tools for score calculation and ancestry normalization. Nat Genet 2024:10.1038/s41588-024-01937-x. [PMID: 39327485 DOI: 10.1038/s41588-024-01937-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/28/2024]
Affiliation(s)
- Samuel A Lambert
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK.
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK.
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
| | - Benjamin Wingfield
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Joel T Gibson
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
| | - Laurent Gil
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- Wellcome Sanger Institute, Hinxton, UK
| | - Santhi Ramachandran
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Florent Yvon
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
| | - Shirin Saverimuttu
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Emily Tinsley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Elizabeth Lewis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Scott C Ritchie
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
| | - Jingqin Wu
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
| | - Rodrigo Cánovas
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- The Australian E-Health Research Centre, CSIRO, Parkville, Victoria, Australia
| | - Aoife McMahon
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Laura W Harris
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Helen Parkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK.
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK.
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK.
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.
| |
Collapse
|
3
|
Lai PT, Coudert E, Aimo L, Axelsen K, Breuza L, de Castro E, Feuermann M, Morgat A, Pourcel L, Pedruzzi I, Poux S, Redaschi N, Rivoire C, Sveshnikova A, Wei CH, Leaman R, Luo L, Lu Z, Bridge A. EnzChemRED, a rich enzyme chemistry relation extraction dataset. Sci Data 2024; 11:982. [PMID: 39251610 PMCID: PMC11384730 DOI: 10.1038/s41597-024-03835-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 08/23/2024] [Indexed: 09/11/2024] Open
Abstract
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.
Collapse
Grants
- U24 HG007822 NHGRI NIH HHS
- NIH Intramural Research Program, National Library of Medicine
- Expert curation and evaluation of EnzChemRED at Swiss-Prot were supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI) and the National Human Genome Research Institute (NHGRI), Office of Director [OD/DPCPSI/ODSS], National Institute of Allergy and Infectious Diseases (NIAID), National Institute on Aging (NIA), National Institute of General Medical Sciences (NIGMS), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Eye Institute (NEI), National Cancer Institute (NCI), National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health [U24HG007822], and by the European Union's Horizon Europe Framework Programme (grant number 101080997), supported in Switzerland through the State Secretariat for Education, Research and Innovation (SERI).
- Fundamental Research Funds for the Central Universities [DUT23RC(3)014 to L.L.]
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Elisabeth Coudert
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucila Aimo
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Kristian Axelsen
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lionel Breuza
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Edouard de Castro
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Marc Feuermann
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucille Pourcel
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Ivo Pedruzzi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Sylvain Poux
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Nicole Redaschi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Catherine Rivoire
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anastasia Sveshnikova
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
| | - Alan Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.
| |
Collapse
|
4
|
Martinez K, Agirre J, Akune Y, Aoki-Kinoshita KF, Arighi C, Axelsen KB, Bolton E, Bordeleau E, Edwards NJ, Fadda E, Feizi T, Hayes C, Ives CM, Joshi HJ, Krishna Prasad K, Kossida S, Lisacek F, Liu Y, Lütteke T, Ma J, Malik A, Martin M, Mehta AY, Neelamegham S, Panneerselvam K, Ranzinger R, Ricard-Blum S, Sanou G, Shanker V, Thomas PD, Tiemeyer M, Urban J, Vita R, Vora J, Yamamoto Y, Mazumder R. Functional implications of glycans and their curation: insights from the workshop held at the 16th Annual International Biocuration Conference in Padua, Italy. Database (Oxford) 2024; 2024:baae073. [PMID: 39137905 PMCID: PMC11321244 DOI: 10.1093/database/baae073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 06/24/2024] [Accepted: 07/10/2024] [Indexed: 08/15/2024]
Abstract
Dynamic changes in protein glycosylation impact human health and disease progression. However, current resources that capture disease and phenotype information focus primarily on the macromolecules within the central dogma of molecular biology (DNA, RNA, proteins). To gain a better understanding of organisms, there is a need to capture the functional impact of glycans and glycosylation on biological processes. A workshop titled "Functional impact of glycans and their curation" was held in conjunction with the 16th Annual International Biocuration Conference to discuss ongoing worldwide activities related to glycan function curation. This workshop brought together subject matter experts, tool developers, and biocurators from over 20 projects and bioinformatics resources. Participants discussed four key topics for each of their resources: (i) how they curate glycan function-related data from publications and other sources, (ii) what type of data they would like to acquire, (iii) what data they currently have, and (iv) what standards they use. Their answers contributed input that provided a comprehensive overview of state-of-the-art glycan function curation and annotations. This report summarizes the outcome of discussions, including potential solutions and areas where curators, data wranglers, and text mining experts can collaborate to address current gaps in glycan and glycosylation annotations, leveraging each other's work to improve their respective resources and encourage impactful data sharing among resources. Database URL: https://wiki.glygen.org/Glycan_Function_Workshop_2023.
Collapse
Affiliation(s)
- Karina Martinez
- Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, 2300 I St. NW, Washington, DC 20052, United States
| | - Jon Agirre
- York Structural Biology Laboratory, Department of Chemistry, University of York, Wentworth Way, York YO10 5DD, United Kingdom
| | - Yukie Akune
- The Glycosciences Laboratory, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, United Kingdom
| | - Kiyoko F Aoki-Kinoshita
- Glycan and Life Systems Integration Center (GaLSIC), Soka University, 1-236 Tangi-machi, Hachioji, Tokyo 192-8577, Japan
| | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, United States
| | - Kristian B Axelsen
- Swiss-Prot Group, Swiss Institute of Bioinformatics (SIB), CMU, 1 rue Michel Servet, Geneva 4 1211, Switzerland
| | - Evan Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Emily Bordeleau
- Michael Smith Laboratories, The University of British Columbia, 2185 East Mall, Vancouver, British Columbia V6T 1Z4, Canada
| | - Nathan J Edwards
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University, 2115 Wisconsin Ave NW, Washington, DC 20007, United States
| | - Elisa Fadda
- Department of Chemistry and Hamilton Institute, Maynooth University, Kilcock Road, Maynooth, Co. Kildare W23 AH3Y, Ireland
| | - Ten Feizi
- The Glycosciences Laboratory, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, United Kingdom
| | - Catherine Hayes
- Proteome Informatics Group, Swiss Institute of Bioinformatics (SIB), route de Drize 7, Geneva CH-1227, Switzerland
| | - Callum M Ives
- Department of Chemistry and Hamilton Institute, Maynooth University, Kilcock Road, Maynooth, Co. Kildare W23 AH3Y, Ireland
| | - Hiren J Joshi
- Copenhagen Center for Glycomics, Department of Cellular and Molecular Medicine, Faculty of Health Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen DK-2200, Denmark
| | - Khakurel Krishna Prasad
- ELI Beamlines Facility, The Extreme Light Infrastructure ERIC, Za Radnicí 835, Dolní Břežany 25241, Czech Republic
| | - Sofia Kossida
- IMGT, The International ImMunoGeneTics Information System, National Center for Scientific Research (CNRS), Institute of Human Genetics (IGH), University of Montpellier (UM), 141 rue de la Cardonille, Montpellier 34 090, France
| | - Frederique Lisacek
- Proteome Informatics Group, Swiss Institute of Bioinformatics (SIB), route de Drize 7, Geneva CH-1227, Switzerland
| | - Yan Liu
- The Glycosciences Laboratory, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, United Kingdom
| | - Thomas Lütteke
- Institute of Veterinary Physiology and Biochemistry, Justus-Liebig-University Gießen, Frankfurter Str. 100, Gießen 35392, Germany
| | - Junfeng Ma
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, 3900 Reservior Road NW, Washington, DC 20007, United States
| | - Adnan Malik
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Akul Y Mehta
- Department of Surgery, Beth Israel Deaconess Medical Center, National Center for Functional Glycomics, Harvard Medical School, 330 Brookline Avenue, Boston, MA 02215, United States
| | - Sriram Neelamegham
- Departments of Chemical & Biological Engineering, Biomedical Engineering and Medicine, University at Buffalo, State University of New York, 906 Furnas Hall, Buffalo, NY 14260, United States
| | - Kalpana Panneerselvam
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - René Ranzinger
- Complex Carbohydrate Research Center, University of Georgia, 315 Riverbend Rd, Athens, GA 30602, United States
| | - Sylvie Ricard-Blum
- Institute of Molecular and Supramolecular Chemistry and Biochemistry (ICBMS), UMR 5246, University Lyon 1, CNRS, 43 Boulevard du 11 novembre 1918, Villeurbanne cedex F-69622, France
| | - Gaoussou Sanou
- IMGT, The International ImMunoGeneTics Information System, National Center for Scientific Research (CNRS), Institute of Human Genetics (IGH), University of Montpellier (UM), 141 rue de la Cardonille, Montpellier 34 090, France
| | - Vijay Shanker
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, United States
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, 2001 N Soto Street, Los Angeles, CA 90032, United States
| | - Michael Tiemeyer
- Complex Carbohydrate Research Center, University of Georgia, 315 Riverbend Rd, Athens, GA 30602, United States
| | - James Urban
- Department of Chemistry and Molecular Biology, University of Gothenburg, Medicinaregatan 7 B, Gothenburg 41390, Sweden
| | - Randi Vita
- Immune Epitope Database and Analysis Project, La Jolla Institute for Allergy & Immunology, 9420 Athena Circle, La Jolla, CA 92037, United States
| | - Jeet Vora
- Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, 2300 I St. NW, Washington, DC 20052, United States
| | - Yasunori Yamamoto
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, 178-4-4 Wakashiba, Kashiwa, Chiba 277-0871, Japan
| | - Raja Mazumder
- Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, 2300 I St. NW, Washington, DC 20052, United States
| |
Collapse
|
5
|
Stevens ER, Laynor G. Enhancing the quality and efficiency of regulatory science literature reviews through innovation and collaboration with library and information science experts. Front Med (Lausanne) 2024; 11:1434427. [PMID: 39021816 PMCID: PMC11251899 DOI: 10.3389/fmed.2024.1434427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Accepted: 06/24/2024] [Indexed: 07/20/2024] Open
Affiliation(s)
- Elizabeth R. Stevens
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Gregory Laynor
- Health Sciences Library, New York University Grossman School of Medicine, New York, NY, United States
| |
Collapse
|
6
|
Khalil H, Pollock D, McInerney P, Evans C, Moraes EB, Godfrey CM, Alexander L, Tricco A, Peters MDJ, Pieper D, Saran A, Ameen D, Taneri PE, Munn Z. Automation tools to support undertaking scoping reviews. Res Synth Methods 2024. [PMID: 38885942 DOI: 10.1002/jrsm.1731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 05/15/2024] [Accepted: 06/02/2024] [Indexed: 06/20/2024]
Abstract
OBJECTIVE This paper describes several automation tools and software that can be considered during evidence synthesis projects and provides guidance for their integration in the conduct of scoping reviews. STUDY DESIGN AND SETTING The guidance presented in this work is adapted from the results of a scoping review and consultations with the JBI Scoping Review Methodology group. RESULTS This paper describes several reliable, validated automation tools and software that can be used to enhance the conduct of scoping reviews. Developments in the automation of systematic reviews, and more recently scoping reviews, are continuously evolving. We detail several helpful tools in order of the key steps recommended by the JBI's methodological guidance for undertaking scoping reviews including team establishment, protocol development, searching, de-duplication, screening titles and abstracts, data extraction, data charting, and report writing. While we include several reliable tools and software that can be used for the automation of scoping reviews, there are some limitations to the tools mentioned. For example, some are available in English only and their lack of integration with other tools results in limited interoperability. CONCLUSION This paper highlighted several useful automation tools and software programs to use in undertaking each step of a scoping review. This guidance has the potential to inform collaborative efforts aiming at the development of evidence informed, integrated automation tools and software packages for enhancing the conduct of high-quality scoping reviews.
Collapse
Affiliation(s)
- Hanan Khalil
- School of Psychology and Public Health, Department of Public Health, La Trobe University, Melbourne, Australia
- The Queensland Centre of Evidence Based Nursing and Midwifery: A JBI Centre of Excellence, Brisbane, Queensland, Australia
| | - Danielle Pollock
- JBI, University of Adelaide, Adelaide, Australia
- Health Evidence Synthesis, Recommendations and Impact (HESRI), School of Public Health, University of Adelaide, Adelaide, Australia
| | - Patricia McInerney
- The Wits JBI Centre for Evidence-Based Practice: A JBI Centre of Excellence, Faculty of Health Sciences, University of the Witwatersrand, South Africa
| | - Catrin Evans
- The Nottingham Centre for Evidence Based Healthcare: A JBI Centre of Excellence, University of Nottingham, UK
| | - Erica B Moraes
- Nursing School, Department of Nursing Fundamentals and Administration, Federal Fluminense University, Rio de Janeiro, Brazil
- The Brazilian Centre of Evidence-based Healthcare: A JBI Centre of Excellence - JBI, Brazil
| | - Christina M Godfrey
- Queen's Collaboration for Health Care Quality: A JBI Centre of Excellence, Queen's University School of Nursing, Kingston, Ontario, Canada
| | - Lyndsay Alexander
- The Scottish Centre for Evidence-based, Multi-Professional Practice: A JBI Centre of Excellence, Aberdeen, UK
- School of Health Sciences, Robert Gordon University, Aberdeen, UK
| | - Andrea Tricco
- Queen's Collaboration for Health Care Quality: A JBI Centre of Excellence, Queen's University School of Nursing, Kingston, Ontario, Canada
- Epidemiology Division and Institute for Health, Management, and Evaluation, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- Knowledge Translation Program, Li Ka Shing Knowledge Institute, St. Michael's Hospital, Unity Health Toronto, Toronto, Ontario, Canada
| | - Micah D J Peters
- Health Evidence Synthesis, Recommendations and Impact (HESRI), School of Public Health, University of Adelaide, Adelaide, Australia
- University of South Australia, Clinical and Health Sciences, Rosemary Bryant AO Research Centre, Adelaide, South Australia, Australia
- University of Adelaide, Faculty of Health and Medical Sciences, Adelaide Nursing School, Adelaide, Australia
| | - Dawid Pieper
- Faculty of Health Sciences Brandenburg, Brandenburg Medical School (Theodor Fontane), Institute for Health Services and Health System Research, Rüdersdorf, Germany
- Center for Health Services Research, Brandenburg Medical School (Theodor Fontane), Rüdersdorf, Germany
| | | | - Daniel Ameen
- Faculty of Medicine, Nursing and Health Sciences, School of Medicine, Monash University, Australia
| | - Petek Eylul Taneri
- HRB-Trials Methodology Research Network, College of Medicine, Nursing and Health Sciences, University of Galway, Galway, Ireland
| | - Zachary Munn
- JBI, University of Adelaide, Adelaide, Australia
- Health Evidence Synthesis, Recommendations and Impact (HESRI), School of Public Health, University of Adelaide, Adelaide, Australia
| |
Collapse
|
7
|
Fabiano N, Gupta A, Bhambra N, Luu B, Wong S, Maaz M, Fiedorowicz JG, Smith AL, Solmi M. How to optimize the systematic review process using AI tools. JCPP ADVANCES 2024; 4:e12234. [PMID: 38827982 PMCID: PMC11143948 DOI: 10.1002/jcv2.12234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Accepted: 12/18/2023] [Indexed: 06/05/2024] Open
Abstract
Systematic reviews are a cornerstone for synthesizing the available evidence on a given topic. They simultaneously allow for gaps in the literature to be identified and provide direction for future research. However, due to the ever-increasing volume and complexity of the available literature, traditional methods for conducting systematic reviews are less efficient and more time-consuming. Numerous artificial intelligence (AI) tools are being released with the potential to optimize efficiency in academic writing and assist with various stages of the systematic review process including developing and refining search strategies, screening titles and abstracts for inclusion or exclusion criteria, extracting essential data from studies and summarizing findings. Therefore, in this article we provide an overview of the currently available tools and how they can be incorporated into the systematic review process to improve efficiency and quality of research synthesis. We emphasize that authors must report all AI tools that have been used at each stage to ensure replicability as part of reporting in methods.
Collapse
Affiliation(s)
| | - Arnav Gupta
- Department of MedicineUniversity of CalgaryCalgaryAlbertaCanada
- College of Public HealthKent State UniversityKentOhioUSA
| | - Nishaant Bhambra
- Department of Family MedicineUniversity of OttawaOttawaOntarioCanada
| | - Brandon Luu
- Department of MedicineUniversity of TorontoTorontoOntarioCanada
| | - Stanley Wong
- Department of PsychiatryUniversity of TorontoTorontoOntarioCanada
| | - Muhammad Maaz
- Faculty of MedicineUniversity of TorontoTorontoOntarioCanada
- Department of Mechanical and Industrial EngineeringUniversity of TorontoTorontoOntarioCanada
| | - Jess G. Fiedorowicz
- Department of PsychiatryUniversity of OttawaOttawaOntarioCanada
- Department of Mental HealthThe Ottawa HospitalOttawaOntarioCanada
- Ottawa Hospital Research Institute (OHRI) Clinical Epidemiology ProgramUniversity of OttawaOttawaOntarioCanada
- School of Epidemiology and Public HealthFaculty of MedicineUniversity of OttawaOttawaOntarioCanada
| | - Andrew L. Smith
- Department of PsychiatryUniversity of OttawaOttawaOntarioCanada
- Department of Mental HealthThe Ottawa HospitalOttawaOntarioCanada
| | - Marco Solmi
- Department of PsychiatryUniversity of OttawaOttawaOntarioCanada
- Department of Mental HealthThe Ottawa HospitalOttawaOntarioCanada
- Ottawa Hospital Research Institute (OHRI) Clinical Epidemiology ProgramUniversity of OttawaOttawaOntarioCanada
- School of Epidemiology and Public HealthFaculty of MedicineUniversity of OttawaOttawaOntarioCanada
- Department of Child and Adolescent PsychiatryCharité ‐ Universitätsmedizin BerlinBerlinGermany
| |
Collapse
|
8
|
Kamihara T, Tanaka K, Omura T, Kaneko S, Hirashiki A, Kokubo M, Shimizu A. Exploratory bibliometric analysis and text mining to reveal research trends in cardiac aging. Aging Med (Milton) 2024; 7:301-311. [PMID: 38975309 PMCID: PMC11222727 DOI: 10.1002/agm2.12329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2023] [Revised: 04/02/2024] [Accepted: 05/29/2024] [Indexed: 07/09/2024] Open
Abstract
Objectives We conducted a text mining analysis of 40 years of literature on cardiac aging from PubMed to investigate the current understanding on cardiac aging and its mechanisms. This study aimed to embody what most researchers consider cardiac aging to be. Methods We used multiple text mining and machine learning tools to extract important information from a large amount of text. Results Analysis revealed that the terms most frequently associated with cardiac aging include "diastolic," "hypertrophy," "fibrosis," "apoptosis," "mitochondrial," "oxidative," and "autophagy." These terms suggest that cardiac aging is characterized by mitochondrial dysfunction, oxidative stress, and impairment of autophagy, especially mitophagy. We also revealed an increase in the frequency of occurrence of "autophagy" in recent years, suggesting that research on autophagy has made a breakthrough in the field of cardiac aging. Additionally, the frequency of occurrence of "mitophagy" has increased significantly since 2019, suggesting that mitophagy is an important factor in cardiac aging. Conclusions Cardiac aging is a complex process that involves mitochondrial dysfunction, oxidative stress, and impairment of autophagy, especially mitophagy. Further research is warranted to elucidate the mechanisms of cardiac aging and develop strategies to mitigate its detrimental effects.
Collapse
Affiliation(s)
- Takahiro Kamihara
- Department of CardiologyNational Center for Geriatrics and GerontologyObuJapan
| | - Ken Tanaka
- Department of Public HealthUniversity of Hawaii at ManoaHonoluluHawaiiUSA
| | - Takuya Omura
- Department of Metabolic ResearchNational Center for Geriatrics and GerontologyObuJapan
| | - Shinji Kaneko
- Department of CardiologyToyota Kosei HospitalToyotaJapan
| | - Akihiro Hirashiki
- Department of CardiologyNational Center for Geriatrics and GerontologyObuJapan
| | - Manabu Kokubo
- Department of CardiologyNational Center for Geriatrics and GerontologyObuJapan
| | - Atsuya Shimizu
- Department of CardiologyNational Center for Geriatrics and GerontologyObuJapan
| |
Collapse
|
9
|
Lambert SA, Wingfield B, Gibson JT, Gil L, Ramachandran S, Yvon F, Saverimuttu S, Tinsley E, Lewis E, Ritchie SC, Wu J, Canovas R, McMahon A, Harris LW, Parkinson H, Inouye M. The Polygenic Score Catalog: new functionality and tools to enable FAIR research. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.29.24307783. [PMID: 38853961 PMCID: PMC11160819 DOI: 10.1101/2024.05.29.24307783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Polygenic scores (PGS) have transformed human genetic research and have multiple potential clinical applications, including risk stratification for disease prevention and prediction of treatment response. Here, we present a series of recent enhancements to the PGS Catalog (www.PGSCatalog.org), the largest findable, accessible, interoperable, and reusable (FAIR) repository of PGS. These include expansions in data content and ancestral diversity as well as the addition of new features. We further present the PGS Catalog Calculator (pgsc_calc, https://github.com/PGScatalog/pgsc_calc), an open-source, scalable and portable pipeline to reproducibly calculate PGS that securely democratizes equitable PGS applications by implementing genetic ancestry estimation and score normalization using reference data. With the PGS Catalog & calculator users can now quantify an individual's genetic predisposition for hundreds of common diseases and clinically relevant traits. Taken together, these updates and tools facilitate the next generation of PGS, thus lowering barriers to the clinical studies necessary to identify where PGS may be integrated into clinical practice.
Collapse
Affiliation(s)
- Samuel A. Lambert
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Benjamin Wingfield
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Joel T. Gibson
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
| | - Laurent Gil
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- Wellcome Sanger Institute, Hinxton, UK
| | - Santhi Ramachandran
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Florent Yvon
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
| | - Shirin Saverimuttu
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Emily Tinsley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Elizabeth Lewis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Scott C. Ritchie
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
| | - Jingqin Wu
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia
| | - Rodrigo Canovas
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia
| | - Aoife McMahon
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Laura W. Harris
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Helen Parkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia
| |
Collapse
|
10
|
Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 2024; 100:104988. [PMID: 38306900 PMCID: PMC10850402 DOI: 10.1016/j.ebiom.2024.104988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Revised: 01/14/2024] [Accepted: 01/15/2024] [Indexed: 02/04/2024] Open
Abstract
Biomedical research yields vast information, much of which is only accessible through the literature. Consequently, literature search is crucial for healthcare and biomedicine. Recent improvements in artificial intelligence (AI) have expanded functionality beyond keywords, but they might be unfamiliar to clinicians and researchers. In response, we present an overview of over 30 literature search tools tailored to common biomedical use cases, aiming at helping readers efficiently fulfill their information needs. We first discuss recent improvements and continued challenges of the widely used PubMed. Then, we describe AI-based literature search tools catering to five specific information needs: 1. Evidence-based medicine. 2. Precision medicine and genomics. 3. Searching by meaning, including questions. 4. Finding related articles with literature recommendation. 5. Discovering hidden associations through literature mining. Finally, we discuss the impacts of recent developments of large language models such as ChatGPT on biomedical information seeking.
Collapse
Affiliation(s)
- Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
11
|
de Crécy-Lagard V, Swairjo MA. On the necessity to include multiple types of evidence when predicting molecular function of proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.18.571875. [PMID: 38187591 PMCID: PMC10769224 DOI: 10.1101/2023.12.18.571875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Machine learning-based platforms are currently revolutionizing many fields of molecular biology including structure prediction for monomers or complexes, predicting the consequences of mutations, or predicting the functions of proteins. However, these platforms use training sets based on currently available knowledge and, in essence, are not built to discover novelty. Hence, claims of discovering novel functions for protein families using artificial intelligence should be carefully dissected, as the dangers of overpredictions are real as we show in a detailed analysis of the prediction made by Kim et al 1 on the function of the YciO protein in the model organism Escherichia coli .
Collapse
|
12
|
Xu HQ, Xiao H, Bu JH, Hong YF, Liu YH, Tao ZY, Ding SF, Xia YT, Wu E, Yan Z, Zhang W, Chen GX, Zhu F, Tao L. EMNPD: a comprehensive endophytic microorganism natural products database for prompt the discovery of new bioactive substances. J Cheminform 2023; 15:115. [PMID: 38017550 PMCID: PMC10683116 DOI: 10.1186/s13321-023-00779-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 11/05/2023] [Indexed: 11/30/2023] Open
Abstract
The discovery and utilization of natural products derived from endophytic microorganisms have garnered significant attention in pharmaceutical research. While remarkable progress has been made in this field each year, the absence of dedicated open-access databases for endophytic microorganism natural products research is evident. To address the increasing demand for mining and sharing of data resources related to endophytic microorganism natural products, this study introduces EMNPD, a comprehensive endophytic microorganism natural products database comprising manually curated data. Currently, EMNPD offers 6632 natural products from 1017 endophytic microorganisms, targeting 1286 entities (including 94 proteins, 282 cell lines, and 910 species) with 91 diverse bioactivities. It encompasses the physico-chemical properties of natural products, ADMET information, quantitative activity data with their potency, natural products contents with diverse fermentation conditions, systematic taxonomy, and links to various well-established databases. EMNPD aims to function as an open-access knowledge repository for the study of endophytic microorganisms and their natural products, thereby facilitating drug discovery research and exploration of bioactive substances. The database can be accessed at http://emnpd.idrblab.cn/ without the need for registration, enabling researchers to freely download the data. EMNPD is expected to become a valuable resource in the field of endophytic microorganism natural products and contribute to future drug development endeavors.
Collapse
Affiliation(s)
- Hong-Quan Xu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Huan Xiao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Jin-Hui Bu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yan-Feng Hong
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yu-Hong Liu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Zi-Yue Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Shu-Fan Ding
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yi-Tong Xia
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - E Wu
- Rehabilitation and Nursing School, Hangzhou Vocational & Technical College, Hangzhou, 310018, Zhejiang, China
| | - Zhen Yan
- The Affiliated Hospital of Hangzhou Normal University, Hangzhou, 310000, China
- First Clinical Medical Institute, Nanjing University of Chinese Medicine, Nanjing, 210023, Jiangsu, China
| | - Wei Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
- Innovation Institute for Affiliated Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Gong-Xing Chen
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China.
- Innovation Institute for Affiliated Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| |
Collapse
|
13
|
Yang X, Saha S, Venkatesan A, Tirunagari S, Vartak V, McEntyre J. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms. Sci Data 2023; 10:722. [PMID: 37857688 PMCID: PMC10587067 DOI: 10.1038/s41597-023-02617-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open
Abstract
Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource.
Collapse
Affiliation(s)
- Xiao Yang
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Shyamasree Saha
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Aravind Venkatesan
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Santosh Tirunagari
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK.
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Vid Vartak
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Johanna McEntyre
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| |
Collapse
|
14
|
Lehmann R. When We Publish: Accuracy and Quality Control in the Time of Open Access. Annu Rev Cell Dev Biol 2023; 39:v-ix. [PMID: 37843927 DOI: 10.1146/annurev-cb-39-091823-100001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2023]
|
15
|
UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res 2023; 51:D523-D531. [PMID: 36408920 PMCID: PMC9825514 DOI: 10.1093/nar/gkac1052] [Citation(s) in RCA: 1758] [Impact Index Per Article: 1758.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 10/05/2022] [Accepted: 10/25/2022] [Indexed: 11/22/2022] Open
Abstract
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication we describe enhancements made to our data processing pipeline and to our website to adapt to an ever-increasing information content. The number of sequences in UniProtKB has risen to over 227 million and we are working towards including a reference proteome for each taxonomic group. We continue to extract detailed annotations from the literature to update or create reviewed entries, while unreviewed entries are supplemented with annotations provided by automated systems using a variety of machine-learning techniques. In addition, the scientific community continues their contributions of publications and annotations to UniProt entries of their interest. Finally, we describe our new website (https://www.uniprot.org/), designed to enhance our users' experience and make our data easily accessible to the research community. This interface includes access to AlphaFold structures for more than 85% of all entries as well as improved visualisations for subcellular localisation of proteins.
Collapse
|
16
|
Coudert E, Gehant S, de Castro E, Pozzato M, Baratin D, Neto T, Sigrist CJA, Redaschi N, Bridge A. Annotation of biologically relevant ligands in UniProtKB using ChEBI. Bioinformatics 2023; 39:6885442. [PMID: 36484697 PMCID: PMC9825770 DOI: 10.1093/bioinformatics/btac793] [Citation(s) in RCA: 73] [Impact Index Per Article: 73.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 11/09/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION To provide high quality, computationally tractable annotation of binding sites for biologically relevant (cognate) ligands in UniProtKB using the chemical ontology ChEBI (Chemical Entities of Biological Interest), to better support efforts to study and predict functionally relevant interactions between protein sequences and structures and small molecule ligands. RESULTS We structured the data model for cognate ligand binding site annotations in UniProtKB and performed a complete reannotation of all cognate ligand binding sites using stable unique identifiers from ChEBI, which we now use as the reference vocabulary for all such annotations. We developed improved search and query facilities for cognate ligands in the UniProt website, REST API and SPARQL endpoint that leverage the chemical structure data, nomenclature and classification that ChEBI provides. AVAILABILITY AND IMPLEMENTATION Binding site annotations for cognate ligands described using ChEBI are available for UniProtKB protein sequence records in several formats (text, XML and RDF) and are freely available to query and download through the UniProt website (www.uniprot.org), REST API (www.uniprot.org/help/api), SPARQL endpoint (sparql.uniprot.org/) and FTP site (https://ftp.uniprot.org/pub/databases/uniprot/). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Elisabeth Coudert
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Sebastien Gehant
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Edouard de Castro
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Monica Pozzato
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Delphine Baratin
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Teresa Neto
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Christian J A Sigrist
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | - Nicole Redaschi
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, Switzerland
| | | | - The UniProt Consortium
BridgeAlan JAimoLucilaArgoud-PuyGhislaineAuchinclossAndrea HAxelsenKristian BBansalParitBaratinDelphineNetoTeresa M BatistaBlatterMarie-ClaudeBollemanJerven TBoutetEmmanuelBreuzaLionelGilBlanca CabreraCasals-CasasCristinaEchioukhKamal ChikhCoudertElisabethCucheBeatricede CastroEdouardEstreicherAnneFamigliettiMaria LFeuermannMarcGasteigerElisabethGaudetPascaleGehantSebastienGerritsenVivienneGosArnaudGruazNadineHuloChantalHyka-NouspikelNevilaJungoFlorenceKerhornouArnaudLe MercierPhilippeLieberherrDamienMassonPatrickMorgatAnneMuthukrishnanVenkateshPaesanoSalvoPedruzziIvoPilboutSandrinePourcelLucillePouxSylvainPozzatoMonicaPruessManuelaRedaschiNicoleRivoireCatherineSigristChristian J ASonessonKarinSundaramShyamalaBatemanAlexMartinMaria-JesusOrchardSandraMagraneMicheleAhmadShadabAlpiEmanueleBowler-BarnettEmily HBrittoRamonaA-JeeHema Bye-CukuraAustraDennyPaulDoganTuncaEbenezerThankGodFanJunGarmiriPenelopeda Costa GonzalesLeonardo JoseHatton-EllisEmmaHusseinAbdulrahmanIgnatchenkoAlexandrInsanaGiuseppeIshtiaqRizwanJoshiVishalJyothiDushyanthKandasaamySwaathiLockAntoniaLucianiAurelienLugaricMarijaLuoJieLussiYvonneMacDougallAlistairMadeiraFabioMahmoudyMahdiMishraAlokMoulangKatieNightingaleAndrewPundirSangyaQiGuoyingRajShriyaRaposoPedroRiceDaniel LSaidiRabieSantosRafaelSperettaElenaStephensonJamesTotooPrabhatTurnerEdwardTyagiNidhiVasudevPreethiWarnerKateWatkinsXavierZaruRossanaZellnerHermannWuCathy HArighiCecilia NArminskiLeslieChenChumingChenYongxingHuangHongzhanLaihoKatiMcGarveyPeterNataleDarren ARossKarenVinayakaC RWangQinghuaWangYuqiSwiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneva 4, SwitzerlandEuropean Molecular Biology Laboratory—European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire CB10 1SD, UKProtein Information Resource, University of Delaware, Newark, DE 19711, USAProtein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| |
Collapse
|
17
|
Comprehensively identifying Long Covid articles with human-in-the-loop machine learning. PATTERNS (NEW YORK, N.Y.) 2022; 4:100659. [PMID: 36471749 PMCID: PMC9712067 DOI: 10.1016/j.patter.2022.100659] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/19/2022] [Accepted: 11/17/2022] [Indexed: 12/05/2022]
Abstract
A significant percentage of COVID-19 survivors experience ongoing multisystemic symptoms that often affect daily living, a condition known as Long Covid or post-acute-sequelae of SARS-CoV-2 infection. However, identifying scientific articles relevant to Long Covid is challenging since there is no standardized or consensus terminology. We developed an iterative human-in-the-loop machine learning framework combining data programming with active learning into a robust ensemble model, demonstrating higher specificity and considerably higher sensitivity than other methods. Analysis of the Long Covid Collection shows that (1) most Long Covid articles do not refer to Long Covid by any name, (2) when the condition is named, the name used most frequently in the literature is Long Covid, and (3) Long Covid is associated with disorders in a wide variety of body systems. The Long Covid Collection is updated weekly and is searchable online at the LitCovid portal: https://www.ncbi.nlm.nih.gov/research/coronavirus/docsum?filters=e_condition.LongCovid.
Collapse
|
18
|
Chen Q, Allot A, Leaman R, Wei CH, Aghaarabi E, Guerrerio J, Xu L, Lu Z. LitCovid in 2022: an information resource for the COVID-19 literature. Nucleic Acids Res 2022; 51:D1512-D1518. [PMID: 36350613 PMCID: PMC9825538 DOI: 10.1093/nar/gkac1005] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/11/2022] [Accepted: 10/19/2022] [Indexed: 11/11/2022] Open
Abstract
LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/)-first launched in February 2020-is a first-of-its-kind literature hub for tracking up-to-date published research on COVID-19. The number of articles in LitCovid has increased from 55 000 to ∼300 000 over the past 2.5 years, with a consistent growth rate of ∼10 000 articles per month. In addition to the rapid literature growth, the COVID-19 pandemic has evolved dramatically. For instance, the Omicron variant has now accounted for over 98% of new infections in the United States. In response to the continuing evolution of the COVID-19 pandemic, this article describes significant updates to LitCovid over the last 2 years. First, we introduced the long Covid collection consisting of the articles on COVID-19 survivors experiencing ongoing multisystemic symptoms, including respiratory issues, cardiovascular disease, cognitive impairment, and profound fatigue. Second, we provided new annotations on the latest COVID-19 strains and vaccines mentioned in the literature. Third, we improved several existing features with more accurate machine learning algorithms for annotating topics and classifying articles relevant to COVID-19. LitCovid has been widely used with millions of accesses by users worldwide on various information needs and continues to play a critical role in collecting, curating and standardizing the latest knowledge on the COVID-19 literature.
Collapse
Affiliation(s)
| | | | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, USA
| | | | | | | | - Zhiyong Lu
- To whom correspondence should be addressed. Tel: +1 301 594 7089; Fax: +1 301 480 2290;
| |
Collapse
|
19
|
Hamdan HZ, Hamdan SZ, Adam I. Association of Selenium Levels with Gestational Diabetes Mellitus: An Updated Systematic Review and Meta-Analysis. Nutrients 2022; 14:3941. [PMID: 36235594 PMCID: PMC9570773 DOI: 10.3390/nu14193941] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 09/14/2022] [Accepted: 09/20/2022] [Indexed: 11/16/2022] Open
Abstract
Several studies have investigated the association between selenium levels and gestational diabetes mellitus (GDM); however, their results are not conclusive. This systematic review and meta-analysis aimed to update and draw conclusions regarding the evidence from published studies that investigated selenium levels in relation to GDM. PubMed, Google Scholar, Cochrane Library and ScienceDirect were searched for studies related to selenium and GDM, published from the inception of each database through to July 2022. The meta-analysis was conducted by measuring the standardized mean difference (SMD) between the selenium levels of women with GDM and those pregnant without GDM (control group). Stratified meta-analysis, meta-regression analysis and reporting bias were applied. The "meta" package in the open-access software R was used to analyze all of the data. A total of 12 studies, including 940 pregnant women with GDM and 1749 controls met this study's inclusion criteria. The selenium levels were significantly lower in women with GDM compared with the control group (SMD = -0.66; 95% confidence interval (CI): (-1.04, -0.28); p ≤ 0.001). Due to significant heterogeneity (I2 = 94%, Cochrane Q = 186.7; p ≤ 0.0001), the random-effects model was followed. The stratified meta-analysis showed that the selenium levels were lower in the cases compared with the normal controls in the third trimester (SMD = -1.85 (-3.03, -0.66); p ≤ 0.01). The same trend was observed in the studies published before the year 2014 (SMD = -0.99 (-1.70, -0.28); p ≤0.01) and those published in or after 2014 (SMD = -0.45 (-0.90, 0.00); p = 0.05). None of the investigated covariates in the meta-regression analysis (each study's geographic location, trimester of selenium quantification, World Bank economic classification, method of selenium determination, study design, study quality score, publication year and study's sample size) were significantly associated with the selenium SMD. The current evidence indicates that selenium levels are lower among women with GDM in comparison to those without GDM; however, after the correction of the reporting bias, the result was no longer significant. Further studies with more prospective designs are needed to confirm this evidence and explain the function of selenium in GDM throughout pregnancy.
Collapse
Affiliation(s)
- Hamdan Z. Hamdan
- Department of Basic Medical Sciences, Unaizah College of Medicine and Medical Sciences, Qassim University, Unaizah 56219, Saudi Arabia
- Faculty of Medicine, Al-Neelain University, Khartoum 12702, Sudan;
| | | | - Ishag Adam
- Department of Obstetrics and Gynecology, Unaizah College of Medicine and Medical Sciences, Qassim University, Unaizah 56219, Saudi Arabia;
| |
Collapse
|
20
|
Xu Q, Liu Y, Hu J, Duan X, Song N, Zhou J, Zhai J, Su J, Liu S, Chen F, Zheng W, Guo Z, Li H, Zhou Q, Niu B. OncoPubMiner: a platform for mining oncology publications. Brief Bioinform 2022; 23:6691792. [PMID: 36058206 DOI: 10.1093/bib/bbac383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 08/08/2022] [Accepted: 08/09/2022] [Indexed: 11/12/2022] Open
Abstract
Updated and expert-quality knowledge bases are fundamental to biomedical research. A knowledge base established with human participation and subject to multiple inspections is needed to support clinical decision making, especially in the growing field of precision oncology. The number of original publications in this field has risen dramatically with the advances in technology and the evolution of in-depth research. Consequently, the issue of how to gather and mine these articles accurately and efficiently now requires close consideration. In this study, we present OncoPubMiner (https://oncopubminer.chosenmedinfo.com), a free and powerful system that combines text mining, data structure customisation, publication search with online reading and project-centred and team-based data collection to form a one-stop 'keyword in-knowledge out' oncology publication mining platform. The platform was constructed by integrating all open-access abstracts from PubMed and full-text articles from PubMed Central, and it is updated daily. OncoPubMiner makes obtaining precision oncology knowledge from scientific articles straightforward and will assist researchers in efficiently developing structured knowledge base systems and bring us closer to achieving precision oncology goals.
Collapse
Affiliation(s)
- Quan Xu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Yueyue Liu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Jifang Hu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaohong Duan
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Niuben Song
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Jiale Zhou
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Jincheng Zhai
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Junyan Su
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Siyao Liu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Fan Chen
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Wei Zheng
- The Department of Nephrology and Hypertension Medicine, Beijing Electric Power Hospital, Beijing 100073, China
| | - Zhongjia Guo
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Hexiang Li
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Qiming Zhou
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Beifang Niu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
21
|
Chen Q, Du J, Allot A, Lu Z. LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2584-2595. [PMID: 35536809 PMCID: PMC9647722 DOI: 10.1109/tcbb.2022.3173562] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/19/2022] [Accepted: 04/22/2022] [Indexed: 05/20/2023]
Abstract
The rapid growth of biomedical literature poses a significant challenge for curation and interpretation. This has become more evident during the COVID-19 pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed, has accumulated over 200,000 articles with millions of accesses. Approximately 10,000 new articles are added to LitCovid every month. A main curation task in LitCovid is topic annotation where an article is assigned with up to eight topics, e.g., Treatment and Diagnosis. The annotated topics have been widely used both in LitCovid (e.g., accounting for ∼18% of total uses) and downstream studies such as network generation. However, it has been a primary curation bottleneck due to the nature of the task and the rapid literature growth. This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. We compare LITMC-BERT with three baseline models on two datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively, and only requires ∼18% of the inference time than the Binary BERT baseline. The related datasets and models are available via https://github.com/ncbi/ml-transformer.
Collapse
|
22
|
Chen Q, Allot A, Leaman R, Islamaj R, Du J, Fang L, Wang K, Xu S, Zhang Y, Bagherzadeh P, Bergler S, Bhatnagar A, Bhavsar N, Chang YC, Lin SJ, Tang W, Zhang H, Tavchioski I, Pollak S, Tian S, Zhang J, Otmakhova Y, Yepes AJ, Dong H, Wu H, Dufour R, Labrak Y, Chatterjee N, Tandon K, Laleye FAA, Rakotoson L, Chersoni E, Gu J, Friedrich A, Pujari SC, Chizhikova M, Sivadasan N, Vg S, Lu Z. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database (Oxford) 2022; 2022:baac069. [PMID: 36043400 PMCID: PMC9428574 DOI: 10.1093/database/baac069] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 08/02/2022] [Accepted: 08/13/2022] [Indexed: 05/03/2023]
Abstract
The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since December 2019. The related findings such as vaccine and drug development have been reported in biomedical literature-at a rate of about 10 000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200 000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g. Diagnosis and Treatment) to the articles in LitCovid. The annotated topics have been widely used for navigating the COVID literature, rapidly locating articles of interest and other downstream studies. However, annotating the topics has been the bottleneck of manual curation. Despite the continuing advances in biomedical text-mining methods, few have been dedicated to topic annotations in COVID-19 literature. To close the gap, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset-consisting of over 30 000 articles with manually reviewed topics-was created for training and testing. It is one of the largest multi-label classification datasets in biomedical scientific literature. Nineteen teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181 and 0.9394 for macro-F1-score, micro-F1-score and instance-based F1-score, respectively. Notably, these scores are substantially higher (e.g. 12%, higher for macro F1-score) than the corresponding scores of the state-of-art multi-label classification method. The level of participation and results demonstrate a successful track and help close the gap between dataset curation and method development. The dataset is publicly available via https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/ for benchmarking and further development. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Alexis Allot
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Rezarta Islamaj
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Jingcheng Du
- School of Biomedical Informatics, UT Health, TX, Houston 77030, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Shuo Xu
- College of Economics and Management, Beijing University of Technology, Beijing, QC, China
| | - Yuefu Zhang
- College of Economics and Management, Beijing University of Technology, Beijing, QC, China
| | | | | | | | | | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Sheng-Jie Lin
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Wentai Tang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Hongtong Zhang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Ilija Tavchioski
- Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
- Jožef Stefan Institute, Ljubljana, Slovenia
| | | | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Yulia Otmakhova
- School of Computing and Information Systems, University of Melbourne, Melbourne, AU-VIC, Australia
| | | | - Hang Dong
- Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Honghan Wu
- Institute of Health Informatics, University College London, London, UK
| | | | | | - Niladri Chatterjee
- Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
| | - Kushagri Tandon
- Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
| | | | | | - Emmanuele Chersoni
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
| | - Jinghang Gu
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
| | | | - Subhash Chandra Pujari
- Institute of Computer Science, Heidelberg University, Heidelberg, Germany
- Bosch Center for Artificial Intelligence, Renningen, Germany
| | - Mariia Chizhikova
- SINAI Group, Department of Computer Science, Advanced Studies Center in ICT (CEATIC), Universidad de Jaén, Jaén, Spain
| | | | | | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| |
Collapse
|
23
|
de Crécy-lagard V, Amorin de Hegedus R, Arighi C, Babor J, Bateman A, Blaby I, Blaby-Haas C, Bridge AJ, Burley SK, Cleveland S, Colwell LJ, Conesa A, Dallago C, Danchin A, de Waard A, Deutschbauer A, Dias R, Ding Y, Fang G, Friedberg I, Gerlt J, Goldford J, Gorelik M, Gyori BM, Henry C, Hutinet G, Jaroch M, Karp PD, Kondratova L, Lu Z, Marchler-Bauer A, Martin MJ, McWhite C, Moghe GD, Monaghan P, Morgat A, Mungall CJ, Natale DA, Nelson WC, O’Donoghue S, Orengo C, O’Toole KH, Radivojac P, Reed C, Roberts RJ, Rodionov D, Rodionova IA, Rudolf JD, Saleh L, Sheynkman G, Thibaud-Nissen F, Thomas PD, Uetz P, Vallenet D, Carter EW, Weigele PR, Wood V, Wood-Charlson EM, Xu J. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022; 2022:baac062. [PMID: 35961013 PMCID: PMC9374478 DOI: 10.1093/database/baac062] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 06/28/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022]
Abstract
Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
Collapse
Affiliation(s)
- Valérie de Crécy-lagard
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | | | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19713, USA
| | - Jill Babor
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Ian Blaby
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Crysten Blaby-Haas
- Biology Department, Brookhaven National Laboratory, Upton, NY 11973, USA
| | - Alan J Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
| | - Stephen K Burley
- RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Stacey Cleveland
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Lucy J Colwell
- Departmenf of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK
| | - Ana Conesa
- Spanish National Research Council, Institute for Integrative Systems Biology, Paterna, Valencia 46980, Spain
| | - Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, i12, Boltzmannstr. 3, Garching/Munich 85748, Germany
| | - Antoine Danchin
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, The University of Hong Kong, 21 Sassoon Road, Pokfulam, SAR Hong Kong 999077, China
| | - Anita de Waard
- Research Collaboration Unit, Elsevier, Jericho, VT 05465, USA
| | - Adam Deutschbauer
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Raquel Dias
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Yousong Ding
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, University of Florida, Gainesville, FL 32610, USA
| | - Gang Fang
- NYU-Shanghai, Shanghai 200120, China
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - John Gerlt
- Institute for Genomic Biology and Departments of Biochemistry and Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Joshua Goldford
- Physics of Living Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Mark Gorelik
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | - Christopher Henry
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Geoffrey Hutinet
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Marshall Jaroch
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Maria-Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Claire McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | - Gaurav D Moghe
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, USA
| | - Paul Monaghan
- Department of Agricultural Education and Communication, University of Florida, Gainesville, FL 32611, USA
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Darren A Natale
- Georgetown University Medical Center, Washington, DC 20007, USA
| | - William C Nelson
- Biological Sciences Division, Pacific Northwest National Laboratories, Richland, WA 99354, USA
| | - Seán O’Donoghue
- School of Biotechnology and Biomolecular Sciences, University of NSW, Sydney, NSW 2052, Australia
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | - Colbie Reed
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | | | - Dmitri Rodionov
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA 92037, USA
| | - Irina A Rodionova
- Department of Bioengineering, Division of Engineering, University of California at San Diego, La Jolla, CA 92093-0412, USA
| | - Jeffrey D Rudolf
- Department of Chemistry, University of Florida, Gainesville, FL 32611, USA
| | - Lana Saleh
- New England Biolabs, Ipswich, MA 01938, USA
| | - Gloria Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90033, USA
| | - Peter Uetz
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - David Vallenet
- LABGeM, Génomique Métabolique, CEA, Genoscope, Institut François Jacob, Université d’Évry, Université Paris-Saclay, CNRS, Evry 91057, France
| | - Erica Watson Carter
- Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| | | | - Valerie Wood
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK
| | - Elisha M Wood-Charlson
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Jin Xu
- Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| |
Collapse
|
24
|
Zhang L, Lu W, Chen H, Huang Y, Cheng Q. A comparative evaluation of biomedical similar article recommendation. J Biomed Inform 2022; 131:104106. [PMID: 35661818 DOI: 10.1016/j.jbi.2022.104106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 05/27/2022] [Accepted: 05/28/2022] [Indexed: 11/28/2022]
Abstract
BACKGROUND Biomedical sciences, with their focus on human health and disease, have attracted unprecedented attention in the 21st century. The proliferation of biomedical sciences has also led to a large number of scientific articles being produced, which makes it difficult for biomedical researchers to find relevant articles and hinders the dissemination of valuable discoveries. To bridge this gap, the research community has initiated the article recommendation task, with the aim of recommending articles to biomedical researchers automatically based on their research interests. Over the past two decades, many recommendation methods have been developed. However, an algorithm-level comparison and rigorous evaluation of the most important methods on a shared dataset is still lacking. METHOD In this study, we first investigate 15 methods for automated article recommendation in the biomedical domain. We then conduct an empirical evaluation of the 15 methods, including six term-based methods, two word embedding methods, three sentence embedding methods, two document embedding methods, and two BERT-based methods. These methods are evaluated in two scenarios: article-oriented recommenders and user-oriented recommenders, with two publicly available datasets: TREC 2005 Genomics and RELISH, respectively. RESULTS Our experimental results show that the text representation models BERT and BioSenVec outperform many existing recommendation methods (e.g., BM25, PMRA, XPRC) and web-based recommendation systems (e.g., MScanner, MedlineRanker, BioReader) on both datasets regarding most of the evaluation metrics, and fine-tuning can improve the performance of the BERT-based methods. CONCLUSIONS Our comparison study is useful for researchers and practitioners in selecting the best modeling strategies for building article recommendation systems in the biomedical domain. The code and datasets are publicly available.
Collapse
Affiliation(s)
- Li Zhang
- School of Information Management, Wuhan University, Wuhan, 430074, Hubei Province, China.
| | - Wei Lu
- School of Information Management, Wuhan University, Wuhan, 430074, Hubei Province, China.
| | - Haihua Chen
- Department of Information Science, University of North Texas, Denton, 76203, Texas, USA.
| | - Yong Huang
- School of Information Management, Wuhan University, Wuhan, 430074, Hubei Province, China.
| | - Qikai Cheng
- School of Information Management, Wuhan University, Wuhan, 430074, Hubei Province, China.
| |
Collapse
|
25
|
Hyams TC, Luo L, Hair B, Lee K, Lu Z, Seminara D. Machine Learning Approach to Facilitate Knowledge Synthesis at the Intersection of Liver Cancer, Epidemiology, and Health Disparities Research. JCO Clin Cancer Inform 2022; 6:e2100129. [PMID: 35623021 DOI: 10.1200/cci.21.00129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Liver cancer is a global challenge, and disparities exist across multiple domains and throughout the disease continuum. However, liver cancer's global epidemiology and etiology are shifting, and the literature is rapidly evolving, presenting a challenge to the synthesis of knowledge needed to identify areas of research needs and to develop research agendas focusing on disparities. Machine learning (ML) techniques can be used to semiautomate the literature review process and improve efficiency. In this study, we detail our approach and provide practical benchmarks for the development of a ML approach to classify literature and extract data at the intersection of three fields: liver cancer, health disparities, and epidemiology. METHODS We performed a six-phase process including: training (I), validating (II), confirming (III), and performing error analysis (IV) for a ML classifier. We then developed an extraction model (V) and applied it (VI) to the liver cancer literature identified through PubMed. We present precision, recall, F1, and accuracy metrics for the classifier and extraction models as appropriate for each phase of the process. We also provide the results for the application of our extraction model. RESULTS With limited training data, we achieved a high degree of accuracy for both our classifier and for the extraction model for liver cancer disparities research literature performed using epidemiologic methods. The disparities concept was the most challenging to accurately classify, and concepts that appeared infrequently in our data set were the most difficult to extract. CONCLUSION We provide a roadmap for using ML to classify and extract comprehensive information on multidisciplinary literature. Our technique can be adapted and modified for other cancers or diseases where disparities persist.
Collapse
Affiliation(s)
- Travis C Hyams
- Office of the Director, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Bethesda, MD
| | - Ling Luo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Brionna Hair
- Office of the Director, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Bethesda, MD
| | - Kyubum Lee
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Daniela Seminara
- Office of the Director, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Bethesda, MD
| |
Collapse
|
26
|
Pasche E, Mottaz A, Caucheteur D, Gobeill J, Michel PA, Ruch P. Variomes: a high recall search engine to support the curation of genomic variants. Bioinformatics 2022; 38:2595-2601. [PMID: 35274687 PMCID: PMC9048643 DOI: 10.1093/bioinformatics/btac146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 02/07/2022] [Accepted: 03/10/2022] [Indexed: 12/02/2022] Open
Abstract
Motivation Identification and interpretation of clinically actionable variants is a critical bottleneck. Searching for evidence in the literature is mandatory according to ASCO/AMP/CAP practice guidelines; however, it is both labor-intensive and error-prone. We developed a system to perform triage of publications relevant to support an evidence-based decision. The system is also able to prioritize variants. Our system searches within pre-annotated collections such as MEDLINE and PubMed Central. Results We assess the search effectiveness of the system using three different experimental settings: literature triage; variant prioritization and comparison of Variomes with LitVar. Almost two-thirds of the publications returned in the top-5 are relevant for clinical decision-support. Our approach enabled identifying 81.8% of clinically actionable variants in the top-3. Variomes retrieves on average +21.3% more articles than LitVar and returns the same number of results or more results than LitVar for 90% of the queries when tested on a set of 803 queries; thus, establishing a new baseline for searching the literature about variants. Availability and implementation Variomes is publicly available at https://candy.hesge.ch/Variomes. Source code is freely available at https://github.com/variomes/sibtm-variomes. SynVar is publicly available at https://goldorak.hesge.ch/synvar. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Emilie Pasche
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Anaïs Mottaz
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Déborah Caucheteur
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Julien Gobeill
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Pierre-André Michel
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Patrick Ruch
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| |
Collapse
|
27
|
Mougeot JLC, Beckman MF, Langdon HC, Lalla RV, Brennan MT, Bahrani Mougeot FK. Haemophilus pittmaniae and Leptotrichia spp. Constitute a Multi-Marker Signature in a Cohort of Human Papillomavirus-Positive Head and Neck Cancer Patients. Front Microbiol 2022; 12:794546. [PMID: 35116012 PMCID: PMC8803733 DOI: 10.3389/fmicb.2021.794546] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 11/10/2021] [Indexed: 12/25/2022] Open
Abstract
ObjectivesHuman papillomavirus (HPV) is a known etiological factor of oropharyngeal head and neck cancer (HNC). HPV positivity and periodontal disease have been associated with higher HNC risk, suggesting a role for oral bacterial species. Our objective was to determine oral microbiome profiles in HNC patients (HPV-positive and HPV-negative) and in healthy controls (HC).MethodsSaliva samples and swabs of buccal mucosa, supragingival plaque, and tongue were collected from HNC patients (N = 23 patients, n = 92 samples) before cancer therapy. Next-generation sequencing (16S-rRNA gene V3–V4 region) was used to determine bacterial taxa relative abundance (RA). β-Diversities of HNC HPV+ (N = 16 patients, n = 64 samples) and HNC HPV– (N = 7 patients, n = 28 samples) groups were compared using PERMANOVA (pMonte Carlo < 0.05). LEfSe discriminant analysis was performed to identify differentiating taxa (Log LDA > 2.0). RA differences were analyzed by Mann–Whitney U-test (α = 0.05). CombiROC program was used to determine multi-marker bacterial signatures. The Microbial Interaction Network Database (MIND) and LitSuggest online tools were used for complementary analyses.ResultsHNC vs. HC and HNC HPV+ vs. HNC HPV– β-diversities differed significantly (pMonte Carlo < 0.05). Streptococcus was the most abundant genus for HNC and HC groups, while Rothia mucilaginosa and Haemophilus parainfluenzae were the most abundant species in HNC and HC patients, respectively, regardless of antibiotics treatment. LEfSe analysis identified 43 and 44 distinctive species for HNC HPV+ and HNC HPV– groups, respectively. In HNC HPV+ group, 26 periodontal disease-associated species identified by LefSe had a higher average RA compared to HNC HPV– group. The significant species included Alloprevotella tannerae, Fusobacterium periodonticum, Haemophilus pittmaniae, Lachnoanaerobaulum orale, and Leptotrichia spp. (Mann–Whitney U-test, p < 0.05). Of 43 LEfSe-identified species in HPV+ group, 31 had a higher RA compared to HPV– group (Mann–Whitney U-test, p < 0.05). MIND analysis confirmed interactions between Haemophilus and Leptotrichia spp., representing a multi-marker signature per CombiROC analysis [area under the curve (AUC) > 0.9]. LitSuggest correctly classified 15 articles relevant to oral microbiome and HPV status.ConclusionOral microbiome profiles of HNC HPV+ and HNC HPV– patients differed significantly regarding periodontal-associated species. Our results suggest that oral bacterial species (e.g., Leptotrichia spp.), possessing unique niches and invasive properties, coexist with HPV within HPV-induced oral lesions in HNC patients. Further investigation into host–microbe interactions in HPV-positive HNC patients may shed light into cancer development.
Collapse
Affiliation(s)
- Jean-Luc C. Mougeot
- Carolinas Medical Center—Atrium Health, Charlotte, NC, United States
- *Correspondence: Jean-Luc C. Mougeot,
| | | | - Holden C. Langdon
- Carolinas Medical Center—Atrium Health, Charlotte, NC, United States
| | - Rajesh V. Lalla
- Section of Oral Medicine–University of Connecticut Health, Farmington, CT, United States
| | | | | |
Collapse
|
28
|
Arshinoff BI, Cary GA, Karimi K, Foley S, Agalakov S, Delgado F, Lotay VS, Ku CJ, Pells TJ, Beatman TR, Kim E, Cameron RA, Vize PD, Telmer C, Croce JC, Ettensohn CA, Hinman VF. Echinobase: leveraging an extant model organism database to build a knowledgebase supporting research on the genomics and biology of echinoderms. Nucleic Acids Res 2022; 50:D970-D979. [PMID: 34791383 PMCID: PMC8728261 DOI: 10.1093/nar/gkab1005] [Citation(s) in RCA: 45] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Revised: 10/05/2021] [Accepted: 10/13/2021] [Indexed: 12/16/2022] Open
Abstract
Echinobase (www.echinobase.org) is a third generation web resource supporting genomic research on echinoderms. The new version was built by cloning the mature Xenopus model organism knowledgebase, Xenbase, refactoring data ingestion pipelines and modifying the user interface to adapt to multispecies echinoderm content. This approach leveraged over 15 years of previous database and web application development to generate a new fully featured informatics resource in a single year. In addition to the software stack, Echinobase uses the private cloud and physical hosts that support Xenbase. Echinobase currently supports six echinoderm species, focused on those used for genomics, developmental biology and gene regulatory network analyses. Over 38 000 gene pages, 18 000 publications, new improved genome assemblies, JBrowse genome browser and BLAST + services are available and supported by the development of a new echinoderm anatomical ontology, uniformly applied formal gene nomenclature, and consistent orthology predictions. A novel feature of Echinobase is integrating support for multiple, disparate species. New genomes from the diverse echinoderm phylum will be added and supported as data becomes available. The common code development design of the integrated knowledgebases ensures parallel improvements as each resource evolves. This approach is widely applicable for developing new model organism informatics resources.
Collapse
Affiliation(s)
- Bradley I Arshinoff
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Gregory A Cary
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Kamran Karimi
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Saoirse Foley
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Sergei Agalakov
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Francisco Delgado
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Vaneet S Lotay
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Carolyn J Ku
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Troy J Pells
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Thomas R Beatman
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Eugene Kim
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - R Andrew Cameron
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | - Peter D Vize
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Cheryl A Telmer
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Jenifer C Croce
- Laboratoire de Biologie du Développement de Villefranche-sur-Mer (LBDV), Institut de la Mer de Villefranche (IMEV), Sorbonne Université, CNRS, Villefranche-sur-Mer, France
| | - Charles A Ettensohn
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Veronica F Hinman
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
29
|
Bansal P, Morgat A, Axelsen KB, Muthukrishnan V, Coudert E, Aimo L, Hyka-Nouspikel N, Gasteiger E, Kerhornou A, Neto TB, Pozzato M, Blatter MC, Ignatchenko A, Redaschi N, Bridge A. Rhea, the reaction knowledgebase in 2022. Nucleic Acids Res 2022; 50:D693-D700. [PMID: 34755880 PMCID: PMC8728268 DOI: 10.1093/nar/gkab1016] [Citation(s) in RCA: 68] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 10/08/2021] [Accepted: 11/09/2021] [Indexed: 12/15/2022] Open
Abstract
Rhea (https://www.rhea-db.org) is an expert-curated knowledgebase of biochemical reactions based on the chemical ontology ChEBI (Chemical Entities of Biological Interest) (https://www.ebi.ac.uk/chebi). In this paper, we describe a number of key developments in Rhea since our last report in the database issue of Nucleic Acids Research in 2019. These include improved reaction coverage in Rhea, the adoption of Rhea as the reference vocabulary for enzyme annotation in the UniProt knowledgebase UniProtKB (https://www.uniprot.org), the development of a new Rhea website, and the designation of Rhea as an ELIXIR Core Data Resource. We hope that these and other developments will enhance the utility of Rhea as a reference resource to study and engineer enzymes and the metabolic systems in which they function.
Collapse
Affiliation(s)
- Parit Bansal
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Kristian B Axelsen
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Venkatesh Muthukrishnan
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Elisabeth Coudert
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Lucila Aimo
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Nevila Hyka-Nouspikel
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Elisabeth Gasteiger
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Arnaud Kerhornou
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Teresa Batista Neto
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Monica Pozzato
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Marie-Claude Blatter
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Alex Ignatchenko
- EMBL-EBI European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Nicole Redaschi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| | - Alan Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211 Geneva 4, Switzerland
| |
Collapse
|
30
|
Karimi K, Agalakov S, Telmer CA, Beatman TR, Pells TJ, Arshinoff BI, Ku CJ, Foley S, Hinman VF, Ettensohn CA, Vize PD. Classifying domain-specific text documents containing ambiguous keywords. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6377760. [PMID: 34585729 PMCID: PMC8588847 DOI: 10.1093/database/baab062] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 08/23/2021] [Accepted: 09/16/2021] [Indexed: 11/14/2022]
Abstract
A keyword-based search of comprehensive databases such as PubMed may return
irrelevant papers, especially if the keywords are used in multiple fields of
study. In such cases, domain experts (curators) need to verify the results and
remove the irrelevant articles. Automating this filtering process will save
time, but it has to be done well enough to ensure few relevant papers are
rejected and few irrelevant papers are accepted. A good solution would be fast,
work with the limited amount of data freely available (full paper body may be
missing), handle ambiguous keywords and be as domain-neutral as possible. In
this paper, we evaluate a number of classification algorithms for identifying a
domain-specific set of papers about echinoderm species and show that the
resulting tool satisfies most of the abovementioned requirements. Echinoderms
consist of a number of very different organisms, including brittle stars, sea
stars (starfish), sea urchins and sea cucumbers. While their taxonomic
identifiers are specific, the common names are used in many other contexts,
creating ambiguity and making a keyword search prone to error. We try
classifiers using Linear, Naïve Bayes, Nearest Neighbor, Tree, SVM,
Bagging, AdaBoost and Neural Network learning models and compare their
performance. We show how effective the resulting classifiers are in filtering
irrelevant articles returned from PubMed. The methodology used is more dependent
on the good selection of training data and is a practical solution that can be
applied to other fields of study facing similar challenges. Database URL The code and date reported in this paper are freely available at
http://xenbaseturbofrog.org/pub/Text-Topic-Classifier/
Collapse
Affiliation(s)
- Kamran Karimi
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Sergei Agalakov
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Cheryl A Telmer
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Thomas R Beatman
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Troy J Pells
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Bradley Im Arshinoff
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Carolyn J Ku
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Saoirse Foley
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Veronica F Hinman
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Charles A Ettensohn
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Peter D Vize
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| |
Collapse
|