1
|
Lin D, McAuliffe M, Pruitt KD, Gururaj A, Melchior C, Schmitt C, Wright SN. Biomedical Data Repository Concepts and Management Principles. Sci Data 2024; 11:622. [PMID: 38871749 PMCID: PMC11176378 DOI: 10.1038/s41597-024-03449-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 05/31/2024] [Indexed: 06/15/2024] Open
Abstract
The demand for open data and open science is on the rise, fueled by expectations from the scientific community, calls to increase transparency and reproducibility in research findings, and developments such as the Final Data Management and Sharing Policy from the U.S. National Institutes of Health and a memorandum on increasing public access to federally funded research, issued by the U.S. Office of Science and Technology Policy. This paper explores the pivotal role of data repositories in biomedical research and open science, emphasizing their importance in managing, preserving, and sharing research data. Our objective is to familiarize readers with the functions of data repositories, set expectations for their services, and provide an overview of methods to evaluate their capabilities. The paper serves to introduce fundamental concepts and community-based guiding principles and aims to equip researchers, repository operators, funders, and policymakers with the knowledge to select appropriate repositories for their data management and sharing needs and foster a foundation for the open sharing and preservation of research data.
Collapse
Affiliation(s)
- Dawei Lin
- National Institute of Allergy and Infectious Diseases (NIAID), National Institutes of Health, Bethesda, Maryland, USA.
| | - Matthew McAuliffe
- Center of Information Technology (CIT), National Institutes of Health, Bethesda, Maryland, USA.
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA.
| | - Anupama Gururaj
- National Institute of Allergy and Infectious Diseases (NIAID), National Institutes of Health, Bethesda, Maryland, USA
| | - Christine Melchior
- Center for Scientific Review (CSR), National Institutes of Health, Bethesda, Maryland, USA
| | - Charles Schmitt
- National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health, Durham, North Carolina, USA
| | - Susan N Wright
- National Institute on Drug Abuse (NIDA), National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
2
|
Novoa J, López-Ibáñez J, Chagoyen M, Ranea JAG, Pazos F. CoMentG: comprehensive retrieval of generic relationships between biomedical concepts from the scientific literature. Database (Oxford) 2024; 2024:baae025. [PMID: 38564426 PMCID: PMC10986793 DOI: 10.1093/database/baae025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 03/01/2024] [Accepted: 03/15/2024] [Indexed: 04/04/2024]
Abstract
The CoMentG resource contains millions of relationships between terms of biomedical interest obtained from the scientific literature. At the core of the system is a methodology for detecting significant co-mentions of concepts in the entire PubMed corpus. That method was applied to nine sets of terms covering the most important classes of biomedical concepts: diseases, symptoms/clinical signs, molecular functions, biological processes, cellular compartments, anatomic parts, cell types, bacteria and chemical compounds. We obtained more than 7 million relationships between more than 74 000 terms, and many types of relationships were not available in any other resource. As the terms were obtained from widely used resources and ontologies, the relationships are given using the standard identifiers provided by them and hence can be linked to other data. A web interface allows users to browse these associations, searching for relationships for a set of terms of interests provided as input, such as between a disease and their associated symptoms, underlying molecular processes or affected tissues. The results are presented in an interactive interface where the user can explore the reported relationships in different ways and follow links to other resources. Database URL: https://csbg.cnb.csic.es/CoMentG/.
Collapse
Affiliation(s)
- Jorge Novoa
- Computational Systems Biology, National Center for Biotechnology (CNB-CSIC), c/ Darwin, 3., Madrid 28049 , Spain
| | - Javier López-Ibáñez
- Computational Systems Biology, National Center for Biotechnology (CNB-CSIC), c/ Darwin, 3., Madrid 28049 , Spain
| | - Mónica Chagoyen
- Computational Systems Biology, National Center for Biotechnology (CNB-CSIC), c/ Darwin, 3., Madrid 28049 , Spain
| | - Juan A G Ranea
- Department of Molecular Biology and Biochemistry, University of Málaga, Avda. Cervantes, 2., Málaga 29071, Spain
- CIBER de Enfermedades Raras (CIBERER), Instituto de Salud Carlos III, Madrid, Spain
- Institute of Biomedical Research in Malaga and platform of nanomedicine (IBIMA platform BIONAND), Malaga 29071, Spain
- Spanish National Bioinformatics Institute (INB/ELIXIR-ES), Barcelona 08034, Spain
| | - Florencio Pazos
- Computational Systems Biology, National Center for Biotechnology (CNB-CSIC), c/ Darwin, 3., Madrid 28049 , Spain
| |
Collapse
|
3
|
Zhang B, Chen L, Xiao S, Dang C, Wang F, Fang Q, Ye X, Stanley DW, Ye G. iSalivaomicDB: A comprehensive saliva omics database for insects. INSECT SCIENCE 2024. [PMID: 38450904 DOI: 10.1111/1744-7917.13349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 01/26/2024] [Accepted: 02/05/2024] [Indexed: 03/08/2024]
Affiliation(s)
- Bo Zhang
- State Key Laboratory of Rice Biology and Breeding, Ministry of Agricultural and Rural Affairs Key Laboratory of Molecular Biology of Crop Pathogens and Insect Pests & Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou, China
| | - Longfei Chen
- State Key Laboratory of Rice Biology and Breeding, Ministry of Agricultural and Rural Affairs Key Laboratory of Molecular Biology of Crop Pathogens and Insect Pests & Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou, China
| | - Shan Xiao
- State Key Laboratory of Rice Biology and Breeding, Ministry of Agricultural and Rural Affairs Key Laboratory of Molecular Biology of Crop Pathogens and Insect Pests & Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou, China
| | - Cong Dang
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, China
| | - Fang Wang
- State Key Laboratory of Rice Biology and Breeding, Ministry of Agricultural and Rural Affairs Key Laboratory of Molecular Biology of Crop Pathogens and Insect Pests & Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou, China
| | - Qi Fang
- State Key Laboratory of Rice Biology and Breeding, Ministry of Agricultural and Rural Affairs Key Laboratory of Molecular Biology of Crop Pathogens and Insect Pests & Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou, China
| | - Xinhai Ye
- State Key Laboratory of Rice Biology and Breeding, Ministry of Agricultural and Rural Affairs Key Laboratory of Molecular Biology of Crop Pathogens and Insect Pests & Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou, China
| | - David W Stanley
- Biological Control of Insects Research Laboratory USDA/Agricultural Research Service, Columbia MO, USA
| | - Gongyin Ye
- State Key Laboratory of Rice Biology and Breeding, Ministry of Agricultural and Rural Affairs Key Laboratory of Molecular Biology of Crop Pathogens and Insect Pests & Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou, China
| |
Collapse
|
4
|
Ma L, Zou D, Liu L, Shireen H, Abbasi AA, Bateman A, Xiao J, Zhao W, Bao Y, Zhang Z. Database Commons: A Catalog of Worldwide Biological Databases. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:1054-1058. [PMID: 36572336 PMCID: PMC10928426 DOI: 10.1016/j.gpb.2022.12.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 12/13/2022] [Accepted: 12/14/2022] [Indexed: 12/25/2022]
Abstract
Biological databases serve as a global fundamental infrastructure for the worldwide scientific community, which dramatically aid the transformation of big data into knowledge discovery and drive significant innovations in a wide range of research fields. Given the rapid data production, biological databases continue to increase in size and importance. To build a catalog of worldwide biological databases, we curate a total of 5825 biological databases from 8931 publications, which are geographically distributed in 72 countries/regions and developed by 1975 institutions (as of September 20, 2022). We further devise a z-index, a novel index to characterize the scientific impact of a database, and rank all these biological databases as well as their hosting institutions and countries in terms of citation and z-index. Consequently, we present a series of statistics and trends of worldwide biological databases, yielding a global perspective to better understand their status and impact for life and health sciences. An up-to-date catalog of worldwide biological databases, as well as their curated meta-information and derived statistics, is publicly available at Database Commons (https://ngdc.cncb.ac.cn/databasecommons/).
Collapse
Affiliation(s)
- Lina Ma
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Dong Zou
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Lin Liu
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Huma Shireen
- National Center for Bioinformatics, Programme of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan
| | - Amir A Abbasi
- National Center for Bioinformatics, Programme of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Jingfa Xiao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wenming Zhao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yiming Bao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhang Zhang
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.
| |
Collapse
|
5
|
Bowler-Barnett EH, Fan J, Luo J, Magrane M, Martin MJ, Orchard S. UniProt and Mass Spectrometry-Based Proteomics-A 2-Way Working Relationship. Mol Cell Proteomics 2023; 22:100591. [PMID: 37301379 PMCID: PMC10404557 DOI: 10.1016/j.mcpro.2023.100591] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 05/20/2023] [Accepted: 06/07/2023] [Indexed: 06/12/2023] Open
Abstract
The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and function caused by nonsynonymous variants and posttranslational modifications including cleavage of the initial transcript into smaller peptides and polypeptides. The UniProtKB database (www.uniprot.org) is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information and presents a summary of experimentally verified, or computationally predicted, functional information added by our expert biocuration team for each protein in the proteome. Researchers in the field of mass spectrometry-based proteomics both consume and add to the body of data available in UniProtKB, and this review highlights the information we provide to this community and the knowledge we in turn obtain from groups via deposition of large-scale datasets in public domain databases.
Collapse
Affiliation(s)
- E H Bowler-Barnett
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| | - J Fan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| | - J Luo
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| | - M Magrane
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| | - M J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| | - S Orchard
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom.
| |
Collapse
|
6
|
Cuzick A, Seager J, Wood V, Urban M, Rutherford K, Hammond-Kosack KE. A framework for community curation of interspecies interactions literature. eLife 2023; 12:e84658. [PMID: 37401199 DOI: 10.7554/elife.84658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Accepted: 05/18/2023] [Indexed: 07/05/2023] Open
Abstract
The quantity and complexity of data being generated and published in biology has increased substantially, but few methods exist for capturing knowledge about phenotypes derived from molecular interactions between diverse groups of species, in such a way that is amenable to data-driven biology and research. To improve access to this knowledge, we have constructed a framework for the curation of the scientific literature studying interspecies interactions, using data curated for the Pathogen-Host Interactions database (PHI-base) as a case study. The framework provides a curation tool, phenotype ontology, and controlled vocabularies to curate pathogen-host interaction data, at the level of the host, pathogen, strain, gene, and genotype. The concept of a multispecies genotype, the 'metagenotype,' is introduced to facilitate capturing changes in the disease-causing abilities of pathogens, and host resistance or susceptibility, observed by gene alterations. We report on this framework and describe PHI-Canto, a community curation tool for use by publication authors.
Collapse
Affiliation(s)
- Alayne Cuzick
- Strategic area: Protecting Crops and the Environment, Rothamsted Research, Harpenden, United Kingdom
| | - James Seager
- Strategic area: Protecting Crops and the Environment, Rothamsted Research, Harpenden, United Kingdom
| | - Valerie Wood
- Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
| | - Martin Urban
- Strategic area: Protecting Crops and the Environment, Rothamsted Research, Harpenden, United Kingdom
| | - Kim Rutherford
- Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
| | - Kim E Hammond-Kosack
- Strategic area: Protecting Crops and the Environment, Rothamsted Research, Harpenden, United Kingdom
| |
Collapse
|
7
|
Picot C, Ajiji P, Jurek L, Nourredine M, Massardier J, Peron A, Cucherat M, Cottin J. Risk of drug use during pregnancy: master protocol for living systematic reviews and meta-analyses performed in the metaPreg project. Syst Rev 2023; 12:101. [PMID: 37344917 DOI: 10.1186/s13643-023-02256-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 05/12/2023] [Indexed: 06/23/2023] Open
Abstract
BACKGROUND Knowledge about the risks of drugs during pregnancy is continuously evolving due to the frequent publication of a large number of epidemiological studies. Systematic reviews and meta-analyses therefore need to be regularly updated to reflect these advances. To improve dissemination of this updated information, we developed an initiative of real-time full-scale living meta-analyses relying on an open online dissemination platform ( www.metapreg.org ). METHOD All living meta-analyses performed in this project will be conducted in accordance with this master protocol after adaptation of the search strategy. A systematic literature search of PubMed and Embase will be performed. All analytical studies (e.g., cohort, case-control, randomized studies) reporting original empirical findings on the association between in utero exposure to drugs and adverse pregnancy outcomes will be included. Study screening and data extraction will be performed in a semi-automation way supervised by a biocurator. A risk of bias will be assessed using the ROBINS-I tools. All clinically relevant pregnancy adverse outcomes (malformations, stillbirths, neuro-developmental disorders, pre-eclampsia, etc.) available in the included studies will be pooled through random-effects meta-analysis. Heterogeneity will be evaluated by I2 statistics. DISCUSSION Our living systematic reviews and subsequent updates will inform the medical, regulatory, and health policy communities as the news results evolve to guide decisions on the proper use of drugs during the pregnancy. SYSTEMATIC REVIEW REGISTRATION Open Science Framework (OSF) registries.
Collapse
Affiliation(s)
- Cyndie Picot
- Service Hospitalo-Universitaire de Pharmaco-Toxicologie, Hospices Civils de Lyon, Bât. A-162, avenue Lacassagne, 69424 Cedex 03, Lyon, France
| | - Priscilla Ajiji
- Faculté de Santé, Université Paris-Est Créteil, EA 7379, Créteil, France
- French National Agency for Medicines and Health Products Safety (ANSM), Saint Denis, France
| | - Lucie Jurek
- Child and Adolescent Neurodevelopmental Psychiatry Department, Center for Assessment and Diagnostic of Autism, Le Vinatier Hospital, Bron, France
- RESHAPE, Université Claude Bernard Lyon 1, U1290, Lyon, France
| | - Mikail Nourredine
- Service Hospitalo-Universitaire de Pharmaco-Toxicologie, Hospices Civils de Lyon, Bât. A-162, avenue Lacassagne, 69424 Cedex 03, Lyon, France
- Service de biostatistiques, Hospices Civils de Lyon, Lyon, France
- Laboratoire d'évaluation et modélisation des effets thérapeutiques, UMR CNRS 5558, Lyon, France
| | - Jérôme Massardier
- Service de Gynécologie Obstétrique et Médecine Foetale, HFME, Hospices Civils de Lyon, Lyon, France
| | - Audrey Peron
- Service Hospitalo-Universitaire de Pharmaco-Toxicologie, Hospices Civils de Lyon, Bât. A-162, avenue Lacassagne, 69424 Cedex 03, Lyon, France
| | - Michel Cucherat
- metaEvidence.org - Service Hospitalo, Universitaire de Pharmaco-Toxicologie, Hospices Civils de Lyon, Lyon, France
| | - Judith Cottin
- Service Hospitalo-Universitaire de Pharmaco-Toxicologie, Hospices Civils de Lyon, Bât. A-162, avenue Lacassagne, 69424 Cedex 03, Lyon, France.
| |
Collapse
|
8
|
Pérez-Pérez M, Ferreira T, Igrejas G, Fdez-Riverola F. A novel gluten knowledge base of potential biomedical and health-related interactions extracted from the literature: using machine learning and graph analysis methodologies to reconstruct the bibliome. J Biomed Inform 2023:104398. [PMID: 37230405 DOI: 10.1016/j.jbi.2023.104398] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 05/12/2023] [Accepted: 05/15/2023] [Indexed: 05/27/2023]
Abstract
BACKGROUND In return for their nutritional properties and broad availability, cereal crops have been associated with different alimentary disorders and symptoms, with the majority of the responsibility being attributed to gluten. Therefore, the research of gluten-related literature data continues to be produced at ever-growing rates, driven in part by the recent exploratory studies that link gluten to non-traditional diseases and the popularity of gluten-free diets, making it increasingly difficult to access and analyse practical and structured information. In this sense, the accelerated discovery of novel advances in diagnosis and treatment, as well as exploratory studies, produce a favourable scenario for disinformation and misinformation. OBJECTIVES Aligned with, the European Union strategy "Delivering on EU Food Safety and Nutrition in 2050" which emphasizes the inextricable links between imbalanced diets, the increased exposure to unreliable sources of information and misleading information, and the increased dependency on reliable sources of information; this paper presents GlutKNOIS, a public and interactive literature-based database that reconstructs and represents the experimental biomedical knowledge extracted from the gluten-related literature. The developed platform includes different external database knowledge, bibliometrics statistics and social media discussion to propose a novel and enhanced way to search, visualise and analyse potential biomedical and health-related interactions in relation to the gluten domain. METHODS For this purpose, the presented study applies a semi-supervised curation workflow that combines natural language processing techniques, machine learning algorithms, ontology-based normalization and integration approaches, named entity recognition methods, and graph knowledge reconstruction methodologies to process, classify, represent and analyse the experimental findings contained in the literature, which is also complemented by data from the social discussion. RESULTS and Conclusions: In this sense, 5,814 documents were manually annotated and 7,424 were fully automatically processed to reconstruct the first online gluten-related knowledge database of evidenced health-related interactions that produce health or metabolic changes based on the literature. In addition, the automatic processing of the literature combined with the knowledge representation methodologies proposed has the potential to assist in the revision and analysis of years of gluten research. The reconstructed knowledge base is public and accessible at https://sing-group.org/glutknois/.
Collapse
Affiliation(s)
- Martín Pérez-Pérez
- CINBIO, Universidade de Vigo, Department of Computer Science, ESEI - Escuela Superior de Ingeniería Informática, 32004 Ourense, España; SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain.
| | - Tânia Ferreira
- Department of Genetics and Biotechnology, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal; Functional Genomics and Proteomics Unit, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal.
| | - Gilberto Igrejas
- Department of Genetics and Biotechnology, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal; Functional Genomics and Proteomics Unit, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal; LAQV-REQUIMTE, Faculty of Science and Technology, Nova University of Lisbon, Lisbon, Portugal.
| | - Florentino Fdez-Riverola
- CINBIO, Universidade de Vigo, Department of Computer Science, ESEI - Escuela Superior de Ingeniería Informática, 32004 Ourense, España; SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain.
| |
Collapse
|
9
|
Launer-Wachs S, Taub-Tabib H, Tokarev Madem J, Bar-Natan O, Goldberg Y, Shamay Y. From Centralized to Ad-Hoc Knowledge Base Construction for Hypotheses Generation. J Biomed Inform 2023; 142:104383. [PMID: 37196989 DOI: 10.1016/j.jbi.2023.104383] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 04/27/2023] [Accepted: 05/03/2023] [Indexed: 05/19/2023]
Abstract
OBJECTIVE To demonstrate and develop an approach enabling individual researchers or small teams to create their own ad-hoc, lightweight knowledge bases tailored for specialized scientific interests, using text-mining over scientific literature, and demonstrate the effectiveness of these knowledge bases in hypothesis generation and literature-based discovery (LBD). METHODS We propose a lightweight process using an extractive search framework to create ad-hoc knowledge bases, which require minimal training and no background in bio-curation or computer science. These knowledge bases are particularly effective for LBD and hypothesis generation using Swanson's ABC method. The personalized nature of the knowledge bases allows for a somewhat higher level of noise than "public facing" ones, as researchers are expected to have prior domain experience to separate signal from noise. Fact verification is shifted from exhaustive verification of the knowledge base to post-hoc verification of specific entries of interest, allowing researchers to assess the correctness of relevant knowledge base entries by considering the paragraphs in which the facts were introduced. RESULTS We demonstrate the methodology by constructing several knowledge bases of different kinds: three knowledge bases that support lab-internal hypothesis generation: Drug Delivery to Ovarian Tumors (DDOT); Tissue Engineering and Regeneration; Challenges in Cancer Research; and an additional comprehensive, accurate knowledge base designated as a public resource for the wider community on the topic of Cell Specific Drug Delivery (CSDD). In each case, we show the design and construction process, along with relevant visualizations for data exploration, and hypothesis generation. For CSDD and DDOT we also show meta-analysis, human evaluation, and in vitro experimental evaluation. CONCLUSION Our approach enables researchers to create personalized, lightweight knowledge bases for specialized scientific interests, effectively facilitating hypothesis generation and literature-based discovery (LBD). By shifting fact verification efforts to post-hoc verification of specific entries, researchers can focus on exploring and generating hypotheses based on their expertise. The constructed knowledge bases demonstrate the versatility and adaptability of our approach to versatile research interests. The web-based platform, available at https://spike-kbc.apps.allenai.org , provides researchers with a valuable tool for rapid construction of knowledge bases tailored to their needs.
Collapse
Affiliation(s)
- Shaked Launer-Wachs
- Faculty of Biomedical Engineering, Technion - Israel Institute of Technology, Haifa, Israel
| | | | - Jennie Tokarev Madem
- Faculty of Biomedical Engineering, Technion - Israel Institute of Technology, Haifa, Israel
| | - Orr Bar-Natan
- Faculty of Biomedical Engineering, Technion - Israel Institute of Technology, Haifa, Israel
| | - Yoav Goldberg
- Allen Institute for AI, Tel Aviv, Israel; Bar-Ilan University, Ramat-Gan, Israel
| | - Yosi Shamay
- Faculty of Biomedical Engineering, Technion - Israel Institute of Technology, Haifa, Israel.
| |
Collapse
|
10
|
Mayer C, Vogt A, Uslu T, Scalzitti N, Chennen K, Poch O, Thompson JD. CeGAL: Redefining a Widespread Fungal-Specific Transcription Factor Family Using an In Silico Error-Tracking Approach. J Fungi (Basel) 2023; 9:jof9040424. [PMID: 37108879 PMCID: PMC10141177 DOI: 10.3390/jof9040424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 03/21/2023] [Accepted: 03/28/2023] [Indexed: 03/31/2023] Open
Abstract
In fungi, the most abundant transcription factor (TF) class contains a fungal-specific ‘GAL4-like’ Zn2C6 DNA binding domain (DBD), while the second class contains another fungal-specific domain, known as ‘fungal_trans’ or middle homology domain (MHD), whose function remains largely uncharacterized. Remarkably, almost a third of MHD-containing TFs in public sequence databases apparently lack DNA binding activity, since they are not predicted to contain a DBD. Here, we reassess the domain organization of these ‘MHD-only’ proteins using an in silico error-tracking approach. In a large-scale analysis of ~17,000 MHD-only TF sequences present in all fungal phyla except Microsporidia and Cryptomycota, we show that the vast majority (>90%) result from genome annotation errors and we are able to predict a new DBD sequence for 14,261 of them. Most of these sequences correspond to a Zn2C6 domain (82%), with a small proportion of C2H2 domains (4%) found only in Dikarya. Our results contradict previous findings that the MHD-only TF are widespread in fungi. In contrast, we show that they are exceptional cases, and that the fungal-specific Zn2C6–MHD domain pair represents the canonical domain signature defining the most predominant fungal TF family. We call this family CeGAL, after the highly characterized members: Cep3, whose 3D structure is determined, and GAL4, a eukaryotic TF archetype. We believe that this will not only improve the annotation and classification of the Zn2C6 TF but will also provide critical guidance for future fungal gene regulatory network analyses.
Collapse
Affiliation(s)
- Claudine Mayer
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
- Faculté des Sciences, Université Paris Cité, UFR Sciences du Vivant, 75013 Paris, France
- Correspondence: (C.M.); (J.D.T.)
| | - Arthur Vogt
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Tuba Uslu
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Nicolas Scalzitti
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Kirsley Chennen
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Julie D. Thompson
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
- Correspondence: (C.M.); (J.D.T.)
| |
Collapse
|
11
|
Shaw F, Minotto A, McTaggart S, Providence A, Harrison P, Paupério J, Rajan J, Burgin J, Cochrane G, Kilias E, Lawniczak M, Davey R. Managing sample metadata for biodiversity: considerations from the Darwin Tree of Life project. Wellcome Open Res 2022. [DOI: 10.12688/wellcomeopenres.18499.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Large-scale reference genome sequencing projects for all of biodiversity are underway and common standards have been in place for some years to enable the understanding and sharing of sequence data. However, the metadata that describes the collection, processing and management of samples, and link to the associated sequencing and genome data, are not yet adequately developed and standardised for these projects. At the time of writing, the Darwin Tree of Life (DToL) Project is over two years into its ten-year ambition to sequence all described eukaryotic species in Britain and Ireland. We have sought consensus from a wide range of scientists across taxonomic domains to determine the minimal set of metadata that we collectively deem as critically important to accompany each sequenced specimen. These metadata are made available throughout the subsequent laboratory processes, and once collected, need to be adequately managed to fulfil the requirements of good data management practice. Due to the size and scale of management required, software tools are needed. These tools need to implement rigorous development pathways and change management procedures to ensure that effective research data management of key project and sample metadata is maintained. Tracking of sample properties through the sequencing process is handled by Lab Information Management Systems (LIMS), so publication of the sequenced data is achieved via technical integration of LIMS and data management tools. Discussions with community members on how metadata standards need to be managed within large-scale programmes is a priority in the planning process. Here we report on the standards we developed with respect to a robust and reusable mechanism of metadata collection, in the hopes that other projects forthcoming or underway will adopt these practices for metadata.
Collapse
|
12
|
Chen Q, Allot A, Leaman R, Wei CH, Aghaarabi E, Guerrerio J, Xu L, Lu Z. LitCovid in 2022: an information resource for the COVID-19 literature. Nucleic Acids Res 2022; 51:D1512-D1518. [PMID: 36350613 PMCID: PMC9825538 DOI: 10.1093/nar/gkac1005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/11/2022] [Accepted: 10/19/2022] [Indexed: 11/11/2022] Open
Abstract
LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/)-first launched in February 2020-is a first-of-its-kind literature hub for tracking up-to-date published research on COVID-19. The number of articles in LitCovid has increased from 55 000 to ∼300 000 over the past 2.5 years, with a consistent growth rate of ∼10 000 articles per month. In addition to the rapid literature growth, the COVID-19 pandemic has evolved dramatically. For instance, the Omicron variant has now accounted for over 98% of new infections in the United States. In response to the continuing evolution of the COVID-19 pandemic, this article describes significant updates to LitCovid over the last 2 years. First, we introduced the long Covid collection consisting of the articles on COVID-19 survivors experiencing ongoing multisystemic symptoms, including respiratory issues, cardiovascular disease, cognitive impairment, and profound fatigue. Second, we provided new annotations on the latest COVID-19 strains and vaccines mentioned in the literature. Third, we improved several existing features with more accurate machine learning algorithms for annotating topics and classifying articles relevant to COVID-19. LitCovid has been widely used with millions of accesses by users worldwide on various information needs and continues to play a critical role in collecting, curating and standardizing the latest knowledge on the COVID-19 literature.
Collapse
Affiliation(s)
| | | | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, USA
| | | | | | | | - Zhiyong Lu
- To whom correspondence should be addressed. Tel: +1 301 594 7089; Fax: +1 301 480 2290;
| |
Collapse
|
13
|
A group theoretic approach to model comparison with simplicial representations. J Math Biol 2022; 85:48. [PMID: 36209430 PMCID: PMC9548478 DOI: 10.1007/s00285-022-01807-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 05/31/2022] [Accepted: 07/25/2022] [Indexed: 10/28/2022]
Abstract
AbstractThe complexity of biological systems, and the increasingly large amount of associated experimental data, necessitates that we develop mathematical models to further our understanding of these systems. Because biological systems are generally not well understood, most mathematical models of these systems are based on experimental data, resulting in a seemingly heterogeneous collection of models that ostensibly represent the same system. To understand the system we therefore need to understand how the different models are related to each other, with a view to obtaining a unified mathematical description. This goal is complicated by the fact that a number of distinct mathematical formalisms may be employed to represent the same system, making direct comparison of the models very difficult. A methodology for comparing mathematical models based on their underlying conceptual structure is therefore required. In previous work we developed an appropriate framework for model comparison where we represent models, specifically the conceptual structure of the models, as labelled simplicial complexes and compare them with the two general methodologies of comparison by distance and comparison by equivalence. In this article we continue the development of our model comparison methodology in two directions. First, we present a rigorous and automatable methodology for the core process of comparison by equivalence, namely determining the vertices in a simplicial representation, corresponding to model components, that are conceptually related and the identification of these vertices via simplicial operations. Our methodology is based on considerations of vertex symmetry in the simplicial representation, for which we develop the required mathematical theory of group actions on simplicial complexes. This methodology greatly simplifies and expedites the process of determining model equivalence. Second, we provide an alternative mathematical framework for our model-comparison methodology by representing models as groups, which allows for the direct application of group-theoretic techniques within our model-comparison methodology.
Collapse
|
14
|
Schuler R, Bugacov A, Hacia J, Ho T, Iwata J, Pearlman L, Samuels B, Williams C, Zhao Z, Kesselman C, Chai Y. FaceBase: A Community-Driven Hub for Data-Intensive Research. J Dent Res 2022; 101:1289-1298. [PMID: 35912790 PMCID: PMC9516628 DOI: 10.1177/00220345221107905] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
The FaceBase Consortium, funded by the National Institute of Dental and Craniofacial Research of the National Institutes of Health, was established in 2009 with the recognition that dental and craniofacial research are increasingly data-intensive disciplines. Data sharing is critical for the validation and reproducibility of results as well as to enable reuse of data. In service of these goals, data ought to be FAIR: Findable, Accessible, Interoperable, and Reusable. The FaceBase data repository and educational resources exemplify the FAIR principles and support a broad user community including researchers in craniofacial development, molecular genetics, and genomics. FaceBase demonstrates that a model in which researchers "self-curate" their data can be successful and scalable. We present the results of the first 2.5 y of FaceBase's operations as an open community and summarize the data sets published during this period. We then describe a research highlight from work on the identification of regulatory networks and noncoding RNAs involved in cleft lip with/without cleft palate that both used and in turn contributed new findings to publicly available FaceBase resources. Collectively, FaceBase serves as a dynamic and continuously evolving resource to facilitate data-intensive research, enhance data reproducibility, and perform deep phenotyping across multiple species in dental and craniofacial research.
Collapse
Affiliation(s)
- R.E. Schuler
- Viterbi School of Engineering,
Information Sciences Institute, University of Southern California, Marina del Rey,
CA, USA
| | - A. Bugacov
- Viterbi School of Engineering,
Information Sciences Institute, University of Southern California, Marina del Rey,
CA, USA
| | - J.G. Hacia
- Keck School of Medicine, Biochemistry
and Molecular Medicine, University of Southern California, Los Angeles, CA,
USA
| | - T.V. Ho
- Ostrow School of Dentistry, Center for
Craniofacial Molecular Biology, University of Southern California, Los Angeles, CA,
USA
| | - J. Iwata
- School of Dentistry, Diagnostic &
Biomedical Sciences, The University of Texas Health Science Center at Houston,
Houston, TX, USA
| | - L. Pearlman
- Viterbi School of Engineering,
Information Sciences Institute, University of Southern California, Marina del Rey,
CA, USA
| | - B.D. Samuels
- Ostrow School of Dentistry, Center for
Craniofacial Molecular Biology, University of Southern California, Los Angeles, CA,
USA
| | - C. Williams
- Viterbi School of Engineering,
Information Sciences Institute, University of Southern California, Marina del Rey,
CA, USA
| | - Z. Zhao
- School of Biomedical Informatics,
Center for Precision Health, The University of Texas Health Science Center at
Houston, Houston, TX, USA
| | - C. Kesselman
- Viterbi School of Engineering,
Information Sciences Institute, University of Southern California, Marina del Rey,
CA, USA
| | - Y. Chai
- Ostrow School of Dentistry, Center for
Craniofacial Molecular Biology, University of Southern California, Los Angeles, CA,
USA
| |
Collapse
|
15
|
“KRiShI”: a manually curated knowledgebase on rice sheath blight disease. Funct Integr Genomics 2022; 22:1403-1410. [DOI: 10.1007/s10142-022-00899-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 06/28/2022] [Accepted: 09/04/2022] [Indexed: 11/04/2022]
|
16
|
Xu Q, Liu Y, Hu J, Duan X, Song N, Zhou J, Zhai J, Su J, Liu S, Chen F, Zheng W, Guo Z, Li H, Zhou Q, Niu B. OncoPubMiner: a platform for mining oncology publications. Brief Bioinform 2022; 23:6691792. [PMID: 36058206 DOI: 10.1093/bib/bbac383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 08/08/2022] [Accepted: 08/09/2022] [Indexed: 11/12/2022] Open
Abstract
Updated and expert-quality knowledge bases are fundamental to biomedical research. A knowledge base established with human participation and subject to multiple inspections is needed to support clinical decision making, especially in the growing field of precision oncology. The number of original publications in this field has risen dramatically with the advances in technology and the evolution of in-depth research. Consequently, the issue of how to gather and mine these articles accurately and efficiently now requires close consideration. In this study, we present OncoPubMiner (https://oncopubminer.chosenmedinfo.com), a free and powerful system that combines text mining, data structure customisation, publication search with online reading and project-centred and team-based data collection to form a one-stop 'keyword in-knowledge out' oncology publication mining platform. The platform was constructed by integrating all open-access abstracts from PubMed and full-text articles from PubMed Central, and it is updated daily. OncoPubMiner makes obtaining precision oncology knowledge from scientific articles straightforward and will assist researchers in efficiently developing structured knowledge base systems and bring us closer to achieving precision oncology goals.
Collapse
Affiliation(s)
- Quan Xu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Yueyue Liu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Jifang Hu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaohong Duan
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Niuben Song
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Jiale Zhou
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Jincheng Zhai
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Junyan Su
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Siyao Liu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Fan Chen
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Wei Zheng
- The Department of Nephrology and Hypertension Medicine, Beijing Electric Power Hospital, Beijing 100073, China
| | - Zhongjia Guo
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Hexiang Li
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Qiming Zhou
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Beifang Niu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
17
|
Chen Q, Du J, Allot A, Lu Z. LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2584-2595. [PMID: 35536809 PMCID: PMC9647722 DOI: 10.1109/tcbb.2022.3173562] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/19/2022] [Accepted: 04/22/2022] [Indexed: 05/20/2023]
Abstract
The rapid growth of biomedical literature poses a significant challenge for curation and interpretation. This has become more evident during the COVID-19 pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed, has accumulated over 200,000 articles with millions of accesses. Approximately 10,000 new articles are added to LitCovid every month. A main curation task in LitCovid is topic annotation where an article is assigned with up to eight topics, e.g., Treatment and Diagnosis. The annotated topics have been widely used both in LitCovid (e.g., accounting for ∼18% of total uses) and downstream studies such as network generation. However, it has been a primary curation bottleneck due to the nature of the task and the rapid literature growth. This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. We compare LITMC-BERT with three baseline models on two datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively, and only requires ∼18% of the inference time than the Binary BERT baseline. The related datasets and models are available via https://github.com/ncbi/ml-transformer.
Collapse
|
18
|
Chen Q, Allot A, Leaman R, Islamaj R, Du J, Fang L, Wang K, Xu S, Zhang Y, Bagherzadeh P, Bergler S, Bhatnagar A, Bhavsar N, Chang YC, Lin SJ, Tang W, Zhang H, Tavchioski I, Pollak S, Tian S, Zhang J, Otmakhova Y, Yepes AJ, Dong H, Wu H, Dufour R, Labrak Y, Chatterjee N, Tandon K, Laleye FAA, Rakotoson L, Chersoni E, Gu J, Friedrich A, Pujari SC, Chizhikova M, Sivadasan N, VG S, Lu Z. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database (Oxford) 2022; 2022:baac069. [PMID: 36043400 PMCID: PMC9428574 DOI: 10.1093/database/baac069] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 08/02/2022] [Accepted: 08/13/2022] [Indexed: 05/03/2023]
Abstract
The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since December 2019. The related findings such as vaccine and drug development have been reported in biomedical literature-at a rate of about 10 000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200 000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g. Diagnosis and Treatment) to the articles in LitCovid. The annotated topics have been widely used for navigating the COVID literature, rapidly locating articles of interest and other downstream studies. However, annotating the topics has been the bottleneck of manual curation. Despite the continuing advances in biomedical text-mining methods, few have been dedicated to topic annotations in COVID-19 literature. To close the gap, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset-consisting of over 30 000 articles with manually reviewed topics-was created for training and testing. It is one of the largest multi-label classification datasets in biomedical scientific literature. Nineteen teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181 and 0.9394 for macro-F1-score, micro-F1-score and instance-based F1-score, respectively. Notably, these scores are substantially higher (e.g. 12%, higher for macro F1-score) than the corresponding scores of the state-of-art multi-label classification method. The level of participation and results demonstrate a successful track and help close the gap between dataset curation and method development. The dataset is publicly available via https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/ for benchmarking and further development. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Alexis Allot
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Rezarta Islamaj
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Jingcheng Du
- School of Biomedical Informatics, UT Health, TX, Houston 77030, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Shuo Xu
- College of Economics and Management, Beijing University of Technology, Beijing, QC, China
| | - Yuefu Zhang
- College of Economics and Management, Beijing University of Technology, Beijing, QC, China
| | | | | | | | | | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Sheng-Jie Lin
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Wentai Tang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Hongtong Zhang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Ilija Tavchioski
- Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
- Jožef Stefan Institute, Ljubljana, Slovenia
| | | | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Yulia Otmakhova
- School of Computing and Information Systems, University of Melbourne, Melbourne, AU-VIC, Australia
| | | | - Hang Dong
- Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Honghan Wu
- Institute of Health Informatics, University College London, London, UK
| | | | | | - Niladri Chatterjee
- Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
| | - Kushagri Tandon
- Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
| | | | | | - Emmanuele Chersoni
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
| | - Jinghang Gu
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
| | | | - Subhash Chandra Pujari
- Institute of Computer Science, Heidelberg University, Heidelberg, Germany
- Bosch Center for Artificial Intelligence, Renningen, Germany
| | - Mariia Chizhikova
- SINAI Group, Department of Computer Science, Advanced Studies Center in ICT (CEATIC), Universidad de Jaén, Jaén, Spain
| | | | | | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| |
Collapse
|
19
|
Raboudi A, Allanic M, Balvay D, Hervé PY, Viel T, Yoganathan T, Certain A, Hilbey J, Charlet J, Durupt A, Boutinaud P, Eynard B, Tavitian B. The BMS-LM ontology for biomedical data reporting throughout the lifecycle of a research study: From data model to ontology. J Biomed Inform 2022; 127:104007. [DOI: 10.1016/j.jbi.2022.104007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Revised: 12/24/2021] [Accepted: 01/28/2022] [Indexed: 11/16/2022]
|
20
|
Nadendla S, Jackson R, Munro J, Quaglia F, Mészáros B, Olley D, Hobbs ET, Goralski SM, Chibucos M, Mungall CJ, Tosatto SCE, Erill I, Giglio MG. ECO: the Evidence and Conclusion Ontology, an update for 2022. Nucleic Acids Res 2022; 50:D1515-D1521. [PMID: 34986598 PMCID: PMC8728134 DOI: 10.1093/nar/gkab1025] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 10/12/2021] [Accepted: 10/18/2021] [Indexed: 11/12/2022] Open
Abstract
The Evidence and Conclusion Ontology (ECO) is a community resource that provides an ontology of terms used to capture the type of evidence that supports biomedical annotations and assertions. Consistent capture of evidence information with ECO allows tracking of annotation provenance, establishment of quality control measures, and evidence-based data mining. ECO is in use by dozens of data repositories and resources with both specific and general areas of focus. ECO is continually being expanded and enhanced in response to user requests as well as our aim to adhere to community best-practices for ontology development. The ECO support team engages in multiple collaborations with other ontologies and annotating groups. Here we report on recent updates to the ECO ontology itself as well as associated resources that are available through this project. ECO project products are freely available for download from the project website (https://evidenceontology.org/) and GitHub (https://github.com/evidenceontology/evidenceontology). ECO is released into the public domain under a CC0 1.0 Universal license.
Collapse
Affiliation(s)
- Suvarna Nadendla
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, USA
| | - Rebecca Jackson
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, USA
| | - James Munro
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, USA
| | - Federica Quaglia
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR-IBIOM), Bari, Italy.,Department of Biomedical Sciences, University of Padova, Padova, Italy
| | - Bálint Mészáros
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg 69117, Germany
| | - Dustin Olley
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, USA
| | - Elizabeth T Hobbs
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States
| | - Stephen M Goralski
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States
| | - Marcus Chibucos
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, USA
| | - Christopher John Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Lab, Berkeley, California, USA
| | | | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States
| | - Michelle G Giglio
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, USA
| |
Collapse
|
21
|
Charles WM, Delgado BM. Health Datasets as Assets: Blockchain-Based Valuation and Transaction Methods. BLOCKCHAIN IN HEALTHCARE TODAY 2022; 5:185. [PMID: 36779021 PMCID: PMC9907414 DOI: 10.30953/bhty.v5.185] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 12/19/2021] [Accepted: 12/21/2021] [Indexed: 05/13/2023]
Abstract
There is increasing recognition about health-oriented datasets that could be regarded as intangible assets: distinct assets with future economic benefits but without physical properties. While health-oriented datasets - particularly health records - are ascribed monetary value on the black market, there are few established methods for assessing the value for legitimate research and business purposes. The emergence of blockchain has created new commercial opportunities for transferring assets without intermediaries. Therefore, blockchain is proposed as a medium by which research datasets could be transacted to provide future value. For authorized individuals to verify their transactions, blockchain methodologies offer security, auditability, and transparency. The authors share data valuation methodologies consistent with accounting principles and include discussions of black market valuation of health data. Furthermore, this article describes blockchain-based methods of managing real-time payment/micropayment strategies.
Collapse
|
22
|
Fitzpatrick R, Stefan MI. Validation Through Collaboration: Encouraging Team Efforts to Ensure Internal and External Validity of Computational Models of Biochemical Pathways. Neuroinformatics 2022; 20:277-284. [PMID: 35543917 PMCID: PMC9537119 DOI: 10.1007/s12021-022-09584-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/17/2022] [Indexed: 01/09/2023]
Abstract
Computational modelling of biochemical reaction pathways is an increasingly important part of neuroscience research. In order to be useful, computational models need to be valid in two senses: First, they need to be consistent with experimental data and able to make testable predictions (external validity). Second, they need to be internally consistent and independently reproducible (internal validity). Here, we discuss both types of validity and provide a brief overview of tools and technologies used to ensure they are met. We also suggest the introduction of new collaborative technologies to ensure model validity: an incentivised experimental database for external validity and reproducibility audits for internal validity. Both rely on FAIR principles and on collaborative science practices.
Collapse
Affiliation(s)
- Richard Fitzpatrick
- Centre for Discovery Brain Sciences, University of Edinburgh, Edinburgh, UK ,School of Biological Sciences, University of Edinburgh, Edinburgh, UK
| | - Melanie I. Stefan
- Centre for Discovery Brain Sciences, University of Edinburgh, Edinburgh, UK ,ZJU-UoE Institute, Zhejiang University, Haining, China
| |
Collapse
|
23
|
Chen Q, Rankine A, Peng Y, Aghaarabi E, Lu Z. Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study. JMIR Med Inform 2021; 9:e27386. [PMID: 34967748 PMCID: PMC8759018 DOI: 10.2196/27386] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 08/06/2021] [Accepted: 08/06/2021] [Indexed: 01/23/2023] Open
Abstract
Background Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank. Objective Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications. Methods We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures. Results Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications. Conclusions Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Alex Rankine
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States.,Harvard College, Cambridge, MA, United States
| | - Yifan Peng
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States.,Weill Cornell Medicine, New York, NY, United States
| | - Elaheh Aghaarabi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States.,Towson University, Towson, MD, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
24
|
Kuiper M, Bonello J, Fernández-Breis JT, Bucher P, Futschik ME, Gaudet P, Kulakovskiy IV, Licata L, Logie C, Lovering RC, Makeev VJ, Orchard S, Panni S, Perfetto L, Sant D, Schulz S, Zerbino DR, Lægreid A. The Gene Regulation Knowledge Commons: The action area of GREEKC. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2021; 1865:194768. [PMID: 34757206 DOI: 10.1016/j.bbagrm.2021.194768] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 10/18/2021] [Accepted: 10/20/2021] [Indexed: 02/08/2023]
Abstract
The COST Action Gene Regulation Ensemble Effort for the Knowledge Commons (GREEKC, CA15205, www.greekc.org) organized nine workshops in a four-year period, starting September 2016. The workshops brought together a wide range of experts from all over the world working on various parts of the knowledge cycle that is central to understanding gene regulatory mechanisms. The discussions between ontologists, curators, text miners, biologists, bioinformaticians, philosophers and computational scientists spawned a host of activities aimed to update and standardise existing knowledge management workflows, encourage new experimental approaches and thoroughly involve end-users in the process to design the Gene Regulation Knowledge Commons (GRKC). The GREEKC consortium describes its main achievements, contextualised in a state-of-the-art of current tools and resources that today represent the GRKC.
Collapse
Affiliation(s)
- Martin Kuiper
- Systems Biology Group, Department of Biology, Norwegian University of Science and Technology, Trondheim, Norway.
| | - Joseph Bonello
- Faculty of Information & Communication Technology, University of Malta, Msida, Malta
| | | | - Philipp Bucher
- Swiss Institute of Bioinformatics, Quartier Sorge, Bâtiment Amphipôle, 1015 Lausanne, Switzerland
| | - Matthias E Futschik
- Systems Biology and Bioinformatics Laboratory (SysBioLab), Centre of Marine Sciences (CCMAR), University of Algarve, 8005-139 Faro, Portugal
| | - Pascale Gaudet
- SIB Swiss Institute of Bioinformatics, 1 Rue Michel-Servet, 1204 Geneva, Switzerland
| | - Ivan V Kulakovskiy
- Institute of Protein Research, Russian Academy of Sciences, Institutskaya 4, 142290 Pushchino, Russia
| | - Luana Licata
- Department of Biology, University of Rome Tor Vergata, Rome, Italy
| | - Colin Logie
- Department of Molecular Biology, Faculty of Science, Radboud University, PO Box 9101, Nijmegen 6500HG, the Netherlands
| | - Ruth C Lovering
- Functional Gene Annotation, Pre-clinical and Fundamental Science, Institute of Cardiovascular Science, University College London, 5 University Street, London WC1E 6JF, UK
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, 119991 Moscow, Russia
| | - Sandra Orchard
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Simona Panni
- Department DIBEST, University of Calabria, Rende, Italy
| | - Livia Perfetto
- Fondazione Human Technopole, Department of Biology, Via Cristina Belgioioso, 171, 20157 Milan, Italy
| | - David Sant
- Department of Biomedical Informatics, University of Utah, 421 Wakara Way #140, Salt Lake City, UT 84108, United States
| | - Stefan Schulz
- Institute of Medical Informatics, Statistics and Documentation, Medical University of Graz, Auenbruggerpl. 2, Graz, Austria
| | - Daniel R Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Astrid Lægreid
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | | |
Collapse
|
25
|
Pourreza Shahri M, Kahanda I. Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes. BMC Bioinformatics 2021; 22:500. [PMID: 34656098 PMCID: PMC8520253 DOI: 10.1186/s12859-021-04421-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 10/04/2021] [Indexed: 11/13/2022] Open
Abstract
Background Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. Results In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. Conclusions This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
Collapse
Affiliation(s)
| | - Indika Kahanda
- School of Computing, University of North Florida, Jacksonville, USA.
| |
Collapse
|
26
|
Glavaški M, Velicki L. Humans and machines in biomedical knowledge curation: hypertrophic cardiomyopathy molecular mechanisms' representation. BioData Min 2021; 14:45. [PMID: 34600580 PMCID: PMC8487578 DOI: 10.1186/s13040-021-00279-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 09/14/2021] [Indexed: 11/25/2022] Open
Abstract
Background Biomedical knowledge is dispersed in scientific literature and is growing constantly. Curation is the extraction of knowledge from unstructured data into a computable form and could be done manually or automatically. Hypertrophic cardiomyopathy (HCM) is the most common inherited cardiac disease, with genotype–phenotype associations still incompletely understood. We compared human- and machine-curated HCM molecular mechanisms’ models and examined the performance of different machine approaches for that task. Results We created six models representing HCM molecular mechanisms using different approaches and made them publicly available, analyzed them as networks, and tried to explain the models’ differences by the analysis of factors that affect the quality of machine-curated models (query constraints and reading systems’ performance). A result of this work is also the Interactive HCM map, the only publicly available knowledge resource dedicated to HCM. Sizes and topological parameters of the networks differed notably, and a low consensus was found in terms of centrality measures between networks. Consensus about the most important nodes was achieved only with respect to one element (calcium). Models with a reduced level of noise were generated and cooperatively working elements were detected. REACH and TRIPS reading systems showed much higher accuracy than Sparser, but at the cost of extraction performance. TRIPS proved to be the best single reading system for text segments about HCM, in terms of the compromise between accuracy and extraction performance. Conclusions Different approaches in curation can produce models of the same disease with diverse characteristics, and they give rise to utterly different conclusions in subsequent analysis. The final purpose of the model should direct the choice of curation techniques. Manual curation represents the gold standard for information extraction in biomedical research and is most suitable when only high-quality elements for models are required. Automated curation provides more substance, but high level of noise is expected. Different curation strategies can reduce the level of human input needed. Biomedical knowledge would benefit overwhelmingly, especially as to its rapid growth, if computers were to be able to assist in analysis on a larger scale. Supplementary Information The online version contains supplementary material available at 10.1186/s13040-021-00279-2.
Collapse
Affiliation(s)
- Mila Glavaški
- Faculty of Medicine, University of Novi Sad, Novi Sad, Serbia.
| | - Lazar Velicki
- Faculty of Medicine, University of Novi Sad, Novi Sad, Serbia.,Institute of Cardiovascular Diseases Vojvodina, Sremska Kamenica, Serbia
| |
Collapse
|
27
|
Ramsey J, McIntosh B, Renfro D, Aleksander SA, LaBonte S, Ross C, Zweifel AE, Liles N, Farrar S, Gill JJ, Erill I, Ades S, Berardini TZ, Bennett JA, Brady S, Britton R, Carbon S, Caruso SM, Clements D, Dalia R, Defelice M, Doyle EL, Friedberg I, Gurney SMR, Hughes L, Johnson A, Kowalski JM, Li D, Lovering RC, Mans TL, McCarthy F, Moore SD, Murphy R, Paustian TD, Perdue S, Peterson CN, Prüß BM, Saha MS, Sheehy RR, Tansey JT, Temple L, Thorman AW, Trevino S, Vollmer AC, Walbot V, Willey J, Siegele DA, Hu JC. Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO). PLoS Comput Biol 2021; 17:e1009463. [PMID: 34710081 PMCID: PMC8553046 DOI: 10.1371/journal.pcbi.1009463] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.
Collapse
Affiliation(s)
- Jolene Ramsey
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| | - Brenley McIntosh
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Daniel Renfro
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Suzanne A. Aleksander
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Sandra LaBonte
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Curtis Ross
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| | - Adrienne E. Zweifel
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Nathan Liles
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Shabnam Farrar
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Jason J. Gill
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
- Department of Animal Science, Texas A&M University, College Station, Texas, United States of America
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
- Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
| | - Sarah Ades
- Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Tanya Z. Berardini
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Newark, California, United States of America
| | - Jennifer A. Bennett
- Department of Biology and Earth Science, Otterbein University, Westerville, Ohio, United States of America
| | - Siobhan Brady
- Department of Plant Biology and Genome Center, University of California Davis, Davis, California, United States of America
| | - Robert Britton
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America
| | - Seth Carbon
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Steven M. Caruso
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
| | - Dave Clements
- Department of Biology, John Hopkins University, Baltimore, Maryland, United States of America
| | - Ritu Dalia
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Meredith Defelice
- Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Erin L. Doyle
- Biology Department, Doane University, Crete, Nebraska, United States of America
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Susan M. R. Gurney
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Lee Hughes
- Department of Biological Sciences, University of North Texas, Denton, Texas, United States of America
| | - Allison Johnson
- Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, Virginia, United States of America
| | - Jason M. Kowalski
- Biological Sciences Department, University of Wisconsin-Parkside, Kenosha, Wisconsin, United States of America
| | - Donghui Li
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Newark, California, United States of America
| | - Ruth C. Lovering
- Institute of Cardiovascular Science, University College London, London, United Kingdom
| | - Tamara L. Mans
- Department of Biochemistry and Biotechnology, Minnesota State University Moorhead, Brooklyn Park, Minnesota, United States of America
| | - Fiona McCarthy
- Department of Basic Science, College of Veterinary Medicine, Mississippi State University, Starkville, Mississippi, United States of America
| | - Sean D. Moore
- Burnett School of Biomedical Sciences, University of Central Florida, Orlando, Florida, United States of America
| | - Rebecca Murphy
- Department of Biology, Centenary College of Louisiana, Shreveport, Louisiana, United States of America
| | - Timothy D. Paustian
- Department of Bacteriology, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Sarah Perdue
- Biological Sciences Department, University of Wisconsin-Parkside, Kenosha, Wisconsin, United States of America
| | - Celeste N. Peterson
- Biology Department, Suffolk University, Boston, Massachusetts, United States of America
| | - Birgit M. Prüß
- Microbiological Sciences Department, North Dakota State University, Fargo, North Dakota, United States of America
| | - Margaret S. Saha
- Department of Biology, College of William & Mary, Williamsburg, Virginia, United States of America
| | - Robert R. Sheehy
- Biology Department, Radford University, Radford, Virginia, United States of America
| | - John T. Tansey
- Department of Biochemistry and Molecular Biology, Otterbein University, Westerville, Ohio, United States of America
| | - Louise Temple
- School of Integrated Sciences, James Madison University, Harrisonburg, Virginia, United States of America
| | - Alexander William Thorman
- Department of Environmental and Public Health Sciences, University of Cincinnati, Cincinnati, Ohio, United States of America
| | - Saul Trevino
- Department of Chemistry, Math, and Physics, Houston Baptist University, Houston, Texas, United States of America
| | - Amy Cheng Vollmer
- Department of Biology, Swarthmore College, Swarthmore, Pennsylvania, United States of America
| | - Virginia Walbot
- Department of Biology, Stanford University, Stanford, California, United States of America
| | - Joanne Willey
- Department of Science Education, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York, United States of America
| | - Deborah A. Siegele
- Department of Biology, Texas A&M University, College Station, Texas, United States of America
| | - James C. Hu
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| |
Collapse
|
28
|
Díaz-Rodríguez M, Lithgow-Serrano O, Guadarrama-García F, Tierrafría VH, Gama-Castro S, Solano-Lira H, Salgado H, Rinaldi F, Méndez-Cruz CF, Collado-Vides J. Lisen&Curate: A platform to facilitate gathering textual evidence for curation of regulation of transcription initiation in bacteria. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2021; 1864:194753. [PMID: 34461312 PMCID: PMC10155859 DOI: 10.1016/j.bbagrm.2021.194753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 07/12/2021] [Accepted: 08/25/2021] [Indexed: 10/20/2022]
Abstract
The number of published papers in biomedical research makes it rather impossible for a researcher to keep up to date. This is where manually curated databases contribute facilitating the access to knowledge. However, the structure required by databases strongly limits the type of valuable information that can be incorporated. Here, we present Lisen&Curate, a curation system that facilitates linking sentences or part of sentences (both considered sources) in articles with their corresponding curated objects, so that rich additional information of these objects is easily available to users. These sources are going to be offered both within RegulonDB and a new database, L-Regulon. To show the relevance of our work, two senior curators performed a curation of 31 articles on the regulation of transcription initiation of E. coli using Lisen&Curate. As a result, 194 objects were curated and 781 sources were recorded. We also found that these sources are useful to develop automatic approaches to detect objects in articles by observing word frequency patterns and by carrying out an open information extraction task. Sources may help to elaborate a controlled vocabulary of experimental methods. Finally, we discuss our ecosystem of interconnected applications, RegulonDB, L-Regulon, and Lisen&Curate, to facilitate the access to knowledge on regulation of transcription initiation in bacteria. We see our proposal as the starting point to change the way experimentalists connect a piece of knowledge with its evidence using RegulonDB.
Collapse
Affiliation(s)
- Martín Díaz-Rodríguez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Oscar Lithgow-Serrano
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico; Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI, Polo universitario Lugano-Campus Est, Via la Santa 1, CH-6962 Lugano, Switzerland
| | - Francisco Guadarrama-García
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Víctor H Tierrafría
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Hilda Solano-Lira
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Heladia Salgado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Fabio Rinaldi
- Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI, Polo universitario Lugano-Campus Est, Via la Santa 1, CH-6962 Lugano, Switzerland; Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
| | - Carlos-Francisco Méndez-Cruz
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico.
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico; Department of Biomedical Engineering, Boston University, 44 Cummington Mall Room 403, 02215 Boston, MA, USA; Center for Genomic Regulation (CRG), Dr. Aiguader 88, 08003, Barcelona, Spain
| |
Collapse
|
29
|
Allot A, Lee K, Chen Q, Luo L, Lu Z. LitSuggest: a web-based system for literature recommendation and curation using machine learning. Nucleic Acids Res 2021; 49:W352-W358. [PMID: 33950204 DOI: 10.1093/nar/gkab326] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 04/16/2021] [Accepted: 04/20/2021] [Indexed: 01/02/2023] Open
Abstract
Searching and reading relevant literature is a routine practice in biomedical research. However, it is challenging for a user to design optimal search queries using all the keywords related to a given topic. As such, existing search systems such as PubMed often return suboptimal results. Several computational methods have been proposed as an effective alternative to keyword-based query methods for literature recommendation. However, those methods require specialized knowledge in machine learning and natural language processing, which can make them difficult for biologists to utilize. In this paper, we propose LitSuggest, a web server that provides an all-in-one literature recommendation and curation service to help biomedical researchers stay up to date with scientific literature. LitSuggest combines advanced machine learning techniques for suggesting relevant PubMed articles with high accuracy. In addition to innovative text-processing methods, LitSuggest offers multiple advantages over existing tools. First, LitSuggest allows users to curate, organize, and download classification results in a single interface. Second, users can easily fine-tune LitSuggest results by updating the training corpus. Third, results can be readily shared, enabling collaborative analysis and curation of scientific literature. Finally, LitSuggest provides an automated personalized weekly digest of newly published articles for each user's project. LitSuggest is publicly available at https://www.ncbi.nlm.nih.gov/research/litsuggest.
Collapse
Affiliation(s)
- Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA.,Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
30
|
Staton M, Cannon E, Sanderson LA, Wegrzyn J, Anderson T, Buehler S, Cobo-Simón I, Faaberg K, Grau E, Guignon V, Gunoskey J, Inderski B, Jung S, Lager K, Main D, Poelchau M, Ramnath R, Richter P, West J, Ficklin S. Tripal, a community update after 10 years of supporting open source, standards-based genetic, genomic and breeding databases. Brief Bioinform 2021; 22:6318561. [PMID: 34251419 PMCID: PMC8574961 DOI: 10.1093/bib/bbab238] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 05/28/2021] [Accepted: 06/01/2021] [Indexed: 12/01/2022] Open
Abstract
Online, open access databases for biological knowledge serve as central repositories for research communities to store, find and analyze integrated, multi-disciplinary datasets. With increasing volumes, complexity and the need to integrate genomic, transcriptomic, metabolomic, proteomic, phenomic and environmental data, community databases face tremendous challenges in ongoing maintenance, expansion and upgrades. A common infrastructure framework using community standards shared by many databases can reduce development burden, provide interoperability, ensure use of common standards and support long-term sustainability. Tripal is a mature, open source platform built to meet this need. With ongoing improvement since its first release in 2009, Tripal provides full functionality for searching, browsing, loading and curating numerous types of data and is a primary technology powering at least 31 publicly available databases spanning plants, animals and human data, primarily storing genomics, genetics and breeding data. Tripal software development is managed by a shared, inclusive governance structure including both project management and advisory teams. Here, we report on the most important and innovative aspects of Tripal after 11 years development, including integration of diverse types of biological data, successful collaborative projects across member databases, and support for implementing FAIR principles.
Collapse
Affiliation(s)
| | - Ethalinda Cannon
- USDA-ARS, Corn Insects and Crop Genetics Research Unit, Ames, IA USA
| | | | | | | | | | | | - Kay Faaberg
- USDA-ARS, National Animal Disease Center, Ames, IA, USA
| | - Emily Grau
- University of Connecticut, Storrs, CT USA
| | | | | | | | - Sook Jung
- Washington State University, Pullman, WA USA
| | - Kelly Lager
- USDA-ARS, National Animal Disease Center, Ames, IA, USA
| | - Dorrie Main
- Washington State University, Pullman, WA USA
| | - Monica Poelchau
- USDA-ARS, National Agricultural Library, Beltsville, MD, USA
| | | | | | - Joe West
- University of Tennessee, Knoxville, TN USA
| | | |
Collapse
|
31
|
Foerster H, Battey JND, Sierro N, Ivanov NV, Mueller LA. Metabolic networks of the Nicotiana genus in the spotlight: content, progress and outlook. Brief Bioinform 2021; 22:bbaa136. [PMID: 32662816 PMCID: PMC8138835 DOI: 10.1093/bib/bbaa136] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 05/19/2020] [Accepted: 06/04/2020] [Indexed: 01/09/2023] Open
Abstract
Manually curated metabolic databases residing at the Sol Genomics Network comprise two taxon-specific databases for the Solanaceae family, i.e. SolanaCyc and the genus Nicotiana, i.e. NicotianaCyc as well as six species-specific databases for Nicotiana tabacum TN90, N. tabacum K326, Nicotiana benthamiana, N. sylvestris, N. tomentosiformis and N. attenuata. New pathways were created through the extraction, examination and verification of related data from the literature and the aid of external database guided by an expert-led curation process. Here we describe the curation progress that has been achieved in these databases since the first release version 1.0 in 2016, the curation flow and the curation process using the example metabolic pathway for cholesterol in plants. The current content of our databases comprises 266 pathways and 36 superpathways in SolanaCyc and 143 pathways plus 21 superpathways in NicotianaCyc, manually curated and validated specifically for the Solanaceae family and Nicotiana genus, respectively. The curated data have been propagated to the respective Nicotiana-specific databases, which resulted in the enrichment and more accurate presentation of their metabolic networks. The quality and coverage in those databases have been compared with related external databases and discussed in terms of literature support and metabolic content.
Collapse
|
32
|
Hatos A, Quaglia F, Piovesan D, Tosatto SCE. APICURON: a database to credit and acknowledge the work of biocurators. Database (Oxford) 2021; 2021:baab019. [PMID: 33882120 PMCID: PMC8060004 DOI: 10.1093/database/baab019] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 03/12/2021] [Accepted: 04/12/2021] [Indexed: 11/14/2022]
Abstract
APICURON is an open and freely accessible resource that tracks and credits the work of biocurators across multiple participating knowledgebases. Biocuration is essential to extract knowledge from research data and make it available in a structured and standardized way to the scientific community. However, processing biological data-mainly from literature-requires a huge effort that is difficult to attribute and quantify. APICURON collects biocuration events from third-party resources and aggregates this information, spotlighting biocurator contributions. APICURON promotes biocurator engagement implementing gamification concepts like badges, medals and leaderboards and at the same time provides a monitoring service for registered resources and for biocurators themselves. APICURON adopts a data model that is flexible enough to represent and track the majority of biocuration activities. Biocurators are identified through their Open Researcher and Contributor ID. The definition of curation events, scoring systems and rules for assigning badges and medals are resource-specific and easily customizable. Registered resources can transfer curation activities on the fly through a secure and robust Application Programming Interface (API). Here, we show how simple and effective it is to connect a resource to APICURON, describing the DisProt database of intrinsically disordered proteins as a use case. We believe APICURON will provide biological knowledgebases with a service to recognize and credit the effort of their biocurators, monitor their activity and promote curator engagement. Database URL: https://apicuron.org.
Collapse
Affiliation(s)
- András Hatos
- Department of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padova 35131, Italy
| | - Federica Quaglia
- Department of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padova 35131, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padova 35131, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padova 35131, Italy
| |
Collapse
|
33
|
Arnaboldi V, Cho J, Sternberg PW. Wormicloud: a new text summarization tool based on word clouds to explore the C. elegans literature. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6206631. [PMID: 33787871 DOI: 10.1093/database/baab015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 02/19/2021] [Accepted: 03/24/2021] [Indexed: 11/12/2022]
Abstract
Finding relevant information from newly published scientific papers is becoming increasingly difficult due to the pace at which articles are published every year as well as the increasing amount of information per paper. Biocuration and model organism databases provide a map for researchers to navigate through the complex structure of the biomedical literature by distilling knowledge into curated and standardized information. In addition, scientific search engines such as PubMed and text-mining tools such as Textpresso allow researchers to easily search for specific biological aspects from newly published papers, facilitating knowledge transfer. However, digesting the information returned by these systems-often a large number of documents-still requires considerable effort. In this paper, we present Wormicloud, a new tool that summarizes scientific articles in a graphical way through word clouds. This tool is aimed at facilitating the discovery of new experimental results not yet curated by model organism databases and is designed for both researchers and biocurators. Wormicloud is customized for the Caenorhabditis elegans literature and provides several advantages over existing solutions, including being able to perform full-text searches through Textpresso, which provides more accurate results than other existing literature search engines. Wormicloud is integrated through direct links from gene interaction pages in WormBase. Additionally, it allows analysis on the gene sets obtained from literature searches with other WormBase tools such as SimpleMine and Gene Set Enrichment. Database URL: https://wormicloud.textpressolab.com.
Collapse
Affiliation(s)
- Valerio Arnaboldi
- Division of Biology and Biological Engineering 156-29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Jaehyoung Cho
- Division of Biology and Biological Engineering 156-29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Paul W Sternberg
- Division of Biology and Biological Engineering 156-29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| |
Collapse
|
34
|
Touré V, Zobolas J, Kuiper M, Vercruysse S. CausalBuilder: bringing the MI2CAST causal interaction annotation standard to the curator. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6129748. [PMID: 33547799 PMCID: PMC7904049 DOI: 10.1093/database/baaa107] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 11/16/2020] [Accepted: 12/07/2020] [Indexed: 12/23/2022]
Abstract
Molecular causal interactions are defined as regulatory connections between biological components. They are commonly retrieved from biological experiments and can be used for connecting biological molecules together to enable the building of regulatory computational models that represent biological systems. However, including a molecular causal interaction in a model requires assessing its relevance to that model, based on the detailed knowledge about the biomolecules, interaction type and biological context. In order to standardize the representation of this knowledge in 'causal statements', we recently developed the Minimum Information about a Molecular Interaction Causal Statement (MI2CAST) guidelines. Here, we introduce causalBuilder: an intuitive web-based curation interface for the annotation of molecular causal interactions that comply with the MI2CAST standard. The causalBuilder prototype essentially embeds the MI2CAST curation guidelines in its interface and makes its rules easy to follow by a curator. In addition, causalBuilder serves as an original application of the Visual Syntax Method general-purpose curation technology and provides both curators and tool developers with an interface that can be fully configured to allow focusing on selected MI2CAST concepts to annotate. After the information is entered, the causalBuilder prototype produces genuine causal statements that can be exported in different formats.
Collapse
Affiliation(s)
- Vasundra Touré
- Department of Biology, Norwegian University of Science and Technology (NTNU), Høgskoleringen 5, 7491 Trondheim, Norway
| | - John Zobolas
- Department of Biology, Norwegian University of Science and Technology (NTNU), Høgskoleringen 5, 7491 Trondheim, Norway
| | - Martin Kuiper
- Department of Biology, Norwegian University of Science and Technology (NTNU), Høgskoleringen 5, 7491 Trondheim, Norway
| | - Steven Vercruysse
- Department of Biology, Norwegian University of Science and Technology (NTNU), Høgskoleringen 5, 7491 Trondheim, Norway
| |
Collapse
|
35
|
Pancsa R, Vranken W, Mészáros B. Computational resources for identifying and describing proteins driving liquid-liquid phase separation. Brief Bioinform 2021; 22:6124912. [PMID: 33517364 PMCID: PMC8425267 DOI: 10.1093/bib/bbaa408] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 11/23/2020] [Accepted: 12/12/2020] [Indexed: 01/06/2023] Open
Abstract
One of the most intriguing fields emerging in current molecular biology is the study of membraneless organelles formed via liquid–liquid phase separation (LLPS). These organelles perform crucial functions in cell regulation and signalling, and recent years have also brought about the understanding of the molecular mechanism of their formation. The LLPS field is continuously developing and optimizing dedicated in vitro and in vivo methods to identify and characterize these non-stoichiometric molecular condensates and the proteins able to drive or contribute to LLPS. Building on these observations, several computational tools and resources have emerged in parallel to serve as platforms for the collection, annotation and prediction of membraneless organelle-linked proteins. In this survey, we showcase recent advancements in LLPS bioinformatics, focusing on (i) available databases and ontologies that are necessary to describe the studied phenomena and the experimental results in an unambiguous way and (ii) prediction methods to assess the potential LLPS involvement of proteins. Through hands-on application of these resources on example proteins and representative datasets, we give a practical guide to show how they can be used in conjunction to provide in silico information on LLPS.
Collapse
Affiliation(s)
- Rita Pancsa
- Enzymology Institute of the Research Centre for Natural Sciences, Budapest, Hungary
| | - Wim Vranken
- Computer Science, chemistry and biomedical sciences at the Vrije Universiteit Brussel
| | - Bálint Mészáros
- Structural and Computational Biology Unit at the European Molecular Biology Laboratory, Heidelberg 69117, Germany
| |
Collapse
|
36
|
Bastian FB, Roux J, Niknejad A, Comte A, Fonseca Costa SS, de Farias TM, Moretti S, Parmentier G, de Laval VR, Rosikiewicz M, Wollbrett J, Echchiki A, Escoriza A, Gharib WH, Gonzales-Porta M, Jarosz Y, Laurenczy B, Moret P, Person E, Roelli P, Sanjeev K, Seppey M, Robinson-Rechavi M. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Res 2021; 49:D831-D847. [PMID: 33037820 PMCID: PMC7778977 DOI: 10.1093/nar/gkaa793] [Citation(s) in RCA: 76] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 08/24/2020] [Accepted: 09/15/2020] [Indexed: 01/24/2023] Open
Abstract
Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced by integrating multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It is based exclusively on curated healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression. Curation includes very large datasets such as GTEx (re-annotation of samples as ‘healthy’ or not) as well as many small ones. Data are integrated and made comparable between species thanks to consistent data annotation and processing, and to calls of presence/absence of expression, along with expression scores. As a result, Bgee is capable of detecting the conditions of expression of any single gene, accommodating any data type and species. Bgee provides several tools for analyses, allowing, e.g., automated comparisons of gene expression patterns within and between species, retrieval of the prefered conditions of expression of any gene, or enrichment analyses of conditions with expression of sets of genes. Bgee release 14.1 includes 29 animal species, and is available at https://bgee.org/ and through its Bioconductor R package BgeeDB.
Collapse
Affiliation(s)
- Frederic B Bastian
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Julien Roux
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Anne Niknejad
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Aurélie Comte
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Sara S Fonseca Costa
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Tarcisio Mendes de Farias
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Sébastien Moretti
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Gilles Parmentier
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Valentine Rech de Laval
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Marta Rosikiewicz
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Julien Wollbrett
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Amina Echchiki
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Angélique Escoriza
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Walid H Gharib
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Mar Gonzales-Porta
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Yohan Jarosz
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Balazs Laurenczy
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Philippe Moret
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Emilie Person
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Patrick Roelli
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Komal Sanjeev
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Mathieu Seppey
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
37
|
Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID-19 literature. Nucleic Acids Res 2021; 49:D1534-D1540. [PMID: 33166392 PMCID: PMC7778958 DOI: 10.1093/nar/gkaa952] [Citation(s) in RCA: 130] [Impact Index Per Article: 43.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 10/02/2020] [Accepted: 10/08/2020] [Indexed: 12/22/2022] Open
Abstract
Since the outbreak of the current pandemic in 2020, there has been a rapid growth of published articles on COVID-19 and SARS-CoV-2, with about 10 000 new articles added each month. This is causing an increasingly serious information overload, making it difficult for scientists, healthcare professionals and the general public to remain up to date on the latest SARS-CoV-2 and COVID-19 research. Hence, we developed LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/), a curated literature hub, to track up-to-date scientific information in PubMed. LitCovid is updated daily with newly identified relevant articles organized into curated categories. To support manual curation, advanced machine-learning and deep-learning algorithms have been developed, evaluated and integrated into the curation workflow. To the best of our knowledge, LitCovid is the first-of-its-kind COVID-19-specific literature resource, with all of its collected articles and curated data freely available. Since its release, LitCovid has been widely used, with millions of accesses by users worldwide for various information needs, such as evidence synthesis, drug discovery and text and data mining, among others.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| |
Collapse
|
38
|
Egorova KS, Smirnova NS, Toukach PV. CSDB_GT, a curated glycosyltransferase database with close-to-full coverage on three most studied nonanimal species. Glycobiology 2020; 31:524-529. [PMID: 33242091 DOI: 10.1093/glycob/cwaa107] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Revised: 11/13/2020] [Accepted: 11/18/2020] [Indexed: 11/13/2022] Open
Abstract
We report the accomplishment of the first stage of the development of a novel manually curated database on glycosyltransferase (GT) activities, CSDB_GT. CSDB_GT (http://csdb.glycoscience.ru/gt.html) has been supplemented with GT activities from Saccharomyces cerevisiae. Now it provides the close-to-complete coverage on experimentally confirmed GTs from the three most studied model organisms from the three kingdoms: plantae (Arabidopsis thaliana, ca. 930 activities), bacteria (Escherichia coli, ca. 820 activities) and fungi (S. cerevisiae, ca. 270 activities).
Collapse
Affiliation(s)
- Ksenia S Egorova
- Laboratory of Metal-Complex and Nano-Scale Catalysts, N.D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky prospect 47, Moscow 119991, Russia
| | - Nadezhda S Smirnova
- Kurnakov Institute of General and Inorganic Chemistry, Russian Academy of Sciences, Leninsky prospect 31, Moscow 119991, Russia
| | - Philip V Toukach
- Laboratory of Carbohydrate Chemistry, N.D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky prospect 47, Moscow 119991, Russia
| |
Collapse
|
39
|
Gabrielsen AM. Openness and trust in data-intensive science: the case of biocuration. MEDICINE, HEALTH CARE, AND PHILOSOPHY 2020; 23:497-504. [PMID: 32524312 PMCID: PMC7426290 DOI: 10.1007/s11019-020-09960-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Data-intensive science comes with increased risks concerning quality and reliability of data, and while trust in science has traditionally been framed as a matter of scientists being expected to adhere to certain technical and moral norms for behaviour, emerging discourses of open science present openness and transparency as substitutes for established trust mechanisms. By ensuring access to all available information, quality becomes a matter of informed judgement by the users, and trust no longer seems necessary. This strategy does not, however, take into consideration the networks of professionals already enabling data-intensive science by providing high-quality data. In the life sciences, biological data- and knowledge bases managed by expert biocurators have become crucial for data-intensive research. In this paper, I will use the case of biocurators to argue that openness and transparency will not diminish the need for trust in data-intensive science. On the contrary, data-intensive science requires a reconfiguration of existing trust mechanisms in order to include those who take care of and manage scientific data after its production.
Collapse
|
40
|
Nydal R, Bennett G, Kuiper M, Lægreid A. Silencing trust: confidence and familiarity in re-engineering knowledge infrastructures. MEDICINE, HEALTH CARE, AND PHILOSOPHY 2020; 23:471-484. [PMID: 32468194 PMCID: PMC7426298 DOI: 10.1007/s11019-020-09957-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
In this paper, we tell the story of efforts currently underway, on diverse fronts, to build digital knowledge repositories ('knowledge-bases') to support research in the life sciences. If successful, knowledge bases will be part of a new knowledge infrastructure-capable of facilitating ever-more comprehensive, computational models of biological systems. Such an infrastructure would, however, represent a sea-change in the technological management and manipulation of complex data, inducing a generational shift in how questions are asked and answered and results published and circulated. Integrating such knowledge bases into the daily workflow of the lab thus destabilizes a number of well-established habits which biologists rely on to ensure the quality of the knowledge they produce, evaluate, communicate and exploit. As the story we tell here shows, such destabilization introduces a situation of unfamiliarity, one that carries with it epistemic risks. It should elicit-to use Niklas Luhmann's terms-the question of trust: a shared recognition that the reliability of research practices is being risked, but that such a risk is worth taking in view of what may be gained. And yet, the problem of trust is being unexpectedly silenced. How that silencing has come about, why it matters, and what might yet be done forms the heart of this paper.
Collapse
Affiliation(s)
- Rune Nydal
- Programme for Applied Ethics, Department of Philosophy and Religious Studies, Norwegian University of Science and Technology, NO- 7491 Trondheim, Norway
| | - Gaymon Bennett
- School of Historical, Philosophical, and Religious Studies, Arizona State University, Tempe, AZ 85287-4302 USA
| | - Martin Kuiper
- Department of Biology, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway
| | - Astrid Lægreid
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway
| |
Collapse
|
41
|
Shaw F, Etuk A, Minotto A, Gonzalez-Beltran A, Johnson D, Rocca-Serra P, Laporte MA, Arnaud E, Devare M, Kersey P, Sansone SA, Davey RP. COPO: a metadata platform for brokering FAIR data in the life sciences. F1000Res 2020. [DOI: 10.12688/f1000research.23889.1] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Scientific innovation is increasingly reliant on data and computational resources. Much of today’s life science research involves generating, processing, and reusing heterogeneous datasets that are growing exponentially in size. Demand for technical experts (data scientists and bioinformaticians) to process these data is at an all-time high, but these are not typically trained in good data management practices. That said, we have come a long way in the last decade, with funders, publishers, and researchers themselves making the case for open, interoperable data as a key component of an open science philosophy. In response, recognition of the FAIR Principles (that data should be Findable, Accessible, Interoperable and Reusable) has become commonplace. However, both technical and cultural challenges for the implementation of these principles still exist when storing, managing, analysing and disseminating both legacy and new data. COPO is a computational system that attempts to address some of these challenges by enabling scientists to describe their research objects (raw or processed data, publications, samples, images, etc.) using community-sanctioned metadata sets and vocabularies, and then use public or institutional repositories to share them with the wider scientific community. COPO encourages data generators to adhere to appropriate metadata standards when publishing research objects, using semantic terms to add meaning to them and specify relationships between them. This allows data consumers, be they people or machines, to find, aggregate, and analyse data which would otherwise be private or invisible, building upon existing standards to push the state of the art in scientific data dissemination whilst minimising the burden of data publication and sharing.
Collapse
|
42
|
Leaman R, Wei CH, Allot A, Lu Z. Ten tips for a text-mining-ready article: How to improve automated discoverability and interpretability. PLoS Biol 2020; 18:e3000716. [PMID: 32479517 PMCID: PMC7289435 DOI: 10.1371/journal.pbio.3000716] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 06/11/2020] [Indexed: 12/22/2022] Open
Abstract
Data-driven research in biomedical science requires structured, computable data. Increasingly, these data are created with support from automated text mining. Text-mining tools have rapidly matured: although not perfect, they now frequently provide outstanding results. We describe 10 straightforward writing tips—and a web tool, PubReCheck—guiding authors to help address the most common cases that remain difficult for text-mining tools. We anticipate these guides will help authors’ work be found more readily and used more widely, ultimately increasing the impact of their work and the overall benefit to both authors and readers. PubReCheck is available at http://www.ncbi.nlm.nih.gov/research/pubrecheck. Your published research is already being processed with automated tools, and text mining will become more common; this Community Page article describes how you can help these tools process your work more accurately, including a web tool, PubReCheck.
Collapse
Affiliation(s)
- Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|
43
|
Teodoro D, Knafou J, Naderi N, Pasche E, Gobeill J, Arighi CN, Ruch P. UPCLASS: a deep learning-based classifier for UniProtKB entry publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5822772. [PMID: 32367111 PMCID: PMC7198315 DOI: 10.1093/database/baaa026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/19/2020] [Accepted: 03/11/2020] [Indexed: 12/20/2022]
Abstract
In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.
Collapse
Affiliation(s)
- Douglas Teodoro
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Knafou
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Nona Naderi
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Emilie Pasche
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Gobeill
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Cecilia N Arighi
- Center of Bioinformatics and Computational Biology, 15 Innovation Way, 19711, Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Patrick Ruch
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| |
Collapse
|
44
|
Lock A, Harris MA, Rutherford K, Hayles J, Wood V. Community curation in PomBase: enabling fission yeast experts to provide detailed, standardized, sharable annotation from research publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5827230. [PMID: 32353878 PMCID: PMC7192550 DOI: 10.1093/database/baaa028] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 02/28/2020] [Accepted: 03/22/2020] [Indexed: 11/22/2022]
Abstract
Maximizing the impact and value of scientific research requires efficient knowledge distribution, which increasingly depends on the integration of standardized published data into online databases. To make data integration more comprehensive and efficient for fission yeast research, PomBase has pioneered a community curation effort that engages publication authors directly in FAIR-sharing of data representing detailed biological knowledge from hypothesis-driven experiments. Canto, an intuitive online curation tool that enables biologists to describe their detailed functional data using shared ontologies, forms the core of PomBase’s system. With 8 years’ experience, and as the author response rate reaches 50%, we review community curation progress and the insights we have gained from the project. We highlight incentives and nudges we deploy to maximize participation, and summarize project outcomes, which include increased knowledge integration and dissemination as well as the unanticipated added value arising from co-curation by publication authors and professional curators.
Collapse
Affiliation(s)
- Antonia Lock
- Department of Genetics, Evolution and Environment, University College London, Gower street, London WC1E 6BT, UK
| | - Midori A Harris
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK
| | - Kim Rutherford
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK
| | - Jacqueline Hayles
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK
| | - Valerie Wood
- Cell Cycle Laboratory, The Francis Crick Institute, Midland Rd, London NW1 1AT, UK
| |
Collapse
|
45
|
Southan C. Opening up connectivity between documents, structures and bioactivity. Beilstein J Org Chem 2020; 16:596-606. [PMID: 32280387 PMCID: PMC7136548 DOI: 10.3762/bjoc.16.54] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Accepted: 03/12/2020] [Indexed: 12/17/2022] Open
Abstract
Bioscientists reading papers or patents strive to discern the key relationships reported within a document "D" where a bioactivity "A" with a quantitative result "R" (e.g., an IC50) is reported for chemical structure "C" that modulates (e.g., inhibits) a protein target "P". A useful shorthand for this connectivity thus becomes DARCP. The problem at the core of this article is that the community has spent millions effectively burying these relationships in PDFs over many decades but must now spend millions more trying to get them back out. The key imperative for this is to increase the flow into structured open databases. The positive impacts will include expanded data mining opportunities for drug discovery and chemical biology. Over the last decade commercial sources have manually extracted DARCP from ≈300,000 documents encompassing ≈7 million compounds interacting with ≈10,000 targets. Over a similar time, the Guide to Pharmacology, BindingDB and ChEMBL have carried out analogues DARCP extractions. Although their expert-curated numbers are lower (i.e., ≈2 million compounds against ≈3700 human proteins), these open sources have the great advantage of being merged within PubChem. Parallel efforts have focused on the extraction of document-to-compound (D-C-only) connectivity. In the absence of molecular mechanism of action (mmoa) annotation, this is of less value but can be automatically extracted. This has been significantly accomplished for patents, (e.g., by IBM, SureChEMBL and WIPO) for over 30 million compounds in PubChem. These have recently been joined by 1.4 million D-C submissions from three major chemistry publishers. In addition, both the European and US PubMed Central portals now add chemistry look-ups from abstracts and full-text papers. However, the fully automated extraction of DARCLP has not yet been achieved. This stands in contrast to the ability of biocurators to discern these relationships in minutes. Unfortunately, no journals have yet instigated a flow of author-specified DARCP directly into open databases. Progress may come from trends such as open science, open access (OA), findable, accessible, interoperable and reusable (FAIR), resource description framework (RDF) and WikiData. However, we will need to await the technical applicability in respect to DARCP capture to see if this opens up connectivity.
Collapse
Affiliation(s)
- Christopher Southan
- Deanery of Biomedical Sciences, University of Edinburgh, Edinburgh, EH8 9XD, UK.,TW2Informatics Ltd, Västra Frölunda, Gothenburg, 42166, Sweden
| |
Collapse
|
46
|
Baryshnikova A. Data libraries - the missing element for modeling biological systems. FEBS J 2020; 287:4594-4601. [PMID: 32100391 PMCID: PMC7687078 DOI: 10.1111/febs.15261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Revised: 02/19/2020] [Accepted: 02/24/2020] [Indexed: 11/29/2022]
Abstract
The primary bottleneck in understanding and modeling biological systems is shifting from data collection to data analysis and integration. This process critically depends on data being available in an organized form, so that they can be accessed, understood, and reused by a broad community of scientists. A proven solution for organizing data is literature curation, which extracts, aggregates, and distributes findings from publications. Here, I describe the benefits of extending curation practices to datasets, especially those that are not deposited in centralized databases. I argue that dataset curation (or ‘data librarianship’ as I suggest we call it) will overcome many barriers in data visibility and reusability and make a unique contribution to integration and modeling.
Collapse
|
47
|
Breuza L, Arighi CN, Argoud-Puy G, Casals-Casas C, Estreicher A, Famiglietti ML, Georghiou G, Gos A, Gruaz-Gumowski N, Hinz U, Hyka-Nouspikel N, Kramarz B, Lovering RC, Lussi Y, Magrane M, Masson P, Perfetto L, Poux S, Rodriguez-Lopez M, Stoeckert C, Sundaram S, Wang LS, Wu E, Orchard S. A Coordinated Approach by Public Domain Bioinformatics Resources to Aid the Fight Against Alzheimer's Disease Through Expert Curation of Key Protein Targets. J Alzheimers Dis 2020; 77:257-273. [PMID: 32716361 PMCID: PMC7592670 DOI: 10.3233/jad-200206] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/05/2020] [Indexed: 01/08/2023]
Abstract
BACKGROUND The analysis and interpretation of data generated from patient-derived clinical samples relies on access to high-quality bioinformatics resources. These are maintained and updated by expert curators extracting knowledge from unstructured biological data described in free-text journal articles and converting this into more structured, computationally-accessible forms. This enables analyses such as functional enrichment of sets of genes/proteins using the Gene Ontology, and makes the searching of data more productive by managing issues such as gene/protein name synonyms, identifier mapping, and data quality. OBJECTIVE To undertake a coordinated annotation update of key public-domain resources to better support Alzheimer's disease research. METHODS We have systematically identified target proteins critical to disease process, in part by accessing informed input from the clinical research community. RESULTS Data from 954 papers have been added to the UniProtKB, Gene Ontology, and the International Molecular Exchange Consortium (IMEx) databases, with 299 human proteins and 279 orthologs updated in UniProtKB. 745 binary interactions were added to the IMEx human molecular interaction dataset. CONCLUSION This represents a significant enhancement in the expert curated data pertinent to Alzheimer's disease available in a number of biomedical databases. Relevant protein entries have been updated in UniProtKB and concomitantly in the Gene Ontology. Molecular interaction networks have been significantly extended in the IMEx Consortium dataset and a set of reference protein complexes created. All the resources described are open-source and freely available to the research community and we provide examples of how these data could be exploited by researchers.
Collapse
Affiliation(s)
- Lionel Breuza
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Cecilia N. Arighi
- Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
- Protein Information Resource, University of Delaware, Newark, DE, USA
| | - Ghislaine Argoud-Puy
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Cristina Casals-Casas
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Anne Estreicher
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Maria Livia Famiglietti
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - George Georghiou
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Campus, Hinxton, Cambridge, UK
| | - Arnaud Gos
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Nadine Gruaz-Gumowski
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Ursula Hinz
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Nevila Hyka-Nouspikel
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Barbara Kramarz
- Functional Gene Annotation, Preclinical and Fundamental Science, Institute of Cardiovascular Science, University College London (UCL), London, UK
| | - Ruth C. Lovering
- Functional Gene Annotation, Preclinical and Fundamental Science, Institute of Cardiovascular Science, University College London (UCL), London, UK
| | - Yvonne Lussi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Campus, Hinxton, Cambridge, UK
| | - Michele Magrane
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Campus, Hinxton, Cambridge, UK
| | - Patrick Masson
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Livia Perfetto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Campus, Hinxton, Cambridge, UK
| | - Sylvain Poux
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Milagros Rodriguez-Lopez
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Campus, Hinxton, Cambridge, UK
| | - Christian Stoeckert
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Shyamala Sundaram
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Li-San Wang
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | | | - Sandra Orchard
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Campus, Hinxton, Cambridge, UK
| | - IMEx Consortium, UniProt Consortium
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
- Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
- Protein Information Resource, University of Delaware, Newark, DE, USA
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Campus, Hinxton, Cambridge, UK
- Functional Gene Annotation, Preclinical and Fundamental Science, Institute of Cardiovascular Science, University College London (UCL), London, UK
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Alzforum, Cambridge, MA, USA
| |
Collapse
|
48
|
Davis AP, Wiegers J, Wiegers TC, Mattingly CJ. Public data sources to support systems toxicology applications. CURRENT OPINION IN TOXICOLOGY 2019; 16:17-24. [PMID: 33604492 PMCID: PMC7889036 DOI: 10.1016/j.cotox.2019.03.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Public databases provide a wealth of freely available information about chemicals, genes, proteins, biological networks, phenotypes, diseases, and exposure science that can be integrated to construct pathways for systems toxicology applications. Relating this disparate information from public repositories, however, can be challenging since databases use a variety of ways to represent, describe, and make available their content. The use of standard vocabularies to annotate key data concepts, however, allows the information to be more easily exchanged and combined for discovery of new findings. We explore some of the many public data sources currently available to support systems toxicology, and demonstrate the value of standardizing data to help construct chemical-induced outcome pathways.
Collapse
Affiliation(s)
- Allan Peter Davis
- Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina 27695, United States
| | - Jolene Wiegers
- Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina 27695, United States
| | - Thomas C Wiegers
- Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina 27695, United States
| | - Carolyn J Mattingly
- Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina 27695, United States
- Center for Human Health and the Environment, North Carolina State University, Raleigh, North Carolina 27695, United States
| |
Collapse
|
49
|
Tang YA, Pichler K, Füllgrabe A, Lomax J, Malone J, Munoz-Torres MC, Vasant DV, Williams E, Haendel M. Ten quick tips for biocuration. PLoS Comput Biol 2019; 15:e1006906. [PMID: 31048830 PMCID: PMC6497217 DOI: 10.1371/journal.pcbi.1006906] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Affiliation(s)
- Y. Amy Tang
- Genestack Limited, Cambridge, Cambridgeshire, United Kingdom
- * E-mail:
| | - Klemens Pichler
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, United Kingdom
| | - Anja Füllgrabe
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, United Kingdom
| | - Jane Lomax
- SciBite Limited, BioData Innovation Centre, Hinxton, Cambridgeshire, United Kingdom
| | - James Malone
- SciBite Limited, BioData Innovation Centre, Hinxton, Cambridgeshire, United Kingdom
| | | | - Drashtti V. Vasant
- Bayer Business Services GmbH, BP Research and Development, Translational Sciences, Berlin, Germany
| | - Eleanor Williams
- Centre for Gene Regulation and Expression, School of Life Sciences, University of Dundee, Dundee, United Kingdom
- Genomics England, Queen Mary University of London, London, United Kingdom
| | - Melissa Haendel
- Linus Pauling Institute, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
50
|
Thompson R, Abicht A, Beeson D, Engel AG, Eymard B, Maxime E, Lochmüller H. A nomenclature and classification for the congenital myasthenic syndromes: preparing for FAIR data in the genomic era. Orphanet J Rare Dis 2018; 13:211. [PMID: 30477555 PMCID: PMC6260762 DOI: 10.1186/s13023-018-0955-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Accepted: 11/14/2018] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Congenital myasthenic syndromes (CMS) are a heterogeneous group of inherited neuromuscular disorders sharing the common feature of fatigable weakness due to defective neuromuscular transmission. Despite rapidly increasing knowledge about the genetic origins, specific features and potential treatments for the known CMS entities, the lack of standardized classification at the most granular level has hindered the implementation of computer-based systems for knowledge capture and reuse. Where individual clinical or genetic entities do not exist in disease coding systems, they are often invisible in clinical records and inadequately annotated in information systems, and features that apply to one disease but not another cannot be adequately differentiated. RESULTS We created a detailed classification of all CMS disease entities suitable for use in clinical and genetic databases and decision support systems. To avoid conflict with existing coding systems as well as with expert-defined group-level classifications, we developed a collaboration with the Orphanet nomenclature for rare diseases, creating a clinically understandable name for each entity and placing it within a logical hierarchy that paves the way towards computer-aided clinical systems and improved knowledge bases for CMS that can adequately differentiate between types and ascribe relevant expert knowledge to each. CONCLUSIONS We suggest that data science approaches can be used effectively in the clinical domain in a way that does not disrupt preexisting expert classification and that enhances the utility of existing coding systems. Our classification provides a comprehensive view of the individual CMS entities in a manner that supports differential diagnosis and understanding of the range and heterogeneity of the disease but that also enables robust computational coding and hierarchy for machine-readability. It can be extended as required in the light of future scientific advances, but already provides the starting point for the creation of FAIR (Findable, Accessible, Interoperable and Reusable) knowledge bases of data on the congenital myasthenic syndromes.
Collapse
Affiliation(s)
- Rachel Thompson
- Institute of Genetic Medicine, Newcastle University, Newcastle upon Tyne, UK
| | | | - David Beeson
- Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU UK
| | | | | | - Emmanuel Maxime
- INSERM US14 - Orphanet, Plateforme Maladies Rares, 75014 Paris, France
| | - Hanns Lochmüller
- Children’s Hospital of Eastern Ontario (CHEO) Research Institute, University of Ottawa, Ottawa, ON K1H 8L1 Canada
- Department of Neuropediatrics and Muscle Disorders, Medical Center – University of Freiburg, Faculty of Medicine, Freiburg, Germany
- Centro Nacional de Análisis Genómico (CNAG-CRG), Center for Genomic Regulation, Barcelona Institute of Science and Technology (BIST), Barcelona, Spain
| |
Collapse
|