1
|
Tsueng G, Mullen JL, Alkuzweny M, Cano M, Rush B, Haag E, Lin J, Welzel DJ, Zhou X, Qian Z, Latif AA, Hufbauer E, Zeller M, Andersen KG, Wu C, Su AI, Gangavarapu K, Hughes LD. Outbreak.info Research Library: a standardized, searchable platform to discover and explore COVID-19 resources. Nat Methods 2023; 20:536-540. [PMID: 36823331 PMCID: PMC10393269 DOI: 10.1038/s41592-023-01770-w] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Accepted: 01/17/2023] [Indexed: 02/25/2023]
Abstract
Outbreak.info Research Library is a standardized, searchable interface of coronavirus disease 2019 (COVID-19) and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) publications, clinical trials, datasets, protocols and other resources, built with a reusable framework. We developed a rigorous schema to enforce consistency across different sources and resource types and linked related resources. Researchers can quickly search the latest research across data repositories, regardless of resource type or repository location, via a search interface, public application programming interface (API) and R package.
Collapse
Affiliation(s)
- Ginger Tsueng
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA.
| | - Julia L Mullen
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA
| | - Manar Alkuzweny
- Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA
- Department of Immunology and Microbiology, the Scripps Research Institute, La Jolla, CA, USA
| | - Marco Cano
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA
| | | | - Emily Haag
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA
| | - Jason Lin
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA
| | - Dylan J Welzel
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA
| | - Xinghua Zhou
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA
| | - Zhongchao Qian
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA
| | - Alaa Abdel Latif
- Department of Immunology and Microbiology, the Scripps Research Institute, La Jolla, CA, USA
| | - Emory Hufbauer
- Department of Immunology and Microbiology, the Scripps Research Institute, La Jolla, CA, USA
| | - Mark Zeller
- Department of Immunology and Microbiology, the Scripps Research Institute, La Jolla, CA, USA
| | - Kristian G Andersen
- Department of Immunology and Microbiology, the Scripps Research Institute, La Jolla, CA, USA
- Scripps Research Translational Institute, La Jolla, CA, USA
| | - Chunlei Wu
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Molecular Medicine, the Scripps Research Institute, La Jolla, CA, USA
| | - Andrew I Su
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Molecular Medicine, the Scripps Research Institute, La Jolla, CA, USA
| | - Karthik Gangavarapu
- Department of Immunology and Microbiology, the Scripps Research Institute, La Jolla, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA
| | - Laura D Hughes
- Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, CA, USA.
| |
Collapse
|
2
|
Tsueng G, Mullen JL, Alkuzweny M, Cano M, Rush B, Haag E, Curators O, Lin J, Welzel DJ, Zhou X, Qian Z, Latif AA, Hufbauer E, Zeller M, Andersen KG, Wu C, Su AI, Gangavarapu K, Hughes LD. Outbreak.info Research Library: A standardized, searchable platform to discover and explore COVID-19 resources. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2022.01.20.477133. [PMID: 35132411 PMCID: PMC8820656 DOI: 10.1101/2022.01.20.477133] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
To combat the ongoing COVID-19 pandemic, scientists have been conducting research at breakneck speeds, producing over 52,000 peer-reviewed articles within the first year. To address the challenge in tracking the vast amount of new research located in separate repositories, we developed outbreak.info Research Library, a standardized, searchable interface of COVID-19 and SARS-CoV-2 resources. Unifying metadata from sixteen repositories, we assembled a collection of over 350,000 publications, clinical trials, datasets, protocols, and other resources as of October 2022. We used a rigorous schema to enforce consistency across different sources and resource types and linked related resources. Researchers can quickly search the latest research across data repositories, regardless of resource type or repository location, via a search interface, public API, and R package. Finally, we discuss the challenges inherent in combining metadata from scattered and heterogeneous resources and provide recommendations to streamline this process to aid scientific research.
Collapse
Affiliation(s)
- Ginger Tsueng
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Julia L. Mullen
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Manar Alkuzweny
- Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Marco Cano
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | | | - Emily Haag
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | | | - Jason Lin
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Dylan J. Welzel
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Xinghua Zhou
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Zhongchao Qian
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Alaa Abdel Latif
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Emory Hufbauer
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Mark Zeller
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Kristian G. Andersen
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037, USA
- Scripps Research Translational Institute, La Jolla, CA 92037, USA
| | - Chunlei Wu
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
- Scripps Research Translational Institute, La Jolla, CA 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Andrew I. Su
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
- Scripps Research Translational Institute, La Jolla, CA 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Karthik Gangavarapu
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, USA
| | - Laura D. Hughes
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| |
Collapse
|
3
|
Sousa D, Lamurias A, Couto FM. A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing. Database (Oxford) 2020; 2020:baaa104. [PMID: 33258966 PMCID: PMC7706181 DOI: 10.1093/database/baaa104] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Revised: 09/02/2020] [Accepted: 11/12/2020] [Indexed: 12/14/2022]
Abstract
Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype-gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.
Collapse
Affiliation(s)
- Diana Sousa
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Andre Lamurias
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Francisco M Couto
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| |
Collapse
|
4
|
Antonazzo G, Urbano JM, Marygold SJ, Millburn GH, Brown NH. Building a pipeline to solicit expert knowledge from the community to aid gene summary curation. Database (Oxford) 2020; 2020:baz152. [PMID: 31960022 PMCID: PMC6971343 DOI: 10.1093/database/baz152] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 10/29/2019] [Accepted: 12/12/2019] [Indexed: 11/25/2022]
Abstract
Brief summaries describing the function of each gene's product(s) are of great value to the research community, especially when interpreting genome-wide studies that reveal changes to hundreds of genes. However, manually writing such summaries, even for a single species, is a daunting task; for example, the Drosophila melanogaster genome contains almost 14 000 protein-coding genes. One solution is to use computational methods to generate summaries, but this often fails to capture the key functions or express them eloquently. Here, we describe how we solicited help from the research community to generate manually written summaries of D. melanogaster gene function. Based on the data within the FlyBase database, we developed a computational pipeline to identify researchers who have worked extensively on each gene. We e-mailed these researchers to ask them to draft a brief summary of the main function(s) of the gene's product, which we edited for consistency to produce a 'gene snapshot'. This approach yielded 1800 gene snapshot submissions within a 3-month period. We discuss the general utility of this strategy for other databases that capture data from the research literature. Database URL: https://flybase.org/.
Collapse
Affiliation(s)
- Giulia Antonazzo
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Jose M Urbano
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Steven J Marygold
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Gillian H Millburn
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Nicholas H Brown
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| |
Collapse
|