1
|
Rosado E, Garcia-Remesal M, Paraiso-Medina S, Pazos A, Maojo V. Using Machine Learning to Collect and Facilitate Remote Access to Biomedical Databases: Development of the Biomedical Database Inventory. JMIR Med Inform 2021; 9:e22976. [PMID: 33629960 PMCID: PMC7952234 DOI: 10.2196/22976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Revised: 12/30/2020] [Accepted: 01/16/2021] [Indexed: 11/13/2022] Open
Abstract
Background Currently, existing biomedical literature repositories do not commonly provide users with specific means to locate and remotely access biomedical databases. Objective To address this issue, we developed the Biomedical Database Inventory (BiDI), a repository linking to biomedical databases automatically extracted from the scientific literature. BiDI provides an index of data resources and a path to access them seamlessly. Methods We designed an ensemble of deep learning methods to extract database mentions. To train the system, we annotated a set of 1242 articles that included mentions of database publications. Such a data set was used along with transfer learning techniques to train an ensemble of deep learning natural language processing models targeted at database publication detection. Results The system obtained an F1 score of 0.929 on database detection, showing high precision and recall values. When applying this model to the PubMed and PubMed Central databases, we identified over 10,000 unique databases. The ensemble model also extracted the weblinks to the reported databases and discarded irrelevant links. For the extraction of weblinks, the model achieved a cross-validated F1 score of 0.908. We show two use cases: one related to “omics” and the other related to the COVID-19 pandemic. Conclusions BiDI enables access to biomedical resources over the internet and facilitates data-driven research and other scientific initiatives. The repository is openly available online and will be regularly updated with an automatic text processing pipeline. The approach can be reused to create repositories of different types (ie, biomedical and others).
Collapse
Affiliation(s)
- Eduardo Rosado
- Biomedical Informatics Group, School of Computer Science, Universidad Politecnica de Madrid, Madrid, Spain
| | - Miguel Garcia-Remesal
- Biomedical Informatics Group, School of Computer Science, Universidad Politecnica de Madrid, Madrid, Spain
| | - Sergio Paraiso-Medina
- Biomedical Informatics Group, School of Computer Science, Universidad Politecnica de Madrid, Madrid, Spain
| | - Alejandro Pazos
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos - Imagen Médica y Diagnóstico Radiológico, Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruña, A Coruña, Spain
| | - Victor Maojo
- Biomedical Informatics Group, School of Computer Science, Universidad Politecnica de Madrid, Madrid, Spain
| |
Collapse
|
2
|
A Survey of Bioinformatics Database and Software Usage through Mining the Literature. PLoS One 2016; 11:e0157989. [PMID: 27331905 PMCID: PMC4917176 DOI: 10.1371/journal.pone.0157989] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2015] [Accepted: 06/08/2016] [Indexed: 11/19/2022] Open
Abstract
Computer-based resources are central to much, if not most, biological and medical research. However, while there is an ever expanding choice of bioinformatics resources to use, described within the biomedical literature, little work to date has provided an evaluation of the full range of availability or levels of usage of database and software resources. Here we use text mining to process the PubMed Central full-text corpus, identifying mentions of databases or software within the scientific literature. We provide an audit of the resources contained within the biomedical literature, and a comparison of their relative usage, both over time and between the sub-disciplines of bioinformatics, biology and medicine. We find that trends in resource usage differs between these domains. The bioinformatics literature emphasises novel resource development, while database and software usage within biology and medicine is more stable and conservative. Many resources are only mentioned in the bioinformatics literature, with a relatively small number making it out into general biology, and fewer still into the medical literature. In addition, many resources are seeing a steady decline in their usage (e.g., BLAST, SWISS-PROT), though some are instead seeing rapid growth (e.g., the GO, R). We find a striking imbalance in resource usage with the top 5% of resource names (133 names) accounting for 47% of total usage, and over 70% of resources extracted being only mentioned once each. While these results highlight the dynamic and creative nature of bioinformatics research they raise questions about software reuse, choice and the sharing of bioinformatics practice. Is it acceptable that so many resources are apparently never reused? Finally, our work is a step towards automated extraction of scientific method from text. We make the dataset generated by our study available under the CC0 license here: http://dx.doi.org/10.6084/m9.figshare.1281371.
Collapse
|
3
|
Smedley D, Schofield P, Chen CK, Aidinis V, Ainali C, Bard J, Balling R, Birney E, Blake A, Bongcam-Rudloff E, Brookes AJ, Cesareni G, Chandras C, Eppig J, Flicek P, Gkoutos G, Greenaway S, Gruenberger M, Hériché JK, Lyall A, Mallon AM, Muddyman D, Reisinger F, Ringwald M, Rosenthal N, Schughart K, Swertz M, Thorisson GA, Zouberakis M, Hancock JM. Finding and sharing: new approaches to registries of databases and services for the biomedical sciences. Database (Oxford) 2010; 2010:baq014. [PMID: 20627863 PMCID: PMC2911849 DOI: 10.1093/database/baq014] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2010] [Accepted: 06/20/2010] [Indexed: 11/14/2022]
Abstract
The recent explosion of biological data and the concomitant proliferation of distributed databases make it challenging for biologists and bioinformaticians to discover the best data resources for their needs, and the most efficient way to access and use them. Despite a rapid acceleration in uptake of syntactic and semantic standards for interoperability, it is still difficult for users to find which databases support the standards and interfaces that they need. To solve these problems, several groups are developing registries of databases that capture key metadata describing the biological scope, utility, accessibility, ease-of-use and existence of web services allowing interoperability between resources. Here, we describe some of these initiatives including a novel formalism, the Database Description Framework, for describing database operations and functionality and encouraging good database practise. We expect such approaches will result in improved discovery, uptake and utilization of data resources. Database URL: http://www.casimir.org.uk/casimir_ddf.
Collapse
Affiliation(s)
- Damian Smedley
- European Bioinformatics Institute, Genome Campus, Hinxton, Cambridgeshire, CB10 1SA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
4
|
Triplet T, Shortridge MD, Griep MA, Stark JL, Powers R, Revesz P. PROFESS: a PROtein function, evolution, structure and sequence database. Database (Oxford) 2010; 2010:baq011. [PMID: 20624718 PMCID: PMC2911846 DOI: 10.1093/database/baq011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2009] [Revised: 06/03/2010] [Accepted: 06/06/2010] [Indexed: 11/13/2022]
Abstract
The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are approximately 1100 molecular biology databases dispersed throughout the Internet. To assist in the functional, structural and evolutionary analysis of the abundant number of novel proteins continually identified from whole-genome sequencing, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. A fundamental component of this approach is the development of an intuitive query system that incorporates a variety of similarity functions capable of generating data relationships not conceived during the creation of the database. The utility of PROFESS is demonstrated by the analysis of the structural drift of homologous proteins and the identification of potential pancreatic cancer therapeutic targets based on the observation of protein-protein interaction networks. Database URL: http://cse.unl.edu/~profess/
Collapse
Affiliation(s)
- Thomas Triplet
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0115 and Department of Chemistry, University of Nebraska-Lincoln, Lincoln NE 68588-0304, USA
| | - Matthew D. Shortridge
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0115 and Department of Chemistry, University of Nebraska-Lincoln, Lincoln NE 68588-0304, USA
| | - Mark A. Griep
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0115 and Department of Chemistry, University of Nebraska-Lincoln, Lincoln NE 68588-0304, USA
| | - Jaime L. Stark
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0115 and Department of Chemistry, University of Nebraska-Lincoln, Lincoln NE 68588-0304, USA
| | - Robert Powers
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0115 and Department of Chemistry, University of Nebraska-Lincoln, Lincoln NE 68588-0304, USA
| | - Peter Revesz
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0115 and Department of Chemistry, University of Nebraska-Lincoln, Lincoln NE 68588-0304, USA
| |
Collapse
|
5
|
Affiliation(s)
- Curtis Huttenhower
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America.
| | | |
Collapse
|
6
|
de la Calle G, García-Remesal M, Chiesa S, de la Iglesia D, Maojo V. BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC Bioinformatics 2009; 10:320. [PMID: 19811635 PMCID: PMC2765974 DOI: 10.1186/1471-2105-10-320] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2009] [Accepted: 10/07/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The rapid evolution of Internet technologies and the collaborative approaches that dominate the field have stimulated the development of numerous bioinformatics resources. To address this new framework, several initiatives have tried to organize these services and resources. In this paper, we present the BioInformatics Resource Inventory (BIRI), a new approach for automatically discovering and indexing available public bioinformatics resources using information extracted from the scientific literature. The index generated can be automatically updated by adding additional manuscripts describing new resources. We have developed web services and applications to test and validate our approach. It has not been designed to replace current indexes but to extend their capabilities with richer functionalities. RESULTS We developed a web service to provide a set of high-level query primitives to access the index. The web service can be used by third-party web services or web-based applications. To test the web service, we created a pilot web application to access a preliminary knowledge base of resources. We tested our tool using an initial set of 400 abstracts. Almost 90% of the resources described in the abstracts were correctly classified. More than 500 descriptions of functionalities were extracted. CONCLUSION These experiments suggest the feasibility of our approach for automatically discovering and indexing current and future bioinformatics resources. Given the domain-independent characteristics of this tool, it is currently being applied by the authors in other areas, such as medical nanoinformatics. BIRI is available at http://edelman.dia.fi.upm.es/biri/.
Collapse
Affiliation(s)
- Guillermo de la Calle
- Dept Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo S/N, 28660 Boadilla del Monte, Madrid, Spain
| | - Miguel García-Remesal
- Dept Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo S/N, 28660 Boadilla del Monte, Madrid, Spain
| | - Stefano Chiesa
- Dept Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo S/N, 28660 Boadilla del Monte, Madrid, Spain
| | - Diana de la Iglesia
- Dept Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo S/N, 28660 Boadilla del Monte, Madrid, Spain
| | - Victor Maojo
- Dept Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo S/N, 28660 Boadilla del Monte, Madrid, Spain
| |
Collapse
|