1
|
Turki H, Dossou BFP, Emezue CC, Owodunni AT, Hadj Taieb MA, Ben Aouicha M, Ben Hassen H, Masmoudi A. MeSH2Matrix: combining MeSH keywords and machine learning for biomedical relation classification based on PubMed. J Biomed Semantics 2024; 15:18. [PMID: 39354632 PMCID: PMC11445994 DOI: 10.1186/s13326-024-00319-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 08/31/2024] [Indexed: 10/03/2024] Open
Abstract
Biomedical relation classification has been significantly improved by the application of advanced machine learning techniques on the raw texts of scholarly publications. Despite this improvement, the reliance on large chunks of raw text makes these algorithms suffer in terms of generalization, precision, and reliability. The use of the distinctive characteristics of bibliographic metadata can prove effective in achieving better performance for this challenging task. In this research paper, we introduce an approach for biomedical relation classification using the qualifiers of co-occurring Medical Subject Headings (MeSH). First of all, we introduce MeSH2Matrix, our dataset consisting of 46,469 biomedical relations curated from PubMed publications using our approach. Our dataset includes a matrix that maps associations between the qualifiers of subject MeSH keywords and those of object MeSH keywords. It also specifies the corresponding Wikidata relation type and the superclass of semantic relations for each relation. Using MeSH2Matrix, we build and train three machine learning models (Support Vector Machine [SVM], a dense model [D-Model], and a convolutional neural network [C-Net]) to evaluate the efficiency of our approach for biomedical relation classification. Our best model achieves an accuracy of 70.78% for 195 classes and 83.09% for five superclasses. Finally, we provide confusion matrix and extensive feature analyses to better examine the relationship between the MeSH qualifiers and the biomedical relations being classified. Our results will hopefully shed light on developing better algorithms for biomedical ontology classification based on the MeSH keywords of PubMed publications. For reproducibility purposes, MeSH2Matrix, as well as all our source codes, are made publicly accessible at https://github.com/SisonkeBiotik-Africa/MeSH2Matrix .
Collapse
Affiliation(s)
- Houcemeddine Turki
- Data Engineering and Semantics Research Unit, Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia.
| | | | - Chris Chinenye Emezue
- Mila Quebec AI Institute, Montreal, Canada
- Technical University of Munich, Munich, Germany
| | | | - Mohamed Ali Hadj Taieb
- Data Engineering and Semantics Research Unit, Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia
| | - Mohamed Ben Aouicha
- Data Engineering and Semantics Research Unit, Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia
| | - Hanen Ben Hassen
- Laboratory of Probability and Statistics, Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia
| | - Afif Masmoudi
- Laboratory of Probability and Statistics, Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia
| |
Collapse
|
2
|
Menotti L, Silvello G, Atzori M, Boytcheva S, Ciompi F, Di Nunzio GM, Fraggetta F, Giachelle F, Irrera O, Marchesin S, Marini N, Müller H, Primov T. Modelling digital health data: The ExaMode ontology for computational pathology. J Pathol Inform 2023; 14:100332. [PMID: 37705689 PMCID: PMC10495665 DOI: 10.1016/j.jpi.2023.100332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 07/14/2023] [Accepted: 08/16/2023] [Indexed: 09/15/2023] Open
Abstract
Computational pathology can significantly benefit from ontologies to standardize the employed nomenclature and help with knowledge extraction processes for high-quality annotated image datasets. The end goal is to reach a shared model for digital pathology to overcome data variability and integration problems. Indeed, data annotation in such a specific domain is still an unsolved challenge and datasets cannot be steadily reused in diverse contexts due to heterogeneity issues of the adopted labels, multilingualism, and different clinical practices. Material and methods This paper presents the ExaMode ontology, modeling the histopathology process by considering 3 key cancer diseases (colon, cervical, and lung tumors) and celiac disease. The ExaMode ontology has been designed bottom-up in an iterative fashion with continuous feedback and validation from pathologists and clinicians. The ontology is organized into 5 semantic areas that defines an ontological template to model any disease of interest in histopathology. Results The ExaMode ontology is currently being used as a common semantic layer in: (i) an entity linking tool for the automatic annotation of medical records; (ii) a web-based collaborative annotation tool for histopathology text reports; and (iii) a software platform for building holistic solutions integrating multimodal histopathology data. Discussion The ontology ExaMode is a key means to store data in a graph database according to the RDF data model. The creation of an RDF dataset can help develop more accurate algorithms for image analysis, especially in the field of digital pathology. This approach allows for seamless data integration and a unified query access point, from which we can extract relevant clinical insights about the considered diseases using SPARQL queries.
Collapse
Affiliation(s)
- Laura Menotti
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Manfredo Atzori
- Information Systems Institute, University of Applied Sciences Western Switzerland, Delémont, Switzerland
- Department of Neuroscience, University of Padua, Padova, Italy
| | | | - Francesco Ciompi
- Department of Pathology, Radboud University Medical Center, Nijmegen, The Netherlands
| | | | | | - Fabio Giachelle
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Ornella Irrera
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Stefano Marchesin
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Niccolò Marini
- Information Systems Institute, University of Applied Sciences Western Switzerland, Delémont, Switzerland
| | - Henning Müller
- Information Systems Institute, University of Applied Sciences Western Switzerland, Delémont, Switzerland
| | | |
Collapse
|
3
|
Boguslav MR, Salem NM, White EK, Sullivan KJ, Bada M, Hernandez TL, Leach SM, Hunter LE. Creating an ignorance-base: Exploring known unknowns in the scientific literature. J Biomed Inform 2023; 143:104405. [PMID: 37270143 PMCID: PMC10528083 DOI: 10.1016/j.jbi.2023.104405] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 05/18/2023] [Accepted: 05/21/2023] [Indexed: 06/05/2023]
Abstract
BACKGROUND Scientific discovery progresses by exploring new and uncharted territory. More specifically, it advances by a process of transforming unknown unknowns first into known unknowns, and then into knowns. Over the last few decades, researchers have developed many knowledge bases to capture and connect the knowns, which has enabled topic exploration and contextualization of experimental results. But recognizing the unknowns is also critical for finding the most pertinent questions and their answers. Prior work on known unknowns has sought to understand them, annotate them, and automate their identification. However, no knowledge-bases yet exist to capture these unknowns, and little work has focused on how scientists might use them to trace a given topic or experimental result in search of open questions and new avenues for exploration. We show here that a knowledge base of unknowns can be connected to ontologically grounded biomedical knowledge to accelerate research in the field of prenatal nutrition. RESULTS We present the first ignorance-base, a knowledge-base created by combining classifiers to recognize ignorance statements (statements of missing or incomplete knowledge that imply a goal for knowledge) and biomedical concepts over the prenatal nutrition literature. This knowledge-base places biomedical concepts mentioned in the literature in context with the ignorance statements authors have made about them. Using our system, researchers interested in the topic of vitamin D and prenatal health were able to uncover three new avenues for exploration (immune system, respiratory system, and brain development) by searching for concepts enriched in ignorance statements. These were buried among the many standard enriched concepts. Additionally, we used the ignorance-base to enrich concepts connected to a gene list associated with vitamin D and spontaneous preterm birth and found an emerging topic of study (brain development) in an implied field (neuroscience). The researchers could look to the field of neuroscience for potential answers to the ignorance statements. CONCLUSION Our goal is to help students, researchers, funders, and publishers better understand the state of our collective scientific ignorance (known unknowns) in order to help accelerate research through the continued illumination of and focus on the known unknowns and their respective goals for scientific knowledge.
Collapse
Affiliation(s)
- Mayla R Boguslav
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA.
| | - Nourah M Salem
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Elizabeth K White
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Katherine J Sullivan
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Michael Bada
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Teri L Hernandez
- College of Nursing, Department of Medicine/Division of Endocrinology, Metabolism, & Diabetes, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Sonia M Leach
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| |
Collapse
|
4
|
Lima VC, Rijo RPCL, Bernardi FA, Filho MEC, Barbosa-Junior F, Pellison FC, Galliez RM, Kritski AL, Alves D. REDbox: a comprehensive semantic framework for data collection and management in tuberculosis research. Sci Rep 2023; 13:7686. [PMID: 37169802 PMCID: PMC10173910 DOI: 10.1038/s41598-023-33492-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 04/13/2023] [Indexed: 05/13/2023] Open
Abstract
Clinical research outcomes depend on the correct definition of the research protocol, the data collection strategy, and the data management plan. Furthermore, researchers often need to work within challenging contexts, as is the case in tuberculosis services, where human and technological resources for research may be scarce. Electronic Data Capture Systems mitigate such risks and enable a reliable environment to conduct health research and promote result dissemination and data reusability. The proposed solution is based on needs pinpointed by researchers, considering the need for an accommodating solution to conduct research in low-resource environments. The REDbox framework was developed to facilitate data collection, management, sharing, and availability in tuberculosis research and improve the user experience through user-friendly, web-based tools. REDbox combines elements of the REDCap and KoBoToolbox electronic data capture systems and semantics to deliver new valuable tools that meet the needs of tuberculosis researchers in Brazil. The framework was implemented in five cross-institutional, nationwide projects to evaluate the users' perceptions of the system's usefulness and the information and user experience. Seventeen responses (representing 40% of active users) to an anonymous survey distributed to active users indicated that REDbox was perceived to be helpful for the particular audience of researchers and health professionals. The relevance of this article lies in the innovative approach to supporting tuberculosis research by combining existing technologies and tailoring supporting features.
Collapse
Affiliation(s)
- Vinícius Costa Lima
- Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, Brazil.
| | - Rui Pedro Charters Lopes Rijo
- Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, Brazil
- School of Technology and Management, Polytechnic Institute of Leiria, Leiria, Portugal
- Institute for Systems Engineering and Computers at Coimbra, Coimbra, Portugal
- Center for Research in Health Technologies and Services, Faculty of Medicine, University of Porto, Porto, Portugal
| | | | | | | | | | - Rafael Mello Galliez
- Faculty of Medicine, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | | | - Domingos Alves
- Department of Social Medicine, Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, Brazil
| |
Collapse
|
5
|
Turki H, Rasberry L, Ali Hadj Taieb M, Mietchen D, Ben Aouicha M, Pouris A, Bousrih Y. Letter to the Editor: FHIR RDF - Why the world needs structured electronic health records. J Biomed Inform 2022; 136:104253. [DOI: 10.1016/j.jbi.2022.104253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2022] [Revised: 11/15/2022] [Accepted: 11/16/2022] [Indexed: 11/21/2022]
|
6
|
Timón-Reina S, Rincón M, Martínez-Tomás R. An overview of graph databases and their applications in the biomedical domain. Database (Oxford) 2021; 2021:baab026. [PMID: 34003247 PMCID: PMC8130509 DOI: 10.1093/database/baab026] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2020] [Revised: 03/24/2021] [Accepted: 04/30/2021] [Indexed: 01/18/2023]
Abstract
Over the past couple of decades, the explosion of densely interconnected data has stimulated the research, development and adoption of graph database technologies. From early graph models to more recent native graph databases, the landscape of implementations has evolved to cover enterprise-ready requirements. Because of the interconnected nature of its data, the biomedical domain has been one of the early adopters of graph databases, enabling more natural representation models and better data integration workflows, exploration and analysis facilities. In this work, we survey the literature to explore the evolution, performance and how the most recent graph database solutions are applied in the biomedical domain, compiling a great variety of use cases. With this evidence, we conclude that the available graph database management systems are fit to support data-intensive, integrative applications, targeted at both basic research and exploratory tasks closer to the clinic.
Collapse
Affiliation(s)
- Santiago Timón-Reina
- Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain
| | - Mariano Rincón
- Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain
| | - Rafael Martínez-Tomás
- Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain
| |
Collapse
|
7
|
Denecke K. Biomedical Standards and Open Health Data. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11527-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
|
8
|
Rubinstein YR, Robinson PN, Gahl WA, Avillach P, Baynam G, Cederroth H, Goodwin RM, Groft SC, Hansson MG, Harris NL, Huser V, Mascalzoni D, McMurry JA, Might M, Nellaker C, Mons B, Paltoo DN, Pevsner J, Posada M, Rockett-Frase AP, Roos M, Rubinstein TB, Taruscio D, van Enckevort E, Haendel MA. The case for open science: rare diseases. JAMIA Open 2020; 3:472-486. [PMID: 33426479 PMCID: PMC7660964 DOI: 10.1093/jamiaopen/ooaa030] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2020] [Revised: 05/30/2020] [Accepted: 06/23/2020] [Indexed: 01/04/2023] Open
Abstract
The premise of Open Science is that research and medical management will progress faster if data and knowledge are openly shared. The value of Open Science is nowhere more important and appreciated than in the rare disease (RD) community. Research into RDs has been limited by insufficient patient data and resources, a paucity of trained disease experts, and lack of therapeutics, leading to long delays in diagnosis and treatment. These issues can be ameliorated by following the principles and practices of sharing that are intrinsic to Open Science. Here, we describe how the RD community has adopted the core pillars of Open Science, adding new initiatives to promote care and research for RD patients and, ultimately, for all of medicine. We also present recommendations that can advance Open Science more globally.
Collapse
Affiliation(s)
- Yaffa R Rubinstein
- Special Volunteer in the Office of Strategic Initiatives, National Library of Medicine, Bethesda, Maryland, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut, USA
| | - William A Gahl
- Undiagnosed Diseases Program and Office of the Clinical Director, National Human Genome Research Institute (NHGRI), National Institutes of Health, Bethesda, Maryland, USA
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Gareth Baynam
- Western Australian Register of Developmental Anomalies and Telethon Kids Institute, Perth, Australia
| | | | - Rebecca M Goodwin
- Department of Health and Human Services, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Stephen C Groft
- NCATS, National Institutes of Health, Bethesda, Maryland, USA
| | - Mats G Hansson
- Center for Research Ethics and Bioethics, Uppsala Universitet, Uppsala, Sweden
| | - Nomi L Harris
- Department of Environmental Genomics & System Biology, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Vojtech Huser
- Department of Health and Human Services, NCBI, National Institutes of Health, Bethesda, Maryland, USA
| | - Deborah Mascalzoni
- Center for Research Ethics and Bioethics, Uppsala University, Sweden and EURAC Research, Bolzano, Italy
| | - Julie A McMurry
- Linus Pauling Institute, Oregon State University, Corvallis, Oregon, USA
| | - Matthew Might
- Hugh Kaul Precision Medicine Institute, The University of Alabama at Birmingham, Birmingham, Alabama, USA
| | - Christoffer Nellaker
- Nuffield Department of Women's and Reproductive Health, Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| | - Barend Mons
- Department of Human Genetics, Leiden University Medical Center, Leiden, Netherlands
| | - Dina N Paltoo
- Department of Health and Human Services, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Jonathan Pevsner
- Department of Neurology, Kennedy Krieger Institute and Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, Maryland, USA
| | - Manuel Posada
- Rare Diseases Research Institute & CIBERER, Instituto de Salud Carlos III, Madrid, Spain
| | | | - Marco Roos
- Human Genetics, Leiden University Medical Center, Leiden, Netherlands
| | - Tamar B Rubinstein
- Children Hospital at Montefiore/Albert Einstein College of Medicine—Pediatrics, Bronx, New York, USA
| | - Domenica Taruscio
- National Centre for Rare Diseases, Istituto Superiore di Sanità, Rome, Italy
| | - Esther van Enckevort
- Department of Genetics, University Medical Center Groningen, University of Groningen, Leiden, Netherlands
| | - Melissa A Haendel
- Linus Pauling Institute, Oregon State University, Corvallis, Oregon, USA
| |
Collapse
|
9
|
Taxonomy-Based Approaches to Quality Assurance of Ontologies. JOURNAL OF HEALTHCARE ENGINEERING 2017; 2017:3495723. [PMID: 29158885 PMCID: PMC5660792 DOI: 10.1155/2017/3495723] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2017] [Accepted: 08/06/2017] [Indexed: 11/17/2022]
Abstract
Ontologies are important components of health information management systems. As such, the quality of their content is of paramount importance. It has been proven to be practical to develop quality assurance (QA) methodologies based on automated identification of sets of concepts expected to have higher likelihood of errors. Four kinds of such sets (called QA-sets) organized around the themes of complex and uncommonly modeled concepts are introduced. A survey of different methodologies based on these QA-sets and the results of applying them to various ontologies are presented. Overall, following these approaches leads to higher QA yields and better utilization of QA personnel. The formulation of additional QA-set methodologies will further enhance the suite of available ontology QA tools.
Collapse
|
10
|
Quinn S, Bond R, Nugent C. A two-staged approach to developing and evaluating an ontology for delivering personalized education to diabetic patients. Inform Health Soc Care 2017; 43:264-279. [PMID: 29035605 DOI: 10.1080/17538157.2017.1364246] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Ontologies are often used in biomedical and health domains to provide a concise and consistent means of attributing meaning to medical terminology. While they are novices in terms of ontology engineering, the evaluation of an ontology by domain specialists provides an opportunity to enhance its objectivity, accuracy, and coverage of the domain itself. This paper provides an evaluation of the viability of using ontology engineering novices to evaluate and enrich an ontology that can be used for personalized diabetic patient education. We describe a methodology for engaging healthcare and information technology specialists with a range of ontology engineering tasks. We used 87.8% of the data collected to validate the accuracy of our ontological model. The contributions also enabled a 16% increase in the class size and an 18% increase in object properties. Furthermore, we propose that ontology engineering novices can make valuable contributions to ontology development. Application-specific evaluation of the ontology using a semantic-web-based architecture is also discussed.
Collapse
Affiliation(s)
- Susan Quinn
- a Computer Science Research Institute, School of Computing & Maths , University of Ulster , Newtownabbey , County Antrim , UK
| | - Raymond Bond
- a Computer Science Research Institute, School of Computing & Maths , University of Ulster , Newtownabbey , County Antrim , UK
| | - Chris Nugent
- a Computer Science Research Institute, School of Computing & Maths , University of Ulster , Newtownabbey , County Antrim , UK
| |
Collapse
|