1
|
Thakur M, Buniello A, Brooksbank C, Gurwitz KT, Hall M, Hartley M, Hulcoop DG, Leach AR, Marques D, Martin M, Mithani A, McDonagh EM, Mutasa-Gottgens E, Ochoa D, Perez-Riverol Y, Stephenson J, Varadi M, Velankar S, Vizcaino JA, Witham R, McEntyre J. EMBL's European Bioinformatics Institute (EMBL-EBI) in 2023. Nucleic Acids Res 2024; 52:D10-D17. [PMID: 38015445 PMCID: PMC10767983 DOI: 10.1093/nar/gkad1088] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 10/23/2023] [Accepted: 10/30/2023] [Indexed: 11/29/2023] Open
Abstract
The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) is one of the world's leading sources of public biomolecular data. Based at the Wellcome Genome Campus in Hinxton, UK, EMBL-EBI is one of six sites of the European Molecular Biology Laboratory (EMBL), Europe's only intergovernmental life sciences organisation. This overview summarises the latest developments in the services provided by EMBL-EBI data resources to scientific communities globally. These developments aim to ensure EMBL-EBI resources meet the current and future needs of these scientific communities, accelerating the impact of open biological data for all.
Collapse
Affiliation(s)
- Matthew Thakur
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Annalisa Buniello
- Open Targets, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Catherine Brooksbank
- Training Team, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Kim T Gurwitz
- Training Team, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Matthew Hall
- Industry Partnerships, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Matthew Hartley
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - David G Hulcoop
- Open Targets, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Andrew R Leach
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
- Industry Partnerships, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Diana Marques
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Maria Martin
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Aziz Mithani
- Training Team, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Ellen M McDonagh
- Open Targets, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Euphemia Mutasa-Gottgens
- Industry Partnerships, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - David Ochoa
- Open Targets, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Yasset Perez-Riverol
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - James Stephenson
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Mihaly Varadi
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sameer Velankar
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Juan Antonio Vizcaino
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Rick Witham
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Johanna McEntyre
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| |
Collapse
|
2
|
Al-Aamri A, Taha K, Al-Hammadi Y, Maalouf M, Homouz D. Analyzing a co-occurrence gene-interaction network to identify disease-gene association. BMC Bioinformatics 2019; 20:70. [PMID: 30736752 PMCID: PMC6368766 DOI: 10.1186/s12859-019-2634-7] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2018] [Accepted: 01/17/2019] [Indexed: 12/03/2022] Open
Abstract
Background Understanding the genetic networks and their role in chronic diseases (e.g., cancer) is one of the important objectives of biological researchers. In this work, we present a text mining system that constructs a gene-gene-interaction network for the entire human genome and then performs network analysis to identify disease-related genes. We recognize the interacting genes based on their co-occurrence frequency within the biomedical literature and by employing linear and non-linear rare-event classification models. We analyze the constructed network of genes by using different network centrality measures to decide on the importance of each gene. Specifically, we apply betweenness, closeness, eigenvector, and degree centrality metrics to rank the central genes of the network and to identify possible cancer-related genes. Results We evaluated the top 15 ranked genes for different cancer types (i.e., Prostate, Breast, and Lung Cancer). The average precisions for identifying breast, prostate, and lung cancer genes vary between 80-100%. On a prostate case study, the system predicted an average of 80% prostate-related genes. Conclusions The results show that our system has the potential for improving the prediction accuracy of identifying gene-gene interaction and disease-gene associations. We also conduct a prostate cancer case study by using the threshold property in logistic regression, and we compare our approach with some of the state-of-the-art methods. Electronic supplementary material The online version of this article (10.1186/s12859-019-2634-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Amira Al-Aamri
- Department of Electrical and Computer Engineering, Abu Dhabi, United Arab Emirates
| | - Kamal Taha
- Department of Electrical and Computer Engineering, Abu Dhabi, United Arab Emirates
| | - Yousof Al-Hammadi
- Department of Electrical and Computer Engineering, Abu Dhabi, United Arab Emirates
| | - Maher Maalouf
- Department of Industrial and Systems Engineering, Abu Dhabi, United Arab Emirates
| | - Dirar Homouz
- Department of Physics, Khalifa University of Science and Technology, Abu Dhabi, P.O. Box 127788,, United Arab Emirates.
| |
Collapse
|
3
|
Hasnain A, Rebholz-Schuhmann D. Biomedical Semantic Resources for Drug Discovery Platforms. LECTURE NOTES IN COMPUTER SCIENCE 2017. [DOI: 10.1007/978-3-319-70407-4_34] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
4
|
Barros M, Couto FM. Knowledge Representation and Management: a Linked Data Perspective. Yearb Med Inform 2016:178-183. [PMID: 27830248 DOI: 10.15265/iy-2016-022] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
INTRODUCTION Biomedical research is increasingly becoming a data-intensive science in several areas, where prodigious amounts of data is being generated that has to be stored, integrated, shared and analyzed. In an effort to improve the accessibility of data and knowledge, the Linked Data initiative proposed a well-defined set of recommendations for exposing, sharing and integrating data, information and knowledge, using semantic web technologies. OBJECTIVE The main goal of this paper is to identify the current status and future trends of knowledge representation and management in Life and Health Sciences, mostly with regard to linked data technologies. METHODS We selected three prominent linked data studies, namely Bio2RDF, Open PHACTS and EBI RDF platform, and selected 14 studies published after 2014 (inclusive) that cited any of the three studies. We manually analyzed these 14 papers in relation to how they use linked data techniques. RESULTS The analyses show a tendency to use linked data techniques in Life and Health Sciences, and even if some studies do not follow all of the recommendations, many of them already represent and manage their knowledge using RDF and biomedical ontologies. CONCLUSION These insights from RDF and biomedical ontologies are having a strong impact on how knowledge is generated from biomedical data, by making data elements increasingly connected and by providing a better description of their semantics. As health institutes become more data centric, we believe that the adoption of linked data techniques will continue to grow and be an effective solution to knowledge representation and management.
Collapse
Affiliation(s)
| | - F M Couto
- Francisco M. Couto, LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, E-mail:
| |
Collapse
|
5
|
Peng Y, Wei CH, Lu Z. Improving chemical disease relation extraction with rich features and weakly labeled data. J Cheminform 2016; 8:53. [PMID: 28316651 PMCID: PMC5054544 DOI: 10.1186/s13321-016-0165-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Accepted: 09/28/2016] [Indexed: 01/08/2023] Open
Abstract
Background Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations. Results We propose a rich-feature approach with Support Vector Machine to aid in the extraction of CIDs from PubMed articles. Our feature vector includes novel statistical features, linguistic knowledge, and domain resources. We also incorporate the output of a rule-based system as features, thus combining the advantages of rule- and machine learning-based systems. Furthermore, we augment our approach with automatically generated labeled text from an existing knowledge base to improve performance without additional cost for corpus construction. To evaluate our system, we perform experiments on the human-annotated BioCreative V benchmarking dataset and compare with previous results. When trained using only BioCreative V training and development sets, our system achieves an F-score of 57.51 %, which already compares favorably to previous methods. Our system performance was further improved to 61.01 % in F-score when augmented with additional automatically generated weakly labeled data. Conclusions Our text-mining approach demonstrates state-of-the-art performance in disease-chemical relation extraction. More importantly, this work exemplifies the use of (freely available) curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development.
Collapse
Affiliation(s)
- Yifan Peng
- National Center for Biotechnology Information, Bethesda, MD 20894 USA ; Computer and Information Sciences, University of Delaware, Newark, DE 19716 USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, Bethesda, MD 20894 USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, Bethesda, MD 20894 USA
| |
Collapse
|
6
|
Kaalia R, Ghosh I. Semantics based approach for analyzing disease-target associations. J Biomed Inform 2016; 62:125-35. [PMID: 27349858 DOI: 10.1016/j.jbi.2016.06.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Revised: 06/23/2016] [Accepted: 06/24/2016] [Indexed: 12/16/2022]
Abstract
BACKGROUND A complex disease is caused by heterogeneous biological interactions between genes and their products along with the influence of environmental factors. There have been many attempts for understanding the cause of these diseases using experimental, statistical and computational methods. In the present work the objective is to address the challenge of representation and integration of information from heterogeneous biomedical aspects of a complex disease using semantics based approach. METHODS Semantic web technology is used to design Disease Association Ontology (DAO-db) for representation and integration of disease associated information with diabetes as the case study. The functional associations of disease genes are integrated using RDF graphs of DAO-db. Three semantic web based scoring algorithms (PageRank, HITS (Hyperlink Induced Topic Search) and HITS with semantic weights) are used to score the gene nodes on the basis of their functional interactions in the graph. RESULTS Disease Association Ontology for Diabetes (DAO-db) provides a standard ontology-driven platform for describing genes, proteins, pathways involved in diabetes and for integrating functional associations from various interaction levels (gene-disease, gene-pathway, gene-function, gene-cellular component and protein-protein interactions). An automatic instance loader module is also developed in present work that helps in adding instances to DAO-db on a large scale. CONCLUSIONS Our ontology provides a framework for querying and analyzing the disease associated information in the form of RDF graphs. The above developed methodology is used to predict novel potential targets involved in diabetes disease from the long list of loose (statistically associated) gene-disease associations.
Collapse
Affiliation(s)
- Rama Kaalia
- School of Computational & Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| | - Indira Ghosh
- School of Computational & Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
| |
Collapse
|
7
|
Bousfield D, McEntyre J, Velankar S, Papadatos G, Bateman A, Cochrane G, Kim JH, Graef F, Vartak V, Alako B, Blomberg N. Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources. F1000Res 2016; 5:ELIXIR-160. [PMID: 27092246 PMCID: PMC4821287 DOI: 10.12688/f1000research.7911.1] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/05/2016] [Indexed: 01/26/2023] Open
Abstract
Data from open access biomolecular data resources, such as the European Nucleotide Archive and the Protein Data Bank are extensively reused within life science research for comparative studies, method development and to derive new scientific insights. Indicators that estimate the extent and utility of such secondary use of research data need to reflect this complex and highly variable data usage. By linking open access scientific literature, via Europe PubMedCentral, to the metadata in biological data resources we separate data citations associated with a deposition statement from citations that capture the subsequent, long-term, reuse of data in academia and industry. We extend this analysis to begin to investigate citations of biomolecular resources in patent documents. We find citations in more than 8,000 patents from 2014, demonstrating substantial use and an important role for data resources in defining biological concepts in granted patents to both academic and industrial innovators. Combined together our results indicate that the citation patterns in biomedical literature and patents vary, not only due to citation practice but also according to the data resource cited. The results guard against the use of simple metrics such as citation counts and show that indicators of data use must not only take into account citations within the biomedical literature but also include reuse of data in industry and other parts of society by including patents and other scientific and technical documents such as guidelines, reports and grant applications.
Collapse
Affiliation(s)
- David Bousfield
- ELIXIR, Wellcome Genome Campus, Cambridge, UK
- Ganesha Associates, Cambridge, UK
| | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Sameer Velankar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - George Papadatos
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Guy Cochrane
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Jee-Hyub Kim
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Florian Graef
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Vid Vartak
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Blaise Alako
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | | |
Collapse
|
8
|
Machado CM, Rebholz-Schuhmann D, Freitas AT, Couto FM. The semantic web in translational medicine: current applications and future directions. Brief Bioinform 2015; 16:89-103. [PMID: 24197933 PMCID: PMC4293377 DOI: 10.1093/bib/bbt079] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2013] [Accepted: 10/08/2013] [Indexed: 11/14/2022] Open
Abstract
Semantic web technologies offer an approach to data integration and sharing, even for resources developed independently or broadly distributed across the web. This approach is particularly suitable for scientific domains that profit from large amounts of data that reside in the public domain and that have to be exploited in combination. Translational medicine is such a domain, which in addition has to integrate private data from the clinical domain with proprietary data from the pharmaceutical domain. In this survey, we present the results of our analysis of translational medicine solutions that follow a semantic web approach. We assessed these solutions in terms of their target medical use case; the resources covered to achieve their objectives; and their use of existing semantic web resources for the purposes of data sharing, data interoperability and knowledge discovery. The semantic web technologies seem to fulfill their role in facilitating the integration and exploration of data from disparate sources, but it is also clear that simply using them is not enough. It is fundamental to reuse resources, to define mappings between resources, to share data and knowledge. All these aspects allow the instantiation of translational medicine at the semantic web-scale, thus resulting in a network of solutions that can share resources for a faster transfer of new scientific results into the clinical practice. The envisioned network of translational medicine solutions is on its way, but it still requires resolving the challenges of sharing protected data and of integrating semantic-driven technologies into the clinical practice.
Collapse
Affiliation(s)
- Catia M. Machado
- *Corresponding author. Catia M. Machado, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Portugal and Instituto de Engenharia de Sistemas e Computadores - Investigação e Desenvolvimento, Universidade de Lisboa, Portugal. E-mail:
| | | | | | | |
Collapse
|
9
|
Hettne KM, Dharuri H, Zhao J, Wolstencroft K, Belhajjame K, Soiland-Reyes S, Mina E, Thompson M, Cruickshank D, Verdes-Montenegro L, Garrido J, de Roure D, Corcho O, Klyne G, van Schouwen R, ‘t Hoen PAC, Bechhofer S, Goble C, Roos M. Structuring research methods and data with the research object model: genomics workflows as a case study. J Biomed Semantics 2014; 5:41. [PMID: 25276335 PMCID: PMC4177597 DOI: 10.1186/2041-1480-5-41] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Accepted: 07/29/2014] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows. RESULTS We present the application of the workflow-centric RO model for our bioinformatics case study. Three workflows were produced following recently defined Best Practices for workflow design. By modelling the experiment as an RO, we were able to automatically query the experiment and answer questions such as "which particular data was input to a particular workflow to test a particular hypothesis?", and "which particular conclusions were drawn from a particular workflow?". CONCLUSIONS Applying a workflow-centric RO model to aggregate and annotate the resources used in a bioinformatics experiment, allowed us to retrieve the conclusions of the experiment in the context of the driving hypothesis, the executed workflows and their input data. The RO model is an extendable reference model that can be used by other systems as well. AVAILABILITY The Research Object is available at http://www.myexperiment.org/packs/428 The Wf4Ever Research Object Model is available at http://wf4ever.github.io/ro.
Collapse
Affiliation(s)
- Kristina M Hettne
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Harish Dharuri
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Jun Zhao
- />Department of Zoology, University of Oxford, Oxford, UK
| | - Katherine Wolstencroft
- />School of Computer Science, University of Manchester, Manchester, UK
- />Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
| | - Khalid Belhajjame
- />School of Computer Science, University of Manchester, Manchester, UK
| | | | - Eleni Mina
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Mark Thompson
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | | | | | | | - David de Roure
- />Department of Zoology, University of Oxford, Oxford, UK
| | - Oscar Corcho
- />Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain
| | - Graham Klyne
- />Department of Zoology, University of Oxford, Oxford, UK
| | - Reinout van Schouwen
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Peter A C ‘t Hoen
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Sean Bechhofer
- />School of Computer Science, University of Manchester, Manchester, UK
| | - Carole Goble
- />School of Computer Science, University of Manchester, Manchester, UK
| | - Marco Roos
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
10
|
Hanauer DA, Saeed M, Zheng K, Mei Q, Shedden K, Aronson AR, Ramakrishnan N. Applying MetaMap to Medline for identifying novel associations in a large clinical dataset: a feasibility analysis. J Am Med Inform Assoc 2014; 21:925-37. [PMID: 24928177 PMCID: PMC4147617 DOI: 10.1136/amiajnl-2014-002767] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2014] [Revised: 05/23/2014] [Accepted: 05/27/2014] [Indexed: 02/07/2023] Open
Abstract
OBJECTIVE We describe experiments designed to determine the feasibility of distinguishing known from novel associations based on a clinical dataset comprised of International Classification of Disease, V.9 (ICD-9) codes from 1.6 million patients by comparing them to associations of ICD-9 codes derived from 20.5 million Medline citations processed using MetaMap. Associations appearing only in the clinical dataset, but not in Medline citations, are potentially novel. METHODS Pairwise associations of ICD-9 codes were independently identified in both the clinical and Medline datasets, which were then compared to quantify their degree of overlap. We also performed a manual review of a subset of the associations to validate how well MetaMap performed in identifying diagnoses mentioned in Medline citations that formed the basis of the Medline associations. RESULTS The overlap of associations based on ICD-9 codes in the clinical and Medline datasets was low: only 6.6% of the 3.1 million associations found in the clinical dataset were also present in the Medline dataset. Further, a manual review of a subset of the associations that appeared in both datasets revealed that co-occurring diagnoses from Medline citations do not always represent clinically meaningful associations. DISCUSSION Identifying novel associations derived from large clinical datasets remains challenging. Medline as a sole data source for existing knowledge may not be adequate to filter out widely known associations. CONCLUSIONS In this study, novel associations were not readily identified. Further improvements in accuracy and relevance for tools such as MetaMap are needed to realize their expected utility.
Collapse
Affiliation(s)
- David A Hanauer
- Department of Pediatrics, University of Michigan Medical School, Ann Arbor, Michigan, USA
| | - Mohammed Saeed
- Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, Michigan, USA
| | - Kai Zheng
- Department of Health Management and Policy, University of Michigan School of Public Health, Ann Arbor, Michigan, USA
- School of Information, University of Michigan, Ann Arbor, Michigan, USA
| | - Qiaozhu Mei
- School of Information, University of Michigan, Ann Arbor, Michigan, USA
- Department of Electronic Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, USA
| | - Kerby Shedden
- Center for Statistical Consultation and Research, University of Michigan, Ann Arbor, Michigan, USA
| | - Alan R Aronson
- Lister Hill Center, National Library of Medicine, Bethesda, Maryland, USA
| | - Naren Ramakrishnan
- Department of Computer Science, Discovery Analytics Center, Virginia Tech, Arlington, Virginia, USA
| |
Collapse
|