1
|
Brochhausen M, Fransson MN, Kanaskar NV, Eriksson M, Merino-Martinez R, Hall RA, Norlin L, Kjellqvist S, Hortlund M, Topaloglu U, Hogan WR, Litton JE. Developing a semantically rich ontology for the biobank-administration domain. J Biomed Semantics 2013; 4:23. [PMID: 24103726 PMCID: PMC4021870 DOI: 10.1186/2041-1480-4-23] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2013] [Accepted: 05/15/2013] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND Biobanks are a critical resource for translational science. Recently, semantic web technologies such as ontologies have been found useful in retrieving research data from biobanks. However, recent research has also shown that there is a lack of data about the administrative aspects of biobanks. These data would be helpful to answer research-relevant questions such as what is the scope of specimens collected in a biobank, what is the curation status of the specimens, and what is the contact information for curators of biobanks. Our use cases include giving researchers the ability to retrieve key administrative data (e.g. contact information, contact's affiliation, etc.) about the biobanks where specific specimens of interest are stored. Thus, our goal is to provide an ontology that represents the administrative entities in biobanking and their relations. We base our ontology development on a set of 53 data attributes called MIABIS, which were in part the result of semantic integration efforts of the European Biobanking and Biomolecular Resources Research Infrastructure (BBMRI). The previous work on MIABIS provided the domain analysis for our ontology. We report on a test of our ontology against competency questions that we derived from the initial BBMRI use cases. Future work includes additional ontology development to answer additional competency questions from these use cases. RESULTS We created an open-source ontology of biobank administration called Ontologized MIABIS (OMIABIS) coded in OWL 2.0 and developed according to the principles of the OBO Foundry. It re-uses pre-existing ontologies when possible in cooperation with developers of other ontologies in related domains, such as the Ontology of Biomedical Investigation. OMIABIS provides a formalized representation of biobanks and their administration. Using the ontology and a set of Description Logic queries derived from the competency questions that we identified, we were able to retrieve test data with perfect accuracy. In addition, we began development of a mapping from the ontology to pre-existing biobank data structures commonly used in the U.S. CONCLUSIONS In conclusion, we created OMIABIS, an ontology of biobank administration. We found that basing its development on pre-existing resources to meet the BBMRI use cases resulted in a biobanking ontology that is re-useable in environments other than BBMRI. Our ontology retrieved all true positives and no false positives when queried according to the competency questions we derived from the BBMRI use cases. Mapping OMIABIS to a data structure used for biospecimen collections in a medical center in Little Rock, AR showed adequate coverage of our ontology.
Collapse
Affiliation(s)
- Mathias Brochhausen
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Martin N Fransson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Nitin V Kanaskar
- Department of IT Research, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Mikael Eriksson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Roxana Merino-Martinez
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Roger A Hall
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Loreana Norlin
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Sanela Kjellqvist
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Maria Hortlund
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Umit Topaloglu
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - William R Hogan
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Jan-Eric Litton
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
2
|
Deus HF, Veiga DF, Freire PR, Weinstein JN, Mills GB, Almeida JS. Exposing the cancer genome atlas as a SPARQL endpoint. J Biomed Inform 2010; 43:998-1008. [PMID: 20851208 PMCID: PMC3071752 DOI: 10.1016/j.jbi.2010.09.004] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2010] [Revised: 07/07/2010] [Accepted: 09/09/2010] [Indexed: 02/03/2023]
Abstract
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.
Collapse
Affiliation(s)
- Helena F Deus
- Department of Bioinformatics and Computational Biology, The University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Blvd., Unit 1410, Houston, TX 77230-1402, USA.
| | | | | | | | | | | |
Collapse
|
3
|
McCusker JP, Phillips JA, Beltrán AG, Finkelstein A, Krauthammer M. Semantic web data warehousing for caGrid. BMC Bioinformatics 2009; 10 Suppl 10:S2. [PMID: 19796399 PMCID: PMC2755823 DOI: 10.1186/1471-2105-10-s10-s2] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The National Cancer Institute (NCI) is developing caGrid as a means for sharing cancer-related data and services. As more data sets become available on caGrid, we need effective ways of accessing and integrating this information. Although the data models exposed on caGrid are semantically well annotated, it is currently up to the caGrid client to infer relationships between the different models and their classes. In this paper, we present a Semantic Web-based data warehouse (Corvus) for creating relationships among caGrid models. This is accomplished through the transformation of semantically-annotated caBIG Unified Modeling Language (UML) information models into Web Ontology Language (OWL) ontologies that preserve those semantics. We demonstrate the validity of the approach by Semantic Extraction, Transformation and Loading (SETL) of data from two caGrid data sources, caTissue and caArray, as well as alignment and query of those sources in Corvus. We argue that semantic integration is necessary for integration of data from distributed web services and that Corvus is a useful way of accomplishing this. Our approach is generalizable and of broad utility to researchers facing similar integration challenges.
Collapse
Affiliation(s)
- Jamie P McCusker
- grid.47100.320000000419368710Department of Pathology, Yale University School of Medicine, New Haven, CT USA
| | | | | | - Anthony Finkelstein
- grid.83440.3b0000000121901201Department of Computer Science, University College London, London, UK
| | - Michael Krauthammer
- grid.47100.320000000419368710Department of Pathology, Yale University School of Medicine, New Haven, CT USA
| |
Collapse
|
4
|
Huang T, Shenoy PJ, Sinha R, Graiser M, Bumpers KW, Flowers CR. Development of the Lymphoma Enterprise Architecture Database: a caBIG Silver level compliant system. Cancer Inform 2009; 8:45-64. [PMID: 19492074 PMCID: PMC2675136 DOI: 10.4137/cin.s940] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Lymphomas are the fifth most common cancer in United States with numerous histological subtypes. Integrating existing clinical information on lymphoma patients provides a platform for understanding biological variability in presentation and treatment response and aids development of novel therapies. We developed a cancer Biomedical Informatics Grid (caBIG) Silver level compliant lymphoma database, called the Lymphoma Enterprise Architecture Data-system (LEAD), which integrates the pathology, pharmacy, laboratory, cancer registry, clinical trials, and clinical data from institutional databases. We utilized the Cancer Common Ontological Representation Environment Software Development Kit (caCORE SDK) provided by National Cancer Institute's Center for Bioinformatics to establish the LEAD platform for data management. The caCORE SDK generated system utilizes an n-tier architecture with open Application Programming Interfaces, controlled vocabularies, and registered metadata to achieve semantic integration across multiple cancer databases. We demonstrated that the data elements and structures within LEAD could be used to manage clinical research data from phase 1 clinical trials, cohort studies, and registry data from the Surveillance Epidemiology and End Results database. This work provides a clear example of how semantic technologies from caBIG can be applied to support a wide range of clinical and research tasks, and integrate data from disparate systems into a single architecture. This illustrates the central importance of caBIG to the management of clinical and biological data.
Collapse
Affiliation(s)
- Taoying Huang
- Winship Cancer Institute, School of Medicine, Emory University, Atlanta, GA, U.S.A
| | - Pareen J. Shenoy
- Winship Cancer Institute, School of Medicine, Emory University, Atlanta, GA, U.S.A
| | - Rajni Sinha
- Winship Cancer Institute, School of Medicine, Emory University, Atlanta, GA, U.S.A
| | - Michael Graiser
- Winship Cancer Institute, School of Medicine, Emory University, Atlanta, GA, U.S.A
| | - Kevin W. Bumpers
- Winship Cancer Institute, School of Medicine, Emory University, Atlanta, GA, U.S.A
| | | |
Collapse
|
5
|
Holford ME, Rajeevan H, Zhao H, Kidd KK, Cheung KH. Semantic Web-based integration of cancer pathways and allele frequency data. Cancer Inform 2009; 8:19-30. [PMID: 19458791 PMCID: PMC2664696 DOI: 10.4137/cin.s1006] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
We demonstrate the use of Semantic Web technology to integrate the ALFRED allele frequency database and the Starpath pathway resource. The linking of population-specific genotype data with cancer-related pathway data is potentially useful given the growing interest in personalized medicine and the exploitation of pathway knowledge for cancer drug discovery. We model our data using the Web Ontology Language (OWL), drawing upon ideas from existing standard formats BioPAX for pathway data and PML for allele frequency data. We store our data within an Oracle database, using Oracle Semantic Technologies. We then query the data using Oracle’s rule-based inference engine and SPARQL-like RDF query language. The ability to perform queries across the domains of population genetics and pathways offers the potential to answer a number of cancer-related research questions. Among the possibilities is the ability to identify genetic variants which are associated with cancer pathways and whose frequency varies significantly between ethnic groups. This sort of information could be useful for designing clinical studies and for providing background data in personalized medicine. It could also assist with the interpretation of genetic analysis results such as those from genome-wide association studies.
Collapse
Affiliation(s)
- Matthew E Holford
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, USA
| | | | | | | | | |
Collapse
|