1
|
Iyappan A, Kawalia SB, Raschka T, Hofmann-Apitius M, Senger P. NeuroRDF: semantic integration of highly curated data to prioritize biomarker candidates in Alzheimer's disease. J Biomed Semantics 2016; 7:45. [PMID: 27392431 PMCID: PMC4939021 DOI: 10.1186/s13326-016-0079-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2015] [Accepted: 05/23/2016] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND Neurodegenerative diseases are incurable and debilitating indications with huge social and economic impact, where much is still to be learnt about the underlying molecular events. Mechanistic disease models could offer a knowledge framework to help decipher the complex interactions that occur at molecular and cellular levels. This motivates the need for the development of an approach integrating highly curated and heterogeneous data into a disease model of different regulatory data layers. Although several disease models exist, they often do not consider the quality of underlying data. Moreover, even with the current advancements in semantic web technology, we still do not have cure for complex diseases like Alzheimer's disease. One of the key reasons accountable for this could be the increasing gap between generated data and the derived knowledge. RESULTS In this paper, we describe an approach, called as NeuroRDF, to develop an integrative framework for modeling curated knowledge in the area of complex neurodegenerative diseases. The core of this strategy lies in the usage of well curated and context specific data for integration into one single semantic web-based framework, RDF. This increases the probability of the derived knowledge to be novel and reliable in a specific disease context. This infrastructure integrates highly curated data from databases (Bind, IntAct, etc.), literature (PubMed), and gene expression resources (such as GEO and ArrayExpress). We illustrate the effectiveness of our approach by asking real-world biomedical questions that link these resources to prioritize the plausible biomarker candidates. Among the 13 prioritized candidate genes, we identified MIF to be a potential emerging candidate due to its role as a pro-inflammatory cytokine. We additionally report on the effort and challenges faced during generation of such an indication-specific knowledge base comprising of curated and quality-controlled data. CONCLUSION Although many alternative approaches have been proposed and practiced for modeling diseases, the semantic web technology is a flexible and well established solution for harmonized aggregation. The benefit of this work, to use high quality and context specific data, becomes apparent in speculating previously unattended biomarker candidates around a well-known mechanism, further leveraged for experimental investigations.
Collapse
Affiliation(s)
- Anandhi Iyappan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754, Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113, Bonn, Germany
| | - Shweta Bagewadi Kawalia
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754, Sankt Augustin, Germany.
- Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113, Bonn, Germany.
| | - Tamara Raschka
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754, Sankt Augustin, Germany
- University of Applied Sciences Koblenz, RheinAhrCampus, Joseph-Rovan-Allee 2, 53424, Remagen, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754, Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113, Bonn, Germany
| | - Philipp Senger
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754, Sankt Augustin, Germany
| |
Collapse
|
2
|
Affiliation(s)
- Toni Kazic
- Dept. of Computer Science Missouri Maize Center, Missouri Informatics Institute, and Interdisciplinary Plant Group, University of Missouri, Columbia, Missouri, United States of America
| |
Collapse
|
3
|
Anguita A, García-Remesal M, de la Iglesia D, Graf N, Maojo V. Toward a view-oriented approach for aligning RDF-based biomedical repositories. Methods Inf Med 2014; 54:50-5. [PMID: 24777240 DOI: 10.3414/me13-02-0020] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2013] [Accepted: 03/17/2014] [Indexed: 11/09/2022]
Abstract
INTRODUCTION This article is part of the Focus Theme of METHODS of Information in Medicine on "Managing Interoperability and Complexity in Health Systems". BACKGROUND The need for complementary access to multiple RDF databases has fostered new lines of research, but also entailed new challenges due to data representation disparities. While several approaches for RDF-based database integration have been proposed, those focused on schema alignment have become the most widely adopted. All state-of-the-art solutions for aligning RDF-based sources resort to a simple technique inherited from legacy relational database integration methods. This technique - known as element-to-element (e2e) mappings - is based on establishing 1:1 mappings between single primitive elements - e.g. concepts, attributes, relationships, etc. - belonging to the source and target schemas. However, due to the intrinsic nature of RDF - a representation language based on defining tuples < subject, predicate, object > -, one may find RDF elements whose semantics vary dramatically when combined into a view involving other RDF elements - i.e. they depend on their context. The latter cannot be adequately represented in the target schema by resorting to the traditional e2e approach. These approaches fail to properly address this issue without explicitly modifying the target ontology, thus lacking the required expressiveness for properly reflecting the intended semantics in the alignment information. OBJECTIVES To enhance existing RDF schema alignment techniques by providing a mechanism to properly represent elements with context-dependent semantics, thus enabling users to perform more expressive alignments, including scenarios that cannot be adequately addressed by the existing approaches. METHODS Instead of establishing 1:1 correspondences between single primitive elements of the schemas, we propose adopting a view-based approach. The latter is targeted at establishing mapping relationships between RDF subgraphs - that can be regarded as the equivalent of views in traditional databases -, rather than between single schema elements. This approach enables users to represent scenarios defined by context-dependent RDF elements that cannot be properly represented when adopting the currently existing approaches. RESULTS We developed a software tool implementing our view-based strategy. Our tool is currently being used in the context of the European Commission funded p-medicine project, targeted at creating a technological framework to integrate clinical and genomic data to facilitate the development of personalized drugs and therapies for cancer, based on the genetic profile of the patient. We used our tool to integrate different RDF-based databases - including different repositories of clinical trials and DICOM images - using the Health Data Ontology Trunk (HDOT) ontology as the target schema. CONCLUSIONS The importance of database integration methods and tools in the context of biomedical research has been widely recognized. Modern research in this area - e.g. identification of disease biomarkers, or design of personalized therapies - heavily relies on the availability of a technical framework to enable researchers to uniformly access disparate repositories. We present a method and a tool that implement a novel alignment method specifically designed to support and enhance the integration of RDF-based data sources at schema (metadata) level. This approach provides an increased level of expressiveness compared to other existing solutions, and allows solving heterogeneity scenarios that cannot be properly represented using other state-of-the-art techniques.
Collapse
Affiliation(s)
- A Anguita
- Alberto Anguita, PhD, Group of Biomedical Informatics, Universidad Politécnica de Madrid, Campus de Montegancedo s/n, 28660 Boadilla del Monte, Spain, E-mail:
| | | | | | | | | |
Collapse
|
4
|
Khan S, Bilal M. Bitmap Index in Ontology Mapping for Data Integration. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2013. [DOI: 10.1007/s13369-012-0373-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
5
|
ADAK SUDESHNA. e2eXpress: END-TO-END BIOINFORMATICS AND KNOWLEDGE MANAGEMENT SYSTEM FOR MICROARRAYS. J BIOL SYST 2012. [DOI: 10.1142/s0218339002000664] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The advent of high-density microarrays has made it possible for scientists to measure the expression levels of thousands of genes simultaneously. Understanding and interpreting the massive volumes of microarray data is necessary to unravel the molecular basis of diseases and will someday lead to medicines tailored for individual genetic profiles. One of the main barriers to realizing the full potential of microarrays today is the need for specialized bioinformatics and knowledge management solutions required to mine the microarray data for biological information. After initial efforts at clustering expression data based on similarity, scientists have recognized the need to cross-reference and correlate experimental data with external data sources, to improve the quality of the biological conclusions that can be drawn. This paper describes e2eXpress, such an end-to-end Bioinformatics and Knowledge Management System for Microarrays. e2eXpress incorporates basic data management and analysis tasks with novel approaches for mining various molecular biological databases to summarize information regarding coregulated gene clusters. In particular, this paper describes two new algorithms: (a) Text Mining for Gene Clusters: a statistical algorithm that is aimed at deriving biologically relevant information for gene clusters from the biomedical literature; (b) Pathway Scoring for Gene Clusters: a computational algorithm that is aimed at deriving pathway related information for gene clusters. This paper describes the variety of statistical and computational algorithms that are required to mine the transcriptome in conjunction with extraneous data sources that can lead to real biological advances.
Collapse
Affiliation(s)
- SUDESHNA ADAK
- Information and Decision Technologies Lab, GE India Technology Center, EPIP Phase II, Hoodi Village, Whitefield Road, Bangalore, Karnataka 560066, India
| |
Collapse
|
6
|
HWANG DOOSUNG, FOTOUHI FARSHAD, SON YOUNGJU. A CASE STUDY: DEVELOPMENT OF AN ORGANISM–SPECIFIC PROTEIN INTERACTION DATABASE AND ITS ASSOCIATED TOOLS. INT J COOP INF SYST 2012. [DOI: 10.1142/s0218843003000723] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we describe the architecture of a protein interaction database and tools for manipulating Drosophila protein interaction data. The proposed system not only maintains interaction data collected by an experiment, but also associates the interaction data with valuable data from various genomic databases. The system inherits a layered-modular architecture by introducing a wrapper-mediator approach in order to solve the syntactic and semantic heterogeneity among multiple data sources. The component modules for wrapping and integrating the relevant data, querying the database, and visualizing the interaction data among proteins are discussed. The system wrapped the relevant data for 14,000 Drosophila proteins from 5 publicly accessible sources. A web-based query interface is developed to browse the database and a query result can be viewed as in a protein interaction map depicting functional pathways, complexes or networks. Protein interaction maps aid in understanding or predicting potential functions for uncharacterized proteins and in describing their functional networks in a biological context. We show that the proposed approach supports data association and data interoperability in a protein interaction database.
Collapse
Affiliation(s)
- DOOSUNG HWANG
- Department of Computer Science, Wayne State University, Detroit, MI 48202, USA
| | - FARSHAD FOTOUHI
- Department of Computer Science, Wayne State University, Detroit, MI 48202, USA
| | - YOUNGJU SON
- Department of Computer Science, Wayne State University, Detroit, MI 48202, USA
| |
Collapse
|
7
|
WROE CHRIS, STEVENS ROBERT, GOBLE CAROLE, ROBERTS ANGUS, GREENWOOD MARK. A SUITE OF DAML+OIL ONTOLOGIES TO DESCRIBE BIOINFORMATICS WEB SERVICES AND DATA. INT J COOP INF SYST 2012. [DOI: 10.1142/s0218843003000711] [Citation(s) in RCA: 91] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The growing quantity and distribution of bioinformatics resources means that finding and utilizing them requires a great deal of expert knowledge, especially as many resources need to be tied together into a workflow to accomplish a useful goal. We want to formally capture at least some of this knowledge within a virtual workbench and middleware framework to assist a wider range of biologists in utilizing these resources. Different activities require different representations of knowledge. Finding or substituting a service within a workflow is often best supported by a classification. Marshalling and configuring services is best accomplished using a formal description. Both representations are highly interdependent and maintaining consistency between the two by hand is difficult. We report on a description logic approach using the web ontology language DAML+OIL that uses property based service descriptions. The ontology is founded on DAML-S to dynamically create service classifications. These classifications are then used to support semantic service matching and discovery in a large grid based middleware project [Formula: see text]. We describe the extensions necessary to DAML-S in order to support bioinformatics service description; the utility of DAML+OIL in creating dynamic classifications based on formal descriptions; and the implementation of a DAML+OIL ontology service to support partial user-driven service matching and composition.
Collapse
Affiliation(s)
- CHRIS WROE
- Department of Computer Science, University of Manchester, Oxford Rd, Manchester, M13 9PL, UK
| | - ROBERT STEVENS
- Department of Computer Science, University of Manchester, Oxford Rd, Manchester, M13 9PL, UK
| | - CAROLE GOBLE
- Department of Computer Science, University of Manchester, Oxford Rd, Manchester, M13 9PL, UK
| | - ANGUS ROBERTS
- Department of Computer Science, University of Manchester, Oxford Rd, Manchester, M13 9PL, UK
| | - MARK GREENWOOD
- Department of Computer Science, University of Manchester, Oxford Rd, Manchester, M13 9PL, UK
| |
Collapse
|
8
|
|
9
|
Vandervalk BP, McCarthy EL, Wilkinson MD. Moby and Moby 2: Creatures of the Deep (Web). Brief Bioinform 2009; 10:114-28. [DOI: 10.1093/bib/bbn051] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
10
|
O'Neill K, Garcia A, Schwegmann A, Jimenez RC, Jacobson D, Hermjakob H. OntoDas – a tool for facilitating the construction of complex queries to the Gene Ontology. BMC Bioinformatics 2008; 9:437. [PMID: 18925933 PMCID: PMC2579441 DOI: 10.1186/1471-2105-9-437] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2008] [Accepted: 10/16/2008] [Indexed: 11/17/2022] Open
Abstract
Background Ontologies such as the Gene Ontology can enable the construction of complex queries over biological information in a conceptual way, however existing systems to do this are too technical. Within the biological domain there is an increasing need for software that facilitates the flexible retrieval of information. OntoDas aims to fulfil this need by allowing the definition of queries by selecting valid ontology terms. Results OntoDas is a web-based tool that uses information visualisation techniques to provide an intuitive, interactive environment for constructing ontology-based queries against the Gene Ontology Database. Both a comprehensive use case and the interface itself were designed in a participatory manner by working with biologists to ensure that the interface matches the way biologists work. OntoDas was further tested with a separate group of biologists and refined based on their suggestions. Conclusion OntoDas provides a visual and intuitive means for constructing complex queries against the Gene Ontology. It was designed with the participation of biologists and compares favourably with similar tools. It is available at
Collapse
|
11
|
Kanagasabai R, Choo KH, Ranganathan S, Baker CJO. A workflow for mutation extraction and structure annotation. J Bioinform Comput Biol 2008; 5:1319-37. [PMID: 18172931 DOI: 10.1142/s0219720007003119] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2007] [Revised: 09/11/2007] [Accepted: 09/30/2007] [Indexed: 11/18/2022]
Abstract
Rich information on point mutation studies is scattered across heterogeneous data sources. This paper presents an automated workflow for mining mutation annotations from full-text biomedical literature using natural language processing (NLP) techniques as well as for their subsequent reuse in protein structure annotation and visualization. This system, called mSTRAP (Mutation extraction and STRucture Annotation Pipeline), is designed for both information aggregation and subsequent brokerage of the mutation annotations. It facilitates the coordination of semantically related information from a series of text mining and sequence analysis steps into a formal OWL-DL ontology. The ontology is designed to support application-specific data management of sequence, structure, and literature annotations that are populated as instances of object and data type properties. mSTRAPviz is a subsystem that facilitates the brokerage of structure information and the associated mutations for visualization. For mutated sequences without any corresponding structure available in the Protein Data Bank (PDB), an automated pipeline for homology modeling is developed to generate the theoretical model. With mSTRAP, we demonstrate a workable system that can facilitate automation of the workflow for the retrieval, extraction, processing, and visualization of mutation annotations -- tasks which are well known to be tedious, time-consuming, complex, and error-prone. The ontology and visualization tool are available at (http://datam.i2r.a-star.edu.sg/mstrap).
Collapse
|
12
|
A bilateral integrative health-care knowledge service mechanism based on ‘MedGrid’. Comput Biol Med 2008; 38:446-60. [DOI: 10.1016/j.compbiomed.2008.01.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2007] [Accepted: 01/14/2008] [Indexed: 11/22/2022]
|
13
|
Abstract
Background The development of e-Science presents a major set of opportunities and challenges for the future progress of biological and life scientific research. Major new tools are required and corresponding demands are placed on the high-throughput data generated and used in these processes. Nowhere is the demand greater than in the semantic integration of these data. Semantic Web tools and technologies afford the chance to achieve this semantic integration. Since pathway knowledge is central to much of the scientific research today it is a good test-bed for semantic integration. Within the context of biological pathways, the BioPAX initiative, part of a broader movement towards the standardization and integration of life science databases, forms a necessary prerequisite for its successful application of e-Science in health care and life science research. This paper examines whether BioPAX, an effort to overcome the barrier of disparate and heterogeneous pathway data sources, addresses the needs of e-Science. Results We demonstrate how BioPAX pathway data can be used to ask and answer some useful biological questions. We find that BioPAX comes close to meeting a broad range of e-Science needs, but certain semantic weaknesses mean that these goals are missed. We make a series of recommendations for re-modeling some aspects of BioPAX to better meet these needs. Conclusion Once these semantic weaknesses are addressed, it will be possible to integrate pathway information in a manner that would be useful in e-Science.
Collapse
Affiliation(s)
- Joanne S Luciano
- Genetics Department, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115, USA
- School of Computer Science, Manchester University, Oxford Road, Manchester, M13 9PL, UK
| | - Robert D Stevens
- School of Computer Science, Manchester University, Oxford Road, Manchester, M13 9PL, UK
| |
Collapse
|
14
|
Kell DB. Systems biology, metabolic modelling and metabolomics in drug discovery and development. Drug Discov Today 2006; 11:1085-92. [PMID: 17129827 DOI: 10.1016/j.drudis.2006.10.004] [Citation(s) in RCA: 219] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2006] [Revised: 09/25/2006] [Accepted: 10/09/2006] [Indexed: 01/03/2023]
Abstract
Unlike signalling pathways, metabolic networks are subject to strict stoichiometric constraints. Metabolomics amplifies changes in the proteome, and represents more closely the phenotype of an organism. Recent advances enable the production (and computer-readable encoding as SBML) of metabolic network models reconstructed from genome sequences, as well as experimental measurements of much of the metabolome. There is increasing convergence between the number of human metabolites estimated via genomics ( approximately 3000) and the number measured experimentally. It is thus both timely, and now possible, to bring these two approaches together as an integrated (if distributed) whole to help understand the genesis of metabolic biomarkers, the progress of disease, and the modes of action, efficacy, off-target effects and toxicity of pharmaceutical drugs.
Collapse
Affiliation(s)
- Douglas B Kell
- School of Chemistry, Faraday Building, The University of Manchester. PO Box 88, Manchester, M60 1QD, UK.
| |
Collapse
|
15
|
Abstract
Effective information management of the pharmacogenomics discipline presents many unique challenges. Genetic and genomic data generated via high-throughput methods need to be integrated with phenotypic data which are defined at multiscale levels, ranging from the molecular to the clinical level. Repositories storing these data are distributed and vary in terms of syntax and semantics which result in issues concerning data exchange and integration. The application of the emerging semantic web offers a promising solution to these interoperability issues.
Collapse
Affiliation(s)
- Hiten Vyas
- Loughborough University, Health Informatics Research Group,Research School of Informatics, LE11 3TU, UK
| | | |
Collapse
|
16
|
Kell DB. Theodor Bücher Lecture. Metabolomics, modelling and machine learning in systems biology - towards an understanding of the languages of cells. Delivered on 3 July 2005 at the 30th FEBS Congress and the 9th IUBMB conference in Budapest. FEBS J 2006; 273:873-94. [PMID: 16478464 DOI: 10.1111/j.1742-4658.2006.05136.x] [Citation(s) in RCA: 130] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The newly emerging field of systems biology involves a judicious interplay between high-throughput 'wet' experimentation, computational modelling and technology development, coupled to the world of ideas and theory. This interplay involves iterative cycles, such that systems biology is not at all confined to hypothesis-dependent studies, with intelligent, principled, hypothesis-generating studies being of high importance and consequently very far from aimless fishing expeditions. I seek to illustrate each of these facets. Novel technology development in metabolomics can increase substantially the dynamic range and number of metabolites that one can detect, and these can be exploited as disease markers and in the consequent and principled generation of hypotheses that are consistent with the data and achieve this in a value-free manner. Much of classical biochemistry and signalling pathway analysis has concentrated on the analyses of changes in the concentrations of intermediates, with 'local' equations - such as that of Michaelis and Menten v=(Vmax x S)/(S+K m) - that describe individual steps being based solely on the instantaneous values of these concentrations. Recent work using single cells (that are not subject to the intellectually unsupportable averaging of the variable displayed by heterogeneous cells possessing nonlinear kinetics) has led to the recognition that some protein signalling pathways may encode their signals not (just) as concentrations (AM or amplitude-modulated in a radio analogy) but via changes in the dynamics of those concentrations (the signals are FM or frequency-modulated). This contributes in principle to a straightforward solution of the crosstalk problem, leads to a profound reassessment of how to understand the downstream effects of dynamic changes in the concentrations of elements in these pathways, and stresses the role of signal processing (and not merely the intermediates) in biological signalling. It is this signal processing that lies at the heart of understanding the languages of cells. The resolution of many of the modern and postgenomic problems of biochemistry requires the development of a myriad of new technologies (and maybe a new culture), and thus regular input from the physical sciences, engineering, mathematics and computer science. One solution, that we are adopting in the Manchester Interdisciplinary Biocentre (http://www.mib.ac.uk/) and the Manchester Centre for Integrative Systems Biology (http://www.mcisb.org/), is thus to colocate individuals with the necessary combinations of skills. Novel disciplines that require such an integrative approach continue to emerge. These include fields such as chemical genomics, synthetic biology, distributed computational environments for biological data and modelling, single cell diagnostics/bionanotechnology, and computational linguistics/text mining.
Collapse
Affiliation(s)
- Douglas B Kell
- School of Chemistry, Faraday Building, The University of Manchester, UK.
| |
Collapse
|
17
|
Bichutskiy VY, Colman R, Brachmann RK, Lathrop RH. Heterogeneous Biomedical Database Integration using a Hybrid Strategy: A P53 Cancer Research Database. Cancer Inform 2006. [DOI: 10.1177/117693510600200021] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Complex problems in life science research give rise to multidisciplinary collaboration, and hence, to the need for heterogeneous database integration. The tumor suppressor p53 is mutated in close to 50% of human cancers, and a small drug-like molecule with the ability to restore native function to cancerous p53 mutants is a long-held medical goal of cancer treatment. The Cancer Research DataBase (CRDB) was designed in support of a project to find such small molecules. As a cancer informatics project, the CRDB involved small molecule data, computational docking results, functional assays, and protein structure data. As an example of the hybrid strategy for data integration, it combined the mediation and data warehousing approaches. This paper uses the CRDB to illustrate the hybrid strategy as a viable approach to heterogeneous data integration in biomedicine, and provides a design method for those considering similar systems. More efficient data sharing implies increased productivity, and, hopefully, improved chances of success in cancer research. (Code and database schemas are freely downloadable, http://www.igb.uci.edu/research/research.html .)
Collapse
Affiliation(s)
- Vadim Y. Bichutskiy
- Department of Computer Science. University of California, Irvine, California 92697, U.S.A
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| | - Richard Colman
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| | - Rainer K. Brachmann
- Department of Medicine. University of California, Irvine, California 92697, U.S.A
- Department of Biological Chemistry. University of California, Irvine, California 92697, U.S.A
- Department of Pathology. University of California, Irvine, California 92697, U.S.A
- Division of Hematology/Oncology. University of California, Irvine, California 92697, U.S.A
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| | - Richard H. Lathrop
- Department of Computer Science. University of California, Irvine, California 92697, U.S.A
- Department of Biomedical Engineering. University of California, Irvine, California 92697, U.S.A
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| |
Collapse
|
18
|
Gros PE, Hérisson J, Ferey N, Gherbi R. Combining applications and remote databases view in a common SQL distributed genomic database. DATA SCIENCE JOURNAL 2005. [DOI: 10.2481/dsj.4.244] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
19
|
|
20
|
Abstract
Interoperability is the most critical issue facing businesses that need to access information from multiple information systems. Our objective in this research is to develop a comprehensive framework and methodology to facilitate semantic interoperability among distributed and heterogeneous information systems. A comprehensive framework for managing various semantic conflicts is proposed. Our proposed framework provides a unified view of the underlying representational and reasoning formalism for the semantic mediation process. This framework is then used as a basis for automating the detection and resolution of semantic conflicts among heterogeneous information sources. We define several types of semantic mediators to achieve semantic interoperability. A domain-independent ontology is used to capture various semantic conflicts. A mediation-based query processing technique is developed to provide uniform and integrated access to the multiple heterogeneous databases. A usable prototype is implemented as a proof-of-concept for this work. Finally, the usefulness of our approach is evaluated using three cases in different application domains. Various heterogeneous datasets are used during the evaluation phase. The results of the evaluation suggest that correct identification and construction of both schema and ontology-schema mapping knowledge play very important roles in achieving interoperability at both the data and schema levels.
Collapse
Affiliation(s)
| | - Sudha Ram
- The University of Arizona, Tucson, Arizona
| |
Collapse
|
21
|
Karasavvas KA, Baldock R, Burger A. Bioinformatics integration and agent technology. J Biomed Inform 2004; 37:205-19. [PMID: 15196484 DOI: 10.1016/j.jbi.2004.04.003] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2003] [Indexed: 10/26/2022]
Abstract
Vast amounts of life sciences data are scattered around the world in the form of a variety of heterogeneous data sources. The need to be able to co-relate relevant information is fundamental to increase the overall knowledge and understanding of a specific subject. Bioinformaticians aspire to find ways to integrate biological data sources for this purpose and system integration is a very important research topic. The purpose of this paper is to provide an overview of important integration issues that should be considered when designing a bioinformatics integration system. The currently prevailing approach for integration is presented with examples of bioinformatics information systems together with their main characteristics. Here, we introduce agent technology and we argue why it provides an appropriate solution for designing bioinformatics integration systems.
Collapse
Affiliation(s)
- K A Karasavvas
- Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, UK.
| | | | | |
Collapse
|
22
|
Flexible Integration of Molecular-Biological Annotation Data: The GenMapper Approach. ADVANCES IN DATABASE TECHNOLOGY - EDBT 2004 2004. [DOI: 10.1007/978-3-540-24741-8_47] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
23
|
|
24
|
|
25
|
Stevens R, Goble C, Paton NW, Bechhofer S, Ng G, Baker P, Brass A. Complex Query Formulation Over Diverse Information Sources in TAMBIS. Bioinformatics 2003. [DOI: 10.1016/b978-155860829-0/50009-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/21/2023] Open
|
26
|
The Kleisli Query System as a Backbone for Bioinformatics Data Integration and Analysis. Bioinformatics 2003. [DOI: 10.1016/b978-155860829-0/50008-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] Open
|
27
|
Ludäscher B, Gupta A, Martone ME. A Model-Based Mediator System for Scientific Data Management. Bioinformatics 2003. [DOI: 10.1016/b978-155860829-0/50014-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
|
28
|
Hancock WS, Wu SL, Stanley RR, Gombocz EA. Publishing large proteome datasets: scientific policy meets emerging technologies. Trends Biotechnol 2002; 20:S39-44. [PMID: 12570159 DOI: 10.1016/s1471-1931(02)00205-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Currently, there are various approaches to proteomic analyses based on either 2D gel or HPLC separation platforms, generating data of different formats, structures and types. Identification of these separated proteins or peptide fragments is typically achieved by mass spectrometry (MS) measurements that use either accurate mass measurements or fragmentation (MS-MS) information. Integrating the information generated from these different platforms is essential if proteomics is to succeed. A further challenge lies in generating standards that can accept the hundreds-of-thousands of mass spectra produced per analysis based on threshold or probability measurements. Finally, peer review and electronic publication processes will be crucial to the dissemination and use of proteomic information. Merging the policy requirements of data-intensive research with information technology will enable scientists to gain real value from global proteomics information.
Collapse
|
29
|
|
30
|
Abstract
As the pace of biological research accelerates, biologists are becoming increasingly reliant on computers to manage the information explosion. Biologists communicate their research findings by relying on precise biological terms; these terms then provide indices into the literature and across the growing number of biological databases. This article examines emerging techniques to access biological resources through extraction of entity names and relations among them. Information extraction has been an active area of research in natural language processing and there are promising results for information extraction applied to news stories, e.g., balanced precision and recall in the 93-95% range for identifying person, organization and location names. But these results do not seem to transfer directly to biological names, where results remain in the 75-80% range. Multiple factors may be involved, including absence of shared training and test sets for rigorous measures of progress, lack of annotated training data specific to biological tasks, pervasive ambiguity of terms, frequent introduction of new terms, and a mismatch between evaluation tasks as defined for news and real biological problems. We present evidence from a simple lexical matching exercise that illustrates some specific problems encountered when identifying biological names. We conclude by outlining a research agenda to raise performance of named entity tagging to a level where it can be used to perform tasks of biological importance.
Collapse
Affiliation(s)
- Lynette Hirschman
- The MITRE Corporation, MS K312, 202 Burlington Rd., Bedford, MA 01730, USA.
| | | | | |
Collapse
|
31
|
Stevens R, Goble C, Horrocks I, Bechhofer S. Building a bioinformatics ontology using OIL. IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE : A PUBLICATION OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY 2002; 6:135-41. [PMID: 12075668 DOI: 10.1109/titb.2002.1006301] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
This paper describes the initial stages of building an ontology of bioinformatics and molecular biology. The conceptualization is encoded using the ontology inference layer (OIL), a knowledge representation language that combines the modeling style of frame-based systems with the expressiveness and reasoning power of description logics (DLs). This paper is the second of a pair in this special issue. The first described the core of the OIL language and the need to use ontologies to deliver semantic bioinformatics resources. In this paper, the early stages of building an ontology component of a bioinformatics resource querying application are described. This ontology (TaO) holds the information about molecular biology represented in bioinformatics resources and the bioinformatics tasks performed over these resources. It, therefore, represents the metadata of the resources the application can query. It also manages the terminologies used in constructing the query plans used to retrieve instances from those external resources. The methodology used in this task capitalizes upon features of OIL-The conceptualization afforded by the frame-based view of OIL's syntax; the expressive power and reasoning of the logical formalism; and the ability to encode both handcrafted, hierarchies of concepts, as well as defining concepts in terms of their properties, which can then be used to establish a classification and infer relationships not encoded by the ontologist. This ability forms the basis of the methodology described here: For each portion of the TaO, a basic framework of concepts is asserted by the ontologist. Then, the properties of these concepts are defined by the ontologist and the logic's reasoning power used to reclassify and infer further relationships. This cycle of elaboration and refinement is iterated on each portion of the ontology until a satisfactory ontology has been created.
Collapse
Affiliation(s)
- Robert Stevens
- Department of Computer Science, University of Manchester, UK.
| | | | | | | |
Collapse
|
32
|
Bechhofer S, Horrocks I, Goble C, Stevens R. OilEd: A Reason-able Ontology Editor for the Semantic Web. KI 2001: ADVANCES IN ARTIFICIAL INTELLIGENCE 2001. [DOI: 10.1007/3-540-45422-5_28] [Citation(s) in RCA: 92] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|