1
|
Schultes E, Roos M, Bonino da Silva Santos LO, Guizzardi G, Bouwman J, Hankemeier T, Baak A, Mons B. FAIR Digital Twins for Data-Intensive Research. Front Big Data 2022; 5:883341. [PMID: 35647536 PMCID: PMC9130601 DOI: 10.3389/fdata.2022.883341] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 04/12/2022] [Indexed: 11/13/2022] Open
Abstract
Although all the technical components supporting fully orchestrated Digital Twins (DT) currently exist, what remains missing is a conceptual clarification and analysis of a more generalized concept of a DT that is made FAIR, that is, universally machine actionable. This methodological overview is a first step toward this clarification. We present a review of previously developed semantic artifacts and how they may be used to compose a higher-order data model referred to here as a FAIR Digital Twin (FDT). We propose an architectural design to compose, store and reuse FDTs supporting data intensive research, with emphasis on privacy by design and their use in GDPR compliant open science.
Collapse
|
2
|
Theodosiou T, Papanikolaou N, Savvaki M, Bonetto G, Maxouri S, Fakoureli E, Eliopoulos AG, Tavernarakis N, Amoutzias GD, Pavlopoulos GA, Aivaliotis M, Nikoletopoulou V, Tzamarias D, Karagogeos D, Iliopoulos I. UniProt-Related Documents (UniReD): assisting wet lab biologists in their quest on finding novel counterparts in a protein network. NAR Genom Bioinform 2020; 2:lqaa005. [PMID: 33575553 PMCID: PMC7671407 DOI: 10.1093/nargab/lqaa005] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Revised: 01/20/2020] [Accepted: 01/31/2020] [Indexed: 02/04/2023] Open
Abstract
The in-depth study of protein–protein interactions (PPIs) is of key importance for understanding how cells operate. Therefore, in the past few years, many experimental as well as computational approaches have been developed for the identification and discovery of such interactions. Here, we present UniReD, a user-friendly, computational prediction tool which analyses biomedical literature in order to extract known protein associations and suggest undocumented ones. As a proof of concept, we demonstrate its usefulness by experimentally validating six predicted interactions and by benchmarking it against public databases of experimentally validated PPIs succeeding a high coverage. We believe that UniReD can become an important and intuitive resource for experimental biologists in their quest for finding novel associations within a protein network and a useful tool to complement experimental approaches (e.g. mass spectrometry) by producing sorted lists of candidate proteins for further experimental validation. UniReD is available at http://bioinformatics.med.uoc.gr/unired/
Collapse
Affiliation(s)
- Theodosios Theodosiou
- University of Crete, School of Medicine, Department of Basic Sciences, Heraklion 71003, Crete, Greece
| | - Nikolaos Papanikolaou
- University of Crete, School of Medicine, Department of Basic Sciences, Heraklion 71003, Crete, Greece
| | - Maria Savvaki
- University of Crete, School of Medicine, Department of Basic Sciences, Heraklion 71003, Crete, Greece.,Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Nikolaou Plastira 100, 70013 Heraklion, Crete, Greece
| | - Giulia Bonetto
- University of Crete, School of Medicine, Department of Basic Sciences, Heraklion 71003, Crete, Greece
| | - Stella Maxouri
- University of Crete, School of Medicine, Department of Basic Sciences, Heraklion 71003, Crete, Greece.,Medical School of Patras University, Laboratory of General Biology, Asklipiou 1, 26500 Rio Patras, Greece
| | - Eirini Fakoureli
- University of Crete, School of Medicine, Department of Basic Sciences, Heraklion 71003, Crete, Greece
| | - Aristides G Eliopoulos
- Department of Biology, Medical School, National and Kapodistrian University of Athens, Mikras Asias 75, 11527 Athens, Greece
| | - Nektarios Tavernarakis
- University of Crete, School of Medicine, Department of Basic Sciences, Heraklion 71003, Crete, Greece.,Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Nikolaou Plastira 100, 70013 Heraklion, Crete, Greece
| | - Grigoris D Amoutzias
- Bioinformatics Laboratory, Department of Biochemistry and Biotechnology, University of Thessaly, Larisa 41500, Greece
| | - Georgios A Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", 34 Fleming Street, 16672 Vari, Greece
| | - Michalis Aivaliotis
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Nikolaou Plastira 100, 70013 Heraklion, Crete, Greece.,Laboratory of Biological Chemistry, Faculty of Health Sciences, School of Medicine, Aristotle University of Thessaloniki, GR-54124, Thessaloniki, Greece.,Functional Proteomics and Systems Biology (FunPATh), Center for Interdisciplinary Research and Innovation (CIRI-AUTH), Balkan Center, Thessaloniki, 10th km Thessaloniki-Thermi Rd, P.O.Box 8318, GR 57001, Greece
| | - Vasiliki Nikoletopoulou
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Nikolaou Plastira 100, 70013 Heraklion, Crete, Greece
| | - Dimitris Tzamarias
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Nikolaou Plastira 100, 70013 Heraklion, Crete, Greece
| | - Domna Karagogeos
- University of Crete, School of Medicine, Department of Basic Sciences, Heraklion 71003, Crete, Greece.,Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Nikolaou Plastira 100, 70013 Heraklion, Crete, Greece
| | - Ioannis Iliopoulos
- University of Crete, School of Medicine, Department of Basic Sciences, Heraklion 71003, Crete, Greece
| |
Collapse
|
3
|
Hatz S, Spangler S, Bender A, Studham M, Haselmayer P, Lacoste AMB, Willis VC, Martin RL, Gurulingappa H, Betz U. Identification of pharmacodynamic biomarker hypotheses through literature analysis with IBM Watson. PLoS One 2019; 14:e0214619. [PMID: 30958864 PMCID: PMC6453528 DOI: 10.1371/journal.pone.0214619] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 03/16/2019] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Pharmacodynamic biomarkers are becoming increasingly valuable for assessing drug activity and target modulation in clinical trials. However, identifying quality biomarkers is challenging due to the increasing volume and heterogeneity of relevant data describing the biological networks that underlie disease mechanisms. A biological pathway network typically includes entities (e.g. genes, proteins and chemicals/drugs) as well as the relationships between these and is typically curated or mined from structured databases and textual co-occurrence data. We propose a hybrid Natural Language Processing and directed relationships-based network analysis approach using IBM Watson for Drug Discovery to rank all human genes and identify potential candidate biomarkers, requiring only an initial determination of a specific target-disease relationship. METHODS Through natural language processing of scientific literature, Watson for Drug Discovery creates a network of semantic relationships between biological concepts such as genes, drugs, and diseases. Using Bruton's tyrosine kinase as a case study, Watson for Drug Discovery's automatically extracted relationship network was compared with a prominent manually curated physical interaction network. Additionally, potential biomarkers for Bruton's tyrosine kinase inhibition were predicted using a matrix factorization approach and subsequently compared with expert-generated biomarkers. RESULTS Watson's natural language processing generated a relationship network matching 55 (86%) genes upstream of BTK and 98 (95%) genes downstream of Bruton's tyrosine kinase in a prominent manually curated physical interaction network. Matrix factorization analysis predicted 11 of 13 genes identified by Merck subject matter experts in the top 20% of Watson for Drug Discovery's 13,595 ranked genes, with 7 in the top 5%. CONCLUSION Taken together, these results suggest that Watson for Drug Discovery's automatic relationship network identifies the majority of upstream and downstream genes in biological pathway networks and can be used to help with the identification and prioritization of pharmacodynamic biomarker evaluation, accelerating the early phases of disease hypothesis generation.
Collapse
Affiliation(s)
- Sonja Hatz
- Merck KGaA, Frankfurter Straße, Darmstadt, Germany
| | - Scott Spangler
- IBM Watson Health, Almaden, California, United States of America
| | - Andrew Bender
- EMD Serono, Middlesex Turnpike, Billerica, United States of America
| | - Matthew Studham
- EMD Serono, Middlesex Turnpike, Billerica, United States of America
| | | | | | - Van C. Willis
- IBM Watson Health, Cambridge, Massachusetts, United States of America
| | - Richard L. Martin
- IBM Watson Health, Cambridge, Massachusetts, United States of America
| | | | - Ulrich Betz
- Merck KGaA, Frankfurter Straße, Darmstadt, Germany
| |
Collapse
|
4
|
Mons B. FAIR Science for Social Machines: Let's Share Metadata Knowlets in the Internet of FAIR Data and Services. DATA INTELLIGENCE 2019. [DOI: 10.1162/dint_a_00002] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
In a world awash with fragmented data and tools, the notion of Open Science has been gaining a lot of momentum, but simultaneously, it caused a great deal of anxiety. Some of the anxiety may be related to crumbling kingdoms, but there are also very legitimate concerns, especially about the relative role of machines and algorithms as compared to humans and the combination of both (i.e., social machines). There are also grave concerns about the connotations of the term “open”, but also regarding the unwanted side effects as well as the scalability of the approaches advocated by early adopters of new methodological developments. Many of these concerns are associated with mind-machine interaction and the critical role that computers are now playing in our day to day scientific practice. Here we address a number of these concerns and provide some possible solutions. FAIR (machine-actionable) data and services are obviously at the core of Open Science (or rather FAIR science). The scalable and transparent routing of data, tools and compute (to run the tools on) is a key central feature of the envisioned Internet of FAIR Data and Services (IFDS). Both the European Commission in its Declaration on the European Open Science Cloud, the G7, and the USA data commons have identified the need to ensure a solid and sustainable infrastructure for Open Science. Here we first define the term FAIR science as opposed to Open Science. In FAIR science, data and the associated tools are all Findable, Accessible under well defined conditions, Interoperable and Reusable, but not necessarily “open”; without restrictions and certainly not always “gratis”. The ambiguous term “open” has already caused considerable confusion and also opt-out reactions from researchers and other data-intensive professionals who cannot make their data open for very good reasons, such as patient privacy or national security. Although Open Science is a definition for a way of working rather than explicitly requesting for all data to be available in full Open Access, the connotation of openness of the data involved in Open Science is very strong. In FAIR science, data and the associated services to run all processes in the data stewardship cycle from design of experiment to capture to curation, processing, linking and analytics all have minimally FAIR metadata, which specify the conditions under which the actual underlying research objects are reusable, first for machines and then also for humans. This effectively means that—properly conducted—Open Science is part of FAIR science. However, FAIR science can also be done with partly closed, sensitive and proprietary data. As has been emphasized before, FAIR is not identical to “open”. In FAIR/Open Science, data should be as open as possible and as closed as necessary. Where data are generated using public funding, the default will usually be that for the FAIR data resulting from the study the accessibility will be as high as possible, and that more restrictive access and licensing policies on these data will have to be explicitly justified and described. In all cases, however, even if the reuse is restricted, data and related services should be findable for their major uses, machines, which will make them also much better findable for human users. With a tendency to make good data stewardship the norm, a very significant new market for distributed data analytics and learning is opening and a plethora of tools and reusable data objects are being developed and released. These all need FAIR metadata to be routed to each other and to be effective.
Collapse
Affiliation(s)
- Barend Mons
- Leiden University Medical Centre, The Netherlands, Poortgebouw N-01, Rijnsburgerweg 10 2333 AA Leiden, The Netherlands
| |
Collapse
|
5
|
Botsis T, Foster M, Kreimeyer K, Pandey A, Forshee R. Monitoring biomedical literature for post-market safety purposes by analyzing networks of text-based coded information. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2017; 2017:66-75. [PMID: 28815108 PMCID: PMC5543357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Literature review is critical but time-consuming in the post-market surveillance of medical products. We focused on the safety signal of intussusception after the vaccination of infants with the Rotashield Vaccine in 1999 and retrieved all PubMed abstracts for rotavirus vaccines published after January 1, 1998. We used the Event-based Text-mining of Health Electronic Records system, the MetaMap tool, and the National Center for Biomedical Ontologies Annotator to process the abstracts and generate coded terms stamped with the date of publication. Data were analyzed in the Pattern-based and Advanced Network Analyzer for Clinical Evaluation and Assessment to evaluate the intussusception-related findings before and after the release of the new rotavirus vaccines in 2006. The tight connection of intussusception with the historical signal in the first period and the absence of any safety concern for the new vaccines in the second period were verified. We demonstrated the feasibility for semi-automated solutions that may assist medical reviewers in monitoring biomedical literature.
Collapse
Affiliation(s)
- Taxiarchis Botsis
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation & Research, US Food and Drug Administration, Silver Spring, MD
| | - Matthew Foster
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation & Research, US Food and Drug Administration, Silver Spring, MD
| | - Kory Kreimeyer
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation & Research, US Food and Drug Administration, Silver Spring, MD
| | - Abhishek Pandey
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation & Research, US Food and Drug Administration, Silver Spring, MD
| | - Richard Forshee
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation & Research, US Food and Drug Administration, Silver Spring, MD
| |
Collapse
|
6
|
Abstract
AbstractLiterature-based discovery systems aim at discovering valuable latent connections between previously disparate research areas. This is achieved by analyzing the contents of their respective literatures with the help of various intelligent computational techniques. In this paper, we review the progress of literature-based discovery research, focusing on understanding their technical features and evaluating their performance. The present literature-based discovery techniques can be divided into two general approaches: the traditional approach and the emerging approach. The traditional approach, which dominate the current research landscape, comprises mainly of techniques that rely on utilizing lexical statistics, knowledge-based and visualization methods in order to address literature-based discovery problems. On the other hand, we have also observed the births of new trends and unprecedented paradigm shifts among the recently emerging literature-based discovery approach. These trends are likely to shape the future trajectory of the next generation literature-based discovery systems.
Collapse
|
7
|
Hettne KM, Thompson M, van Haagen HHHBM, van der Horst E, Kaliyaperumal R, Mina E, Tatum Z, Laros JFJ, van Mulligen EM, Schuemie M, Aten E, Li TS, Bruskiewich R, Good BM, Su AI, Kors JA, den Dunnen J, van Ommen GJB, Roos M, ‘t Hoen PA, Mons B, Schultes EA. The Implicitome: A Resource for Rationalizing Gene-Disease Associations. PLoS One 2016; 11:e0149621. [PMID: 26919047 PMCID: PMC4769089 DOI: 10.1371/journal.pone.0149621] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Accepted: 02/03/2016] [Indexed: 11/19/2022] Open
Abstract
High-throughput experimental methods such as medical sequencing and genome-wide association studies (GWAS) identify increasingly large numbers of potential relations between genetic variants and diseases. Both biological complexity (millions of potential gene-disease associations) and the accelerating rate of data production necessitate computational approaches to prioritize and rationalize potential gene-disease relations. Here, we use concept profile technology to expose from the biomedical literature both explicitly stated gene-disease relations (the explicitome) and a much larger set of implied gene-disease associations (the implicitome). Implicit relations are largely unknown to, or are even unintended by the original authors, but they vastly extend the reach of existing biomedical knowledge for identification and interpretation of gene-disease associations. The implicitome can be used in conjunction with experimental data resources to rationalize both known and novel associations. We demonstrate the usefulness of the implicitome by rationalizing known and novel gene-disease associations, including those from GWAS. To facilitate the re-use of implicit gene-disease associations, we publish our data in compliance with FAIR Data Publishing recommendations [https://www.force11.org/group/fairgroup] using nanopublications. An online tool (http://knowledge.bio) is available to explore established and potential gene-disease associations in the context of other biomedical relations.
Collapse
Affiliation(s)
- Kristina M. Hettne
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
- * E-mail:
| | - Mark Thompson
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | | | - Eelke van der Horst
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Rajaram Kaliyaperumal
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Eleni Mina
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Zuotian Tatum
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Jeroen F. J. Laros
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Erik M. van Mulligen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, Rotterdam, The Netherlands
| | - Martijn Schuemie
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, Rotterdam, The Netherlands
| | - Emmelien Aten
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Tong Shu Li
- Department of Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, CA, United States of America
| | | | - Benjamin M. Good
- Department of Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, CA, United States of America
| | - Andrew I. Su
- Department of Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, CA, United States of America
| | - Jan A. Kors
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, Rotterdam, The Netherlands
| | - Johan den Dunnen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Gert-Jan B. van Ommen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Marco Roos
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Peter A.C. ‘t Hoen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Barend Mons
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
- Dutch Techcentre for Life Sciences, Utrecht, The Netherlands
| | - Erik A. Schultes
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
- Leiden Institute for Advanced Computer Science, Leiden, The Netherlands
| |
Collapse
|
8
|
Laukens K, Naulaerts S, Berghe WV. Bioinformatics approaches for the functional interpretation of protein lists: from ontology term enrichment to network analysis. Proteomics 2015; 15:981-96. [PMID: 25430566 DOI: 10.1002/pmic.201400296] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2014] [Revised: 10/16/2014] [Accepted: 11/24/2014] [Indexed: 12/24/2022]
Abstract
The main result of a great deal of the published proteomics studies is a list of identified proteins, which then needs to be interpreted in relation to the research question and existing knowledge. In the early days of proteomics this interpretation was only based on expert insights, acquired by digesting a large amount of relevant literature. With the growing size and complexity of the experimental datasets, many computational techniques, databases, and tools have claimed a central role in this task. In this review we discuss commonly and less commonly used methods to functionally interpret experimental proteome lists and compare them with available knowledge. We first address several functional analysis and enrichment techniques based on ontologies and literature. Then we outline how various types of network and pathway information can be used. While the problem of functional interpretation of proteome data is to an extent equivalent to the interpretation of transcriptome or other ''omics'' data, this paper addresses some of the specific challenges and solutions of the proteomics field.
Collapse
Affiliation(s)
- Kris Laukens
- Department of Mathematics and Computer Science, University of Antwerp, Middelheimlaan, Antwerp, Belgium; Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp / Antwerp University Hospital, Antwerp, Belgium
| | | | | |
Collapse
|
9
|
Mina E, Thompson M, Kaliyaperumal R, Zhao J, der Horst VE, Tatum Z, Hettne KM, Schultes EA, Mons B, Roos M. Nanopublications for exposing experimental data in the life-sciences: a Huntington's Disease case study. J Biomed Semantics 2015; 6:5. [PMID: 26464783 PMCID: PMC4603842 DOI: 10.1186/2041-1480-6-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2014] [Accepted: 10/31/2014] [Indexed: 12/20/2022] Open
Abstract
Data from high throughput experiments often produce far more results than can ever appear in the main text or tables of a single research article. In these cases, the majority of new associations are often archived either as supplemental information in an arbitrary format or in publisher-independent databases that can be difficult to find. These data are not only lost from scientific discourse, but are also elusive to automated search, retrieval and processing. Here, we use the nanopublication model to make scientific assertions that were concluded from a workflow analysis of Huntington’s Disease data machine-readable, interoperable, and citable. We followed the nanopublication guidelines to semantically model our assertions as well as their provenance metadata and authorship. We demonstrate interoperability by linking nanopublication provenance to the Research Object model. These results indicate that nanopublications can provide an incentive for researchers to expose data that is interoperable and machine-readable for future use and preservation for which they can get credits for their effort. Nanopublications can have a leading role into hypotheses generation offering opportunities to produce large-scale data integration.
Collapse
Affiliation(s)
- Eleni Mina
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Mark Thompson
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Rajaram Kaliyaperumal
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Jun Zhao
- Department of Zoology, University of Oxford, Oxford, UK
| | - van Eelke der Horst
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Zuotian Tatum
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Kristina M Hettne
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Erik A Schultes
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Barend Mons
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Marco Roos
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| |
Collapse
|
10
|
Wang J, Zuo Y, Man YG, Avital I, Stojadinovic A, Liu M, Yang X, Varghese RS, Tadesse MG, Ressom HW. Pathway and network approaches for identification of cancer signature markers from omics data. J Cancer 2015; 6:54-65. [PMID: 25553089 PMCID: PMC4278915 DOI: 10.7150/jca.10631] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Accepted: 11/14/2014] [Indexed: 12/12/2022] Open
Abstract
The advancement of high throughput omic technologies during the past few years has made it possible to perform many complex assays in a much shorter time than the traditional approaches. The rapid accumulation and wide availability of omic data generated by these technologies offer great opportunities to unravel disease mechanisms, but also presents significant challenges to extract knowledge from such massive data and to evaluate the findings. To address these challenges, a number of pathway and network based approaches have been introduced. This review article evaluates these methods and discusses their application in cancer biomarker discovery using hepatocellular carcinoma (HCC) as an example.
Collapse
Affiliation(s)
- Jinlian Wang
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
- 7. Genetics and Genomics Science, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Yiming Zuo
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
- 6. Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA
| | - Yan-gao Man
- 2. Bon Secours Cancer Institute, Richmond VA, USA
| | | | - Alexander Stojadinovic
- 2. Bon Secours Cancer Institute, Richmond VA, USA
- 3. Division of Surgical Oncology, Walter Reed National Military Medical Center, Bethesda, MD, USA
| | - Meng Liu
- 4. Department of Public Health School of Hunter College, City University of New York, NYC, USA
| | - Xiaowei Yang
- 4. Department of Public Health School of Hunter College, City University of New York, NYC, USA
| | - Rency S. Varghese
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
| | - Mahlet G Tadesse
- 5. Department of Mathematics and Statistics, Georgetown University, Washington DC, USA
| | - Habtom W Ressom
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
| |
Collapse
|
11
|
Protein-protein interaction predictions using text mining methods. Methods 2014; 74:47-53. [PMID: 25448298 DOI: 10.1016/j.ymeth.2014.10.026] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2014] [Revised: 09/05/2014] [Accepted: 10/21/2014] [Indexed: 01/10/2023] Open
Abstract
It is beyond any doubt that proteins and their interactions play an essential role in most complex biological processes. The understanding of their function individually, but also in the form of protein complexes is of a great importance. Nowadays, despite the plethora of various high-throughput experimental approaches for detecting protein-protein interactions, many computational methods aiming to predict new interactions have appeared and gained interest. In this review, we focus on text-mining based computational methodologies, aiming to extract information for proteins and their interactions from public repositories such as literature and various biological databases. We discuss their strengths, their weaknesses and how they complement existing experimental techniques by simultaneously commenting on the biological databases which hold such information and the benchmark datasets that can be used for evaluating new tools.
Collapse
|
12
|
Borland AM, Hartwell J, Weston DJ, Schlauch KA, Tschaplinski TJ, Tuskan GA, Yang X, Cushman JC. Engineering crassulacean acid metabolism to improve water-use efficiency. TRENDS IN PLANT SCIENCE 2014; 19:327-38. [PMID: 24559590 PMCID: PMC4065858 DOI: 10.1016/j.tplants.2014.01.006] [Citation(s) in RCA: 122] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/06/2013] [Revised: 01/01/2014] [Accepted: 01/13/2014] [Indexed: 05/19/2023]
Abstract
Climatic extremes threaten agricultural sustainability worldwide. One approach to increase plant water-use efficiency (WUE) is to introduce crassulacean acid metabolism (CAM) into C3 crops. Such a task requires comprehensive systems-level understanding of the enzymatic and regulatory pathways underpinning this temporal CO2 pump. Here we review the progress that has been made in achieving this goal. Given that CAM arose through multiple independent evolutionary origins, comparative transcriptomics and genomics of taxonomically diverse CAM species are being used to define the genetic 'parts list' required to operate the core CAM functional modules of nocturnal carboxylation, diurnal decarboxylation, and inverse stomatal regulation. Engineered CAM offers the potential to sustain plant productivity for food, feed, fiber, and biofuel production in hotter and drier climates.
Collapse
Affiliation(s)
- Anne M Borland
- School of Biology, Newcastle University, Newcastle upon Tyne NE1 7RU, UK; Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6407, USA
| | - James Hartwell
- Department of Plant Sciences, Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - David J Weston
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6407, USA
| | - Karen A Schlauch
- Department of Biochemistry and Molecular Biology, MS330, University of Nevada, Reno, NV 89557-0330, USA
| | | | - Gerald A Tuskan
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6407, USA
| | - Xiaohan Yang
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6407, USA
| | - John C Cushman
- Department of Biochemistry and Molecular Biology, MS330, University of Nevada, Reno, NV 89557-0330, USA.
| |
Collapse
|
13
|
Dumontier M, Baker CJ, Baran J, Callahan A, Chepelev L, Cruz-Toledo J, Del Rio NR, Duck G, Furlong LI, Keath N, Klassen D, McCusker JP, Queralt-Rosinach N, Samwald M, Villanueva-Rosales N, Wilkinson MD, Hoehndorf R. The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. J Biomed Semantics 2014; 5:14. [PMID: 24602174 PMCID: PMC4015691 DOI: 10.1186/2041-1480-5-14] [Citation(s) in RCA: 77] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2013] [Accepted: 02/02/2014] [Indexed: 11/10/2022] Open
Abstract
The Semanticscience Integrated Ontology (SIO) is an ontology to facilitate biomedical knowledge discovery. SIO features a simple upper level comprised of essential types and relations for the rich description of arbitrary (real, hypothesized, virtual, fictional) objects, processes and their attributes. SIO specifies simple design patterns to describe and associate qualities, capabilities, functions, quantities, and informational entities including textual, geometrical, and mathematical entities, and provides specific extensions in the domains of chemistry, biology, biochemistry, and bioinformatics. SIO provides an ontological foundation for the Bio2RDF linked data for the life sciences project and is used for semantic integration and discovery for SADI-based semantic web services. SIO is freely available to all users under a creative commons by attribution license. See website for further information: http://sio.semanticscience.org.
Collapse
Affiliation(s)
- Michel Dumontier
- Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Abstract
Purpose
– The aim of this paper is to reposition the research library in the context of the changing information and knowledge architecture at the end of the “Gutenberg Parenthesis” and as part of the rapidly emerging “semantic” environment of the Linked Open Data paradigm. Understanding this process requires a good understanding of the evolution of the “document” notion in the passage from print based culture to the distributed hypertextual and RDF based information architecture of the WWW.
Design/methodology/approach
– These objectives are reached using literature study and a descriptive historical approach as well as text mining techniques using Google nGrams as a data source.
Findings
– The paper presents a proposal for effectively repositioning research libraries in the context of eScience and eScholarship as well as clear indications of the proposed repositioning already taking place. Furthermore, a new perspective of the “document” notion is provided.
Practical implications
– The evolution described in the contribution creates opportunities for libraries to reposition themselves as aggregators and selectors of content and as contextualising agents as part of future Linked Data based scholarly research environments provided they are able and ready to operate the related cultural changes.
Originality/value
– The paper will be useful for practitioners in search of strategic guidance for repositioning their librarian institutions in a context of ever increasing competition for scarce funding resources.
Collapse
|
15
|
Coelho ED, Arrais JP, Matos S, Pereira C, Rosa N, Correia MJ, Barros M, Oliveira JL. Computational prediction of the human-microbial oral interactome. BMC SYSTEMS BIOLOGY 2014; 8:24. [PMID: 24576332 PMCID: PMC3975954 DOI: 10.1186/1752-0509-8-24] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/27/2013] [Accepted: 02/17/2014] [Indexed: 11/12/2022]
Abstract
BACKGROUND The oral cavity is a complex ecosystem where human chemical compounds coexist with a particular microbiota. However, shifts in the normal composition of this microbiota may result in the onset of oral ailments, such as periodontitis and dental caries. In addition, it is known that the microbial colonization of the oral cavity is mediated by protein-protein interactions (PPIs) between the host and microorganisms. Nevertheless, this kind of PPIs is still largely undisclosed. To elucidate these interactions, we have created a computational prediction method that allows us to obtain a first model of the Human-Microbial oral interactome. RESULTS We collected high-quality experimental PPIs from five major human databases. The obtained PPIs were used to create our positive dataset and, indirectly, our negative dataset. The positive and negative datasets were merged and used for training and validation of a naïve Bayes classifier. For the final prediction model, we used an ensemble methodology combining five distinct PPI prediction techniques, namely: literature mining, primary protein sequences, orthologous profiles, biological process similarity, and domain interactions. Performance evaluation of our method revealed an area under the ROC-curve (AUC) value greater than 0.926, supporting our primary hypothesis, as no single set of features reached an AUC greater than 0.877. After subjecting our dataset to the prediction model, the classified result was filtered for very high confidence PPIs (probability ≥ 1-10-7), leading to a set of 46,579 PPIs to be further explored. CONCLUSIONS We believe this dataset holds not only important pathways involved in the onset of infectious oral diseases, but also potential drug-targets and biomarkers. The dataset used for training and validation, the predictions obtained and the network final network are available at http://bioinformatics.ua.pt/software/oralint.
Collapse
Affiliation(s)
- Edgar D Coelho
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Telematics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Joel P Arrais
- Department of Informatics Engineering (DEI), University of Coimbra, Coimbra, Portugal
- Centre for Informatics and Systems of the University at Coimbra (CISUC), University of Coimbra, Coimbra, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Telematics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Carlos Pereira
- Centre for Informatics and Systems of the University at Coimbra (CISUC), University of Coimbra, Coimbra, Portugal
- Department of Informatics Engineering and Systems, Polytechnic Institute of Coimbra, Engineering Institute of Coimbra (IPC-ISEC), Coimbra, Portugal
| | - Nuno Rosa
- Department of Health Sciences, Institute of Health Sciences, The Catholic University of Portugal, Viseu, Portugal
| | - Maria José Correia
- Department of Health Sciences, Institute of Health Sciences, The Catholic University of Portugal, Viseu, Portugal
| | - Marlene Barros
- Department of Health Sciences, Institute of Health Sciences, The Catholic University of Portugal, Viseu, Portugal
- Centre for Neurosciences and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - José Luís Oliveira
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Telematics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| |
Collapse
|
16
|
Pavlopoulos GA, Promponas VJ, Ouzounis CA, Iliopoulos I. Biological information extraction and co-occurrence analysis. Methods Mol Biol 2014; 1159:77-92. [PMID: 24788262 DOI: 10.1007/978-1-4939-0709-0_5] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Nowadays, it is possible to identify terms corresponding to biological entities within passages in biomedical text corpora: critically, their potential relationships then need to be detected. These relationships are typically detected by co-occurrence analysis, revealing associations between bioentities through their coexistence in single sentences and/or entire abstracts. These associations implicitly define networks, whose nodes represent terms/bioentities/concepts being connected by relationship edges; edge weights might represent confidence for these semantic connections.This chapter provides a review of current methods for co-occurrence analysis, focusing on data storage, analysis, and representation. We highlight scenarios of these approaches implemented by useful tools for information extraction and knowledge inference in the field of systems biology. We illustrate the practical utility of two online resources providing services of this type-namely, STRING and BioTextQuest-concluding with a discussion of current challenges and future perspectives in the field.
Collapse
Affiliation(s)
- Georgios A Pavlopoulos
- Division of Basic Sciences, University of Crete Medical School, Heraklion, 71110, Greece
| | | | | | | |
Collapse
|
17
|
van Haagen HHHBM, 't Hoen PAC, Mons B, Schultes EA. Generic information can retrieve known biological associations: implications for biomedical knowledge discovery. PLoS One 2013; 8:e78665. [PMID: 24260124 PMCID: PMC3834066 DOI: 10.1371/journal.pone.0078665] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2013] [Accepted: 09/13/2013] [Indexed: 02/01/2023] Open
Abstract
Motivation Weighted semantic networks built from text-mined literature can be used to retrieve known protein-protein or gene-disease associations, and have been shown to anticipate associations years before they are explicitly stated in the literature. Our text-mining system recognizes over 640,000 biomedical concepts: some are specific (i.e., names of genes or proteins) others generic (e.g., ‘Homo sapiens’). Generic concepts may play important roles in automated information retrieval, extraction, and inference but may also result in concept overload and confound retrieval and reasoning with low-relevance or even spurious links. Here, we attempted to optimize the retrieval performance for protein-protein interactions (PPI) by filtering generic concepts (node filtering) or links to generic concepts (edge filtering) from a weighted semantic network. First, we defined metrics based on network properties that quantify the specificity of concepts. Then using these metrics, we systematically filtered generic information from the network while monitoring retrieval performance of known protein-protein interactions. We also systematically filtered specific information from the network (inverse filtering), and assessed the retrieval performance of networks composed of generic information alone. Results Filtering generic or specific information induced a two-phase response in retrieval performance: initially the effects of filtering were minimal but beyond a critical threshold network performance suddenly drops. Contrary to expectations, networks composed exclusively of generic information demonstrated retrieval performance comparable to unfiltered networks that also contain specific concepts. Furthermore, an analysis using individual generic concepts demonstrated that they can effectively support the retrieval of known protein-protein interactions. For instance the concept “binding” is indicative for PPI retrieval and the concept “mutation abnormality” is indicative for gene-disease associations. Conclusion Generic concepts are important for information retrieval and cannot be removed from semantic networks without negative impact on retrieval performance.
Collapse
Affiliation(s)
| | - Peter A. C. 't Hoen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Barend Mons
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Erik A. Schultes
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
18
|
Botsis T, Ball R. Automating case definitions using literature-based reasoning. Appl Clin Inform 2013; 4:515-27. [PMID: 24454579 PMCID: PMC3885912 DOI: 10.4338/aci-2013-04-ra-0028] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Accepted: 10/08/2013] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Establishing a Case Definition (CDef) is a first step in many epidemiological, clinical, surveillance, and research activities. The application of CDefs still relies on manual steps and this is a major source of inefficiency in surveillance and research. OBJECTIVE Describe the need and propose an approach for automating the useful representation of CDefs for medical conditions. METHODS We translated the existing Brighton Collaboration CDef for anaphylaxis by mostly relying on the identification of synonyms for the criteria of the CDef using the NLM MetaMap tool. We also generated a CDef for the same condition using all the related PubMed abstracts, processing them with a text mining tool, and further treating the synonyms with the above strategy. The co-occurrence of the anaphylaxis and any other medical term within the same sentence of the abstracts supported the construction of a large semantic network. The 'islands' algorithm reduced the network and revealed its densest region including the nodes that were used to represent the key criteria of the CDef. We evaluated the ability of the "translated" and the "generated" CDef to classify a set of 6034 H1N1 reports for anaphylaxis using two similarity approaches and comparing them with our previous semi-automated classification approach. RESULTS Overall classification performance across approaches to producing CDefs was similar, with the generated CDef and vector space model with cosine similarity having the highest accuracy (0.825 ± 0.003) and the semi-automated approach and vector space model with cosine similarity having the highest recall (0.809 ± 0.042). Precision was low for all approaches. CONCLUSION The useful representation of CDefs is a complicated task but potentially offers substantial gains in efficiency to support safety and clinical surveillance.
Collapse
Affiliation(s)
- T. Botsis
- Taxiarchis Botsis PhD, MS, Office of Biostatistics and Epidemiology, CBER, FDA, Woodmont Office Complex 1, Rm 306N, 1401 Rockville Pike, Rockville, MD 20852, Tel. +1 301 827 5405, E-mail:
| | - R. Ball
- Office of Biostatistics and Epidemiology, Center for Biologics Evaluation and Research (CBER), Food and Drug Administration (FDA), Rockville, MD
| |
Collapse
|
19
|
de Vries B, Eising E, Broos LAM, Koelewijn SC, Todorov B, Frants RR, Boer JM, Ferrari MD, Hoen PAC', van den Maagdenberg AMJM. RNA expression profiling in brains of familial hemiplegic migraine type 1 knock-in mice. Cephalalgia 2013; 34:174-82. [PMID: 23985897 DOI: 10.1177/0333102413502736] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
BACKGROUND Various CACNA1A missense mutations cause familial hemiplegic migraine type 1 (FHM1), a rare monogenic subtype of migraine with aura. FHM1 mutation R192Q is associated with pure hemiplegic migraine, whereas the S218L mutation causes hemiplegic migraine, cerebellar ataxia, seizures, and mild head trauma-induced brain edema. Transgenic knock-in (KI) migraine mouse models were generated that carried either the FHM1 R192Q or the S218L mutation and were shown to exhibit increased CaV2.1 channel activity. Here we investigated their cerebellar and caudal cortical transcriptome. METHODS Caudal cortical and cerebellar RNA expression profiles from mutant and wild-type mice were studied using microarrays. Respective brain regions were selected based on their relevance to migraine aura and ataxia. Relevant expression changes were further investigated at RNA and protein level by quantitative polymerase chain reaction (qPCR) and/or immunohistochemistry, respectively. RESULTS Expression differences in the cerebellum were most pronounced in S218L mice. Particularly, tyrosine hydroxylase, a marker of delayed cerebellar maturation, appeared strongly upregulated in S218L cerebella. In contrast, only minimal expression differences were observed in the caudal cortex of either mutant mice strain. CONCLUSION Despite pronounced consequences of migraine gene mutations at the neurobiological level, changes in cortical RNA expression in FHM1 migraine mice compared to wild-type are modest. In contrast, pronounced RNA expression changes are seen in the cerebellum of S218L mice and may explain their cerebellar ataxia phenotype.
Collapse
Affiliation(s)
- Boukje de Vries
- Department of Human Genetics, Leiden University Medical Centre, The Netherlands
| | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Vos R, Aarts S, van Mulligen E, Metsemakers J, van Boxtel MP, Verhey F, van den Akker M. Finding potentially new multimorbidity patterns of psychiatric and somatic diseases: exploring the use of literature-based discovery in primary care research. J Am Med Inform Assoc 2013; 21:139-45. [PMID: 23775174 DOI: 10.1136/amiajnl-2012-001448] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Multimorbidity, the co-occurrence of two or more chronic medical conditions within a single individual, is increasingly becoming part of daily care of general medical practice. Literature-based discovery may help to investigate the patterns of multimorbidity and to integrate medical knowledge for improving healthcare delivery for individuals with co-occurring chronic conditions. OBJECTIVE To explore the usefulness of literature-based discovery in primary care research through the key-case of finding associations between psychiatric and somatic diseases relevant to general practice in a large biomedical literature database (Medline). METHODS By using literature based discovery for matching disease profiles as vectors in a high-dimensional associative concept space, co-occurrences of a broad spectrum of chronic medical conditions were matched for their potential in biomedicine. An experimental setting was chosen in parallel with expert evaluations and expert meetings to assess performance and to generate targets for integrating literature-based discovery in multidisciplinary medical research of psychiatric and somatic disease associations. RESULTS Through stepwise reductions a reference set of 21,945 disease combinations was generated, from which a set of 166 combinations between psychiatric and somatic diseases was selected and assessed by text mining and expert evaluation. CONCLUSIONS Literature-based discovery tools generate specific patterns of associations between psychiatric and somatic diseases: one subset was appraised as promising for further research; the other subset surprised the experts, leading to intricate discussions and further eliciting of frameworks of biomedical knowledge. These frameworks enable us to specify targets for further developing and integrating literature-based discovery in multidisciplinary research of general practice, psychology and psychiatry, and epidemiology.
Collapse
Affiliation(s)
- Rein Vos
- School for Public Health and Primary Care: CAPHRI, Maastricht University, Maastricht, The Netherlands
| | | | | | | | | | | | | |
Collapse
|
21
|
Li C, Jimeno-Yepes A, Arregui M, Kirsch H, Rebholz-Schuhmann D. PCorral--interactive mining of protein interactions from MEDLINE. Database (Oxford) 2013; 2013:bat030. [PMID: 23640984 PMCID: PMC3641755 DOI: 10.1093/database/bat030] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2012] [Revised: 03/15/2013] [Accepted: 03/27/2013] [Indexed: 11/13/2022]
Abstract
The extraction of information from the scientific literature is a complex task-for researchers doing manual curation and for automatic text processing solutions. The identification of protein-protein interactions (PPIs) requires the extraction of protein named entities and their relations. Semi-automatic interactive support is one approach to combine both solutions for efficient working processes to generate reliable database content. In principle, the extraction of PPIs can be achieved with different methods that can be combined to deliver high precision and/or high recall results in different combinations at the same time. Interactive use can be achieved, if the analytical methods are fast enough to process the retrieved documents. PCorral provides interactive mining of PPIs from the scientific literature allowing curators to skim MEDLINE for PPIs at low overheads. The keyword query to PCorral steers the selection of documents, and the subsequent text analysis generates high recall and high precision results for the curator. The underlying components of PCorral process the documents on-the-fly and are available, as well, as web service from the Whatizit infrastructure. The human interface summarizes the identified PPI results, and the involved entities are linked to relevant resources and databases. Altogether, PCorral serves curator at both the beginning and the end of the curation workflow for information retrieval and information extraction. Database URL: http://www.ebi.ac.uk/Rebholz-srv/pcorral.
Collapse
Affiliation(s)
- Chen Li
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | | | | | | |
Collapse
|
22
|
A protein prioritization approach tailored for the FA/BRCA pathway. PLoS One 2013; 8:e62017. [PMID: 23620800 PMCID: PMC3631253 DOI: 10.1371/journal.pone.0062017] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Accepted: 03/15/2013] [Indexed: 11/22/2022] Open
Abstract
Fanconi anemia (FA) is a heterogeneous recessive disorder associated with a markedly elevated risk to develop cancer. To date sixteen FA genes have been identified, three of which predispose heterozygous mutation carriers to breast cancer. The FA proteins work together in a genome maintenance pathway, the so-called FA/BRCA pathway which is important during the S phase of the cell cycle. Since not all FA patients can be linked to (one of) the sixteen known complementation groups, new FA genes remain to be identified. In addition the complex FA network remains to be further unravelled. One of the FA genes, FANCI, has been identified via a combination of bioinformatic techniques exploiting FA protein properties and genetic linkage. The aim of this study was to develop a prioritization approach for proteins of the entire human proteome that potentially interact with the FA/BRCA pathway or are novel candidate FA genes. To this end, we combined the original bioinformatics approach based on the properties of the first thirteen FA proteins identified with publicly available tools for protein-protein interactions, literature mining (Nermal) and a protein function prediction tool (FuncNet). Importantly, the three newest FA proteins FANCO/RAD51C, FANCP/SLX4, and XRCC2 displayed scores in the range of the already known FA proteins. Likewise, a prime candidate FA gene based on next generation sequencing and having a very low score was subsequently disproven by functional studies for the FA phenotype. Furthermore, the approach strongly enriches for GO terms such as DNA repair, response to DNA damage stimulus, and cell cycle-regulated genes. Additionally, overlaying the top 150 with a haploinsufficiency probability score, renders the approach more tailored for identifying breast cancer related genes. This approach may be useful for prioritization of putative novel FA or breast cancer genes from next generation sequencing efforts.
Collapse
|
23
|
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 2012; 13:829-39. [DOI: 10.1038/nrg3337] [Citation(s) in RCA: 170] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
24
|
Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL, Evelo CT, Blomberg N, Ecker G, Goble C, Mons B. Open PHACTS: semantic interoperability for drug discovery. Drug Discov Today 2012; 17:1188-98. [PMID: 22683805 DOI: 10.1016/j.drudis.2012.05.016] [Citation(s) in RCA: 172] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2012] [Revised: 05/18/2012] [Accepted: 05/31/2012] [Indexed: 01/22/2023]
Abstract
Open PHACTS is a public-private partnership between academia, publishers, small and medium sized enterprises and pharmaceutical companies. The goal of the project is to deliver and sustain an 'open pharmacological space' using and enhancing state-of-the-art semantic web standards and technologies. It is focused on practical and robust applications to solve specific questions in drug discovery research. OPS is intended to facilitate improvements in drug discovery in academia and industry and to support open innovation and in-house non-public drug discovery research. This paper lays out the challenges and how the Open PHACTS project is hoping to address these challenges technically and socially.
Collapse
Affiliation(s)
- Antony J Williams
- Royal Society of Chemistry, ChemSpider, US Office, Wake Forest, NC 27587, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Tang YT, Kao HY. Augmented transitive relationships with high impact protein distillation in protein interaction prediction. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2012; 1824:1468-75. [PMID: 22683815 DOI: 10.1016/j.bbapap.2012.05.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2012] [Revised: 05/18/2012] [Accepted: 05/30/2012] [Indexed: 11/16/2022]
Abstract
Predicting new protein-protein interactions is important for discovering novel functions of various biological pathways. Predicting these interactions is a crucial and challenging task. Moreover, discovering new protein-protein interactions through biological experiments is still difficult. Therefore, it is increasingly important to discover new protein interactions. Many studies have predicted protein-protein interactions, using biological features such as Gene Ontology (GO) functional annotations and structural domains of two proteins. In this paper, we propose an augmented transitive relationships predictor (ATRP), a new method of predicting potential protein interactions using transitive relationships and annotations of protein interactions. In addition, a distillation of virtual direct protein-protein interactions is proposed to deal with unbalanced distribution of different types of interactions in the existing protein-protein interaction databases. Our results demonstrate that ATRP can effectively predict protein-protein interactions. ATRP achieves an 81% precision, a 74% recall and a 77% F-measure in average rate in the prediction of direct protein-protein interactions. Using the generated benchmark datasets from KUPS to evaluate of all types of the protein-protein interaction, ATRP achieved a 93% precision, a 49% recall and a 64% F-measure in average rate. This article is part of a Special Issue entitled: Computational Methods for Protein Interaction and Structural Prediction.
Collapse
Affiliation(s)
- Yi-Tsung Tang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan, ROC
| | | |
Collapse
|
26
|
van Mulligen EM, Fourrier-Reglat A, Gurwitz D, Molokhia M, Nieto A, Trifiro G, Kors JA, Furlong LI. The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J Biomed Inform 2012; 45:879-84. [PMID: 22554700 DOI: 10.1016/j.jbi.2012.04.004] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2011] [Revised: 02/02/2012] [Accepted: 04/11/2012] [Indexed: 11/25/2022]
Abstract
Corpora with specific entities and relationships annotated are essential to train and evaluate text-mining systems that are developed to extract specific structured information from a large corpus. In this paper we describe an approach where a named-entity recognition system produces a first annotation and annotators revise this annotation using a web-based interface. The agreement figures achieved show that the inter-annotator agreement is much better than the agreement with the system provided annotations. The corpus has been annotated for drugs, disorders, genes and their inter-relationships. For each of the drug-disorder, drug-target, and target-disorder relations three experts have annotated a set of 100 abstracts. These annotated relationships will be used to train and evaluate text-mining software to capture these relationships in texts.
Collapse
Affiliation(s)
- Erik M van Mulligen
- Dept. of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands.
| | | | | | | | | | | | | | | |
Collapse
|
27
|
Hossain MS, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N. Connecting the dots between PubMed abstracts. PLoS One 2012; 7:e29509. [PMID: 22235301 PMCID: PMC3250456 DOI: 10.1371/journal.pone.0029509] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2011] [Accepted: 11/29/2011] [Indexed: 11/23/2022] Open
Abstract
Background There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and diseases. Each article investigates subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must integrate information from multiple publications. Particularly, unraveling relationships between extra-cellular inputs and downstream molecular response mechanisms requires integrating conclusions from diverse publications. Methodology We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for “connecting the dots” across the literature. We describe a storytelling algorithm that, given a start and end publication, typically with little or no overlap in content, identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. The quality of discovered stories is measured using local criteria such as the size of supporting neighborhoods for each link and the strength of individual links connecting publications, as well as global metrics of dispersion. To ensure that the story stays coherent as it meanders from one publication to another, we demonstrate the design of novel coherence and overlap filters for use as post-processing steps. Conclusions We demonstrate the application of our storytelling algorithm to three case studies: i) a many-one study exploring relationships between multiple cellular inputs and a molecule responsible for cell-fate decisions, ii) a many-many study exploring the relationships between multiple cytokines and multiple downstream transcription factors, and iii) a one-to-one study to showcase the ability to recover a cancer related association, viz. the Warburg effect, from past literature. The storytelling pipeline helps narrow down a scientist's focus from several hundreds of thousands of relevant documents to only around a hundred stories. We argue that our approach can serve as a valuable discovery aid for hypothesis generation and connection exploration in large unstructured biological knowledge bases.
Collapse
Affiliation(s)
- M Shahriar Hossain
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, United States of America.
| | | | | | | | | | | |
Collapse
|
28
|
|
29
|
Xu L, Furlotte N, Lin Y, Heinrich K, Berry MW, George EO, Homayouni R. Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts. PLoS One 2011; 6:e18851. [PMID: 21533142 PMCID: PMC3077411 DOI: 10.1371/journal.pone.0018851] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2010] [Accepted: 03/21/2011] [Indexed: 12/31/2022] Open
Abstract
High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature.
Collapse
Affiliation(s)
- Lijing Xu
- Bioinformatics Program, University of Memphis, Memphis, Tennessee, United States of America
- Department of Mathematical Sciences, University of Memphis, Memphis, Tennessee, United States of America
| | - Nicholas Furlotte
- Bioinformatics Program, University of Memphis, Memphis, Tennessee, United States of America
| | - Yunyue Lin
- Department of Computer Science, University of Memphis, Memphis, Tennessee, United States of America
| | - Kevin Heinrich
- Computable Genomix, Memphis, Tennessee, United States of America
| | - Michael W. Berry
- Department of Electrical and Computer Engineering, University of Tennessee, Knoxville, Tennessee, United States of America
| | - Ebenezer O. George
- Bioinformatics Program, University of Memphis, Memphis, Tennessee, United States of America
- Department of Mathematical Sciences, University of Memphis, Memphis, Tennessee, United States of America
| | - Ramin Homayouni
- Bioinformatics Program, University of Memphis, Memphis, Tennessee, United States of America
- Department of Biological Sciences, University of Memphis, Memphis, Tennessee, United States of America
- * E-mail:
| |
Collapse
|
30
|
Ligthart L, de Vries B, Smith AV, Ikram MA, Amin N, Hottenga JJ, Koelewijn SC, Kattenberg VM, de Moor MHM, Janssens ACJW, Aulchenko YS, Oostra BA, de Geus EJC, Smit JH, Zitman FG, Uitterlinden AG, Hofman A, Willemsen G, Nyholt DR, Montgomery GW, Terwindt GM, Gudnason V, Penninx BWJH, Breteler M, Ferrari MD, Launer LJ, van Duijn CM, van den Maagdenberg AMJM, Boomsma DI. Meta-analysis of genome-wide association for migraine in six population-based European cohorts. Eur J Hum Genet 2011; 19:901-7. [PMID: 21448238 PMCID: PMC3172930 DOI: 10.1038/ejhg.2011.48] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Migraine is a common neurological disorder with a genetically complex background. This paper describes a meta-analysis of genome-wide association (GWA) studies on migraine, performed by the Dutch–Icelandic migraine genetics (DICE) consortium, which brings together six population-based European migraine cohorts with a total sample size of 10 980 individuals (2446 cases and 8534 controls). A total of 32 SNPs showed marginal evidence for association at a P-value<10−5. The best result was obtained for SNP rs9908234, which had a P-value of 8.00 × 10−8. This top SNP is located in the nerve growth factor receptor (NGFR) gene. However, this SNP did not replicate in three cohorts from the Netherlands and Australia. Of the other 31 SNPs, 18 SNPs were tested in two replication cohorts, but none replicated. In addition, we explored previously identified candidate genes in the meta-analysis data set. This revealed a modest gene-based significant association between migraine and the metadherin (MTDH) gene, previously identified in the first clinic-based GWA study (GWAS) for migraine (Bonferroni-corrected gene-based P-value=0.026). This finding is consistent with the involvement of the glutamate pathway in migraine. Additional research is necessary to further confirm the involvement of glutamate.
Collapse
Affiliation(s)
- Lannie Ligthart
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
|
32
|
van Haagen HHHBM, 't Hoen PAC, de Morrée A, van Roon-Mom WMC, Peters DJM, Roos M, Mons B, van Ommen GJ, Schuemie MJ. In silico discovery and experimental validation of new protein-protein interactions. Proteomics 2011; 11:843-53. [DOI: 10.1002/pmic.201000398] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2010] [Revised: 10/17/2010] [Accepted: 11/25/2010] [Indexed: 01/27/2023]
|
33
|
Abstract
This chapter gives a brief overview of text-mining techniques to extract knowledge from large text collections. It describes the basis pipeline of how to come from text to relationships between biological concepts and the problems that are encountered at each step in the pipeline. We first explain how words in text are recognized as concepts. Second, concepts are associated with each other using 2×2 contingency tables and test statistics. Third, we explain that it is possible to extract indirect links between concepts using the direct links taken from 2×2 table analyses. This we call implicit information extraction. Fourth, the validation techniques to evaluate a text-mining system such as ROC curves and retrospective studies are discussed. We conclude by examining how text information can be combined with other non-textual data sources such as microarray expression data and what the future directions are for text-mining within the Internet.
Collapse
Affiliation(s)
- Herman van Haagen
- Department of Human Genetics, University Medical Center, Leiden, The Netherlands.
| | | |
Collapse
|
34
|
Harmston N, Filsell W, Stumpf MPH. What the papers say: text mining for genomics and systems biology. Hum Genomics 2010; 5:17-29. [PMID: 21106487 PMCID: PMC3500154 DOI: 10.1186/1479-7364-5-1-17] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2010] [Accepted: 08/06/2010] [Indexed: 12/11/2022] Open
Abstract
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining - the automated extraction of information from (electronically) published sources - could potentially fulfil an important role - but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.
Collapse
Affiliation(s)
- Nathan Harmston
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| | - Wendy Filsell
- Unilever R&D, Colworth Science Park, Sharnbrook, Bedford MK44 1 LQ, UK
| | - Michael PH Stumpf
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| |
Collapse
|
35
|
Biomedical semantics: the hub for biomedical research 2.0. J Biomed Semantics 2010; 1:1. [PMID: 20618983 PMCID: PMC2895735 DOI: 10.1186/2041-1480-1-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2010] [Accepted: 03/31/2010] [Indexed: 11/10/2022] Open
|