1
|
Chen H, Lu D, Xiao Z, Li S, Zhang W, Luan X, Zhang W, Zheng G. Comprehensive applications of the artificial intelligence technology in new drug research and development. Health Inf Sci Syst 2024; 12:41. [PMID: 39130617 PMCID: PMC11310389 DOI: 10.1007/s13755-024-00300-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Accepted: 07/27/2024] [Indexed: 08/13/2024] Open
Abstract
Purpose Target-based strategy is a prevalent means of drug research and development (R&D), since targets provide effector molecules of drug action and offer the foundation of pharmacological investigation. Recently, the artificial intelligence (AI) technology has been utilized in various stages of drug R&D, where AI-assisted experimental methods show higher efficiency than sole experimental ones. It is a critical need to give a comprehensive review of AI applications in drug R &D for biopharmaceutical field. Methods Relevant literatures about AI-assisted drug R&D were collected from the public databases (Including Google Scholar, Web of Science, PubMed, IEEE Xplore Digital Library, Springer, and ScienceDirect) through a keyword searching strategy with the following terms [("Artificial Intelligence" OR "Knowledge Graph" OR "Machine Learning") AND ("Drug Target Identification" OR "New Drug Development")]. Results In this review, we first introduced common strategies and novel trends of drug R&D, followed by characteristic description of AI algorithms widely used in drug R&D. Subsequently, we depicted detailed applications of AI algorithms in target identification, lead compound identification and optimization, drug repurposing, and drug analytical platform construction. Finally, we discussed the challenges and prospects of AI-assisted methods for drug discovery. Conclusion Collectively, this review provides comprehensive overview of AI applications in drug R&D and presents future perspectives for biopharmaceutical field, which may promote the development of drug industry.
Collapse
Affiliation(s)
- Hongyu Chen
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Dong Lu
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Ziyi Xiao
- Johns Hopkins Bloomberg School of Public Health, Baltimore, MD USA
| | - Shensuo Li
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Wen Zhang
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Xin Luan
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Weidong Zhang
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Guangyong Zheng
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| |
Collapse
|
2
|
Mazein I, Rougny A, Mazein A, Henkel R, Gütebier L, Michaelis L, Ostaszewski M, Schneider R, Satagopam V, Jensen LJ, Waltemath D, Wodke JAH, Balaur I. Graph databases in systems biology: a systematic review. Brief Bioinform 2024; 25:bbae561. [PMID: 39565895 PMCID: PMC11578065 DOI: 10.1093/bib/bbae561] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 09/28/2024] [Accepted: 10/21/2024] [Indexed: 11/22/2024] Open
Abstract
Graph databases are becoming increasingly popular across scientific disciplines, being highly suitable for storing and connecting complex heterogeneous data. In systems biology, they are used as a backend solution for biological data repositories, ontologies, networks, pathways, and knowledge graph databases. In this review, we analyse all publications using or mentioning graph databases retrieved from PubMed and PubMed Central full-text search, focusing on the top 16 available graph databases, Publications are categorized according to their domain and application, focusing on pathway and network biology and relevant ontologies and tools. We detail different approaches and highlight the advantages of outstanding resources, such as UniProtKB, Disease Ontology, and Reactome, which provide graph-based solutions. We discuss ongoing efforts of the systems biology community to standardize and harmonize knowledge graph creation and the maintenance of integrated resources. Outlining prospects, including the use of graph databases as a way of communication between biological data repositories, we conclude that efficient design, querying, and maintenance of graph databases will be key for knowledge generation in systems biology and other research fields with heterogeneous data.
Collapse
Affiliation(s)
- Ilya Mazein
- Medical Informatics Laboratory, University Medicine Greifswald, Walther-Rathenau-Straße 48, Greifswald 17475, Germany
| | - Adrien Rougny
- Luxembourg Centre for Systems Biology, University of Luxembourg, 6 Avenue du Swing, Belvaux L-4367, Luxembourg
| | - Alexander Mazein
- Luxembourg Centre for Systems Biology, University of Luxembourg, 6 Avenue du Swing, Belvaux L-4367, Luxembourg
| | - Ron Henkel
- Medical Informatics Laboratory, University Medicine Greifswald, Walther-Rathenau-Straße 48, Greifswald 17475, Germany
| | - Lea Gütebier
- Medical Informatics Laboratory, University Medicine Greifswald, Walther-Rathenau-Straße 48, Greifswald 17475, Germany
| | - Lea Michaelis
- Medical Informatics Laboratory, University Medicine Greifswald, Walther-Rathenau-Straße 48, Greifswald 17475, Germany
| | - Marek Ostaszewski
- Luxembourg Centre for Systems Biology, University of Luxembourg, 6 Avenue du Swing, Belvaux L-4367, Luxembourg
| | - Reinhard Schneider
- Luxembourg Centre for Systems Biology, University of Luxembourg, 6 Avenue du Swing, Belvaux L-4367, Luxembourg
| | - Venkata Satagopam
- Luxembourg Centre for Systems Biology, University of Luxembourg, 6 Avenue du Swing, Belvaux L-4367, Luxembourg
| | - Lars Juhl Jensen
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Grønnegårdsvej 15, 1870 Frederiksberg C, Denmark
| | - Dagmar Waltemath
- Medical Informatics Laboratory, University Medicine Greifswald, Walther-Rathenau-Straße 48, Greifswald 17475, Germany
| | - Judith A H Wodke
- Medical Informatics Laboratory, University Medicine Greifswald, Walther-Rathenau-Straße 48, Greifswald 17475, Germany
| | - Irina Balaur
- Luxembourg Centre for Systems Biology, University of Luxembourg, 6 Avenue du Swing, Belvaux L-4367, Luxembourg
| |
Collapse
|
3
|
Hauben M, Rafi M. Knowledge Graphs in Pharmacovigilance: A Step-By-Step Guide. Clin Ther 2024; 46:538-543. [PMID: 38670887 DOI: 10.1016/j.clinthera.2024.03.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 02/16/2024] [Accepted: 03/15/2024] [Indexed: 04/28/2024]
Abstract
PURPOSE This work aims to demystify Knowledge Graphs (KGs) in pharmacovigilance (PV). It complements the scoping review within this issue. By bridging knowledge gaps and stimulating interest, further engagement with this topic by pharmacovigilance professionals will be facilitated. METHODS We elucidate fundamental KGs concepts and terminology, followed by delineating a sequence of implementation steps: use case definition, data type selection, data sourcing, KG construction, KG embedding, and deriving actionable insights. Information technology options and limitations are also explored. FINDINGS KGs in pharmacovigilance is a multi-disciplinary field involving information technology, machine learning, biology, and PV. We were able to synthesize the relevant core concepts to create an intuitive exposition of KGs in PV. IMPLICATIONS This work demystifies KGs with a pharmacovigilance focus, preparing readers for the accompanying in-depth scoping review. that follows. It lays the groundwork for advancing PV research and practice by emphasizing the importance of engaging with vigilance experts. This approach enhances knowledge sharing and collaboration, contributing to more effective and informed pharmacovigilance efforts and optimal assessment and deployment of KGs in PV.
Collapse
Affiliation(s)
| | - Mazin Rafi
- Department of Statistics, Rutgers University, Piscataway, New Jersey.
| |
Collapse
|
4
|
Chaturvedi J, Wang T, Velupillai S, Stewart R, Roberts A. Development of a Knowledge Graph Embeddings Model for Pain. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:299-308. [PMID: 38222382 PMCID: PMC10785867] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has been recorded in electronic health records. Knowledge graphs represent concepts and their relations by an interlinked network, enabling semantic and context-based reasoning in a computationally tractable form. These graphs can, however, be too large for efficient computation. Knowledge graph embeddings help to resolve this by representing the graphs in a low-dimensional vector space. These embeddings can then be used in various downstream tasks such as classification and link prediction. The various relations associated with pain which are required to construct such a knowledge graph can be obtained from external medical knowledge bases such as SNOMED CT, a hierarchical systematic nomenclature of medical terms. A knowledge graph built in this way could be further enriched with real-world examples of pain and its relations extracted from electronic health records. This paper describes the construction of such knowledge graph embedding models of pain concepts, extracted from the unstructured text of mental health electronic health records, combined with external knowledge created from relations described in SNOMED CT, and their evaluation on a subject-object link prediction task. The performance of the models was compared with other baseline models.
Collapse
Affiliation(s)
- Jaya Chaturvedi
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
| | - Tao Wang
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
| | - Sumithra Velupillai
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
| | - Robert Stewart
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
- South London and Maudsley NHS Foundation Trust, London, United Kingdom
| | - Angus Roberts
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
- South London and Maudsley NHS Foundation Trust, London, United Kingdom
| |
Collapse
|
5
|
Schäfer J, Tang M, Luu D, Bergmann AK, Wiese L. Graph4Med: a web application and a graph database for visualizing and analyzing medical databases. BMC Bioinformatics 2022; 23:537. [PMID: 36503436 PMCID: PMC9743588 DOI: 10.1186/s12859-022-05092-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Accepted: 12/01/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Medical databases normally contain large amounts of data in a variety of forms. Although they grant significant insights into diagnosis and treatment, implementing data exploration into current medical databases is challenging since these are often based on a relational schema and cannot be used to easily extract information for cohort analysis and visualization. As a consequence, valuable information regarding cohort distribution or patient similarity may be missed. With the rapid advancement of biomedical technologies, new forms of data from methods such as Next Generation Sequencing (NGS) or chromosome microarray (array CGH) are constantly being generated; hence it can be expected that the amount and complexity of medical data will rise and bring relational database systems to a limit. DESCRIPTION We present Graph4Med, a web application that relies on a graph database obtained by transforming a relational database. Graph4Med provides a straightforward visualization and analysis of a selected patient cohort. Our use case is a database of pediatric Acute Lymphoblastic Leukemia (ALL). Along routine patients' health records it also contains results of latest technologies such as NGS data. We developed a suitable graph data schema to convert the relational data into a graph data structure and store it in Neo4j. We used NeoDash to build a dashboard for querying and displaying patients' cohort analysis. This way our tool (1) quickly displays the overview of patients' cohort information such as distributions of gender, age, mutations (fusions), diagnosis; (2) provides mutation (fusion) based similarity search and display in a maneuverable graph; (3) generates an interactive graph of any selected patient and facilitates the identification of interesting patterns among patients. CONCLUSION We demonstrate the feasibility and advantages of a graph database for storing and querying medical databases. Our dashboard allows a fast and interactive analysis and visualization of complex medical data. It is especially useful for patients similarity search based on mutations (fusions), of which vast amounts of data have been generated by NGS in recent years. It can discover relationships and patterns in patients cohorts that are normally hard to grasp. Expanding Graph4Med to more medical databases will bring novel insights into diagnostic and research.
Collapse
Affiliation(s)
- Jero Schäfer
- grid.7839.50000 0004 1936 9721Institute of Computer Science, Goethe-Universität Frankfurt, Frankfurt am Main, Germany
| | - Ming Tang
- grid.10423.340000 0000 9529 9877Department of Human Genetics, Hannover Medical School, Hannover, Germany ,grid.9122.80000 0001 2163 2777L3S Research Center, Leibniz Universität Hannover, Hannover, Germany
| | - Danny Luu
- grid.10423.340000 0000 9529 9877Department of Human Genetics, Hannover Medical School, Hannover, Germany
| | - Anke Katharina Bergmann
- grid.10423.340000 0000 9529 9877Department of Human Genetics, Hannover Medical School, Hannover, Germany
| | - Lena Wiese
- grid.7839.50000 0004 1936 9721Institute of Computer Science, Goethe-Universität Frankfurt, Frankfurt am Main, Germany ,grid.418009.40000 0000 9191 9864Bioinformatics Group, Fraunhofer ITEM, Hannover, Germany
| |
Collapse
|
6
|
Feng F, Tang F, Gao Y, Zhu D, Li T, Yang S, Yao Y, Huang Y, Liu J. GenomicKB: a knowledge graph for the human genome. Nucleic Acids Res 2022; 51:D950-D956. [PMID: 36318240 PMCID: PMC9825430 DOI: 10.1093/nar/gkac957] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 10/06/2022] [Accepted: 10/27/2022] [Indexed: 11/07/2022] Open
Abstract
Genomic Knowledgebase (GenomicKB) is a graph database for researchers to explore and investigate human genome, epigenome, transcriptome, and 4D nucleome with simple and efficient queries. The database uses a knowledge graph to consolidate genomic datasets and annotations from over 30 consortia and portals, including 347 million genomic entities, 1.36 billion relations, and 3.9 billion entity and relation properties. GenomicKB is equipped with a web-based query system (https://gkb.dcmb.med.umich.edu/) which allows users to query the knowledge graph with customized graph patterns and specific constraints on entities and relations. Compared with traditional tabular-structured data stored in separate data portals, GenomicKB emphasizes the relations among genomic entities, intuitively connects isolated data matrices, and supports efficient queries for scientific discoveries. GenomicKB transforms complicated analysis among multiple genomic entities and relations into coding-free queries, and facilitates data-driven genomic discoveries in the future.
Collapse
Affiliation(s)
- Fan Feng
- Department of Computational Medicine and Bioinformatics, University of Michigan, MI, USA
| | | | | | | | | | | | - Yuan Yao
- Electrical Engineering and Computer Science, University of Michigan, MI, USA
| | - Yuanhao Huang
- Department of Computational Medicine and Bioinformatics, University of Michigan, MI, USA
| | - Jie Liu
- To whom correspondence should be addressed.
| |
Collapse
|
7
|
Xiao G, Pfaff E, Prud'hommeaux E, Booth D, Sharma DK, Huo N, Yu Y, Zong N, Ruddy KJ, Chute CG, Jiang G. FHIR-Ontop-OMOP: Building clinical knowledge graphs in FHIR RDF with the OMOP Common data Model. J Biomed Inform 2022; 134:104201. [PMID: 36089199 PMCID: PMC9561043 DOI: 10.1016/j.jbi.2022.104201] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 08/04/2022] [Accepted: 09/04/2022] [Indexed: 11/26/2022]
Abstract
BACKGROUND Knowledge graphs (KGs) play a key role to enable explainable artificial intelligence (AI) applications in healthcare. Constructing clinical knowledge graphs (CKGs) against heterogeneous electronic health records (EHRs) has been desired by the research and healthcare AI communities. From the standardization perspective, community-based standards such as the Fast Healthcare Interoperability Resources (FHIR) and the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) are increasingly used to represent and standardize EHR data for clinical data analytics, however, the potential of such a standard on building CKG has not been well investigated. OBJECTIVE To develop and evaluate methods and tools that expose the OMOP CDM-based clinical data repositories into virtual clinical KGs that are compliant with FHIR Resource Description Framework (RDF) specification. METHODS We developed a system called FHIR-Ontop-OMOP to generate virtual clinical KGs from the OMOP relational databases. We leveraged an OMOP CDM-based Medical Information Mart for Intensive Care (MIMIC-III) data repository to evaluate the FHIR-Ontop-OMOP system in terms of the faithfulness of data transformation and the conformance of the generated CKGs to the FHIR RDF specification. RESULTS A beta version of the system has been released. A total of more than 100 data element mappings from 11 OMOP CDM clinical data, health system and vocabulary tables were implemented in the system, covering 11 FHIR resources. The generated virtual CKG from MIMIC-III contains 46,520 instances of FHIR Patient, 716,595 instances of Condition, 1,063,525 instances of Procedure, 24,934,751 instances of MedicationStatement, 365,181,104 instances of Observations, and 4,779,672 instances of CodeableConcept. Patient counts identified by five pairs of SQL (over the MIMIC database) and SPARQL (over the virtual CKG) queries were identical, ensuring the faithfulness of the data transformation. Generated CKG in RDF triples for 100 patients were fully conformant with the FHIR RDF specification. CONCLUSION The FHIR-Ontop-OMOP system can expose OMOP database as a FHIR-compliant RDF graph. It provides a meaningful use case demonstrating the potentials that can be enabled by the interoperability between FHIR and OMOP CDM. Generated clinical KGs in FHIR RDF provide a semantic foundation to enable explainable AI applications in healthcare.
Collapse
Affiliation(s)
- Guohui Xiao
- University of Bergen, Norway; University of Oslo, Norway; Ontopic S.r.l., Italy.
| | - Emily Pfaff
- University of North Carolina, Chapel Hill, NC, USA
| | | | | | | | - Nan Huo
- Mayo Clinic, Rochester, MN, USA
| | - Yue Yu
- Mayo Clinic, Rochester, MN, USA
| | | | | | | | | |
Collapse
|
8
|
Robin V, Bodein A, Scott-Boyer MP, Leclercq M, Périn O, Droit A. Overview of methods for characterization and visualization of a protein-protein interaction network in a multi-omics integration context. Front Mol Biosci 2022; 9:962799. [PMID: 36158572 PMCID: PMC9494275 DOI: 10.3389/fmolb.2022.962799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 08/16/2022] [Indexed: 11/26/2022] Open
Abstract
At the heart of the cellular machinery through the regulation of cellular functions, protein-protein interactions (PPIs) have a significant role. PPIs can be analyzed with network approaches. Construction of a PPI network requires prediction of the interactions. All PPIs form a network. Different biases such as lack of data, recurrence of information, and false interactions make the network unstable. Integrated strategies allow solving these different challenges. These approaches have shown encouraging results for the understanding of molecular mechanisms, drug action mechanisms, and identification of target genes. In order to give more importance to an interaction, it is evaluated by different confidence scores. These scores allow the filtration of the network and thus facilitate the representation of the network, essential steps to the identification and understanding of molecular mechanisms. In this review, we will discuss the main computational methods for predicting PPI, including ones confirming an interaction as well as the integration of PPIs into a network, and we will discuss visualization of these complex data.
Collapse
Affiliation(s)
- Vivian Robin
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Antoine Bodein
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Marie-Pier Scott-Boyer
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Mickaël Leclercq
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Olivier Périn
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Arnaud Droit
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| |
Collapse
|
9
|
Santos A, Colaço AR, Nielsen AB, Niu L, Strauss M, Geyer PE, Coscia F, Albrechtsen NJW, Mundt F, Jensen LJ, Mann M. A knowledge graph to interpret clinical proteomics data. Nat Biotechnol 2022; 40:692-702. [PMID: 35102292 PMCID: PMC9110295 DOI: 10.1038/s41587-021-01145-6] [Citation(s) in RCA: 93] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Accepted: 11/01/2021] [Indexed: 12/14/2022]
Abstract
Implementing precision medicine hinges on the integration of omics data, such as proteomics, into the clinical decision-making process, but the quantity and diversity of biomedical data, and the spread of clinically relevant knowledge across multiple biomedical databases and publications, pose a challenge to data integration. Here we present the Clinical Knowledge Graph (CKG), an open-source platform currently comprising close to 20 million nodes and 220 million relationships that represent relevant experimental data, public databases and literature. The graph structure provides a flexible data model that is easily extendable to new nodes and relationships as new databases become available. The CKG incorporates statistical and machine learning algorithms that accelerate the analysis and interpretation of typical proteomics workflows. Using a set of proof-of-concept biomarker studies, we show how the CKG might augment and enrich proteomics data and help inform clinical decision-making.
Collapse
Affiliation(s)
- Alberto Santos
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
- Li-Ka Shing Big Data Institute, University of Oxford, Oxford, UK.
- Center for Health Data Science, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
| | - Ana R Colaço
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Annelaura B Nielsen
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Lili Niu
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Maximilian Strauss
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
- OmicEra Diagnostics GmbH, Planegg, Germany
| | - Philipp E Geyer
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
- OmicEra Diagnostics GmbH, Planegg, Germany
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Fabian Coscia
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Nicolai J Wewer Albrechtsen
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
- Department of Biomedical Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
- Department for Clinical Biochemistry, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
| | - Filip Mundt
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Lars Juhl Jensen
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Matthias Mann
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
| |
Collapse
|
10
|
CTIMS: Automated Defect Detection Framework Using Computed Tomography. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12042175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Non-Destructive Testing (NDT) is one of the inspection techniques used in industrial tool inspection for quality and safety control. It is performed mainly using X-ray Computed Tomography (CT) to scan the internal structure of the tools and detect the potential defects. In this paper, we propose a new toolbox called the CT-Based Integrity Monitoring System (CTIMS-Toolbox) for automated inspection of CT images and volumes. It contains three main modules: first, the database management module, which handles the database and reads/writes queries to retrieve or save the CT data; second, the pre-processing module for registration and background subtraction; third, the defect inspection module to detect all the potential defects (missing parts, damaged screws, etc.) based on a hybrid system composed of computer vision and deep learning techniques. This paper explores the different features of the CTIMS-Toolbox, exposes the performance of its modules, compares its features to some existing CT inspection toolboxes, and provides some examples of the obtained results.
Collapse
|
11
|
Yang JJ, Gessner CR, Duerksen JL, Biber D, Binder JL, Ozturk M, Foote B, McEntire R, Stirling K, Ding Y, Wild DJ. Knowledge graph analytics platform with LINCS and IDG for Parkinson's disease target illumination. BMC Bioinformatics 2022; 23:37. [PMID: 35021991 PMCID: PMC8756622 DOI: 10.1186/s12859-021-04530-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Accepted: 12/13/2021] [Indexed: 11/12/2022] Open
Abstract
Background LINCS, "Library of Integrated Network-based Cellular Signatures", and IDG, "Illuminating the Druggable Genome", are both NIH projects and consortia that have generated rich datasets for the study of the molecular basis of human health and disease. LINCS L1000 expression signatures provide unbiased systems/omics experimental evidence. IDG provides compiled and curated knowledge for illumination and prioritization of novel drug target hypotheses. Together, these resources can support a powerful new approach to identifying novel drug targets for complex diseases, such as Parkinson's disease (PD), which continues to inflict severe harm on human health, and resist traditional research approaches. Results Integrating LINCS and IDG, we built the Knowledge Graph Analytics Platform (KGAP) to support an important use case: identification and prioritization of drug target hypotheses for associated diseases. The KGAP approach includes strong semantics interpretable by domain scientists and a robust, high performance implementation of a graph database and related analytical methods. Illustrating the value of our approach, we investigated results from queries relevant to PD. Approved PD drug indications from IDG’s resource DrugCentral were used as starting points for evidence paths exploring chemogenomic space via LINCS expression signatures for associated genes, evaluated as target hypotheses by integration with IDG. The KG-analytic scoring function was validated against a gold standard dataset of genes associated with PD as elucidated, published mechanism-of-action drug targets, also from DrugCentral. IDG's resource TIN-X was used to rank and filter KGAP results for novel PD targets, and one, SYNGR3 (Synaptogyrin-3), was manually investigated further as a case study and plausible new drug target for PD. Conclusions The synergy of LINCS and IDG, via KG methods, empowers graph analytics methods for the investigation of the molecular basis of complex diseases, and specifically for identification and prioritization of novel drug targets. The KGAP approach enables downstream applications via integration with resources similarly aligned with modern KG methodology. The generality of the approach indicates that KGAP is applicable to many disease areas, in addition to PD, the focus of this paper. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04530-9.
Collapse
|
12
|
Williams CJ, Richardson DC, Richardson JS. The importance of residue-level filtering and the Top2018 best-parts dataset of high-quality protein residues. Protein Sci 2022; 31:290-300. [PMID: 34779043 PMCID: PMC8740842 DOI: 10.1002/pro.4239] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Revised: 11/08/2021] [Accepted: 11/09/2021] [Indexed: 01/03/2023]
Abstract
We have curated a high-quality, "best-parts" reference dataset of about 3 million protein residues in about 15,000 PDB-format coordinate files, each containing only residues with good electron density support for a physically acceptable model conformation. The resulting prefiltered data typically contain the entire core of each chain, in quite long continuous fragments. Each reference file is a single protein chain, and the total set of files were selected for low redundancy, high resolution, good MolProbity score, and other chain-level criteria. Then each residue was critically tested for adequate local map quality to firmly support its conformation, which must also be free of serious clashes or covalent-geometry outliers. The resulting Top2018 prefiltered datasets have been released on the Zenodo online web service and are freely available for all uses under a Creative Commons license. Currently, one dataset is residue filtered on main chain plus Cβ atoms, and a second dataset is full-residue filtered; each is available at four different sequence-identity levels. Here, we illustrate both statistics and examples that show the beneficial consequences of residue-level filtering. That process is necessary because even the best of structures contain a few highly disordered local regions with poor density and low-confidence conformations that should not be included in reference data. Therefore, the open distribution of these very large, prefiltered reference datasets constitutes a notable advance for structural bioinformatics and the fields that depend upon it.
Collapse
|
13
|
Zhu Q, Nguyễn ÐT, Sheils T, Alyea G, Sid E, Xu Y, Dickens J, Mathé EA, Pariser A. Scientific evidence based rare disease research discovery with research funding data in knowledge graph. Orphanet J Rare Dis 2021; 16:483. [PMID: 34794473 PMCID: PMC8600882 DOI: 10.1186/s13023-021-02120-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 11/06/2021] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Limited knowledge and unclear underlying biology of many rare diseases pose significant challenges to patients, clinicians, and scientists. To address these challenges, there is an urgent need to inspire and encourage scientists to propose and pursue innovative research studies that aim to uncover the genetic and molecular causes of more rare diseases and ultimately to identify effective therapeutic solutions. A clear understanding of current research efforts, knowledge/research gaps, and funding patterns as scientific evidence is crucial to systematically accelerate the pace of research discovery in rare diseases, which is an overarching goal of this study. METHODS To semantically represent NIH funding data for rare diseases and advance its use of effectively promoting rare disease research, we identified NIH funded projects for rare diseases by mapping GARD diseases to the project based on project titles; subsequently we presented and managed those identified projects in a knowledge graph using Neo4j software, hosted at NCATS, based on a pre-defined data model that captures semantics among the data. With this developed knowledge graph, we were able to perform several case studies to demonstrate scientific evidence generation for supporting rare disease research discovery. RESULTS Of 5001 rare diseases belonging to 32 distinct disease categories, we identified 1294 diseases that are mapped to 45,647 distinct, NIH-funded projects obtained from the NIH ExPORTER by implementing semantic annotation of project titles. To capture semantic relationships presenting amongst mapped research funding data, we defined a data model comprised of seven primary classes and corresponding object and data properties. A Neo4j knowledge graph based on this predefined data model has been developed, and we performed multiple case studies over this knowledge graph to demonstrate its use in directing and promoting rare disease research. CONCLUSION We developed an integrative knowledge graph with rare disease funding data and demonstrated its use as a source from where we can effectively identify and generate scientific evidence to support rare disease research. With the success of this preliminary study, we plan to implement advanced computational approaches for analyzing more funding related data, e.g., project abstracts and PubMed article abstracts, and linking to other types of biomedical data to perform more sophisticated research gap analysis and identify opportunities for future research in rare diseases.
Collapse
Affiliation(s)
- Qian Zhu
- Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, MD, 20850, USA.
| | - Ðắc-Trung Nguyễn
- Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, MD, 20850, USA
| | - Timothy Sheils
- Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, MD, 20850, USA
| | | | - Eric Sid
- Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD, 20892, USA
| | - Yanji Xu
- Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD, 20892, USA
| | - James Dickens
- Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD, 20892, USA
| | - Ewy A Mathé
- Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, MD, 20850, USA
| | - Anne Pariser
- Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD, 20892, USA
| |
Collapse
|
14
|
Kim L, Yahia E, Segonds F, Véron P, Mallet A. i-Dataquest: A heterogeneous information retrieval tool using data graph for the manufacturing industry. COMPUT IND 2021. [DOI: 10.1016/j.compind.2021.103527] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
15
|
Bundhoo E, Ghoorah AW, Jaufeerally-Fakim Y. TAGOPSIN: collating taxa-specific gene and protein functional and structural information. BMC Bioinformatics 2021; 22:517. [PMID: 34688246 PMCID: PMC8541804 DOI: 10.1186/s12859-021-04429-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Accepted: 10/06/2021] [Indexed: 11/25/2022] Open
Abstract
Background The wealth of biological information available nowadays in public databases has triggered an unprecedented rise in multi-database search and data retrieval for obtaining detailed information about key functional and structural entities. This concerns investigations ranging from gene or genome analysis to protein structural analysis. However, the retrieval of interconnected data from a number of different databases is very often done repeatedly in an unsystematic way. Results Here, we present TAxonomy, Gene, Ontology, Protein, Structure INtegrated (TAGOPSIN), a command line program written in Java for rapid and systematic retrieval of select data from seven of the most popular public biological databases relevant to comparative genomics and protein structure studies. The program allows a user to retrieve organism-centred data and assemble them in a single data warehouse which constitutes a useful resource for several biological applications. TAGOPSIN was tested with a number of organisms encompassing eukaryotes, prokaryotes and viruses. For example, it successfully integrated data for about 17,000 UniProt entries of Homo sapiens and 21 UniProt entries of human coronavirus. Conclusion TAGOPSIN demonstrates efficient data integration whereby manipulation of interconnected data is more convenient than doing multi-database queries. The program facilitates for instance interspecific comparative analyses of protein-coding genes in a molecular evolutionary study, or identification of taxa-specific protein domains and three-dimensional structures. TAGOPSIN is available as a JAR file at https://github.com/ebundhoo/TAGOPSIN and is released under the GNU General Public License. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04429-5.
Collapse
Affiliation(s)
- Eshan Bundhoo
- Department of Agricultural and Food Science, Faculty of Agriculture, University of Mauritius, Reduit, 80837, Mauritius
| | - Anisah W Ghoorah
- Department of Digital Technologies, Faculty of Information, Communication and Digital Technologies, University of Mauritius, Reduit, 80837, Mauritius.
| | - Yasmina Jaufeerally-Fakim
- Department of Agricultural and Food Science, Faculty of Agriculture, University of Mauritius, Reduit, 80837, Mauritius
| |
Collapse
|
16
|
Dedié A, Bleimehl T, Täger J, Preusse M, Hrabě de Angelis M, Jarasch A. DZDconnect: mit vernetzten Daten gegen Diabetes. DIABETOLOGE 2021. [DOI: 10.1007/s11428-021-00807-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
17
|
Bukhari SAC, Pawar S, Mandell J, Kleinstein SH, Cheung KH. LinkedImm: a linked data graph database for integrating immunological data. BMC Bioinformatics 2021; 22:105. [PMID: 34433410 PMCID: PMC8385794 DOI: 10.1186/s12859-021-04031-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 02/16/2021] [Indexed: 11/23/2022] Open
Abstract
Background Many systems biology studies leverage the integration of multiple data types (across different data sources) to offer a more comprehensive view of the biological system being studied. While SQL (Structured Query Language) databases are popular in the biomedical domain, NoSQL database technologies have been used as a more relationship-based, flexible and scalable method of data integration. Results We have created a graph database integrating data from multiple sources. In addition to using a graph-based query language (Cypher) for data retrieval, we have developed a web-based dashboard that allows users to easily browse and plot data without the need to learn Cypher. We have also implemented a visual graph query interface for users to browse graph data. Finally, we have built a prototype to allow the user to query the graph database in natural language. Conclusion We have demonstrated the feasibility and flexibility of using a graph database for storing and querying immunological data with complex biological relationships. Querying a graph database through such relationships has the potential to discover novel relationships among heterogeneous biological data and metadata.
Collapse
Affiliation(s)
- Syed Ahmad Chan Bukhari
- Division of Computer Science, Mathematics and Science, Collins College of Professional Studies, St. John's University, New York, NY, USA
| | - Shrikant Pawar
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
| | - Jeff Mandell
- Program in Computational Biology and Bioinformatics, Yale School of Medicine, New Haven, CT, USA
| | - Steven H Kleinstein
- Program in Computational Biology and Bioinformatics, Yale School of Medicine, New Haven, CT, USA.,Department of Pathology, Yale School of Medicine, New Haven, CT, USA
| | - Kei-Hoi Cheung
- Program in Computational Biology and Bioinformatics, Yale School of Medicine, New Haven, CT, USA. .,Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, USA. .,Yale Center for Medical Informatics, Yale School of Medicine, New Haven, CT, USA.
| |
Collapse
|
18
|
García Del Valle EP, Lagunes García G, Prieto Santamaría L, Zanin M, Menasalvas Ruiz E, Rodríguez-González A. DisMaNET: A network-based tool to cross map disease vocabularies. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 207:106233. [PMID: 34157517 DOI: 10.1016/j.cmpb.2021.106233] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 06/02/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVES The growing integration of healthcare sources is improving our understanding of diseases. Cross-mapping resources such as UMLS play a very important role in this area, but their coverage is still incomplete. With the aim to facilitate the integration and interoperability of biological, clinical and literary sources in studies of diseases, we built DisMaNET, a system to cross-map terms from disease vocabularies by leveraging the power and interpretability of network analysis. METHODS First, we collected and normalized data from 8 disease vocabularies and mapping sources to generate our datasets. Next, we built DisMaNET by integrating the generated datasets into a Neo4j graph database. Then we exploited the query mechanisms of Neo4j to cross-map disease terms of different vocabularies with a relevance score metric and contrasted the results with some state-of-the-art solutions. Finally, we made our system publicly available for its exploitation and evaluation both through a graphical user interface and REST APIs. RESULTS DisMaNET contains almost half a million nodes and near nine hundred thousand edges, including hierarchical and mapping relationships. Its query capabilities enabled the detection of connections between disease vocabularies that are not present in major mapping sources such as UMLS and the Disease Ontology, even for rare diseases. Furthermore, DisMaNET was capable of obtaining more than 80% of the mappings with UMLS reported in MonDO and DisGeNET, and it was successfully exploited to resolve the missing mappings in the DISNET project. CONCLUSIONS DisMaNET is a powerful, intuitive and publicly available system to cross-map terms from different disease vocabularies. Our study proves that it is a competitive alternative to existing mapping systems, incorporating the potential of network analysis and the interpretability of the results through a visual interface as its main advantages. Expansion with new sources, versioning and the improvement of the search and scoring algorithms are envisioned as future lines of work.
Collapse
Affiliation(s)
| | - Gerardo Lagunes García
- ETS de Ingenieros Informáticos. Universidad Politécnica de Madrid. Boadilla del Monte, Madrid, Spain; Centro de Tecnología Biomédica, ETS Ingenieros Informáticos. Universidad Politécnica de Madrid. Pozuelo de Alarcón, Madrid, Spain
| | - Lucía Prieto Santamaría
- Centro de Tecnología Biomédica, ETS Ingenieros Informáticos. Universidad Politécnica de Madrid. Pozuelo de Alarcón, Madrid, Spain
| | - Massimiliano Zanin
- Instituto de Física Interdisciplinar y Sistemas Complejos IFISC (CSIC-UIB), Campus UIB, Palma de Mallorca, Spain
| | - Ernestina Menasalvas Ruiz
- ETS de Ingenieros Informáticos. Universidad Politécnica de Madrid. Boadilla del Monte, Madrid, Spain; Centro de Tecnología Biomédica, ETS Ingenieros Informáticos. Universidad Politécnica de Madrid. Pozuelo de Alarcón, Madrid, Spain
| | - Alejandro Rodríguez-González
- ETS de Ingenieros Informáticos. Universidad Politécnica de Madrid. Boadilla del Monte, Madrid, Spain; Centro de Tecnología Biomédica, ETS Ingenieros Informáticos. Universidad Politécnica de Madrid. Pozuelo de Alarcón, Madrid, Spain
| |
Collapse
|
19
|
Hassani‐Pak K, Singh A, Brandizi M, Hearnshaw J, Parsons JD, Amberkar S, Phillips AL, Doonan JH, Rawlings C. KnetMiner: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across species. PLANT BIOTECHNOLOGY JOURNAL 2021; 19:1670-1678. [PMID: 33750020 PMCID: PMC8384599 DOI: 10.1111/pbi.13583] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 12/17/2020] [Accepted: 03/16/2021] [Indexed: 05/03/2023]
Abstract
The generation of new ideas and scientific hypotheses is often the result of extensive literature and database searches, but, with the growing wealth of public and private knowledge, the process of searching diverse and interconnected data to generate new insights into genes, gene networks, traits and diseases is becoming both more complex and more time-consuming. To guide this technically challenging data integration task and to make gene discovery and hypotheses generation easier for researchers, we have developed a comprehensive software package called KnetMiner which is open-source and containerized for easy use. KnetMiner is an integrated, intelligent, interactive gene and gene network discovery platform that supports scientists explore and understand the biological stories of complex traits and diseases across species. It features fast algorithms for generating rich interactive gene networks and prioritizing candidate genes based on knowledge mining approaches. KnetMiner is used in many plant science institutions and has been adopted by several plant breeding organizations to accelerate gene discovery. The software is generic and customizable and can therefore be readily applied to new species and data types; for example, it has been applied to pest insects and fungal pathogens; and most recently repurposed to support COVID-19 research. Here, we give an overview of the main approaches behind KnetMiner and we report plant-centric case studies for identifying genes, gene networks and trait relationships in Triticum aestivum (bread wheat), as well as, an evidence-based approach to rank candidate genes under a large Arabidopsis thaliana QTL. KnetMiner is available at: https://knetminer.org.
Collapse
|
20
|
Tarazona S, Arzalluz-Luque A, Conesa A. Undisclosed, unmet and neglected challenges in multi-omics studies. NATURE COMPUTATIONAL SCIENCE 2021; 1:395-402. [PMID: 38217236 DOI: 10.1038/s43588-021-00086-z] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 05/17/2021] [Indexed: 01/15/2024]
Abstract
Multi-omics approaches have become a reality in both large genomics projects and small laboratories. However, the multi-omics research community still faces a number of issues that have either not been sufficiently discussed or for which current solutions are still limited. In this Perspective, we elaborate on these limitations and suggest points of attention for future research. We finally discuss new opportunities and challenges brought to the field by the rapid development of single-cell high-throughput molecular technologies.
Collapse
Affiliation(s)
- Sonia Tarazona
- Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - Angeles Arzalluz-Luque
- Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - Ana Conesa
- Microbiology and Cell Science Department, Institute for Food and Agricultural Research, University of Florida, Gainesville, FL, USA.
- Genetics Institute, University of Florida, Gainesville, FL, USA.
- Institute for Integrative Systems Biology, Spanish National Research Council, Valencia, Spain.
| |
Collapse
|
21
|
Timón-Reina S, Rincón M, Martínez-Tomás R. An overview of graph databases and their applications in the biomedical domain. Database (Oxford) 2021; 2021:baab026. [PMID: 34003247 PMCID: PMC8130509 DOI: 10.1093/database/baab026] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2020] [Revised: 03/24/2021] [Accepted: 04/30/2021] [Indexed: 01/18/2023]
Abstract
Over the past couple of decades, the explosion of densely interconnected data has stimulated the research, development and adoption of graph database technologies. From early graph models to more recent native graph databases, the landscape of implementations has evolved to cover enterprise-ready requirements. Because of the interconnected nature of its data, the biomedical domain has been one of the early adopters of graph databases, enabling more natural representation models and better data integration workflows, exploration and analysis facilities. In this work, we survey the literature to explore the evolution, performance and how the most recent graph database solutions are applied in the biomedical domain, compiling a great variety of use cases. With this evidence, we conclude that the available graph database management systems are fit to support data-intensive, integrative applications, targeted at both basic research and exploratory tasks closer to the clinic.
Collapse
Affiliation(s)
- Santiago Timón-Reina
- Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain
| | - Mariano Rincón
- Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain
| | - Rafael Martínez-Tomás
- Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain
| |
Collapse
|
22
|
Friedrichs M. BioDWH2: an automated graph-based data warehouse and mapping tool. J Integr Bioinform 2021; 18:167-176. [PMID: 33618440 PMCID: PMC8238471 DOI: 10.1515/jib-2020-0033] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 01/25/2021] [Accepted: 01/25/2021] [Indexed: 12/28/2022] Open
Abstract
Data integration plays a vital role in scientific research. In biomedical research, the OMICS fields have shown the need for larger datasets, like proteomics, pharmacogenomics, and newer fields like foodomics. As research projects require multiple data sources, mapping between these sources becomes necessary. Utilized workflow systems and integration tools therefore need to process large amounts of heterogeneous data formats, check for data source updates, and find suitable mapping methods to cross-reference entities from different databases. This article presents BioDWH2, an open-source, graph-based data warehouse and mapping tool, capable of helping researchers with these issues. A workspace centered approach allows project-specific data source selections and Neo4j or GraphQL server tools enable quick access to the database for analysis. The BioDWH2 tools are available to the scientific community at https://github.com/BioDWH2.
Collapse
Affiliation(s)
- Marcel Friedrichs
- Bielefeld University, Faculty of Technology, Bioinformatics / Medical Informatics Department, Bielefeld, Germany
| |
Collapse
|
23
|
Simpson CM, Gnad F. Applying graph database technology for analyzing perturbed co-expression networks in cancer. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:6029398. [PMID: 33306799 PMCID: PMC7731929 DOI: 10.1093/database/baaa110] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Revised: 09/20/2020] [Accepted: 11/30/2020] [Indexed: 11/13/2022]
Abstract
Graph representations provide an elegant solution to capture and analyze complex molecular mechanisms in the cell. Co-expression networks are undirected graph representations of transcriptional co-behavior indicating (co-)regulations, functional modules or even physical interactions between the corresponding gene products. The growing avalanche of available RNA sequencing (RNAseq) data fuels the construction of such networks, which are usually stored in relational databases like most other biological data. Inferring linkage by recursive multiple-join statements, however, is computationally expensive and complex to design in relational databases. In contrast, graph databases store and represent complex interconnected data as nodes, edges and properties, making it fast and intuitive to query and analyze relationships. While graph-based database technologies are on their way from a fringe domain to going mainstream, there are only a few studies reporting their application to biological data. We used the graph database management system Neo4j to store and analyze co-expression networks derived from RNAseq data from The Cancer Genome Atlas. Comparing co-expression in tumors versus healthy tissues in six cancer types revealed significant perturbation tracing back to erroneous or rewired gene regulation. Applying centrality, community detection and pathfinding graph algorithms uncovered the destruction or creation of central nodes, modules and relationships in co-expression networks of tumors. Given the speed, accuracy and straightforwardness of managing these densely connected networks, we conclude that graph databases are ready for entering the arena of biological data.
Collapse
Affiliation(s)
- Claire M Simpson
- Department of Bioinformatics and Data Science, Cell Signaling Technology Inc., 3 Trask Lane, Danvers, MA 01923, USA
| | - Florian Gnad
- Department of Bioinformatics and Data Science, Cell Signaling Technology Inc., 3 Trask Lane, Danvers, MA 01923, USA
| |
Collapse
|
24
|
Adoni WYH, Nahhal T, Krichen M, El byed A, Assayad I. DHPV: a distributed algorithm for large-scale graph partitioning. JOURNAL OF BIG DATA 2020; 7:76. [PMID: 32953386 PMCID: PMC7493068 DOI: 10.1186/s40537-020-00357-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Accepted: 09/03/2020] [Indexed: 06/11/2023]
Abstract
Big graphs are part of the movement of "Not Only SQL" databases (also called NoSQL) focusing on the relationships between data, rather than the values themselves. The data is stored in vertices while the edges model the interactions or relationships between these data. They offer flexibility in handling data that is strongly connected to each other. The analysis of a big graph generally involves exploring all of its vertices. Thus, this operation is costly in time and resources because big graphs are generally composed of millions of vertices connected through billions of edges. Consequently, the graph algorithms are expansive compared to the size of the big graph, and are therefore ineffective for data exploration. Thus, partitioning the graph stands out as an efficient and less expensive alternative for exploring a big graph. This technique consists in partitioning the graph into a set of k sub-graphs in order to reduce the complexity of the queries. Nevertheless, it presents many challenges because it is an NP-complete problem. In this article, we present DPHV (Distributed Placement of Hub-Vertices) an efficient parallel and distributed heuristic for large-scale graph partitioning. An application on a real-world graphs demonstrates the feasibility and reliability of our method. The experiments carried on a 10-nodes Spark cluster proved that the proposed methodology achieves significant gain in term of time and outperforms JA-BE-JA, Greedy, DFEP.
Collapse
Affiliation(s)
| | - Tarik Nahhal
- LIMSAD Laboratory, Faculty of sciences, Hassan II University of Casablanca, Casablanca, Morocco
| | - Moez Krichen
- Faculty of CSIT, Albaha University, Al Bahah, Saudi Arabia
- ReDCAD Laboratory, University of Sfax, Sfax, Tunisia
| | - Abdeltif El byed
- LIMSAD Laboratory, Faculty of sciences, Hassan II University of Casablanca, Casablanca, Morocco
| | - Ismail Assayad
- LIMSAD Laboratory, ENSEM, Hassan II University of Casablanca, Casablanca, Morocco
| |
Collapse
|
25
|
Bolduc B, Hodgkins SB, Varner RK, Crill PM, McCalley CK, Chanton JP, Tyson GW, Riley WJ, Palace M, Duhaime MB, Hough MA, Saleska SR, Sullivan MB, Rich VI. The IsoGenie database: an interdisciplinary data management solution for ecosystems biology and environmental research. PeerJ 2020. [DOI: 10.7717/peerj.9467] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Modern microbial and ecosystem sciences require diverse interdisciplinary teams that are often challenged in “speaking” to one another due to different languages and data product types. Here we introduce the IsoGenie Database (IsoGenieDB; https://isogenie-db.asc.ohio-state.edu/), a de novo developed data management and exploration platform, as a solution to this challenge of accurately representing and integrating heterogenous environmental and microbial data across ecosystem scales. The IsoGenieDB is a public and private data infrastructure designed to store and query data generated by the IsoGenie Project, a ~10 year DOE-funded project focused on discovering ecosystem climate feedbacks in a thawing permafrost landscape. The IsoGenieDB provides (i) a platform for IsoGenie Project members to explore the project’s interdisciplinary datasets across scales through the inherent relationships among data entities, (ii) a framework to consolidate and harmonize the datasets needed by the team’s modelers, and (iii) a public venue that leverages the same spatially explicit, disciplinarily integrated data structure to share published datasets. The IsoGenieDB is also being expanded to cover the NASA-funded Archaea to Atmosphere (A2A) project, which scales the findings of IsoGenie to a broader suite of Arctic peatlands, via the umbrella A2A Database (A2A-DB). The IsoGenieDB’s expandability and flexible architecture allow it to serve as an example ecosystems database.
Collapse
Affiliation(s)
- Benjamin Bolduc
- Department of Microbiology, The Ohio State University, Columbus, OH, USA
| | | | - Ruth K. Varner
- Earth Systems Research Center, Institute for the Study of Earth, Oceans and Space, University of New Hampshire, Durham, NH, USA
- Department of Earth Sciences, College of Engineering and Physical Sciences, University of New Hampshire, Durham, NH, USA
| | - Patrick M. Crill
- Department of Geological Sciences and Bolin Centre for Climate Research, Stockholm University, Stockholm, Sweden
| | - Carmody K. McCalley
- Thomas H. Gosnell School of Life Sciences, Rochester Institute of Technology, Rochester, NY, USA
| | - Jeffrey P. Chanton
- Department of Earth, Ocean, and Atmospheric Science, Florida State University, Tallahassee, FL, USA
| | - Gene W. Tyson
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - William J. Riley
- Climate and Ecosystem Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Michael Palace
- Earth Systems Research Center, Institute for the Study of Earth, Oceans and Space, University of New Hampshire, Durham, NH, USA
- Department of Earth Sciences, College of Engineering and Physical Sciences, University of New Hampshire, Durham, NH, USA
| | - Melissa B. Duhaime
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
| | - Moira A. Hough
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Scott R. Saleska
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Matthew B. Sullivan
- Department of Microbiology, The Ohio State University, Columbus, OH, USA
- Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, Columbus, OH, USA
| | - Virginia I. Rich
- Department of Microbiology, The Ohio State University, Columbus, OH, USA
| | | |
Collapse
|
26
|
Aguilera-Mendoza L, Marrero-Ponce Y, Beltran JA, Tellez Ibarra R, Guillen-Ramirez HA, Brizuela CA. Graph-based data integration from bioactive peptide databases of pharmaceutical interest: toward an organized collection enabling visual network analysis. Bioinformatics 2020; 35:4739-4747. [PMID: 30994884 DOI: 10.1093/bioinformatics/btz260] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2018] [Revised: 03/30/2019] [Accepted: 04/10/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Bioactive peptides have gained great attention in the academy and pharmaceutical industry since they play an important role in human health. However, the increasing number of bioactive peptide databases is causing the problem of data redundancy and duplicated efforts. Even worse is the fact that the available data is non-standardized and often dirty with data entry errors. Therefore, there is a need for a unified view that enables a more comprehensive analysis of the information on this topic residing at different sites. RESULTS After collecting web pages from a large variety of bioactive peptide databases, we organized the web content into an integrated graph database (starPepDB) that holds a total of 71 310 nodes and 348 505 relationships. In this graph structure, there are 45 120 nodes representing peptides, and the rest of the nodes are connected to peptides for describing metadata. Additionally, to facilitate a better understanding of the integrated data, a software tool (starPep toolbox) has been developed for supporting visual network analysis in a user-friendly way; providing several functionalities such as peptide retrieval and filtering, network construction and visualization, interactive exploration and exporting data options. AVAILABILITY AND IMPLEMENTATION Both starPepDB and starPep toolbox are freely available at http://mobiosd-hub.com/starpep/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Longendri Aguilera-Mendoza
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Mexico.,Grupo de Investigación de Bioinformática, Universidad de las Ciencias Informáticas (UCI), CP 17100, La Habana, Cuba
| | - Yovani Marrero-Ponce
- Universidad San Francisco de Quito (USFQ), Grupo de Medicina Molecular y Translacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas, CP 170901, Quito, Pichincha, Ecuador.,Grupo de Investigación Ambiental (GIA), Programas Ambientales, Facultad de Ingenierías, Fundacion Universitaria Tecnologico Comfenalco - Cartagena, Cr 44 D N° 30A - 91, CP 130015, Cartagena, Bolívar, Colombia
| | - Jesus A Beltran
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Mexico
| | - Roberto Tellez Ibarra
- Grupo de Investigación de Bioinformática, Universidad de las Ciencias Informáticas (UCI), CP 17100, La Habana, Cuba
| | - Hugo A Guillen-Ramirez
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Mexico
| | - Carlos A Brizuela
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Mexico
| |
Collapse
|
27
|
Stothers JAM, Nguyen A. Can Neo4j Replace PostgreSQL in Healthcare? AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:646-653. [PMID: 32477687 PMCID: PMC7233060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Our current big data landscape prompts us to develop new analytical skills in order to make the best use of the abundance of datasets at hand. Traditionally, SQL databases such as PostgreSQL have been the databases of choice, and newer graph databases such as Neo4j have been relegated to the analysis of social network and transportation datasets. In this paper, we conduct a side by side comparison of PostgreSQL (using SQL) and Neo4j (using Cypher) using the MIMIC-III patient database as a case study. We found that, while Neo4j is more time intensive to implement, its queries are less complex and have a faster runtime than comparable queries performed in PostgreSQL. This leads to the conclusion that while PostgreSQL is adequate as a database, Neo4j should be considered a viable contender for health data storage and analysis.
Collapse
|
28
|
Schrodt J, Dudchenko A, Knaup-Gregori P, Ganzinger M. Graph-Representation of Patient Data: a Systematic Literature Review. J Med Syst 2020; 44:86. [PMID: 32166501 PMCID: PMC7067737 DOI: 10.1007/s10916-020-1538-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 02/07/2020] [Indexed: 11/30/2022]
Abstract
Graph theory is a well-established theory with many methods used in mathematics to study graph structures. In the field of medicine, electronic health records (EHR) are commonly used to store and analyze patient data. Consequently, it seems straight-forward to perform research on modeling EHR data as graphs. This systematic literature review aims to investigate the frontiers of the current research in the field of graphs representing and processing patient data. We want to show, which areas of research in this context need further investigation. The databases MEDLINE, Web of Science, IEEE Xplore and ACM digital library were queried by using the search terms health record, graph and related terms. Based on the "Preferred Reporting Items for Systematic Reviews and Meta-Analysis" (PRISMA) statement guidelines the articles were screened and evaluated using full-text analysis. Eleven out of 383 articles found in systematic literature review were finally included for analysis in this literature review. Most of them use graphs to represent temporal relations, often representing the connection among laboratory data points. Only two papers report that the graph data were further processed by comparing the patient graphs using similarity measurements. Graphs representing individual patients are hardly used in research context, only eleven papers considered such kind of graphs in their investigations. The potential of graph theoretical algorithms, which are already well established, could help increasing this research field, but currently there are too few papers to estimate how this area of research will develop. Altogether, the use of such patient graphs could be a promising technique to develop decision support systems for diagnosis, medication or therapy of patients using similarity measurements or different kinds of analysis.
Collapse
Affiliation(s)
- Jens Schrodt
- Institute for Medical Biometry and Informatics, Heidelberg University, Im Neuenheimer Feld 130.3, 69120, Heidelberg, Germany
| | - Aleksei Dudchenko
- Institute for Medical Biometry and Informatics, Heidelberg University, Im Neuenheimer Feld 130.3, 69120, Heidelberg, Germany.,School of Translational Information Technologies, ITMO University, Kronverksky Pr. 49, 197101, Saint-Petersburg, Russia
| | - Petra Knaup-Gregori
- Institute for Medical Biometry and Informatics, Heidelberg University, Im Neuenheimer Feld 130.3, 69120, Heidelberg, Germany
| | - Matthias Ganzinger
- Institute for Medical Biometry and Informatics, Heidelberg University, Im Neuenheimer Feld 130.3, 69120, Heidelberg, Germany.
| |
Collapse
|
29
|
Struck A, Walsh B, Buchanan A, Lee JA, Spangler R, Stuart JM, Ellrott K. Exploring Integrative Analysis Using the BioMedical Evidence Graph. JCO Clin Cancer Inform 2020; 4:147-159. [PMID: 32097025 PMCID: PMC7049249 DOI: 10.1200/cci.19.00110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/16/2020] [Indexed: 12/22/2022] Open
Abstract
PURPOSE The analysis of cancer biology data involves extremely heterogeneous data sets, including information from RNA sequencing, genome-wide copy number, DNA methylation data reporting on epigenetic regulation, somatic mutations from whole-exome or whole-genome analyses, pathology estimates from imaging sections or subtyping, drug response or other treatment outcomes, and various other clinical and phenotypic measurements. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrated data set analysis. METHODS We introduce the BioMedical Evidence Graph (BMEG), a graph database and query engine for discovery and analysis of cancer biology. The BMEG is unique from other biologic data graphs in that sample-level molecular and clinical information is connected to reference knowledge bases. It combines gene expression and mutation data with drug-response experiments, pathway information databases, and literature-derived associations. RESULTS The construction of the BMEG has resulted in a graph containing > 41 million vertices and 57 million edges. The BMEG system provides a graph query-based application programming interface to enable analysis, with client code available for Python, Javascript, and R, and a server online at bmeg.io. Using this system, we have demonstrated several forms of cross-data set analysis to show the utility of the system. CONCLUSION The BMEG is an evolving resource dedicated to enabling integrative analysis. We have demonstrated queries on the system that illustrate mutation significance analysis, drug-response machine learning, patient-level knowledge-base queries, and pathway level analysis. We have compared the resulting graph to other available integrated graph systems and demonstrated the former is unique in the scale of the graph and the type of data it makes available.
Collapse
Affiliation(s)
- Adam Struck
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Brian Walsh
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Alexander Buchanan
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Jordan A. Lee
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Ryan Spangler
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Joshua M. Stuart
- Biomolecular Engineering Department, University of California, Santa Cruz, Santa Cruz, CA
- University of California Santa Cruz Genomics Institute, University of California, Santa Cruz Santa Cruz, CA
| | - Kyle Ellrott
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| |
Collapse
|
30
|
Bukhari SAC, Mandell J, Kleinstein SH, Cheung KH. A linked data graph approach to integration of immunological data. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2019; 2019:1742-1749. [PMID: 34707915 DOI: 10.1109/bibm47256.2019.8982986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Systems biology involves the integration of multiple data types (across different data sources) to offer a more complete picture of the biological system being studied. While many existing biological databases are implemented using the traditional SQL (Structured Query Language) database technology, NoSQL database technologies have been explored as a more relationship-based, flexible and scalable method of data integration. In this paper, we describe how to use the Neo4J graph database to integrate a variety of types of data sets in the context of systems vaccinology. Specifically, we have converted into a common graph model diverse types of vaccine response measurement data from the NIH/NIAID ImmPort data repository, pathway data from Reactome, influenza virus strains from WHO, and taxonomic data from NCBI Taxon. While Neo4J provides a graph-based query language (Cypher) for data retrieval, we develop a web-based dashboard for users to easily browse and visualize data without the need to learn Cypher. In addition, we have prototyped a natural language query interface for users to interact with our system. In conclusion, we demonstrate the feasibility of using a graph-based database for storing and querying immunological data with complex biological relationships. Querying a graph database through such relationships has the potential to reveal novel relationships among heterogeneous biological data.
Collapse
Affiliation(s)
- Syed Ahmad Chan Bukhari
- Division of Computer Science, Mathematics and Science Lesley H. & William L. Collins College of Professional Studies, St. John's University, New York, NY, USA
| | - Jeff Mandell
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Steven H Kleinstein
- Department of Pathology, Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Kei-Hoi Cheung
- Department of Emergency Medicine and Yale Center for Medical Informatics Computationa Biology and Bioinformaticcs, Yale University, New Haven, CT, USA
| |
Collapse
|
31
|
Tully B, Balleine RL, Hains PG, Zhong Q, Reddel RR, Robinson PJ. Addressing the Challenges of High-Throughput Cancer Tissue Proteomics for Clinical Application: ProCan. Proteomics 2019; 19:e1900109. [PMID: 31321850 DOI: 10.1002/pmic.201900109] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 07/11/2019] [Indexed: 11/09/2022]
Abstract
The cancer tissue proteome has enormous potential as a source of novel predictive biomarkers in oncology. Progress in the development of mass spectrometry (MS)-based tissue proteomics now presents an opportunity to exploit this by applying the strategies of comprehensive molecular profiling and big-data analytics that are refined in other fields of 'omics research. ProCan (ProCan is a registered trademark) is a program aiming to generate high-quality tissue proteomic data across a broad spectrum of cancer types. It is based on data-independent acquisition-MS proteomic analysis of annotated tissue samples sourced through collaboration with expert clinical and cancer research groups. The practical requirements of a high-throughput translational research program have shaped the approach that ProCan is taking to address challenges in study design, sample preparation, raw data acquisition, and data analysis. The ultimate goal is to establish a large proteomics knowledge-base that, in combination with other cancer 'omics data, will accelerate cancer research.
Collapse
Affiliation(s)
- Brett Tully
- ProCan, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
| | - Rosemary L Balleine
- ProCan, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
| | - Peter G Hains
- ProCan, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
| | - Qing Zhong
- ProCan, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
| | - Roger R Reddel
- ProCan, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
| | - Phillip J Robinson
- ProCan, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
| |
Collapse
|
32
|
Sachs J, Page R, Baskauf SJ, Pender J, Lujan-Toro B, Macklin J, Comspon Z. Training and hackathon on building biodiversity knowledge graphs. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36152] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Knowledge graphs have the potential to unite disconnected digitized biodiversity data, and there are a number of efforts underway to build biodiversity knowledge graphs. More generally, the recent popularity of knowledge graphs, driven in part by the advent and success of the Google Knowledge Graph, has breathed life into the ongoing development of semantic web infrastructure and prototypes in the biodiversity informatics community. We describe a one week training event and hackathon that focused on applying three specific knowledge graph technologies – the Neptune graph database; Metaphactory; and Wikidata - to a diverse set of biodiversity use cases.
We give an overview of the training, the projects that were advanced throughout the week, and the critical discussions that emerged. We believe that the main barriers towards adoption of biodiversity knowledge graphs are the lack of understanding of knowledge graphs and the lack of adoption of shared unique identifiers. Furthermore, we believe an important advancement in the outlook of knowledge graph development is the emergence of Wikidata as an identifier broker and as a scoping tool. To remedy the current barriers towards biodiversity knowledge graph development, we recommend continued discussions at workshops and at conferences, which we expect to increase awareness and adoption of knowledge graph technologies.
Collapse
|
33
|
Duran‐Frigola M, Fernández‐Torras A, Bertoni M, Aloy P. Formatting biological big data for modern machine learning in drug discovery. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2018. [DOI: 10.1002/wcms.1408] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Miquel Duran‐Frigola
- Joint IRB‐BSC‐CRG Program in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona) Barcelona Institute of Science and Technology Barcelona Spain
| | - Adrià Fernández‐Torras
- Joint IRB‐BSC‐CRG Program in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona) Barcelona Institute of Science and Technology Barcelona Spain
| | - Martino Bertoni
- Joint IRB‐BSC‐CRG Program in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona) Barcelona Institute of Science and Technology Barcelona Spain
| | - Patrick Aloy
- Joint IRB‐BSC‐CRG Program in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona) Barcelona Institute of Science and Technology Barcelona Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA) Barcelona Spain
| |
Collapse
|
34
|
Abstract
The understanding of molecular processes involved in a specific biological system can be significantly improved by combining and comparing different data sets and knowledge resources. However, these information sources often use different identification systems and an identifier conversion step is required before any integration effort. Mapping between identifiers is often provided by the reference information resources and several tools have been implemented to simplify their use. However, most of these tools do not combine the information provided by individual resources to increase the completeness of the mapping process. Also, deprecated identifiers from former versions of databases are not taken into account. Finally, finding automatically the most relevant path to map identifiers from one scope to the other is often not trivial. The Biological Entity Dictionary (BED) addresses these three challenges by relying on a graph data model describing possible relationships between entities and their identifiers. This model has been implemented using Neo4j and an R package provides functions to query the graph but also to create and feed a custom instance of the database. This design combined with a local installation of the graph database and a cache system make BED very efficient to convert large lists of identifiers.
Collapse
Affiliation(s)
- Patrice Godard
- Clarivate Analytics, Carlsbad, CA, 92008, USA.,UCB Pharma, Braine-l'Alleud, 1420, Belgium
| | | |
Collapse
|
35
|
Castillo-Lara S, Abril JF. PlanNET: homology-based predicted interactome for multiple planarian transcriptomes. Bioinformatics 2018; 34:1016-1023. [PMID: 29186384 PMCID: PMC5860622 DOI: 10.1093/bioinformatics/btx738] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Revised: 10/24/2017] [Accepted: 11/23/2017] [Indexed: 01/30/2023] Open
Abstract
Motivation Planarians are emerging as a model organism to study regeneration in animals. However, the little available data of protein-protein interactions hinders the advances in understanding the mechanisms underlying its regenerating capabilities. Results We have developed a protocol to predict protein-protein interactions using sequence homology data and a reference Human interactome. This methodology was applied on 11 Schmidtea mediterranea transcriptomic sequence datasets. Then, using Neo4j as our database manager, we developed PlanNET, a web application to explore the multiplicity of networks and the associated sequence annotations. By mapping RNA-seq expression experiments onto the predicted networks, and allowing a transcript-centric exploration of the planarian interactome, we provide researchers with a useful tool to analyse possible pathways and to design new experiments, as well as a reproducible methodology to predict, store, and explore protein interaction networks for non-model organisms. Availability and implementation The web application PlanNET is available at https://compgen.bio.ub.edu/PlanNET. The source code used is available at https://compgen.bio.ub.edu/PlanNET/downloads. Contact jabril@ub.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- S Castillo-Lara
- Computational Genomics Laboratory, Genetics, Microbiology and Statistics Department, Institut de Biomedicina (IBUB), Universitat de Barcelona, Barcelona, Catalonia, Spain
| | - J F Abril
- Computational Genomics Laboratory, Genetics, Microbiology and Statistics Department, Institut de Biomedicina (IBUB), Universitat de Barcelona, Barcelona, Catalonia, Spain
| |
Collapse
|
36
|
Himmelstein DS, Lizee A, Hessler C, Brueggeman L, Chen SL, Hadley D, Green A, Khankhanian P, Baranzini SE. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife 2017; 6:26726. [PMID: 28936969 PMCID: PMC5640425 DOI: 10.7554/elife.26726] [Citation(s) in RCA: 244] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2017] [Accepted: 09/11/2017] [Indexed: 12/16/2022] Open
Abstract
The ability to computationally predict whether a compound treats a disease would improve the economy and success rate of drug approval. This study describes Project Rephetio to systematically model drug efficacy based on 755 existing treatments. First, we constructed Hetionet (neo4j.het.io), an integrative network encoding knowledge from millions of biomedical studies. Hetionet v1.0 consists of 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. Data were integrated from 29 public resources to connect compounds, diseases, genes, anatomies, pathways, biological processes, molecular functions, cellular components, pharmacologic classes, side effects, and symptoms. Next, we identified network patterns that distinguish treatments from non-treatments. Then, we predicted the probability of treatment for 209,168 compound-disease pairs (het.io/repurpose). Our predictions validated on two external sets of treatment and provided pharmacological insights on epilepsy, suggesting they will help prioritize drug repurposing candidates. This study was entirely open and received realtime feedback from 40 community members.
Collapse
Affiliation(s)
- Daniel Scott Himmelstein
- Biological and Medical Informatics Program, University of California, San Francisco, San Francisco, United States.,Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, United States
| | - Antoine Lizee
- Department of Neurology, University of California, San Francisco, San Francisco, United States.,ITUN-CRTI-UMR 1064 Inserm, University of Nantes, Nantes, France
| | - Christine Hessler
- Department of Neurology, University of California, San Francisco, San Francisco, United States
| | - Leo Brueggeman
- Department of Neurology, University of California, San Francisco, San Francisco, United States.,University of Iowa, Iowa City, United States
| | - Sabrina L Chen
- Department of Neurology, University of California, San Francisco, San Francisco, United States.,Johns Hopkins University, Baltimore, United States
| | - Dexter Hadley
- Department of Pediatrics, University of California, San Fransisco, San Fransisco, United States.,Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, United States
| | - Ari Green
- Department of Neurology, University of California, San Francisco, San Francisco, United States
| | - Pouya Khankhanian
- Department of Neurology, University of California, San Francisco, San Francisco, United States.,Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, United States
| | - Sergio E Baranzini
- Biological and Medical Informatics Program, University of California, San Francisco, San Francisco, United States.,Department of Neurology, University of California, San Francisco, San Francisco, United States
| |
Collapse
|