1
|
Sen S, Woodhouse MR, Portwood JL, Andorf CM. Maize Feature Store: A centralized resource to manage and analyze curated maize multi-omics features for machine learning applications. Database (Oxford) 2023; 2023:baad078. [PMID: 37935586 PMCID: PMC10634621 DOI: 10.1093/database/baad078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Revised: 09/16/2023] [Accepted: 10/19/2023] [Indexed: 11/09/2023]
Abstract
The big-data analysis of complex data associated with maize genomes accelerates genetic research and improves agronomic traits. As a result, efforts have increased to integrate diverse datasets and extract meaning from these measurements. Machine learning models are a powerful tool for gaining knowledge from large and complex datasets. However, these models must be trained on high-quality features to succeed. Currently, there are no solutions to host maize multi-omics datasets with end-to-end solutions for evaluating and linking features to target gene annotations. Our work presents the Maize Feature Store (MFS), a versatile application that combines features built on complex data to facilitate exploration, modeling and analysis. Feature stores allow researchers to rapidly deploy machine learning applications by managing and providing access to frequently used features. We populated the MFS for the maize reference genome with over 14 000 gene-based features based on published genomic, transcriptomic, epigenomic, variomic and proteomics datasets. Using the MFS, we created an accurate pan-genome classification model with an AUC-ROC score of 0.87. The MFS is publicly available through the maize genetics and genomics database. Database URL https://mfs.maizegdb.org/.
Collapse
Affiliation(s)
- Shatabdi Sen
- Department of Plant Pathology & Microbiology, Iowa State University, 1344 Advanced Teaching & Research Bldg, 2213 Pammel Dr, Ames, IA 50011, USA
| | - Margaret R Woodhouse
- USDA-ARS, Corn Insects and Crop Genetics Research Unit, 819 Wallace Road, Ames, IA 50011, USA
| | - John L Portwood
- USDA-ARS, Corn Insects and Crop Genetics Research Unit, 819 Wallace Road, Ames, IA 50011, USA
| | - Carson M Andorf
- USDA-ARS, Corn Insects and Crop Genetics Research Unit, 819 Wallace Road, Ames, IA 50011, USA
- Department of Computer Science, Iowa State University, Atanasoff Hall, 2434 Osborn Dr, Ames, IA 50011, USA
| |
Collapse
|
2
|
Big Data in Laboratory Medicine—FAIR Quality for AI? Diagnostics (Basel) 2022; 12:diagnostics12081923. [PMID: 36010273 PMCID: PMC9406962 DOI: 10.3390/diagnostics12081923] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 08/05/2022] [Accepted: 08/06/2022] [Indexed: 12/22/2022] Open
Abstract
Laboratory medicine is a digital science. Every large hospital produces a wealth of data each day—from simple numerical results from, e.g., sodium measurements to highly complex output of “-omics” analyses, as well as quality control results and metadata. Processing, connecting, storing, and ordering extensive parts of these individual data requires Big Data techniques. Whereas novel technologies such as artificial intelligence and machine learning have exciting application for the augmentation of laboratory medicine, the Big Data concept remains fundamental for any sophisticated data analysis in large databases. To make laboratory medicine data optimally usable for clinical and research purposes, they need to be FAIR: findable, accessible, interoperable, and reusable. This can be achieved, for example, by automated recording, connection of devices, efficient ETL (Extract, Transform, Load) processes, careful data governance, and modern data security solutions. Enriched with clinical data, laboratory medicine data allow a gain in pathophysiological insights, can improve patient care, or can be used to develop reference intervals for diagnostic purposes. Nevertheless, Big Data in laboratory medicine do not come without challenges: the growing number of analyses and data derived from them is a demanding task to be taken care of. Laboratory medicine experts are and will be needed to drive this development, take an active role in the ongoing digitalization, and provide guidance for their clinical colleagues engaging with the laboratory data in research.
Collapse
|
3
|
Pal S, Mondal S, Das G, Khatua S, Ghosh Z. Big data in biology: The hope and present-day challenges in it. GENE REPORTS 2020. [DOI: 10.1016/j.genrep.2020.100869] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
4
|
Liu J, Liu Q, Zhang L, Su S, Liu Y. Enabling Massive XML-Based Biological Data Management in HBase. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1994-2004. [PMID: 31094692 DOI: 10.1109/tcbb.2019.2915811] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Publishing biological data in XML formats is attractive for organizations who would like to provide their bioinformatics resources in an extensible and machine-readable format. In the era of big data, massive XML-based biological data management is emerged as a challengeable issue. With the continuous growth of the XML-based biological data sets, it is usually frustrating to use traditional declarative query languages to provide efficient query capabilities in terms of processing speed and scale. In this study, we report a novel platform to store and query massive XML-based biological data collections. A prototype tool for constructing HBase tables from XML-based biological data collections is first developed, and then a formal approach to transform the XML query model into the MapReduce query model is proposed. Finally, an evaluation of the query performance of the proposed approach on the existing XML-based biological databases is presented, showing that the performance advantages of the proposed solution. The source code of the massive XML-based biological data management platform is freely available at https://github.com/lyotvincent/X2H.
Collapse
|
5
|
Nti-Addae Y, Matthews D, Ulat VJ, Syed R, Sempéré G, Pétel A, Renner J, Larmande P, Guignon V, Jones E, Robbins K. Benchmarking database systems for Genomic Selection implementation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5566651. [PMID: 31508797 PMCID: PMC6737464 DOI: 10.1093/database/baz096] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Revised: 05/29/2019] [Accepted: 07/01/2019] [Indexed: 01/07/2023]
Abstract
MOTIVATION With high-throughput genotyping systems now available, it has become feasible to fully integrate genotyping information into breeding programs. To make use of this information effectively requires DNA extraction facilities and marker production facilities that can efficiently deploy the desired set of markers across samples with a rapid turnaround time that allows for selection before crosses needed to be made. In reality, breeders often have a short window of time to make decisions by the time they are able to collect all their phenotyping data and receive corresponding genotyping data. This presents a challenge to organize information and utilize it in downstream analyses to support decisions made by breeders. In order to implement genomic selection routinely as part of breeding programs, one would need an efficient genotyping data storage system. We selected and benchmarked six popular open-source data storage systems, including relational database management and columnar storage systems. RESULTS We found that data extract times are greatly influenced by the orientation in which genotype data is stored in a system. HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix. AVAILABILITY http://gobiin1.bti.cornell.edu:6083/projects/GBM/repos/benchmarking/browse.
Collapse
Affiliation(s)
| | | | - Victor Jun Ulat
- Centro Internacional de Mejoramiento de Maíz y Trigo (CIMMYT)
| | - Raza Syed
- Institute of Biotechnology, Cornell University
| | | | | | | | | | | | | | - Kelly Robbins
- Section of Plant Breeding and Genetics, School of Integrative Plants Sciences, Cornell University
| |
Collapse
|
6
|
Wang X, Williams C, Liu ZH, Croghan J. Big data management challenges in health research-a literature review. Brief Bioinform 2019; 20:156-167. [PMID: 28968677 DOI: 10.1093/bib/bbx086] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Indexed: 12/12/2022] Open
Abstract
Big data management for information centralization (i.e. making data of interest findable) and integration (i.e. making related data connectable) in health research is a defining challenge in biomedical informatics. While essential to create a foundation for knowledge discovery, optimized solutions to deliver high-quality and easy-to-use information resources are not thoroughly explored. In this review, we identify the gaps between current data management approaches and the need for new capacity to manage big data generated in advanced health research. Focusing on these unmet needs and well-recognized problems, we introduce state-of-the-art concepts, approaches and technologies for data management from computing academia and industry to explore improvement solutions. We explain the potential and significance of these advances for biomedical informatics. In addition, we discuss specific issues that have a great impact on technical solutions for developing the next generation of digital products (tools and data) to facilitate the raw-data-to-knowledge process in health research.
Collapse
Affiliation(s)
- Xiaoming Wang
- National Institute of Infectious and Allergy Diseases, NIH, Rockville, Maryland, USA
| | - Carolyn Williams
- National Institute of Infectious and Allergy Diseases, NIH, Rockville, Maryland, USA
| | | | - Joe Croghan
- National Institute of Infectious and Allergy Diseases, NIH, Rockville, Maryland, USA
| |
Collapse
|
7
|
Paris N, Mendis M, Daniel C, Murphy S, Tannier X, Zweigenbaum P. i2b2 implemented over SMART-on-FHIR. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2018; 2017:369-378. [PMID: 29888095 PMCID: PMC5961782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Integrating Biology and the Bedside (i2b2) is the de-facto open-source medical tool for cohort discovery. Fast Healthcare Interoperability Resources (FHIR) is a new standard for exchanging health care information electronically. Substitutable Modular third-party Applications (SMART) defines the SMART-on-FHIR specification on how applications shall interface with Electronic Health Records (EHR) through FHIR. Related work made it possible to produce FHIR from an i2b2 instance or made i2b2 able to store FHIR datasets. In this paper, we extend i2b2 to search remotely into one or multiple SMART-on-FHIR Application Programming Interfaces (APIs). This enables the federation of queries, security, terminology mapping, and also bridges the gap between i2b2 and modern big-data technologies.
Collapse
Affiliation(s)
- Nicolas Paris
- WIND-DSI, AP-HP, Paris, France,LIMSI, CNRS, Université Paris-Saclay, Orsay, France,INSERM, UMR S 1142, LIMICS, Paris, France
| | | | - Christel Daniel
- WIND-DSI, AP-HP, Paris, France,INSERM, UMR S 1142, LIMICS, Paris, France
| | | | - Xavier Tannier
- INSERM, UMR S 1142, LIMICS, Paris, France,Sorbonne Universités, UPMC Univ Paris 06, France
| | | |
Collapse
|
8
|
Christoph J, Knell C, Bosserhoff A, Naschberger E, Stürzl M, Rübner M, Seuss H, Ruh M, Prokosch HU, Sedlmayr B. Usability and Suitability of the Omics-Integrating Analysis Platform tranSMART for Translational Research and Education. Appl Clin Inform 2017; 8:1173-1183. [PMID: 29270954 DOI: 10.4338/aci-2017-05-ra-0085] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Platforms like tranSMART assist researchers in analyzing clinical and corresponding omics data. Usability is an important, yet often overlooked, factor affecting the adoption and meaningful use. Analyses on the specific needs of translational researchers and considerations about the application of such platforms for education are rare. OBJECTIVES The aim of this study was to test whether tranSMART can be used in education and how well medical students and professional researchers can handle it; to identify which kind of translational researchers-in terms of skills, experienced limitations, and available data-can take advantage of tranSMART; and to evaluate the usability and to generate recommendations for improvements. METHODS An online-based test has been done by medical students (N = 109) and researchers (N = 26). The test comprised 13 tasks in the context of four typical research scenarios based on experimental and clinical data. A web questionnaire was provided to identify both the needs and the conditions of research as well as to evaluate the system's usability based on the "System Usability Scale" (SUS). RESULTS Students and researchers were able to handle tranSMART well and coped with most scenarios: cohort identification, data exploration, hypothesis generation, and hypothesis validation were answered with a rate of correctness between 82 and 100%. Of the total, 72.2% of the teaching researchers considered tranSMART suitable for their lessons and 84.6% of the researchers considered the platform useful for their daily work; 65.4% of the researchers named the nonavailability of a platform like tranSMART as a restriction on their research. The usability was rated "acceptable" with a SUS of 70.8. CONCLUSION tranSMART is potentially suitable for education purposes and fits most of the needs of translational researchers. Improvements are needed on the presentation of analysis results and on the guidance of users through the analysis, especially to ensure the compliance of the analysis with the requirements of statistical testing.
Collapse
Affiliation(s)
- J Christoph
- Department of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - C Knell
- Department of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - A Bosserhoff
- Institute of Biochemistry (Emil-Fischer-Center), Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - E Naschberger
- Division of Molecular and Experimental Surgery, Department of Surgery, Translational Research Center Erlangen, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - M Stürzl
- Division of Molecular and Experimental Surgery, Department of Surgery, Translational Research Center Erlangen, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - M Rübner
- Department of Gynecology and Obstetrics, Comprehensive Cancer Center Erlangen-EMN, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - H Seuss
- Department of Radiology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - M Ruh
- Department of Experimental Medicine 1, Nikolaus-Fiebiger-Center for Molecular Medicine, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - H-U Prokosch
- Department of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - B Sedlmayr
- Department of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| |
Collapse
|
9
|
Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2017; 2017:6120820. [PMID: 29375652 PMCID: PMC5742497 DOI: 10.1155/2017/6120820] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Accepted: 11/01/2017] [Indexed: 02/01/2023]
Abstract
Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services.
Collapse
|
10
|
Kulkarni P, Frommolt P. Challenges in the Setup of Large-scale Next-Generation Sequencing Analysis Workflows. Comput Struct Biotechnol J 2017; 15:471-477. [PMID: 29158876 PMCID: PMC5683667 DOI: 10.1016/j.csbj.2017.10.001] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Revised: 09/29/2017] [Accepted: 10/06/2017] [Indexed: 11/18/2022] Open
Abstract
While Next-Generation Sequencing (NGS) can now be considered an established analysis technology for research applications across the life sciences, the analysis workflows still require substantial bioinformatics expertise. Typical challenges include the appropriate selection of analytical software tools, the speedup of the overall procedure using HPC parallelization and acceleration technology, the development of automation strategies, data storage solutions and finally the development of methods for full exploitation of the analysis results across multiple experimental conditions. Recently, NGS has begun to expand into clinical environments, where it facilitates diagnostics enabling personalized therapeutic approaches, but is also accompanied by new technological, legal and ethical challenges. There are probably as many overall concepts for the analysis of the data as there are academic research institutions. Among these concepts are, for instance, complex IT architectures developed in-house, ready-to-use technologies installed on-site as well as comprehensive Everything as a Service (XaaS) solutions. In this mini-review, we summarize the key points to consider in the setup of the analysis architectures, mostly for scientific rather than diagnostic purposes, and provide an overview of the current state of the art and challenges of the field.
Collapse
Affiliation(s)
- Pranav Kulkarni
- Bioinformatics Core Facility, CECAD Research Center, University of Cologne, Germany
| | - Peter Frommolt
- Bioinformatics Core Facility, CECAD Research Center, University of Cologne, Germany
| |
Collapse
|
11
|
Bao S, Plassard AJ, Landman BA, Gokhale A. Cloud Engineering Principles and Technology Enablers for Medical Image Processing-as-a-Service. PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING. IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING 2017; 2017:127-137. [PMID: 28884169 PMCID: PMC5584067 DOI: 10.1109/ic2e.2017.23] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Traditional in-house, laboratory-based medical imaging studies use hierarchical data structures (e.g., NFS file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance from these approaches is, however, impeded by standard network switches since they can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. To that end, a cloud-based "medical image processing-as-a-service" offers promise in utilizing the ecosystem of Apache Hadoop, which is a flexible framework providing distributed, scalable, fault tolerant storage and parallel computational modules, and HBase, which is a NoSQL database built atop Hadoop's distributed file system. Despite this promise, HBase's load distribution strategy of region split and merge is detrimental to the hierarchical organization of imaging data (e.g., project, subject, session, scan, slice). This paper makes two contributions to address these concerns by describing key cloud engineering principles and technology enhancements we made to the Apache Hadoop ecosystem for medical imaging applications. First, we propose a row-key design for HBase, which is a necessary step that is driven by the hierarchical organization of imaging data. Second, we propose a novel data allocation policy within HBase to strongly enforce collocation of hierarchically related imaging data. The proposed enhancements accelerate data processing by minimizing network usage and localizing processing to machines where the data already exist. Moreover, our approach is amenable to the traditional scan, subject, and project-level analysis procedures, and is compatible with standard command line/scriptable image processing software. Experimental results for an illustrative sample of imaging data reveals that our new HBase policy results in a three-fold time improvement in conversion of classic DICOM to NiFTI file formats when compared with the default HBase region split policy, and nearly a six-fold improvement over a commonly available network file system (NFS) approach even for relatively small file sets. Moreover, file access latency is lower than network attached storage.
Collapse
Affiliation(s)
- Shunxing Bao
- Dept of EECS, Vanderbilt University, Nashville, TN 37235, USA
| | | | | | | |
Collapse
|
12
|
Abstract
In various biomedical applications that collect, handle, and manipulate data, the amounts of data tend to build up and venture into the range identified as bigdata. In such occurrences, a design decision has to be taken as to what type of database would be used to handle this data. More often than not, the default and classical solution to this in the biomedical domain according to past research is relational databases. While this used to be the norm for a long while, it is evident that there is a trend to move away from relational databases in favor of other types and paradigms of databases. However, it still has paramount importance to understand the interrelation that exists between biomedical big data and relational databases. This chapter will review the pros and cons of using relational databases to store biomedical big data that previous researches have discussed and used.
Collapse
Affiliation(s)
- N H Nisansa D de Silva
- Department of Computer and Information Science, University of Oregon, 224 Deschutes Hall, 1477 E 13th Ave., Eugene, OR, 97403, USA.
| |
Collapse
|
13
|
Schulz WL, Nelson BG, Felker DK, Durant TJS, Torres R. Evaluation of relational and NoSQL database architectures to manage genomic annotations. J Biomed Inform 2016; 64:288-295. [PMID: 27810480 DOI: 10.1016/j.jbi.2016.10.015] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2016] [Revised: 10/03/2016] [Accepted: 10/26/2016] [Indexed: 10/20/2022]
Abstract
While the adoption of next generation sequencing has rapidly expanded, the informatics infrastructure used to manage the data generated by this technology has not kept pace. Historically, relational databases have provided much of the framework for data storage and retrieval. Newer technologies based on NoSQL architectures may provide significant advantages in storage and query efficiency, thereby reducing the cost of data management. But their relative advantage when applied to biomedical data sets, such as genetic data, has not been characterized. To this end, we compared the storage, indexing, and query efficiency of a common relational database (MySQL), a document-oriented NoSQL database (MongoDB), and a relational database with NoSQL support (PostgreSQL). When used to store genomic annotations from the dbSNP database, we found the NoSQL architectures to outperform traditional, relational models for speed of data storage, indexing, and query retrieval in nearly every operation. These findings strongly support the use of novel database technologies to improve the efficiency of data management within the biological sciences.
Collapse
Affiliation(s)
- Wade L Schulz
- Yale University, Department of Laboratory Medicine, New Haven, CT, United States.
| | - Brent G Nelson
- University of Minnesota, Department of Psychiatry, Minneapolis, MN, United States
| | | | - Thomas J S Durant
- Yale University, Department of Laboratory Medicine, New Haven, CT, United States
| | - Richard Torres
- Yale University, Department of Laboratory Medicine, New Haven, CT, United States
| |
Collapse
|
14
|
Wang S, Mares MA, Guo YK. CGDM: collaborative genomic data model for molecular profiling data using NoSQL. Bioinformatics 2016; 32:3654-3660. [PMID: 27522085 DOI: 10.1093/bioinformatics/btw531] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Revised: 07/21/2016] [Accepted: 08/09/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION High-throughput molecular profiling has greatly improved patient stratification and mechanistic understanding of diseases. With the increasing amount of data used in translational medicine studies in recent years, there is a need to improve the performance of data warehouses in terms of data retrieval and statistical processing. Both relational and Key Value models have been used for managing molecular profiling data. Key Value models such as SeqWare have been shown to be particularly advantageous in terms of query processing speed for large datasets. However, more improvement can be achieved, particularly through better indexing techniques of the Key Value models, taking advantage of the types of queries which are specific for the high-throughput molecular profiling data. RESULTS In this article, we introduce a Collaborative Genomic Data Model (CGDM), aimed at significantly increasing the query processing speed for the main classes of queries on genomic databases. CGDM creates three Collaborative Global Clustering Index Tables (CGCITs) to solve the velocity and variety issues at the cost of limited extra volume. Several benchmarking experiments were carried out, comparing CGDM implemented on HBase to the traditional SQL data model (TDM) implemented on both HBase and MySQL Cluster, using large publicly available molecular profiling datasets taken from NCBI and HapMap. In the microarray case, CGDM on HBase performed up to 246 times faster than TDM on HBase and 7 times faster than TDM on MySQL Cluster. In single nucleotide polymorphism case, CGDM on HBase outperformed TDM on HBase by up to 351 times and TDM on MySQL Cluster by up to 9 times. AVAILABILITY AND IMPLEMENTATION The CGDM source code is available at https://github.com/evanswang/CGDM. CONTACT y.guo@imperial.ac.uk.
Collapse
Affiliation(s)
- Shicai Wang
- Data Science Institute, Imperial College London, London, UK
| | | | - Yi-Ke Guo
- Data Science Institute, Imperial College London, London, UK.,School of Computer Science, Shanghai University, Shanghai, China
| |
Collapse
|
15
|
Sempéré G, Philippe F, Dereeper A, Ruiz M, Sarah G, Larmande P. Gigwa-Genotype investigator for genome-wide analyses. Gigascience 2016; 5:25. [PMID: 27267926 PMCID: PMC4897896 DOI: 10.1186/s13742-016-0131-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2016] [Accepted: 05/16/2016] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Exploring the structure of genomes and analyzing their evolution is essential to understanding the ecological adaptation of organisms. However, with the large amounts of data being produced by next-generation sequencing, computational challenges arise in terms of storage, search, sharing, analysis and visualization. This is particularly true with regards to studies of genomic variation, which are currently lacking scalable and user-friendly data exploration solutions. DESCRIPTION Here we present Gigwa, a web-based tool that provides an easy and intuitive way to explore large amounts of genotyping data by filtering it not only on the basis of variant features, including functional annotations, but also on genotype patterns. The data storage relies on MongoDB, which offers good scalability properties. Gigwa can handle multiple databases and may be deployed in either single- or multi-user mode. In addition, it provides a wide range of popular export formats. CONCLUSIONS The Gigwa application is suitable for managing large amounts of genomic variation data. Its user-friendly web interface makes such processing widely accessible. It can either be simply deployed on a workstation or be used to provide a shared data portal for a given community of researchers.
Collapse
Affiliation(s)
- Guilhem Sempéré
- UMR InterTryp (CIRAD), Campus International de Baillarguet, 34398, Montpellier, Cedex 5, France.
- South Green Bioinformatics Platform, 1000 Avenue Agropolis, 34934, Montpellier, Cedex 5, France.
| | - Florian Philippe
- UMR DIADE (IRD), 911 Avenue Agropolis, 34934, Montpellier, Cedex 5, France
| | - Alexis Dereeper
- South Green Bioinformatics Platform, 1000 Avenue Agropolis, 34934, Montpellier, Cedex 5, France
- UMR IPME (IRD), 911 Avenue Agropolis, 34394, Montpellier, Cedex 5, France
| | - Manuel Ruiz
- South Green Bioinformatics Platform, 1000 Avenue Agropolis, 34934, Montpellier, Cedex 5, France
- UMR AGAP, CIRAD, 34398, Montpellier, Cedex 5, France
- Institut de Biologie Computationnelle, Université de Montpellier, 860 Rue de St Priest, 34095, Montpellier, Cedex 5, France
- Agrobiodiversity Research Area, International Center for Tropical Agriculture (CIAT), 6713, Cali, Colombia
| | - Gautier Sarah
- South Green Bioinformatics Platform, 1000 Avenue Agropolis, 34934, Montpellier, Cedex 5, France
- INRA, UMR AGAP, 34398, Montpellier, Cedex 5, France
| | - Pierre Larmande
- South Green Bioinformatics Platform, 1000 Avenue Agropolis, 34934, Montpellier, Cedex 5, France
- UMR DIADE (IRD), 911 Avenue Agropolis, 34934, Montpellier, Cedex 5, France
- Institut de Biologie Computationnelle, Université de Montpellier, 860 Rue de St Priest, 34095, Montpellier, Cedex 5, France
- INRIA Zenith Team, LIRMM, 161 Rue Ada, 34095, Montpellier, Cedex 5, France
| |
Collapse
|
16
|
Satagopam V, Gu W, Eifes S, Gawron P, Ostaszewski M, Gebel S, Barbosa-Silva A, Balling R, Schneider R. Integration and Visualization of Translational Medicine Data for Better Understanding of Human Diseases. BIG DATA 2016; 4:97-108. [PMID: 27441714 PMCID: PMC4932659 DOI: 10.1089/big.2015.0057] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Translational medicine is a domain turning results of basic life science research into new tools and methods in a clinical environment, for example, as new diagnostics or therapies. Nowadays, the process of translation is supported by large amounts of heterogeneous data ranging from medical data to a whole range of -omics data. It is not only a great opportunity but also a great challenge, as translational medicine big data is difficult to integrate and analyze, and requires the involvement of biomedical experts for the data processing. We show here that visualization and interoperable workflows, combining multiple complex steps, can address at least parts of the challenge. In this article, we present an integrated workflow for exploring, analysis, and interpretation of translational medicine data in the context of human health. Three Web services-tranSMART, a Galaxy Server, and a MINERVA platform-are combined into one big data pipeline. Native visualization capabilities enable the biomedical experts to get a comprehensive overview and control over separate steps of the workflow. The capabilities of tranSMART enable a flexible filtering of multidimensional integrated data sets to create subsets suitable for downstream processing. A Galaxy Server offers visually aided construction of analytical pipelines, with the use of existing or custom components. A MINERVA platform supports the exploration of health and disease-related mechanisms in a contextualized analytical visualization system. We demonstrate the utility of our workflow by illustrating its subsequent steps using an existing data set, for which we propose a filtering scheme, an analytical pipeline, and a corresponding visualization of analytical results. The workflow is available as a sandbox environment, where readers can work with the described setup themselves. Overall, our work shows how visualization and interfacing of big data processing services facilitate exploration, analysis, and interpretation of translational medicine data.
Collapse
Affiliation(s)
- Venkata Satagopam
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Belval, Luxembourg
| | - Wei Gu
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Belval, Luxembourg
| | - Serge Eifes
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Belval, Luxembourg
- Information Technology for Translational Medicine (ITTM) S.A., Esch-Belval, Luxembourg
| | - Piotr Gawron
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Belval, Luxembourg
| | - Marek Ostaszewski
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Belval, Luxembourg
| | - Stephan Gebel
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Belval, Luxembourg
| | - Adriano Barbosa-Silva
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Belval, Luxembourg
| | - Rudi Balling
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Belval, Luxembourg
| | - Reinhard Schneider
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Belval, Luxembourg
| |
Collapse
|
17
|
Gabetta M, Limongelli I, Rizzo E, Riva A, Segagni D, Bellazzi R. BigQ: a NoSQL based framework to handle genomic variants in i2b2. BMC Bioinformatics 2015; 16:415. [PMID: 26714792 PMCID: PMC4696314 DOI: 10.1186/s12859-015-0861-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2015] [Accepted: 12/15/2015] [Indexed: 12/25/2022] Open
Abstract
Background Precision medicine requires the tight integration of clinical and molecular data. To this end, it is mandatory to define proper technological solutions able to manage the overwhelming amount of high throughput genomic data needed to test associations between genomic signatures and human phenotypes. The i2b2 Center (Informatics for Integrating Biology and the Bedside) has developed a widely internationally adopted framework to use existing clinical data for discovery research that can help the definition of precision medicine interventions when coupled with genetic data. i2b2 can be significantly advanced by designing efficient management solutions of Next Generation Sequencing data. Results We developed BigQ, an extension of the i2b2 framework, which integrates patient clinical phenotypes with genomic variant profiles generated by Next Generation Sequencing. A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations. We report an evaluation of the query performance of our system on more than 11 million variants, showing that the implemented solution scales linearly in terms of query time and disk space with the number of variants. Conclusions In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data. The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0861-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Matteo Gabetta
- Dipartimento di Ingegneria Industriale e dell'Informazione and Center for Health Technologies, Università di Pavia, Pavia, Italy. .,Biomeris s.r.l., Pavia, Italy.
| | - Ivan Limongelli
- Dipartimento di Ingegneria Industriale e dell'Informazione and Center for Health Technologies, Università di Pavia, Pavia, Italy. .,IRCCS Fondazione Policlinico S. Matteo, Pavia, Italy.
| | - Ettore Rizzo
- Dipartimento di Ingegneria Industriale e dell'Informazione and Center for Health Technologies, Università di Pavia, Pavia, Italy. .,Dipartimento di Medicina Molecolare, Università di Pavia, Pavia, Italy.
| | - Alberto Riva
- Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL, USA.
| | | | - Riccardo Bellazzi
- Dipartimento di Ingegneria Industriale e dell'Informazione and Center for Health Technologies, Università di Pavia, Pavia, Italy. .,IRCCS Fondazione S. Maugeri, Pavia, Italy.
| |
Collapse
|
18
|
Noor AM, Holmberg L, Gillett C, Grigoriadis A. Big Data: the challenge for small research groups in the era of cancer genomics. Br J Cancer 2015; 113:1405-12. [PMID: 26492224 PMCID: PMC4815885 DOI: 10.1038/bjc.2015.341] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Revised: 08/04/2015] [Accepted: 08/09/2015] [Indexed: 01/06/2023] Open
Abstract
In the past decade, cancer research has seen an increasing trend towards high-throughput techniques and translational approaches. The increasing availability of assays that utilise smaller quantities of source material and produce higher volumes of data output have resulted in the necessity for data storage solutions beyond those previously used. Multifactorial data, both large in sample size and heterogeneous in context, needs to be integrated in a standardised, cost-effective and secure manner. This requires technical solutions and administrative support not normally financially accounted for in small- to moderate-sized research groups. In this review, we highlight the Big Data challenges faced by translational research groups in the precision medicine era; an era in which the genomes of over 75 000 patients will be sequenced by the National Health Service over the next 3 years to advance healthcare. In particular, we have looked at three main themes of data management in relation to cancer research, namely (1) cancer ontology management, (2) IT infrastructures that have been developed to support data management and (3) the unique ethical challenges introduced by utilising Big Data in research.
Collapse
Affiliation(s)
- Aisyah Mohd Noor
- Research Oncology, Faculty of Life Sciences and Medicine, King's College London, Guy's Hospital, London SE1 9RT, UK
| | - Lars Holmberg
- Research Oncology, Faculty of Life Sciences and Medicine, King's College London, Guy's Hospital, London SE1 9RT, UK.,Department of Surgical Sciences, Uppsala University, Uppsala 751 85, Sweden
| | - Cheryl Gillett
- Research Oncology, Faculty of Life Sciences and Medicine, King's College London, Guy's Hospital, London SE1 9RT, UK.,Faculty of Life Sciences and Medicine, King's Health Partners Cancer Biobank, King's College London, Research Oncology, Guy's Hospital, London SE1 9RT, UK
| | - Anita Grigoriadis
- Research Oncology, Faculty of Life Sciences and Medicine, King's College London, Guy's Hospital, London SE1 9RT, UK.,Breast Cancer Now Research Unit, Research Oncology, Faculty of Life Sciences and Medicine, King's College London, Guy's Hospital, London SE1 9RT, UK
| |
Collapse
|
19
|
Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies. BIOMED RESEARCH INTERNATIONAL 2015; 2015:904541. [PMID: 26125026 PMCID: PMC4466500 DOI: 10.1155/2015/904541] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Revised: 04/01/2015] [Accepted: 04/01/2015] [Indexed: 02/07/2023]
Abstract
Sequencing the human genome began in 1994, and 10 years of work were necessary in order to provide a nearly complete sequence. Nowadays, NGS technologies allow sequencing of a whole human genome in a few days. This deluge of data challenges scientists in many ways, as they are faced with data management issues and analysis and visualization drawbacks due to the limitations of current bioinformatics tools. In this paper, we describe how the NGS Big Data revolution changes the way of managing and analysing data. We present how biologists are confronted with abundance of methods, tools, and data formats. To overcome these problems, focus on Big Data Information Technology innovations from web and business intelligence. We underline the interest of NoSQL databases, which are much more efficient than relational databases. Since Big Data leads to the loss of interactivity with data during analysis due to high processing time, we describe solutions from the Business Intelligence that allow one to regain interactivity whatever the volume of data is. We illustrate this point with a focus on the Amadea platform. Finally, we discuss visualization challenges posed by Big Data and present the latest innovations with JavaScript graphic libraries.
Collapse
|