1
|
Koppad S, B A, Gkoutos GV, Acharjee A. Cloud Computing Enabled Big Multi-Omics Data Analytics. Bioinform Biol Insights 2021; 15:11779322211035921. [PMID: 34376975 PMCID: PMC8323418 DOI: 10.1177/11779322211035921] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 07/12/2021] [Indexed: 12/27/2022] Open
Abstract
High-throughput experiments enable researchers to explore complex multifactorial
diseases through large-scale analysis of omics data. Challenges for such
high-dimensional data sets include storage, analyses, and sharing. Recent
innovations in computational technologies and approaches, especially in cloud
computing, offer a promising, low-cost, and highly flexible solution in the
bioinformatics domain. Cloud computing is rapidly proving increasingly useful in
molecular modeling, omics data analytics (eg, RNA sequencing, metabolomics, or
proteomics data sets), and for the integration, analysis, and interpretation of
phenotypic data. We review the adoption of advanced cloud-based and big data
technologies for processing and analyzing omics data and provide insights into
state-of-the-art cloud bioinformatics applications.
Collapse
Affiliation(s)
- Saraswati Koppad
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Annappa B
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK.,MRC Health Data Research UK (HDR UK), London, UK.,NIHR Experimental Cancer Medicine Centre, Birmingham, UK.,NIHR Biomedical Research Centre, University Hospitals Birmingham, Birmingham, UK
| | - Animesh Acharjee
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK
| |
Collapse
|
2
|
Muth T, Renard BY. Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification? Brief Bioinform 2019; 19:954-970. [PMID: 28369237 DOI: 10.1093/bib/bbx033] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Indexed: 01/24/2023] Open
Abstract
While peptide identifications in mass spectrometry (MS)-based shotgun proteomics are mostly obtained using database search methods, high-resolution spectrum data from modern MS instruments nowadays offer the prospect of improving the performance of computational de novo peptide sequencing. The major benefit of de novo sequencing is that it does not require a reference database to deduce full-length or partial tag-based peptide sequences directly from experimental tandem mass spectrometry spectra. Although various algorithms have been developed for automated de novo sequencing, the prediction accuracy of proposed solutions has been rarely evaluated in independent benchmarking studies. The main objective of this work is to provide a detailed evaluation on the performance of de novo sequencing algorithms on high-resolution data. For this purpose, we processed four experimental data sets acquired from different instrument types from collision-induced dissociation and higher energy collisional dissociation (HCD) fragmentation mode using the software packages Novor, PEAKS and PepNovo. Moreover, the accuracy of these algorithms is also tested on ground truth data based on simulated spectra generated from peak intensity prediction software. We found that Novor shows the overall best performance compared with PEAKS and PepNovo with respect to the accuracy of correct full peptide, tag-based and single-residue predictions. In addition, the same tool outpaced the commercial competitor PEAKS in terms of running time speedup by factors of around 12-17. Despite around 35% prediction accuracy for complete peptide sequences on HCD data sets, taken as a whole, the evaluated algorithms perform moderately on experimental data but show a significantly better performance on simulated data (up to 84% accuracy). Further, we describe the most frequently occurring de novo sequencing errors and evaluate the influence of missing fragment ion peaks and spectral noise on the accuracy. Finally, we discuss the potential of de novo sequencing for now becoming more widely used in the field.
Collapse
Affiliation(s)
- Thilo Muth
- Research Group Bioinformatics, Robert Koch Institute, Berlin, Germany
| | - Bernhard Y Renard
- Research Group Bioinformatics, Robert Koch Institute, Berlin, Germany
| |
Collapse
|
3
|
Eslami T, Saeed F. Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Time Series Data-fMRI Study. High Throughput 2018; 7:E11. [PMID: 29677161 PMCID: PMC6023306 DOI: 10.3390/ht7020011] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Revised: 04/04/2018] [Accepted: 04/17/2018] [Indexed: 11/16/2022] Open
Abstract
Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods.
Collapse
Affiliation(s)
- Taban Eslami
- Department of Computer Science, Western Michigan University, Kalamazoo, MI 49008, USA.
| | - Fahad Saeed
- Department of Computer Science, Western Michigan University, Kalamazoo, MI 49008, USA.
| |
Collapse
|
4
|
Challenges in the quantitation of naturally generated bioactive peptides in processed meats. Trends Food Sci Technol 2017. [DOI: 10.1016/j.tifs.2017.04.011] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
5
|
Maabreh M, Qolomany B, Alsmadi I, Gupta A. Deep Learning-based MSMS Spectra Reduction in Support of Running Multiple Protein Search Engines on Cloud. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2017; 2017:1909-1914. [PMID: 34430067 PMCID: PMC8382039 DOI: 10.1109/bibm.2017.8217951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The diversity of the available protein search engines with respect to the utilized matching algorithms, the low overlap ratios among their results and the disparity of their coverage encourage the community of proteomics to utilize ensemble solutions of different search engines. The advancing in cloud computing technology and the availability of distributed processing clusters can also provide support to this task. However, data transferring and results' combining, in this case, could be the major bottleneck. The flood of billions of observed mass spectra, hundreds of Gigabytes or potentially Terabytes of data, could easily cause the congestions, increase the risk of failure, poor performance, add more computations' cost, and waste available resources. Therefore, in this study, we propose a deep learning model in order to mitigate the traffic over cloud network and, thus reduce the cost of cloud computing. The model, which depends on the top 50 intensities and their m/z values of each spectrum, removes any spectrum which is predicted not to pass the majority voting of the participated search engines. Our results using three search engines namely: pFind, Comet and X!Tandem, and four different datasets are promising and promote the investment in deep learning to solve such type of Big data problems.
Collapse
Affiliation(s)
- Majdi Maabreh
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| | - Basheer Qolomany
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| | - Izzat Alsmadi
- Department of Computing and Cyber Security, Texas A&M University, San Antonio, TX, USA
| | - Ajay Gupta
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| |
Collapse
|
6
|
HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2016.02.029] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
7
|
Yu P, Lin W. Single-cell Transcriptome Study as Big Data. GENOMICS PROTEOMICS & BIOINFORMATICS 2016; 14:21-30. [PMID: 26876720 PMCID: PMC4792842 DOI: 10.1016/j.gpb.2016.01.005] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/17/2015] [Revised: 01/09/2016] [Accepted: 01/10/2016] [Indexed: 12/31/2022]
Abstract
The rapid growth of single-cell RNA-seq studies (scRNA-seq) demands efficient data storage, processing, and analysis. Big-data technology provides a framework that facilitates the comprehensive discovery of biological signals from inter-institutional scRNA-seq datasets. The strategies to solve the stochastic and heterogeneous single-cell transcriptome signal are discussed in this article. After extensively reviewing the available big-data applications of next-generation sequencing (NGS)-based studies, we propose a workflow that accounts for the unique characteristics of scRNA-seq data and primary objectives of single-cell studies.
Collapse
Affiliation(s)
- Pingjian Yu
- Genomics and Bioinformatics Lab, Baylor Institute for Immunology Research, Dallas, TX 75204, USA
| | - Wei Lin
- Genomics and Bioinformatics Lab, Baylor Institute for Immunology Research, Dallas, TX 75204, USA.
| |
Collapse
|
8
|
Luo J, Wu M, Gopukumar D, Zhao Y. Big Data Application in Biomedical Research and Health Care: A Literature Review. BIOMEDICAL INFORMATICS INSIGHTS 2016; 8:1-10. [PMID: 26843812 PMCID: PMC4720168 DOI: 10.4137/bii.s31559] [Citation(s) in RCA: 165] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Revised: 12/06/2015] [Accepted: 12/06/2015] [Indexed: 01/01/2023]
Abstract
Big data technologies are increasingly used for biomedical and health-care informatics research. Large amounts of biological and clinical data have been generated and collected at an unprecedented speed and scale. For example, the new generation of sequencing technologies enables the processing of billions of DNA sequence data per day, and the application of electronic health records (EHRs) is documenting large amounts of patient data. The cost of acquiring and analyzing biomedical data is expected to decrease dramatically with the help of technology upgrades, such as the emergence of new sequencing machines, the development of novel hardware and software for parallel computing, and the extensive expansion of EHRs. Big data applications present new opportunities to discover new knowledge and create novel methods to improve the quality of health care. The application of big data in health care is a fast-growing field, with many new discoveries and methodologies published in the last five years. In this paper, we review and discuss big data application in four major biomedical subdisciplines: (1) bioinformatics, (2) clinical informatics, (3) imaging informatics, and (4) public health informatics. Specifically, in bioinformatics, high-throughput experiments facilitate the research of new genome-wide association studies of diseases, and with clinical informatics, the clinical field benefits from the vast amount of collected patient data for making intelligent decisions. Imaging informatics is now more rapidly integrated with cloud platforms to share medical image data and workflows, and public health informatics leverages big data techniques for predicting and monitoring infectious disease outbreaks, such as Ebola. In this paper, we review the recent progress and breakthroughs of big data applications in these health-care domains and summarize the challenges, gaps, and opportunities to improve and advance big data applications in health care.
Collapse
Affiliation(s)
- Jake Luo
- College of Health Science, Department of Health Informatics and Administration, Center for Biomedical Data and Language Processing, University of Wisconsin–Milwaukee, Milwaukee, WI, USA
| | - Min Wu
- College of Health Science, Department of Health Informatics and Administration, Center for Biomedical Data and Language Processing, University of Wisconsin–Milwaukee, Milwaukee, WI, USA
| | - Deepika Gopukumar
- College of Health Science, Department of Health Informatics and Administration, Center for Biomedical Data and Language Processing, University of Wisconsin–Milwaukee, Milwaukee, WI, USA
| | - Yiqing Zhao
- College of Health Science, Department of Health Informatics and Administration, Center for Biomedical Data and Language Processing, University of Wisconsin–Milwaukee, Milwaukee, WI, USA
| |
Collapse
|
9
|
Prabahar A, Swaminathan S. Perspectives of Machine Learning Techniques in Big Data Mining of Cancer. BIG DATA ANALYTICS IN GENOMICS 2016:317-336. [DOI: 10.1007/978-3-319-41279-5_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
10
|
Das K, Nenadic Z. A Nonlinear Technique for Analysis of Big Data in Neuroscience. BIG DATA ANALYTICS 2016. [DOI: 10.1007/978-81-322-3628-3_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
|
11
|
Agarwal P, Owzar K. Next generation distributed computing for cancer research. Cancer Inform 2015; 13:97-109. [PMID: 25983539 PMCID: PMC4412427 DOI: 10.4137/cin.s16344] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2014] [Revised: 01/05/2015] [Accepted: 01/06/2015] [Indexed: 11/28/2022] Open
Abstract
Advances in next generation sequencing (NGS) and mass spectrometry (MS) technologies have provided many new opportunities and angles for extending the scope of translational cancer research while creating tremendous challenges in data management and analysis. The resulting informatics challenge is invariably not amenable to the use of traditional computing models. Recent advances in scalable computing and associated infrastructure, particularly distributed computing for Big Data, can provide solutions for addressing these challenges. In this review, the next generation of distributed computing technologies that can address these informatics problems is described from the perspective of three key components of a computational platform, namely computing, data storage and management, and networking. A broad overview of scalable computing is provided to set the context for a detailed description of Hadoop, a technology that is being rapidly adopted for large-scale distributed computing. A proof-of-concept Hadoop cluster, set up for performance benchmarking of NGS read alignment, is described as an example of how to work with Hadoop. Finally, Hadoop is compared with a number of other current technologies for distributed computing.
Collapse
Affiliation(s)
- Pankaj Agarwal
- Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA
| | - Kouros Owzar
- Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC, USA
| |
Collapse
|
12
|
Slagel J, Mendoza L, Shteynberg D, Deutsch EW, Moritz RL. Processing shotgun proteomics data on the Amazon cloud with the trans-proteomic pipeline. Mol Cell Proteomics 2014; 14:399-404. [PMID: 25418363 DOI: 10.1074/mcp.o114.043380] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Cloud computing, where scalable, on-demand compute cycles and storage are available as a service, has the potential to accelerate mass spectrometry-based proteomics research by providing simple, expandable, and affordable large-scale computing to all laboratories regardless of location or information technology expertise. We present new cloud computing functionality for the Trans-Proteomic Pipeline, a free and open-source suite of tools for the processing and analysis of tandem mass spectrometry datasets. Enabled with Amazon Web Services cloud computing, the Trans-Proteomic Pipeline now accesses large scale computing resources, limited only by the available Amazon Web Services infrastructure, for all users. The Trans-Proteomic Pipeline runs in an environment fully hosted on Amazon Web Services, where all software and data reside on cloud resources to tackle large search studies. In addition, it can also be run on a local computer with computationally intensive tasks launched onto the Amazon Elastic Compute Cloud service to greatly decrease analysis times. We describe the new Trans-Proteomic Pipeline cloud service components, compare the relative performance and costs of various Elastic Compute Cloud service instance types, and present on-line tutorials that enable users to learn how to deploy cloud computing technology rapidly with the Trans-Proteomic Pipeline. We provide tools for estimating the necessary computing resources and costs given the scale of a job and demonstrate the use of cloud enabled Trans-Proteomic Pipeline by performing over 1100 tandem mass spectrometry files through four proteomic search engines in 9 h and at a very low cost.
Collapse
Affiliation(s)
- Joseph Slagel
- From the ‡Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109
| | - Luis Mendoza
- From the ‡Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109
| | - David Shteynberg
- From the ‡Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109
| | - Eric W Deutsch
- From the ‡Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109
| | - Robert L Moritz
- From the ‡Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109
| |
Collapse
|
13
|
Mohammed EA, Far BH, Naugler C. Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends. BioData Min 2014; 7:22. [PMID: 25383096 PMCID: PMC4224309 DOI: 10.1186/1756-0381-7-22] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2014] [Accepted: 10/18/2014] [Indexed: 12/23/2022] Open
Abstract
The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called "big data" challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data. THE MAPREDUCE PROGRAMMING FRAMEWORK USES TWO TASKS COMMON IN FUNCTIONAL PROGRAMMING: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation. In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.
Collapse
Affiliation(s)
- Emad A Mohammed
- Department of Electrical and Computer Engineering, Schulich School of Engineering, University of Calgary, Calgary, AB, Canada
| | - Behrouz H Far
- Department of Electrical and Computer Engineering, Schulich School of Engineering, University of Calgary, Calgary, AB, Canada
| | - Christopher Naugler
- Department of Pathology and Laboratory Medicine, University of Calgary and Calgary Laboratory Services, Calgary, AB, Canada
| |
Collapse
|
14
|
Distributed computing strategies for processing of FT-ICR MS imaging datasets for continuous mode data visualization. Anal Bioanal Chem 2014; 407:2321-7. [DOI: 10.1007/s00216-014-8210-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 09/12/2014] [Accepted: 09/19/2014] [Indexed: 11/25/2022]
|
15
|
Mrozek D, Małysiak-Mrozek B, Kłapciński A. Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 2014; 30:2822-5. [PMID: 24930141 PMCID: PMC4173022 DOI: 10.1093/bioinformatics/btu389] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Summary: Popular methods for 3D protein structure similarity searching, especially those that generate high-quality alignments such as Combinatorial Extension (CE) and Flexible structure Alignment by Chaining Aligned fragment pairs allowing Twists (FATCAT) are still time consuming. As a consequence, performing similarity searching against large repositories of structural data requires increased computational resources that are not always available. Cloud computing provides huge amounts of computational power that can be provisioned on a pay-as-you-go basis. We have developed the cloud-based system that allows scaling of the similarity searching process vertically and horizontally. Cloud4Psi (Cloud for Protein Similarity) was tested in the Microsoft Azure cloud environment and provided good, almost linearly proportional acceleration when scaled out onto many computational units. Availability and implementation: Cloud4Psi is available as Software as a Service for testing purposes at: http://cloud4psi.cloudapp.net/. For source code and software availability, please visit the Cloud4Psi project home page at http://zti.polsl.pl/dmrozek/science/cloud4psi.htm. Contact:dariusz.mrozek@polsl.pl
Collapse
Affiliation(s)
- Dariusz Mrozek
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Bożena Małysiak-Mrozek
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Artur Kłapciński
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|
16
|
Hillman C, Ahmad Y, Whitehorn M, Cobley A. Near Real-Time Processing of Proteomics Data Using Hadoop. BIG DATA 2014; 2:44-49. [PMID: 27447310 DOI: 10.1089/big.2013.0036] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This article presents a near real-time processing solution using MapReduce and Hadoop. The solution is aimed at some of the data management and processing challenges facing the life sciences community. Research into genes and their product proteins generates huge volumes of data that must be extensively preprocessed before any biological insight can be gained. In order to carry out this processing in a timely manner, we have investigated the use of techniques from the big data field. These are applied specifically to process data resulting from mass spectrometers in the course of proteomic experiments. Here we present methods of handling the raw data in Hadoop, and then we investigate a process for preprocessing the data using Java code and the MapReduce framework to identify 2D and 3D peaks.
Collapse
Affiliation(s)
- Chris Hillman
- 1 School of Computing, University of Dundee , Nethergate, Dundee, United Kingdom
| | - Yasmeen Ahmad
- 2 Centre for Gene Regulation & Expression, University of Dundee , Nethergate, Dundee, United Kingdom
| | - Mark Whitehorn
- 1 School of Computing, University of Dundee , Nethergate, Dundee, United Kingdom
| | - Andy Cobley
- 1 School of Computing, University of Dundee , Nethergate, Dundee, United Kingdom
| |
Collapse
|
17
|
Lee WP, Hsiao YT, Hwang WC. Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment. BMC SYSTEMS BIOLOGY 2014; 8:5. [PMID: 24428926 PMCID: PMC3900469 DOI: 10.1186/1752-0509-8-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/04/2013] [Accepted: 01/06/2014] [Indexed: 11/10/2022]
Abstract
Background To improve the tedious task of reconstructing gene networks through testing
experimentally the possible interactions between genes, it becomes a trend
to adopt the automated reverse engineering procedure instead. Some
evolutionary algorithms have been suggested for deriving network parameters.
However, to infer large networks by the evolutionary algorithm, it is
necessary to address two important issues: premature convergence and high
computational cost. To tackle the former problem and to enhance the
performance of traditional evolutionary algorithms, it is advisable to use
parallel model evolutionary algorithms. To overcome the latter and to speed
up the computation, it is advocated to adopt the mechanism of cloud
computing as a promising solution: most popular is the method of MapReduce
programming model, a fault-tolerant framework to implement parallel
algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by
developing and parallelizing a hybrid GA-PSO optimization method. Our
parallel method is extended to work with the Hadoop MapReduce programming
model and is executed in different cloud computing environments. To evaluate
the proposed approach, we use a well-known open-source software
GeneNetWeaver to create several yeast S. cerevisiae sub-networks
and use them to produce gene profiles. Experiments have been conducted and
the results have been analyzed. They show that our parallel approach can be
successfully used to infer networks with desired behaviors and the
computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network
parameters and they perform better than the widely-used sequential
algorithms in gene network inference. These parallel algorithms can be
distributed to the cloud computing environment to speed up the computation.
By coupling the parallel model population-based optimization method and the
parallel computational framework, high quality solutions can be obtained
within relatively short time. This integrated approach is a promising way
for inferring large networks.
Collapse
Affiliation(s)
- Wei-Po Lee
- Department of Information Management, National Sun Yat-sen University, Kaohsiung, Taiwan.
| | | | | |
Collapse
|
18
|
Verheggen K, Barsnes H, Martens L. Distributed computing and data storage in proteomics: many hands make light work, and a stronger memory. Proteomics 2013; 14:367-77. [PMID: 24285552 DOI: 10.1002/pmic.201300288] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2013] [Revised: 09/09/2013] [Accepted: 09/23/2013] [Indexed: 12/25/2022]
Abstract
Modern day proteomics generates ever more complex data, causing the requirements on the storage and processing of such data to outgrow the capacity of most desktop computers. To cope with the increased computational demands, distributed architectures have gained substantial popularity in the recent years. In this review, we provide an overview of the current techniques for distributed computing, along with examples of how the techniques are currently being employed in the field of proteomics. We thus underline the benefits of distributed computing in proteomics, while also pointing out the potential issues and pitfalls involved.
Collapse
Affiliation(s)
- Kenneth Verheggen
- Department of Medical Protein Research, VIB, Ghent, Belgium; Department of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium
| | | | | |
Collapse
|
19
|
O'Driscoll A, Daugelaite J, Sleator RD. 'Big data', Hadoop and cloud computing in genomics. J Biomed Inform 2013; 46:774-81. [PMID: 23872175 DOI: 10.1016/j.jbi.2013.07.001] [Citation(s) in RCA: 125] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Revised: 06/17/2013] [Accepted: 07/08/2013] [Indexed: 12/18/2022]
Abstract
Since the completion of the Human Genome project at the turn of the Century, there has been an unprecedented proliferation of genomic sequence data. A consequence of this is that the medical discoveries of the future will largely depend on our ability to process and analyse large genomic data sets, which continue to expand as the cost of sequencing decreases. Herein, we provide an overview of cloud computing and big data technologies, and discuss how such expertise can be used to deal with biology's big data sets. In particular, big data technologies such as the Apache Hadoop project, which provides distributed and parallelised data processing and analysis of petabyte (PB) scale data sets will be discussed, together with an overview of the current usage of Hadoop within the bioinformatics community.
Collapse
Affiliation(s)
- Aisling O'Driscoll
- Department of Computing, Cork Institute of Technology, Rossa Avenue, Bishopstown, Cork, Ireland
| | | | | |
Collapse
|