1
|
Garzón W, Benavides L, Gaignard A, Redon R, Südholt M. A taxonomy of tools and approaches for distributed genomic analyses. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.101024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022] Open
|
2
|
Pal S, Mondal S, Das G, Khatua S, Ghosh Z. Big data in biology: The hope and present-day challenges in it. GENE REPORTS 2020. [DOI: 10.1016/j.genrep.2020.100869] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
3
|
Cattaneo G, Giancarlo R, Piotto S, Ferraro Petrillo U, Roscigno G, Di Biasi L. MapReduce in Computational Biology - A Synopsis. ADVANCES IN ARTIFICIAL LIFE, EVOLUTIONARY COMPUTATION, AND SYSTEMS CHEMISTRY 2017. [DOI: 10.1007/978-3-319-57711-1_5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|
4
|
Hundt C, Hildebrandt A, Schmidt B. rapidGSEA: Speeding up gene set enrichment analysis on multi-core CPUs and CUDA-enabled GPUs. BMC Bioinformatics 2016; 17:394. [PMID: 27663265 PMCID: PMC5035472 DOI: 10.1186/s12859-016-1244-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2016] [Accepted: 09/08/2016] [Indexed: 12/02/2022] Open
Abstract
Background Gene Set Enrichment Analysis (GSEA) is a popular method to reveal significant dependencies between predefined sets of gene symbols and observed phenotypes by evaluating the deviation of gene expression values between cases and controls. An established measure of inter-class deviation, the enrichment score, is usually computed using a weighted running sum statistic over the whole set of gene symbols. Due to the lack of analytic expressions the significance of enrichment scores is determined using a non-parametric estimation of their null distribution by permuting the phenotype labels of the probed patients. Accordingly, GSEA is a time-consuming task due to the large number of required permutations to accurately estimate the nominal p-value – a circumstance that is even more pronounced during multiple hypothesis testing since its estimate is lower-bounded by the inverse number of samples in permutation space. Results We present rapidGSEA – a software suite consisting of two tools for facilitating permutation-based GSEA: cudaGSEA and ompGSEA. cudaGSEA is a CUDA-accelerated tool using fine-grained parallelization schemes on massively parallel architectures while ompGSEA is a coarse-grained multi-threaded tool for multi-core CPUs. Nominal p-value estimation of 4,725 gene sets on a data set consisting of 20,639 unique gene symbols and 200 patients (183 cases + 17 controls) each probing one million permutations takes 19 hours on a Xeon CPU and less than one hour on a GeForce Titan X GPU while the established GSEA tool from the Broad Institute (broadGSEA) takes roughly 13 days. Conclusion cudaGSEA outperforms broadGSEA by around two orders-of-magnitude on a single Tesla K40c or GeForce Titan X GPU. ompGSEA provides around one order-of-magnitude speedup to broadGSEA on a standard Xeon CPU. The rapidGSEA suite is open-source software and can be downloaded at https://github.com/gravitino/cudaGSEAas standalone application or package for the R framework.
Collapse
Affiliation(s)
- Christian Hundt
- Department of Computer Science, Johannes Gutenberg University, Staudingerweg 9, Mainz, 55128, Germany.
| | - Andreas Hildebrandt
- Department of Computer Science, Johannes Gutenberg University, Staudingerweg 9, Mainz, 55128, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Staudingerweg 9, Mainz, 55128, Germany
| |
Collapse
|
5
|
Yu P, Lin W. Single-cell Transcriptome Study as Big Data. GENOMICS PROTEOMICS & BIOINFORMATICS 2016; 14:21-30. [PMID: 26876720 PMCID: PMC4792842 DOI: 10.1016/j.gpb.2016.01.005] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/17/2015] [Revised: 01/09/2016] [Accepted: 01/10/2016] [Indexed: 12/31/2022]
Abstract
The rapid growth of single-cell RNA-seq studies (scRNA-seq) demands efficient data storage, processing, and analysis. Big-data technology provides a framework that facilitates the comprehensive discovery of biological signals from inter-institutional scRNA-seq datasets. The strategies to solve the stochastic and heterogeneous single-cell transcriptome signal are discussed in this article. After extensively reviewing the available big-data applications of next-generation sequencing (NGS)-based studies, we propose a workflow that accounts for the unique characteristics of scRNA-seq data and primary objectives of single-cell studies.
Collapse
Affiliation(s)
- Pingjian Yu
- Genomics and Bioinformatics Lab, Baylor Institute for Immunology Research, Dallas, TX 75204, USA
| | - Wei Lin
- Genomics and Bioinformatics Lab, Baylor Institute for Immunology Research, Dallas, TX 75204, USA.
| |
Collapse
|
6
|
Abstract
High-throughput platforms such as microarray, mass spectrometry, and next-generation sequencing are producing an increasing volume of omics data that needs large data storage and computing power. Cloud computing offers massive scalable computing and storage, data sharing, on-demand anytime and anywhere access to resources and applications, and thus, it may represent the key technology for facing those issues. In fact, in the recent years it has been adopted for the deployment of different bioinformatics solutions and services both in academia and in the industry. Although this, cloud computing presents several issues regarding the security and privacy of data, that are particularly important when analyzing patients data, such as in personalized medicine. This chapter reviews main academic and industrial cloud-based bioinformatics solutions; with a special focus on microarray data analysis solutions and underlines main issues and problems related to the use of such platforms for the storage and analysis of patients data.
Collapse
Affiliation(s)
- Barbara Calabrese
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
| | - Mario Cannataro
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy.
| |
Collapse
|
7
|
Wang L, Yang L, Peng Z, Lu D, Jin Y, McNutt M, Yin Y. cisPath: an R/Bioconductor package for cloud users for visualization and management of functional protein interaction networks. BMC SYSTEMS BIOLOGY 2015; 9 Suppl 1:S1. [PMID: 25708840 PMCID: PMC4331675 DOI: 10.1186/1752-0509-9-s1-s1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Background With the burgeoning development of cloud technology and services, there are an increasing number of users who prefer cloud to run their applications. All software and associated data are hosted on the cloud, allowing users to access them via a web browser from any computer, anywhere. This paper presents cisPath, an R/Bioconductor package deployed on cloud servers for client users to visualize, manage, and share functional protein interaction networks. Results With this R package, users can easily integrate downloaded protein-protein interaction information from different online databases with private data to construct new and personalized interaction networks. Additional functions allow users to generate specific networks based on private databases. Since the results produced with the use of this package are in the form of web pages, cloud users can easily view and edit the network graphs via the browser, using a mouse or touch screen, without the need to download them to a local computer. This package can also be installed and run on a local desktop computer. Depending on user preference, results can be publicized or shared by uploading to a web server or cloud driver, allowing other users to directly access results via a web browser. Conclusions This package can be installed and run on a variety of platforms. Since all network views are shown in web pages, such package is particularly useful for cloud users. The easy installation and operation is an attractive quality for R beginners and users with no previous experience with cloud services.
Collapse
|
8
|
Shanahan HP, Owen AM, Harrison AP. Bioinformatics on the cloud computing platform Azure. PLoS One 2014; 9:e102642. [PMID: 25050811 PMCID: PMC4106841 DOI: 10.1371/journal.pone.0102642] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2013] [Accepted: 06/20/2014] [Indexed: 12/27/2022] Open
Abstract
We discuss the applicability of the Microsoft cloud computing platform, Azure, for bioinformatics. We focus on the usability of the resource rather than its performance. We provide an example of how R can be used on Azure to analyse a large amount of microarray expression data deposited at the public database ArrayExpress. We provide a walk through to demonstrate explicitly how Azure can be used to perform these analyses in Appendix S1 and we offer a comparison with a local computation. We note that the use of the Platform as a Service (PaaS) offering of Azure can represent a steep learning curve for bioinformatics developers who will usually have a Linux and scripting language background. On the other hand, the presence of an additional set of libraries makes it easier to deploy software in a parallel (scalable) fashion and explicitly manage such a production run with only a few hundred lines of code, most of which can be incorporated from a template. We propose that this environment is best suited for running stable bioinformatics software by users not involved with its development.
Collapse
Affiliation(s)
- Hugh P. Shanahan
- Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
- * E-mail:
| | - Anne M. Owen
- Department of Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, United Kingdom
| | - Andrew P. Harrison
- Department of Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, United Kingdom
- Department of Biological Sciences, University of Essex, Wivenhoe Park, Colchester, United Kingdom
| |
Collapse
|
9
|
Chang TH, Wu SL, Wang WJ, Horng JT, Chang CW. A novel approach for discovering condition-specific correlations of gene expressions within biological pathways by using cloud computing technology. BIOMED RESEARCH INTERNATIONAL 2014; 2014:763237. [PMID: 24579087 PMCID: PMC3919110 DOI: 10.1155/2014/763237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2013] [Revised: 11/18/2013] [Accepted: 12/15/2013] [Indexed: 11/18/2022]
Abstract
Microarrays are widely used to assess gene expressions. Most microarray studies focus primarily on identifying differential gene expressions between conditions (e.g., cancer versus normal cells), for discovering the major factors that cause diseases. Because previous studies have not identified the correlations of differential gene expression between conditions, crucial but abnormal regulations that cause diseases might have been disregarded. This paper proposes an approach for discovering the condition-specific correlations of gene expressions within biological pathways. Because analyzing gene expression correlations is time consuming, an Apache Hadoop cloud computing platform was implemented. Three microarray data sets of breast cancer were collected from the Gene Expression Omnibus, and pathway information from the Kyoto Encyclopedia of Genes and Genomes was applied for discovering meaningful biological correlations. The results showed that adopting the Hadoop platform considerably decreased the computation time. Several correlations of differential gene expressions were discovered between the relapse and nonrelapse breast cancer samples, and most of them were involved in cancer regulation and cancer-related pathways. The results showed that breast cancer recurrence might be highly associated with the abnormal regulations of these gene pairs, rather than with their individual expression levels. The proposed method was computationally efficient and reliable, and stable results were obtained when different data sets were used. The proposed method is effective in identifying meaningful biological regulation patterns between conditions.
Collapse
Affiliation(s)
- Tzu-Hao Chang
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan
| | - Shih-Lin Wu
- Department of Computer Science and Information Engineering, College of Engineering, Chang Gung University, Taoyuan 333, Taiwan
| | - Wei-Jen Wang
- Department of Computer Science and Information Engineering, National Central University, Taoyuan 320, Taiwan
| | - Jorng-Tzong Horng
- Department of Computer Science and Information Engineering, National Central University, Taoyuan 320, Taiwan
- Department of Biomedical Informatics, Asia University, Taichung 413, Taiwan
| | - Cheng-Wei Chang
- Department of Information Management, Hsing Wu University, New Taipei City 244, Taiwan
| |
Collapse
|
10
|
Lin YC, Yu CS, Lin YJ. Enabling large-scale biomedical analysis in the cloud. BIOMED RESEARCH INTERNATIONAL 2013; 2013:185679. [PMID: 24288665 PMCID: PMC3832998 DOI: 10.1155/2013/185679] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 08/06/2013] [Accepted: 09/22/2013] [Indexed: 01/02/2023]
Abstract
Recent progress in high-throughput instrumentations has led to an astonishing growth in both volume and complexity of biomedical data collected from various sources. The planet-size data brings serious challenges to the storage and computing technologies. Cloud computing is an alternative to crack the nut because it gives concurrent consideration to enable storage and high-performance computing on large-scale data. This work briefly introduces the data intensive computing system and summarizes existing cloud-based resources in bioinformatics. These developments and applications would facilitate biomedical research to make the vast amount of diversification data meaningful and usable.
Collapse
Affiliation(s)
- Ying-Chih Lin
- Master's Program in Biomedical Informatics and Biomedical Engineering, Feng Chia University, No. 100 Wenhwa Road, Seatwen, Taichung 40724, Taiwan
- Department of Applied Mathematics, Feng Chia University, No. 100 Wenhwa Road, Seatwen, Taichung 40724, Taiwan
| | - Chin-Sheng Yu
- Master's Program in Biomedical Informatics and Biomedical Engineering, Feng Chia University, No. 100 Wenhwa Road, Seatwen, Taichung 40724, Taiwan
- Department of Information Engineering and Computer Science, Feng Chia University, No. 100 Wenhwa Road, Seatwen, Taichung 40724, Taiwan
| | - Yen-Jen Lin
- Department of Computer Science, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan
| |
Collapse
|
11
|
Zhou S, Liao R, Guan J. When cloud computing meets bioinformatics: a review. J Bioinform Comput Biol 2013; 11:1330002. [PMID: 24131049 DOI: 10.1142/s0219720013300025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
In the past decades, with the rapid development of high-throughput technologies, biology research has generated an unprecedented amount of data. In order to store and process such a great amount of data, cloud computing and MapReduce were applied to many fields of bioinformatics. In this paper, we first introduce the basic concepts of cloud computing and MapReduce, and their applications in bioinformatics. We then highlight some problems challenging the applications of cloud computing and MapReduce to bioinformatics. Finally, we give a brief guideline for using cloud computing in biology research.
Collapse
Affiliation(s)
- Shuigeng Zhou
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, P. R. China
| | | | | |
Collapse
|
12
|
Abstract
Multiple sequence alignment (MSA) of DNA, RNA, and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology, and bioinformatics. Next-generation sequencing technologies are changing the biology landscape, flooding the databases with massive amounts of raw sequence data. MSA of ever-increasing sequence data sets is becoming a significant bottleneck. In order to realise the promise of MSA for large-scale sequence data sets, it is necessary for existing MSA algorithms to be run in a parallelised fashion with the sequence data distributed over a computing cluster or server farm. Combining MSA algorithms with cloud computing technologies is therefore likely to improve the speed, quality, and capability for MSA to handle large numbers of sequences.
In this review, multiple sequence alignments are discussed, with a specific focus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and the next generation of cloud base MSA algorithms is introduced.
Collapse
|
13
|
O'Driscoll A, Daugelaite J, Sleator RD. 'Big data', Hadoop and cloud computing in genomics. J Biomed Inform 2013; 46:774-81. [PMID: 23872175 DOI: 10.1016/j.jbi.2013.07.001] [Citation(s) in RCA: 125] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Revised: 06/17/2013] [Accepted: 07/08/2013] [Indexed: 12/18/2022]
Abstract
Since the completion of the Human Genome project at the turn of the Century, there has been an unprecedented proliferation of genomic sequence data. A consequence of this is that the medical discoveries of the future will largely depend on our ability to process and analyse large genomic data sets, which continue to expand as the cost of sequencing decreases. Herein, we provide an overview of cloud computing and big data technologies, and discuss how such expertise can be used to deal with biology's big data sets. In particular, big data technologies such as the Apache Hadoop project, which provides distributed and parallelised data processing and analysis of petabyte (PB) scale data sets will be discussed, together with an overview of the current usage of Hadoop within the bioinformatics community.
Collapse
Affiliation(s)
- Aisling O'Driscoll
- Department of Computing, Cork Institute of Technology, Rossa Avenue, Bishopstown, Cork, Ireland
| | | | | |
Collapse
|
14
|
Translational biomedical informatics in the cloud: present and future. BIOMED RESEARCH INTERNATIONAL 2013; 2013:658925. [PMID: 23586054 PMCID: PMC3613081 DOI: 10.1155/2013/658925] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 12/08/2012] [Accepted: 02/17/2013] [Indexed: 01/14/2023]
Abstract
Next generation sequencing and other high-throughput experimental techniques of recent decades have driven the exponential growth in publicly available molecular and clinical data. This information explosion has prepared the ground for the development of translational bioinformatics. The scale and dimensionality of data, however, pose obvious challenges in data mining, storage, and integration. In this paper we demonstrated the utility and promise of cloud computing for tackling the big data problems. We also outline our vision that cloud computing could be an enabling tool to facilitate translational bioinformatics research.
Collapse
|
15
|
Zou Q, Li XB, Jiang WR, Lin ZY, Li GL, Chen K. Survey of MapReduce frame operation in bioinformatics. Brief Bioinform 2013; 15:637-47. [PMID: 23396756 DOI: 10.1093/bib/bbs088] [Citation(s) in RCA: 107] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Bioinformatics is challenged by the fact that traditional analysis tools have difficulty in processing large-scale data from high-throughput sequencing. The open source Apache Hadoop project, which adopts the MapReduce framework and a distributed file system, has recently given bioinformatics researchers an opportunity to achieve scalable, efficient and reliable computing performance on Linux clusters and on cloud computing services. In this article, we present MapReduce frame-based applications that can be employed in the next-generation sequencing and other biological domains. In addition, we discuss the challenges faced by this field as well as the future works on parallel computing in bioinformatics.
Collapse
|
16
|
Abstract
Abstract As advances in life sciences and information technology bring profound influences on bioinformatics due to its interdisciplinary nature, bioinformatics is experiencing a new leap-forward from in-house computing infrastructure into utility-supplied cloud computing delivered over the Internet, in order to handle the vast quantities of biological data generated by high-throughput experimental technologies. Albeit relatively new, cloud computing promises to address big data storage and analysis issues in the bioinformatics field. Here we review extant cloud-based services in bioinformatics, classify them into Data as a Service (DaaS), Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS), and present our perspectives on the adoption of cloud computing in bioinformatics. Reviewers This article was reviewed by Frank Eisenhaber, Igor Zhulin, and Sandor Pongor.
Collapse
Affiliation(s)
- Lin Dai
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, No.7 Beitucheng West Road, Building G, Chaoyang District, Beijing 100029, China
| | | | | | | | | |
Collapse
|
17
|
Abstract
High-throughput genome research has long been associated with bioinformatics, as it assists genome sequencing and annotation projects. Along with databases, to store, properly manage, and retrieve biological data, a large number of computational tools have been developed to decode biological information from this data. However, with the advent of next-generation sequencing (NGS) technology the sequence data starts generating at a pace never before seen. Consequently researchers are facing a threat as they are experiencing a potential shortage of storage space and tools to analyze the data. Moreover, the voluminous data increases traffic in the network by uploading and downloading large data sets, and thus consume much of the network's available bandwidth. All of these obstacles have led to the solution in the form of cloud computing.
Collapse
Affiliation(s)
- Asheesh Shanker
- Department of Bioscience and Biotechnology, Banasthali University, Banasthali, Rajasthan, India.
| |
Collapse
|