1
|
Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment. Molecules 2019; 24:molecules24010179. [PMID: 30621295 PMCID: PMC6337464 DOI: 10.3390/molecules24010179] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Revised: 12/29/2018] [Accepted: 01/01/2019] [Indexed: 11/16/2022] Open
Abstract
Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction of suitable data stored in text files. To perform these operations on large scale in the face of the growing amount of macromolecular data in public repositories, we propose to perform them in the distributed environment of Azure Data Lake and scale the calculations on the Cloud. In this paper, we present dedicated data extractors for PDB files that can be used in various types of calculations performed over protein and nucleic acids structures in the Azure Data Lake. Results of our tests show that the Cloud storage space occupied by the macromolecular data can be successfully reduced by using compression of PDB files without significant loss of data processing efficiency. Moreover, our experiments show that the performed calculations can be significantly accelerated when using large sequential files for storing macromolecular data and by parallelizing the calculations and data extractions that precede them. Finally, the paper shows how all the calculations can be performed in a declarative way in U-SQL scripts for Data Lake Analytics.
Collapse
|
2
|
HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2016.02.029] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
3
|
Horlacher O, Lisacek F, Müller M. Mining Large Scale Tandem Mass Spectrometry Data for Protein Modifications Using Spectral Libraries. J Proteome Res 2015; 15:721-31. [PMID: 26653734 DOI: 10.1021/acs.jproteome.5b00877] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Experimental improvements in post-translational modification (PTM) detection by tandem mass spectrometry (MS/MS) has allowed the identification of vast numbers of PTMs. Open modification searches (OMSs) of MS/MS data, which do not require prior knowledge of the modifications present in the sample, further increased the diversity of detected PTMs. Despite much effort, there is still a lack of functional annotation of PTMs. One possibility to narrow the annotation gap is to mine MS/MS data deposited in public repositories and to correlate the PTM presence with biological meta-information attached to the data. Since the data volume can be quite substantial and contain tens of millions of MS/MS spectra, the data mining tools must be able to cope with big data. Here, we present two tools, Liberator and MzMod, which are built using the MzJava class library and the Apache Spark large scale computing framework. Liberator builds large MS/MS spectrum libraries, and MzMod searches them in an OMS mode. We applied these tools to a recently published set of 25 million spectra from 30 human tissues and present tissue specific PTMs. We also compared the results to the ones obtained with the OMS tool MODa and the search engine X!Tandem.
Collapse
Affiliation(s)
- Oliver Horlacher
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics , Geneva 1211, Switzerland.,Centre Universitaire de Bioinformatique, University of Geneva , Geneva 1211, Switzerland
| | - Frederique Lisacek
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics , Geneva 1211, Switzerland.,Centre Universitaire de Bioinformatique, University of Geneva , Geneva 1211, Switzerland
| | - Markus Müller
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics , Geneva 1211, Switzerland.,Centre Universitaire de Bioinformatique, University of Geneva , Geneva 1211, Switzerland
| |
Collapse
|
4
|
Horlacher O, Nikitin F, Alocci D, Mariethoz J, Müller M, Lisacek F. MzJava: An open source library for mass spectrometry data processing. J Proteomics 2015; 129:63-70. [PMID: 26141507 DOI: 10.1016/j.jprot.2015.06.013] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Revised: 06/17/2015] [Accepted: 06/22/2015] [Indexed: 10/23/2022]
Abstract
Mass spectrometry (MS) is a widely used and evolving technique for the high-throughput identification of molecules in biological samples. The need for sharing and reuse of code among bioinformaticians working with MS data prompted the design and implementation of MzJava, an open-source Java Application Programming Interface (API) for MS related data processing. MzJava provides data structures and algorithms for representing and processing mass spectra and their associated biological molecules, such as metabolites, glycans and peptides. MzJava includes functionality to perform mass calculation, peak processing (e.g. centroiding, filtering, transforming), spectrum alignment and clustering, protein digestion, fragmentation of peptides and glycans as well as scoring functions for spectrum-spectrum and peptide/glycan-spectrum matches. For data import and export MzJava implements readers and writers for commonly used data formats. For many classes support for the Hadoop MapReduce (hadoop.apache.org) and Apache Spark (spark.apache.org) frameworks for cluster computing was implemented. The library has been developed applying best practices of software engineering. To ensure that MzJava contains code that is correct and easy to use the library's API was carefully designed and thoroughly tested. MzJava is an open-source project distributed under the AGPL v3.0 licence. MzJava requires Java 1.7 or higher. Binaries, source code and documentation can be downloaded from http://mzjava.expasy.org and https://bitbucket.org/sib-pig/mzjava. This article is part of a Special Issue entitled: Computational Proteomics.
Collapse
Affiliation(s)
- Oliver Horlacher
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland; Centre Universitaire de Bioinformatique, University of Geneva, Geneva 1211, Switzerland
| | - Frederic Nikitin
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland
| | - Davide Alocci
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland; Centre Universitaire de Bioinformatique, University of Geneva, Geneva 1211, Switzerland
| | - Julien Mariethoz
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland
| | - Markus Müller
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland; Centre Universitaire de Bioinformatique, University of Geneva, Geneva 1211, Switzerland.
| | - Frederique Lisacek
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland; Centre Universitaire de Bioinformatique, University of Geneva, Geneva 1211, Switzerland.
| |
Collapse
|
5
|
Hung CL, Chen WP, Hua GJ, Zheng H, Tsai SJJ, Lin YL. Cloud computing-based TagSNP selection algorithm for human genome data. Int J Mol Sci 2015; 16:1096-110. [PMID: 25569088 PMCID: PMC4307292 DOI: 10.3390/ijms16011096] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Accepted: 12/04/2014] [Indexed: 12/31/2022] Open
Abstract
Single nucleotide polymorphisms (SNPs) play a fundamental role in human genetic variation and are used in medical diagnostics, phylogeny construction, and drug design. They provide the highest-resolution genetic fingerprint for identifying disease associations and human features. Haplotypes are regions of linked genetic variants that are closely spaced on the genome and tend to be inherited together. Genetics research has revealed SNPs within certain haplotype blocks that introduce few distinct common haplotypes into most of the population. Haplotype block structures are used in association-based methods to map disease genes. In this paper, we propose an efficient algorithm for identifying haplotype blocks in the genome. In chromosomal haplotype data retrieved from the HapMap project website, the proposed algorithm identified longer haplotype blocks than an existing algorithm. To enhance its performance, we extended the proposed algorithm into a parallel algorithm that copies data in parallel via the Hadoop MapReduce framework. The proposed MapReduce-paralleled combinatorial algorithm performed well on real-world data obtained from the HapMap dataset; the improvement in computational efficiency was proportional to the number of processors used.
Collapse
Affiliation(s)
- Che-Lun Hung
- Department of Computer Science and Communication Engineering, Providence University, Taichung 43301, Taiwan.
| | - Wen-Pei Chen
- Department of Applied Chemistry, Providence University, Taiwan 43301, Taiwan.
| | - Guan-Jie Hua
- Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan.
| | - Huiru Zheng
- School of Computing and Mathematics, University of Ulster, Newtownabbey BT37 0QB, UK.
| | - Suh-Jen Jane Tsai
- Department of Applied Chemistry, Providence University, Taiwan 43301, Taiwan.
| | - Yaw-Ling Lin
- Department of Applied Chemistry, Providence University, Taiwan 43301, Taiwan.
| |
Collapse
|
6
|
Seralathan MV, Sivanesan S, Bafana A, Kashyap SM, Patrizio A, Krishnamurthi K, Chakrabarti T. Cytochrome P450 BM3 of Bacillus megaterium - a possible endosulfan biotransforming gene. J Environ Sci (China) 2014; 26:2307-2314. [PMID: 25458686 DOI: 10.1016/j.jes.2014.09.016] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2013] [Revised: 01/23/2014] [Accepted: 04/03/2014] [Indexed: 06/04/2023]
Abstract
Computing chemistry was applied to understand biotransformation mechanism of an organochlorine pesticide, endosulfan. The stereo specific metabolic activity of human CYP-2B6 (cytochrome P450) on endosulfan has been well demonstrated. Sequence and structural similarity search revealed that the bacterium Bacillus megaterium encodes CYP-BM3, which is similar to CYP-2B6. The functional similarity was studied at organism level by batch-scale studies and it was proved that B. megaterium could metabolize endosulfan to endosulfan sulfate, as CYP-2B6 does in human system. The gene expression analyses also confirmed the possible role of CYP-BM3 in endosulfan metabolism. Thus, our results show that the protein structure based in-silico approach can help us to understand and identify microbes for remediation strategy development. To the best of our knowledge this is the first report which has extrapolated the bacterial gene for endosulfan biotransformation through in silico prediction approach for metabolic gene identification.
Collapse
Affiliation(s)
| | | | - Amit Bafana
- Environmental Health Division, CSIR-NEERI, Nagpur 440020, India
| | | | | | | | | |
Collapse
|
7
|
Dalpé G, Joly Y. Opportunities and Challenges Provided by Cloud Repositories for Bioinformatics-Enabled Drug Discovery. Drug Dev Res 2014; 75:393-401. [DOI: 10.1002/ddr.21211] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Accepted: 06/24/2014] [Indexed: 02/03/2023]
Affiliation(s)
- Gratien Dalpé
- Centre of Genomics and Policy; McGill University; Montreal Quebec Canada
| | - Yann Joly
- Centre of Genomics and Policy; McGill University; Montreal Quebec Canada
| |
Collapse
|
8
|
Mrozek D, Małysiak-Mrozek B, Kłapciński A. Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 2014; 30:2822-5. [PMID: 24930141 PMCID: PMC4173022 DOI: 10.1093/bioinformatics/btu389] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Summary: Popular methods for 3D protein structure similarity searching, especially those that generate high-quality alignments such as Combinatorial Extension (CE) and Flexible structure Alignment by Chaining Aligned fragment pairs allowing Twists (FATCAT) are still time consuming. As a consequence, performing similarity searching against large repositories of structural data requires increased computational resources that are not always available. Cloud computing provides huge amounts of computational power that can be provisioned on a pay-as-you-go basis. We have developed the cloud-based system that allows scaling of the similarity searching process vertically and horizontally. Cloud4Psi (Cloud for Protein Similarity) was tested in the Microsoft Azure cloud environment and provided good, almost linearly proportional acceleration when scaled out onto many computational units. Availability and implementation: Cloud4Psi is available as Software as a Service for testing purposes at: http://cloud4psi.cloudapp.net/. For source code and software availability, please visit the Cloud4Psi project home page at http://zti.polsl.pl/dmrozek/science/cloud4psi.htm. Contact:dariusz.mrozek@polsl.pl
Collapse
Affiliation(s)
- Dariusz Mrozek
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Bożena Małysiak-Mrozek
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Artur Kłapciński
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|
9
|
Hung CL, Hua GJ. Local alignment tool based on Hadoop framework and GPU architecture. BIOMED RESEARCH INTERNATIONAL 2014; 2014:541490. [PMID: 24955362 PMCID: PMC4052794 DOI: 10.1155/2014/541490] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2014] [Accepted: 04/14/2014] [Indexed: 11/17/2022]
Abstract
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
Collapse
Affiliation(s)
- Che-Lun Hung
- Department of Computer Science and Communication Engineering, Providence University, No. 200, Section 7, Taiwan Boulevard, Shalu District, Taichung 43301, Taiwan
| | - Guan-Jie Hua
- Department of Computer Science and Information Engineering, Providence University, No. 200, Section 7, Taiwan Boulevard, Shalu District, Taichung 43301, Taiwan
| |
Collapse
|
10
|
Chang TH, Wu SL, Wang WJ, Horng JT, Chang CW. A novel approach for discovering condition-specific correlations of gene expressions within biological pathways by using cloud computing technology. BIOMED RESEARCH INTERNATIONAL 2014; 2014:763237. [PMID: 24579087 PMCID: PMC3919110 DOI: 10.1155/2014/763237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2013] [Revised: 11/18/2013] [Accepted: 12/15/2013] [Indexed: 11/18/2022]
Abstract
Microarrays are widely used to assess gene expressions. Most microarray studies focus primarily on identifying differential gene expressions between conditions (e.g., cancer versus normal cells), for discovering the major factors that cause diseases. Because previous studies have not identified the correlations of differential gene expression between conditions, crucial but abnormal regulations that cause diseases might have been disregarded. This paper proposes an approach for discovering the condition-specific correlations of gene expressions within biological pathways. Because analyzing gene expression correlations is time consuming, an Apache Hadoop cloud computing platform was implemented. Three microarray data sets of breast cancer were collected from the Gene Expression Omnibus, and pathway information from the Kyoto Encyclopedia of Genes and Genomes was applied for discovering meaningful biological correlations. The results showed that adopting the Hadoop platform considerably decreased the computation time. Several correlations of differential gene expressions were discovered between the relapse and nonrelapse breast cancer samples, and most of them were involved in cancer regulation and cancer-related pathways. The results showed that breast cancer recurrence might be highly associated with the abnormal regulations of these gene pairs, rather than with their individual expression levels. The proposed method was computationally efficient and reliable, and stable results were obtained when different data sets were used. The proposed method is effective in identifying meaningful biological regulation patterns between conditions.
Collapse
Affiliation(s)
- Tzu-Hao Chang
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan
| | - Shih-Lin Wu
- Department of Computer Science and Information Engineering, College of Engineering, Chang Gung University, Taoyuan 333, Taiwan
| | - Wei-Jen Wang
- Department of Computer Science and Information Engineering, National Central University, Taoyuan 320, Taiwan
| | - Jorng-Tzong Horng
- Department of Computer Science and Information Engineering, National Central University, Taoyuan 320, Taiwan
- Department of Biomedical Informatics, Asia University, Taichung 413, Taiwan
| | - Cheng-Wei Chang
- Department of Information Management, Hsing Wu University, New Taipei City 244, Taiwan
| |
Collapse
|