Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Alnasir JJ, Shanahan HP. The application of Hadoop in structural bioinformatics. Brief Bioinform 2018;21:96-105. [PMID: 30462158 DOI: 10.1093/bib/bby106] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 09/20/2018] [Accepted: 10/05/2018] [Indexed: 11/13/2022] Open

For:	Alnasir JJ, Shanahan HP. The application of Hadoop in structural bioinformatics. Brief Bioinform 2018;21:96-105. [PMID: 30462158 DOI: 10.1093/bib/bby106] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 09/20/2018] [Accepted: 10/05/2018] [Indexed: 11/13/2022] Open

Number

Cited by Other Article(s)

Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022;4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open

Abstract

Background

Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data.

Results

In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data.

Conclusions

The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.

Collapse

Affiliation(s)

Jinxiang Chen Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Fuyi Li Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
Miao Wang Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Junlong Li Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Tatiana T. Marquez-Lago Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
André Leier Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
Jerico Revote Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
Shuqin Li Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Quanzhong Liu Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China Quanzhong Liu
Jiangning Song Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia *Correspondence: Jiangning Song

Collapse

Chen W, Yao C, Guo Y, Wang Y, Xue Z. pmTM-align: scalable pairwise and multiple structure alignment with Apache Spark and OpenMP. BMC Bioinformatics 2020;21:426. [PMID: 32993484 PMCID: PMC7526426 DOI: 10.1186/s12859-020-03757-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Accepted: 09/16/2020] [Indexed: 12/18/2022] Open

Bagheri H, Muppirala U, Masonbrink RE, Severin AJ, Rajan H. Shared data science infrastructure for genomics data. BMC Bioinformatics 2019;20:436. [PMID: 31438850 PMCID: PMC6704658 DOI: 10.1186/s12859-019-2967-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Accepted: 06/25/2019] [Indexed: 01/20/2023] Open

Abstract

BACKGROUND

Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data contained in large data repositories. The main features of Boag are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories.

RESULTS

As a proof of concept, Boa for genomics, Boag, has been implemented to analyze RefSeq's 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boag to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boag databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations.

CONCLUSIONS

In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boag, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boag using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boag could be used with large biological datasets.

Collapse