Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Ferraro Petrillo U, Sorella M, Cattaneo G, Giancarlo R, Rombo SE. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics. BMC Bioinformatics 2019;20:138. [PMID: 30999863 PMCID: PMC6471689 DOI: 10.1186/s12859-019-2694-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open

For:	Ferraro Petrillo U, Sorella M, Cattaneo G, Giancarlo R, Rombo SE. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics. BMC Bioinformatics 2019;20:138. [PMID: 30999863 PMCID: PMC6471689 DOI: 10.1186/s12859-019-2694-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open

Number

Cited by Other Article(s)

Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023;30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open

Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022;4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open

Abstract

Background

Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data.

Results

In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data.

Conclusions

The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.

Collapse

Affiliation(s)

Jinxiang Chen Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Fuyi Li Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
Miao Wang Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Junlong Li Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Tatiana T. Marquez-Lago Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
André Leier Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
Jerico Revote Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
Shuqin Li Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
Quanzhong Liu Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China Quanzhong Liu
Jiangning Song Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia *Correspondence: Jiangning Song

Collapse

Nyström-Persson J, Keeble-Gagnère G, Zawad N. Compact and evenly distributed k-mer binning for genomic sequences. Bioinformatics 2021;37:2563-2569. [PMID: 33693556 PMCID: PMC8428581 DOI: 10.1093/bioinformatics/btab156] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 02/15/2021] [Accepted: 03/03/2021] [Indexed: 11/17/2022] Open

HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data. BIOMED RESEARCH INTERNATIONAL 2020;2019:3108950. [PMID: 31915686 PMCID: PMC6930768 DOI: 10.1155/2019/3108950] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2019] [Revised: 09/14/2019] [Accepted: 10/22/2019] [Indexed: 12/22/2022]

Romano P, Céol A, Dräger A, Fiannaca A, Giugno R, La Rosa M, Milanesi L, Pfeffer U, Rizzo R, Shin SY, Xia J, Urso A. The 2017 Network Tools and Applications in Biology (NETTAB) workshop: aims, topics and outcomes. BMC Bioinformatics 2019;20:125. [PMID: 30999855 PMCID: PMC6472292 DOI: 10.1186/s12859-019-2681-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open