1
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
2
|
Vasconcelos S, Nunes GL, Dias MC, Lorena J, Oliveira RRM, Lima TGL, Pires ES, Valadares RBS, Alves R, Watanabe MTC, Zappi DC, Hiura AL, Pastore M, Vasconcelos LV, Mota NFO, Viana PL, Gil ASB, Simões AO, Imperatriz‐Fonseca VL, Harley RM, Giulietti AM, Oliveira G. Unraveling the plant diversity of the Amazonian canga through DNA barcoding. Ecol Evol 2021; 11:13348-13362. [PMID: 34646474 PMCID: PMC8495817 DOI: 10.1002/ece3.8057] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2021] [Revised: 08/03/2021] [Accepted: 08/11/2021] [Indexed: 01/04/2023] Open
Abstract
The canga of the Serra dos Carajás, in Eastern Amazon, is home to a unique open plant community, harboring several endemic and rare species. Although a complete flora survey has been recently published, scarce to no genetic information is available for most plant species of the ironstone outcrops of the Serra dos Carajás. In this scenario, DNA barcoding appears as a fast and effective approach to assess the genetic diversity of the Serra dos Carajás flora, considering the growing need for robust biodiversity conservation planning in such an area with industrial mining activities. Thus, after testing eight different DNA barcode markers (matK, rbcL, rpoB, rpoC1, atpF-atpH, psbK-psbI, trnH-psbA, and ITS2), we chose rbcL and ITS2 as the most suitable markers for a broad application in the regional flora. Here we describe DNA barcodes for 1,130 specimens of 538 species, 323 genera, and 115 families of vascular plants from a highly diverse flora in the Amazon basin, with a total of 344 species being barcoded for the first time. In addition, we assessed the potential of using DNA metabarcoding of bulk samples for surveying plant diversity in the canga. Upon achieving the first comprehensive DNA barcoding effort directed to a complete flora in the Brazilian Amazon, we discuss the relevance of our results to guide future conservation measures in the Serra dos Carajás.
Collapse
Affiliation(s)
| | | | - Mariana C. Dias
- Instituto Tecnológico ValeBelémBrazil
- Programa Interunidades de Pós‐Graduação em BioinformáticaUniversidade Federal de Minas GeraisBelo HorizonteBrazil
| | | | - Renato R. M. Oliveira
- Instituto Tecnológico ValeBelémBrazil
- Programa Interunidades de Pós‐Graduação em BioinformáticaUniversidade Federal de Minas GeraisBelo HorizonteBrazil
| | | | | | | | | | | | - Daniela C. Zappi
- Instituto Tecnológico ValeBelémBrazil
- Instituto de Ciências BiológicasUniversidade de BrasíliaBrasíliaBrazil
| | | | - Mayara Pastore
- Instituto Tecnológico ValeBelémBrazil
- Coordenação de BotânicaMuseu Paraense Emílio GoeldiBelémBrazil
| | - Liziane V. Vasconcelos
- Instituto Tecnológico ValeBelémBrazil
- Programa de Pós‐Graduação em EcologiaUniversidade Federal do ParáBelémBrazil
| | - Nara F. O. Mota
- Instituto Tecnológico ValeBelémBrazil
- Coordenação de BotânicaMuseu Paraense Emílio GoeldiBelémBrazil
| | - Pedro L. Viana
- Coordenação de BotânicaMuseu Paraense Emílio GoeldiBelémBrazil
| | - André S. B. Gil
- Coordenação de BotânicaMuseu Paraense Emílio GoeldiBelémBrazil
| | - André O. Simões
- Departamento de Biologia VegetalUniversidade Estadual de CampinasCampinasBrazil
| | | | | | - Ana M. Giulietti
- Instituto Tecnológico ValeBelémBrazil
- Programa de Pós‐Graduação em BotânicaUniversidade Estadual de Feira de SantanaFeira de SantanaBrazil
| | | |
Collapse
|