1
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
2
|
Mining the red deer genome (CerEla1.0) to develop X-and Y-chromosome-linked STR markers. PLoS One 2020; 15:e0242506. [PMID: 33226998 PMCID: PMC7986210 DOI: 10.1371/journal.pone.0242506] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Accepted: 11/03/2020] [Indexed: 12/13/2022] Open
Abstract
Microsatellites are widely applied in population and forensic genetics, wildlife studies and parentage testing in animal breeding, among others, and recently, high-throughput sequencing technologies have greatly facilitated the identification of microsatellite markers. In this study the genomic data of Cervus elaphus (CerEla1.0) was exploited, in order to identify microsatellite loci along the red deer genome and for designing the cognate primers. The bioinformatics pipeline identified 982,433 microsatellite motifs genome-wide, assorted along the chromosomes, from which 45,711 loci mapped to the X- and 1096 to the Y-chromosome. Primers were successfully designed for 170,873 loci, and validated with an independently developed autosomal tetranucleotide STR set. Ten X- and five Y-chromosome-linked microsatellites were selected and tested by two multiplex PCR setups on genomic DNA samples of 123 red deer stags. The average number of alleles per locus was 3.3, and the average gene diversity value of the markers was 0.270. The overall observed and expected heterozygosities were 0.755 and 0.832, respectively. Polymorphic Information Content (PIC) ranged between 0.469 and 0.909 per locus with a mean value of 0.813. Using the X- and Y-chromosome linked markers 19 different Y-chromosome and 72 X-chromosome lines were identified. Both the X- and the Y-haplotypes split to two distinct clades each. The Y-chromosome clades correlated strongly with the geographic origin of the haplotypes of the samples. Segregation and admixture of subpopulations were demonstrated by the use of the combination of nine autosomal and 16 sex chromosomal STRs concerning southwestern and northeastern Hungary. In conclusion, the approach demonstrated here is a very efficient method for developing microsatellite markers for species with available genomic sequence data, as well as for their use in individual identifications and in population genetics studies.
Collapse
|
3
|
Wu Q, Miao G, Li X, Liu W, Ikhwanuddin M, Ma H. De novo assembly of genome and development of polymorphic microsatellite loci in the blue swimming crab (Portunus pelagicus) using RAD approach. Mol Biol Rep 2018; 45:1913-1918. [DOI: 10.1007/s11033-018-4339-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2018] [Accepted: 08/28/2018] [Indexed: 12/17/2022]
|