1
|
Alves SIA, Ferreira VBC, Dantas CWD, da Silva ALDC, Ramos RTJ. EasySSR: a user-friendly web application with full command-line features for large-scale batch microsatellite mining and samples comparison. Front Genet 2023; 14:1228552. [PMID: 37693309 PMCID: PMC10483286 DOI: 10.3389/fgene.2023.1228552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 07/28/2023] [Indexed: 09/12/2023] Open
Abstract
Microsatellites, also known as SSRs or STRs, are polymorphic DNA regions with tandem repetitions of a nucleotide motif of size 1-6 base pairs with a broad range of applications in many fields, such as comparative genomics, molecular biology, and forensics. However, the majority of researchers do not have computational training and struggle while running command-line tools or very limited web tools for their SSR research, spending a considerable amount of time learning how to execute the software and conducting the post-processing data tabulation in other tools or manually-time that could be used directly in data analysis. We present EasySSR, a user-friendly web tool with command-line full functionality, designed for practical use in batch identifying and comparing SSRs in sequences, draft, or complete genomes, not requiring previous bioinformatic skills to run. EasySSR requires only a FASTA and an optional GENBANK file of one or more genomes to identify and compare STRs. The tool can automatically analyze and compare SSRs in whole genomes, convert GenBank to PTT files, identify perfect and imperfect SSRs and coding and non-coding regions, compare their frequencies, abundancy, motifs, flanking sequences, and iterations, producing many outputs ready for download such as PTT files, interactive charts, and Excel tables, giving the user the data ready for further analysis in minutes. EasySSR was implemented as a web application, which can be executed from any browser and is available for free at https://computationalbiology.ufpa.br/easyssr/. Tutorials, usage notes, and download links to the source code can be found at https://github.com/engbiopct/EasySSR.
Collapse
Affiliation(s)
- Sandy Ingrid Aguiar Alves
- Laboratory of Biological Engineering, Biological Science Institute, Park of Science and Technology, Federal University of Pará, Belém, Brazil
| | - Victor Benedito Costa Ferreira
- Laboratory of Biological Engineering, Biological Science Institute, Park of Science and Technology, Federal University of Pará, Belém, Brazil
| | | | - Artur Luiz da Costa da Silva
- Laboratory of Biological Engineering, Biological Science Institute, Park of Science and Technology, Federal University of Pará, Belém, Brazil
| | - Rommel Thiago Jucá Ramos
- Laboratory of Biological Engineering, Biological Science Institute, Park of Science and Technology, Federal University of Pará, Belém, Brazil
| |
Collapse
|
2
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
3
|
Gou X, Shi H, Yu S, Wang Z, Li C, Liu S, Ma J, Chen G, Liu T, Liu Y. SSRMMD: A Rapid and Accurate Algorithm for Mining SSR Feature Loci and Candidate Polymorphic SSRs Based on Assembled Sequences. Front Genet 2020; 11:706. [PMID: 32849772 PMCID: PMC7398111 DOI: 10.3389/fgene.2020.00706] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2020] [Accepted: 06/10/2020] [Indexed: 12/16/2022] Open
Abstract
Microsatellites or simple sequence repeats (SSRs) are short tandem repeats of DNA widespread in genomes and transcriptomes of diverse organisms and are used in various genetic studies. Few software programs that mine SSRs can be further used to mine polymorphic SSRs, and these programs have poor portability, have slow computational speed, are highly dependent on other programs, and have low marker development rates. In this study, we develop an algorithm named Simple Sequence Repeat Molecular Marker Developer (SSRMMD), which uses improved regular expressions to rapidly and exhaustively mine perfect SSR loci from any size of assembled sequence. To mine polymorphic SSRs, SSRMMD uses a novel three-stage method to assess the conservativeness of SSR flanking sequences and then uses the sliding window method to fragment each assembled sequence to assess its uniqueness. Furthermore, molecular biology assays support the polymorphic SSRs identified by SSRMMD. SSRMMD is implemented using the Perl programming language and can be downloaded from https://github.com/GouXiangJian/SSRMMD.
Collapse
Affiliation(s)
- Xiangjian Gou
- Triticeae Research Institute, Sichuan Agricultural University, Chengdu, China.,Maize Research Institute, Sichuan Agricultural University, Chengdu, China
| | - Haoran Shi
- Triticeae Research Institute, Sichuan Agricultural University, Chengdu, China
| | - Shifan Yu
- Triticeae Research Institute, Sichuan Agricultural University, Chengdu, China
| | - Zhiqiang Wang
- Triticeae Research Institute, Sichuan Agricultural University, Chengdu, China
| | - Caixia Li
- Triticeae Research Institute, Sichuan Agricultural University, Chengdu, China
| | - Shihang Liu
- Triticeae Research Institute, Sichuan Agricultural University, Chengdu, China
| | - Jian Ma
- Triticeae Research Institute, Sichuan Agricultural University, Chengdu, China
| | - Guangdeng Chen
- College of Resources, Sichuan Agricultural University, Chengdu, China
| | - Tao Liu
- College of Information Engineering, Sichuan Agricultural University, Ya'an, China
| | - Yaxi Liu
- Triticeae Research Institute, Sichuan Agricultural University, Chengdu, China.,State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Chengdu, China
| |
Collapse
|
4
|
Mitra U, Bhattacharyya B, Mukhopadhyay T. PEER: A direct method for biosequence pattern mining through waits of optimal k-mers. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2019.12.072] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
5
|
IDSSR: An Efficient Pipeline for Identifying Polymorphic Microsatellites from a Single Genome Sequence. Int J Mol Sci 2019; 20:ijms20143497. [PMID: 31315288 PMCID: PMC6678329 DOI: 10.3390/ijms20143497] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Revised: 06/25/2019] [Accepted: 07/15/2019] [Indexed: 12/02/2022] Open
Abstract
Simple sequence repeats (SSRs) are known as microsatellites, and consist of tandem 1–6-base motifs. They have become one of the most popular molecular markers, and are widely used in molecular ecology, conservation biology, molecular breeding, and many other fields. Previously reported methods identify monomorphic and polymorphic SSRs and determine the polymorphic SSRs via experimental validation, which is potentially time-consuming and costly. Herein, we present a new strategy named insertion/deletion (INDEL) SSR (IDSSR) to identify polymorphic SSRs by integrating SSRs with nucleotide insertions/deletions (INDEL) solely based on a single genome sequence and the sequenced pair-end reads. These INDEL indexes and polymorphic SSRs were identified, as well as the number of repeats, repeat motifs, chromosome location, annealing temperature, and primer sequences, enabling future experimental approaches to determine the correctness and polymorphism. Experimental validation with the giant panda demonstrated that our method has high reliability and stability. The efficient SSR pipeline would help researchers obtain high-quality genetic markers for plants and animals of interest, save labor, and reduce costly marker-screening experiments. IDSSR is freely available at https://github.com/Allsummerking/IDSSR.
Collapse
|
6
|
Shamanskiy VA, Timonina VN, Popadin KY, Gunbin KV. ImtRDB: a database and software for mitochondrial imperfect interspersed repeats annotation. BMC Genomics 2019; 20:295. [PMID: 31284879 PMCID: PMC6614062 DOI: 10.1186/s12864-019-5536-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Mitochondria is a powerhouse of all eukaryotic cells that have its own circular DNA (mtDNA) encoding various RNAs and proteins. Somatic perturbations of mtDNA are accumulating with age thus it is of great importance to uncover the main sources of mtDNA instability. Recent analyses demonstrated that somatic mtDNA deletions depend on imperfect repeats of various nature between distant mtDNA segments. However, till now there are no comprehensive databases annotating all types of imperfect repeats in numerous species with sequenced complete mitochondrial genome as well as there are no algorithms capable to call all types of imperfect repeats in circular mtDNA. RESULTS We implemented naïve algorithm of pattern recognition by analogy to standard dot-plot construction procedures allowing us to find both perfect and imperfect repeats of four main types: direct, inverted, mirror and complementary. Our algorithm is adapted to specific characteristics of mtDNA such as circularity and an excess of short repeats - it calls imperfect repeats starting from the length of 10 b.p. We constructed interactive web available database ImtRDB depositing perfect and imperfect repeats positions in mtDNAs of more than 3500 Vertebrate species. Additional tools, such as visualization of repeats within a genome, comparison of repeat densities among different genomes and a possibility to download all results make this database useful for many biologists. Our first analyses of the database demonstrated that mtDNA imperfect repeats (i) are usually short; (ii) associated with unfolded DNA structures; (iii) four types of repeats positively correlate with each other forming two equivalent pairs: direct and mirror versus inverted and complementary, with identical nucleotide content and similar distribution between species; (iv) abundance of repeats is negatively associated with GC content; (v) dinucleotides GC versus CG are overrepresented on light chain of mtDNA covered by repeats. CONCLUSIONS ImtRDB is available at http://bioinfodbs.kantiana.ru/ImtRDB/ . It is accompanied by the software calling all types of interspersed repeats with different level of degeneracy in circular DNA. This database and software can become a very useful tool in various areas of mitochondrial and chloroplast DNA research.
Collapse
Affiliation(s)
- Viktor A Shamanskiy
- Center for Mitochondrial Functional Genomics, School of Life Science, Immanuel Kant Baltic Federal University, Kaliningrad, Russia
| | - Valeria N Timonina
- Center for Mitochondrial Functional Genomics, School of Life Science, Immanuel Kant Baltic Federal University, Kaliningrad, Russia
| | - Konstantin Yu Popadin
- Center for Mitochondrial Functional Genomics, School of Life Science, Immanuel Kant Baltic Federal University, Kaliningrad, Russia.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Konstantin V Gunbin
- Center for Mitochondrial Functional Genomics, School of Life Science, Immanuel Kant Baltic Federal University, Kaliningrad, Russia. .,Center of Brain Neurobiology and Neurogenetics, Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia.
| |
Collapse
|
7
|
Pickett BD, Miller JB, Ridge PG. Kmer-SSR: a fast and exhaustive SSR search algorithm. Bioinformatics 2018; 33:3922-3928. [PMID: 28968741 PMCID: PMC5860095 DOI: 10.1093/bioinformatics/btx538] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 08/29/2017] [Indexed: 11/30/2022] Open
Abstract
Motivation One of the main challenges with bioinformatics software is that the size and complexity of datasets necessitate trading speed for accuracy, or completeness. To combat this problem of computational complexity, a plethora of heuristic algorithms have arisen that report a ‘good enough’ solution to biological questions. However, in instances such as Simple Sequence Repeats (SSRs), a ‘good enough’ solution may not accurately portray results in population genetics, phylogenetics and forensics, which require accurate SSRs to calculate intra- and inter-species interactions. Results We present Kmer-SSR, which finds all SSRs faster than most heuristic SSR identification algorithms in a parallelized, easy-to-use manner. The exhaustive Kmer-SSR option has 100% precision and 100% recall and accurately identifies every SSR of any specified length. To identify more biologically pertinent SSRs, we also developed several filters that allow users to easily view a subset of SSRs based on user input. Kmer-SSR, coupled with the filter options, accurately and intuitively identifies SSRs quickly and in a more user-friendly manner than any other SSR identification algorithm. Availability and implementation The source code is freely available on GitHub at https://github.com/ridgelab/Kmer-SSR.
Collapse
|
8
|
Beier S, Thiel T, Münch T, Scholz U, Mascher M. MISA-web: a web server for microsatellite prediction. Bioinformatics 2018; 33:2583-2585. [PMID: 28398459 PMCID: PMC5870701 DOI: 10.1093/bioinformatics/btx198] [Citation(s) in RCA: 1010] [Impact Index Per Article: 168.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2016] [Accepted: 04/06/2017] [Indexed: 12/27/2022] Open
Abstract
Motivation Microsatellites are a widely-used marker system in plant genetics and forensics. The development of reliable microsatellite markers from resequencing data is challenging. Results We extended MISA, a computational tool assisting the development of microsatellite markers, and reimplemented it as a web-based application. We improved compound microsatellite detection and added the possibility to display and export MISA results in GFF3 format for downstream analysis. Availability and Implementation MISA-web can be accessed under http://misaweb.ipk-gatersleben.de/. The website provides tutorials, usage note as well as download links to the source code. Contact scholz@ipk-gatersleben.de.
Collapse
Affiliation(s)
- Sebastian Beier
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Corrensstr. 3, 06466 Seeland, Germany
| | - Thomas Thiel
- KWS Saat SE, Grimsehlstr. 31, 37555 Einbeck, Germany
| | - Thomas Münch
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Corrensstr. 3, 06466 Seeland, Germany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Corrensstr. 3, 06466 Seeland, Germany
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Corrensstr. 3, 06466 Seeland, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Germany
| |
Collapse
|
9
|
Rodrigues-Luiz GF, Cardoso MS, Valdivia HO, Ayala EV, Gontijo CMF, Rodrigues TDS, Fujiwara RT, Lopes RS, Bartholomeu DC. TipMT: Identification of PCR-based taxon-specific markers. BMC Bioinformatics 2017; 18:104. [PMID: 28187714 PMCID: PMC5303226 DOI: 10.1186/s12859-017-1485-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2016] [Accepted: 01/11/2017] [Indexed: 12/02/2022] Open
Abstract
Background Molecular genetic markers are one of the most informative and widely used genome features in clinical and environmental diagnostic studies. A polymerase chain reaction (PCR)-based molecular marker is very attractive because it is suitable to high throughput automation and confers high specificity. However, the design of taxon-specific primers may be difficult and time consuming due to the need to identify appropriate genomic regions for annealing primers and to evaluate primer specificity. Results Here, we report the development of a Tool for Identification of Primers for Multiple Taxa (TipMT), which is a web application to search and design primers for genotyping based on genomic data. The tool identifies and targets single sequence repeats (SSR) or orthologous/taxa-specific genes for genotyping using Multiplex PCR. This pipeline was applied to the genomes of four species of Leishmania (L. amazonensis, L. braziliensis, L. infantum and L. major) and validated by PCR using artificial genomic DNA mixtures of the Leishmania species as templates. This experimental validation demonstrates the reliability of TipMT because amplification profiles showed discrimination of genomic DNA samples from Leishmania species. Conclusions The TipMT web tool allows for large-scale identification and design of taxon-specific primers and is freely available to the scientific community at http://200.131.37.155/tipMT/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1485-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gabriela F Rodrigues-Luiz
- Laboratório de Imunologia e Genômica de Parasitos, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Mariana S Cardoso
- Laboratório de Imunologia e Genômica de Parasitos, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Hugo O Valdivia
- Laboratório de Imunologia e Genômica de Parasitos, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Edward V Ayala
- Laboratório de Imunologia e Genômica de Parasitos, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | | | - Thiago de S Rodrigues
- Departamento de Computação, Centro Federal de Educação Tecnológica de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Ricardo T Fujiwara
- Laboratório de Imunologia e Genômica de Parasitos, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Robson S Lopes
- Laboratório de Imunologia e Genômica de Parasitos, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil.,Departamento de Computação, Universidade Federal do Mato Grosso, Barra do Garças, Mato Grosso, Brazil
| | - Daniella C Bartholomeu
- Laboratório de Imunologia e Genômica de Parasitos, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil.
| |
Collapse
|