Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lefebvre A, Lecroq T, Dauchel H, Alexandre J. FORRepeats: detects repeats on entire chromosomes and between genomes. Bioinformatics 2003;19:319-26. [PMID: 12584116 DOI: 10.1093/bioinformatics/btf843] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

For:	Lefebvre A, Lecroq T, Dauchel H, Alexandre J. FORRepeats: detects repeats on entire chromosomes and between genomes. Bioinformatics 2003;19:319-26. [PMID: 12584116 DOI: 10.1093/bioinformatics/btf843] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Number

Cited by Other Article(s)

Díaz-Domínguez D, Leinonen M, Salmela L. Space-efficient computation of k-mer dictionaries for large values of k. Algorithms Mol Biol 2024;19:14. [PMID: 38581000 PMCID: PMC10996146 DOI: 10.1186/s13015-024-00259-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Accepted: 03/02/2024] [Indexed: 04/07/2024] Open

Leinonen M, Salmela L. SAKE: Strobemer-assisted k-mer extraction. PLoS One 2023;18:e0294415. [PMID: 38019768 PMCID: PMC10686461 DOI: 10.1371/journal.pone.0294415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 10/30/2023] [Indexed: 12/01/2023] Open

Leinonen M, Salmela L. Extraction of long k-mers using spaced seeds. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;PP:1-1. [PMID: 34529572 DOI: 10.1109/tcbb.2021.3113131] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]

Kundu S, Ray MD, Sharma A. Interplay between genome organization and epigenomic alterations of pericentromeric DNA in cancer. J Genet Genomics 2021;48:184-197. [PMID: 33840602 DOI: 10.1016/j.jgg.2021.02.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2020] [Revised: 02/07/2021] [Accepted: 02/20/2021] [Indexed: 12/16/2022]

Feng C, Dai M, Liu Y, Chen M. Sequence repetitiveness quantification and de novo repeat detection by weighted k-mer coverage. Brief Bioinform 2020;22:5855256. [PMID: 32591772 DOI: 10.1093/bib/bbaa086] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 04/10/2020] [Accepted: 04/22/2020] [Indexed: 11/12/2022] Open

O'Neill RJ. Seq'ing identity and function in a repeat-derived noncoding RNA world. Chromosome Res 2020;28:111-127. [PMID: 32146545 PMCID: PMC7393779 DOI: 10.1007/s10577-020-09628-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 02/07/2020] [Accepted: 02/14/2020] [Indexed: 01/06/2023]

Orozco-Arias S, Isaza G, Guyot R. Retrotransposons in Plant Genomes: Structure, Identification, and Classification through Bioinformatics and Machine Learning. Int J Mol Sci 2019;20:E3837. [PMID: 31390781 PMCID: PMC6696364 DOI: 10.3390/ijms20153837] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/31/2019] [Accepted: 08/02/2019] [Indexed: 01/26/2023] Open

Manekar SC, Sathe SR. Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art. Curr Genomics 2019;20:2-15. [PMID: 31015787 PMCID: PMC6446480 DOI: 10.2174/1389202919666181026101326] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 10/05/2018] [Accepted: 10/24/2018] [Indexed: 12/24/2022] Open

Transposable Elements: Classification, Identification, and Their Use As a Tool For Comparative Genomics. Methods Mol Biol 2019;1910:177-207. [PMID: 31278665 DOI: 10.1007/978-1-4939-9074-0_6] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]

Manekar SC, Sathe SR. A benchmark study of k-mer counting methods for high-throughput sequencing. Gigascience 2018;7:5140149. [PMID: 30346548 PMCID: PMC6280066 DOI: 10.1093/gigascience/giy125] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/16/2018] [Indexed: 11/25/2022] Open

Touyar N, Schbath S, Cellier D, Dauchel H. Poisson Approximation for the Number of Repeats in a Stationary Markov Chain. J Appl Probab 2016. [DOI: 10.1239/jap/1214950359] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Taillefer E, Miller J. Exhaustive computation of exact duplications via super and non-nested local maximal repeats. J Bioinform Comput Biol 2013;12:1350018. [PMID: 24467757 DOI: 10.1142/s0219720013500182] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]

Jiang N. Overview of repeat annotation and de novo repeat identification. Methods Mol Biol 2013;1057:275-87. [PMID: 23918436 DOI: 10.1007/978-1-62703-568-2_20] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

To detect and analyze sequence repeats whatever be their origin. Methods Mol Biol 2012;859:69-90. [PMID: 22367866 DOI: 10.1007/978-1-61779-603-6_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]

Janicki M, Rooke R, Yang G. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes. Chromosome Res 2012;19:787-808. [PMID: 21850457 DOI: 10.1007/s10577-011-9230-7] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]

Makałowski W, Pande A, Gotea V, Makałowska I. Transposable elements and their identification. Methods Mol Biol 2012;855:337-359. [PMID: 22407715 DOI: 10.1007/978-1-61779-582-4_12] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]

Külekci MO, Vitter JS, Xu B. Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011;9:421-429. [PMID: 21968959 DOI: 10.1109/tcbb.2011.127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]

Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. ACTA ACUST UNITED AC 2011;27:764-70. [PMID: 21217122 DOI: 10.1093/bioinformatics/btr011] [Citation(s) in RCA: 2744] [Impact Index Per Article: 196.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity (Edinb) 2009;104:520-33. [PMID: 19935826 DOI: 10.1038/hdy.2009.165] [Citation(s) in RCA: 143] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open

Becher V, Deymonnaz A, Heiber P. Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome. Bioinformatics 2009;25:1746-53. [DOI: 10.1093/bioinformatics/btp321] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open

Scheibye-Alsing K, Hoffmann S, Frankel A, Jensen P, Stadler PF, Mang Y, Tommerup N, Gilchrist MJ, Nygård AB, Cirera S, Jørgensen CB, Fredholm M, Gorodkin J. Sequence assembly. Comput Biol Chem 2008;33:121-36. [PMID: 19152793 DOI: 10.1016/j.compbiolchem.2008.11.003] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2008] [Revised: 11/28/2008] [Accepted: 11/28/2008] [Indexed: 01/20/2023]

Kurtz S, Narechania A, Stein JC, Ware D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 2008;9:517. [PMID: 18976482 PMCID: PMC2613927 DOI: 10.1186/1471-2164-9-517] [Citation(s) in RCA: 177] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2008] [Accepted: 10/31/2008] [Indexed: 12/02/2022] Open

Abstract

Background

The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks.

Results

Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 10⁹bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage (≈ 0.45×) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C₀t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity.

Conclusion

The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see .

Collapse

Poisson Approximation for the Number of Repeats in a Stationary Markov Chain. J Appl Probab 2008. [DOI: 10.1017/s0021900200004344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]

Kleffe J, Möller F, Wittig B. Simultaneous identification of long similar substrings in large sets of sequences. BMC Bioinformatics 2007;8 Suppl 5:S7. [PMID: 17570866 PMCID: PMC1892095 DOI: 10.1186/1471-2105-8-s5-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open

Zhang S, Xiao Y. Quasiperiodic property in Alu repeats. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006;74:022901. [PMID: 17025492 DOI: 10.1103/physreve.74.022901] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2005] [Revised: 04/06/2006] [Indexed: 05/12/2023]

Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics 2005;22:134-41. [PMID: 16287941 DOI: 10.1093/bioinformatics/bti774] [Citation(s) in RCA: 202] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

Abstract

MOTIVATION

Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes.

RESULTS

We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis.

AVAILABILITY

WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build.

SUPPLEMENTARY INFORMATION

Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf

Collapse

Darling ACE, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004;14:1394-403. [PMID: 15231754 PMCID: PMC442156 DOI: 10.1101/gr.2289704] [Citation(s) in RCA: 3508] [Impact Index Per Article: 167.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Mizuta S, Munakata H, Aimaiti A, Oosawa K, Shimizu T. Evaluation of the color-coding method for searching tandem repeats in prokaryotic genomes. CHEM-BIO INFORMATICS JOURNAL 2004. [DOI: 10.1273/cbij.4.133] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]

Cannon SB, Kozik A, Chan B, Michelmore R, Young ND. DiagHunter and GenoPix2D: programs for genomic comparisons, large-scale homology discovery and visualization. Genome Biol 2003;4:R68. [PMID: 14519203 PMCID: PMC328457 DOI: 10.1186/gb-2003-4-10-r68] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2003] [Revised: 06/19/2003] [Accepted: 08/08/2003] [Indexed: 11/10/2022] Open

Cannon SB, Young ND. OrthoParaMap: distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies. BMC Bioinformatics 2003;4:35. [PMID: 12952558 PMCID: PMC200972 DOI: 10.1186/1471-2105-4-35] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2003] [Accepted: 09/02/2003] [Indexed: 11/10/2022] Open

Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2003. [PMCID: PMC2447368 DOI: 10.1002/cfg.229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open