1
|
Hu K, Ni P, Xu M, Zou Y, Chang J, Gao X, Li Y, Ruan J, Hu B, Wang J. HiTE: a fast and accurate dynamic boundary adjustment approach for full-length transposable element detection and annotation. Nat Commun 2024; 15:5573. [PMID: 38956036 PMCID: PMC11219922 DOI: 10.1038/s41467-024-49912-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Accepted: 06/25/2024] [Indexed: 07/04/2024] Open
Abstract
Recent advancements in genome assembly have greatly improved the prospects for comprehensive annotation of Transposable Elements (TEs). However, existing methods for TE annotation using genome assemblies suffer from limited accuracy and robustness, requiring extensive manual editing. In addition, the currently available gold-standard TE databases are not comprehensive, even for extensively studied species, highlighting the critical need for an automated TE detection method to supplement existing repositories. In this study, we introduce HiTE, a fast and accurate dynamic boundary adjustment approach designed to detect full-length TEs. The experimental results demonstrate that HiTE outperforms RepeatModeler2, the state-of-the-art tool, across various species. Furthermore, HiTE has identified numerous novel transposons with well-defined structures containing protein-coding domains, some of which are directly inserted within crucial genes, leading to direct alterations in gene expression. A Nextflow version of HiTE is also available, with enhanced parallelism, reproducibility, and portability.
Collapse
Affiliation(s)
- Kang Hu
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Xiangjiang Laboratory, Changsha, 410205, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Peng Ni
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Xiangjiang Laboratory, Changsha, 410205, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Minghua Xu
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - You Zou
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Jianye Chang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518000, China
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Center of Excellence on Smart Health, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA, 23529, USA
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518000, China
| | - Bin Hu
- Key Laboratory of Brain Health Intelligent Evaluation and Intervention, Ministry of Education (Beijing Institute of Technology), Beijing, P. R. China.
- School of Medical Technology, Beijing Institute of Technology, Beijing, P. R. China.
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
- Xiangjiang Laboratory, Changsha, 410205, China.
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China.
| |
Collapse
|
2
|
Rudenko V, Korotkov E. Study of Dispersed Repeats in the Cyanidioschyzon merolae Genome. Int J Mol Sci 2024; 25:4441. [PMID: 38674025 PMCID: PMC11050394 DOI: 10.3390/ijms25084441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 04/08/2024] [Accepted: 04/15/2024] [Indexed: 04/28/2024] Open
Abstract
In this study, we applied the iterative procedure (IP) method to search for families of highly diverged dispersed repeats in the genome of Cyanidioschyzon merolae, which contains over 16 million bases. The algorithm included the construction of position weight matrices (PWMs) for repeat families and the identification of more dispersed repeats based on the PWMs using dynamic programming. The results showed that the C. merolae genome contained 20 repeat families comprising a total of 33,938 dispersed repeats, which is significantly more than has been previously found using other methods. The repeats varied in length from 108 to 600 bp (522.54 bp in average) and occupied more than 72% of the C. merolae genome, whereas previously identified repeats, including tandem repeats, have been shown to constitute only about 28%. The high genomic content of dispersed repeats and their location in the coding regions suggest a significant role in the regulation of the functional activity of the genome.
Collapse
Affiliation(s)
- Valentina Rudenko
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow 119071, Russia;
| | | |
Collapse
|
3
|
Yang L, Metzger GA, Padilla Del Valle R, Delgadillo Rubalcaba D, McLaughlin RN. Evolutionary insights from profiling LINE-1 activity at allelic resolution in a single human genome. EMBO J 2024; 43:112-131. [PMID: 38177314 PMCID: PMC10883270 DOI: 10.1038/s44318-023-00007-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 10/18/2023] [Accepted: 11/10/2023] [Indexed: 01/06/2024] Open
Abstract
Transposable elements have created the majority of the sequence in many genomes. In mammals, LINE-1 retrotransposons have been expanding for more than 100 million years as distinct, consecutive lineages; however, the drivers of this recurrent lineage emergence and disappearance are unknown. Most human genome assemblies provide a record of this ancient evolution, but fail to resolve ongoing LINE-1 retrotranspositions. Utilizing the human CHM1 long-read-based haploid assembly, we identified and cloned all full-length, intact LINE-1s, and found 29 LINE-1s with measurable in vitro retrotransposition activity. Among individuals, these LINE-1s varied in their presence, their allelic sequences, and their activity. We found that recently retrotransposed LINE-1s tend to be active in vitro and polymorphic in the population relative to more ancient LINE-1s. However, some rare allelic forms of old LINE-1s retain activity, suggesting older lineages can persist longer than expected. Finally, in LINE-1s with in vitro activity and in vivo fitness, we identified mutations that may have increased replication in ancient genomes and may prove promising candidates for mechanistic investigations of the drivers of LINE-1 evolution and which LINE-1 sequences contribute to human disease.
Collapse
Affiliation(s)
- Lei Yang
- Pacific Northwest Research Institute, Seattle, WA, USA
| | | | - Ricky Padilla Del Valle
- Pacific Northwest Research Institute, Seattle, WA, USA
- Molecular and Cellular Biology Graduate Program, University of Washington, Seattle, WA, USA
| | | | - Richard N McLaughlin
- Pacific Northwest Research Institute, Seattle, WA, USA.
- Molecular and Cellular Biology Graduate Program, University of Washington, Seattle, WA, USA.
| |
Collapse
|
4
|
Zhao P, Gu L, Gao Y, Pan Z, Liu L, Li X, Zhou H, Yu D, Han X, Qian L, Liu GE, Fang L, Wang Z. Young SINEs in pig genomes impact gene regulation, genetic diversity, and complex traits. Commun Biol 2023; 6:894. [PMID: 37652983 PMCID: PMC10471783 DOI: 10.1038/s42003-023-05234-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Accepted: 08/09/2023] [Indexed: 09/02/2023] Open
Abstract
Transposable elements (TEs) are a major source of genetic polymorphisms and play a role in chromatin architecture, gene regulatory networks, and genomic evolution. However, their functional role in pigs and contributions to complex traits are largely unknown. We created a catalog of TEs (n = 3,087,929) in pigs and found that young SINEs were predominantly silenced by histone modifications, DNA methylation, and decreased accessibility. However, some transcripts from active young SINEs showed high tissue-specificity, as confirmed by analyzing 3570 RNA-seq samples. We also detected 211,067 dimorphic SINEs in 374 individuals, including 340 population-specific ones associated with local adaptation. Mapping these dimorphic SINEs to genome-wide associations of 97 complex traits in pigs, we found 54 candidate genes (e.g., ANK2 and VRTN) that might be mediated by TEs. Our findings highlight the important roles of young SINEs and provide a supplement for genotype-to-phenotype associations and modern breeding in pigs.
Collapse
Affiliation(s)
- Pengju Zhao
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - Lihong Gu
- Institute of Animal Science & Veterinary Medicine, Hainan Academy of Agricultural Sciences, No. 14 Xingdan Road, Haikou, 571100, China
| | - Yahui Gao
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, 20705, USA
| | - Zhangyuan Pan
- Department of Animal Science, University of California, Davis, CA, 95616, USA
| | - Lei Liu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518124, China
| | - Xingzheng Li
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518124, China
| | - Huaijun Zhou
- Department of Animal Science, University of California, Davis, CA, 95616, USA
| | - Dongyou Yu
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - Xinyan Han
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - Lichun Qian
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - George E Liu
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, 20705, USA.
| | - Lingzhao Fang
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, 8000, Denmark.
| | - Zhengguang Wang
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China.
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China.
| |
Collapse
|
5
|
Rodriguez M, Makałowski W. Software evaluation for de novo detection of transposons. Mob DNA 2022; 13:14. [PMID: 35477485 PMCID: PMC9047281 DOI: 10.1186/s13100-022-00266-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2021] [Accepted: 03/16/2022] [Indexed: 11/16/2022] Open
Abstract
Transposable elements (TEs) are major genomic components in most eukaryotic genomes and play an important role in genome evolution. However, despite their relevance the identification of TEs is not an easy task and a number of tools were developed to tackle this problem. To better understand how they perform, we tested several widely used tools for de novo TE detection and compared their performance on both simulated data and well curated genomic sequences. As expected, tools that build TE-models performed better than k-mer counting ones, with RepeatModeler beating competitors in most datasets. However, there is a tendency for most tools to identify TE-regions in a fragmented manner and it is also frequent that small TEs or fragmented TEs are not detected. Consequently, the identification of TEs is still a challenging endeavor and it requires a significant manual curation by an experienced expert. The results will be helpful for identifying common issues associated with TE-annotation and for evaluating how comparable are the results obtained with different tools.
Collapse
Affiliation(s)
- Matias Rodriguez
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, 48149, Münster, Germany
| | - Wojciech Makałowski
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, 48149, Münster, Germany.
| |
Collapse
|
6
|
Storer JM, Hubley R, Rosen J, Smit AFA. Methodologies for the De novo Discovery of Transposable Element Families. Genes (Basel) 2022; 13:709. [PMID: 35456515 PMCID: PMC9025800 DOI: 10.3390/genes13040709] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 04/14/2022] [Accepted: 04/15/2022] [Indexed: 02/07/2023] Open
Abstract
The discovery and characterization of transposable element (TE) families are crucial tasks in the process of genome annotation. Careful curation of TE libraries for each organism is necessary as each has been exposed to a unique and often complex set of TE families. De novo methods have been developed; however, a fully automated and accurate approach to the development of complete libraries remains elusive. In this review, we cover established methods and recent developments in de novo TE analysis. We also present various methodologies used to assess these tools and discuss opportunities for further advancement of the field.
Collapse
Affiliation(s)
| | | | | | - Arian F. A. Smit
- Institute for Systems Biology, Seattle, WA 98109, USA; (J.M.S.); (R.H.); (J.R.)
| |
Collapse
|
7
|
Finding and Characterizing Repeats in Plant Genomes. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2443:327-385. [PMID: 35037215 DOI: 10.1007/978-1-0716-2067-0_18] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Plant genomes contain a particularly high proportion of repeated structures of various types. This chapter proposes a guided tour of the available software that can help biologists to scan automatically for these repeats in sequence data or check hypothetical models intended to characterize their structures. Since transposable elements (TEs) are a major source of repeats in plants, many methods have been used or developed for this broad class of sequences. They are representative of the range of tools available for other classes of repeats and we have provided two sections on this topic (for the analysis of genomes or directly of sequenced reads), as well as a selection of the main existing software. It may be hard to keep up with the profusion of proposals in this dynamic field and the rest of the chapter is devoted to the foundations of an efficient search for repeats and more complex patterns. We first introduce the key concepts of the art of indexing and mapping or querying sequences. We end the chapter with the more prospective issue of building models of repeat families. We present the Machine Learning approach first, seeking to build predictors automatically for some families of ET, from a set of sequences known to belong to this family. A second approach, the linguistic (or syntactic) approach, allows biologists to describe themselves and check the validity of models of their favorite repeat family.
Collapse
|
8
|
Zeng C, Takeda A, Sekine K, Osato N, Fukunaga T, Hamada M. Bioinformatics Approaches for Determining the Functional Impact of Repetitive Elements on Non-coding RNAs. Methods Mol Biol 2022; 2509:315-340. [PMID: 35796972 DOI: 10.1007/978-1-0716-2380-0_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
With a large number of annotated non-coding RNAs (ncRNAs), repetitive sequences are found to constitute functional components (termed as repetitive elements) in ncRNAs that perform specific biological functions. Bioinformatics analysis is a powerful tool for improving our understanding of the role of repetitive elements in ncRNAs. This chapter summarizes recent findings that reveal the role of repetitive elements in ncRNAs. Furthermore, relevant bioinformatics approaches are systematically reviewed, which promises to provide valuable resources for studying the functional impact of repetitive elements on ncRNAs.
Collapse
Affiliation(s)
- Chao Zeng
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan.
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), Tokyo, Japan.
| | - Atsushi Takeda
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan
| | - Kotaro Sekine
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan
| | - Naoki Osato
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo, Japan
| | - Michiaki Hamada
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan.
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), Tokyo, Japan.
| |
Collapse
|
9
|
Feng C, Dai M, Liu Y, Chen M. Sequence repetitiveness quantification and de novo repeat detection by weighted k-mer coverage. Brief Bioinform 2020; 22:5855256. [PMID: 32591772 DOI: 10.1093/bib/bbaa086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 04/10/2020] [Accepted: 04/22/2020] [Indexed: 11/12/2022] Open
Abstract
DNA repeats are abundant in eukaryotic genomes and have been proved to play a vital role in genome evolution and regulation. A large number of approaches have been proposed to identify various repeats in the genome. Some de novo repeat identification tools can efficiently generate sequence repetitive scores based on k-mer counting for repeat detection. However, we noticed that these tools can still be improved in terms of repetitive score calculation, sensitivity to segmental duplications and detection specificity. Therefore, here, we present a new computational approach named Repeat Locator (RepLoc), which is based on weighted k-mer coverage to quantify the genome sequence repetitiveness and locate the repetitive sequences. According to the repetitiveness map of the human genome generated by RepLoc, we found that there may be relationships between sequence repetitiveness and genome structures. A comprehensive benchmark shows that RepLoc is a more efficient k-mer counting based tool for de novo repeat detection. The RepLoc software is freely available at http://bis.zju.edu.cn/reploc.
Collapse
Affiliation(s)
- Cong Feng
- Ming Chen's laboratory in Zhejiang University
| | - Min Dai
- Key Laboratory of Genetic Network Biology, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences
| | | | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University
| |
Collapse
|
10
|
Shortt JA, Ruggiero RP, Cox C, Wacholder AC, Pollock DD. Finding and extending ancient simple sequence repeat-derived regions in the human genome. Mob DNA 2020; 11:11. [PMID: 32095164 PMCID: PMC7027126 DOI: 10.1186/s13100-020-00206-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Accepted: 02/04/2020] [Indexed: 12/19/2022] Open
Abstract
Background Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regions not identified by current methods. The identification of these regions is complicated because SSRs appear to evolve through complex cycles of expansion and contraction, often interrupted by mutations that alter both the repeated motif and mutation rate. We applied an empirical, kmer-based, approach to identify genome regions that are likely derived from SSRs. Results The sequences flanking annotated SSRs are enriched for similar sequences and for SSRs with similar motifs, suggesting that the evolutionary remains of SSR activity abound in regions near obvious SSRs. Using our previously described P-clouds approach, we identified ‘SSR-clouds’, groups of similar kmers (or ‘oligos’) that are enriched near a training set of unbroken SSR loci, and then used the SSR-clouds to detect likely SSR-derived regions throughout the genome. Conclusions Our analysis indicates that the amount of likely SSR-derived sequence in the human genome is 6.77%, over twice as much as previous estimates, including millions of newly identified ancient SSR-derived loci. SSR-clouds identified poly-A sequences adjacent to transposable element termini in over 74% of the oldest class of Alu (roughly, AluJ), validating the sensitivity of the approach. Poly-A’s annotated by SSR-clouds also had a length distribution that was more consistent with their poly-A origins, with mean about 35 bp even in older Alus. This work demonstrates that the high sensitivity provided by SSR-Clouds improves the detection of SSR-derived regions and will enable deeper analysis of how decaying repeats contribute to genome structure.
Collapse
Affiliation(s)
- Jonathan A Shortt
- 1Colorado Center for Personalized Medicine, University of Colorado School of Medicine, Aurora, CO 80045 USA
| | - Robert P Ruggiero
- 2Department of Biology, Southeast Missouri State University, Cape Girardeau, MO 63701 USA
| | - Corey Cox
- 1Colorado Center for Personalized Medicine, University of Colorado School of Medicine, Aurora, CO 80045 USA
| | - Aaron C Wacholder
- 3Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213 USA
| | - David D Pollock
- 4Department of Biochemistry & Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045 USA
| |
Collapse
|
11
|
Orozco-Arias S, Isaza G, Guyot R. Retrotransposons in Plant Genomes: Structure, Identification, and Classification through Bioinformatics and Machine Learning. Int J Mol Sci 2019; 20:E3837. [PMID: 31390781 PMCID: PMC6696364 DOI: 10.3390/ijms20153837] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/31/2019] [Accepted: 08/02/2019] [Indexed: 01/26/2023] Open
Abstract
Transposable elements (TEs) are genomic units able to move within the genome of virtually all organisms. Due to their natural repetitive numbers and their high structural diversity, the identification and classification of TEs remain a challenge in sequenced genomes. Although TEs were initially regarded as "junk DNA", it has been demonstrated that they play key roles in chromosome structures, gene expression, and regulation, as well as adaptation and evolution. A highly reliable annotation of these elements is, therefore, crucial to better understand genome functions and their evolution. To date, much bioinformatics software has been developed to address TE detection and classification processes, but many problematic aspects remain, such as the reliability, precision, and speed of the analyses. Machine learning and deep learning are algorithms that can make automatic predictions and decisions in a wide variety of scientific applications. They have been tested in bioinformatics and, more specifically for TEs, classification with encouraging results. In this review, we will discuss important aspects of TEs, such as their structure, importance in the evolution and architecture of the host, and their current classifications and nomenclatures. We will also address current methods and their limitations in identifying and classifying TEs.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales 170001, Colombia
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Romain Guyot
- Department of Electronics and Automatization, Universidad Autónoma de Manizales, Manizales 170001, Colombia.
- Institut de Recherche pour le Développement, CIRAD, University Montpellier, 34000 Montpellier, France.
| |
Collapse
|
12
|
Li W, Freudenberg J, Freudenberg J. Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome. Gene 2019; 691:141-152. [PMID: 30630097 DOI: 10.1016/j.gene.2018.12.040] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 12/07/2018] [Accepted: 12/14/2018] [Indexed: 10/27/2022]
Abstract
The nuclear human genome harbors sequences of mitochondrial origin, indicating an ancestral transfer of DNA from the mitogenome. Several Nuclear Mitochondrial Segments (NUMTs) have been detected by alignment-based sequence similarity search, as implemented in the Basic Local Alignment Search Tool (BLAST). Identifying NUMTs is important for the comprehensive annotation and understanding of the human genome. Here we explore the possibility of detecting NUMTs in the human genome by alignment-free sequence similarity search, such as k-mers (k-tuples, k-grams, oligos of length k) distributions. We find that when k=6 or larger, the k-mer approach and BLAST search produce almost identical results, e.g., detect the same set of NUMTs longer than 3 kb. However, when k=5 or k=4, certain signals are only detected by the alignment-free approach, and these may indicate yet unrecognized, and potentially more ancestral NUMTs. We introduce a "Manhattan plot" style representation of NUMT predictions across the genome, which are calculated based on the reciprocal of the Jensen-Shannon divergence between the nuclear and mitochondrial k-mer frequencies. The further inspection of the k-mer-based NUMT predictions however shows that most of them contain long-terminal-repeat (LTR) annotations, whereas BLAST-based NUMT predictions do not. Thus, similarity of the mitogenome to LTR sequences is recognized, which we validate by finding the mitochondrial k-mer distribution closer to those for transposable sequences and specifically, close to some types of LTR.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Jerome Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Jan Freudenberg
- Regeneron Genetics Center, Regeneron Pharmaceuticals, Inc., Tarrytown, NY, USA
| |
Collapse
|
13
|
Metsky HC, Siddle KJ, Gladden-Young A, Qu J, Yang DK, Brehio P, Goldfarb A, Piantadosi A, Wohl S, Carter A, Lin AE, Barnes KG, Tully DC, Corleis B, Hennigan S, Barbosa-Lima G, Vieira YR, Paul LM, Tan AL, Garcia KF, Parham LA, Odia I, Eromon P, Folarin OA, Goba A, Simon-Lorière E, Hensley L, Balmaseda A, Harris E, Kwon DS, Allen TM, Runstadler JA, Smole S, Bozza FA, Souza TML, Isern S, Michael SF, Lorenzana I, Gehrke L, Bosch I, Ebel G, Grant DS, Happi CT, Park DJ, Gnirke A, Sabeti PC, Matranga CB. Capturing sequence diversity in metagenomes with comprehensive and scalable probe design. Nat Biotechnol 2019; 37:160-168. [PMID: 30718881 PMCID: PMC6587591 DOI: 10.1038/s41587-018-0006-x] [Citation(s) in RCA: 75] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 12/18/2018] [Indexed: 01/24/2023]
Abstract
Metagenomic sequencing has the potential to transform microbial detection and characterization, but new tools are needed to improve its sensitivity. Here we present CATCH, a computational method to enhance nucleic acid capture for enrichment of diverse microbial taxa. CATCH designs optimal probe sets, with a specified number of oligonucleotides, that achieve full coverage of, and scale well with, known sequence diversity. We focus on applying CATCH to capture viral genomes in complex metagenomic samples. We design, synthesize, and validate multiple probe sets, including one that targets the whole genomes of the 356 viral species known to infect humans. Capture with these probe sets enriches unique viral content on average 18-fold, allowing us to assemble genomes that could not be recovered without enrichment, and accurately preserves within-sample diversity. We also use these probe sets to recover genomes from the 2018 Lassa fever outbreak in Nigeria and to improve detection of uncharacterized viral infections in human and mosquito samples. The results demonstrate that CATCH enables more sensitive and cost-effective metagenomic sequencing.
Collapse
Affiliation(s)
- Hayden C. Metsky
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA ,0000 0001 2341 2786grid.116068.8Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA USA
| | - Katherine J. Siddle
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA ,000000041936754Xgrid.38142.3cDepartment of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA USA
| | | | - James Qu
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - David K. Yang
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA ,000000041936754Xgrid.38142.3cDepartment of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA USA
| | - Patrick Brehio
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Andrew Goldfarb
- 000000041936754Xgrid.38142.3cFaculty of Arts and Sciences, Harvard University, Cambridge, MA USA
| | - Anne Piantadosi
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA ,0000 0004 0386 9924grid.32224.35Division of Infectious Diseases, Massachusetts General Hospital, Boston, MA USA
| | - Shirlee Wohl
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA ,000000041936754Xgrid.38142.3cDepartment of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA USA
| | - Amber Carter
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Aaron E. Lin
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA ,000000041936754Xgrid.38142.3cDepartment of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA USA
| | - Kayla G. Barnes
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA ,000000041936754Xgrid.38142.3cDepartment of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA USA ,000000041936754Xgrid.38142.3cDepartment of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA USA
| | - Damien C. Tully
- 0000 0004 0489 3491grid.461656.6The Ragon Institute of MGH, MIT and Harvard, Cambridge, MA USA
| | - Bjӧrn Corleis
- 0000 0004 0489 3491grid.461656.6The Ragon Institute of MGH, MIT and Harvard, Cambridge, MA USA
| | - Scott Hennigan
- 0000 0004 0378 6934grid.416511.6Massachusetts Department of Public Health, Boston, MA USA
| | - Giselle Barbosa-Lima
- 0000 0001 0723 0931grid.418068.3Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, Rio de Janeiro, Brazil
| | - Yasmine R. Vieira
- 0000 0001 0723 0931grid.418068.3Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, Rio de Janeiro, Brazil
| | - Lauren M. Paul
- 0000 0001 0647 2963grid.255962.fDepartment of Biological Sciences, College of Arts and Sciences, Florida Gulf Coast University, Fort Myers, FL USA
| | - Amanda L. Tan
- 0000 0001 0647 2963grid.255962.fDepartment of Biological Sciences, College of Arts and Sciences, Florida Gulf Coast University, Fort Myers, FL USA
| | - Kimberly F. Garcia
- 0000 0001 2297 2829grid.10601.36Instituto de Investigacion en Microbiologia, Universidad Nacional Autónoma de Honduras, Tegucigalpa, Honduras
| | - Leda A. Parham
- 0000 0001 2297 2829grid.10601.36Instituto de Investigacion en Microbiologia, Universidad Nacional Autónoma de Honduras, Tegucigalpa, Honduras
| | - Ikponmwosa Odia
- Institute of Lassa Fever Research and Control, Irrua Specialist Teaching Hospital, Irrua, Nigeria
| | - Philomena Eromon
- grid.442553.1African Center of Excellence for Genomics of Infectious Disease (ACEGID), Redeemer’s University, Ede, Nigeria
| | - Onikepe A. Folarin
- grid.442553.1African Center of Excellence for Genomics of Infectious Disease (ACEGID), Redeemer’s University, Ede, Nigeria ,grid.442553.1Department of Biological Sciences, College of Natural Sciences, Redeemer’s University, Ede, Nigeria
| | - Augustine Goba
- Lassa Fever Laboratory, Kenema Government Hospital, Kenema, Sierra Leone
| | | | - Etienne Simon-Lorière
- 0000 0001 2353 6535grid.428999.7Evolutionary Genomics of RNA Viruses, Virology Department, Institut Pasteur, Paris, France
| | - Lisa Hensley
- 0000 0001 2164 9667grid.419681.3Integrated Research Facility, Division of Clinical Research, National Institute of Allergy and Infectious Diseases, US National Institutes of Health, Frederick, MD USA
| | - Angel Balmaseda
- Laboratorio Nacional de Virología, Centro Nacional de Diagnóstico y Referencia, Ministry of Health, Managua, Nicaragua
| | - Eva Harris
- 0000 0001 2181 7878grid.47840.3fDivision of Infectious Diseases and Vaccinology, School of Public Health, University of California, Berkeley, Berkeley, CA USA
| | - Douglas S. Kwon
- 0000 0004 0386 9924grid.32224.35Division of Infectious Diseases, Massachusetts General Hospital, Boston, MA USA ,0000 0004 0489 3491grid.461656.6The Ragon Institute of MGH, MIT and Harvard, Cambridge, MA USA
| | - Todd M. Allen
- 0000 0004 0489 3491grid.461656.6The Ragon Institute of MGH, MIT and Harvard, Cambridge, MA USA
| | - Jonathan A. Runstadler
- 0000 0004 1936 7531grid.429997.8Department of Infectious Disease and Global Health, Cummings School of Veterinary Medicine, Tufts University, North Grafton, MA USA
| | - Sandra Smole
- 0000 0004 0378 6934grid.416511.6Massachusetts Department of Public Health, Boston, MA USA
| | - Fernando A. Bozza
- 0000 0001 0723 0931grid.418068.3Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, Rio de Janeiro, Brazil
| | - Thiago M. L. Souza
- 0000 0001 0723 0931grid.418068.3Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, Rio de Janeiro, Brazil
| | - Sharon Isern
- 0000 0001 0647 2963grid.255962.fDepartment of Biological Sciences, College of Arts and Sciences, Florida Gulf Coast University, Fort Myers, FL USA
| | - Scott F. Michael
- 0000 0001 0647 2963grid.255962.fDepartment of Biological Sciences, College of Arts and Sciences, Florida Gulf Coast University, Fort Myers, FL USA
| | - Ivette Lorenzana
- 0000 0001 2297 2829grid.10601.36Instituto de Investigacion en Microbiologia, Universidad Nacional Autónoma de Honduras, Tegucigalpa, Honduras
| | - Lee Gehrke
- 0000 0001 2341 2786grid.116068.8Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA USA ,000000041936754Xgrid.38142.3cDepartment of Microbiology and Immunobiology, Harvard Medical School, Boston, MA USA
| | - Irene Bosch
- 0000 0001 2341 2786grid.116068.8Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA USA
| | - Gregory Ebel
- 0000 0004 1936 8083grid.47894.36Department of Microbiology, Immunology and Pathology, Colorado State University, Fort Collins, CO USA
| | - Donald S. Grant
- Lassa Fever Laboratory, Kenema Government Hospital, Kenema, Sierra Leone ,0000 0001 2290 9707grid.442296.fCollege of Medicine and Allied Health Sciences, University of Sierra Leone, Freetown, Sierra Leone
| | - Christian T. Happi
- 000000041936754Xgrid.38142.3cDepartment of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA USA ,Institute of Lassa Fever Research and Control, Irrua Specialist Teaching Hospital, Irrua, Nigeria ,grid.442553.1African Center of Excellence for Genomics of Infectious Disease (ACEGID), Redeemer’s University, Ede, Nigeria ,grid.442553.1Department of Biological Sciences, College of Natural Sciences, Redeemer’s University, Ede, Nigeria
| | - Daniel J. Park
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Andreas Gnirke
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Pardis C. Sabeti
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, MA USA ,000000041936754Xgrid.38142.3cDepartment of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA USA ,000000041936754Xgrid.38142.3cDepartment of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA USA ,0000 0001 2167 1581grid.413575.1Howard Hughes Medical Institute, Chevy Chase, MD USA
| | | |
Collapse
|
14
|
Guizard S, Piégu B, Arensburger P, Guillou F, Bigot Y. Deep landscape update of dispersed and tandem repeats in the genome model of the red jungle fowl, Gallus gallus, using a series of de novo investigating tools. BMC Genomics 2016; 17:659. [PMID: 27542599 PMCID: PMC4992247 DOI: 10.1186/s12864-016-3015-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2016] [Accepted: 08/12/2016] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND The program RepeatMasker and the database Repbase-ISB are part of the most widely used strategy for annotating repeats in animal genomes. They have been used to show that avian genomes have a lower repeat content (8-12 %) than the sequenced genomes of many vertebrate species (30-55 %). However, the efficiency of such a library-based strategies is dependent on the quality and completeness of the sequences in the database that is used. An alternative to these library based methods are methods that identify repeats de novo. These alternative methods have existed for a least a decade and may be more powerful than the library based methods. We have used an annotation strategy involving several complementary de novo tools to determine the repeat content of the model genome galGal4 (1.04 Gbp), including identifying simple sequence repeats (SSRs), tandem repeats and transposable elements (TEs). RESULTS We annotated over one Gbp. of the galGal4 genome and showed that it is composed of approximately 19 % SSRs and TEs repeats. Furthermore, we estimate that the actual genome of the red jungle fowl contains about 31-35 % repeats. We find that library-based methods tend to overestimate TE diversity. These results have a major impact on the current understanding of repeats distributions throughout chromosomes in the red jungle fowl. CONCLUSIONS Our results are a proof of concept of the reliability of using de novo tools to annotate repeats in large animal genomes. They have also revealed issues that will need to be resolved in order to develop gold-standard methodologies for annotating repeats in eukaryote genomes.
Collapse
Affiliation(s)
- Sébastien Guizard
- Physiologie de la Reproduction et des Comportements, UMR INRA-CNRS 7247, PRC, 37380 Nouzilly, France
| | - Benoît Piégu
- Physiologie de la Reproduction et des Comportements, UMR INRA-CNRS 7247, PRC, 37380 Nouzilly, France
| | - Peter Arensburger
- Physiologie de la Reproduction et des Comportements, UMR INRA-CNRS 7247, PRC, 37380 Nouzilly, France
- Biological Sciences Department, California State Polytechnic University, Pomona, CA 91768 USA
| | - Florian Guillou
- Physiologie de la Reproduction et des Comportements, UMR INRA-CNRS 7247, PRC, 37380 Nouzilly, France
| | - Yves Bigot
- Physiologie de la Reproduction et des Comportements, UMR INRA-CNRS 7247, PRC, 37380 Nouzilly, France
| |
Collapse
|
15
|
Maumus F, Quesneville H. Impact and insights from ancient repetitive elements in plant genomes. CURRENT OPINION IN PLANT BIOLOGY 2016; 30:41-6. [PMID: 26874965 DOI: 10.1016/j.pbi.2016.01.003] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2015] [Revised: 01/04/2016] [Accepted: 01/17/2016] [Indexed: 05/13/2023]
Abstract
Transposable elements and other repeated sequences are predominant contributors to most plant genomes. The vast majority of repeated elements accumulate mutations to the extent of becoming anonymous sequences, also known as 'genomic dark matter' which is also thought to contribute significantly to the composition of plant genomes. This review aims to highlight recent methods and analyses suggesting that ancient repeats have profound effects on plant genome biology.
Collapse
Affiliation(s)
- Florian Maumus
- INRA, UR1164 URGI-Research Unit in Genomics-Info, INRA de Versailles-Grignon, Route de Saint-Cyr, Versailles 78026, France.
| | - Hadi Quesneville
- INRA, UR1164 URGI-Research Unit in Genomics-Info, INRA de Versailles-Grignon, Route de Saint-Cyr, Versailles 78026, France
| |
Collapse
|
16
|
Abstract
Plant genomes contain a particularly high proportion of repeated structures of various types. This chapter proposes a guided tour of available software that can help biologists to look for these repeats and check some hypothetical models intended to characterize their structures. Since transposable elements are a major source of repeats in plants, many methods have been used or developed for this large class of sequences. They are representative of the range of tools available for other classes of repeats and we have provided a whole section on this topic as well as a selection of the main existing software. In order to better understand how they work and how repeats may be efficiently found in genomes, it is necessary to look at the technical issues involved in the large-scale search of these structures. Indeed, it may be hard to keep up with the profusion of proposals in this dynamic field and the rest of the chapter is devoted to the foundations of the search for repeats and more complex patterns. The second section introduces the key concepts that are useful for understanding the current state of the art in playing with words, applied to genomic sequences. This can be seen as the first stage of a very general approach called linguistic analysis that is interested in the analysis of natural or artificial texts. Words, the lexical level, correspond to simple repeated entities in texts or strings. In fact, biologists need to represent more complex entities where a repeat family is built on more abstract structures, including direct or inverted small repeats, motifs, composition constraints as well as ordering and distance constraints between these elementary blocks. In terms of linguistics, this corresponds to the syntactic level of a language. The last section introduces concepts and practical tools that can be used to reach this syntactic level in biological sequence analysis.
Collapse
Affiliation(s)
- Jacques Nicolas
- Dyliss Team, Irisa/Inria Centre de Rennes Bretagne Atlantique, Campus de Beaulieu, 35510, Rennes cedex, France.
| | - Pierre Peterlongo
- Irisa/Inria Centre de Rennes Bretagne Atlantique, Campus de Beaulieu, 35510, Rennes cedex, France
| | - Sébastien Tempel
- LCB, CNRS UMR 7283, 31 Chemin Joseph Aiguier, 13402, Marseille cedex 20, France
| |
Collapse
|
17
|
Bast J, Schaefer I, Schwander T, Maraun M, Scheu S, Kraaijeveld K. No Accumulation of Transposable Elements in Asexual Arthropods. Mol Biol Evol 2015; 33:697-706. [PMID: 26560353 PMCID: PMC4760076 DOI: 10.1093/molbev/msv261] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Transposable elements (TEs) and other repetitive DNA can accumulate in the absence of recombination, a process contributing to the degeneration of Y-chromosomes and other nonrecombining genome portions. A similar accumulation of repetitive DNA is expected for asexually reproducing species, given their entire genome is effectively nonrecombining. We tested this expectation by comparing the whole-genome TE loads of five asexual arthropod lineages and their sexual relatives, including asexual and sexual lineages of crustaceans (Daphnia water fleas), insects (Leptopilina wasps), and mites (Oribatida). Surprisingly, there was no evidence for increased TE load in genomes of asexual as compared to sexual lineages, neither for all classes of repetitive elements combined nor for specific TE families. Our study therefore suggests that nonrecombining genomes do not accumulate TEs like nonrecombining genomic regions of sexual lineages. Even if a slight but undetected increase of TEs were caused by asexual reproduction, it appears to be negligible compared to variance between species caused by processes unrelated to reproductive mode. It remains to be determined if molecular mechanisms underlying genome regulation in asexuals hamper TE activity. Alternatively, the differences in TE dynamics between nonrecombining genomes in asexual lineages versus nonrecombining genome portions in sexual species might stem from selection for benign TEs in asexual lineages because of the lack of genetic conflict between TEs and their hosts and/or because asexual lineages may only arise from sexual ancestors with particularly low TE loads.
Collapse
Affiliation(s)
- Jens Bast
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
| | - Ina Schaefer
- J.F. Blumenbach Institute of Zoology and Anthropology, Georg August University Goettingen, Goettingen, Germany
| | - Tanja Schwander
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
| | - Mark Maraun
- J.F. Blumenbach Institute of Zoology and Anthropology, Georg August University Goettingen, Goettingen, Germany
| | - Stefan Scheu
- J.F. Blumenbach Institute of Zoology and Anthropology, Georg August University Goettingen, Goettingen, Germany
| | - Ken Kraaijeveld
- Department of Ecological Science, VU University Amsterdam, Amsterdam, The Netherlands Leiden Genome Technology Center, Department of Human genetics, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
18
|
Sun C, Mueller RL. Hellbender genome sequences shed light on genomic expansion at the base of crown salamanders. Genome Biol Evol 2015; 6:1818-29. [PMID: 25115007 PMCID: PMC4122941 DOI: 10.1093/gbe/evu143] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Among animals, genome sizes range from 20 Mb to 130 Gb, with 380-fold variation across vertebrates. Most of the largest vertebrate genomes are found in salamanders, an amphibian clade of 660 species. Thus, salamanders are an important system for studying causes and consequences of genomic gigantism. Previously, we showed that plethodontid salamander genomes accumulate higher levels of long terminal repeat (LTR) retrotransposons than do other vertebrates, although the evolutionary origins of such sequences remained unexplored. We also showed that some salamanders in the family Plethodontidae have relatively slow rates of DNA loss through small insertions and deletions. Here, we present new data from Cryptobranchus alleganiensis, the hellbender. Cryptobranchus and Plethodontidae span the basal phylogenetic split within salamanders; thus, analyses incorporating these taxa can shed light on the genome of the ancestral crown salamander lineage, which underwent expansion. We show that high levels of LTR retrotransposons likely characterize all crown salamanders, suggesting that disproportionate expansion of this transposable element (TE) class contributed to genomic expansion. Phylogenetic and age distribution analyses of salamander LTR retrotransposons indicate that salamanders' high TE levels reflect persistence and diversification of ancestral TEs rather than horizontal transfer events. Finally, we show that relatively slow DNA loss rates through small indels likely characterize all crown salamanders, suggesting that a decreased DNA loss rate contributed to genomic expansion at the clade's base. Our identification of shared genomic features across phylogenetically distant salamanders is a first step toward identifying the evolutionary processes underlying accumulation and persistence of high levels of repetitive sequence in salamander genomes.
Collapse
|
19
|
Maumus F, Fiston-Lavier AS, Quesneville H. Impact of transposable elements on insect genomes and biology. CURRENT OPINION IN INSECT SCIENCE 2015; 7:30-36. [PMID: 32846669 DOI: 10.1016/j.cois.2015.01.001] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2014] [Revised: 12/30/2014] [Accepted: 01/06/2015] [Indexed: 06/11/2023]
Affiliation(s)
- Florian Maumus
- Unité de recherche en Génomique-Info (URGI), UR1164, INRA, RD10 route de Saint Cyr, 78026 Versailles, France.
| | - Anna-Sophie Fiston-Lavier
- Institut des Sciences de l'Evolution de Montpellier (ISEM), UMR5554 CNRS-Université Montpellier II, 2 place Eugene Bataillon, bat. 22, CC065 34095 Montpellier Cedex 05, France
| | - Hadi Quesneville
- Unité de recherche en Génomique-Info (URGI), UR1164, INRA, RD10 route de Saint Cyr, 78026 Versailles, France
| |
Collapse
|
20
|
Inference of transposable element ancestry. PLoS Genet 2014; 10:e1004482. [PMID: 25121584 PMCID: PMC4133154 DOI: 10.1371/journal.pgen.1004482] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2013] [Accepted: 05/16/2014] [Indexed: 01/11/2023] Open
Abstract
Most common methods for inferring transposable element (TE) evolutionary relationships are based on dividing TEs into subfamilies using shared diagnostic nucleotides. Although originally justified based on the “master gene” model of TE evolution, computational and experimental work indicates that many of the subfamilies generated by these methods contain multiple source elements. This implies that subfamily-based methods give an incomplete picture of TE relationships. Studies on selection, functional exaptation, and predictions of horizontal transfer may all be affected. Here, we develop a Bayesian method for inferring TE ancestry that gives the probability that each sequence was replicative, its frequency of replication, and the probability that each extant TE sequence came from each possible ancestral sequence. Applying our method to 986 members of the newly-discovered LAVA family of TEs, we show that there were far more source elements in the history of LAVA expansion than subfamilies identified using the CoSeg subfamily-classification program. We also identify multiple replicative elements in the AluSc subfamily in humans. Our results strongly indicate that a reassessment of subfamily structures is necessary to obtain accurate estimates of mutation processes, phylogenetic relationships and historical times of activity. The most common entities in vertebrate genomes are transposable elements (TEs), DNA sequences that have been repeatedly copied and inserted into new locations throughout the genome. Some TEs have been replicated hundreds of thousands of times, and their ecology and evolutionary history within a genome is thus critical to understanding how genome structure evolves. It was once thought that only a few “master gene” copies could replicate, while the rest were inactive (dead on arrival), but recent computational and laboratory studies have indicated that this is not the case. However, previous methods for reconstructing TE evolutionary history were not designed to solve the problem of determining the ancestral source sequence for large numbers of elements. Here, we present a new method that is. Our method surveys all likely TE ancestors and determines the probability that each modern element arose from each of its plausible ancestors. We applied our method to the gibbon-derived LAVA TE family and to the human AluSc subfamily and inferred many more source elements than indicated by previous methods. This new method will help us better understand TE evolution, including both the impact of sequence on replication and the substitution process after replication.
Collapse
|
21
|
Deep investigation of Arabidopsis thaliana junk DNA reveals a continuum between repetitive elements and genomic dark matter. PLoS One 2014; 9:e94101. [PMID: 24709859 PMCID: PMC3978025 DOI: 10.1371/journal.pone.0094101] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2014] [Accepted: 03/10/2014] [Indexed: 11/19/2022] Open
Abstract
Eukaryotic genomes contain highly variable amounts of DNA with no apparent function. This so-called junk DNA is composed of two components: repeated and repeat-derived sequences (together referred to as the repeatome), and non-annotated sequences also known as genomic dark matter. Because of their high duplication rates as compared to other genomic features, transposable elements are predominant contributors to the repeatome and the products of their decay is thought to be a major source of genomic dark matter. Determining the origin and composition of junk DNA is thus important to help understanding genome evolution as well as host biology. In this study, we have used a combination of tools enabling to show that the repeatome from the small and reducing A. thaliana genome is significantly larger than previously thought. Furthermore, we present the concepts and results from a series of innovative approaches suggesting that a significant amount of the A. thaliana dark matter is of repetitive origin. As a tentative standard for the community, we propose a deep compendium annotation of the A. thaliana repeatome that may help addressing farther genome evolution as well as transcriptional and epigenetic regulation in this model plant.
Collapse
|
22
|
The Burmese python genome reveals the molecular basis for extreme adaptation in snakes. Proc Natl Acad Sci U S A 2013; 110:20645-50. [PMID: 24297902 DOI: 10.1073/pnas.1314475110] [Citation(s) in RCA: 203] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Snakes possess many extreme morphological and physiological adaptations. Identification of the molecular basis of these traits can provide novel understanding for vertebrate biology and medicine. Here, we study snake biology using the genome sequence of the Burmese python (Python molurus bivittatus), a model of extreme physiological and metabolic adaptation. We compare the python and king cobra genomes along with genomic samples from other snakes and perform transcriptome analysis to gain insights into the extreme phenotypes of the python. We discovered rapid and massive transcriptional responses in multiple organ systems that occur on feeding and coordinate major changes in organ size and function. Intriguingly, the homologs of these genes in humans are associated with metabolism, development, and pathology. We also found that many snake metabolic genes have undergone positive selection, which together with the rapid evolution of mitochondrial proteins, provides evidence for extensive adaptive redesign of snake metabolic pathways. Additional evidence for molecular adaptation and gene family expansions and contractions is associated with major physiological and phenotypic adaptations in snakes; genes involved are related to cell cycle, development, lungs, eyes, heart, intestine, and skeletal structure, including GRB2-associated binding protein 1, SSH, WNT16, and bone morphogenetic protein 7. Finally, changes in repetitive DNA content, guanine-cytosine isochore structure, and nucleotide substitution rates indicate major shifts in the structure and evolution of snake genomes compared with other amniotes. Phenotypic and physiological novelty in snakes seems to be driven by system-wide coordination of protein adaptation, gene expression, and changes in the structure of the genome.
Collapse
|
23
|
Fernandez-Silva I, Whitney J, Wainwright B, Andrews KR, Ylitalo-Ward H, Bowen BW, Toonen RJ, Goetze E, Karl SA. Microsatellites for next-generation ecologists: a post-sequencing bioinformatics pipeline. PLoS One 2013; 8:e55990. [PMID: 23424642 PMCID: PMC3570555 DOI: 10.1371/journal.pone.0055990] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2012] [Accepted: 01/04/2013] [Indexed: 11/18/2022] Open
Abstract
Microsatellites are the markers of choice for a variety of population genetic studies. The recent advent of next-generation pyrosequencing has drastically accelerated microsatellite locus discovery by providing a greater amount of DNA sequencing reads at lower costs compared to other techniques. However, laboratory testing of PCR primers targeting potential microsatellite markers remains time consuming and costly. Here we show how to reduce this workload by screening microsatellite loci via bioinformatic analyses prior to primer design. Our method emphasizes the importance of sequence quality, and we avoid loci associated with repetitive elements by screening with repetitive sequence databases available for a growing number of taxa. Testing with the Yellowstripe Goatfish Mulloidichthys flavolineatus and the marine planktonic copepod Pleuromamma xiphias we show higher success rate of primers selected by our pipeline in comparison to previous in silico microsatellite detection methodologies. Following the same pipeline, we discover and select microsatellite loci in nine additional species including fishes, sea stars, copepods and octopuses.
Collapse
Affiliation(s)
- Iria Fernandez-Silva
- Hawai'i Institute of Marine Biology, University of Hawai'i, Kāne'ohe, Hawai'i, United States of America.
| | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Abstract
The availability of a large amount of genomic sequences has provided unique opportunities for understanding the composition and dynamics of transposable elements (TEs) in plants. As the cost of sequencing declines, the genomic sequences of most crop plants will be available within the next few years. Thus, the annotation of genomic sequences, rather than sequence availability, will become the "bottleneck" for genome study. Since TEs are the largest component of most plant genomes, the automation of TE identification and classification is essential for future genome annotation as well as characterization of TEs. In this chapter, the functions and mechanisms of different repeat finding tools are reviewed, with a focus on de novo repeat identification programs. In addition, this chapter covers the further processing of results from de novo identification programs and the construction of repeat libraries for downstream genome analyses.
Collapse
Affiliation(s)
- Ning Jiang
- Department of Horticulture, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
25
|
Xu HE, Zhang HH, Han MJ, Shen YH, Huang XZ, Xiang ZH, Zhang Z. [Computational approaches for identification and classification of transposable elements in eukaryotic genomes]. YI CHUAN = HEREDITAS 2012; 34:1009-1019. [PMID: 22917906 DOI: 10.3724/sp.j.1005.2012.01009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Repetitive sequences (repeats) represent a significant fraction of the eukaryotic genomes and can be divided into tandem repeats, segmental duplications, and interspersed repeats on the basis of their sequence characteristics and how they are formed. Most interspersed repeats are derived from transposable elements (TEs). Eukaryotic TEs have been subdivided into two major classes according to the intermediate they use to move. The transposition and amplification of TEs have a great impact on the evolution of genes and the stability of genomes. However, identification and classification of TEs are complex and difficult due to the fact that their structure and classification are complex and diverse compared with those of other types of repeats. Here, we briefly introduced the function and classification of TEs, and summarized three different steps for identification, classification and annotation of TEs in eukaryotic genomes: (1) assembly of a repeat library, (2) repeat correction and classification, and (3) genome annotation. The existing computational approaches for each step were summarized and the advantages and disadvantages of the approaches were also highlighted in this review. To accurately identify, classify, and annotate the TEs in eukaryotic genomes requires combined methods. This review provides useful information for biologists who are not familiar with these approaches to find their way through the forest of programs.
Collapse
Affiliation(s)
- Hong-En Xu
- The Institute of Sericulture and Systems Biology, Southwest University, Chongqing, China.
| | | | | | | | | | | | | |
Collapse
|
26
|
Janicki M, Rooke R, Yang G. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes. Chromosome Res 2012; 19:787-808. [PMID: 21850457 DOI: 10.1007/s10577-011-9230-7] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
A major portion of most eukaryotic genomes are transposable elements (TEs). During evolution, TEs have introduced profound changes to genome size, structure, and function. As integral parts of genomes, the dynamic presence of TEs will continue to be a major force in reshaping genomes. Early computational analyses of TEs in genome sequences focused on filtering out "junk" sequences to facilitate gene annotation. When the high abundance and diversity of TEs in eukaryotic genomes were recognized, these early efforts transformed into the systematic genome-wide categorization and classification of TEs. The availability of genomic sequence data reversed the classical genetic approaches to discovering new TE families and superfamilies. Curated TE databases and their accurate annotation of genome sequences in turn facilitated the studies on TEs in a number of frontiers including: (1) TE-mediated changes of genome size and structure, (2) the influence of TEs on genome and gene functions, (3) TE regulation by host, (4) the evolution of TEs and their population dynamics, and (5) genomic scale studies of TE activity. Bioinformatics and genomic approaches have become an integral part of large-scale studies on TEs to extract information with pure in silico analyses or to assist wet lab experimental studies. The current revolution in genome sequencing technology facilitates further progress in the existing frontiers of research and emergence of new initiatives. The rapid generation of large-sequence datasets at record low costs on a routine basis is challenging the computing industry on storage capacity and manipulation speed and the bioinformatics community for improvement in algorithms and their implementations.
Collapse
Affiliation(s)
- Mateusz Janicki
- Department of Biology, University of Toronto at Mississauga, 3359 Mississauga Road, Mississauga, ON L5L1C6, Canada
| | | | | |
Collapse
|
27
|
Flutre T, Permal E, Quesneville H. Transposable Element Annotation in Completely Sequenced Eukaryote Genomes. PLANT TRANSPOSABLE ELEMENTS 2012. [DOI: 10.1007/978-3-642-31842-9_2] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
28
|
Permal E, Flutre T, Quesneville H. Roadmap for annotating transposable elements in eukaryote genomes. Methods Mol Biol 2012; 859:53-68. [PMID: 22367865 DOI: 10.1007/978-1-61779-603-6_3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Current high-throughput techniques have made it feasible to sequence even the genomes of non-model organisms. However, the annotation process now represents a bottleneck to genome analysis, especially when dealing with transposable elements (TE). Combined approaches, using both de novo and knowledge-based methods to detect TEs, are likely to produce reasonably comprehensive and sensitive results. This chapter provides a roadmap for researchers involved in genome projects to address this issue. At each step of the TE annotation process, from the identification of TE families to the annotation of TE copies, we outline the tools and good practices to be used.
Collapse
Affiliation(s)
- Emmanuelle Permal
- Unité de Recherches en Génomique Info - URGI (UR1164) - INRA - Centre de Versailles, Versailles cedex, France
| | | | | |
Collapse
|
29
|
Sun C, Shepard DB, Chong RA, López Arriaza J, Hall K, Castoe TA, Feschotte C, Pollock DD, Mueller RL. LTR retrotransposons contribute to genomic gigantism in plethodontid salamanders. Genome Biol Evol 2011; 4:168-83. [PMID: 22200636 PMCID: PMC3318908 DOI: 10.1093/gbe/evr139] [Citation(s) in RCA: 130] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/22/2011] [Indexed: 01/20/2023] Open
Abstract
Among vertebrates, most of the largest genomes are found within the salamanders, a clade of amphibians that includes 613 species. Salamander genome sizes range from ~14 to ~120 Gb. Because genome size is correlated with nucleus and cell sizes, as well as other traits, morphological evolution in salamanders has been profoundly affected by genomic gigantism. However, the molecular mechanisms driving genomic expansion in this clade remain largely unknown. Here, we present the first comparative analysis of transposable element (TE) content in salamanders. Using high-throughput sequencing, we generated genomic shotgun data for six species from the Plethodontidae, the largest family of salamanders. We then developed a pipeline to mine TE sequences from shotgun data in taxa with limited genomic resources, such as salamanders. Our summaries of overall TE abundance and diversity for each species demonstrate that TEs make up a substantial portion of salamander genomes, and that all of the major known types of TEs are represented in salamanders. The most abundant TE superfamilies found in the genomes of our six focal species are similar, despite substantial variation in genome size. However, our results demonstrate a major difference between salamanders and other vertebrates: salamander genomes contain much larger amounts of long terminal repeat (LTR) retrotransposons, primarily Ty3/gypsy elements. Thus, the extreme increase in genome size that occurred in salamanders was likely accompanied by a shift in TE landscape. These results suggest that increased proliferation of LTR retrotransposons was a major molecular mechanism contributing to genomic expansion in salamanders.
Collapse
Affiliation(s)
- Cheng Sun
- Department of Biology, Colorado State University
| | - Donald B. Shepard
- Department of Biology, Colorado State University
- Current address: Department of Fisheries, Wildlife and Conservation Biology; University of Minnesota
| | | | | | - Kathryn Hall
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine
| | - Todd A. Castoe
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine
| | | | - David D. Pollock
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine
| | | |
Collapse
|
30
|
Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet 2011; 7:e1002384. [PMID: 22144907 PMCID: PMC3228813 DOI: 10.1371/journal.pgen.1002384] [Citation(s) in RCA: 724] [Impact Index Per Article: 55.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2011] [Accepted: 10/04/2011] [Indexed: 12/18/2022] Open
Abstract
Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo "clouds"). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%-69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (∼25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed "element-specific" P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ∼100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.
Collapse
|
31
|
Castoe TA, Hall KT, Guibotsy Mboulas ML, Gu W, de Koning APJ, Fox SE, Poole AW, Vemulapalli V, Daza JM, Mockler T, Smith EN, Feschotte C, Pollock DD. Discovery of highly divergent repeat landscapes in snake genomes using high-throughput sequencing. Genome Biol Evol 2011; 3:641-53. [PMID: 21572095 PMCID: PMC3157835 DOI: 10.1093/gbe/evr043] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
We conducted a comprehensive assessment of genomic repeat content in two snake genomes, the venomous copperhead (Agkistrodon contortrix) and the Burmese python (Python molurus bivittatus). These two genomes are both relatively small (∼1.4 Gb) but have surprisingly extensive differences in the abundance and expansion histories of their repeat elements. In the python, the readily identifiable repeat element content is low (21%), similar to bird genomes, whereas that of the copperhead is higher (45%), similar to mammalian genomes. The copperhead's greater repeat content arises from the recent expansion of many different microsatellites and transposable element (TE) families, and the copperhead had 23-fold greater levels of TE-related transcripts than the python. This suggests the possibility that greater TE activity in the copperhead is ongoing. Expansion of CR1 LINEs in the copperhead genome has resulted in TE-mediated microsatellite expansion ("microsatellite seeding") at a scale several orders of magnitude greater than previously observed in vertebrates. Snakes also appear to be prone to horizontal transfer of TEs, particularly in the copperhead lineage. The reason that the copperhead has such a small genome in the face of so much recent expansion of repeat elements remains an open question, although selective pressure related to extreme metabolic performance is an obvious candidate. TE activity can affect gene regulation as well as rates of recombination and gene duplication, and it is therefore possible that TE activity played a role in the evolution of major adaptations in snakes; some evidence suggests this may include the evolution of venom repertoires.
Collapse
Affiliation(s)
- Todd A Castoe
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Flutre T, Duprat E, Feuillet C, Quesneville H. Considering transposable element diversification in de novo annotation approaches. PLoS One 2011; 6:e16526. [PMID: 21304975 PMCID: PMC3031573 DOI: 10.1371/journal.pone.0016526] [Citation(s) in RCA: 324] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2010] [Accepted: 01/04/2011] [Indexed: 01/24/2023] Open
Abstract
Transposable elements (TEs) are mobile, repetitive DNA sequences that are almost ubiquitous in prokaryotic and eukaryotic genomes. They have a large impact on genome structure, function and evolution. With the recent development of high-throughput sequencing methods, many genome sequences have become available, making possible comparative studies of TE dynamics at an unprecedented scale. Several methods have been proposed for the de novo identification of TEs in sequenced genomes. Most begin with the detection of genomic repeats, but the subsequent steps for defining TE families differ. High-quality TE annotations are available for the Drosophila melanogaster and Arabidopsis thaliana genome sequences, providing a solid basis for the benchmarking of such methods. We compared the performance of specific algorithms for the clustering of interspersed repeats and found that only a particular combination of algorithms detected TE families with good recovery of the reference sequences. We then applied a new procedure for reconciling the different clustering results and classifying TE sequences. The whole approach was implemented in a pipeline using the REPET package. Finally, we show that our combined approach highlights the dynamics of well defined TE families by making it possible to identify structural variations among their copies. This approach makes it possible to annotate TE families and to study their diversification in a single analysis, improving our understanding of TE dynamics at the whole-genome scale and for diverse species.
Collapse
Affiliation(s)
- Timothée Flutre
- Unité de Recherche en Génomique-Info, UR 1164, INRA Centre de Versailles-Grignon, Versailles, France
| | - Elodie Duprat
- Institut de Minéralogie et de Physique des Milieux Condensés, UMR 7590, CNRS-UPMC-IPGP-Université Paris Diderot, Paris, France
| | - Catherine Feuillet
- Génétique, Diversité et Ecophysiologie des Céréales, UMR 1095, INRA Domaine du Crouël, Clermont-Ferrand, France
| | - Hadi Quesneville
- Unité de Recherche en Génomique-Info, UR 1164, INRA Centre de Versailles-Grignon, Versailles, France
- * E-mail:
| |
Collapse
|
33
|
Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity (Edinb) 2009; 104:520-33. [PMID: 19935826 DOI: 10.1038/hdy.2009.165] [Citation(s) in RCA: 130] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
The production of genome sequences has led to another important advance in their annotation, which is closely linked to the exact determination of their content in terms of repeats, among which are transposable elements (TEs). The evolutionary implications and the presence of coding regions in some TEs can confuse gene annotation, and also hinder the process of genome assembly, making particularly crucial to be able to annotate and classify them correctly in genome sequences. This review is intended to provide an overview as comprehensive as possible of the automated methods currently used to annotate and classify TEs in sequenced genomes. Different categories of programs exist according to their methodology and the repeat, which they can identify. I describe here the main characteristics of the programs, their main goals and the difficulties they can entail. The drawbacks of the different methods are also highlighted to help biologists who are unfamiliar with algorithmic methods to understand this methodology better. Globally, using several different programs and carrying out a cross comparison of their results has the best chance of finding reliable results as any single program. However, this makes it essential to verify the results provided by each program independently. The ideal solution would be to test all programs against the same data set to obtain a true comparison of their actual performance.
Collapse
|
34
|
Belancio VP, Deininger PL, Roy-Engel AM. LINE dancing in the human genome: transposable elements and disease. Genome Med 2009; 1:97. [PMID: 19863772 PMCID: PMC2784310 DOI: 10.1186/gm97] [Citation(s) in RCA: 99] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Transposable elements (TEs) have been consistently underestimated in their contribution to genetic instability and human disease. TEs can cause human disease by creating insertional mutations in genes, and also contributing to genetic instability through non-allelic homologous recombination and introduction of sequences that evolve into various cis-acting signals that alter gene expression. Other outcomes of TE activity, such as their potential to cause DNA double-strand breaks or to modulate the epigenetic state of chromosomes, are less fully characterized. The currently active human transposable elements are members of the non-LTR retroelement families, LINE-1, Alu (SINE), and SVA. The impact of germline insertional mutagenesis by TEs is well established, whereas the rate of post-insertional TE-mediated germline mutations and all forms of somatic mutations remain less well quantified. The number of human diseases discovered to be associated with non-allelic homologous recombination between TEs, and particularly between Alu elements, is growing at an unprecedented rate. Improvement in the technology for detection of such events, as well as the mounting interest in the research and medical communities in resolving the underlying causes of the human diseases with unknown etiology, explain this increase. Here, we focus on the most recent advances in understanding of the impact of the active human TEs on the stability of the human genome and its relevance to human disease.
Collapse
Affiliation(s)
- Victoria P Belancio
- Department of Structural and Cellular Biology, School of Medicine, Tulane Cancer Center and Tulane Center for Aging, Tulane University, SL-49 1430 Tulane Ave, New Orleans, LA 70112, USA.
| | | | | |
Collapse
|