1
|
Orozco-Arias S, Isaza G, Guyot R. Retrotransposons in Plant Genomes: Structure, Identification, and Classification through Bioinformatics and Machine Learning. Int J Mol Sci 2019; 20:E3837. [PMID: 31390781 PMCID: PMC6696364 DOI: 10.3390/ijms20153837] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/31/2019] [Accepted: 08/02/2019] [Indexed: 01/26/2023] Open
Abstract
Transposable elements (TEs) are genomic units able to move within the genome of virtually all organisms. Due to their natural repetitive numbers and their high structural diversity, the identification and classification of TEs remain a challenge in sequenced genomes. Although TEs were initially regarded as "junk DNA", it has been demonstrated that they play key roles in chromosome structures, gene expression, and regulation, as well as adaptation and evolution. A highly reliable annotation of these elements is, therefore, crucial to better understand genome functions and their evolution. To date, much bioinformatics software has been developed to address TE detection and classification processes, but many problematic aspects remain, such as the reliability, precision, and speed of the analyses. Machine learning and deep learning are algorithms that can make automatic predictions and decisions in a wide variety of scientific applications. They have been tested in bioinformatics and, more specifically for TEs, classification with encouraging results. In this review, we will discuss important aspects of TEs, such as their structure, importance in the evolution and architecture of the host, and their current classifications and nomenclatures. We will also address current methods and their limitations in identifying and classifying TEs.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales 170001, Colombia
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Romain Guyot
- Department of Electronics and Automatization, Universidad Autónoma de Manizales, Manizales 170001, Colombia.
- Institut de Recherche pour le Développement, CIRAD, University Montpellier, 34000 Montpellier, France.
| |
Collapse
|
2
|
Valencia JD, Girgis HZ. LtrDetector: A tool-suite for detecting long terminal repeat retrotransposons de-novo. BMC Genomics 2019; 20:450. [PMID: 31159720 PMCID: PMC6547461 DOI: 10.1186/s12864-019-5796-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Accepted: 05/14/2019] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Long terminal repeat retrotransposons are the most abundant transposons in plants. They play important roles in alternative splicing, recombination, gene regulation, and defense mechanisms. Large-scale sequencing projects for plant genomes are currently underway. Software tools are important for annotating long terminal repeat retrotransposons in these newly available genomes. However, the available tools are not very sensitive to known elements and perform inconsistently on different genomes. Some are hard to install or obsolete. They may struggle to process large plant genomes. None can be executed in parallel out of the box and very few have features to support visual review of new elements. To overcome these limitations, we developed LtrDetector, which uses techniques inspired by signal-processing. RESULTS We compared LtrDetector to LTR_Finder and LTRharvest, the two most successful predecessor tools, on six plant genomes. For each organism, we constructed a ground truth data set based on queries from a consensus sequence database. According to this evaluation, LtrDetector was the most sensitive tool, achieving 16-23% improvement in sensitivity over LTRharvest and 21% improvement over LTR_Finder. All three tools had low false positive rates, with LtrDetector achieving 98.2% precision, in between its two competitors. Overall, LtrDetector provides the best compromise between high sensitivity and low false positive rate while requiring moderate time and utilizing memory available on personal computers. CONCLUSIONS LtrDetector uses a novel methodology revolving around k-mer distributions, which allows it to produce high-quality results using relatively lightweight procedures. It is easy to install and use. It is not species specific, performing well using its default parameters on genomes of varying size and repeat content. It is automatically configured for parallel execution and runs efficiently on an ordinary personal computer. It includes a k-mer scores visualization tool to facilitate manual review of the identified elements. These features make LtrDetector an attractive tool for future annotation projects involving long terminal repeat retrotransposons.
Collapse
Affiliation(s)
- Joseph D Valencia
- The Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, 74104, OK, USA
| | - Hani Z Girgis
- The Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, 74104, OK, USA.
| |
Collapse
|
3
|
Inpactor, Integrated and Parallel Analyzer and Classifier of LTR Retrotransposons and Its Application for Pineapple LTR Retrotransposons Diversity and Dynamics. BIOLOGY 2018; 7:biology7020032. [PMID: 29799487 PMCID: PMC6022998 DOI: 10.3390/biology7020032] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 05/16/2018] [Accepted: 05/22/2018] [Indexed: 12/22/2022]
Abstract
One particular class of Transposable Elements (TEs), called Long Terminal Repeats (LTRs), retrotransposons, comprises the most abundant mobile elements in plant genomes. Their copy number can vary from several hundreds to up to a few million copies per genome, deeply affecting genome organization and function. The detailed classification of LTR retrotransposons is an essential step to precisely understand their effect at the genome level, but remains challenging in large-sized genomes, requiring the use of optimized bioinformatics tools that can take advantage of supercomputers. Here, we propose a new tool: Inpactor, a parallel and scalable pipeline designed to classify LTR retrotransposons, to identify autonomous and non-autonomous elements, to perform RT-based phylogenetic trees and to analyze their insertion times using High Performance Computing (HPC) techniques. Inpactor was tested on the classification and annotation of LTR retrotransposons in pineapple, a recently-sequenced genome. The pineapple genome assembly comprises 44% of transposable elements, of which 23% were classified as LTR retrotransposons. Exceptionally, 16.4% of the pineapple genome assembly corresponded to only one lineage of the Gypsy superfamily: Del, suggesting that this particular lineage has undergone a significant increase in its copy numbers. As demonstrated for the pineapple genome, Inpactor provides comprehensive data of LTR retrotransposons’ classification and dynamics, allowing a fine understanding of their contribution to genome structure and evolution. Inpactor is available at https://github.com/simonorozcoarias/Inpactor.
Collapse
|
4
|
Schietgat L, Vens C, Cerri R, Fischer CN, Costa E, Ramon J, Carareto CMA, Blockeel H. A machine learning based framework to identify and classify long terminal repeat retrotransposons. PLoS Comput Biol 2018; 14:e1006097. [PMID: 29684010 PMCID: PMC5933816 DOI: 10.1371/journal.pcbi.1006097] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2017] [Revised: 05/03/2018] [Accepted: 03/19/2018] [Indexed: 12/03/2022] Open
Abstract
Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE. Over the years, with the increase of the acquisition of biological data, the extraction of knowledge from this data is getting more important. To understand how biology works is very important to increase the quality of the products and services which use biological data. This directly influences companies and governments, which need to remain in the knowledge frontier of an increasing competitive economy. Transposable Elements (TEs) are an example of very important biological data, and to understand their role in the genomes of organisms is very important for the development of products based on biological data. As an example, we can cite the production biofuels such as the sugar-cane-based ones. Many studies have revealed the presence of active TEs in this plant, which has gained economic importance in many countries. To understand how TEs influence the plant should help researchers to develop more resistant varieties of sugar-cane, increasing the production. Thus, the development of computational methods able to help biologists in the correct identification and classification of TEs is very important from both theoretical and practical perspectives.
Collapse
Affiliation(s)
| | - Celine Vens
- Department of Computer Science, KU Leuven, Leuven, Belgium
- Department of Public Health and Primary Care, KU Leuven Kulak, Kortrijk, Belgium
- Department of Respiratory Medicine, Ghent University, and VIB Inflammation Research Center, Ghent, Belgium
- * E-mail:
| | - Ricardo Cerri
- Department of Computer Science, UFSCar Federal University of São Carlos, São Carlos, São Paulo, Brazil
| | - Carlos N. Fischer
- Department of Statistics, Applied Mathematics, and Computer Science, UNESP São Paulo State University, Rio Claro, São Paulo, Brazil
| | - Eduardo Costa
- Department of Computer Science, KU Leuven, Leuven, Belgium
- Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, São Paulo, Brazil
| | - Jan Ramon
- Department of Computer Science, KU Leuven, Leuven, Belgium
- INRIA Lille Nord Europe, 40 avenue Halley, 59650 Villeneuve d’Ascq, France
| | - Claudia M. A. Carareto
- Department of Biology, UNESP São Paulo State University, São José do Rio Preto, São Paulo, Brazil
| | | |
Collapse
|
5
|
Genome-wide analysis of transposable elements in the coffee berry borer Hypothenemus hampei (Coleoptera: Curculionidae): description of novel families. Mol Genet Genomics 2017; 292:565-583. [PMID: 28204924 DOI: 10.1007/s00438-017-1291-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2016] [Accepted: 01/12/2017] [Indexed: 10/20/2022]
Abstract
The coffee berry borer (CBB) Hypothenemus hampei is the most limiting pest of coffee production worldwide. The CBB genome has been recently sequenced; however, information regarding the presence and characteristics of transposable elements (TEs) was not provided. Using systematic searching strategies based on both de novo and homology-based approaches, we present a library of TEs from the draft genome of CBB sequenced by the Colombian Coffee Growers Federation. The library consists of 880 sequences classified as 66% Class I (LTRs: 46%, non-LTRs: 20%) and 34% Class II (DNA transposons: 8%, Helitrons: 16% and MITEs: 10%) elements, including families of the three main LTR (Gypsy, Bel-Pao and Copia) and non-LTR (CR1, Daphne, I/Nimb, Jockey, Kiri, R1, R2 and R4) clades and DNA superfamilies (Tc1-mariner, hAT, Merlin, P, PIF-Harbinger, PiggyBac and Helitron). We propose the existence of novel families: Hypo, belonging to the LTR Gypsy superfamily; Hamp, belonging to non-LTRs; and rosa, belonging to Class II or DNA transposons. Although the rosa clade has been previously described, it was considered to be a basal subfamily of the mariner family. Based on our phylogenetic analysis, including Tc1, mariner, pogo, rosa and Lsra elements from other insects, we propose that rosa and Lsra elements are subfamilies of an independent family of Class II elements termed rosa. The annotations obtained indicate that a low percentage of the assembled CBB genome (approximately 8.2%) consists of TEs. Although these TEs display high diversity, most sequences are degenerate, with few full-length copies of LTR and DNA transposons and several complete and putatively active copies of non-LTR elements. MITEs constitute approximately 50% of the total TEs content, with a high proportion associated with DNA transposons in the Tc1-mariner superfamily.
Collapse
|
6
|
Characterization of new transposable element sub-families from white clover (Trifolium repens) using PCR amplification. Genetica 2016; 144:577-589. [PMID: 27671023 DOI: 10.1007/s10709-016-9926-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2016] [Accepted: 09/17/2016] [Indexed: 12/15/2022]
Abstract
Transposable elements (TEs) dominate the landscapes of most plant and animal genomes. Once considered junk DNA and genetic parasites, these interspersed, repetitive DNA elements are now known to play major roles in both genetic and epigenetic processes that sponsor genome variation and regulate gene expression. Knowledge of TE consensus sequences from elements in species whose genomes have not been sequenced is limited, and the individual TEs that are encountered in clones or short-reads rarely represent potentially canonical, let alone, functional representatives. In this study, we queried the Repbase database with eight BAC clones from white clover (Trifolium repens), identified a large number of candidate TEs, and used polymerase chain reaction and Sanger sequencing to create consensus sequences for three new TE families. The results show that TE family consensus sequences can be obtained experimentally in species for which just a single, full-length member of a TE family has been sequenced.
Collapse
|
7
|
Monat C, Tando N, Tranchant-Dubreuil C, Sabot F. LTRclassifier: A website for fast structural LTR retrotransposons classification in plants. Mob Genet Elements 2016; 6:e1241050. [PMID: 28090381 DOI: 10.1080/2159256x.2016.1241050] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2016] [Revised: 09/20/2016] [Accepted: 09/20/2016] [Indexed: 10/20/2022] Open
Abstract
Automatic classification of LTR retrotransposons is a big challenge in the area of massive genomics. Many tools were developed to detect them but automatic classification is somehow challenging. Here we propose a simple approach, LTRclassifier, based on HMM recognition followed by BLAST analyses (i) to classify plant LTR retrotransposons in their respective superfamily, and (ii) to provide automatically a basic functional annotation of these elements. The method was tested on various TE databases, and shown to be robust and fast. This tool is available as a web service implemented at IRD bioinformatics facility, http://LTRclassifier.ird.fr/.
Collapse
Affiliation(s)
- Cecile Monat
- UMR DIADE IRD/UM, IRD , Montpellier Cedex 5 , France
| | | | | | | |
Collapse
|
8
|
Möller S, Afgan E, Banck M, Bonnal RJP, Booth T, Chilton J, Cock PJA, Gumbel M, Harris N, Holland R, Kalaš M, Kaján L, Kibukawa E, Powel DR, Prins P, Quinn J, Sallou O, Strozzi F, Seemann T, Sloggett C, Soiland-Reyes S, Spooner W, Steinbiss S, Tille A, Travis AJ, Guimera R, Katayama T, Chapman BA. Community-driven development for computational biology at Sprints, Hackathons and Codefests. BMC Bioinformatics 2014; 15 Suppl 14:S7. [PMID: 25472764 PMCID: PMC4255748 DOI: 10.1186/1471-2105-15-s14-s7] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background Computational biology comprises a wide range of technologies and approaches. Multiple technologies can be combined to create more powerful workflows if the individuals contributing the data or providing tools for its interpretation can find mutual understanding and consensus. Much conversation and joint investigation are required in order to identify and implement the best approaches. Traditionally, scientific conferences feature talks presenting novel technologies or insights, followed up by informal discussions during coffee breaks. In multi-institution collaborations, in order to reach agreement on implementation details or to transfer deeper insights in a technology and practical skills, a representative of one group typically visits the other. However, this does not scale well when the number of technologies or research groups is large. Conferences have responded to this issue by introducing Birds-of-a-Feather (BoF) sessions, which offer an opportunity for individuals with common interests to intensify their interaction. However, parallel BoF sessions often make it hard for participants to join multiple BoFs and find common ground between the different technologies, and BoFs are generally too short to allow time for participants to program together. Results This report summarises our experience with computational biology Codefests, Hackathons and Sprints, which are interactive developer meetings. They are structured to reduce the limitations of traditional scientific meetings described above by strengthening the interaction among peers and letting the participants determine the schedule and topics. These meetings are commonly run as loosely scheduled "unconferences" (self-organized identification of participants and topics for meetings) over at least two days, with early introductory talks to welcome and organize contributors, followed by intensive collaborative coding sessions. We summarise some prominent achievements of those meetings and describe differences in how these are organised, how their audience is addressed, and their outreach to their respective communities. Conclusions Hackathons, Codefests and Sprints share a stimulating atmosphere that encourages participants to jointly brainstorm and tackle problems of shared interest in a self-driven proactive environment, as well as providing an opportunity for new participants to get involved in collaborative projects.
Collapse
|
9
|
Gremme G, Steinbiss S, Kurtz S. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:645-56. [PMID: 24091398 DOI: 10.1109/tcbb.2013.68] [Citation(s) in RCA: 229] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/19/2023]
Abstract
Genome annotations are often published as plain text files describing genomic features and their subcomponents by an implicit annotation graph. In this paper, we present the GenomeTools, a convenient and efficient software library and associated software tools for developing bioinformatics software intended to create, process or convert annotation graphs. The GenomeTools strictly follow the annotation graph approach, offering a unified graph-based representation. This gives the developer intuitive and immediate access to genomic features and tools for their manipulation. To process large annotation sets with low memory overhead, we have designed and implemented an efficient pull-based approach for sequential processing of annotations. This allows to handle even the largest annotation sets, such as a complete catalogue of human variations. Our object-oriented C-based software library enables a developer to conveniently implement their own functionality on annotation graphs and to integrate it into larger workflows, simultaneously accessing compressed sequence data if required. The careful C implementation of the GenomeTools does not only ensure a light-weight memory footprint while allowing full sequential as well as random access to the annotation graph, but also facilitates the creation of bindings to a variety of script programming languages (like Python and Ruby) sharing the same interface.
Collapse
|