1
|
Salojärvi J, Rambani A, Yu Z, Guyot R, Strickler S, Lepelley M, Wang C, Rajaraman S, Rastas P, Zheng C, Muñoz DS, Meidanis J, Paschoal AR, Bawin Y, Krabbenhoft TJ, Wang ZQ, Fleck SJ, Aussel R, Bellanger L, Charpagne A, Fournier C, Kassam M, Lefebvre G, Métairon S, Moine D, Rigoreau M, Stolte J, Hamon P, Couturon E, Tranchant-Dubreuil C, Mukherjee M, Lan T, Engelhardt J, Stadler P, Correia De Lemos SM, Suzuki SI, Sumirat U, Wai CM, Dauchot N, Orozco-Arias S, Garavito A, Kiwuka C, Musoli P, Nalukenge A, Guichoux E, Reinout H, Smit M, Carretero-Paulet L, Filho OG, Braghini MT, Padilha L, Sera GH, Ruttink T, Henry R, Marraccini P, Van de Peer Y, Andrade A, Domingues D, Giuliano G, Mueller L, Pereira LF, Plaisance S, Poncet V, Rombauts S, Sankoff D, Albert VA, Crouzillat D, de Kochko A, Descombes P. The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars. Nat Genet 2024; 56:721-731. [PMID: 38622339 PMCID: PMC11018527 DOI: 10.1038/s41588-024-01695-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 02/23/2024] [Indexed: 04/17/2024]
Abstract
Coffea arabica, an allotetraploid hybrid of Coffea eugenioides and Coffea canephora, is the source of approximately 60% of coffee products worldwide, and its cultivated accessions have undergone several population bottlenecks. We present chromosome-level assemblies of a di-haploid C. arabica accession and modern representatives of its diploid progenitors, C. eugenioides and C. canephora. The three species exhibit largely conserved genome structures between diploid parents and descendant subgenomes, with no obvious global subgenome dominance. We find evidence for a founding polyploidy event 350,000-610,000 years ago, followed by several pre-domestication bottlenecks, resulting in narrow genetic variation. A split between wild accessions and cultivar progenitors occurred ~30.5 thousand years ago, followed by a period of migration between the two populations. Analysis of modern varieties, including lines historically introgressed with C. canephora, highlights their breeding histories and loci that may contribute to pathogen resistance, laying the groundwork for future genomics-based breeding of C. arabica.
Collapse
Affiliation(s)
- Jarkko Salojärvi
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore.
- Organismal and Evolutionary Biology Research Programme, University of Helsinki, Helsinki, Finland.
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore.
| | - Aditi Rambani
- Boyce Thompson Institute, Cornell University, Ithaca, NY, USA
| | - Zhe Yu
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada
| | - Romain Guyot
- Institut de Recherche pour le Développement (IRD), Université de Montpellier, Montpellier, France
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Colombia
| | - Susan Strickler
- Boyce Thompson Institute, Cornell University, Ithaca, NY, USA
| | - Maud Lepelley
- Société des Produits Nestlé SA, Nestlé Research, Tours, France
| | - Cui Wang
- Organismal and Evolutionary Biology Research Programme, University of Helsinki, Helsinki, Finland
| | - Sitaram Rajaraman
- Organismal and Evolutionary Biology Research Programme, University of Helsinki, Helsinki, Finland
| | - Pasi Rastas
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Chunfang Zheng
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada
| | - Daniella Santos Muñoz
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada
| | - João Meidanis
- Institute of Computing, University of Campinas, Campinas, Brazil
| | - Alexandre Rossi Paschoal
- Department of Computer Science, The Federal University of Technology - Paraná (UTFPR), Cornélio Procópio, Brazil
| | - Yves Bawin
- Plant Sciences Unit, Flanders Research Institute for Agriculture, Fisheries and Food (ILVO), Melle, Belgium
| | | | - Zhen Qin Wang
- Department of Biological Sciences, University at Buffalo, Buffalo, NY, USA
| | - Steven J Fleck
- Department of Biological Sciences, University at Buffalo, Buffalo, NY, USA
| | - Rudy Aussel
- Société des Produits Nestlé SA, Nestlé Research, Tours, France
- Centre d'Immunologie de Marseille-Luminy, Aix Marseille Université, Marseille, France
| | | | - Aline Charpagne
- Société des Produits Nestlé SA, Nestlé Research, Lausanne, Switzerland
| | - Coralie Fournier
- Société des Produits Nestlé SA, Nestlé Research, Lausanne, Switzerland
| | - Mohamed Kassam
- Société des Produits Nestlé SA, Nestlé Research, Lausanne, Switzerland
| | - Gregory Lefebvre
- Société des Produits Nestlé SA, Nestlé Research, Lausanne, Switzerland
| | - Sylviane Métairon
- Société des Produits Nestlé SA, Nestlé Research, Lausanne, Switzerland
| | - Déborah Moine
- Société des Produits Nestlé SA, Nestlé Research, Lausanne, Switzerland
| | - Michel Rigoreau
- Société des Produits Nestlé SA, Nestlé Research, Tours, France
| | - Jens Stolte
- Société des Produits Nestlé SA, Nestlé Research, Lausanne, Switzerland
| | - Perla Hamon
- Institut de Recherche pour le Développement (IRD), Université de Montpellier, Montpellier, France
| | - Emmanuel Couturon
- Institut de Recherche pour le Développement (IRD), Université de Montpellier, Montpellier, France
| | | | - Minakshi Mukherjee
- Department of Biological Sciences, University at Buffalo, Buffalo, NY, USA
| | - Tianying Lan
- Department of Biological Sciences, University at Buffalo, Buffalo, NY, USA
| | - Jan Engelhardt
- Department of Computer Science, University of Leipzig, Leipzig, Germany
| | - Peter Stadler
- Department of Computer Science, University of Leipzig, Leipzig, Germany
- Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
| | | | | | - Ucu Sumirat
- Indonesian Coffee and Cocoa Research Institute (ICCRI), Jember, Indonesia
| | - Ching Man Wai
- University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Nicolas Dauchot
- Research Unit in Plant Cellular and Molecular Biology, University of Namur, Namur, Belgium
| | - Simon Orozco-Arias
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Colombia
| | - Andrea Garavito
- Departamento de Ciencias Biológicas, Facultad de Ciencias Exactas y Naturales, Universidad de Caldas, Manizales, Colombia
| | - Catherine Kiwuka
- National Agricultural Research Organization (NARO), Entebbe, Uganda
| | - Pascal Musoli
- National Agricultural Research Organization (NARO), Entebbe, Uganda
| | - Anne Nalukenge
- National Agricultural Research Organization (NARO), Entebbe, Uganda
| | - Erwan Guichoux
- Biodiversité Gènes & Communautés, INRA, Bordeaux, France
| | | | - Martin Smit
- Hortus Botanicus Amsterdam, Amsterdam, the Netherlands
| | | | - Oliveiro Guerreiro Filho
- Instituto Agronômico (IAC) Centro de Café 'Alcides Carvalho', Fazenda Santa Elisa, Campinas, Brazil
| | - Masako Toma Braghini
- Instituto Agronômico (IAC) Centro de Café 'Alcides Carvalho', Fazenda Santa Elisa, Campinas, Brazil
| | - Lilian Padilha
- Embrapa Café/Instituto Agronômico (IAC) Centro de Café 'Alcides Carvalho', Fazenda Santa Elisa, Campinas, Brazil
| | | | - Tom Ruttink
- Plant Sciences Unit, Flanders Research Institute for Agriculture, Fisheries and Food (ILVO), Melle, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Robert Henry
- Queensland Alliance for Agriculture and Food Innovation, University of Queensland, Brisbane, Queensland, Australia
| | - Pierre Marraccini
- CIRAD - UMR DIADE (IRD-CIRAD-Université de Montpellier) BP 64501, Montpellier, France
| | - Yves Van de Peer
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa
- College of Horticulture, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China
- Center for Plant Systems Biology, VIB, Ghent, Belgium
| | - Alan Andrade
- Embrapa Café/Inovacafé Laboratory of Molecular Genetics Campus da UFLA-MG, Lavras, Brazil
| | - Douglas Domingues
- Group of Genomics and Transcriptomes in Plants, São Paulo State University, UNESP, Rio Claro, Brazil
| | - Giovanni Giuliano
- Italian National Agency for New Technologies, Energy and Sustainable Economic Development, ENEA Casaccia Research Center, Rome, Italy
| | - Lukas Mueller
- Boyce Thompson Institute, Cornell University, Ithaca, NY, USA
| | - Luiz Filipe Pereira
- Embrapa Café/Lab. Biotecnologia, Área de Melhoramento Genético, Londrina, Brazil
| | | | - Valerie Poncet
- Institut de Recherche pour le Développement (IRD), Université de Montpellier, Montpellier, France
| | - Stephane Rombauts
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Center for Plant Systems Biology, VIB, Ghent, Belgium
| | - David Sankoff
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada
| | - Victor A Albert
- Department of Biological Sciences, University at Buffalo, Buffalo, NY, USA.
| | | | - Alexandre de Kochko
- Institut de Recherche pour le Développement (IRD), Université de Montpellier, Montpellier, France.
| | - Patrick Descombes
- Société des Produits Nestlé SA, Nestlé Research, Lausanne, Switzerland.
| |
Collapse
|
2
|
Gao D. Introduction of Plant Transposon Annotation for Beginners. BIOLOGY 2023; 12:1468. [PMID: 38132293 PMCID: PMC10741241 DOI: 10.3390/biology12121468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 11/21/2023] [Accepted: 11/23/2023] [Indexed: 12/23/2023]
Abstract
Transposons are mobile DNA sequences that contribute large fractions of many plant genomes. They provide exclusive resources for tracking gene and genome evolution and for developing molecular tools for basic and applied research. Despite extensive efforts, it is still challenging to accurately annotate transposons, especially for beginners, as transposon prediction requires necessary expertise in both transposon biology and bioinformatics. Moreover, the complexity of plant genomes and the dynamic evolution of transposons also bring difficulties for genome-wide transposon discovery. This review summarizes the three major strategies for transposon detection including repeat-based, structure-based, and homology-based annotation, and introduces the transposon superfamilies identified in plants thus far, and some related bioinformatics resources for detecting plant transposons. Furthermore, it describes transposon classification and explains why the terms 'autonomous' and 'non-autonomous' cannot be used to classify the superfamilies of transposons. Lastly, this review also discusses how to identify misannotated transposons and improve the quality of the transposon database. This review provides helpful information about plant transposons and a beginner's guide on annotating these repetitive sequences.
Collapse
Affiliation(s)
- Dongying Gao
- Small Grains and Potato Germplasm Research Unit, USDA-ARS, Aberdeen, ID 83210, USA
| |
Collapse
|
3
|
Orozco-Arias S, Lopez-Murillo LH, Piña JS, Valencia-Castrillon E, Tabares-Soto R, Castillo-Ossa L, Isaza G, Guyot R. Genomic object detection: An improved approach for transposable elements detection and classification using convolutional neural networks. PLoS One 2023; 18:e0291925. [PMID: 37733731 PMCID: PMC10513252 DOI: 10.1371/journal.pone.0291925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 09/10/2023] [Indexed: 09/23/2023] Open
Abstract
Analysis of eukaryotic genomes requires the detection and classification of transposable elements (TEs), a crucial but complex and time-consuming task. To improve the performance of tools that accomplish these tasks, Machine Learning approaches (ML) that leverage computer resources, such as GPUs (Graphical Processing Unit) and multiple CPU (Central Processing Unit) cores, have been adopted. However, until now, the use of ML techniques has mostly been limited to classification of TEs. Herein, a detection-classification strategy (named YORO) based on convolutional neural networks is adapted from computer vision (YOLO) to genomics. This approach enables the detection of genomic objects through the prediction of the position, length, and classification in large DNA sequences such as fully sequenced genomes. As a proof of concept, the internal protein-coding domains of LTR-retrotransposons are used to train the proposed neural network. Precision, recall, accuracy, F1-score, execution times and time ratios, as well as several graphical representations were used as metrics to measure performance. These promising results open the door for a new generation of Deep Learning tools for genomics. YORO architecture is available at https://github.com/simonorozcoarias/YORO.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Colombia
- Center for Technology Development Bioprocess and Agroindustry Plant, Department of Systems and Informatics, Universidad de Caldas, Manizales, Colombia
| | | | - Johan S. Piña
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Colombia
| | | | - Reinel Tabares-Soto
- Center for Technology Development Bioprocess and Agroindustry Plant, Department of Systems and Informatics, Universidad de Caldas, Manizales, Colombia
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Colombia
| | - Luis Castillo-Ossa
- Center for Technology Development Bioprocess and Agroindustry Plant, Department of Systems and Informatics, Universidad de Caldas, Manizales, Colombia
| | - Gustavo Isaza
- Center for Technology Development Bioprocess and Agroindustry Plant, Department of Systems and Informatics, Universidad de Caldas, Manizales, Colombia
| | - Romain Guyot
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Colombia
- Institut de Recherche pour le Développement, CIRAD, Univ. Montpellier, Montpellier, France
| |
Collapse
|
4
|
Orozco-Arias S, Dupeyron M, Gutiérrez-Duque D, Tabares-Soto R, Guyot R. High nucleotide similarity of three Copia lineage LTR retrotransposons among plant genomes. Genome 2023; 66:51-61. [PMID: 36623262 DOI: 10.1139/gen-2022-0026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Transposable elements (TEs) are mobile elements found in the majority of eukaryotic genomes. TEs deeply impact the structure and evolution of chromosomes and can induce mutations affecting coding genes. In plants, the major group of TEs is long terminal repeat retrotransposons (LTR-RTs). They are classified into superfamilies (Gypsy, Copia) and subclassified into lineages. Horizontal transfer (HT), defined as the nonsexual transmission of genetic material between species, is a process allowing LTR-RTs to invade a new genome. Although this phenomenon was considered rare, recent studies demonstrate numerous transfers of LTR-RTs. This study aims to determine which LTR-RT lineages are shared with high similarity among 69 plant genomes. We identified and classified 88 450 LTR-RTs and determined 143 cases of high similarities between pairs of genomes. Most of them involved three Copia lineages (Oryco/Ivana, Retrofit/Ale, and Tork/Tar/Ikeros). A detailed analysis of three cases of high similarities involving Tork/Tar/Ikeros group shows an uneven distribution in the phylogeny of the elements and incongruence with between phylogenetic trees topologies, indicating they could be originated from HTs. Overall, our results suggest that LTR-RT Copia lineages share outstanding similarity between distant species and may likely be involved in HT mechanisms more frequent than initially estimated.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Sciences, Universidad Autónoma de Manizales, Colombia.,Department of Systems and Informatics, Universidad de Caldas, Colombia
| | - Mathilde Dupeyron
- Institut de Recherche pour le Développement, IRD, CIRAD, Université de Montpellier, France
| | | | - Reinel Tabares-Soto
- Department of Systems and Informatics, Universidad de Caldas, Colombia.,Department of Electronics and Automatization, Universidad Autónoma de Manizales, Colombia
| | - Romain Guyot
- Institut de Recherche pour le Développement, IRD, CIRAD, Université de Montpellier, France.,Department of Electronics and Automatization, Universidad Autónoma de Manizales, Colombia
| |
Collapse
|
5
|
Piña JS, Orozco-Arias S, Tobón-Orozco N, Camargo-Forero L, Tabares-Soto R, Guyot R. G-SAIP: Graphical Sequence Alignment Through Parallel Programming in the Post-Genomic Era. Evol Bioinform Online 2023; 19:11769343221150585. [PMID: 36703866 PMCID: PMC9871978 DOI: 10.1177/11769343221150585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 12/23/2022] [Indexed: 01/22/2023] Open
Abstract
A common task in bioinformatics is to compare DNA sequences to identify similarities between organisms at the sequence level. An approach to such comparison is the dot-plots, a 2-dimensional graphical representation to analyze DNA or protein alignments. Dot-plots alignment software existed before the sequencing revolution, and now there is an ongoing limitation when dealing with large-size sequences, resulting in very long execution times. High-Performance Computing (HPC) techniques have been successfully used in many applications to reduce computing times, but so far, very few applications for graphical sequence alignment using HPC have been reported. Here, we present G-SAIP (Graphical Sequence Alignment in Parallel), a software capable of spawning multiple distributed processes on CPUs, over a supercomputing infrastructure to speed up the execution time for dot-plot generation up to 1.68× compared with other current fastest tools, improve the efficiency for comparative structural genomic analysis, phylogenetics because the benefits of pairwise alignments for comparison between genomes, repetitive structure identification, and assembly quality checking.
Collapse
Affiliation(s)
- Johan S. Piña
- Department of Data Science, People
Contact, Manizales, Caldas, Colombia,Department of Computer Science,
Universidad Autónoma de Manizales, Manizales, Caldas, Colombia,Johan S. Piña, Department of Computer
Science, Universidad Autónoma de Manizales, Antigua estación del ferrocarril,
Manizales, Caldas 170004, Colombia.
| | - Simon Orozco-Arias
- Department of Computer Science,
Universidad Autónoma de Manizales, Manizales, Caldas, Colombia,Department of Systems and Informatics,
Universidad de Caldas, Manizales, Caldas, Colombia
| | - Nicolas Tobón-Orozco
- Department of Computer Science,
Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | | | - Reinel Tabares-Soto
- Department of Electronics and
Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Romain Guyot
- Department of Electronics and
Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia,Institut de Recherche pour le
Développement, CIRAD, University of Montpellier, Montpellier, France
| |
Collapse
|
6
|
Orozco-Arias S, Gaviria-Orrego S, Tabares-Soto R, Isaza G, Guyot R. InpactorDB: A Plant LTR Retrotransposon Reference Library. Methods Mol Biol 2023; 2703:31-44. [PMID: 37646935 DOI: 10.1007/978-1-0716-3389-2_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
LTR retrotransposons (LTR-RT) are major components of plant genomes. These transposable elements participate in the structure and evolution of genes and genomes through their mobility and their copy number amplification. For example, they are commonly used as evolutionary markers in genetic, genomic, and cytogenetic approaches. However, the plant research community is faced with the near absence of free availability of full-length, curated, and lineage-level classified LTR retrotransposon reference sequences. In this chapter, we will introduce InpactorDB, an LTR retrotransposon sequence database of 181 plant species representing 98 plant families for a total of 67,241 non-redundant elements. We will introduce how to use newly sequenced genomes to identify and classify LTR-RTs in a similar way with a standardized procedure using the Inpactor tool. InpactorDB is freely available at https://inpactordb.github.io .
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
- Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | - Simon Gaviria-Orrego
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
- Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | - Romain Guyot
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.
- Institut de Recherche pour le Développement, CIRAD, University of Montpellier, Montpellier, France.
| |
Collapse
|
7
|
Kirov I, Merkulov P, Polkhovskaya E, Konstantinov Z, Kazancev M, Saenko K, Polkhovskiy A, Dudnikov M, Garibyan T, Demurin Y, Soloviev A. Epigenetic Stress and Long-Read cDNA Sequencing of Sunflower ( Helianthus annuus L.) Revealed the Origin of the Plant Retrotranscriptome. PLANTS (BASEL, SWITZERLAND) 2022; 11:3579. [PMID: 36559691 PMCID: PMC9784723 DOI: 10.3390/plants11243579] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 12/13/2022] [Accepted: 12/13/2022] [Indexed: 06/12/2023]
Abstract
Transposable elements (TEs) contribute not only to genome diversity but also to transcriptome diversity in plants. To unravel the sources of LTR retrotransposon (RTE) transcripts in sunflower, we exploited a recently developed transposon activation method ('TEgenesis') along with long-read cDNA Nanopore sequencing. This approach allows for the identification of 56 RTE transcripts from different genomic loci including full-length and non-autonomous RTEs. Using the mobilome analysis, we provided a new set of expressed and transpositional active sunflower RTEs for future studies. Among them, a Ty3/Gypsy RTE called SUNTY3 exhibited ongoing transposition activity, as detected by eccDNA analysis. We showed that the sunflower genome contains a diverse set of non-autonomous RTEs encoding a single RTE protein, including the previously described TR-GAG (terminal repeat with the GAG domain) as well as new categories, TR-RT-RH, TR-RH, and TR-INT-RT. Our results demonstrate that 40% of the loci for RTE-related transcripts (nonLTR-RTEs) lack their LTR sequences and resemble conventional eucaryotic genes encoding RTE-related proteins with unknown functions. It was evident based on phylogenetic analysis that three nonLTR-RTEs encode GAG (HadGAG1-3) fused to a host protein. These HadGAG proteins have homologs found in other plant species, potentially indicating GAG domestication. Ultimately, we found that the sunflower retrotranscriptome originated from the transcription of active RTEs, non-autonomous RTEs, and gene-like RTE transcripts, including those encoding domesticated proteins.
Collapse
Affiliation(s)
- Ilya Kirov
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
| | - Pavel Merkulov
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
| | - Ekaterina Polkhovskaya
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
| | - Zakhar Konstantinov
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
| | - Mikhail Kazancev
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
| | - Ksenia Saenko
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
- Federal Research Center of Biological Plant Protection, 350039 Krasnodar, Russia
| | - Alexander Polkhovskiy
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
- Skolkovo Institute of Science and Technology, 121205 Moscow, Russia
| | - Maxim Dudnikov
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
| | - Tsovinar Garibyan
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
| | - Yakov Demurin
- Pustovoit All-Russia Research Institute of Oilseed Crops, Filatova St. 17, 350038 Krasnodar, Russia
| | - Alexander Soloviev
- All-Russia Research Institute of Agricultural Biotechnology, Timiryazevskaya Str. 42, 127550 Moscow, Russia
| |
Collapse
|
8
|
Orozco-Arias S, Humberto Lopez-Murillo L, Candamil-Cortés MS, Arias M, Jaimes PA, Rossi Paschoal A, Tabares-Soto R, Isaza G, Guyot R. Inpactor2: a software based on deep learning to identify and classify LTR-retrotransposons in plant genomes. Brief Bioinform 2022; 24:6887110. [PMID: 36502372 PMCID: PMC9851300 DOI: 10.1093/bib/bbac511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 10/13/2022] [Accepted: 10/26/2022] [Indexed: 12/14/2022] Open
Abstract
LTR-retrotransposons are the most abundant repeat sequences in plant genomes and play an important role in evolution and biodiversity. Their characterization is of great importance to understand their dynamics. However, the identification and classification of these elements remains a challenge today. Moreover, current software can be relatively slow (from hours to days), sometimes involve a lot of manual work and do not reach satisfactory levels in terms of precision and sensitivity. Here we present Inpactor2, an accurate and fast application that creates LTR-retrotransposon reference libraries in a very short time. Inpactor2 takes an assembled genome as input and follows a hybrid approach (deep learning and structure-based) to detect elements, filter partial sequences and finally classify intact sequences into superfamilies and, as very few tools do, into lineages. This tool takes advantage of multi-core and GPU architectures to decrease execution times. Using the rice genome, Inpactor2 showed a run time of 5 minutes (faster than other tools) and has the best accuracy and F1-Score of the tools tested here, also having the second best accuracy and specificity only surpassed by EDTA, but achieving 28% higher sensitivity. For large genomes, Inpactor2 is up to seven times faster than other available bioinformatics tools.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Corresponding authors. Simon Orozco-Arias, Computer Science Department, Universidad Autónoma de Manizales, Antigua Estación del Ferrocarrill, Manizalez, Colombia. Tel.: +57(606)8727272 - 8727709 Ext 102; E-mail: ; Alexandre Rossi Paschoal, Department of Computer Science, Bioinformatics and Pattern Recognition Group, Graduation Program in Bioinformatics, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Paraná, 86300-000, Brazil. Tel.: +433133-3790; E-mail: ; Gustavo Isaza, Systems and Informatics Department, Center for Technology Development - Bioprocess and Agro-industry Plant, Universidad de Caldas, St 65 #26-10, Manizales, Colombia. Tel.: +57(606)8781500 ext 13146; E-mail: , Romain Guyot, IRD, 911 Av. Agropolis, 34394 Montpellier, France. Tel.: +334674160000; E-mail:
| | | | | | - Maradey Arias
- Department of Computer Science, Universidad Autónoma de Manizales, 170001, Caldas, Colombia
| | - Paula A Jaimes
- Department of Computer Science, Universidad Autónoma de Manizales, 170001, Caldas, Colombia
| | - Alexandre Rossi Paschoal
- Corresponding authors. Simon Orozco-Arias, Computer Science Department, Universidad Autónoma de Manizales, Antigua Estación del Ferrocarrill, Manizalez, Colombia. Tel.: +57(606)8727272 - 8727709 Ext 102; E-mail: ; Alexandre Rossi Paschoal, Department of Computer Science, Bioinformatics and Pattern Recognition Group, Graduation Program in Bioinformatics, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Paraná, 86300-000, Brazil. Tel.: +433133-3790; E-mail: ; Gustavo Isaza, Systems and Informatics Department, Center for Technology Development - Bioprocess and Agro-industry Plant, Universidad de Caldas, St 65 #26-10, Manizales, Colombia. Tel.: +57(606)8781500 ext 13146; E-mail: , Romain Guyot, IRD, 911 Av. Agropolis, 34394 Montpellier, France. Tel.: +334674160000; E-mail:
| | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, 170001, Caldas, Colombia
| | - Gustavo Isaza
- Corresponding authors. Simon Orozco-Arias, Computer Science Department, Universidad Autónoma de Manizales, Antigua Estación del Ferrocarrill, Manizalez, Colombia. Tel.: +57(606)8727272 - 8727709 Ext 102; E-mail: ; Alexandre Rossi Paschoal, Department of Computer Science, Bioinformatics and Pattern Recognition Group, Graduation Program in Bioinformatics, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Paraná, 86300-000, Brazil. Tel.: +433133-3790; E-mail: ; Gustavo Isaza, Systems and Informatics Department, Center for Technology Development - Bioprocess and Agro-industry Plant, Universidad de Caldas, St 65 #26-10, Manizales, Colombia. Tel.: +57(606)8781500 ext 13146; E-mail: , Romain Guyot, IRD, 911 Av. Agropolis, 34394 Montpellier, France. Tel.: +334674160000; E-mail:
| | - Romain Guyot
- Corresponding authors. Simon Orozco-Arias, Computer Science Department, Universidad Autónoma de Manizales, Antigua Estación del Ferrocarrill, Manizalez, Colombia. Tel.: +57(606)8727272 - 8727709 Ext 102; E-mail: ; Alexandre Rossi Paschoal, Department of Computer Science, Bioinformatics and Pattern Recognition Group, Graduation Program in Bioinformatics, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Paraná, 86300-000, Brazil. Tel.: +433133-3790; E-mail: ; Gustavo Isaza, Systems and Informatics Department, Center for Technology Development - Bioprocess and Agro-industry Plant, Universidad de Caldas, St 65 #26-10, Manizales, Colombia. Tel.: +57(606)8781500 ext 13146; E-mail: , Romain Guyot, IRD, 911 Av. Agropolis, 34394 Montpellier, France. Tel.: +334674160000; E-mail:
| |
Collapse
|
9
|
Finding and Characterizing Repeats in Plant Genomes. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2443:327-385. [PMID: 35037215 DOI: 10.1007/978-1-0716-2067-0_18] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Plant genomes contain a particularly high proportion of repeated structures of various types. This chapter proposes a guided tour of the available software that can help biologists to scan automatically for these repeats in sequence data or check hypothetical models intended to characterize their structures. Since transposable elements (TEs) are a major source of repeats in plants, many methods have been used or developed for this broad class of sequences. They are representative of the range of tools available for other classes of repeats and we have provided two sections on this topic (for the analysis of genomes or directly of sequenced reads), as well as a selection of the main existing software. It may be hard to keep up with the profusion of proposals in this dynamic field and the rest of the chapter is devoted to the foundations of an efficient search for repeats and more complex patterns. We first introduce the key concepts of the art of indexing and mapping or querying sequences. We end the chapter with the more prospective issue of building models of repeat families. We present the Machine Learning approach first, seeking to build predictors automatically for some families of ET, from a set of sequences known to belong to this family. A second approach, the linguistic (or syntactic) approach, allows biologists to describe themselves and check the validity of models of their favorite repeat family.
Collapse
|
10
|
Raharimalala N, Rombauts S, McCarthy A, Garavito A, Orozco-Arias S, Bellanger L, Morales-Correa AY, Froger S, Michaux S, Berry V, Metairon S, Fournier C, Lepelley M, Mueller L, Couturon E, Hamon P, Rakotomalala JJ, Descombes P, Guyot R, Crouzillat D. The absence of the caffeine synthase gene is involved in the naturally decaffeinated status of Coffea humblotiana, a wild species from Comoro archipelago. Sci Rep 2021; 11:8119. [PMID: 33854089 PMCID: PMC8046976 DOI: 10.1038/s41598-021-87419-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 03/23/2021] [Indexed: 02/02/2023] Open
Abstract
Caffeine is the most consumed alkaloid stimulant in the world. It is synthesized through the activity of three known N-methyltransferase proteins. Here we are reporting on the 422-Mb chromosome-level assembly of the Coffea humblotiana genome, a wild and endangered, naturally caffeine-free, species from the Comoro archipelago. We predicted 32,874 genes and anchored 88.7% of the sequence onto the 11 chromosomes. Comparative analyses with the African Robusta coffee genome (C. canephora) revealed an extensive genome conservation, despite an estimated 11 million years of divergence and a broad diversity of genome sizes within the Coffea genus. In this genome, the absence of caffeine is likely due to the absence of the caffeine synthase gene which converts theobromine into caffeine through an illegitimate recombination mechanism. These findings pave the way for further characterization of caffeine-free species in the Coffea genus and will guide research towards naturally-decaffeinated coffee drinks for consumers.
Collapse
Affiliation(s)
- Nathalie Raharimalala
- grid.433118.c0000 0001 2302 6762Centre National de Recherche Appliquée au Développement Rural, BP 1444, 101 Ambatobe, Antananarivo Madagascar
| | - Stephane Rombauts
- grid.5342.00000 0001 2069 7798Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium ,grid.11486.3a0000000104788040VIB Center for Plant Systems Biology, 9052 Gent, Belgium
| | - Andrew McCarthy
- grid.418923.50000 0004 0638 528XEuropean Molecular Biology Laboratory, 71 Avenue des Martyrs, CS 90181, 38042 Grenoble Cedex 9, France
| | - Andréa Garavito
- grid.7779.e0000 0001 2290 6370Departamento de Ciencias Biológicas, Facultad de Ciencias Exactas y Naturales, Universidad de Caldas, Manizales, Colombia ,Centro de Bioinformática y biología computacional de Colombia – BIOS, Ecoparque los Yarumos, Manizales, Caldas, Colombia
| | - Simon Orozco-Arias
- grid.7779.e0000 0001 2290 6370Department of Systems and Informatics, Universidad de Caldas, Manizales, Colombia ,grid.441739.c0000 0004 0486 2919Universidad Autónoma de Manizales, Manizales, Colombia
| | - Laurence Bellanger
- Nestle Research-Plant Science Research Unit, BP 49716, 37097 Tours Cedex 2, France
| | - Alexa Yadira Morales-Correa
- grid.7779.e0000 0001 2290 6370Departamento de Ciencias Biológicas, Facultad de Ciencias Exactas y Naturales, Universidad de Caldas, Manizales, Colombia
| | - Solène Froger
- Nestle Research-Plant Science Research Unit, BP 49716, 37097 Tours Cedex 2, France
| | - Stéphane Michaux
- Nestle Research-Plant Science Research Unit, BP 49716, 37097 Tours Cedex 2, France
| | - Victoria Berry
- Nestle Research-Plant Science Research Unit, BP 49716, 37097 Tours Cedex 2, France
| | - Sylviane Metairon
- grid.419905.00000 0001 0066 4948Nestle Research, Société des Produits Nestlé SA, 1015 Lausanne, Switzerland
| | - Coralie Fournier
- grid.419905.00000 0001 0066 4948Nestle Research, Société des Produits Nestlé SA, 1015 Lausanne, Switzerland ,grid.8591.50000 0001 2322 4988Present Address: University of Geneva, CMU-Décanat, 1 Rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Maud Lepelley
- Nestle Research-Plant Science Research Unit, BP 49716, 37097 Tours Cedex 2, France
| | - Lukas Mueller
- grid.5386.8000000041936877XBoyce Thompson Institute for Plant Research, Cornell University, Ithaca, NY 14853 USA
| | - Emmanuel Couturon
- grid.121334.60000 0001 2097 0141Institut de Recherche pour le Développement, UMR DIADE, Université de Montpellier, Montpellier, France
| | - Perla Hamon
- grid.121334.60000 0001 2097 0141Institut de Recherche pour le Développement, UMR DIADE, Université de Montpellier, Montpellier, France
| | - Jean-Jacques Rakotomalala
- grid.433118.c0000 0001 2302 6762Centre National de Recherche Appliquée au Développement Rural, BP 1444, 101 Ambatobe, Antananarivo Madagascar
| | - Patrick Descombes
- grid.419905.00000 0001 0066 4948Nestle Research, Société des Produits Nestlé SA, 1015 Lausanne, Switzerland
| | - Romain Guyot
- grid.441739.c0000 0004 0486 2919Universidad Autónoma de Manizales, Manizales, Colombia ,grid.121334.60000 0001 2097 0141Institut de Recherche pour le Développement, UMR DIADE, Université de Montpellier, Montpellier, France
| | - Dominique Crouzillat
- Nestle Research-Plant Science Research Unit, BP 49716, 37097 Tours Cedex 2, France
| |
Collapse
|
11
|
Orozco-Arias S, Jaimes PA, Candamil MS, Jiménez-Varón CF, Tabares-Soto R, Isaza G, Guyot R. InpactorDB: A Classified Lineage-Level Plant LTR Retrotransposon Reference Library for Free-Alignment Methods Based on Machine Learning. Genes (Basel) 2021; 12:genes12020190. [PMID: 33525408 PMCID: PMC7910972 DOI: 10.3390/genes12020190] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 12/04/2022] Open
Abstract
Long terminal repeat (LTR) retrotransposons are mobile elements that constitute the major fraction of most plant genomes. The identification and annotation of these elements via bioinformatics approaches represent a major challenge in the era of massive plant genome sequencing. In addition to their involvement in genome size variation, LTR retrotransposons are also associated with the function and structure of different chromosomal regions and can alter the function of coding regions, among others. Several sequence databases of plant LTR retrotransposons are available for public access, such as PGSB and RepetDB, or restricted access such as Repbase. Although these databases are useful to identify LTR-RTs in new genomes by similarity, the elements of these databases are not fully classified to the lineage (also called family) level. Here, we present InpactorDB, a semi-curated dataset composed of 130,439 elements from 195 plant genomes (belonging to 108 plant species) classified to the lineage level. This dataset has been used to train two deep neural networks (i.e., one fully connected and one convolutional) for the rapid classification of these elements. In lineage-level classification approaches, we obtain up to 98% performance, indicated by the F1-score, precision and recall scores.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, 170002 Manizales, Colombia; (P.A.J.); (M.S.C.)
- Department of Systems and Informatics, Universidad de Caldas, 170002 Manizales, Colombia;
- Correspondence: (S.O.-A.); (R.G.)
| | - Paula A. Jaimes
- Department of Computer Science, Universidad Autónoma de Manizales, 170002 Manizales, Colombia; (P.A.J.); (M.S.C.)
| | - Mariana S. Candamil
- Department of Computer Science, Universidad Autónoma de Manizales, 170002 Manizales, Colombia; (P.A.J.); (M.S.C.)
| | | | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, 170002 Manizales, Colombia;
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, 170002 Manizales, Colombia;
| | - Romain Guyot
- Department of Electronics and Automation, Universidad Autónoma de Manizales, 170002 Manizales, Colombia;
- Institut de Recherche pour le Développement, CIRAD, University of Montpellier, 34394 Montpellier, France
- Correspondence: (S.O.-A.); (R.G.)
| |
Collapse
|
12
|
Orozco-Arias S, Tobon-Orozco N, Piña JS, Jiménez-Varón CF, Tabares-Soto R, Guyot R. TIP_finder: An HPC Software to Detect Transposable Element Insertion Polymorphisms in Large Genomic Datasets. BIOLOGY 2020; 9:E281. [PMID: 32917036 PMCID: PMC7563458 DOI: 10.3390/biology9090281] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 09/01/2020] [Accepted: 09/07/2020] [Indexed: 12/12/2022]
Abstract
Transposable elements (TEs) are non-static genomic units capable of moving indistinctly from one chromosomal location to another. Their insertion polymorphisms may cause beneficial mutations, such as the creation of new gene function, or deleterious in eukaryotes, e.g., different types of cancer in humans. A particular type of TE called LTR-retrotransposons comprises almost 8% of the human genome. Among LTR retrotransposons, human endogenous retroviruses (HERVs) bear structural and functional similarities to retroviruses. Several tools allow the detection of transposon insertion polymorphisms (TIPs) but fail to efficiently analyze large genomes or large datasets. Here, we developed a computational tool, named TIP_finder, able to detect mobile element insertions in very large genomes, through high-performance computing (HPC) and parallel programming, using the inference of discordant read pair analysis. TIP_finder inputs are (i) short pair reads such as those obtained by Illumina, (ii) a chromosome-level reference genome sequence, and (iii) a database of consensus TE sequences. The HPC strategy we propose adds scalability and provides a useful tool to analyze huge genomic datasets in a decent running time. TIP_finder accelerates the detection of transposon insertion polymorphisms (TIPs) by up to 55 times in breast cancer datasets and 46 times in cancer-free datasets compared to the fastest available algorithms. TIP_finder applies a validated strategy to find TIPs, accelerates the process through HPC, and addresses the issues of runtime for large-scale analyses in the post-genomic era. TIP_finder version 1.0 is available at https://github.com/simonorozcoarias/TIP_finder.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales 170002, Colombia; (N.T.-O.); (J.S.P.)
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170002, Colombia
| | - Nicolas Tobon-Orozco
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales 170002, Colombia; (N.T.-O.); (J.S.P.)
| | - Johan S. Piña
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales 170002, Colombia; (N.T.-O.); (J.S.P.)
| | | | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales 170002, Colombia;
| | - Romain Guyot
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales 170002, Colombia;
- Institut de Recherche pour le Développement (IRD), CIRAD, Université de Montpellier, 34394 Montpellier, France
| |
Collapse
|
13
|
Structural and Functional Annotation of Transposable Elements Revealed a Potential Regulation of Genes Involved in Rubber Biosynthesis by TE-Derived siRNA Interference in Hevea brasiliensis. Int J Mol Sci 2020; 21:ijms21124220. [PMID: 32545790 PMCID: PMC7353026 DOI: 10.3390/ijms21124220] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Revised: 06/09/2020] [Accepted: 06/11/2020] [Indexed: 12/14/2022] Open
Abstract
The natural rubber biosynthetic pathway is well described in Hevea, although the final stages of rubber elongation are still poorly understood. Small Rubber Particle Proteins and Rubber Elongation Factors (SRPPs and REFs) are proteins with major function in rubber particle formation and stabilization. Their corresponding genes are clustered on a scaffold1222 of the reference genomic sequence of the Hevea brasiliensis genome. Apart from gene expression by transcriptomic analyses, to date, no deep analyses have been carried out for the genomic environment of SRPPs and REFs loci. By integrative analyses on transposable element annotation, small RNAs production and gene expression, we analysed their role in the control of the transcription of rubber biosynthetic genes. The first in-depth annotation of TEs (Transposable Elements) and their capacity to produce TE-derived siRNAs (small interfering RNAs) is presented, only possible in the Hevea brasiliensis clone PB 260 for which all data are available. We observed that 11% of genes are located near TEs and their presence may interfere in their transcription at both genetic and epigenetic level. We hypothesized that the genomic environment of rubber biosynthesis genes has been shaped by TE and TE-derived siRNAs with possible transcriptional interference on their gene expression. We discussed possible functionalization of TEs as enhancers and as donors of alternative transcription start sites in promoter sequences, possibly through the modelling of genetic and epigenetic landscapes.
Collapse
|
14
|
Measuring Performance Metrics of Machine Learning Algorithms for Detecting and Classifying Transposable Elements. Processes (Basel) 2020. [DOI: 10.3390/pr8060638] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Because of the promising results obtained by machine learning (ML) approaches in several fields, every day is more common, the utilization of ML to solve problems in bioinformatics. In genomics, a current issue is to detect and classify transposable elements (TEs) because of the tedious tasks involved in bioinformatics methods. Thus, ML was recently evaluated for TE datasets, demonstrating better results than bioinformatics applications. A crucial step for ML approaches is the selection of metrics that measure the realistic performance of algorithms. Each metric has specific characteristics and measures properties that may be different from the predicted results. Although the most commonly used way to compare measures is by using empirical analysis, a non-result-based methodology has been proposed, called measure invariance properties. These properties are calculated on the basis of whether a given measure changes its value under certain modifications in the confusion matrix, giving comparative parameters independent of the datasets. Measure invariance properties make metrics more or less informative, particularly on unbalanced, monomodal, or multimodal negative class datasets and for real or simulated datasets. Although several studies applied ML to detect and classify TEs, there are no works evaluating performance metrics in TE tasks. Here, we analyzed 26 different metrics utilized in binary, multiclass, and hierarchical classifications, through bibliographic sources, and their invariance properties. Then, we corroborated our findings utilizing freely available TE datasets and commonly used ML algorithms. Based on our analysis, the most suitable metrics for TE tasks must be stable, even using highly unbalanced datasets, multimodal negative class, and training datasets with errors or outliers. Based on these parameters, we conclude that the F1-score and the area under the precision-recall curve are the most informative metrics since they are calculated based on other metrics, providing insight into the development of an ML application.
Collapse
|
15
|
Orozco-Arias S, Isaza G, Guyot R, Tabares-Soto R. A systematic review of the application of machine learning in the detection and classification of transposable elements. PeerJ 2019; 7:e8311. [PMID: 31976169 PMCID: PMC6967008 DOI: 10.7717/peerj.8311] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Accepted: 11/28/2019] [Indexed: 12/16/2022] Open
Abstract
Background Transposable elements (TEs) constitute the most common repeated sequences in eukaryotic genomes. Recent studies demonstrated their deep impact on species diversity, adaptation to the environment and diseases. Although there are many conventional bioinformatics algorithms for detecting and classifying TEs, none have achieved reliable results on different types of TEs. Machine learning (ML) techniques can automatically extract hidden patterns and novel information from labeled or non-labeled data and have been applied to solving several scientific problems. Methodology We followed the Systematic Literature Review (SLR) process, applying the six stages of the review protocol from it, but added a previous stage, which aims to detect the need for a review. Then search equations were formulated and executed in several literature databases. Relevant publications were scanned and used to extract evidence to answer research questions. Results Several ML approaches have already been tested on other bioinformatics problems with promising results, yet there are few algorithms and architectures available in literature focused specifically on TEs, despite representing the majority of the nuclear DNA of many organisms. Only 35 articles were found and categorized as relevant in TE or related fields. Conclusions ML is a powerful tool that can be used to address many problems. Although ML techniques have been used widely in other biological tasks, their utilization in TE analyses is still limited. Following the SLR, it was possible to notice that the use of ML for TE analyses (detection and classification) is an open problem, and this new field of research is growing in interest.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.,Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | - Romain Guyot
- Institut de Recherche pour le Développement, CIRAD, University of Montpellier, Montpellier, France.,Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| |
Collapse
|
16
|
Jiang F, Zhang J, Wang S, Yang L, Luo Y, Gao S, Zhang M, Wu S, Hu S, Sun H, Wang Y. The apricot ( Prunus armeniaca L.) genome elucidates Rosaceae evolution and beta-carotenoid synthesis. HORTICULTURE RESEARCH 2019; 6:128. [PMID: 31754435 PMCID: PMC6861294 DOI: 10.1038/s41438-019-0215-6] [Citation(s) in RCA: 66] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 10/09/2019] [Accepted: 10/23/2019] [Indexed: 05/23/2023]
Abstract
Apricots, scientifically known as Prunus armeniaca L, are drupes that resemble and are closely related to peaches or plums. As one of the top consumed fruits, apricots are widely grown worldwide except in Antarctica. A high-quality reference genome for apricot is still unavailable, which has become a handicap that has dramatically limited the elucidation of the associations of phenotypes with the genetic background, evolutionary diversity, and population diversity in apricot. DNA from P. armeniaca was used to generate a standard, size-selected library with an average DNA fragment size of ~20 kb. The library was run on Sequel SMRT Cells, generating a total of 16.54 Gb of PacBio subreads (N50 = 13.55 kb). The high-quality P. armeniaca reference genome presented here was assembled using long-read single-molecule sequencing at approximately 70× coverage and 171× Illumina reads (40.46 Gb), combined with a genetic map for chromosome scaffolding. The assembled genome size was 221.9 Mb, with a contig NG50 size of 1.02 Mb. Scaffolds covering 92.88% of the assembled genome were anchored on eight chromosomes. Benchmarking Universal Single-Copy Orthologs analysis showed 98.0% complete genes. We predicted 30,436 protein-coding genes, and 38.28% of the genome was predicted to be repetitive. We found 981 contracted gene families, 1324 expanded gene families and 2300 apricot-specific genes. The differentially expressed gene (DEG) analysis indicated that a change in the expression of the 9-cis-epoxycarotenoid dioxygenase (NCED) gene but not lycopene beta-cyclase (LcyB) gene results in a low β-carotenoid content in the white cultivar "Dabaixing". This complete and highly contiguous P. armeniaca reference genome will be of help for future studies of resistance to plum pox virus (PPV) and the identification and characterization of important agronomic genes and breeding strategies in apricot.
Collapse
Affiliation(s)
- Fengchao Jiang
- Beijing Academy of Forestry and Pomology Sciences, 100093 Beijing, PR China
- Apricot Engineering and Technology Research Center, National Forestry and Grassland Administration, 100093 Beijing, PR China
| | - Junhuan Zhang
- Beijing Academy of Forestry and Pomology Sciences, 100093 Beijing, PR China
- Apricot Engineering and Technology Research Center, National Forestry and Grassland Administration, 100093 Beijing, PR China
| | - Sen Wang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, 100101 Beijing, China
| | - Li Yang
- Beijing Academy of Forestry and Pomology Sciences, 100093 Beijing, PR China
- Apricot Engineering and Technology Research Center, National Forestry and Grassland Administration, 100093 Beijing, PR China
| | - Yingfeng Luo
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, 100101 Beijing, China
| | - Shenghan Gao
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, 100101 Beijing, China
| | - Meiling Zhang
- Beijing Academy of Forestry and Pomology Sciences, 100093 Beijing, PR China
- Apricot Engineering and Technology Research Center, National Forestry and Grassland Administration, 100093 Beijing, PR China
| | - Shuangyang Wu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, 100101 Beijing, China
| | - Songnian Hu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, 100101 Beijing, China
| | - Haoyuan Sun
- Beijing Academy of Forestry and Pomology Sciences, 100093 Beijing, PR China
- Apricot Engineering and Technology Research Center, National Forestry and Grassland Administration, 100093 Beijing, PR China
| | - Yuzhu Wang
- Beijing Academy of Forestry and Pomology Sciences, 100093 Beijing, PR China
- Apricot Engineering and Technology Research Center, National Forestry and Grassland Administration, 100093 Beijing, PR China
| |
Collapse
|
17
|
Orozco-Arias S, Isaza G, Guyot R. Retrotransposons in Plant Genomes: Structure, Identification, and Classification through Bioinformatics and Machine Learning. Int J Mol Sci 2019; 20:E3837. [PMID: 31390781 PMCID: PMC6696364 DOI: 10.3390/ijms20153837] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/31/2019] [Accepted: 08/02/2019] [Indexed: 01/26/2023] Open
Abstract
Transposable elements (TEs) are genomic units able to move within the genome of virtually all organisms. Due to their natural repetitive numbers and their high structural diversity, the identification and classification of TEs remain a challenge in sequenced genomes. Although TEs were initially regarded as "junk DNA", it has been demonstrated that they play key roles in chromosome structures, gene expression, and regulation, as well as adaptation and evolution. A highly reliable annotation of these elements is, therefore, crucial to better understand genome functions and their evolution. To date, much bioinformatics software has been developed to address TE detection and classification processes, but many problematic aspects remain, such as the reliability, precision, and speed of the analyses. Machine learning and deep learning are algorithms that can make automatic predictions and decisions in a wide variety of scientific applications. They have been tested in bioinformatics and, more specifically for TEs, classification with encouraging results. In this review, we will discuss important aspects of TEs, such as their structure, importance in the evolution and architecture of the host, and their current classifications and nomenclatures. We will also address current methods and their limitations in identifying and classifying TEs.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales 170001, Colombia
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Romain Guyot
- Department of Electronics and Automatization, Universidad Autónoma de Manizales, Manizales 170001, Colombia.
- Institut de Recherche pour le Développement, CIRAD, University Montpellier, 34000 Montpellier, France.
| |
Collapse
|
18
|
Valencia JD, Girgis HZ. LtrDetector: A tool-suite for detecting long terminal repeat retrotransposons de-novo. BMC Genomics 2019; 20:450. [PMID: 31159720 PMCID: PMC6547461 DOI: 10.1186/s12864-019-5796-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Accepted: 05/14/2019] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Long terminal repeat retrotransposons are the most abundant transposons in plants. They play important roles in alternative splicing, recombination, gene regulation, and defense mechanisms. Large-scale sequencing projects for plant genomes are currently underway. Software tools are important for annotating long terminal repeat retrotransposons in these newly available genomes. However, the available tools are not very sensitive to known elements and perform inconsistently on different genomes. Some are hard to install or obsolete. They may struggle to process large plant genomes. None can be executed in parallel out of the box and very few have features to support visual review of new elements. To overcome these limitations, we developed LtrDetector, which uses techniques inspired by signal-processing. RESULTS We compared LtrDetector to LTR_Finder and LTRharvest, the two most successful predecessor tools, on six plant genomes. For each organism, we constructed a ground truth data set based on queries from a consensus sequence database. According to this evaluation, LtrDetector was the most sensitive tool, achieving 16-23% improvement in sensitivity over LTRharvest and 21% improvement over LTR_Finder. All three tools had low false positive rates, with LtrDetector achieving 98.2% precision, in between its two competitors. Overall, LtrDetector provides the best compromise between high sensitivity and low false positive rate while requiring moderate time and utilizing memory available on personal computers. CONCLUSIONS LtrDetector uses a novel methodology revolving around k-mer distributions, which allows it to produce high-quality results using relatively lightweight procedures. It is easy to install and use. It is not species specific, performing well using its default parameters on genomes of varying size and repeat content. It is automatically configured for parallel execution and runs efficiently on an ordinary personal computer. It includes a k-mer scores visualization tool to facilitate manual review of the identified elements. These features make LtrDetector an attractive tool for future annotation projects involving long terminal repeat retrotransposons.
Collapse
Affiliation(s)
- Joseph D Valencia
- The Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, 74104, OK, USA
| | - Hani Z Girgis
- The Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, 74104, OK, USA.
| |
Collapse
|