1
|
Liu X, Zhang H, Zeng Y, Zhu X, Zhu L, Fu J. DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks. Genes (Basel) 2024; 15:404. [PMID: 38674339 PMCID: PMC11048956 DOI: 10.3390/genes15040404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 03/20/2024] [Accepted: 03/23/2024] [Indexed: 04/28/2024] Open
Abstract
The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms for enhanced accuracy in capturing the intricate features of splice sites. We constructed multiple datasets using the most recent versions of genomic data from three different organisms, Oryza sativa japonica, Arabidopsis thaliana and Homo sapiens. This approach allows us to train models with a richer set of high-quality data. DRANetSplicer outperformed benchmark methods on donor and acceptor splice site datasets, achieving an average accuracy of (96.57%, 95.82%) across the three organisms. Comparative analyses with benchmark methods, including SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, revealed DRANetSplicer's superior predictive performance, resulting in at least a (4.2%, 11.6%) relative reduction in average error rate. We utilized the DRANetSplicer model trained on O. sativa japonica data to predict splice sites in A. thaliana, achieving accuracies for donor and acceptor sites of (94.89%, 94.25%). These results indicate that DRANetSplicer possesses excellent cross-organism predictive capabilities, with its performance in cross-organism predictions even surpassing that of benchmark methods in non-cross-organism predictions. Cross-organism validation showcased DRANetSplicer's excellence in predicting splice sites across similar organisms, supporting its applicability in gene annotation for understudied organisms. We employed multiple methods to visualize the decision-making process of the model. The visualization results indicate that DRANetSplicer can learn and interpret well-known biological features, further validating its overall performance. Our study systematically examined and confirmed the predictive ability of DRANetSplicer from various levels and perspectives, indicating that its practical application in gene annotation is justified.
Collapse
Affiliation(s)
- Xueyan Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Hongyan Zhang
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Ying Zeng
- School of Computer and Communication, Hunan Institute of Engineering, Xiangtan 411104, China;
| | - Xinghui Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Lei Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Jiahui Fu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| |
Collapse
|
2
|
Bartas M, Volna A, Cerven J, Pucker B. Identification of annotation artifacts concerning the chalcone synthase (CHS). BMC Res Notes 2023; 16:109. [PMID: 37340477 DOI: 10.1186/s13104-023-06386-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 06/13/2023] [Indexed: 06/22/2023] Open
Abstract
OBJECTIVE Chalcone synthase (CHS) catalyzes the initial step of the flavonoid biosynthesis. The CHS encoding gene is well studied in numerous plant species. Rapidly growing sequence databases contain hundreds of CHS entries that are the result of automatic annotation. In this study, we evaluated apparent multiplication of CHS domains in CHS gene models of four plant species. MAIN FINDINGS CHS genes with an apparent triplication of the CHS domain encoding part were discovered through database searches. Such genes were found in Macadamia integrifolia, Musa balbisiana, Musa troglodytarum, and Nymphaea colorata. A manual inspection of the CHS gene models in these four species with massive RNA-seq data suggests that these gene models are the result of artificial fusions in the annotation process. While there are hundreds of seemingly correct CHS records in the databases, it is not clear why these annotation artifacts appeared.
Collapse
Affiliation(s)
- Martin Bartas
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czech Republic
| | - Adriana Volna
- Department of Physics, Faculty of Science, University of Ostrava, Ostrava, Czech Republic
| | - Jiri Cerven
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czech Republic
| | - Boas Pucker
- Institute of Plant Biology & BRICS, TU Braunschweig, Braunschweig, Germany.
| |
Collapse
|
3
|
Pucker B, Iorizzo M. Apiaceae FNS I originated from F3H through tandem gene duplication. PLoS One 2023; 18:e0280155. [PMID: 36656808 PMCID: PMC9851555 DOI: 10.1371/journal.pone.0280155] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 12/21/2022] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Flavonoids are specialized metabolites with numerous biological functions in stress response and reproduction of plants. Flavones are one subgroup that is produced by the flavone synthase (FNS). Two distinct enzyme families evolved that can catalyze the biosynthesis of flavones. While the membrane-bound FNS II is widely distributed in seed plants, one lineage of soluble FNS I appeared to be unique to Apiaceae species. RESULTS We show through phylogenetic and comparative genomic analyses that Apiaceae FNS I evolved through tandem gene duplication of flavanone 3-hydroxylase (F3H) followed by neofunctionalization. Currently available datasets suggest that this event happened within the Apiaceae in a common ancestor of Daucus carota and Apium graveolens. The results also support previous findings that FNS I in the Apiaceae evolved independent of FNS I in other plant species. CONCLUSION We validated a long standing hypothesis about the evolution of Apiaceae FNS I and predicted the phylogenetic position of this event. Our results explain how an Apiaceae-specific FNS I lineage evolved and confirm independence from other FNS I lineages reported in non-Apiaceae species.
Collapse
Affiliation(s)
- Boas Pucker
- Institute of Plant Biology, TU Braunschweig, Braunschweig, Germany
- BRICS, TU Braunschweig, Braunschweig, Germany
- * E-mail: (BP); (MI)
| | - Massimo Iorizzo
- Plants for Human Health Institute, NC State University, Kannapolis, North Carolina, United States of America
- Department of Horticultural Science, NC State University, Raleigh, North Carolina, United States of America
- * E-mail: (BP); (MI)
| |
Collapse
|
4
|
Schilbert HM, Pucker B, Ries D, Viehöver P, Micic Z, Dreyer F, Beckmann K, Wittkop B, Weisshaar B, Holtgräwe D. Mapping‑by‑Sequencing Reveals Genomic Regions Associated with Seed Quality Parameters in Brassica napus. Genes (Basel) 2022; 13:genes13071131. [PMID: 35885914 PMCID: PMC9317104 DOI: 10.3390/genes13071131] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 06/15/2022] [Accepted: 06/22/2022] [Indexed: 11/21/2022] Open
Abstract
Rapeseed (Brassica napus L.) is an important oil crop and has the potential to serve as a highly productive source of protein. This protein exhibits an excellent amino acid composition and has high nutritional value for humans. Seed protein content (SPC) and seed oil content (SOC) are two complex quantitative and polygenic traits which are negatively correlated and assumed to be controlled by additive and epistatic effects. A reduction in seed glucosinolate (GSL) content is desired as GSLs cause a stringent and bitter taste. The goal here was the identification of genomic intervals relevant for seed GSL content and SPC/SOC. Mapping by sequencing (MBS) revealed 30 and 15 new and known genomic intervals associated with seed GSL content and SPC/SOC, respectively. Within these intervals, we identified known but also so far unknown putatively causal genes and sequence variants. A 4 bp insertion in the MYB28 homolog on C09 shows a significant association with a reduction in seed GSL content. This study provides insights into the genetic architecture and potential mechanisms underlying seed quality traits, which will enhance future breeding approaches in B. napus.
Collapse
Affiliation(s)
- Hanna Marie Schilbert
- Genetics and Genomics of Plants, CeBiTec & Faculty of Biology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany; (H.M.S.); (B.P.); (D.R.); (P.V.); (B.W.)
- Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany
| | - Boas Pucker
- Genetics and Genomics of Plants, CeBiTec & Faculty of Biology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany; (H.M.S.); (B.P.); (D.R.); (P.V.); (B.W.)
- Plant Biotechnology and Bioinformatics, Institute of Plant Biology & Braunschweig Integrated Centre of Systems Biology (BRICS), TU Braunschweig, Mendelssohnstraße 4, 38106 Braunschweig, Germany
| | - David Ries
- Genetics and Genomics of Plants, CeBiTec & Faculty of Biology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany; (H.M.S.); (B.P.); (D.R.); (P.V.); (B.W.)
| | - Prisca Viehöver
- Genetics and Genomics of Plants, CeBiTec & Faculty of Biology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany; (H.M.S.); (B.P.); (D.R.); (P.V.); (B.W.)
| | - Zeljko Micic
- Deutsche Saatveredelung AG, Weissenburger Straße 5, 59557 Lippstadt, Germany;
| | - Felix Dreyer
- NPZ Innovation GmbH, Hohenlieth-Hof 1, 24363 Holtsee, Germany; (F.D.); (K.B.)
| | - Katrin Beckmann
- NPZ Innovation GmbH, Hohenlieth-Hof 1, 24363 Holtsee, Germany; (F.D.); (K.B.)
| | - Benjamin Wittkop
- Department of Plant Breeding, Justus Liebig University, Heinrich-Buff-Ring 26-32, 35392 Giessen, Germany;
| | - Bernd Weisshaar
- Genetics and Genomics of Plants, CeBiTec & Faculty of Biology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany; (H.M.S.); (B.P.); (D.R.); (P.V.); (B.W.)
| | - Daniela Holtgräwe
- Genetics and Genomics of Plants, CeBiTec & Faculty of Biology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany; (H.M.S.); (B.P.); (D.R.); (P.V.); (B.W.)
- Correspondence:
| |
Collapse
|
5
|
Scalzitti N, Kress A, Orhand R, Weber T, Moulinier L, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Spliceator: multi-species splice site prediction using convolutional neural networks. BMC Bioinformatics 2021; 22:561. [PMID: 34814826 PMCID: PMC8609763 DOI: 10.1186/s12859-021-04471-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 11/09/2021] [Indexed: 12/14/2022] Open
Abstract
Background Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. Results We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89–92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. Conclusions Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04471-3.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Arnaud Kress
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Romain Orhand
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Thomas Weber
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Luc Moulinier
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Anne Jeannin-Girardon
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Pierre Collet
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Julie D Thompson
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.
| |
Collapse
|
6
|
Schilbert HM, Schöne M, Baier T, Busche M, Viehöver P, Weisshaar B, Holtgräwe D. Characterization of the Brassica napus Flavonol Synthase Gene Family Reveals Bifunctional Flavonol Synthases. FRONTIERS IN PLANT SCIENCE 2021; 12:733762. [PMID: 34721462 PMCID: PMC8548573 DOI: 10.3389/fpls.2021.733762] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 09/21/2021] [Indexed: 06/13/2023]
Abstract
Flavonol synthase (FLS) is a key enzyme for the formation of flavonols, which are a subclass of the flavonoids. FLS catalyzes the conversion of dihydroflavonols to flavonols. The enzyme belongs to the 2-oxoglutarate-dependent dioxygenases (2-ODD) superfamily. We characterized the FLS gene family of Brassica napus that covers 13 genes, based on the genome sequence of the B. napus cultivar Express 617. The goal was to unravel which BnaFLS genes are relevant for seed flavonol accumulation in the amphidiploid species B. napus. Two BnaFLS1 homeologs were identified and shown to encode bifunctional enzymes. Both exhibit FLS activity as well as flavanone 3-hydroxylase (F3H) activity, which was demonstrated in vivo and in planta. BnaFLS1-1 and -2 are capable of converting flavanones into dihydroflavonols and further into flavonols. Analysis of spatio-temporal transcription patterns revealed similar expression profiles of BnaFLS1 genes. Both are mainly expressed in reproductive organs and co-expressed with the genes encoding early steps of flavonoid biosynthesis. Our results provide novel insights into flavonol biosynthesis in B. napus and contribute information for breeding targets with the aim to modify the flavonol content in rapeseed.
Collapse
Affiliation(s)
- Hanna Marie Schilbert
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, Bielefeld, Germany
| | - Maximilian Schöne
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, Bielefeld, Germany
| | - Thomas Baier
- Algae Biotechnology and Bioenergy, CeBiTec and Faculty of Biology, Bielefeld University, Bielefeld, Germany
| | - Mareike Busche
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, Bielefeld, Germany
| | - Prisca Viehöver
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, Bielefeld, Germany
| | - Bernd Weisshaar
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, Bielefeld, Germany
| | - Daniela Holtgräwe
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
7
|
Farhat S, Le P, Kayal E, Noel B, Bigeard E, Corre E, Maumus F, Florent I, Alberti A, Aury JM, Barbeyron T, Cai R, Da Silva C, Istace B, Labadie K, Marie D, Mercier J, Rukwavu T, Szymczak J, Tonon T, Alves-de-Souza C, Rouzé P, Van de Peer Y, Wincker P, Rombauts S, Porcel BM, Guillou L. Rapid protein evolution, organellar reductions, and invasive intronic elements in the marine aerobic parasite dinoflagellate Amoebophrya spp. BMC Biol 2021; 19:1. [PMID: 33407428 PMCID: PMC7789003 DOI: 10.1186/s12915-020-00927-9] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2020] [Accepted: 11/12/2020] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Dinoflagellates are aquatic protists particularly widespread in the oceans worldwide. Some are responsible for toxic blooms while others live in symbiotic relationships, either as mutualistic symbionts in corals or as parasites infecting other protists and animals. Dinoflagellates harbor atypically large genomes (~ 3 to 250 Gb), with gene organization and gene expression patterns very different from closely related apicomplexan parasites. Here we sequenced and analyzed the genomes of two early-diverging and co-occurring parasitic dinoflagellate Amoebophrya strains, to shed light on the emergence of such atypical genomic features, dinoflagellate evolution, and host specialization. RESULTS We sequenced, assembled, and annotated high-quality genomes for two Amoebophrya strains (A25 and A120), using a combination of Illumina paired-end short-read and Oxford Nanopore Technology (ONT) MinION long-read sequencing approaches. We found a small number of transposable elements, along with short introns and intergenic regions, and a limited number of gene families, together contribute to the compactness of the Amoebophrya genomes, a feature potentially linked with parasitism. While the majority of Amoebophrya proteins (63.7% of A25 and 59.3% of A120) had no functional assignment, we found many orthologs shared with Dinophyceae. Our analyses revealed a strong tendency for genes encoded by unidirectional clusters and high levels of synteny conservation between the two genomes despite low interspecific protein sequence similarity, suggesting rapid protein evolution. Most strikingly, we identified a large portion of non-canonical introns, including repeated introns, displaying a broad variability of associated splicing motifs never observed among eukaryotes. Those introner elements appear to have the capacity to spread over their respective genomes in a manner similar to transposable elements. Finally, we confirmed the reduction of organelles observed in Amoebophrya spp., i.e., loss of the plastid, potential loss of a mitochondrial genome and functions. CONCLUSION These results expand the range of atypical genome features found in basal dinoflagellates and raise questions regarding speciation and the evolutionary mechanisms at play while parastitism was selected for in this particular unicellular lineage.
Collapse
Affiliation(s)
- Sarah Farhat
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
- School of Marine and Atmospheric Sciences, Stony Brook University, Stony Brook, New York, 11794, USA
| | - Phuong Le
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- VIB Center for Plant Systems Biology, Ghent, Belgium
| | - Ehsan Kayal
- Sorbonne Université, CNRS, FR2424, Station Biologique de Roscoff, Place Georges Teissier, 29680, Roscoff, France
| | - Benjamin Noel
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
| | - Estelle Bigeard
- Sorbonne Université, CNRS, UMR7144 Adaptation et Diversité en Milieu Marin, Ecology of Marine Plankton (ECOMAP), Station Biologique de Roscoff SBR, 29680, Roscoff, France
| | - Erwan Corre
- Sorbonne Université, CNRS, FR2424, Station Biologique de Roscoff, Place Georges Teissier, 29680, Roscoff, France
| | - Florian Maumus
- URGI, INRA, Université Paris-Saclay, 78026, Versailles, France
| | - Isabelle Florent
- Unité Molécules de Communication et Adaptation des Microorganismes (MCAM, UMR7245), Muséum national d'Histoire naturelle, CNRS, CP 52, 57 rue Cuvier, 75005, Paris, France
| | - Adriana Alberti
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
| | - Jean-Marc Aury
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
| | - Tristan Barbeyron
- Sorbonne Université, CNRS, UMR 8227, Station Biologique de Roscoff, Place Georges Teissier, 29680, Roscoff, France
| | - Ruibo Cai
- Sorbonne Université, CNRS, UMR7144 Adaptation et Diversité en Milieu Marin, Ecology of Marine Plankton (ECOMAP), Station Biologique de Roscoff SBR, 29680, Roscoff, France
| | - Corinne Da Silva
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
| | - Benjamin Istace
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
| | - Karine Labadie
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
| | - Dominique Marie
- Sorbonne Université, CNRS, UMR7144 Adaptation et Diversité en Milieu Marin, Ecology of Marine Plankton (ECOMAP), Station Biologique de Roscoff SBR, 29680, Roscoff, France
| | - Jonathan Mercier
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
| | - Tsinda Rukwavu
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
| | - Jeremy Szymczak
- Sorbonne Université, CNRS, FR2424, Station Biologique de Roscoff, Place Georges Teissier, 29680, Roscoff, France
- Sorbonne Université, CNRS, UMR7144 Adaptation et Diversité en Milieu Marin, Ecology of Marine Plankton (ECOMAP), Station Biologique de Roscoff SBR, 29680, Roscoff, France
| | - Thierry Tonon
- Centre for Novel Agricultural Products, Department of Biology, University of York, Heslington, York, YO10 5DD, UK
| | - Catharina Alves-de-Souza
- Algal Resources Collection, MARBIONC, Center for Marine Sciences, University of North Carolina Wilmington, 5600 Marvin K. Moss Lane, Wilmington, NC, 28409, USA
| | - Pierre Rouzé
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- VIB Center for Plant Systems Biology, Ghent, Belgium
| | - Yves Van de Peer
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- VIB Center for Plant Systems Biology, Ghent, Belgium
- Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa
| | - Patrick Wincker
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France
| | - Stephane Rombauts
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- VIB Center for Plant Systems Biology, Ghent, Belgium
| | - Betina M Porcel
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 91057, Evry, France.
| | - Laure Guillou
- Sorbonne Université, CNRS, UMR7144 Adaptation et Diversité en Milieu Marin, Ecology of Marine Plankton (ECOMAP), Station Biologique de Roscoff SBR, 29680, Roscoff, France.
| |
Collapse
|
8
|
Sielemann K, Hafner A, Pucker B. The reuse of public datasets in the life sciences: potential risks and rewards. PeerJ 2020; 8:e9954. [PMID: 33024631 PMCID: PMC7518187 DOI: 10.7717/peerj.9954] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Accepted: 08/25/2020] [Indexed: 12/13/2022] Open
Abstract
The 'big data' revolution has enabled novel types of analyses in the life sciences, facilitated by public sharing and reuse of datasets. Here, we review the prodigious potential of reusing publicly available datasets and the associated challenges, limitations and risks. Possible solutions to issues and research integrity considerations are also discussed. Due to the prominence, abundance and wide distribution of sequencing data, we focus on the reuse of publicly available sequence datasets. We define 'successful reuse' as the use of previously published data to enable novel scientific findings. By using selected examples of successful reuse from different disciplines, we illustrate the enormous potential of the practice, while acknowledging the respective limitations and risks. A checklist to determine the reuse value and potential of a particular dataset is also provided. The open discussion of data reuse and the establishment of this practice as a norm has the potential to benefit all stakeholders in the life sciences.
Collapse
Affiliation(s)
- Katharina Sielemann
- Genetics and Genomics of Plants, Center for Biotechnology (CeBiTec) & Faculty of Biology, Bielefeld University, Bielefeld, Germany
- Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, Bielefeld, Germany
| | - Alenka Hafner
- Genetics and Genomics of Plants, Center for Biotechnology (CeBiTec) & Faculty of Biology, Bielefeld University, Bielefeld, Germany
- Current Affiliation: Intercollege Graduate Degree Program in Plant Biology, Penn State University, University Park, State College, PA, United States of America
| | - Boas Pucker
- Genetics and Genomics of Plants, Center for Biotechnology (CeBiTec) & Faculty of Biology, Bielefeld University, Bielefeld, Germany
- Evolution and Diversity, Department of Plant Sciences, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
9
|
Siadjeu C, Pucker B, Viehöver P, Albach DC, Weisshaar B. High Contiguity De Novo Genome Sequence Assembly of Trifoliate Yam ( Dioscorea dumetorum) Using Long Read Sequencing. Genes (Basel) 2020; 11:E274. [PMID: 32143301 PMCID: PMC7140821 DOI: 10.3390/genes11030274] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Revised: 02/25/2020] [Accepted: 02/29/2020] [Indexed: 12/17/2022] Open
Abstract
Trifoliate yam (Dioscorea dumetorum) is one example of an orphan crop, not traded internationally. Post-harvest hardening of the tubers of this species starts within 24 h after harvesting and renders the tubers inedible. Genomic resources are required for D. dumetorum to improve breeding for non-hardening varieties as well as for other traits. We sequenced the D. dumetorum genome and generated the corresponding annotation. The two haplophases of this highly heterozygous genome were separated to a large extent. The assembly represents 485 Mbp of the genome with an N50 of over 3.2 Mbp. A total of 35,269 protein-encoding gene models as well as 9941 non-coding RNA genes were predicted, and functional annotations were assigned.
Collapse
Affiliation(s)
- Christian Siadjeu
- Institute for Biology and Environmental Sciences, Biodiversity and Evolution of Plants, Carl-von-Ossietzky University Oldenburg, Carl-von-Ossietzky Str. 9-11, 26111 Oldenburg, Germany; (C.S.); (D.C.A.)
- Genetics and Genomics of Plants, Faculty of Biology, Center for Biotechnology (CeBiTec), Bielefeld University, Sequenz 1, 33615 Bielefeld, NRW, Germany; (B.P.); (P.V.)
| | - Boas Pucker
- Genetics and Genomics of Plants, Faculty of Biology, Center for Biotechnology (CeBiTec), Bielefeld University, Sequenz 1, 33615 Bielefeld, NRW, Germany; (B.P.); (P.V.)
- Molecular Genetics and Physiology of Plants, Faculty of Biology and Biotechnology, Ruhr-University Bochum, Universitätsstraße 150, 44801 Bochum, Germany
| | - Prisca Viehöver
- Genetics and Genomics of Plants, Faculty of Biology, Center for Biotechnology (CeBiTec), Bielefeld University, Sequenz 1, 33615 Bielefeld, NRW, Germany; (B.P.); (P.V.)
| | - Dirk C. Albach
- Institute for Biology and Environmental Sciences, Biodiversity and Evolution of Plants, Carl-von-Ossietzky University Oldenburg, Carl-von-Ossietzky Str. 9-11, 26111 Oldenburg, Germany; (C.S.); (D.C.A.)
| | - Bernd Weisshaar
- Genetics and Genomics of Plants, Faculty of Biology, Center for Biotechnology (CeBiTec), Bielefeld University, Sequenz 1, 33615 Bielefeld, NRW, Germany; (B.P.); (P.V.)
| |
Collapse
|
10
|
Frey K, Pucker B. Animal, Fungi, and Plant Genome Sequences Harbor Different Non-Canonical Splice Sites. Cells 2020; 9:E458. [PMID: 32085510 PMCID: PMC7072748 DOI: 10.3390/cells9020458] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 02/11/2020] [Accepted: 02/14/2020] [Indexed: 11/17/2022] Open
Abstract
Most protein-encoding genes in eukaryotes contain introns, which are interwoven with exons. Introns need to be removed from initial transcripts in order to generate the final messenger RNA (mRNA), which can be translated into an amino acid sequence. Precise excision of introns by the spliceosome requires conserved dinucleotides, which mark the splice sites. However, there are variations of the highly conserved combination of GT at the 5' end and AG at the 3' end of an intron in the genome. GC-AG and AT-AC are two major non-canonical splice site combinations, which have been known for years. Recently, various minor non-canonical splice site combinations were detected with numerous dinucleotide permutations. Here, we expand systematic investigations of non-canonical splice site combinations in plants across eukaryotes by analyzing fungal and animal genome sequences. Comparisons of splice site combinations between these three kingdoms revealed several differences, such as an apparently increased CT-AC frequency in fungal genome sequences. Canonical GT-AG splice site combinations in antisense transcripts are a likely explanation for this observation, thus indicating annotation errors. In addition, high numbers of GA-AG splice site combinations were observed in Eurytemoraaffinis and Oikopleuradioica. A variant in one U1 small nuclear RNA (snRNA) isoform might allow the recognition of GA as a 5' splice site. In depth investigation of splice site usage based on RNA-Seq read mappings indicates a generally higher flexibility of the 3' splice site compared to the 5' splice site across animals, fungi, and plants.
Collapse
Affiliation(s)
- Katharina Frey
- Genetics and Genomics of Plants, Center for Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany;
- Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, 33615 Bielefeld, Germany
| | - Boas Pucker
- Genetics and Genomics of Plants, Center for Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany;
- Molecular Genetics and Physiology of Plants, Faculty of Biology and Biotechnology, Ruhr-University Bochum, Universitätsstraße 150, 44801 Bochum, Germany
| |
Collapse
|
11
|
Pucker B, Holtgräwe D, Stadermann KB, Frey K, Huettel B, Reinhardt R, Weisshaar B. A chromosome-level sequence assembly reveals the structure of the Arabidopsis thaliana Nd-1 genome and its gene set. PLoS One 2019; 14:e0216233. [PMID: 31112551 PMCID: PMC6529160 DOI: 10.1371/journal.pone.0216233] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 04/16/2019] [Indexed: 01/27/2023] Open
Abstract
In addition to the BAC-based reference sequence of the accession Columbia-0 from the year 2000, several short read assemblies of THE plant model organism Arabidopsis thaliana were published during the last years. Also, a SMRT-based assembly of Landsberg erecta has been generated that identified translocation and inversion polymorphisms between two genotypes of the species. Here we provide a chromosome-arm level assembly of the A. thaliana accession Niederzenz-1 (AthNd-1_v2c) based on SMRT sequencing data. The best assembly comprises 69 nucleome sequences and displays a contig length of up to 16 Mbp. Compared to an earlier Illumina short read-based NGS assembly (AthNd-1_v1), a 75 fold increase in contiguity was observed for AthNd-1_v2c. To assign contig locations independent from the Col-0 gold standard reference sequence, we used genetic anchoring to generate a de novo assembly. In addition, we assembled the chondrome and plastome sequences. Detailed analyses of AthNd-1_v2c allowed reliable identification of large genomic rearrangements between A. thaliana accessions contributing to differences in the gene sets that distinguish the genotypes. One of the differences detected identified a gene that is lacking from the Col-0 gold standard sequence. This de novo assembly extends the known proportion of the A. thaliana pan-genome.
Collapse
Affiliation(s)
- Boas Pucker
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| | - Daniela Holtgräwe
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| | - Kai Bernd Stadermann
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| | - Katharina Frey
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| | - Bruno Huettel
- Max Planck Genome Centre Cologne, Max Planck Institute for Plant Breeding Research, Cologne, Germany
| | - Richard Reinhardt
- Max Planck Genome Centre Cologne, Max Planck Institute for Plant Breeding Research, Cologne, Germany
| | - Bernd Weisshaar
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| |
Collapse
|
12
|
Pucker B, Brockington SF. Genome-wide analyses supported by RNA-Seq reveal non-canonical splice sites in plant genomes. BMC Genomics 2018; 19:980. [PMID: 30594132 PMCID: PMC6310983 DOI: 10.1186/s12864-018-5360-z] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 12/10/2018] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Most eukaryotic genes comprise exons and introns thus requiring the precise removal of introns from pre-mRNAs to enable protein biosynthesis. U2 and U12 spliceosomes catalyze this step by recognizing motifs on the transcript in order to remove the introns. A process which is dependent on precise definition of exon-intron borders by splice sites, which are consequently highly conserved across species. Only very few combinations of terminal dinucleotides are frequently observed at intron ends, dominated by the canonical GT-AG splice sites on the DNA level. RESULTS Here we investigate the occurrence of diverse combinations of dinucleotides at predicted splice sites. Analyzing 121 plant genome sequences based on their annotation revealed strong splice site conservation across species, annotation errors, and true biological divergence from canonical splice sites. The frequency of non-canonical splice sites clearly correlates with their divergence from canonical ones indicating either an accumulation of probably neutral mutations, or evolution towards canonical splice sites. Strong conservation across multiple species and non-random accumulation of substitutions in splice sites indicate a functional relevance of non-canonical splice sites. The average composition of splice sites across all investigated species is 98.7% for GT-AG, 1.2% for GC-AG, 0.06% for AT-AC, and 0.09% for minor non-canonical splice sites. RNA-Seq data sets of 35 species were incorporated to validate non-canonical splice site predictions through gaps in sequencing reads alignments and to demonstrate the expression of affected genes. CONCLUSION We conclude that bona fide non-canonical splice sites are present and appear to be functionally relevant in most plant genomes, although at low abundance.
Collapse
Affiliation(s)
- Boas Pucker
- Evolution and Diversity, Department of Plant Sciences, University of Cambridge, Cambridge, UK
- Genetics and Genomics of Plants, CeBiTec & Faculty of Biology, Bielefeld University, Bielefeld, Germany
| | - Samuel F. Brockington
- Evolution and Diversity, Department of Plant Sciences, University of Cambridge, Cambridge, UK
| |
Collapse
|
13
|
Behnke N, Suprianto E, Möllers C. A major QTL on chromosome C05 significantly reduces acid detergent lignin (ADL) content and increases seed oil and protein content in oilseed rape (Brassica napus L.). TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2018; 131:2477-2492. [PMID: 30143828 DOI: 10.1007/s00122-018-3167-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2018] [Accepted: 08/17/2018] [Indexed: 05/27/2023]
Abstract
A reduction in acid detergent lignin content in oilseed rape resulted in an increase in seed oil and protein content. Worldwide increasing demand for vegetable oil and protein requires continuous breeding efforts to enhance the yield of oil and protein crop species. The oil-extracted meal of oilseed rape is currently mainly used for feeding livestock, but efforts are undertaken to use the oilseed rape protein in food production. One limiting factor is the high lignin content of black-seeded oilseed rape that negatively affects digestibility and sensory quality of food products compared to soybean. Breeding attempts to develop yellow seeded oilseed rape with reduced lignin content have not yet resulted in competitive cultivars. The objective of this work was to investigate the inheritance of seed quality in a DH population derived from the cross of the high oil lines SGDH14 and cv. Express. The DH population of 139 lines was tested in field experiments in 14 environments in north-west Europe. Seeds harvested from open pollinated plants were used for extensive seed quality analysis. A molecular marker map based on the Illumina Infinium 60 K Brassica SNP chip was used to map QTL. Amongst others, one major QTL for acid detergent lignin content, explaining 81% of the phenotypic variance, was identified on chromosome C05. Lines with reduced lignin content nevertheless did not show a yellowish appearance, but showed a reduced seed hull content. The position of the QTL co-located with QTL for oil and protein content of the defatted meal with opposite additive effects, suggesting that the reduction in lignin content resulted in an increase in oil and protein content.
Collapse
Affiliation(s)
- Nina Behnke
- Department of Crop Sciences, Georg-August-Universität Göttingen, Von-Siebold-Str. 8, 37075, Göttingen, Germany
| | - Edy Suprianto
- Department of Crop Sciences, Georg-August-Universität Göttingen, Von-Siebold-Str. 8, 37075, Göttingen, Germany
| | - Christian Möllers
- Department of Crop Sciences, Georg-August-Universität Göttingen, Von-Siebold-Str. 8, 37075, Göttingen, Germany.
| |
Collapse
|