1
|
Salamzade R, Tran PQ, Martin C, Manson AL, Gilmore MS, Earl AM, Anantharaman K, Kalan LR. zol & fai: large-scale targeted detection and evolutionary investigation of gene clusters. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.06.07.544063. [PMID: 37333121 PMCID: PMC10274777 DOI: 10.1101/2023.06.07.544063] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
Many universally and conditionally important genes are genomically aggregated within clusters. Here, we introduce fai and zol, which together enable large-scale comparative analysis of different types of gene clusters and mobile-genetic elements (MGEs), such as biosynthetic gene clusters (BGCs) or viruses. Fundamentally, they overcome a current bottleneck to reliably perform comprehensive orthology inference at large scale across broad taxonomic contexts and thousands of genomes. First, fai allows the identification of orthologous instances of a query gene cluster of interest amongst a database of target genomes. Subsequently, zol enables reliable, context-specific inference of ortholog groups for individual protein-encoding genes across gene cluster instances. In addition, zol performs functional annotation and computes a variety of evolutionary statistics for each inferred ortholog group. Importantly, in comparison to tools for visual exploration of homologous relationships between gene clusters, zol can scale to thousands of gene cluster instances and produce detailed reports that are easy to digest. To showcase fai and zol, we apply them for: (i) longitudinal tracking of a virus in metagenomes, (ii) discovering novel population-level genetic insights of two common BGCs in the fungal species Aspergillus flavus, and (iii) uncovering large-scale evolutionary trends of a virulence-associated gene cluster across thousands of genomes from a diverse bacterial genus.
Collapse
Affiliation(s)
- Rauf Salamzade
- Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, USA
- Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, USA
| | - Patricia Q. Tran
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
- Freshwater and Marine Science Doctoral Program, University of Wisconsin-Madison, WI, USA
| | - Cody Martin
- Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, USA
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
| | - Abigail L. Manson
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | - Michael S. Gilmore
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Department of Ophthalmology, Harvard Medical School and Mass Eye and Ear, Boston, Massachusetts, USA
- Department of Microbiology, Harvard Medical School and Mass Eye and Ear, Boston, Massachusetts, USA
| | - Ashlee M. Earl
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | | | - Lindsay R. Kalan
- Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, USA
- Department of Medicine, Division of Infectious Disease, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, USA
- M.G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
2
|
Li H, Durbin R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet 2024; 25:658-670. [PMID: 38649458 DOI: 10.1038/s41576-024-00718-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/25/2024]
Abstract
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Richard Durbin
- Department of Genetics, Cambridge University, Cambridge, UK.
| |
Collapse
|
3
|
Helf MJ, Buntin K, Klančar A, Rust M, Petersen F, Pistorius D, Weber E, Wong J, Krastel P. Scaling up for success: from bioactive natural products to new medicines. Nat Prod Rep 2024. [PMID: 39129507 DOI: 10.1039/d4np00022f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
Covering 1986 to presentNatural product drug discovery at Novartis has a long and successful history of delivering life saving medicines to millions of patients. In this viewpoint, we are presenting the tools we use and challenges we face as we advance natural products from early research into development and beyond. We are leveraging our collection of 90 000 microbial strains and 20 000 isolated natural products to find new medications in an interdisciplinary approach that requires expertise in microbiology, computational biology, synthetic biology, chemistry, and process development. Technological advances, particularly in genome engineering and data science have transformed our field, accelerating discovery and facilitating sustainable compound supply. Emerging new modalities such as antibody drug conjugates, radioligand therapies and xRNA-based medications offer new opportunities for natural product-derived drugs. By taking advantage of these new modalities and the most recent research technologies, natural products will significantly contribute to the medicines of the future.
Collapse
Affiliation(s)
| | - Kathrin Buntin
- Biomedical Research, Novartis Pharma AG, 4002 Basel, Switzerland.
| | | | - Michael Rust
- Biomedical Research, Novartis Pharma AG, 4002 Basel, Switzerland.
| | - Frank Petersen
- Biomedical Research, Novartis Pharma AG, 4002 Basel, Switzerland.
| | | | - Eric Weber
- Biomedical Research, Novartis Pharma AG, 4002 Basel, Switzerland.
| | - Joanne Wong
- Biomedical Research, Novartis Pharma AG, 4002 Basel, Switzerland.
| | - Philipp Krastel
- Biomedical Research, Novartis Pharma AG, 4002 Basel, Switzerland.
| |
Collapse
|
4
|
Peyretaillade E, Akossi RF, Tournayre J, Delbac F, Wawrzyniak I. How to overcome constraints imposed by microsporidian genome features to ensure gene prediction? J Eukaryot Microbiol 2024:e13038. [PMID: 38934348 DOI: 10.1111/jeu.13038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 06/03/2024] [Accepted: 06/10/2024] [Indexed: 06/28/2024]
Abstract
Since the advent of sequencing techniques and due to their continuous evolution, it has become easier and less expensive to obtain the complete genome sequence of any organism. Nevertheless, to elucidate all biological processes governing organism development, quality annotation is essential. In genome annotation, predicting gene structure is one of the most important and captivating challenges for computational biology. This aspect of annotation requires continual optimization, particularly for genomes as unusual as those of microsporidia. Indeed, this group of fungal-related parasites exhibits specific features (highly reduced gene sizes, sequences with high rate of evolution) linked to their evolution as intracellular parasites, requiring the implementation of specific annotation approaches to consider all these features. This review aimed to outline these characteristics and to assess the increasingly efficient approaches and tools that have enhanced the accuracy of gene prediction for microsporidia, both in terms of sensitivity and specificity. Subsequently, a final part will be dedicated to postgenomic approaches aimed at reinforcing the annotation data generated by prediction software. These approaches include the characterization of other understudied genes, such as those encoding regulatory noncoding RNAs or very small proteins, which also play crucial roles in the life cycle of these microorganisms.
Collapse
Affiliation(s)
| | - Reginal F Akossi
- LMGE, CNRS, Université Clermont Auvergne, Clermont-Ferrand, France
| | - Jérémy Tournayre
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, Saint-Genès-Champanelle, France
| | - Frédéric Delbac
- LMGE, CNRS, Université Clermont Auvergne, Clermont-Ferrand, France
| | - Ivan Wawrzyniak
- LMGE, CNRS, Université Clermont Auvergne, Clermont-Ferrand, France
| |
Collapse
|
5
|
Brůna T, Lomsadze A, Borodovsky M. GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. Genome Res 2024; 34:757-768. [PMID: 38866548 PMCID: PMC11216313 DOI: 10.1101/gr.278373.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 05/02/2024] [Indexed: 06/14/2024]
Abstract
Large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic-, and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data are sufficient for making gene predictions with "high confidence." The genes situated in the genomic space between the high-confidence genes are predicted in the next stage. The set of high-confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperforms gene finders using a single type of extrinsic evidence. Comparisons with gene finders MAKER2 and TSEBRA, those that use both transcript- and protein-derived extrinsic evidence, show that GeneMark-ETP delivers state-of-the-art gene-prediction accuracy, with the margin of outperforming existing approaches increasing in its application to larger and more complex eukaryotic genomes.
Collapse
Affiliation(s)
- Tomáš Brůna
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| | - Alexandre Lomsadze
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| | - Mark Borodovsky
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, USA;
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| |
Collapse
|
6
|
Gabriel L, Brůna T, Hoff KJ, Ebel M, Lomsadze A, Borodovsky M, Stanke M. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Res 2024; 34:769-777. [PMID: 38866550 PMCID: PMC11216308 DOI: 10.1101/gr.278090.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2023] [Accepted: 02/28/2024] [Indexed: 06/14/2024]
Abstract
Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement integrating all three data types was made by the recently released GeneMark-ETP. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS, and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under an assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperforms BRAKER1 and BRAKER2. The average transcript-level F1-score is increased by about 20 percentage points on average, whereas the difference is most pronounced for species with large and complex genomes. BRAKER3 also outperforms other existing tools, MAKER2, Funannotate, and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.
Collapse
Affiliation(s)
- Lars Gabriel
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Tomáš Brůna
- U.S. Department of Energy, Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Katharina J Hoff
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany;
- Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Matthis Ebel
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Alexandre Lomsadze
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA;
| | - Mark Borodovsky
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| |
Collapse
|
7
|
Ariffin N, Newman DW, Nelson MG, O’cualain R, Hubbard SJ. Proteogenomic Gene Structure Validation in the Pineapple Genome. J Proteome Res 2024; 23:1583-1592. [PMID: 38651221 PMCID: PMC11077482 DOI: 10.1021/acs.jproteome.3c00675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 03/15/2024] [Accepted: 04/12/2024] [Indexed: 04/25/2024]
Abstract
MD2 pineapple (Ananas comosus) is the second most important tropical crop that preserves crassulacean acid metabolism (CAM), which has high water-use efficiency and is fast becoming the most consumed fresh fruit worldwide. Despite the significance of environmental efficiency and popularity, until very recently, its genome sequence has not been determined and a high-quality annotated proteome has not been available. Here, we have undertaken a pilot proteogenomic study, analyzing the proteome of MD2 pineapple leaves using liquid chromatography-mass spectrometry (LC-MS/MS), which validates 1781 predicted proteins in the annotated F153 (V3) genome. In addition, a further 603 peptide identifications are found that map exclusively to an independent MD2 transcriptome-derived database but are not found in the standard F153 (V3) annotated proteome. Peptide identifications derived from these MD2 transcripts are also cross-referenced to a more recent and complete MD2 genome annotation, resulting in 402 nonoverlapping peptides, which in turn support 30 high-quality gene candidates novel to both pineapple genomes. Many of the validated F153 (V3) genes are also supported by an independent proteomics data set collected for an ornamental pineapple variety. The contigs and peptides have been mapped to the current F153 genome build and are available as bed files to display a custom gene track on the Ensembl Plants region viewer. These analyses add to the knowledge of experimentally validated pineapple genes and demonstrate the utility of transcript-derived proteomics to discover both novel genes and genetic structure in a plant genome, adding value to its annotation.
Collapse
Affiliation(s)
- Norazrin Ariffin
- School
of Biological Sciences, Faculty of Biology Medicine and Health, MAHSC, University of Manchester, Michael Smith Building, Oxford Road, Manchester M13 9PT, United Kingdom
- Department
of Agriculture Technology, Faculty of Agriculture, Universiti Putra Malaysia, Serdang 43400, Selangor Darul Ehsan, Malaysia
| | - David Wells Newman
- School
of Biological Sciences, Faculty of Biology Medicine and Health, MAHSC, University of Manchester, Michael Smith Building, Oxford Road, Manchester M13 9PT, United Kingdom
| | - Michael G. Nelson
- School
of Biological Sciences, Faculty of Biology Medicine and Health, MAHSC, University of Manchester, Michael Smith Building, Oxford Road, Manchester M13 9PT, United Kingdom
| | - Ronan O’cualain
- School
of Biological Sciences, Faculty of Biology Medicine and Health, MAHSC, University of Manchester, Michael Smith Building, Oxford Road, Manchester M13 9PT, United Kingdom
| | - Simon J. Hubbard
- School
of Biological Sciences, Faculty of Biology Medicine and Health, MAHSC, University of Manchester, Michael Smith Building, Oxford Road, Manchester M13 9PT, United Kingdom
| |
Collapse
|
8
|
Bruna T, Lomsadze A, Borodovsky M. A new gene finding tool GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.01.13.524024. [PMID: 36711453 PMCID: PMC9882169 DOI: 10.1101/2023.01.13.524024] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic- and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data is sufficient for making gene predictions with 'high confidence'. The genes situated in the genomic space between the high confidence genes are predicted in the next stage. The set of high confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperformed gene finders using a single type of extrinsic evidence. Comparisons with gene finders utilizing both transcript- and protein-derived extrinsic evidence, MAKER2, and TSEBRA, demonstrated that GeneMark-ETP delivered state-of-the-art gene prediction accuracy with the margin of outperforming existing approaches increasing in its applications to larger and more complex eukaryotic genomes.
Collapse
Affiliation(s)
- Tomas Bruna
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Alexandre Lomsadze
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Mark Borodovsky
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| |
Collapse
|
9
|
Scalzitti N, Miralavy I, Korenchan DE, Farrar CT, Gilad AA, Banzhaf W. Computational peptide discovery with a genetic programming approach. J Comput Aided Mol Des 2024; 38:17. [PMID: 38570405 PMCID: PMC11416381 DOI: 10.1007/s10822-024-00558-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 03/07/2024] [Indexed: 04/05/2024]
Abstract
The development of peptides for therapeutic targets or biomarkers for disease diagnosis is a challenging task in protein engineering. Current approaches are tedious, often time-consuming and require complex laboratory data due to the vast search spaces that need to be considered. In silico methods can accelerate research and substantially reduce costs. Evolutionary algorithms are a promising approach for exploring large search spaces and can facilitate the discovery of new peptides. This study presents the development and use of a new variant of the genetic-programming-based POET algorithm, called POETRegex , where individuals are represented by a list of regular expressions. This algorithm was trained on a small curated dataset and employed to generate new peptides improving the sensitivity of peptides in magnetic resonance imaging with chemical exchange saturation transfer (CEST). The resulting model achieves a performance gain of 20% over the initial POET models and is able to predict a candidate peptide with a 58% performance increase compared to the gold-standard peptide. By combining the power of genetic programming with the flexibility of regular expressions, new peptide targets were identified that improve the sensitivity of detection by CEST. This approach provides a promising research direction for the efficient identification of peptides with therapeutic or diagnostic potential.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Iliya Miralavy
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - David E Korenchan
- Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Christian T Farrar
- Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Assaf A Gilad
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA.
- Department of Chemical Engineering, Michigan State University, East Lansing, MI, USA.
- Department of Radiology, Michigan State University, East Lansing, MI, USA.
| | - Wolfgang Banzhaf
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA.
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA.
| |
Collapse
|
10
|
Martinů J, Tarabai H, Štefka J, Hypša V. Highly Resolved Genomes of Two Closely Related Lineages of the Rodent Louse Polyplax serrata with Different Host Specificities. Genome Biol Evol 2024; 16:evae045. [PMID: 38478715 PMCID: PMC10972687 DOI: 10.1093/gbe/evae045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/01/2024] Open
Abstract
Sucking lice of the parvorder Anoplura are permanent ectoparasites with specific lifestyle and highly derived features. Currently, genomic data are only available for a single species, the human louse Pediculus humanus. Here, we present genomes of two distinct lineages, with different host spectra, of a rodent louse Polyplax serrata. Genomes of these ecologically different lineages are closely similar in gene content and display a conserved order of genes, with the exception of a single translocation. Compared with P. humanus, the P. serrata genomes are noticeably larger (139 vs. 111 Mbp) and encode a higher number of genes. Similar to P. humanus, they are reduced in sensory-related categories such as vision and olfaction. Utilizing genome-wide data, we perform phylogenetic reconstruction and evolutionary dating of the P. serrata lineages. Obtained estimates reveal their relatively deep divergence (∼6.5 Mya), comparable with the split between the human and chimpanzee lice P. humanus and Pediculus schaeffi. This supports the view that the P. serrata lineages are likely to represent two cryptic species with different host spectra. Historical demographies show glaciation-related population size (Ne) reduction, but recent restoration of Ne was seen only in the less host-specific lineage. Together with the louse genomes, we analyze genomes of their bacterial symbiont Legionella polyplacis and evaluate their potential complementarity in synthesis of amino acids and B vitamins. We show that both systems, Polyplax/Legionella and Pediculus/Riesia, display almost identical patterns, with symbionts involved in synthesis of B vitamins but not amino acids.
Collapse
Affiliation(s)
- Jana Martinů
- Department of Parasitology, Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
| | - Hassan Tarabai
- Department of Parasitology, Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
- Central European Institute of Technology (CEITEC), University of Veterinary Sciences, Brno, Czech Republic
| | - Jan Štefka
- Department of Parasitology, Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
- Institute of Parasitology, Biology Centre, The Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Václav Hypša
- Department of Parasitology, Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
- Institute of Parasitology, Biology Centre, The Czech Academy of Sciences, České Budějovice, Czech Republic
| |
Collapse
|
11
|
Kwon T, Hovde BT. Global characterization of biosynthetic gene clusters in non-model eukaryotes using domain architectures. Sci Rep 2024; 14:1534. [PMID: 38233413 PMCID: PMC10794256 DOI: 10.1038/s41598-023-50095-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Accepted: 12/15/2023] [Indexed: 01/19/2024] Open
Abstract
The majority of pharmaceuticals are derived from natural products, bioactive compounds naturally synthesized by organisms to provide evolutionary advantages. Although the rich evolutionary history of eukaryotic algal species implicates a high potential for natural product-based drug discovery, it remains largely untouched. This study investigates 2762 putative biosynthetic gene clusters (BGCs) from 212 eukaryotic algal genomes. To analyze a vast set of structurally diverse BGCs, we employed comparative analysis based on the vectorization of biosynthetic domains, referred to as biosynthetic domain architecture (BDA). By characterizing core biosynthetic machineries through BDA, we identified key BDAs of modular BGCs in diverse eukaryotes and introduced 16 candidate modular BGCs with similar BDAs to previously validated BGCs. This study provides a global characterization of eukaryotic algal BGCs, offering an alternative to laborious manual curation for BGC prioritization.
Collapse
Affiliation(s)
- Taehyung Kwon
- Genomics and Bioanalytics Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Blake T Hovde
- Genomics and Bioanalytics Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA.
| |
Collapse
|
12
|
Tournayre J, Polonais V, Wawrzyniak I, Akossi RF, Parisot N, Lerat E, Delbac F, Souvignet P, Reichstadt M, Peyretaillade E. MicroAnnot: A Dedicated Workflow for Accurate Microsporidian Genome Annotation. Int J Mol Sci 2024; 25:880. [PMID: 38255958 PMCID: PMC10815200 DOI: 10.3390/ijms25020880] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 12/29/2023] [Accepted: 01/04/2024] [Indexed: 01/24/2024] Open
Abstract
With nearly 1700 species, Microsporidia represent a group of obligate intracellular eukaryotes with veterinary, economic and medical impacts. To help understand the biological functions of these microorganisms, complete genome sequencing is routinely used. Nevertheless, the proper prediction of their gene catalogue is challenging due to their taxon-specific evolutionary features. As innovative genome annotation strategies are needed to obtain a representative snapshot of the overall lifestyle of these parasites, the MicroAnnot tool, a dedicated workflow for microsporidian sequence annotation using data from curated databases of accurately annotated microsporidian genes, has been developed. Furthermore, specific modules have been implemented to perform small gene (<300 bp) and transposable element identification. Finally, functional annotation was performed using the signature-based InterProScan software. MicroAnnot's accuracy has been verified by the re-annotation of four microsporidian genomes for which structural annotation had previously been validated. With its comparative approach and transcriptional signal identification method, MicroAnnot provides an accurate prediction of translation initiation sites, an efficient identification of transposable elements, as well as high specificity and sensitivity for microsporidian genes, including those under 300 bp.
Collapse
Affiliation(s)
- Jérémy Tournayre
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, 63122 Saint-Genès-Champanelle, France; (J.T.); (P.S.); (M.R.)
| | - Valérie Polonais
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Ivan Wawrzyniak
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Reginald Florian Akossi
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Nicolas Parisot
- UMR 203, BF2I, INRAE, INSA Lyon, Université de Lyon, 69621 Villeurbanne, France
| | - Emmanuelle Lerat
- VAS, CNRS, UMR5558, LBBE, Université Claude Bernard Lyon 1, 69622 Villeurbanne, France;
| | - Frédéric Delbac
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Pierre Souvignet
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, 63122 Saint-Genès-Champanelle, France; (J.T.); (P.S.); (M.R.)
| | - Matthieu Reichstadt
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, 63122 Saint-Genès-Champanelle, France; (J.T.); (P.S.); (M.R.)
| | - Eric Peyretaillade
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| |
Collapse
|
13
|
Nørrevang AF, Shabala S, Palmgren M. A two-sequence motif-based method for the inventory of gene families in fragmented and poorly annotated genome sequences. BMC Genomics 2024; 25:26. [PMID: 38172704 PMCID: PMC10763278 DOI: 10.1186/s12864-023-09859-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 11/29/2023] [Indexed: 01/05/2024] Open
Abstract
Databases of genome sequences are growing exponentially, but, in some cases, assembly is incomplete and genes are poorly annotated. For evolutionary studies, it is important to identify all members of a given gene family in a genome. We developed a method for identifying most, if not all, members of a gene family from raw genomes in which assembly is of low quality, using the P-type ATPase superfamily as an example. The method is based on the translation of an entire genome in all six reading frames and the co-occurrence of two family-specific sequence motifs that are in close proximity to each other. To test the method's usability, we first used it to identify P-type ATPase members in the high-quality annotated genome of barley (Hordeum vulgare). Subsequently, after successfully identifying plasma membrane H+-ATPase family members (P3A ATPases) in various plant genomes of varying quality, we tested the hypothesis that the number of P3A ATPases correlates with the ability of the plant to tolerate saline conditions. In 19 genomes of glycophytes and halophytes, the total number of P3A ATPase genes was found to vary from 7 to 22, but no significant difference was found between the two groups. The method successfully identified P-type ATPase family members in raw genomes that are poorly assembled.
Collapse
Affiliation(s)
- Anton Frisgaard Nørrevang
- NovoCrops Center, Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, Frederiksberg C, DK-1871, Denmark
| | - Sergey Shabala
- School of Biological Sciences, University of Western Australia, Crawley, WA6009, Australia
- International Research Centre for Environmental Membrane Biology, Foshan University, Foshan, 528000, China
| | - Michael Palmgren
- NovoCrops Center, Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, Frederiksberg C, DK-1871, Denmark.
| |
Collapse
|
14
|
Stepankiw N, Yang AWH, Hughes TR. The human genome contains over a million autonomous exons. Genome Res 2023; 33:1865-1878. [PMID: 37945377 PMCID: PMC10760453 DOI: 10.1101/gr.277792.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Accepted: 10/27/2023] [Indexed: 11/12/2023]
Abstract
Mammalian mRNA and lncRNA exons are often small compared to introns. The exon definition model predicts that exons splice autonomously, dependent on proximal exon sequence features, explaining their delineation within large introns. This model has not been examined on a genome-wide scale, however, leaving open the question of how often mRNA and lncRNA exons are autonomous. It is also unknown how frequently such exons can arise by chance. Here, we directly assayed large fragments (500-1000 bp) of the human genome by exon trapping, which detects exons spliced into a heterologous transgene, here designed with a large intron context. We define the trapped exons as "autonomous." We obtained ∼1.25 million trapped exons, including most known mRNA and well-annotated lncRNA internal exons, demonstrating that human exons are predominantly autonomous. mRNA exons are trapped with the highest efficiency. Nearly a million of the trapped exons are unannotated, most located in intergenic regions and antisense to mRNA, with depletion from the forward strand of introns. These exons are not conserved, suggesting they are nonfunctional and arose from random mutations. They are nonetheless highly enriched with known splicing promoting sequence features that delineate known exons. Novel autonomous exons are more numerous than annotated lncRNA exons, and computational models also indicate they will occur with similar frequency in any randomly generated sequence. These results show that most human coding exons splice autonomously, and provide an explanation for the existence of many unconserved lncRNAs, as well as a new annotation and inclusion levels of spliceable loci in the human genome.
Collapse
Affiliation(s)
- Nicholas Stepankiw
- Donnelly Centre, University of Toronto, Toronto, Ontario, Canada M5S 3E1
| | - Ally W H Yang
- Donnelly Centre, University of Toronto, Toronto, Ontario, Canada M5S 3E1
| | - Timothy R Hughes
- Donnelly Centre, University of Toronto, Toronto, Ontario, Canada M5S 3E1;
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada M5S 1A8
| |
Collapse
|
15
|
Kirkland TN, Beyhan S, Stajich JE. Evaluation of Different Gene Prediction Tools in Coccidioides immitis. J Fungi (Basel) 2023; 9:1094. [PMID: 37998899 PMCID: PMC10672684 DOI: 10.3390/jof9111094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 11/01/2023] [Accepted: 11/07/2023] [Indexed: 11/25/2023] Open
Abstract
Gene prediction is required to obtain optimal biologically meaningful information from genomic sequences, but automated gene prediction software is imperfect. In this study, we compare the original annotation of the Coccidioides immitis RS genome (the reference strain of C. immitis) to annotations using the Funannotate and Augustus genome prediction pipelines. A total of 25% of the originally predicted genes (denoted CIMG) were not found in either the Funannotate or Augustus predictions. A comparison of Funannotate and Augustus predictions also found overlapping but not identical sets of genes. The predicted genes found only in the original annotation (referred to as CIMG-unique) were less likely to have a meaningful functional annotation and a lower number of orthologs and homologs in other fungi than all CIMG genes predicted by the original annotation. The CIMG-unique genes were also more likely to be lineage-specific and poorly expressed. In addition, the CIMG-unique genes were found in clusters and tended to be more frequently associated with transposable elements than all CIMG-predicted genes. The CIMG-unique genes were more likely to have experimentally determined transcription start sites that were further away from the originally predicted transcription start sites, and experimentally determined initial transcription was less likely to result in stable CIMG-unique transcripts. A sample of CIMG-unique genes that were relatively well expressed and differentially expressed in mycelia and spherules was inspected in a genome browser, and the structure of only about half of them was found to be supported by RNA-seq data. These data suggest that some of the CIMG-unique genes are not authentic gene predictions. Genes that were predicted only by the Funannotate pipeline were also less likely to have a meaningful functional annotation, be shorter, and express less well than all the genes predicted by Funannotate. C. immitis genes predicted by more than one annotation are more likely to have predicted functions, many orthologs and homologs, and be well expressed. Lineage-specific genes are relatively uncommon in this group. These data emphasize the importance and limitations of gene prediction software and suggest that improvements to the annotation of the C. immitis genome should be considered.
Collapse
Affiliation(s)
- Theo N. Kirkland
- Department of Medicine, Division of Infectious Disease, School of Medicine, University of California, La Jolla, CA 92093, USA;
- Department of Pathology, School of Medicine, University of California, La Jolla, CA 92093, USA
| | - Sinem Beyhan
- Department of Medicine, Division of Infectious Disease, School of Medicine, University of California, La Jolla, CA 92093, USA;
- Department of Infectious Diseases, J. Craig Venter Institute, La Jolla, CA 92037, USA
| | - Jason E. Stajich
- Department of Microbiology and Plant Pathology, Institute for Integrative Genome Biology, University of California—Riverside, Riverside, CA 92521, USA;
| |
Collapse
|
16
|
Scalzitti N, Miralavy I, Korenchan DE, Farrar CT, Gilad AA, Banzhaf W. Computational Peptide Discovery with a Genetic Programming Approach. RESEARCH SQUARE 2023:rs.3.rs-3307450. [PMID: 37693481 PMCID: PMC10491332 DOI: 10.21203/rs.3.rs-3307450/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2023]
Abstract
Background The development of peptides for therapeutic targets or biomarkers for disease diagnosis is a challenging task in protein engineering. Current approaches are tedious, often time-consuming and require complex laboratory data due to the vast search space. In silico methods can accelerate research and substantially reduce costs. Evolutionary algorithms are a promising approach for exploring large search spaces and facilitating the discovery of new peptides. Results This study presents the development and use of a variant of the initial POET algorithm, called P O E T R e g e x , which is based on genetic programming, where individuals are represented by a list of regular expressions. The program was trained on a small curated dataset and employed to predict new peptides that can improve the problem of sensitivity in detecting peptides through magnetic resonance imaging using chemical exchange saturation transfer (CEST). The resulting model achieves a performance gain of 20% over the initial POET variant and is able to predict a candidate peptide with a 58% performance increase compared to the gold-standard peptide. Conclusions By combining the power of genetic programming with the flexibility of regular expressions, new potential peptide targets were identified to improve the sensitivity of detection by CEST. This approach provides a promising research direction for the efficient identification of peptides with therapeutic or diagnostic potential.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Iliya Miralavy
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - David E. Korenchan
- Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Christian T. Farrar
- Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Assaf A. Gilad
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Chemical Engineering, Michigan State University, East Lansing, MI, USA
- Department of Radiology, Michigan State University, East Lansing, MI, USA
| | - Wolfgang Banzhaf
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
17
|
Baker L, David C, Jacobs DJ. Ab initio gene prediction for protein-coding regions. BIOINFORMATICS ADVANCES 2023; 3:vbad105. [PMID: 37638212 PMCID: PMC10448985 DOI: 10.1093/bioadv/vbad105] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 07/04/2023] [Accepted: 08/08/2023] [Indexed: 08/29/2023]
Abstract
Motivation Ab initio gene prediction in nonmodel organisms is a difficult task. While many ab initio methods have been developed, their average accuracy over long segments of a genome, and especially when assessed over a wide range of species, generally yields results with sensitivity and specificity levels in the low 60% range. A common weakness of most methods is the tendency to learn patterns that are species-specific to varying degrees. The need exists for methods to extract genetic features that can distinguish coding and noncoding regions that are not sensitive to specific organism characteristics. Results A new method based on a neural network (NN) that uses a collection of sensors to create input features is presented. It is shown that accurate predictions are achieved even when trained on organisms that are significantly different phylogenetically than test organisms. A consensus prediction algorithm for a CoDing Sequence (CDS) is subsequently applied to the first nucleotide level of NN predictions that boosts accuracy through a data-driven procedure that optimizes a CDS/non-CDS threshold. An aggregate accuracy benchmark at the nucleotide level shows that this new approach performs better than existing ab initio methods, while requiring significantly less training data. Availability and implementation https://github.com/BioMolecularPhysicsGroup-UNCC/MachineLearning.
Collapse
Affiliation(s)
- Lonnie Baker
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, NC 28223, United States
| | - Charles David
- Department of Bioinformatics, The New Zealand Institute for Plant and Food Research, Lincoln 7608, New Zealand
| | - Donald J Jacobs
- Department of Physics and Optical Science, University of North Carolina at Charlotte, NC 28223, United States
- UNC Charlotte School of Data Science, University of North Carolina at Charlotte, NC 28223, United States
| |
Collapse
|
18
|
Glick L, Mayrose I. The Effect of Methodological Considerations on the Construction of Gene-Based Plant Pan-genomes. Genome Biol Evol 2023; 15:evad121. [PMID: 37401440 PMCID: PMC10340445 DOI: 10.1093/gbe/evad121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Revised: 06/21/2023] [Accepted: 06/28/2023] [Indexed: 07/05/2023] Open
Abstract
Pan-genomics is an emerging approach for studying the genetic diversity within plant populations. In contrast to common resequencing studies that compare whole genome sequencing data with a single reference genome, the construction of a pan-genome (PG) involves the direct comparison of multiple genomes to one another, thereby enabling the detection of genomic sequences and genes not present in the reference, as well as the analysis of gene content diversity. Although multiple studies describing PGs of various plant species have been published in recent years, a better understanding regarding the effect of the computational procedures used for PG construction could guide researchers in making more informed methodological decisions. Here, we examine the effect of several key methodological factors on the obtained gene pool and on gene presence-absence detections by constructing and comparing multiple PGs of Arabidopsis thaliana and cultivated soybean, as well as conducting a meta-analysis on published PGs. These factors include the construction method, the sequencing depth, and the extent of input data used for gene annotation. We observe substantial differences between PGs constructed using three common procedures (de novo assembly and annotation, map-to-pan, and iterative assembly) and that results are dependent on the extent of the input data. Specifically, we report low agreement between the gene content inferred using different procedures and input data. Our results should increase the awareness of the community to the consequences of methodological decisions made during the process of PG construction and emphasize the need for further investigation of commonly applied methodologies.
Collapse
Affiliation(s)
- Lior Glick
- Department of Life Sciences, School of Plant Sciences and Food Security, Tel-Aviv University, Tel Aviv, Israel
| | - Itay Mayrose
- Department of Life Sciences, School of Plant Sciences and Food Security, Tel-Aviv University, Tel Aviv, Israel
| |
Collapse
|
19
|
Zhou S, Xia T, Gao X, Lyu T, Wang L, Wang X, Shi L, Dong Y, Zhang H. A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila). Sci Data 2023; 10:254. [PMID: 37142629 PMCID: PMC10160052 DOI: 10.1038/s41597-023-02142-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Accepted: 04/11/2023] [Indexed: 05/06/2023] Open
Abstract
Aythya marila is one of the few species of Anatidae, and the only Aythya to live in the circumpolar. However, there is a relative lack of research on genetics of this species. In this study, we reported and assembled the first high-quality chromosome-level genome assembly of A. marila. This genome was assembled using Nanopore long reads, and errors corrected using Illumina short reads, with a final genome size of 1.14 Gb, scaffold N50 of 85.44 Mb, and contig N50 of 32.46 Mb. 106 contigs were clustered and ordered onto 35 chromosomes based on Hi-C data, covering approximately 98.28% of the genome. BUSCO assessment showed that 97.0% of the highly conserved genes in aves_odb10 were present intact in the genome assembly. In addition, a total of 154.94 Mb of repetitive sequences were identified. 15,953 protein-coding genes were predicted in the genome, and 98.96% of genes were functionally annotated. This genome will be a valuable resource for future genetic diversity and genomics studies of A. marila.
Collapse
Affiliation(s)
- Shengyang Zhou
- College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China
| | - Tian Xia
- College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China
| | - Xiaodong Gao
- College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China
| | - Tianshu Lyu
- College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China
| | - Lidong Wang
- College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China
| | - Xibao Wang
- College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China
| | - Lupeng Shi
- College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China
| | - Yuehuan Dong
- College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China
| | - Honghai Zhang
- College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China.
| |
Collapse
|
20
|
Mayer C, Vogt A, Uslu T, Scalzitti N, Chennen K, Poch O, Thompson JD. CeGAL: Redefining a Widespread Fungal-Specific Transcription Factor Family Using an In Silico Error-Tracking Approach. J Fungi (Basel) 2023; 9:jof9040424. [PMID: 37108879 PMCID: PMC10141177 DOI: 10.3390/jof9040424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 03/21/2023] [Accepted: 03/28/2023] [Indexed: 03/31/2023] Open
Abstract
In fungi, the most abundant transcription factor (TF) class contains a fungal-specific ‘GAL4-like’ Zn2C6 DNA binding domain (DBD), while the second class contains another fungal-specific domain, known as ‘fungal_trans’ or middle homology domain (MHD), whose function remains largely uncharacterized. Remarkably, almost a third of MHD-containing TFs in public sequence databases apparently lack DNA binding activity, since they are not predicted to contain a DBD. Here, we reassess the domain organization of these ‘MHD-only’ proteins using an in silico error-tracking approach. In a large-scale analysis of ~17,000 MHD-only TF sequences present in all fungal phyla except Microsporidia and Cryptomycota, we show that the vast majority (>90%) result from genome annotation errors and we are able to predict a new DBD sequence for 14,261 of them. Most of these sequences correspond to a Zn2C6 domain (82%), with a small proportion of C2H2 domains (4%) found only in Dikarya. Our results contradict previous findings that the MHD-only TF are widespread in fungi. In contrast, we show that they are exceptional cases, and that the fungal-specific Zn2C6–MHD domain pair represents the canonical domain signature defining the most predominant fungal TF family. We call this family CeGAL, after the highly characterized members: Cep3, whose 3D structure is determined, and GAL4, a eukaryotic TF archetype. We believe that this will not only improve the annotation and classification of the Zn2C6 TF but will also provide critical guidance for future fungal gene regulatory network analyses.
Collapse
Affiliation(s)
- Claudine Mayer
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
- Faculté des Sciences, Université Paris Cité, UFR Sciences du Vivant, 75013 Paris, France
- Correspondence: (C.M.); (J.D.T.)
| | - Arthur Vogt
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Tuba Uslu
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Nicolas Scalzitti
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Kirsley Chennen
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
| | - Julie D. Thompson
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000 Strasbourg, France
- Correspondence: (C.M.); (J.D.T.)
| |
Collapse
|
21
|
Khodji H, Collet P, Thompson JD, Jeannin-Girardon A. De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04390-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
22
|
Brajkovic S, Rugen N, Agius C, Berner N, Eckert S, Sakhteman A, Schwechheimer C, Kuster B. Getting Ready for Large-Scale Proteomics in Crop Plants. Nutrients 2023; 15:nu15030783. [PMID: 36771489 PMCID: PMC9921824 DOI: 10.3390/nu15030783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 01/27/2023] [Accepted: 02/01/2023] [Indexed: 02/05/2023] Open
Abstract
Plants are an indispensable cornerstone of sustainable global food supply. While immense progress has been made in decoding the genomes of crops in recent decades, the composition of their proteomes, the entirety of all expressed proteins of a species, is virtually unknown. In contrast to the model plant Arabidopsis thaliana, proteomic analyses of crop plants have often been hindered by the presence of extreme concentrations of secondary metabolites such as pigments, phenolic compounds, lipids, carbohydrates or terpenes. As a consequence, crop proteomic experiments have, thus far, required individually optimized protein extraction protocols to obtain samples of acceptable quality for downstream analysis by liquid chromatography tandem mass spectrometry (LC-MS/MS). In this article, we present a universal protein extraction protocol originally developed for gel-based experiments and combined it with an automated single-pot solid-phase-enhanced sample preparation (SP3) protocol on a liquid handling robot to prepare high-quality samples for proteomic analysis of crop plants. We also report an automated offline peptide separation protocol and optimized micro-LC-MS/MS conditions that enables the identification and quantification of ~10,000 proteins from plant tissue within 6 h of instrument time. We illustrate the utility of the workflow by analyzing the proteomes of mature tomato fruits to an unprecedented depth. The data demonstrate the robustness of the approach which we propose for use in upcoming large-scale projects that aim to map crop tissue proteomes.
Collapse
Affiliation(s)
- Sarah Brajkovic
- Chair of Proteomics and Bioanalytics, Technical University of Munich (TUM), 85354 Freising, Germany
| | - Nils Rugen
- Chair of Proteomics and Bioanalytics, Technical University of Munich (TUM), 85354 Freising, Germany
- Institute of Plant Genetics, Leibniz University Hannover, 30167 Hannover, Germany
| | - Carlos Agius
- Chair of Plant Systems Biology, Technical University of Munich (TUM), 85354 Freising, Germany
| | - Nicola Berner
- Chair of Proteomics and Bioanalytics, Technical University of Munich (TUM), 85354 Freising, Germany
| | - Stephan Eckert
- Chair of Proteomics and Bioanalytics, Technical University of Munich (TUM), 85354 Freising, Germany
| | - Amirhossein Sakhteman
- Chair of Proteomics and Bioanalytics, Technical University of Munich (TUM), 85354 Freising, Germany
| | - Claus Schwechheimer
- Chair of Plant Systems Biology, Technical University of Munich (TUM), 85354 Freising, Germany
| | - Bernhard Kuster
- Chair of Proteomics and Bioanalytics, Technical University of Munich (TUM), 85354 Freising, Germany
- Correspondence:
| |
Collapse
|
23
|
Addressing the pervasive scarcity of structural annotation in eukaryotic algae. Sci Rep 2023; 13:1687. [PMID: 36717613 PMCID: PMC9886943 DOI: 10.1038/s41598-023-27881-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Accepted: 01/09/2023] [Indexed: 02/01/2023] Open
Abstract
Despite a continuous increase in algal genome sequencing, structural annotations of most algal genome assemblies remain unavailable. This pervasive scarcity of genome annotation has restricted rigorous investigation of these genomic resources and may have precipitated misleading biological interpretations. However, the annotation process for eukaryotic algal species is often challenging as genomic resources and transcriptomic evidence are not always available. To address this challenge, we benchmark the cutting-edge gene prediction methods that can be generalized for a broad range of non-model eukaryotes. Using the most accurate methods selected based on high-quality algal genomes, we predict structural annotations for 135 unannotated algal genomes. Using previously available genomic data pooled together with new data obtained in this study, we identified the core orthologous genes and the multi-gene phylogeny of eukaryotic algae, including of previously unexplored algal species. This study not only provides a benchmark for the use of structural annotation methods on a variety of non-model eukaryotes, but also compensates for missing data in the current spectrum of algal genomic resources. These results bring us one step closer to the full potential of eukaryotic algal genomics.
Collapse
|
24
|
Novel CaLB-like Lipase Found Using ProspectBIO, a Software for Genome-Based Bioprospection. BIOTECH (BASEL (SWITZERLAND)) 2023; 12:biotech12010006. [PMID: 36648832 PMCID: PMC9844320 DOI: 10.3390/biotech12010006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 12/29/2022] [Accepted: 01/03/2023] [Indexed: 01/11/2023]
Abstract
Enzymes have been highly demanded in diverse applications such as in the food, pharmaceutical, and industrial fuel sectors. Thus, in silico bioprospecting emerges as an efficient strategy for discovering new enzyme candidates. A new program called ProspectBIO was developed for this purpose as it can find non-annotated sequences by searching for homologs of a model enzyme directly in genomes. Here we describe the ProspectBIO software methodology and the experimental validation by prospecting for novel lipases by sequence homology to Candida antarctica lipase B (CaLB) and conserved motifs. As expected, we observed that the new bioprospecting software could find more sequences (1672) than a conventional similarity-based search in a protein database (733). Additionally, the absence of patent protection was introduced as a criterion resulting in the final selection of a putative lipase-encoding gene from Ustilago hordei (UhL). Expression of UhL in Pichia pastoris resulted in the production of an enzyme with activity towards a tributyrin substrate. The recombinant enzyme activity levels were 4-fold improved when lowering the temperature and increasing methanol concentrations during the induction phase in shake-flask cultures. Protein sequence alignment and structural modeling showed that the recombinant enzyme has high similarity and capability of adjustment to the structure of CaLB. However, amino acid substitutions identified in the active pocket entrance may be responsible for the differences in the substrate specificities of the two enzymes. Thus, the ProspectBIO software allowed the finding of a new promising lipase for biotechnological application without the need for laborious and expensive conventional bioprospecting experimental steps.
Collapse
|
25
|
Li H. Protein-to-genome alignment with miniprot. Bioinformatics 2023; 39:btad014. [PMID: 36648328 PMCID: PMC9869432 DOI: 10.1093/bioinformatics/btad014] [Citation(s) in RCA: 70] [Impact Index Per Article: 70.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/25/2022] [Accepted: 01/16/2023] [Indexed: 01/18/2023] Open
Abstract
MOTIVATION Protein-to-genome alignment is critical to annotating genes in non-model organisms. While there are a few tools for this purpose, all of them were developed over 10 years ago and did not incorporate the latest advances in alignment algorithms. They are inefficient and could not keep up with the rapid production of new genomes and quickly growing protein databases. RESULTS Here, we describe miniprot, a new aligner for mapping protein sequences to a complete genome. Miniprot integrates recent techniques such as k-mer sketch and vectorized dynamic programming. It is tens of times faster than existing tools while achieving comparable accuracy on real data. AVAILABILITY AND IMPLEMENTATION https://github.com/lh3/miniport.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
26
|
da Silva EMG, Rebello KM, Choi YJ, Gregorio V, Paschoal AR, Mitreva M, McKerrow JH, Neves-Ferreira AGDC, Passetti F. Identification of Novel Genes and Proteoforms in Angiostrongylus costaricensis through a Proteogenomic Approach. Pathogens 2022; 11:1273. [PMID: 36365024 PMCID: PMC9694666 DOI: 10.3390/pathogens11111273] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/15/2022] [Accepted: 10/20/2022] [Indexed: 07/22/2023] Open
Abstract
RNA sequencing (RNA-Seq) and mass-spectrometry-based proteomics data are often integrated in proteogenomic studies to assist in the prediction of eukaryote genome features, such as genes, splicing, single-nucleotide (SNVs), and single-amino-acid variants (SAAVs). Most genomes of parasite nematodes are draft versions that lack transcript- and protein-level information and whose gene annotations rely only on computational predictions. Angiostrongylus costaricensis is a roundworm species that causes an intestinal inflammatory disease, known as abdominal angiostrongyliasis (AA). Currently, there is no drug available that acts directly on this parasite, mostly due to the sparse understanding of its molecular characteristics. The available genome of A. costaricensis, specific to the Costa Rica strain, is a draft version that is not supported by transcript- or protein-level evidence. This study used RNA-Seq and MS/MS data to perform an in-depth annotation of the A. costaricensis genome. Our prediction improved the reference annotation with (a) novel coding and non-coding genes; (b) pieces of evidence of alternative splicing generating new proteoforms; and (c) a list of SNVs between the Brazilian (Crissiumal) and the Costa Rica strain. To the best of our knowledge, this is the first time that a multi-omics approach has been used to improve the genome annotation of A. costaricensis. We hope this improved genome annotation can assist in the future development of drugs, kits, and vaccines to treat, diagnose, and prevent AA caused by either the Brazil strain (Crissiumal) or the Costa Rica strain.
Collapse
Affiliation(s)
- Esdras Matheus Gomes da Silva
- Instituto Carlos Chagas, Fiocruz, Curitiba 81350-010, PR, Brazil
- Laboratory of Toxinology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro 21040-900, RJ, Brazil
| | - Karina Mastropasqua Rebello
- Laboratory of Toxinology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro 21040-900, RJ, Brazil
- Laboratory of Integrated Studies in Protozoology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro 21040-360, RJ, Brazil
| | - Young-Jun Choi
- Department of Medicine, Washington University School of Medicine, St. Louis, MO 63110, USA
| | - Vitor Gregorio
- Bioinformatics and Pattern Recognition Group (Bioinfo-CP), Department of Computer Science (DACOM), Federal University of Technology-Parana (UTFPR), Cornélio Procópio 86300-000, PR, Brazil
| | - Alexandre Rossi Paschoal
- Bioinformatics and Pattern Recognition Group (Bioinfo-CP), Department of Computer Science (DACOM), Federal University of Technology-Parana (UTFPR), Cornélio Procópio 86300-000, PR, Brazil
| | - Makedonka Mitreva
- Department of Medicine, Washington University School of Medicine, St. Louis, MO 63110, USA
| | - James H. McKerrow
- Center for Discovery and Innovation in Parasitic Diseases, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, CA 92093, USA
| | | | - Fabio Passetti
- Instituto Carlos Chagas, Fiocruz, Curitiba 81350-010, PR, Brazil
| |
Collapse
|
27
|
Abdullah-Zawawi MR, Govender N, Harun S, Muhammad NAN, Zainal Z, Mohamed-Hussein ZA. Multi-Omics Approaches and Resources for Systems-Level Gene Function Prediction in the Plant Kingdom. PLANTS (BASEL, SWITZERLAND) 2022; 11:2614. [PMID: 36235479 PMCID: PMC9573505 DOI: 10.3390/plants11192614] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/05/2022] [Accepted: 09/13/2022] [Indexed: 06/16/2023]
Abstract
In higher plants, the complexity of a system and the components within and among species are rapidly dissected by omics technologies. Multi-omics datasets are integrated to infer and enable a comprehensive understanding of the life processes of organisms of interest. Further, growing open-source datasets coupled with the emergence of high-performance computing and development of computational tools for biological sciences have assisted in silico functional prediction of unknown genes, proteins and metabolites, otherwise known as uncharacterized. The systems biology approach includes data collection and filtration, system modelling, experimentation and the establishment of new hypotheses for experimental validation. Informatics technologies add meaningful sense to the output generated by complex bioinformatics algorithms, which are now freely available in a user-friendly graphical user interface. These resources accentuate gene function prediction at a relatively minimal cost and effort. Herein, we present a comprehensive view of relevant approaches available for system-level gene function prediction in the plant kingdom. Together, the most recent applications and sought-after principles for gene mining are discussed to benefit the plant research community. A realistic tabulation of plant genomic resources is included for a less laborious and accurate candidate gene discovery in basic plant research and improvement strategies.
Collapse
Affiliation(s)
- Muhammad-Redha Abdullah-Zawawi
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur 56000, Malaysia
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Nisha Govender
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Sarahani Harun
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Nor Azlan Nor Muhammad
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Zamri Zainal
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Zeti-Azura Mohamed-Hussein
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| |
Collapse
|
28
|
Fisher CR, Wilson M, Scott JG. A chromosome-level assembly of the widely used Rockefeller strain of Aedes aegypti, the yellow fever mosquito. G3 GENES|GENOMES|GENETICS 2022; 12:6695221. [PMID: 36086997 PMCID: PMC9635639 DOI: 10.1093/g3journal/jkac242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 08/23/2022] [Indexed: 12/03/2022]
Abstract
Aedes aegypti is the vector of important human diseases, and genomic resources are crucial in facilitating the study of A. aegypti and its ecosystem interactions. Several laboratory-acclimated strains of this mosquito have been established, but the most used strain in toxicology studies is “Rockefeller,” which was originally collected and established in Cuba 130 years ago. A full-length genome assembly of another reference strain, “Liverpool,” was published in 2018 and is the reference genome for the species (AaegL5). However, genetic studies with the Rockefeller strain are complicated by the availability of only the Liverpool strain as the reference genome. Differences between Liverpool and Rockefeller have been known for decades, particularly in the expression of genes relevant to mosquito behavior and vector control (e.g. olfactory). These differences indicate that AaegL5 is likely not fully representative of the Rockefeller genome, presenting potential impediments to research. Here, we present a chromosomal-level assembly and annotation of the Rockefeller genome and a comparative characterization vs the Liverpool genome. Our results set the stage for a pan-genomic approach to understanding evolution and diversity within this important disease vector.
Collapse
Affiliation(s)
- Cera R Fisher
- Department of Entomology, Comstock Hall, Cornell University , Ithaca, NY 14853, USA
| | - Michael Wilson
- Center for Cell Analysis & Modeling, University of Connecticut Health Center , Farmington, CT 06030, USA
| | - Jeffrey G Scott
- Department of Entomology, Comstock Hall, Cornell University , Ithaca, NY 14853, USA
| |
Collapse
|
29
|
Ermolaev A, Kudryavtseva N, Pivovarov A, Kirov I, Karlov G, Khrustaleva L. Integrating Genetic and Chromosome Maps of Allium cepa: From Markers Visualization to Genome Assembly Verification. Int J Mol Sci 2022; 23:10486. [PMID: 36142398 PMCID: PMC9504663 DOI: 10.3390/ijms231810486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 09/05/2022] [Accepted: 09/07/2022] [Indexed: 11/16/2022] Open
Abstract
The ability to directly look into genome sequences has opened great opportunities in plant breeding. Yet, the assembly of full-length chromosomes remains one of the most difficult problems in modern genomics. Genetic maps are commonly used in de novo genome assembly and are constructed on the basis of a statistical analysis of the number of recombinations. This may affect the accuracy of the ordering and orientation of scaffolds within the chromosome, especially in the region of recombination suppression. Moreover, it is impossible to assign contigs lacking DNA markers. Here, we report the use of Tyr-FISH to determine the position of the short DNA sequence of markers and non-mapped unique copy sequence on the physical chromosomes of a large-genome onion (Allium cepa L.). In order to minimize potential background masking of the target signal, we improved our earlier developed pipeline for probe design. A total of 23 markers were located on physical chromosomes 2 and 6. The order of markers was corrected by the integration of genetic, pseudochromosome maps and cytogenetic maps. Additionally, the position of the mlh1 gene, which was not on the genetic map, was defined on physical chromosome 2. Tyr-FISH mapping showed that the order of 23.1% (chromosome 2) and 27.3% (chromosome 6) of the tested genes differed between physical chromosomes and pseudochromosomes. The results can be used for the improvement of pseudochromosome 2 and 6 assembly. The present study aims to demonstrate the value of the in situ visualization of DNA sequences in chromosome-scaffold genome assembly.
Collapse
Affiliation(s)
- Aleksey Ermolaev
- Laboratory of Applied Genomics and Crop Breeding, All-Russian Research Institute of Agricultural Biotechnology, Timiryazevskay 42 Str., 127550 Moscow, Russia
- Center of Molecular Biotechnology, Russian State Agrarian University-Moscow Timiryazev Agricultural Academy, Timiryazevskay 49 Str., 127550 Moscow, Russia
| | - Natalia Kudryavtseva
- Center of Molecular Biotechnology, Russian State Agrarian University-Moscow Timiryazev Agricultural Academy, Timiryazevskay 49 Str., 127550 Moscow, Russia
- Plant Cell Engineering Laboratory, All-Russian Research Institute of Agricultural Biotechnology, Timiryazevskay 42 Str., 127550 Moscow, Russia
| | - Anton Pivovarov
- Center of Molecular Biotechnology, Russian State Agrarian University-Moscow Timiryazev Agricultural Academy, Timiryazevskay 49 Str., 127550 Moscow, Russia
- Plant Cell Engineering Laboratory, All-Russian Research Institute of Agricultural Biotechnology, Timiryazevskay 42 Str., 127550 Moscow, Russia
| | - Ilya Kirov
- Laboratory of Marker-Assisted and Genomic Selection of Plants, All-Russian Research Institute of Agricultural Biotechnology, Timiryazevskay 42 Str., 127550 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
| | - Gennady Karlov
- Laboratory of Applied Genomics and Crop Breeding, All-Russian Research Institute of Agricultural Biotechnology, Timiryazevskay 42 Str., 127550 Moscow, Russia
| | - Ludmila Khrustaleva
- Center of Molecular Biotechnology, Russian State Agrarian University-Moscow Timiryazev Agricultural Academy, Timiryazevskay 49 Str., 127550 Moscow, Russia
- Plant Cell Engineering Laboratory, All-Russian Research Institute of Agricultural Biotechnology, Timiryazevskay 42 Str., 127550 Moscow, Russia
- Department of Botany, Breeding and Seed Production of Garden Plants, Russian State Agrarian University-Moscow Timiryazev Agricultural Academy, Timiryazevskay 49 Str., 127550 Moscow, Russia
| |
Collapse
|
30
|
Annotation of Siberian Larch (Larix sibirica Ledeb.) Nuclear Genome—One of the Most Cold-Resistant Tree Species in the Only Deciduous GENUS in Pinaceae. PLANTS 2022; 11:plants11152062. [PMID: 35956540 PMCID: PMC9370799 DOI: 10.3390/plants11152062] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 07/22/2022] [Accepted: 07/26/2022] [Indexed: 11/17/2022]
Abstract
The recent release of the nuclear, chloroplast and mitochondrial genome assemblies of Siberian larch (Larix sibirica Ledeb.), one of the most cold-resistant tree species in the only deciduous genus of Pinaceae, with seasonal senescence and a rot-resistant valuable timber widely used in construction, greatly contributed to the development of genomic resources for the larch genus. Here, we present an extensive repeatome analysis and the first annotation of the draft nuclear Siberian larch genome assembly. About 66% of the larch genome consists of highly repetitive elements (REs), with the likely wave of retrotransposons insertions into the larch genome estimated to occur 4–5 MYA. In total, 39,370 gene models were predicted, with 87% of them having homology to the Arabidopsis-annotated proteins and 78% having at least one GO term assignment. The current state of the genome annotations allows for the exploration of the gymnosperm and angiosperm species for relative gene abundance in different functional categories. Comparative analysis of functional gene categories across different angiosperm and gymnosperm species finds that the Siberian larch genome has an overabundance of genes associated with programmed cell death (PCD), autophagy, stress hormone biosynthesis and regulatory pathways; genes that may play important roles in seasonal senescence and stress response to extreme cold in larch. Despite being incomplete, the draft assemblies and annotations of the conifer genomes are at a point of development where they now represent a valuable source for further genomic, genetic and population studies.
Collapse
|
31
|
Jan S, Anna C, Antonín K, Jiří Š, Jan B, Tereza L, Pavel K. Intracellular sequestration of cadmium and zinc in ectomycorrhizal fungus Amanita muscaria (Agaricales, Amanitaceae) and characterization of its metallothionein gene. Fungal Genet Biol 2022; 162:103717. [PMID: 35764233 DOI: 10.1016/j.fgb.2022.103717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 06/10/2022] [Accepted: 06/22/2022] [Indexed: 11/16/2022]
Abstract
Amanita muscaria is an ectomycorrhizal mushroom that commonly grows at metal-polluted sites. Sporocarps from the lead smelter-polluted area near Příbram (Central Bohemia, Czech Republic) showed elevated concentrations of Cd and Zn. Size exclusion chromatography of the cell extracts of the sporocarps from both polluted and unpolluted sites indicated that substantial part of intracellular Cd and Zn was sequestered in 6-kDa complexes, presumably with metallothionein(s) (MT). When the cultured mycelial isolates were compared, those from Příbram were more Cd-tolerant and accumulated slightly less Cd and Zn than those from the unpolluted site. The analysis of the available A.muscaria sequence data returned a 67-amino acid (AA) MT encoded by the AmMT1 gene. Weak Cd and Zn responsiveness of AmMT1 in the mycelia suggested its metal homeostasis function in A.muscaria, rather than a major role in detoxification. The AmMT1 belongs to a ubiquitous peptide group in the Agaricomycetes consisting of 60-70-AA MTs containing seven cysteinyl domains and a conserved histidyl, features observed also in a newly predicted, atypical 45-AA RaMT1 of the Zn-accumulator Russula bresadolae in which the C-terminal cysteinyl domains VI and VII are missing. Heterologous expression in metal-sensitive yeast mutants indicated that AmMT1 and RaMT1 encode functional peptides that can protect cells against Cd, Zn, and Cu toxicity. The metal protection phenotype observed in yeasts with mutant variants of AmMT1 and RaMT1 further indicated that the conserved histidyl seems to play a structural, not metal binding role, and the cysteinyls of the C-terminal domains VI and VII are important for Cu binding. The data provide an important insight into the metal handling of site-associated ectomycorrhizal species disturbed by excess metals and the properties of MTs common in Agaricomycetes.
Collapse
Affiliation(s)
- Sácký Jan
- Department of Biochemistry and Microbiology, University of Chemistry and Technology, Prague, Technická 3, 166 28 Prague 6, Czech Republic
| | - Chaloupecká Anna
- Department of Biochemistry and Microbiology, University of Chemistry and Technology, Prague, Technická 3, 166 28 Prague 6, Czech Republic
| | - Kaňa Antonín
- Department of Analytical Chemistry, University of Chemistry and Technology, Prague, Technická 5, 166 28 Prague, Czech Republic
| | - Šantrůček Jiří
- Department of Biochemistry and Microbiology, University of Chemistry and Technology, Prague, Technická 3, 166 28 Prague 6, Czech Republic
| | - Borovička Jan
- Institute of Geology of the Czech Academy of Sciences, Rozvojová 269, 16500 Prague 6, Czech Republic; Nuclear Physics Institute of the Czech Academy of Sciences, Hlavní 130, 25068 Husinec-Řež, Czech Republic
| | - Leonhardt Tereza
- Department of Biochemistry and Microbiology, University of Chemistry and Technology, Prague, Technická 3, 166 28 Prague 6, Czech Republic
| | - Kotrba Pavel
- Department of Biochemistry and Microbiology, University of Chemistry and Technology, Prague, Technická 3, 166 28 Prague 6, Czech Republic.
| |
Collapse
|
32
|
Solano-González S, Solano-Campos F. Production of mannosylerythritol lipids: biosynthesis, multi-omics approaches, and commercial exploitation. Mol Omics 2022; 18:699-715. [DOI: 10.1039/d2mo00150k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Compilation of resources regarding MEL biosynthesis, key production parameters; available omics resources and current commercial applications, for smut fungi known to produce MELs.
Collapse
Affiliation(s)
- Stefany Solano-González
- Universidad Nacional, Escuela de Ciencias Biológicas, Laboratorio de Bioinformática Aplicada, Heredia, Costa Rica
| | - Frank Solano-Campos
- Universidad Nacional, Escuela de Ciencias Biológicas, Laboratorio de Biotecnología de Plantas, Heredia, Costa Rica
| |
Collapse
|
33
|
Hurgobin B. Annotation of Protein-Coding Genes in Plant Genomes. Methods Mol Biol 2022; 2443:309-326. [PMID: 35037214 DOI: 10.1007/978-1-0716-2067-0_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Advances in next-generation sequencing technologies and the lower sequencing costs are paving the way to more plant genome sequencing, assembly, and annotation projects. While genome assembly is the first step toward elucidating the genome structure of a species, it is the annotation of the protein-coding genes that provide meaningful information to biologists. However, genome annotation is not a trivial task. Therefore, the aim of this chapter is to provide a detailed view of this important process, including tools and commands that can be used to carry out such a process.
Collapse
Affiliation(s)
- Bhavna Hurgobin
- La Trobe Institute for Agriculture and Food, Department of Animal, Plant and Soil Sciences, School of Life Sciences, AgriBio Building, La Trobe University, Bundoora, VIC, Australia.
- Australian Research Council Research Hub for Medicinal Agriculture, AgriBio Building, La Trobe University, Bundoora, VIC, Australia.
| |
Collapse
|
34
|
Al Kadi M, Jung N, Okuzaki D. UNAGI: Yeast Transcriptome Reconstruction and Gene Discovery Using Nanopore Sequencing. Methods Mol Biol 2022; 2477:79-89. [PMID: 35524113 DOI: 10.1007/978-1-0716-2257-5_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Computational approaches are the main approaches used in genome annotation. However, accuracy is low. Untranslated regions are not identified, complex isoforms are not predicted correctly and discovery rate of noncoding RNA is low. RNA-seq has revolutionized transcriptome reconstruction over the last decade. However, fragmentation included in cDNA sequencing leads to information loss, requiring transcripts to be assembled and reconstructed, thus affecting the accuracy of reconstructed transcriptome. Recently, long-read sequencing has been introduced with technologies such as Oxford Nanopore sequencing. cDNA is sequenced directly without fragmentation producing long reads that don't need to be assembled keeping the transcript structure intact and increasing the accuracy of transcriptome reconstruction.Here we present a protocol and a pipeline to reconstruct the transcriptome of compact genomes including yeasts. It involves generating full-length cDNA and using Oxford Nanopore ligation-based sequencing kit to sequence multiple samples in the same run. The pipeline (1) strands the generated long reads, (2) corrects the reads by mapping them to the reference genome, (3) identifies transcripts including 5'UTR and 3'UTR, (4) profiles the isoforms, filtering out artifacts resulting from low accuracy in sequencing, and (5) improves accuracy of provided annotations. Using long reads improves the accuracy of transcriptome reconstruction and helps in discovering a significant number of novel RNAs.
Collapse
Affiliation(s)
- Mohamad Al Kadi
- Department of Bacterial Infections, Research Institute for Microbial Diseases, Osaka University, Osaka, Japan
| | - Nicolas Jung
- Department of Infection Metagenomics, Research Institute for Microbial Diseases, Osaka University, Osaka, Japan
| | - Daisuke Okuzaki
- Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, Osaka, Japan.
- Single Cell Genomics, Human Immunology, WPI Immunology Frontier Research Center, Osaka University, Osaka, Japan.
- Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, Osaka, Japan.
| |
Collapse
|
35
|
Li J, Singh U, Bhandary P, Campbell J, Arendsee Z, Seetharam AS, Wurtele ES. Foster thy young: enhanced prediction of orphan genes in assembled genomes. Nucleic Acids Res 2021; 50:e37. [PMID: 34928390 PMCID: PMC9023268 DOI: 10.1093/nar/gkab1238] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Revised: 10/22/2021] [Accepted: 12/02/2021] [Indexed: 02/06/2023] Open
Abstract
Proteins encoded by newly-emerged genes ('orphan genes') share no sequence similarity with proteins in any other species. They provide organisms with a reservoir of genetic elements to quickly respond to changing selection pressures. Here, we systematically assess the ability of five gene prediction pipelines to accurately predict genes in genomes according to phylostratal origin. BRAKER and MAKER are existing, popular ab initio tools that infer gene structures by machine learning. Direct Inference is an evidence-based pipeline we developed to predict gene structures from alignments of RNA-Seq data. The BIND pipeline integrates ab initio predictions of BRAKER and Direct inference; MIND combines Direct Inference and MAKER predictions. We use highly-curated Arabidopsis and yeast annotations as gold-standard benchmarks, and cross-validate in rice. Each pipeline under-predicts orphan genes (as few as 11 percent, under one prediction scenario). Increasing RNA-Seq diversity greatly improves prediction efficacy. The combined methods (BIND and MIND) yield best predictions overall, BIND identifying 68% of annotated orphan genes, 99% of ancient genes, and give the highest sensitivity score regardless dataset in Arabidopsis. We provide a light weight, flexible, reproducible, and well-documented solution to improve gene prediction.
Collapse
Affiliation(s)
- Jing Li
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50014, USA.,Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA.,Genetics and Genomics Graduate Program, Iowa State University, Ames, IA 50014, USA
| | - Urminder Singh
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50014, USA.,Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA.,Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50014, USA
| | - Priyanka Bhandary
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50014, USA.,Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA.,Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50014, USA
| | - Jacqueline Campbell
- Corn Insects and Crop Genetics Research Unit, US Department of Agriculture Agriculture Research Service, Ames, IA 50014, USA
| | - Zebulun Arendsee
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50014, USA.,Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA.,Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50014, USA
| | - Arun S Seetharam
- Genome Informatics Facility, Iowa State University, Ames, IA 50014, USA
| | - Eve Syrkin Wurtele
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50014, USA.,Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA.,Genetics and Genomics Graduate Program, Iowa State University, Ames, IA 50014, USA.,Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50014, USA
| |
Collapse
|
36
|
Scalzitti N, Kress A, Orhand R, Weber T, Moulinier L, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Spliceator: multi-species splice site prediction using convolutional neural networks. BMC Bioinformatics 2021; 22:561. [PMID: 34814826 PMCID: PMC8609763 DOI: 10.1186/s12859-021-04471-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 11/09/2021] [Indexed: 12/14/2022] Open
Abstract
Background Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. Results We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89–92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. Conclusions Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04471-3.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Arnaud Kress
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Romain Orhand
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Thomas Weber
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Luc Moulinier
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Anne Jeannin-Girardon
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Pierre Collet
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Julie D Thompson
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.
| |
Collapse
|
37
|
Wisser RJ, Oppenheim SJ, Ernest EG, Mhora TT, Dumas MD, Gregory NF, Evans TA, Donofrio NM. Genome assembly of a Mesoamerican derived variety of lima bean: a foundational cultivar in the Mid-Atlantic USA. G3 GENES|GENOMES|GENETICS 2021; 11:6326801. [PMID: 34542584 PMCID: PMC8527486 DOI: 10.1093/g3journal/jkab207] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Accepted: 05/25/2021] [Indexed: 11/25/2022]
Abstract
Lima bean, Phaseolus lunatus, is closely related to common bean and is high in fiber and protein, with a low glycemic index. Lima bean is widely grown in the state of Delaware, where late summer and early fall weather are conducive to pod production. The same weather conditions also promote diseases such as pod rot and downy mildew, the latter of which has caused previous epidemics. A better understanding of the genes underlying resistance to this and other pathogens is needed to keep this industry thriving in the region. Our current study sought to sequence, assemble, and annotate a commercially available cultivar called Bridgeton, which could then serve as a reference genome, a basis of comparison to other Phaseolus taxa, and a resource for the identification of potential resistance genes. Combined efforts of sequencing, linkage, and comparative analysis resulted in a 623 Mb annotated assembly for lima bean, as well as a better understanding of an evolutionarily dynamic resistance locus in legumes.
Collapse
Affiliation(s)
- Randall J Wisser
- Department of Plant and Soil Sciences, University of Delaware, Newark, DE 19716, USA
- Laboratoire d’Ecophysiologie des Plantes sous Stress Environnementaux, INRAE, Univ. Montpellier, SupAgro, 34060 Montpellier, France
| | - Sara J Oppenheim
- Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA
| | - Emmalea G Ernest
- Cooperative Extension, University of Delaware, Georgetown, DE 19947, USA
| | - Terence T Mhora
- Department of Plant and Soil Sciences, University of Delaware, Newark, DE 19716, USA
| | - Michael D Dumas
- Department of Plant and Soil Sciences, University of Delaware, Newark, DE 19716, USA
| | - Nancy F Gregory
- Department of Plant and Soil Sciences, University of Delaware, Newark, DE 19716, USA
| | - Thomas A Evans
- Department of Plant and Soil Sciences, University of Delaware, Newark, DE 19716, USA
| | - Nicole M Donofrio
- Department of Plant and Soil Sciences, University of Delaware, Newark, DE 19716, USA
| |
Collapse
|
38
|
Mathé C, Dunand C. Automatic Prediction and Annotation: There Are Strong Biases for Multigenic Families. Front Genet 2021; 12:697477. [PMID: 34603370 PMCID: PMC8481831 DOI: 10.3389/fgene.2021.697477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 08/05/2021] [Indexed: 11/16/2022] Open
Affiliation(s)
- Catherine Mathé
- Laboratoire de Recherche en Sciences Végétales, Université de Toulouse, CNRS, UPS, Toulouse INP, Auzeville-Tolosane, France
| | - Christophe Dunand
- Laboratoire de Recherche en Sciences Végétales, Université de Toulouse, CNRS, UPS, Toulouse INP, Auzeville-Tolosane, France
| |
Collapse
|
39
|
SAVMD: An adaptive signal processing method for identifying protein coding regions. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2021.102998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
40
|
Martin R, Hackl T, Hattab G, Fischer MG, Heider D. MOSGA: Modular Open-Source Genome Annotator. Bioinformatics 2021; 36:5514-5515. [PMID: 33258916 DOI: 10.1093/bioinformatics/btaa1003] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Revised: 10/16/2020] [Accepted: 11/18/2020] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION The generation of high-quality assemblies, even for large eukaryotic genomes, has become a routine task for many biologists thanks to recent advances in sequencing technologies. However, the annotation of these assemblies-a crucial step toward unlocking the biology of the organism of interest-has remained a complex challenge that often requires advanced bioinformatics expertise. RESULTS Here, we present MOSGA (Modular Open-Source Genome Annotator), a genome annotation framework for eukaryotic genomes with a user-friendly web-interface that generates and integrates annotations from various tools. The aggregated results can be analyzed with a fully integrated genome browser and are provided in a format ready for submission to NCBI. MOSGA is built on a portable, customizable and easily extendible Snakemake backend, and thus, can be tailored to a wide range of users and projects. AVAILABILITY AND IMPLEMENTATION We provide MOSGA as a web service at https://mosga.mathematik.uni-marburg.de and as a docker container at registry.gitlab.com/mosga/mosga: latest. Source code can be found at https://gitlab.com/mosga/mosga. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Roman Martin
- Department of Mathematics and Computer Science, University of Marburg, 35032 Marburg, Germany.,Department of Organic-Analytical Chemistry, TUM Campus Straubing, 94315 Straubing, Germany
| | - Thomas Hackl
- Department of Biomolecular Mechanisms, Max Planck Institute for Medical Research, Heidelberg 69120, Germany
| | - Georges Hattab
- Department of Mathematics and Computer Science, University of Marburg, 35032 Marburg, Germany
| | - Matthias G Fischer
- Department of Biomolecular Mechanisms, Max Planck Institute for Medical Research, Heidelberg 69120, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, 35032 Marburg, Germany
| |
Collapse
|
41
|
Qi H, Li L, Zhang G. Construction of a chromosome-level genome and variation map for the Pacific oyster Crassostrea gigas. Mol Ecol Resour 2021; 21:1670-1685. [PMID: 33655634 DOI: 10.1111/1755-0998.13368] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 02/17/2021] [Accepted: 02/23/2021] [Indexed: 12/11/2022]
Abstract
The Pacific oyster (Crassostrea gigas) is a widely distributed marine bivalve of great ecological and economic importance. In this study, we provide a high-quality chromosome-level genome assembled using Pacific Bioscience long reads and Hi-C-based and linkage-map-based scaffolding technologies and a high-resolution variation map constructed using large-scale resequencing analysis. The 586.8 Mb genome consists of 10 pseudochromosome sequences ranging from 38.6 to 78.9 Mb, containing 301 contigs with an N50 size of 3.1 Mb. A total of 30,078 protein-coding genes were predicted, of which 22,757 (75.7%) were high-reliability annotations supported by a homologous match to a curated protein in the SWISS-PROT database or transcript expression. Although a medium level of repeat components (57.2%) was detected, the genomic content of the segmental duplications reached 26.2%, which is the highest among the reported genomes. By whole genome resequencing analysis of 495 Pacific oysters, a comprehensive variation map was built, comprised of 4.78 million single nucleotide polymorphisms, 0.60 million short insertions and deletions, and 49,333 copy number variation regions. The structural variations can lead to an average interindividual genomic divergence of 0.21, indicating their crucial role in shaping the Pacific oyster genome diversity. The large amount of mosaic distributed repeat elements, small variations, and copy number variations indicate that the Pacific oyster is a diploid organism with an extremely high genomic complexity at the intra- and interindividual level. The genome and variation maps can improve our understanding of oyster genome diversity and enrich the resources for oyster molecular evolution, comparative genomics, and genetic research.
Collapse
Affiliation(s)
- Haigang Qi
- Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China.,Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China.,Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China.,National and Local Joint Engineering Laboratory of Ecological Mariculture, Qingdao, China
| | - Li Li
- Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China.,Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China.,Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China.,National and Local Joint Engineering Laboratory of Ecological Mariculture, Qingdao, China
| | - Guofan Zhang
- Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China.,Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China.,Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China.,National and Local Joint Engineering Laboratory of Ecological Mariculture, Qingdao, China
| |
Collapse
|
42
|
Zheng Q, Chen T, Zhou W, Xie L, Su H. Gene prediction by the noise-assisted MEMD and wavelet transform for identifying the protein coding regions. Biocybern Biomed Eng 2021. [DOI: 10.1016/j.bbe.2020.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
43
|
Meyer C, Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes. BMC Bioinformatics 2020; 21:513. [PMID: 33172385 PMCID: PMC7656754 DOI: 10.1186/s12859-020-03855-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 10/30/2020] [Indexed: 11/10/2022] Open
Abstract
Background Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.
Collapse
Affiliation(s)
- Corentin Meyer
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Nicolas Scalzitti
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Anne Jeannin-Girardon
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Pierre Collet
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Olivier Poch
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Julie D Thompson
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France.
| |
Collapse
|