1
|
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 2016; 44:6614-24. [PMID: 27342282 PMCID: PMC5001611 DOI: 10.1093/nar/gkw569] [Citation(s) in RCA: 4825] [Impact Index Per Article: 536.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Revised: 06/08/2016] [Accepted: 06/13/2016] [Indexed: 12/01/2022] Open
Abstract
Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.
Collapse
|
Research Support, N.I.H., Intramural |
9 |
4825 |
2
|
Besemer J, Lomsadze A, Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 2001; 29:2607-18. [PMID: 11410670 PMCID: PMC55746 DOI: 10.1093/nar/29.12.2607] [Citation(s) in RCA: 1760] [Impact Index Per Article: 73.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2001] [Revised: 05/03/2001] [Accepted: 05/03/2001] [Indexed: 02/06/2023] Open
Abstract
Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence of relatively strong sequence patterns identifying true translation initiation sites. In the current paper we show that the accuracy of gene start prediction can be improved by combining models of protein-coding and non-coding regions and models of regulatory sites near gene start within an iterative Hidden Markov model based algorithm. The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes. We have also observed that GeneMarkS detects prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods. Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start. Therefore, sequence motifs related to transcription and translation regulatory sites can be revealed and analyzed with higher precision. These motifs were shown to possess a significant variability, the functional and evolutionary connections of which are discussed.
Collapse
|
research-article |
24 |
1760 |
3
|
Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res 2010; 38:e132. [PMID: 20403810 PMCID: PMC2896542 DOI: 10.1093/nar/gkq275] [Citation(s) in RCA: 1081] [Impact Index Per Article: 72.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes.
Collapse
|
Research Support, Non-U.S. Gov't |
15 |
1081 |
4
|
Brůna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform 2021; 3:lqaa108. [PMID: 33575650 PMCID: PMC7787252 DOI: 10.1093/nargab/lqaa108] [Citation(s) in RCA: 838] [Impact Index Per Article: 209.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/26/2020] [Accepted: 12/20/2020] [Indexed: 12/13/2022] Open
Abstract
The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.
Collapse
|
research-article |
4 |
838 |
5
|
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 2016; 32:767-9. [PMID: 26559507 PMCID: PMC6078167 DOI: 10.1093/bioinformatics/btv661] [Citation(s) in RCA: 709] [Impact Index Per Article: 78.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2015] [Revised: 10/02/2015] [Accepted: 10/26/2015] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Gene finding in eukaryotic genomes is notoriously difficult to automate. The task is to design a work flow with a minimal set of tools that would reach state-of-the-art performance across a wide range of species. GeneMark-ET is a gene prediction tool that incorporates RNA-Seq data into unsupervised training and subsequently generates ab initio gene predictions. AUGUSTUS is a gene finder that usually requires supervised training and uses information from RNA-Seq reads in the prediction step. Complementary strengths of GeneMark-ET and AUGUSTUS provided motivation for designing a new combined tool for automatic gene prediction. RESULTS We present BRAKER1, a pipeline for unsupervised RNA-Seq-based genome annotation that combines the advantages of GeneMark-ET and AUGUSTUS. As input, BRAKER1 requires a genome assembly file and a file in bam-format with spliced alignments of RNA-Seq reads to the genome. First, GeneMark-ET performs iterative training and generates initial gene structures. Second, AUGUSTUS uses predicted genes for training and then integrates RNA-Seq read information into final gene predictions. In our experiments, we observed that BRAKER1 was more accurate than MAKER2 when it is using RNA-Seq as sole source for training and prediction. BRAKER1 does not require pre-trained parameters or a separate expert-prepared training step. AVAILABILITY AND IMPLEMENTATION BRAKER1 is available for download at http://bioinf.uni-greifswald.de/bioinf/braker/ and http://exon.gatech.edu/GeneMark/ CONTACT katharina.hoff@uni-greifswald.de or borodovsky@gatech.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
brief-report |
9 |
709 |
6
|
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008; 18:1979-90. [PMID: 18757608 DOI: 10.1101/gr.081612.108] [Citation(s) in RCA: 689] [Impact Index Per Article: 40.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We describe a new ab initio algorithm, GeneMark-ES version 2, that identifies protein-coding genes in fungal genomes. The algorithm does not require a predetermined training set to estimate parameters of the underlying hidden Markov model (HMM). Instead, the anonymous genomic sequence in question is used as an input for iterative unsupervised training. The algorithm extends our previously developed method tested on genomes of Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster. To better reflect features of fungal gene organization, we enhanced the intron submodel to accommodate sequences with and without branch point sites. This design enables the algorithm to work equally well for species with the kinds of variations in splicing mechanisms seen in the fungal phyla Ascomycota, Basidiomycota, and Zygomycota. Upon self-training, the intron submodel switches on in several steps to reach its full complexity. We demonstrate that the algorithm accuracy, both at the exon and the whole gene level, is favorably compared to the accuracy of gene finders that employ supervised training. Application of the new method to known fungal genomes indicates substantial improvement over existing annotations. By eliminating the effort necessary to build comprehensive training sets, the new algorithm can streamline and accelerate the process of annotation in a large number of fungal genome sequencing projects.
Collapse
|
Research Support, N.I.H., Extramural |
17 |
689 |
7
|
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 2005; 33:6494-506. [PMID: 16314312 PMCID: PMC1298918 DOI: 10.1093/nar/gki937] [Citation(s) in RCA: 616] [Impact Index Per Article: 30.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2005] [Revised: 10/12/2005] [Accepted: 10/12/2005] [Indexed: 11/25/2022] Open
Abstract
Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.
Collapse
|
Comparative Study |
20 |
616 |
8
|
Abstract
BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and AUGUSTUS. GeneMark-ES/ET learns its parameters from a novel genomic sequence in a fully automated fashion; if available, it uses extrinsic evidence for model refinement. From the protein-coding genes predicted by GeneMark-ES/ET, we select a set for training AUGUSTUS, one of the most accurate gene finding tools that, in contrast to GeneMark-ES/ET, integrates extrinsic evidence already into the gene prediction step. The first published version, BRAKER1, integrated genomic footprints of unassembled RNA-Seq reads into the training as well as into the prediction steps. The pipeline has since been extended to the integration of data on mapped cross-species proteins, and to the usage of heterogeneous extrinsic evidence, both RNA-Seq and protein alignments. In this book chapter, we briefly summarize the pipeline methodology and describe how to apply BRAKER in environments characterized by various combinations of external evidence.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
385 |
9
|
Wu GA, Prochnik S, Jenkins J, Salse J, Hellsten U, Murat F, Perrier X, Ruiz M, Scalabrin S, Terol J, Takita MA, Labadie K, Poulain J, Couloux A, Jabbari K, Cattonaro F, Del Fabbro C, Pinosio S, Zuccolo A, Chapman J, Grimwood J, Tadeo FR, Estornell LH, Muñoz-Sanz JV, Ibanez V, Herrero-Ortega A, Aleza P, Pérez-Pérez J, Ramón D, Brunel D, Luro F, Chen C, Farmerie WG, Desany B, Kodira C, Mohiuddin M, Harkins T, Fredrikson K, Burns P, Lomsadze A, Borodovsky M, Reforgiato G, Freitas-Astúa J, Quetier F, Navarro L, Roose M, Wincker P, Schmutz J, Morgante M, Machado MA, Talon M, Jaillon O, Ollitrault P, Gmitter F, Rokhsar D. Sequencing of diverse mandarin, pummelo and orange genomes reveals complex history of admixture during citrus domestication. Nat Biotechnol 2014; 32:656-62. [PMID: 24908277 PMCID: PMC4113729 DOI: 10.1038/nbt.2906] [Citation(s) in RCA: 362] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2013] [Accepted: 04/14/2014] [Indexed: 01/21/2023]
Abstract
Cultivated citrus are selections from, or hybrids of, wild progenitor species whose identities and contributions to citrus domestication remain controversial. Here we sequence and compare citrus genomes--a high-quality reference haploid clementine genome and mandarin, pummelo, sweet-orange and sour-orange genomes--and show that cultivated types derive from two progenitor species. Although cultivated pummelos represent selections from one progenitor species, Citrus maxima, cultivated mandarins are introgressions of C. maxima into the ancestral mandarin species Citrus reticulata. The most widely cultivated citrus, sweet orange, is the offspring of previously admixed individuals, but sour orange is an F1 hybrid of pure C. maxima and C. reticulata parents, thus implying that wild mandarins were part of the early breeding germplasm. A Chinese wild 'mandarin' diverges substantially from C. reticulata, thus suggesting the possibility of other unrecognized wild citrus species. Understanding citrus phylogeny through genome analysis clarifies taxonomic relationships and facilitates sequence-directed genetic improvement.
Collapse
|
Research Support, N.I.H., Extramural |
11 |
362 |
10
|
Lomsadze A, Burns PD, Borodovsky M. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 2014; 42:e119. [PMID: 24990371 PMCID: PMC4150757 DOI: 10.1093/nar/gku557] [Citation(s) in RCA: 352] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2013] [Revised: 06/04/2014] [Accepted: 06/10/2014] [Indexed: 12/02/2022] Open
Abstract
We present a new approach to automatic training of a eukaryotic ab initio gene finding algorithm. With the advent of Next-Generation Sequencing, automatic training has become paramount, allowing genome annotation pipelines to keep pace with the speed of genome sequencing. Earlier we developed GeneMark-ES, currently the only gene finding algorithm for eukaryotic genomes that performs automatic training in unsupervised ab initio mode. The new algorithm, GeneMark-ET augments GeneMark-ES with a novel method that integrates RNA-Seq read alignments into the self-training procedure. Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments. We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%. In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.
Collapse
|
Research Support, N.I.H., Extramural |
11 |
352 |
11
|
Tang S, Lomsadze A, Borodovsky M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res 2015; 43:e78. [PMID: 25870408 PMCID: PMC4499116 DOI: 10.1093/nar/gkv227] [Citation(s) in RCA: 274] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2014] [Accepted: 03/05/2015] [Indexed: 01/08/2023] Open
Abstract
Massive parallel sequencing of RNA transcripts by next-generation technology (RNA-Seq) generates critically important data for eukaryotic gene discovery. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment-based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. The algorithm parameters are estimated by unsupervised training which makes unnecessary manually curated preparation of training sets. We demonstrate that (i) the unsupervised training is robust with respect to the presence of transcripts assembly errors and (ii) the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting translation initiation sites in modelled as well as in assembled transcripts compares favourably to other existing methods.
Collapse
|
Research Support, N.I.H., Extramural |
10 |
274 |
12
|
Brůna T, Lomsadze A, Borodovsky M. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom Bioinform 2020; 2:lqaa026. [PMID: 32440658 PMCID: PMC7222226 DOI: 10.1093/nargab/lqaa026] [Citation(s) in RCA: 269] [Impact Index Per Article: 53.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Revised: 03/10/2020] [Accepted: 05/12/2020] [Indexed: 01/29/2023] Open
Abstract
We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes.
Collapse
|
Journal Article |
5 |
269 |
13
|
Blanc G, Agarkova I, Grimwood J, Kuo A, Brueggeman A, Dunigan DD, Gurnon J, Ladunga I, Lindquist E, Lucas S, Pangilinan J, Pröschold T, Salamov A, Schmutz J, Weeks D, Yamada T, Lomsadze A, Borodovsky M, Claverie JM, Grigoriev IV, Van Etten JL. The genome of the polar eukaryotic microalga Coccomyxa subellipsoidea reveals traits of cold adaptation. Genome Biol 2012; 13:R39. [PMID: 22630137 PMCID: PMC3446292 DOI: 10.1186/gb-2012-13-5-r39] [Citation(s) in RCA: 202] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2012] [Revised: 05/15/2012] [Accepted: 05/25/2012] [Indexed: 12/27/2022] Open
Abstract
Background Little is known about the mechanisms of adaptation of life to the extreme environmental conditions encountered in polar regions. Here we present the genome sequence of a unicellular green alga from the division chlorophyta, Coccomyxa subellipsoidea C-169, which we will hereafter refer to as C-169. This is the first eukaryotic microorganism from a polar environment to have its genome sequenced. Results The 48.8 Mb genome contained in 20 chromosomes exhibits significant synteny conservation with the chromosomes of its relatives Chlorella variabilis and Chlamydomonas reinhardtii. The order of the genes is highly reshuffled within synteny blocks, suggesting that intra-chromosomal rearrangements were more prevalent than inter-chromosomal rearrangements. Remarkably, Zepp retrotransposons occur in clusters of nested elements with strictly one cluster per chromosome probably residing at the centromere. Several protein families overrepresented in C. subellipsoidae include proteins involved in lipid metabolism, transporters, cellulose synthases and short alcohol dehydrogenases. Conversely, C-169 lacks proteins that exist in all other sequenced chlorophytes, including components of the glycosyl phosphatidyl inositol anchoring system, pyruvate phosphate dikinase and the photosystem 1 reaction center subunit N (PsaN). Conclusions We suggest that some of these gene losses and gains could have contributed to adaptation to low temperatures. Comparison of these genomic features with the adaptive strategies of psychrophilic microbes suggests that prokaryotes and eukaryotes followed comparable evolutionary routes to adapt to cold environments.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
13 |
202 |
14
|
Wang W, Haberer G, Gundlach H, Gläßer C, Nussbaumer T, Luo MC, Lomsadze A, Borodovsky M, Kerstetter RA, Shanklin J, Byrant DW, Mockler TC, Appenroth KJ, Grimwood J, Jenkins J, Chow J, Choi C, Adam C, Cao XH, Fuchs J, Schubert I, Rokhsar D, Schmutz J, Michael TP, Mayer KFX, Messing J. The Spirodela polyrhiza genome reveals insights into its neotenous reduction fast growth and aquatic lifestyle. Nat Commun 2015; 5:3311. [PMID: 24548928 PMCID: PMC3948053 DOI: 10.1038/ncomms4311] [Citation(s) in RCA: 186] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2013] [Accepted: 01/24/2014] [Indexed: 11/30/2022] Open
Abstract
The subfamily of the Lemnoideae belongs to a different order than other monocotyledonous species that have been sequenced and comprises aquatic plants that grow rapidly on the water surface. Here we select Spirodela polyrhiza for whole-genome sequencing. We show that Spirodela has a genome with no signs of recent retrotranspositions but signatures of two ancient whole-genome duplications, possibly 95 million years ago (mya), older than those in Arabidopsis and rice. Its genome has only 19,623 predicted protein-coding genes, which is 28% less than the dicotyledonous Arabidopsis thaliana and 50% less than monocotyledonous rice. We propose that at least in part, the neotenous reduction of these aquatic plants is based on readjusted copy numbers of promoters and repressors of the juvenile-to-adult transition. The Spirodela genome, along with its unique biology and physiology, will stimulate new insights into environmental adaptation, ecology, evolution and plant development, and will be instrumental for future bioenergy applications. Spirodela, or duckweed, is a basal monocotyledonous plant with both pharmaceutical and commercial value. Here, the authors sequence the genome of Spirodela polyrhiza, suggesting its genome has evolved by neotenous reduction and clonal propagation, and provide a platform for future comparative genomic studies in angiosperms.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
10 |
186 |
15
|
Lomsadze A, Gemayel K, Tang S, Borodovsky M. Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes. Genome Res 2018; 28:1079-1089. [PMID: 29773659 PMCID: PMC6028130 DOI: 10.1101/gr.230615.117] [Citation(s) in RCA: 120] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2017] [Accepted: 05/16/2018] [Indexed: 11/24/2022]
Abstract
In a conventional view of the prokaryotic genome organization, promoters precede operons and ribosome binding sites (RBSs) with Shine-Dalgarno consensus precede genes. However, recent experimental research suggesting a more diverse view motivated us to develop an algorithm with improved gene-finding accuracy. We describe GeneMarkS-2, an ab initio algorithm that uses a model derived by self-training for finding species-specific (native) genes, along with an array of precomputed "heuristic" models designed to identify harder-to-detect genes (likely horizontally transferred). Importantly, we designed GeneMarkS-2 to identify several types of distinct sequence patterns (signals) involved in gene expression control, among them the patterns characteristic for leaderless transcription as well as noncanonical RBS patterns. To assess the accuracy of GeneMarkS-2, we used genes validated by COG (Clusters of Orthologous Groups) annotation, proteomics experiments, and N-terminal protein sequencing. We observed that GeneMarkS-2 performed better on average in all accuracy measures when compared with the current state-of-the-art gene prediction tools. Furthermore, the screening of ∼5000 representative prokaryotic genomes made by GeneMarkS-2 predicted frequent leaderless transcription in both archaea and bacteria. We also observed that the RBS sites in some species with leadered transcription did not necessarily exhibit the Shine-Dalgarno consensus. The modeling of different types of sequence motifs regulating gene expression prompted a division of prokaryotic genomes into five categories with distinct sequence patterns around the gene starts.
Collapse
|
Research Support, N.I.H., Extramural |
7 |
120 |
16
|
Kattenhorn LM, Mills R, Wagner M, Lomsadze A, Makeev V, Borodovsky M, Ploegh HL, Kessler BM. Identification of proteins associated with murine cytomegalovirus virions. J Virol 2004; 78:11187-97. [PMID: 15452238 PMCID: PMC521832 DOI: 10.1128/jvi.78.20.11187-11197.2004] [Citation(s) in RCA: 118] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Proteins associated with the murine cytomegalovirus (MCMV) viral particle were identified by a combined approach of proteomic and genomic methods. Purified MCMV virions were dissociated by complete denaturation and subjected to either separation by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and in-gel digestion or treated directly by in-solution tryptic digestion. Peptides were separated by nanoflow liquid chromatography and analyzed by tandem mass spectrometry (LC-MS/MS). The MS/MS spectra obtained were searched against a database of MCMV open reading frames (ORFs) predicted to be protein coding by an MCMV-specific version of the gene prediction algorithm GeneMarkS. We identified 38 proteins from the capsid, tegument, glycoprotein, replication, and immunomodulatory protein families, as well as 20 genes of unknown function. Observed irregularities in coding potential suggested possible sequence errors in the 3'-proximal ends of m20 and M31. These errors were experimentally confirmed by sequencing analysis. The MS data further indicated the presence of peptides derived from the unannotated ORFs ORF(c225441-226898) (m166.5) and ORF(105932-106072). Immunoblot experiments confirmed expression of m166.5 during viral infection.
Collapse
|
Research Support, U.S. Gov't, P.H.S. |
21 |
118 |
17
|
Wóycicki R, Witkowicz J, Gawroński P, Dąbrowska J, Lomsadze A, Pawełkowicz M, Siedlecka E, Yagi K, Pląder W, Seroczyńska A, Śmiech M, Gutman W, Niemirowicz-Szczytt K, Bartoszewski G, Tagashira N, Hoshi Y, Borodovsky M, Karpiński S, Malepszy S, Przybecki Z. The genome sequence of the North-European cucumber (Cucumis sativus L.) unravels evolutionary adaptation mechanisms in plants. PLoS One 2011; 6:e22728. [PMID: 21829493 PMCID: PMC3145757 DOI: 10.1371/journal.pone.0022728] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2010] [Accepted: 07/05/2011] [Indexed: 01/01/2023] Open
Abstract
Cucumber (Cucumis sativus L.), a widely cultivated crop, has originated from Eastern Himalayas and secondary domestication regions includes highly divergent climate conditions e.g. temperate and subtropical. We wanted to uncover adaptive genome differences between the cucumber cultivars and what sort of evolutionary molecular mechanisms regulate genetic adaptation of plants to different ecosystems and organism biodiversity. Here we present the draft genome sequence of the Cucumis sativus genome of the North-European Borszczagowski cultivar (line B10) and comparative genomics studies with the known genomes of: C. sativus (Chinese cultivar – Chinese Long (line 9930)), Arabidopsis thaliana, Populus trichocarpa and Oryza sativa. Cucumber genomes show extensive chromosomal rearrangements, distinct differences in quantity of the particular genes (e.g. involved in photosynthesis, respiration, sugar metabolism, chlorophyll degradation, regulation of gene expression, photooxidative stress tolerance, higher non-optimal temperatures tolerance and ammonium ion assimilation) as well as in distributions of abscisic acid-, dehydration- and ethylene-responsive cis-regulatory elements (CREs) in promoters of orthologous group of genes, which lead to the specific adaptation features. Abscisic acid treatment of non-acclimated Arabidopsis and C. sativus seedlings induced moderate freezing tolerance in Arabidopsis but not in C. sativus. This experiment together with analysis of abscisic acid-specific CRE distributions give a clue why C. sativus is much more susceptible to moderate freezing stresses than A. thaliana. Comparative analysis of all the five genomes showed that, each species and/or cultivars has a specific profile of CRE content in promoters of orthologous genes. Our results constitute the substantial and original resource for the basic and applied research on environmental adaptations of plants, which could facilitate creation of new crops with improved growth and yield in divergent conditions.
Collapse
MESH Headings
- Adaptation, Physiological
- Chromosome Mapping
- Chromosomes, Artificial, Bacterial
- Chromosomes, Plant/genetics
- Cucumis sativus/genetics
- DNA, Plant/genetics
- Evolution, Molecular
- Gene Expression Regulation, Plant
- Genes, Plant
- Genome, Plant
- Polymerase Chain Reaction
- Promoter Regions, Genetic/genetics
- Regulatory Sequences, Nucleic Acid
- Sequence Analysis, DNA
Collapse
|
Research Support, Non-U.S. Gov't |
14 |
56 |
18
|
Gabriel L, Brůna T, Hoff KJ, Ebel M, Lomsadze A, Borodovsky M, Stanke M. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.06.10.544449. [PMID: 37398387 PMCID: PMC10312602 DOI: 10.1101/2023.06.10.544449] [Citation(s) in RCA: 47] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement was made by the recently released GeneMark-ETP integrating all three data types. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperformed BRAKER1 and BRAKER2. The average transcript-level F1-score was increased by ~20 percentage points on average, while the difference was most pronounced for species with large and complex genomes. BRAKER3 also outperformed other existing tools, MAKER2, Funannotate and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.
Collapse
|
Preprint |
1 |
47 |
19
|
Mills R, Rozanov M, Lomsadze A, Tatusova T, Borodovsky M. Improving gene annotation of complete viral genomes. Nucleic Acids Res 2003; 31:7041-55. [PMID: 14627837 PMCID: PMC290248 DOI: 10.1093/nar/gkg878] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2003] [Revised: 08/02/2003] [Accepted: 10/03/2003] [Indexed: 11/18/2022] Open
Abstract
Gene annotation in viruses often relies upon similarity search methods. These methods possess high specificity but some genes may be missed, either those unique to a particular genome or those highly divergent from known homologs. To identify potentially missing viral genes we have analyzed all complete viral genomes currently available in GenBank with a specialized and augmented version of the gene finding program GeneMarkS. In particular, by implementing genome-specific self-training protocols we have better adjusted the GeneMarkS statistical models to sequences of viral genomes. Hundreds of new genes were identified, some in well studied viral genomes. For example, a new gene predicted in the genome of the Epstein-Barr virus was shown to encode a protein similar to alpha-herpesvirus minor tegument protein UL14 with heat shock functions. Convincing evidence of this similarity was obtained after only 12 PSI-BLAST iterations. In another example, several iterations of PSI-BLAST were required to demonstrate that a gene predicted in the genome of Alcelaphine herpesvirus 1 encodes a BALF1-like protein which is thought to be involved in apoptosis regulation and, potentially, carcinogenesis. New predictions were used to refine annotations of viral genomes in the RefSeq collection curated by the National Center for Biotechnology Information. Importantly, even in those cases where no sequence similarities were detected, GeneMarkS significantly reduced the number of primary targets for experimental characterization by identifying the most probable candidate genes. The new genome annotations were stored in VIOLIN, an interactive database which provides access to similarity search tools for up-to-date analysis of predicted viral proteins.
Collapse
|
research-article |
22 |
39 |
20
|
Buti M, Moretto M, Barghini E, Mascagni F, Natali L, Brilli M, Lomsadze A, Sonego P, Giongo L, Alonge M, Velasco R, Varotto C, Šurbanovski N, Borodovsky M, Ward JA, Engelen K, Cavallini A, Cestaro A, Sargent DJ. The genome sequence and transcriptome of Potentilla micrantha and their comparison to Fragaria vesca (the woodland strawberry). Gigascience 2018; 7:1-14. [PMID: 29659812 PMCID: PMC5893959 DOI: 10.1093/gigascience/giy010] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2017] [Accepted: 02/07/2018] [Indexed: 11/20/2022] Open
Abstract
Background The genus Potentilla is closely related to that of Fragaria, the economically important strawberry genus. Potentilla micrantha is a species that does not develop berries but shares numerous morphological and ecological characteristics with Fragaria vesca. These similarities make P. micrantha an attractive choice for comparative genomics studies with F. vesca. Findings In this study, the P. micrantha genome was sequenced and annotated, and RNA-Seq data from the different developmental stages of flowering and fruiting were used to develop a set of gene predictions. A 327 Mbp sequence and annotation of the genome of P. micrantha, spanning 2674 sequence contigs, with an N50 size of 335,712, estimated to cover 80% of the total genome size of the species was developed. The genus Potentilla has a characteristically larger genome size than Fragaria, but the recovered sequence scaffolds were remarkably collinear at the micro-syntenic level with the genome of F. vesca, its closest sequenced relative. A total of 33,602 genes were predicted, and 95.1% of bench-marking universal single-copy orthologous genes were complete within the presented sequence. Thus, we argue that the majority of the gene-rich regions of the genome have been sequenced. Conclusions Comparisons of RNA-Seq data from the stages of floral and fruit development revealed genes differentially expressed between P. micrantha and F. vesca.The data presented are a valuable resource for future studies of berry development in Fragaria and the Rosaceae and they also shed light on the evolution of genome size and organization in this family.
Collapse
|
Research Support, Non-U.S. Gov't |
7 |
30 |
21
|
Kislyuk A, Lomsadze A, Lapidus AL, Borodovsky M. Frameshift detection in prokaryotic genomic sequences. ACTA ACUST UNITED AC 2009; 5:458-77. [PMID: 19640832 DOI: 10.1504/ijbra.2009.027519] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
We have developed a new method for frameshift detection, a combination of ab initio and alignment-based algorithms, that can serve as a useful tool for sequencing quality control in the next generation sequencing. We evaluated the method's accuracy on test sets of annotated genomic sequences with artificial frameshifts in protein coding regions. These tests have shown that the new method performs comparably to the earlier developed FrameD. On the sets of sequences produced by 454 pyrosequencing with sequence errors recovered by Sanger re-sequencing the accuracy of the method was shown to hold at the same level.
Collapse
|
Research Support, N.I.H., Extramural |
16 |
12 |
22
|
Abstract
We derive a tight perturbation bound for hidden Markov models. Using this bound, we show that, in many cases, the distribution of a hidden Markov model is considerably more sensitive to perturbations in the emission probabilities than to perturbations in the transition probability matrix and the initial distribution of the underlying Markov chain. Our approach can also be used to assess the sensitivity of other stochastic models, such as mixture processes and semi-Markov processes.
Collapse
|
|
9 |
10 |
23
|
Bruna T, Lomsadze A, Borodovsky M. A new gene finding tool GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.01.13.524024. [PMID: 36711453 PMCID: PMC9882169 DOI: 10.1101/2023.01.13.524024] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic- and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data is sufficient for making gene predictions with 'high confidence'. The genes situated in the genomic space between the high confidence genes are predicted in the next stage. The set of high confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperformed gene finders using a single type of extrinsic evidence. Comparisons with gene finders utilizing both transcript- and protein-derived extrinsic evidence, MAKER2, and TSEBRA, demonstrated that GeneMark-ETP delivered state-of-the-art gene prediction accuracy with the margin of outperforming existing approaches increasing in its applications to larger and more complex eukaryotic genomes.
Collapse
|
Preprint |
1 |
10 |
24
|
Houghton KA, Lomsadze A, Park S, Nascimento FS, Barratt J, Arrowood MJ, VanRoey E, Talundzic E, Borodovsky M, Qvarnstrom Y. Development of a workflow for identification of nuclear genotyping markers for Cyclospora cayetanensis. ACTA ACUST UNITED AC 2020; 27:24. [PMID: 32275020 PMCID: PMC7147239 DOI: 10.1051/parasite/2020022] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2019] [Accepted: 04/02/2020] [Indexed: 01/29/2023]
Abstract
Cyclospora cayetanensis is an intestinal parasite responsible for the diarrheal illness, cyclosporiasis. Molecular genotyping, using targeted amplicon sequencing, provides a complementary tool for outbreak investigations, especially when epidemiological data are insufficient for linking cases and identifying clusters. The goal of this study was to identify candidate genotyping markers using a novel workflow for detection of segregating single nucleotide polymorphisms (SNPs) in C. cayetanensis genomes. Four whole C. cayetanensis genomes were compared using this workflow and four candidate markers were selected for evaluation of their genotyping utility by PCR and Sanger sequencing. These four markers covered 13 SNPs and resolved parasites from 57 stool specimens, differentiating C. cayetanensis into 19 new unique genotypes.
Collapse
|
Journal Article |
5 |
8 |
25
|
Brůna T, Aryal R, Dudchenko O, Sargent DJ, Mead D, Buti M, Cavallini A, Hytönen T, Andrés J, Pham M, Weisz D, Mascagni F, Usai G, Natali L, Bassil N, Fernandez GE, Lomsadze A, Armour M, Olukolu B, Poorten T, Britton C, Davik J, Ashrafi H, Aiden EL, Borodovsky M, Worthington M. A chromosome-length genome assembly and annotation of blackberry (Rubus argutus, cv. "Hillquist"). G3 (BETHESDA, MD.) 2023; 13:jkac289. [PMID: 36331334 PMCID: PMC9911083 DOI: 10.1093/g3journal/jkac289] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 10/03/2022] [Indexed: 11/06/2022]
Abstract
Blackberries (Rubus spp.) are the fourth most economically important berry crop worldwide. Genome assemblies and annotations have been developed for Rubus species in subgenus Idaeobatus, including black raspberry (R. occidentalis), red raspberry (R. idaeus), and R. chingii, but very few genomic resources exist for blackberries and their relatives in subgenus Rubus. Here we present a chromosome-length assembly and annotation of the diploid blackberry germplasm accession "Hillquist" (R. argutus). "Hillquist" is the only known source of primocane-fruiting (annual-fruiting) in tetraploid fresh-market blackberry breeding programs and is represented in the pedigree of many important cultivars worldwide. The "Hillquist" assembly, generated using Pacific Biosciences long reads scaffolded with high-throughput chromosome conformation capture sequencing, consisted of 298 Mb, of which 270 Mb (90%) was placed on 7 chromosome-length scaffolds with an average length of 38.6 Mb. Approximately 52.8% of the genome was composed of repetitive elements. The genome sequence was highly collinear with a novel maternal haplotype-resolved linkage map of the tetraploid blackberry selection A-2551TN and genome assemblies of R. chingii and red raspberry. A total of 38,503 protein-coding genes were predicted, of which 72% were functionally annotated. Eighteen flowering gene homologs within a previously mapped locus aligning to an 11.2 Mb region on chromosome Ra02 were identified as potential candidate genes for primocane-fruiting. The utility of the "Hillquist" genome has been demonstrated here by the development of the first genotyping-by-sequencing-based linkage map of tetraploid blackberry and the identification of possible candidate genes for primocane-fruiting. This chromosome-length assembly will facilitate future studies in Rubus biology, genetics, and genomics and strengthen applied breeding programs.
Collapse
|
Research Support, N.I.H., Extramural |
2 |
5 |