1
|
Schultz DT, Eizenga JM, Corbett-Detig RB, Francis WR, Christianson LM, Haddock SH. Conserved novel ORFs in the mitochondrial genome of the ctenophore Beroe forskalii. PeerJ 2020; 8:e8356. [PMID: 32025367 PMCID: PMC6991124 DOI: 10.7717/peerj.8356] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Accepted: 12/04/2019] [Indexed: 11/20/2022] Open
Abstract
To date, five ctenophore species' mitochondrial genomes have been sequenced, and each contains open reading frames (ORFs) that if translated have no identifiable orthologs. ORFs with no identifiable orthologs are called unidentified reading frames (URFs). If truly protein-coding, ctenophore mitochondrial URFs represent a little understood path in early-diverging metazoan mitochondrial evolution and metabolism. We sequenced and annotated the mitochondrial genomes of three individuals of the beroid ctenophore Beroe forskalii and found that in addition to sharing the same canonical mitochondrial genes as other ctenophores, the B. forskalii mitochondrial genome contains two URFs. These URFs are conserved among the three individuals but not found in other sequenced species. We developed computational tools called pauvre and cuttlery to determine the likelihood that URFs are protein coding. There is evidence that the two URFs are under negative selection, and a novel Bayesian hypothesis test of trinucleotide frequency shows that the URFs are more similar to known coding genes than noncoding intergenic sequence. Protein structure and function prediction of all ctenophore URFs suggests that they all code for transmembrane transport proteins. These findings, along with the presence of URFs in other sequenced ctenophore mitochondrial genomes, suggest that ctenophores may have uncharacterized transmembrane proteins present in their mitochondria.
Collapse
Affiliation(s)
- Darrin T. Schultz
- Department of Biomolecular Engineering and Bioinformatics, University of California Santa Cruz, Santa Cruz, CA, USA
- Monterey Bay Aquarium Research Institute, Moss Landing, CA, USA
| | - Jordan M. Eizenga
- Department of Biomolecular Engineering and Bioinformatics, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Russell B. Corbett-Detig
- Department of Biomolecular Engineering and Bioinformatics, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Warren R. Francis
- Department of Biology, University of Southern Denmark, Odense, Denmark
| | | | - Steven H.D. Haddock
- Monterey Bay Aquarium Research Institute, Moss Landing, CA, USA
- Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA
| |
Collapse
|
2
|
Goli B, Nair AS. The elusive short gene – an ensemble method for recognition for prokaryotic genome. Biochem Biophys Res Commun 2012; 422:36-41. [DOI: 10.1016/j.bbrc.2012.04.090] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2012] [Accepted: 04/17/2012] [Indexed: 10/28/2022]
|
3
|
Kulkarni OC, Vigneshwar R, Jayaraman VK, Kulkarni BD. Identification of coding and non-coding sequences using local Holder exponent formalism. Bioinformatics 2005; 21:3818-23. [PMID: 16118261 DOI: 10.1093/bioinformatics/bti639] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Accurate prediction of genes in genomes has always been a challenging task for bioinformaticians and computational biologists. The discovery of existence of distinct scaling relations in coding and non-coding sequences has led to new perspectives in the understanding of the DNA sequences. This has motivated us to exploit the differences in the local singularity distributions for characterization and classification of coding and non-coding sequences. RESULTS The local singularity density distribution in the coding and non-coding sequences of four genomes was first estimated using the wavelet transform modulus maxima methodology. Support vector machines classifier was then trained with the extracted features. The trained classifier is able to provide an average test accuracy of 97.7%. The local singularity features in a DNA sequence can be exploited for successful identification of coding and non-coding sequences. CONTACT Available on request from bd.kulkarni@ncl.res.in.
Collapse
|
4
|
Abstract
Recognition of function of newly sequenced DNA fragments is an important area of computational molecular biology. Here we present an extensive review of methods for prediction of functional sites, tRNA, and protein-coding genes and discuss possible further directions of research in this area.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Moscow region, Russia
| |
Collapse
|
5
|
Abstract
A number of methods for recognizing protein coding genes in DNA sequence have been published over the last 13 years, and new, more comprehensive algorithms, drawing on the repertoire of existing techniques, continue to be developed. To optimize continued development, it is valuable to systematically review and evaluate published techniques. At the core of most gene recognition algorithms is one or more coding measures--functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of 'typical' exonic DNA. In this paper we review and synthesize the underlying coding measures from published algorithms. A standardized benchmark is described, and each of the measures is evaluated according to this benchmark. Our main conclusion is that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures. Different measures contain different information. However there is a great deal of redundancy in the current suite of measures. We show that in future development of gene recognition algorithms, attention can probably be limited to six of the twenty or so measures proposed to date.
Collapse
Affiliation(s)
- J W Fickett
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, NM 87545
| | | |
Collapse
|
6
|
Abstract
We have developed an algorithm that automatically and reproducibly identifies potential tRNA genes in genomic DNA sequences, and we present a general strategy for testing the sensitivity of such algorithms. This algorithm is useful for the flagging and characterization of long genomic sequences that have not been experimentally analyzed for identification of functional regions, and for the scanning of nucleotide sequence databases for errors in the sequences and the functional assignments associated with them. In an exhaustive scan of the GenBank database, 97.5% of the 744 known tRNA genes were correctly identified (true-positives), and 42 previously unidentified sequences were predicted to be tRNAs. A detailed analysis of these latter predictions reveals that 16 of the 42 are very similar to known tRNA genes, and we predict that they do, in fact, code for tRNA, yielding a false-positive rate for the algorithm of 0.003%. The new algorithm and testing strategy are a considerable improvement over any previously described strategies for recognizing tRNA genes, and they allow detections of genes (including introns) embedded in long genomic sequences.
Collapse
Affiliation(s)
- G A Fichant
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, NM 87545
| | | |
Collapse
|
7
|
Cosmi C, Cuomo V, Ragosta M, Macchiato MF. Characterization of nucleotidic sequences using maximum entropy techniques. J Theor Biol 1990; 147:423-32. [PMID: 2292889 DOI: 10.1016/s0022-5193(05)80497-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
A statistical method for characterizing nucleotidic sequences based on maximum entropy techniques is presented. The method uses only codon usage tables and takes into account the length of sequences, and preserves the information contained in each codon by a punctual index. We present the methodological aspects of the analysis, showing an application relative to nucleotidic sequences of eukaryotes.
Collapse
Affiliation(s)
- C Cosmi
- Istituto di Fisica, Facoltà di Ingegneria, Università della Basilicata, Potenza, Italy
| | | | | | | |
Collapse
|
8
|
Abstract
A novel approach to the problem of prediction of protein-coding regions is suggested. This approach combines the site prediction methods to predict splicing sites and the global coding region prediction methods to choose the best variant of spliced mRNA. One of the advantages of the suggested algorithm is that the resulting mRNA or protein sequence may then be immediately analyzed further. The true mRNA either coincides with the predicted one or ranks high in the list of variants. In the latter situation the predicted mRNA usually differs from the true one in only one or two of several exons. The combined approach allows the use of a priori information (e.g. the putative protein length or the number of exons). It is possible to use additional parameters not considered here, such as the preferred lengths of exons and introns, and particularly the preferred position of introns in the reading frame and the preferred codon position of exon termini.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, USSR Academy of Sciences, Pushchino, Moscow region
| |
Collapse
|
9
|
Affiliation(s)
- J Kypr
- Institute of Biophysics, Czechoslovak Academy of Sciences, Brno
| | | |
Collapse
|
10
|
Tramontano A, Macchiato MF. A transportable interactive package for the statistical analysis and handling of sequence data. Comput Biol Med 1988; 18:113-22. [PMID: 3356143 DOI: 10.1016/0010-4825(88)90037-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
A real-time, machine-independent software for the analysis and manipulation of sequence data is described. The implemented system allows users to read in sequence data from existing data bases, and to edit, manipulate, analyse and align them. It is easy to use and is suitable for routine investigations and to modify sequences in order to create a user customized data base.
Collapse
Affiliation(s)
- A Tramontano
- International Institute of Genetics and Biophysics, C.N.R., Napoli, Italy
| | | |
Collapse
|
11
|
Charlebois RL, Lam WL, Cline SW, Doolittle WF. Characterization of pHV2 from Halobacterium volcanii and its use in demonstrating transformation of an archaebacterium. Proc Natl Acad Sci U S A 1987; 84:8530-4. [PMID: 2825193 PMCID: PMC299578 DOI: 10.1073/pnas.84.23.8530] [Citation(s) in RCA: 115] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
We determined the complete nucleotide sequence of the 6354-base-pair plasmid pHV2 of the archaebacterium Halobacterium volcanii. This plasmid is present in approximately six copies per chromosome. We have generated a strain, H. volcanii WFD11, cured of pHV2 by treatment of liquid cultures with ethidium bromide. We describe PEG-mediated transformation of H. volcanii WFD11 with intact pHV2 and with a form of pHV2 marked by a 93-base-pair deletion generated in vitro.
Collapse
Affiliation(s)
- R L Charlebois
- Department of Biochemistry, Dalhousie University, Halifax, NS, Canada
| | | | | | | |
Collapse
|
12
|
Moody ME, Fristensky B. Database bias and the identification of protein coding sequences. DNA (MARY ANN LIEBERT, INC.) 1987; 6:493-5. [PMID: 3677996 DOI: 10.1089/dna.1987.6.493] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
A simple quantitative test for the probability that an open reading frame actually codes for a protein has been described by Tramontano and Macchiato (1986). However, their test is only valid for the special case in which both coding and noncoding sequences are represented equally. We present a generalized adaptation of their method that uses estimates for the relative proportions of coding and noncoding sequences to provide a more accurate prediction.
Collapse
Affiliation(s)
- M E Moody
- Department of Pure and Applied Mathematics, Washington State University, Pullman 99164-2930
| | | |
Collapse
|
13
|
Abstract
We find a region in the non-coding part of bacteriophage lambda genome that codes for the conserved fold which repressors and other proteins use for specific DNA binding. The region is involved in a long open reading frame exceeding one kilobase and is read in the same frame as gene A in the opposite strand. The putative translation product of this open reading frame has a highly ordered secondary structure with a predominance of alpha helices, which is typical of repressors. In addition, codon usage in this frame suggests a protein-coding region. However, there is a TGA stop codon located between the putative gene start point and the region coding for the DNA binding fold. It thus appears that bacteriophage lambda had one more DNA binding protein, perhaps repressor, in the past that was inactivated by a mutation.
Collapse
|
14
|
Kypr J. A part of codon bias in genes protects protein spatial structures from destabilization by random single point mutations. Biochem Biophys Res Commun 1986; 139:1094-7. [PMID: 3767992 DOI: 10.1016/s0006-291x(86)80289-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Relationships are examined, using a primitive approximation, between the known general codon bias in genes and resistance of protein tertiary structure to destabilization by random single point mutations. A correlation of these two properties is found in the case of the first codon position while the second and third codon positions are evidently used for other purposes. This study suggests a separation of roles of the particular codon positions in the translation of the genetic message.
Collapse
|