1
|
Arruda M, da Silva A, de Assis F. An Adaptive Mapping Method Using Spectral Envelope Approach for DNA Spectral Analysis. ENTROPY 2022; 24:e24070978. [PMID: 35885202 PMCID: PMC9323741 DOI: 10.3390/e24070978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/07/2022] [Accepted: 07/12/2022] [Indexed: 11/16/2022]
Abstract
The digital signal processing approaches were investigated as a preliminary indicator for discriminating between the protein coding and non-coding regions of DNA. This is because a three-base periodicity (TBP) has already been proven to exist in protein-coding regions arising from the length of codons (three nucleic acids). This demonstrates that there is a prominent peak in the energy spectrum of a DNA coding sequence at frequency 13 rad/sample. However, because DNA sequences are symbolic sequences, these should be mapped into one or more signals such that the hidden information is highlighted. We propose, therefore, two new algorithms for computing adaptive mappings and, by using them, finding periodicities. Both such algorithms are based on the spectral envelope approach. This adaptive approach is essentially important since a single mapping for any DNA sequence may ignore its intrinsic properties. Finally, the improved performance of the new methods is verified by using them with synthetic and real DNA sequences as compared to the classical methods, especially the minimum entropy mapping (MEM) spectrum, which is also an adaptive method. We demonstrated that our method is both more accurate and more responsive than all its counterparts. This is especially important in this application since it reduces the risks of a coding sequence being missed.
Collapse
|
2
|
SAVMD: An adaptive signal processing method for identifying protein coding regions. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2021.102998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
3
|
Tsonis AA, Wang G, Zhang L, Lu W, Kayafas A, Del Rio-Tsonis K. An application of slow feature analysis to the genetic sequences of coronaviruses and influenza viruses. Hum Genomics 2021; 15:26. [PMID: 33962680 PMCID: PMC8103670 DOI: 10.1186/s40246-021-00327-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 04/19/2021] [Indexed: 12/03/2022] Open
Abstract
Background Mathematical approaches have been for decades used to probe the structure of DNA sequences. This has led to the development of Bioinformatics. In this exploratory work, a novel mathematical method is applied to probe the DNA structure of two related viral families: those of coronaviruses and those of influenza viruses. The coronaviruses are SARS-CoV-2, SARS-CoV-1, and MERS. The influenza viruses include H1N1-1918, H1N1-2009, H2N2-1957, and H3N2-1968. Methods The mathematical method used is the slow feature analysis (SFA), a rather new but promising method to delineate complex structure in DNA sequences. Results The analysis indicates that the DNA sequences exhibit an elaborate and convoluted structure akin to complex networks. We define a measure of complexity and show that each DNA sequence exhibits a certain degree of complexity within itself, while at the same time there exists complex inter-relationships between the sequences within a family and between the two families. From these relationships, we find evidence, especially for the coronavirus family, that increasing complexity in a sequence is associated with higher transmission rate but with lower mortality. Conclusions The complexity measure defined here may hold a promise and could become a useful tool in the prediction of transmission and mortality rates in future new viral strains. Supplementary Information The online version contains supplementary material available at 10.1186/s40246-021-00327-2.
Collapse
Affiliation(s)
- Anastasios A Tsonis
- Department of Mathematical Sciences, Atmospheric Sciences Group, University of Wisconsin-Milwaukee, Milwaukee, WI, 53201, USA. .,Hydrologic Research Center, San Diego, CA, 92127, USA.
| | - Geli Wang
- Key Laboratory of Middle Atmosphere and Global Environment Observation (LAGEO), Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, 100029, China
| | - Lvyi Zhang
- Key Laboratory of Middle Atmosphere and Global Environment Observation (LAGEO), Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, 100029, China
| | - Wenxu Lu
- Key Laboratory of Middle Atmosphere and Global Environment Observation (LAGEO), Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, 100029, China
| | - Aristotle Kayafas
- Department of Biology and Center for Visual Sciences, Miami University, Oxford, OH, 45056, USA
| | - Katia Del Rio-Tsonis
- Department of Biology and Center for Visual Sciences, Miami University, Oxford, OH, 45056, USA.
| |
Collapse
|
4
|
Yin C. Latent periodicity-2 in coronavirus SARS-CoV-2 genome: Evolutionary implications. J Theor Biol 2021; 515:110604. [PMID: 33508323 PMCID: PMC7835100 DOI: 10.1016/j.jtbi.2021.110604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Revised: 01/02/2021] [Accepted: 01/21/2021] [Indexed: 11/25/2022]
Abstract
The ongoing global pandemic of infection disease COVID-19 caused by the 2019 novel coronavirus (SARS-COV-2, formerly 2019-nCoV) presents critical threats to public health and the economy. The genome of SARS-CoV-2 had been sequenced and structurally annotated, yet little is known of the intrinsic organization and evolution of the genome. To this end, we present a mathematical method for the genomic spectrum, a kind of barcode, of SARS-CoV-2 and common human coronaviruses. The genomic spectrum is constructed according to the periodic distributions of nucleotides and therefore reflects the unique characteristics of the genome. The results demonstrate that coronavirus SARS-CoV-2 exhibits predominant latent periodicity-2 regions of non-structural proteins 3, 4, 5, and 6. Further analysis of the latent periodicity-2 regions suggests that the dinucleotide imbalances are increased during evolution and may confer the evolutionary fitness of the virus. Especially, SARS-CoV-2 isolates have increased latent periodicity-2 and periodicity-3 during COVID-19 pandemic. The special strong periodicity-2 regions and the intensity of periodicity-2 in the SARS-CoV-2 whole genome may become diagnostic and pharmaceutical targets in monitoring and curing the COVID-19 disease.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, USA.
| |
Collapse
|
5
|
Zheng Q, Chen T, Zhou W, Xie L, Su H. Gene prediction by the noise-assisted MEMD and wavelet transform for identifying the protein coding regions. Biocybern Biomed Eng 2021. [DOI: 10.1016/j.bbe.2020.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
6
|
Han S, Liang Y, Ma Q, Xu Y, Zhang Y, Du W, Wang C, Li Y. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief Bioinform 2020; 20:2009-2027. [PMID: 30084867 PMCID: PMC6954391 DOI: 10.1093/bib/bby065] [Citation(s) in RCA: 82] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2018] [Revised: 06/20/2018] [Indexed: 12/31/2022] Open
Abstract
Discovering new long non-coding RNAs (lncRNAs) has been a fundamental step in lncRNA-related research. Nowadays, many machine learning-based tools have been developed for lncRNA identification. However, many methods predict lncRNAs using sequence-derived features alone, which tend to display unstable performances on different species. Moreover, the majority of tools cannot be re-trained or tailored by users and neither can the features be customized or integrated to meet researchers’ requirements. In this study, features extracted from sequence-intrinsic composition, secondary structure and physicochemical property are comprehensively reviewed and evaluated. An integrated platform named LncFinder is also developed to enhance the performance and promote the research of lncRNA identification. LncFinder includes a novel lncRNA predictor using the heterologous features we designed. Experimental results show that our method outperforms several state-of-the-art tools on multiple species with more robust and satisfactory results. Researchers can additionally employ LncFinder to extract various classic features, build classifier with numerous machine learning algorithms and evaluate classifier performance effectively and efficiently. LncFinder can reveal the properties of lncRNA and mRNA from various perspectives and further inspire lncRNA–protein interaction prediction and lncRNA evolution analysis. It is anticipated that LncFinder can significantly facilitate lncRNA-related research, especially for the poorly explored species. LncFinder is released as R package (https://CRAN.R-project.org/package=LncFinder). A web server (http://bmbl.sdstate.edu/lncfinder/) is also developed to maximize its availability.
Collapse
Affiliation(s)
- Siyu Han
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Yanchun Liang
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China.,Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, China
| | - Qin Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakot State University, Brookings, SD, USA.,Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Yangyi Xu
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Yu Zhang
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Wei Du
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Cankun Wang
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Ying Li
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| |
Collapse
|
7
|
Raman Kumar M, Vaegae NK. A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.03.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
8
|
Michel CJ, Thompson JD. Identification of a circular code periodicity in the bacterial ribosome: origin of codon periodicity in genes? RNA Biol 2020; 17:571-583. [PMID: 31960748 PMCID: PMC8647727 DOI: 10.1080/15476286.2020.1719311] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Revised: 01/10/2020] [Accepted: 01/14/2020] [Indexed: 02/09/2023] Open
Abstract
Three-base periodicity (TBP), where nucleotides and higher order n-tuples are preferentially spaced by 3, 6, 9, etc. bases, is a well-known intrinsic property of protein-coding DNA sequences. However, its origins are still not fully understood. One hypothesis is that the periodicity reflects a primordial coding system that was used before the emergence of the modern standard genetic code (SGC). Recent evidence suggests that the X circular code, a set of 20 trinucleotides allowing the reading frames in genes to be retrieved locally, represents a possible ancestor of the SGC. Motifs from the X circular code have been found in the reading frame of protein-coding regions in extant organisms from bacteria to eukaryotes, in many transfer RNA (tRNA) genes and in important functional regions of the ribosomal RNA (rRNA), notably in the peptidyl transferase centre and the decoding centre. Here, we have used a powerful correlation function to search for periodicity patterns involving the 20 trinucleotides of the X circular code in a large set of bacterial protein-coding genes, as well as in the translation machinery, including rRNA and tRNA sequences. As might be expected, we found a strong circular code periodicity 0 modulo 3 in the protein-coding genes. More surprisingly, we also identified a similar circular code periodicity in a large region of the 16S rRNA. This region includes the 3' major domain corresponding to the primordial proto-ribosome decoding centre and containing numerous sites that interact with the tRNA and messenger RNA (mRNA) during translation. Furthermore, 3D structural analysis shows that the periodicity region surrounds the mRNA channel that lies between the head and the body of the SSU. Our results support the hypothesis that the X circular code may constitute an ancestral translation code involved in reading frame retrieval and maintenance, traces of which persist in modern mRNA, tRNA and rRNA despite their long evolution and adaptation to the SGC.
Collapse
Affiliation(s)
- Christian J. Michel
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Julie D. Thompson
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| |
Collapse
|
9
|
Kar S, Ganguly M, Das S. USING DIT-FFT ALGORITHM FOR IDENTIFICATION OF PROTEIN CODING REGION IN EUKARYOTIC GENE. BIOMEDICAL ENGINEERING-APPLICATIONS BASIS COMMUNICATIONS 2019. [DOI: 10.4015/s1016237219500029] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The new research platform on biomedical engineering by Digital Signal Processing (DSP) is playing a vital role in the prediction of protein coding regions (Exons) from genomic sequences with great accuracy. We can determine the protein coding area in DNA sequences with the help of period-3 property. It has been seen that in order to find out the period-3 property, the DFT algorithm is mostly used but in this paper, we have tested FFT algorithm instead of DFT algorithm. DSP is basically concerned with processing numerical sequences. When digital signal processing used in DNA sequences analysis, it requires conversion of base characters sequence to the numerical version. The numerical representation of DNA sequences strongly impacts the biological properties mirrored through the numerical genre. In this work, the proposed technique based on DIT-FFT algorithm has been used to identify the exonic area with the help of integer value representation for transforming the DNA sequences. Digital filters are used to read out period 3 components from the output spectrum and to eliminate the unwanted high frequency noise from DNA sequences. To overcome background noise means to suppress the non-coding regions, i.e., Introns. Proposed algorithm is tested on four nucleotide sequences having single or multiple numbers of exons.
Collapse
Affiliation(s)
- Subhajit Kar
- Department of Electronics, West Bengal State University, Barasat, Kolkata 700126, India
| | - Madhabi Ganguly
- Department of Electronics, West Bengal State University, Barasat, Kolkata 700126, India
| | - Saptarshi Das
- Department of Electronics, West Bengal State University, Barasat, Kolkata 700126, India
| |
Collapse
|
10
|
Anzalone AV, Zairis S, Lin AJ, Rabadan R, Cornish VW. Interrogation of Eukaryotic Stop Codon Readthrough Signals by in Vitro RNA Selection. Biochemistry 2019; 58:1167-1178. [PMID: 30698415 DOI: 10.1021/acs.biochem.8b01280] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
RNA signals located downstream of stop codons in eukaryotic mRNAs can stimulate high levels of translational readthrough by the ribosome, thereby giving rise to functionally distinct C-terminally extended protein products. Although many readthrough events have been previously discovered in Nature, a broader description of the stimulatory RNA signals would help to identify new reprogramming events in eukaryotic genes and provide insights into the molecular mechanisms of readthrough. Here, we explore the RNA reprogramming landscape by performing in vitro translation selections to enrich RNA readthrough signals de novo from a starting randomized library comprising >1013 unique sequence variants. Selection products were characterized using high-throughput sequencing, from which we identified primary sequence and secondary structure readthrough features. The activities of readthrough signals, including three novel sequence motifs, were confirmed in cellular reporter assays. Then, we used machine learning and our HTS data to predict readthrough activity from human 3'-untranslated region sequences. This led to the discovery of >1.5% readthrough in four human genes (CDKN2B, LEPROTL1, PVRL3, and SFTA2). Together, our results provide valuable insights into RNA-mediated translation reprogramming, offer tools for readthrough discovery in eukaryotic genes, and present new opportunities to explore the biological consequences of stop codon readthrough in humans.
Collapse
Affiliation(s)
- Andrew V Anzalone
- Department of Chemistry , Columbia University , New York , New York 10027 , United States
| | - Sakellarios Zairis
- Department of Systems Biology , Columbia University , New York , New York 10032 , United States
| | - Annie J Lin
- Department of Chemistry , Columbia University , New York , New York 10027 , United States.,Department of Systems Biology , Columbia University , New York , New York 10032 , United States
| | - Raul Rabadan
- Department of Systems Biology , Columbia University , New York , New York 10032 , United States
| | - Virginia W Cornish
- Department of Chemistry , Columbia University , New York , New York 10027 , United States.,Department of Systems Biology , Columbia University , New York , New York 10032 , United States
| |
Collapse
|
11
|
Das L, Nanda S, Das JK. An integrated approach for identification of exon locations using recursive Gauss Newton tuned adaptive Kaiser window. Genomics 2018; 111:284-296. [PMID: 30342085 DOI: 10.1016/j.ygeno.2018.10.008] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2018] [Revised: 09/11/2018] [Accepted: 10/11/2018] [Indexed: 11/27/2022]
Abstract
Identification of exon location in a DNA sequence has been considered as the most demanding and challenging research topic in the field of Bioinformatics. This work proposes a robust approach combining the Trigonometric mapping with Adaptive tuned Kaiser Windowing approach for locating the protein coding regions (EXONS) in a genetic sequence. For better convergence as well as improved accurateness, the side lobe height control parameter (β) of Kaiser Window in the proposed algorithm is made adaptive to track the changing dynamics of the genetic sequence. This yields better tracking potential of the anticipated Adaptive Kaiser algorithm as it uses the recursive Gauss Newton tuning which in turn utilizes the covariance of the error signal to tune the β factor which has been shown through numerous simulation results under a variety of practical test conditions. A detailed comparative analysis with the existing mapping schemes, windowing techniques, and other signal processing methods like SVD, AN, DFT, STDFT, WT, and ST has also been included in the paper to focus on the strength and efficiency of the proposed approach. Moreover, some critical performance parameters have been computed using the proposed approach to investigate the effectiveness and robustness of the algorithm. In addition to this, the proposed approach has also been successfully applied on a number of benchmark gene sets like Musmusculus, Homosapiens, and C. elegans, etc., where the proposed approach revealed efficient prediction of exon location in contrast to the other existing mapping methods.
Collapse
Affiliation(s)
- Lopamudra Das
- School of Electronics Engineering, KIIT University, Bhubaneswar, India.
| | - Sarita Nanda
- School of Electronics Engineering, KIIT University, Bhubaneswar, India.
| | - J K Das
- School of Electronics Engineering, KIIT University, Bhubaneswar, India.
| |
Collapse
|
12
|
Zhao J, Wang J, Jiang H. Detecting Periodicities in Eukaryotic Genomes by Ramanujan Fourier Transform. J Comput Biol 2018; 25:963-975. [PMID: 29963923 DOI: 10.1089/cmb.2017.0252] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Ramanujan Fourier transform (RFT) nowadays is becoming a popular signal processing method. RFT is used to detect periodicities in exons and introns of eukaryotic genomes in this article. Genomic sequences of nine species were analyzed. The highest peak in the spectrum amplitude corresponding to each exon or intron is regarded as the significant signal. Accordingly, the periodicity corresponding to the significant signal can be also regarded as a valuable periodicity. Exons and introns have different periodic phenomena. The computational results reveal that the 2-, 3-, 4-, and 6-base periodicities of exons and introns are four kinds of important periodicities based on RFT. It is the first time that the 2-base periodicity of introns is discovered through signal processing method. The frequencies of the 2-base periodicity and the 3-base periodicity occurrence are polar opposite between the exons and the introns. With regard to the cyclicality of the Ramanujan sums, which is the base function of the transformation, RFT is suggested for studying the periodic features of dinucleotides, trinucleotides, and q nucleotides.
Collapse
Affiliation(s)
- Jian Zhao
- 1 Department of Mathematics, Nanjing Tech University , Nanjing, China .,2 Department of Statistics, Northwestern University , Evanston, Illinois
| | - Jiasong Wang
- 3 Department of Mathematics, Nanjing University , Nanjing, China
| | - Hongmei Jiang
- 2 Department of Statistics, Northwestern University , Evanston, Illinois
| |
Collapse
|
13
|
Computational Techniques for a Comprehensive Understanding of Different Genotype-Phenotype Factors in Biological Systems and Their Applications. Synth Biol (Oxf) 2018. [DOI: 10.1007/978-981-10-8693-9_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
14
|
Wang Y, Chen X, Sheng Y, Liu Y, Gao S. N6-adenine DNA methylation is associated with the linker DNA of H2A.Z-containing well-positioned nucleosomes in Pol II-transcribed genes in Tetrahymena. Nucleic Acids Res 2017; 45:11594-11606. [PMID: 29036602 PMCID: PMC5714169 DOI: 10.1093/nar/gkx883] [Citation(s) in RCA: 74] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Revised: 09/12/2017] [Accepted: 09/23/2017] [Indexed: 01/01/2023] Open
Abstract
DNA N6-methyladenine (6mA) is newly rediscovered as a potential epigenetic mark across a more diverse range of eukaryotes than previously realized. As a unicellular model organism, Tetrahymena thermophila is among the first eukaryotes reported to contain 6mA modification. However, lack of comprehensive information about 6mA distribution hinders further investigations into its function and regulatory mechanism. In this study, we provide the first genome-wide, base pair-resolution map of 6mA in Tetrahymena by applying single-molecule real-time (SMRT) sequencing. We provide evidence that 6mA occurs mostly in the AT motif of the linker DNA regions. More strikingly, these linker DNA regions with 6mA are usually flanked by well-positioned nucleosomes and/or H2A.Z-containing nucleosomes. We also find that 6mA is exclusively associated with RNA polymerase II (Pol II)-transcribed genes, but is not an unambiguous mark for active transcription. These results support that 6mA is an integral part of the chromatin landscape shaped by adenosine triphosphate (ATP)-dependent chromatin remodeling and transcription.
Collapse
Affiliation(s)
- Yuanyuan Wang
- Institute of Evolution & Marine Biodiversity, Ocean University of China, Qingdao 266003, China
| | - Xiao Chen
- Institute of Evolution & Marine Biodiversity, Ocean University of China, Qingdao 266003, China
| | - Yalan Sheng
- Institute of Evolution & Marine Biodiversity, Ocean University of China, Qingdao 266003, China
| | - Yifan Liu
- Department of Pathology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Shan Gao
- Institute of Evolution & Marine Biodiversity, Ocean University of China, Qingdao 266003, China
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266003, China
| |
Collapse
|
15
|
Messaoudi I, Elloumi Oueslati A, Lachiri Z. Inferring Helitron Structures from 1D and 2D Representations Based on the Chaos Game Theory. Ing Rech Biomed 2017. [DOI: 10.1016/j.irbm.2017.01.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
16
|
Morán Losada P, Fischer S, Chouvarine P, Tümmler B. Three-base periodicity of sites of sequence variation in Pseudomonas aeruginosa and Staphylococcus aureus core genomes. FEBS Lett 2016; 590:3538-3543. [PMID: 27664047 DOI: 10.1002/1873-3468.12431] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2016] [Revised: 09/08/2016] [Accepted: 09/12/2016] [Indexed: 11/11/2022]
Abstract
The three-base periodicity property is characteristic of protein-coding sequences. Here, we report on three-base periodicity of sequence variation in the core genome of bacteria. Single nucleotide polymorphism (SNP) syntenies were extracted from pairwise genome alignments of 41 Staphylococcus aureus or 20 Pseudomonas aeruginosa strains. The length of fragment pairs with identical nucleotides at all SNP positions showed a length-dependent overrepresentation of multiples of three nucleotides at corresponding codon positions of the AT-rich S. aureus and the GC-rich P. aeruginosa. Three-base SNP periodicity seems to be a characteristic feature of the tightly arranged bacterial core genome.
Collapse
Affiliation(s)
- Patricia Morán Losada
- Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Germany
| | - Sebastian Fischer
- Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Germany
| | - Philippe Chouvarine
- Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Germany
| | - Burkhard Tümmler
- Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Germany. .,Biomedical Research in Endstage and Obstructive Lung Disease (BREATH), German Center for Lung Research, Hannover, Germany.
| |
Collapse
|
17
|
Marhon SA, Kremer SC. Prediction of Protein Coding Regions Using a Wide-Range Wavelet Window Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:742-753. [PMID: 26415183 DOI: 10.1109/tcbb.2015.2476789] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Prediction of protein coding regions is an important topic in the field of genomic sequence analysis. Several spectrum-based techniques for the prediction of protein coding regions have been proposed. However, the outstanding issue in most of the proposed techniques is that these techniques depend on an experimentally-selected, predefined value of the window length. In this paper, we propose a new Wide-Range Wavelet Window (WRWW) method for the prediction of protein coding regions. The analysis of the proposed wavelet window shows that its frequency response can adapt its width to accommodate the change in the window length so that it can allow or prevent frequencies other than the basic frequency in the analysis of DNA sequences. This feature makes the proposed window capable of analyzing DNA sequences with a wide range of the window lengths without degradation in the performance. The experimental analysis of applying the WRWW method and other spectrum-based methods to five benchmark datasets has shown that the proposed method outperforms other methods along a wide range of the window lengths. In addition, the experimental analysis has shown that the proposed method is dominant in the prediction of both short and long exons.
Collapse
|
18
|
Zhang X, Shen Z, Zhang G, Shen Y, Chen M, Zhao J, Wu R. Short Exon Detection via Wavelet Transform Modulus Maxima. PLoS One 2016; 11:e0163088. [PMID: 27635656 PMCID: PMC5026382 DOI: 10.1371/journal.pone.0163088] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Accepted: 09/04/2016] [Indexed: 02/05/2023] Open
Abstract
The detection of short exons is a challenging open problem in the field of bioinformatics. Due to the fact that the weakness of existing model-independent methods lies in their inability to reliably detect small exons, a model-independent method based on the singularity detection with wavelet transform modulus maxima has been developed for detecting short coding sequences (exons) in eukaryotic DNA sequences. In the analysis of our method, the local maxima can capture and characterize singularities of short exons, which helps to yield significant patterns that are rarely observed with the traditional methods. In order to get some information about singularities on the differences between the exon signal and the background noise, the noise level is estimated by filtering the genomic sequence through a notch filter. Meanwhile, a fast method based on a piecewise cubic Hermite interpolating polynomial is applied to reconstruct the wavelet coefficients for improving the computational efficiency. In addition, the output measure of a paired-numerical representation calculated in both forward and reverse directions is used to incorporate a useful DNA structural property. The performances of our approach and other techniques are evaluated on two benchmark data sets. Experimental results demonstrate that the proposed method outperforms all assessed model-independent methods for detecting short exons in terms of evaluation metrics.
Collapse
Affiliation(s)
- Xiaolei Zhang
- Shantou University Medical College, Shantou, P.R. China
| | - Zhiwei Shen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Guishan Zhang
- College of Engineering, Shantou University, Shantou, P.R. China
| | - Yuanyu Shen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Miaomiao Chen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Jiaxiang Zhao
- College of Electronic Information and Optical Engineering, Nankai University, Tianjin, P.R. China
- * E-mail: (JXZ); (RHW)
| | - Renhua Wu
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
- * E-mail: (JXZ); (RHW)
| |
Collapse
|
19
|
Howe ED, Song JS. Categorical spectral analysis of periodicity in human and viral genomes. Nucleic Acids Res 2012; 41:1395-405. [PMID: 23241388 PMCID: PMC3561982 DOI: 10.1093/nar/gks1261] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Periodicity in nucleotide sequences arises from regular repeating patterns which may reflect important structure and function. Although a three-base periodicity in coding regions has been known for some time and has provided the basis for powerful gene prediction algorithms, its origins are still not fully understood. Here, we show that, contrary to common belief, amino acid (AA) bias and codon usage bias are insufficient to create base-3 periodicity. This article applies the rigorous method of spectral envelope to systematically characterize the contributions of codon bias, AA bias and protein structural motifs to the three-base periodicity of coding sequences. The method is also used to classify CpG islands in the human genome. In addition, we show how spectral envelope can be used to trace the evolution of viral genomes and monitor global sequence changes without having to align to previously known genomes. This approach also detects reassortment events, such as those that led to the 2009 pandemic H1N1 virus.
Collapse
Affiliation(s)
- Elizabeth D Howe
- Institute for Human Genetics, University of California, San Francisco, 513 Parnassus Avenue, Box 0794, San Francisco, CA 94143-0794, USA
| | | |
Collapse
|
20
|
Abstract
Novel methods for identifying a new type of DNA latent periodicity, called latent profile periodicity or latent profility, are used to search for periodic structures in genes. These methods reveal two distinct levels of organization of genetic information encoding. It is shown that latent profility in genes may correlate with specific structural features of their encoded proteins.
Collapse
Affiliation(s)
- Maria Chaley
- Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Institutskaya st., 4, 142290 Pushchino, Russia.
| | | |
Collapse
|
21
|
Smith A, Johnson P. Gene expression in the unicellular eukaryote Trichomonas vaginalis. Res Microbiol 2011; 162:646-54. [DOI: 10.1016/j.resmic.2011.04.007] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2011] [Accepted: 03/02/2011] [Indexed: 02/01/2023]
|
22
|
Trotta E. The 3-base periodicity and codon usage of coding sequences are correlated with gene expression at the level of transcription elongation. PLoS One 2011; 6:e21590. [PMID: 21738721 PMCID: PMC3125259 DOI: 10.1371/journal.pone.0021590] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2011] [Accepted: 06/03/2011] [Indexed: 11/18/2022] Open
Abstract
Background Gene transcription is regulated by DNA transcriptional regulatory elements, promoters and enhancers that are located outside the coding regions. Here, we examine the characteristic 3-base periodicity of the coding sequences and analyse its correlation with the genome-wide transcriptional profile of yeast. Principal Findings The analysis of coding sequences by a new class of indices proposed here identified two different sources of 3-base periodicity: the codon frequency and the codon sequence. In exponentially growing yeast cells, the codon-frequency component of periodicity accounts for 71.9% of the variability of the cellular mRNA by a strong association with the density of elongating mRNA polymerase II complexes. The mRNA abundance explains most of the correlation between the codon-frequency component of periodicity and protein levels. Furthermore, pyrimidine-ending codons of the four-fold degenerate small amino acids alanine, glycine and valine are associated with genes with double the transcription rate of those associated with purine-ending codons. Conclusions We demonstrate that the 3-base periodicity of coding sequences is higher than expected by the codon usage frequency (CUF) and that its components, associated with codon bias and amino acid composition, are correlated with gene expression, principally at the level of transcription elongation. This indicates a role of codon sequences in maximising the transcription efficiency in exponentially growing yeast cells. Moreover, the results contrast with the common Darwinian explanation that attributes the codon bias to translational selection by an adjustment of synonymous codon frequencies to the most abundant isoaccepting tRNA. Here, we show that selection on codon bias likely acts at both the transcriptional and translational level and that codon usage and the relative abundance of tRNA could drive each other in order to synergistically optimize the efficiency of gene expression.
Collapse
Affiliation(s)
- Edoardo Trotta
- Institute of Translational Pharmacology, Consiglio Nazionale delle Ricerche, Roma, Italy.
| |
Collapse
|
23
|
Marhon SA, Kremer SC. Gene Prediction Based on DNA Spectral Analysis: A Literature Review. J Comput Biol 2011; 18:639-76. [PMID: 21381961 DOI: 10.1089/cmb.2010.0184] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Sajid A. Marhon
- School of Computer Science, University of Guelph, Guelph, Ontario, Canada
| | - Stefan C. Kremer
- School of Computer Science, University of Guelph, Guelph, Ontario, Canada
| |
Collapse
|
24
|
Sahu SS, Panda G. Identification of protein-coding regions in DNA sequences using a time-frequency filtering approach. GENOMICS, PROTEOMICS & BIOINFORMATICS 2011; 9:45-55. [PMID: 21641562 PMCID: PMC5054166 DOI: 10.1016/s1672-0229(11)60007-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/19/2010] [Accepted: 10/31/2010] [Indexed: 11/13/2022]
Abstract
Accurate identification of protein-coding regions (exons) in DNA sequences has been a challenging task in bioinformatics. Particularly the coding regions have a 3-base periodicity, which forms the basis of all exon identification methods. Many signal processing tools and techniques have been applied successfully for the identification task but still improvement in this direction is needed. In this paper, we have introduced a new promising model-independent time-frequency filtering technique based on S-transform for accurate identification of the coding regions. The S-transform is a powerful linear time-frequency representation useful for filtering in time-frequency domain. The potential of the proposed technique has been assessed through simulation study and the results obtained have been compared with the existing methods using standard datasets. The comparative study demonstrates that the proposed method outperforms its counterparts in identifying the coding regions.
Collapse
Affiliation(s)
- Sitanshu Sekhar Sahu
- Department of Electronics and Communication Engineering, National Institute of Technology, Rourkela, India.
| | | |
Collapse
|
25
|
Xu S, Rao N, Chen X, Zhou B. Inferring an organism-specific optimal threshold for predicting protein coding regions in eukaryotes based on a bootstrapping algorithm. Biotechnol Lett 2011; 33:889-96. [DOI: 10.1007/s10529-011-0525-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2010] [Accepted: 01/06/2011] [Indexed: 11/25/2022]
|
26
|
Wang L, Stein LD. Localizing triplet periodicity in DNA and cDNA sequences. BMC Bioinformatics 2010; 11:550. [PMID: 21059240 PMCID: PMC2992068 DOI: 10.1186/1471-2105-11-550] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2010] [Accepted: 11/08/2010] [Indexed: 01/23/2023] Open
Abstract
Background The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans. Results Using both simulated TP signals and the real C. elegans sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT. Conclusions MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.
Collapse
Affiliation(s)
- Liya Wang
- Cold Spring Harbor Laboratory, Williams #5, Cold Spring Harbor, NY 11724, USA.
| | | |
Collapse
|
27
|
Hirayama S, Mizuta S. Significant deviations in the configurations of homologous tandem repeats in prokaryotic genomes. GENOMICS PROTEOMICS & BIOINFORMATICS 2010; 7:163-74. [PMID: 20172489 PMCID: PMC5054416 DOI: 10.1016/s1672-0229(08)60046-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
We explored the possibilities of whole-genome duplication (WGD) in prokaryotic species, where we performed statistical analyses of the configurations of the central angles between homologous tandem repeats (TRs) on the circular chromosomes. At first, we detected TRs on their chromosomes and identified equivalent tandem repeat pairs (ETRPs); here, an ETRP is defined as a pair of tandem repeats sequentially similar to each other. Then we carried out statistical analyses of the central angle distributions of the detected ETRPs on each circular chromosome by way of comparisons between the detected distributions and those generated by null models. In the analyses, we estimated a P value by a simulation using the Kullback-Leibler divergence as a distance measure between two distributions. As a result, the central angle distributions for 8 out of the 203 prokaryotic species showed statistically significant deviations (P<0.05). In particular, we found out the characteristic feature of one round of WGD in Photorhabdus luminescens genome and that of two rounds of WGD in Escherichia coli K12.
Collapse
Affiliation(s)
- Shintaro Hirayama
- Graduate School of Science and Technology, Hirosaki University, Hirosaki, Aomori 036-8561, Japan
| | | |
Collapse
|
28
|
Sánchez R, Grau R. An algebraic hypothesis about the primeval genetic code architecture. Math Biosci 2009; 221:60-76. [PMID: 19607845 DOI: 10.1016/j.mbs.2009.07.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2008] [Revised: 06/23/2009] [Accepted: 07/09/2009] [Indexed: 11/26/2022]
Abstract
A plausible architecture of an ancient genetic code is derived from an extended base triplet vector space over the Galois field of the extended base alphabet {D,A,C,G,U}, where symbol D represents one or more hypothetical bases with unspecific pairings. We hypothesized that the high degeneration of a primeval genetic code with five bases and the gradual origin and improvement of a primeval DNA repair system could make possible the transition from ancient to modern genetic codes. Our results suggest that the Watson-Crick base pairing G identical with C and A=U and the non-specific base pairing of the hypothetical ancestral base D used to define the sum and product operations are enough features to determine the coding constraints of the primeval and the modern genetic code, as well as, the transition from the former to the latter. Geometrical and algebraic properties of this vector space reveal that the present codon assignment of the standard genetic code could be induced from a primeval codon assignment. Besides, the Fourier spectrum of the extended DNA genome sequences derived from the multiple sequence alignment suggests that the called period-3 property of the present coding DNA sequences could also exist in the ancient coding DNA sequences. The phylogenetic analyses achieved with metrics defined in the N-dimensional vector space (B(3))(N) of DNA sequences and with the new evolutionary model presented here also suggest that an ancient DNA coding sequence with five or more bases does not contradict the expected evolutionary history.
Collapse
Affiliation(s)
- Robersy Sánchez
- Research Institute of Tropical Roots, Tuber Crops and Plantains (INIVIT), Biotechnology Group, Villa Clara, Cuba
| | | |
Collapse
|
29
|
Chirila TV, Minamisawa T, Keen I, Shiba K. Effect of Motif-Programmed Artificial Proteins on the Calcium Uptake in a Synthetic Hydrogel. Macromol Biosci 2009; 9:959-67. [DOI: 10.1002/mabi.200900096] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
30
|
An efficient sliding window strategy for accurate location of eukaryotic protein coding regions. Comput Biol Med 2009; 39:392-5. [DOI: 10.1016/j.compbiomed.2009.01.010] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2007] [Revised: 01/16/2009] [Accepted: 01/28/2009] [Indexed: 11/22/2022]
|
31
|
Chen K, Meng Q, Ma L, Liu Q, Tang P, Chiu C, Hu S, Yu J. A novel DNA sequence periodicity decodes nucleosome positioning. Nucleic Acids Res 2008; 36:6228-36. [PMID: 18829715 PMCID: PMC2577358 DOI: 10.1093/nar/gkn626] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
There have been two types of well-characterized DNA sequence periodicities; both are found to be associated with important molecular mechanisms. One is a 3-nt periodicity corresponding to codon triplets, the other is a 10.5-nt periodicity related to the structure of DNA helixes. In the process of analyzing the genome and transcriptome of Trichomonas vaginalis, we observed a 120.9-nt periodicity along DNA sequences. Different from the 3- and 10.5-nt periodicities, this novel periodicity originates near the 5′-end of transcripts, extends along the direction of transcription, and weakens gradually along transcripts. As a result, codon usage as well as amino acid composition is constrained by this periodicity. Similar periodicities were also identified in other organisms, but with variable length associated with the length of nucleosome units. We validated this association experimentally in T. vaginalis, and demonstrated that the periodicity manifests nucleotide variations between linker-DNA and wrapping-DNA along nucleosome array. We conclude that this novel DNA sequence periodicity is a signature of nucleosome organization suggesting that nucleosomes are well-positioned with regularity, especially near the 5′-end of transcripts.
Collapse
Affiliation(s)
- Kaifu Chen
- Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Graduate University of Chinese Academy of Sciences, Beijing, China
| | | | | | | | | | | | | | | |
Collapse
|
32
|
Yin C, Yau SST. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J Theor Biol 2007; 247:687-94. [PMID: 17509616 DOI: 10.1016/j.jtbi.2007.03.038] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2006] [Revised: 03/24/2007] [Accepted: 03/26/2007] [Indexed: 11/30/2022]
Abstract
With the exponential growth of genomic sequences, there is an increasing demand to accurately identify protein coding regions (exons) from genomic sequences. Despite many progresses being made in the identification of protein coding regions by computational methods during the last two decades, the performances and efficiencies of the prediction methods still need to be improved. In addition, it is indispensable to develop different prediction methods since combining different methods may greatly improve the prediction accuracy. A new method to predict protein coding regions is developed in this paper based on the fact that most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature. The method computes the 3-base periodicity and the background noise of the stepwise DNA segments of the target DNA sequences using nucleotide distributions in the three codon positions of the DNA sequences. Exon and intron sequences can be identified from trends of the ratio of the 3-base periodicity to the background noise in the DNA sequences. Case studies on genes from different organisms show that this method is an effective approach for exon prediction.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, M/C 249, Chicago, IL 60607-7045, USA
| | | |
Collapse
|
33
|
Larsabal E, Danchin A. Genomes are covered with ubiquitous 11 bp periodic patterns, the "class A flexible patterns". BMC Bioinformatics 2005; 6:206. [PMID: 16120222 PMCID: PMC1242344 DOI: 10.1186/1471-2105-6-206] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2005] [Accepted: 08/24/2005] [Indexed: 11/17/2022] Open
Abstract
Background The genomes of prokaryotes and lower eukaryotes display a very strong 11 bp periodic bias in the distribution of their nucleotides. This bias is present throughout a given genome, both in coding and non-coding sequences. Until now this bias remained of unknown origin. Results Using a technique for analysis of auto-correlations based on linear projection, we identified the sequences responsible for the bias. Prokaryotic and lower eukaryotic genomes are covered with ubiquitous patterns that we termed "class A flexible patterns". Each pattern is composed of up to ten conserved nucleotides or dinucleotides distributed into a discontinuous motif. Each occurrence spans a region up to 50 bp in length. They belong to what we named the "flexible pattern" type, in that there is some limited fluctuation in the distances between the nucleotides composing each occurrence of a given pattern. When taken together, these patterns cover up to half of the genome in the majority of prokaryotes. They generate the previously recognized 11 bp periodic bias. Conclusion Judging from the structure of the patterns, we suggest that they may define a dense network of protein interaction sites in chromosomes.
Collapse
Affiliation(s)
- Etienne Larsabal
- Unité de Génétique des Génomes Bactériens, Institut Pasteur, URA CNRS 2171, 28, rue du Docteur Roux, 75724 Paris Cedex 15, France
| | - Antoine Danchin
- Unité de Génétique des Génomes Bactériens, Institut Pasteur, URA CNRS 2171, 28, rue du Docteur Roux, 75724 Paris Cedex 15, France
| |
Collapse
|
34
|
Ruvinsky A, Eskesen ST, Eskesen FN, Hurst LD. Can codon usage bias explain intron phase distributions and exon symmetry? J Mol Evol 2005; 60:99-104. [PMID: 15696372 DOI: 10.1007/s00239-004-0032-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2004] [Accepted: 08/31/2004] [Indexed: 10/25/2022]
Abstract
More introns exist between codons (phase 0) than between the first and the second bases (phase 1) or between the second and the third base (phase 2) within the codon. Many explanations have been suggested for this excess of phase 0. It has, for example, been argued to reflect an ancient utility for introns in separating exons that code for separate protein modules. There may, however, be a simple, alternative explanation. Introns typically require, for correct splicing, particular nucleotides immediately 5' in exons (typically a G) and immediately 3' in the following exon (also often a G). Introns therefore tend to be found between particular nucleotide pairs (e.g., G|G pairs) in the coding sequence. If, owing to bias in usage of different codons, these pairs are especially common at phase 0, then intron phase biases may have a trivial explanation. Here we take codon usage frequencies for a variety of eukaryotes and use these to generate random sequences. We then ask about the phase of putative intron insertion sites. Importantly, in all simulated data sets intron phase distribution is biased in favor of phase 0. In many cases the bias is of the magnitude observed in real data and can be attributed to codon usage bias. It is also known that exons may carry either the same phase (symmetric) or different phases (asymmetric) at the opposite ends. We simulated a distribution of different types of exons using frequencies of introns observed in real genes assuming random combination of intron phases at the opposite sides of exons. Surprisingly the simulated pattern was quite similar to that observed. In the simulants we typically observe a prevalence of symmetric exons carrying phase 0 at both ends, which is common for eukaryotic genes. However, at least in some species, the extent of the bias in favor of symmetric (0,0) exons is not as great in simulants as in real genes. These results emphasize the need to construct a biologically relevant null model of successful intron insertion.
Collapse
Affiliation(s)
- A Ruvinsky
- Institute for Genetics and Bioinformatics, University of New England, Armidale 2351, NSW, Australia.
| | | | | | | |
Collapse
|
35
|
Nikolaou C, Almirantis Y. Mutually symmetric and complementary triplets: differences in their use distinguish systematically between coding and non-coding genomic sequences. J Theor Biol 2003; 223:477-87. [PMID: 12875825 DOI: 10.1016/s0022-5193(03)00123-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The general property of asymmetry in word use in meaningful texts written in a variety of languages, motivates a quantification of the differences in the use of mutually symmetric triplets in genomic sequences. When this is done in the three reading frames, high values found for one of them are used as indication that the sequence is coding for a protein. Moreover, a similar quantification of the differences in the use of complementary triplets is introduced, again with predictive power of the coding character of a sequence. This method reflects the non-equivalence between sense and anti-sense strand of a coding segment. In both approaches, "linguistic asymmetry" in coding sequences is related to the form of the genetic code and to the bias in codon usage and amino acid use skews.
Collapse
Affiliation(s)
- Christoforos Nikolaou
- National Research Center for Physical Sciences Demokritos, Institute of Biology, 15310 Athens, Greece
| | | |
Collapse
|
36
|
Kotlar D, Lavner Y. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res 2003; 13:1930-7. [PMID: 12869578 PMCID: PMC403785 DOI: 10.1101/gr.1261703] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2003] [Accepted: 05/21/2003] [Indexed: 11/24/2022]
Abstract
A new measure for gene prediction in eukaryotes is presented. The measure is based on the Discrete Fourier Transform (DFT) phase at a frequency of 1/3, computed for the four binary sequences for A, T, C, and G. Analysis of all the experimental genes of S. cerevisiae revealed distribution of the phase in a bell-like curve around a central value, in all four nucleotides, whereas the distribution of the phase in the noncoding regions was found to be close to uniform. Similar findings were obtained for other organisms. Several measures based on the phase property are proposed. The measures are computed by clockwise rotation of the vectors, obtained by DFT for each analysis frame, by an angle equal to the corresponding central value. In protein coding regions, this rotation is assumed to closely align all vectors in the complex plane, thereby amplifying the magnitude of the vector sum. In noncoding regions, this operation does not significantly change this magnitude. Computing the measures with one chromosome and applying them on sequences of others reveals improved performance compared with other algorithms that use the 1/3 frequency feature, especially in short exons. The phase property is also used to find the reading frame of the sequence.
Collapse
Affiliation(s)
- Daniel Kotlar
- Department of Computer Science, Tel-Hai Academic College, Upper Galilee 12210, Israel
| | | |
Collapse
|
37
|
Fukushima A, Ikemura T, Kinouchi M, Oshima T, Kudo Y, Mori H, Kanaya S. Periodicity in prokaryotic and eukaryotic genomes identified by power spectrum analysis. Gene 2002; 300:203-11. [PMID: 12468102 DOI: 10.1016/s0378-1119(02)00850-8] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
We used a power spectrum method to identify periodic patterns in nucleotide sequence, and characterized nucleotide sequences that confer periodicities to prokaryotic and eukaryotic genomes and genomes. A 10-bp periodicity was prevalent in hyperthermophilic bacteria and archaebacteria, and an 11-bp periodicity was prevalent in eubacteria. The 10-bp periodicity was also prevalent in the eukaryotes such as the worm Caenorhabditis elegans. Additionally, in the worm genome, a 68-bp periodicity in chromosome I, a 59-bp periodicity in chromosome II, and a 94-bp periodicity in chromosome III were found. In human chromosomes 21 and 22, approximately 167- or 84-bp periodicity was detected along the entire length of these chromosomes. Because the 167-bp is identical to the length of DNA that forms two complete helical turns in nucleosome organization, we speculated that the respective sequences may correspond to arrays of a special compact form of nucleosomes clustered in specific regions of the human chromosomes. This periodic element contained a high frequency of TGG. TGG-rich sequences are known to form a specific subset of folded DNA structures, and therefore, the sequences might have potential to form specific higher order structures related to the clustered occurrence of a specific form of the speculated nucleosomes.
Collapse
Affiliation(s)
- Atsushi Fukushima
- Graduate School of Biological Sciences, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0101, Japan
| | | | | | | | | | | | | |
Collapse
|
38
|
Abstract
Two different views have been proposed for origins of genes (or proteins). One is that primordial genes evolved from random sequences. This view underlies the concept of modern in vitro evolution experiments that functional molecules (even proteins) evolved from random sequence-libraries. On the contrary, the second view reminds that "random sequences" would be an unusual state in which to find RNA or DNA, because it is their inherent nature to yield periodic structures during the course of semi-conservative replication. In this second view, the periodicity of DNA (or RNA) is responsible for emergence of primordial genes. Although recent reports on the variety of periodicities present in proteins, genes and genomes are consistent with the second view, it has yet to be experimentally tested. We assessed the significance of periodicities of DNA in the origin of genes by constructing such periodic DNAs. The results showed that periodic DNA produced ordered proteins at very high rates, which is in contrast to the fact that proteins with random sequences lack secondary structures. We concluded that periodicity played a pivotal role in the origin of many genes. The observation should pave the way for new experimental evolution systems for proteins.
Collapse
Affiliation(s)
- Kiyotaka Shiba
- Department of Protein Engineering, Cancer Institute, Japanese Foundation for Cancer Research, Toshima, Tokyo 170-8455, Japan.
| | | | | |
Collapse
|
39
|
Wang Y, Zhang CT, Dong P. Recognizing shorter coding regions of human genes based on the statistics of stop codons. Biopolymers 2002; 63:207-16. [PMID: 11787008 DOI: 10.1002/bip.10054] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With the quick progress of the Human Genome Project, a great amount of uncharacterized DNA sequences needs to be annotated copiously by better algorithms. Recognizing shorter coding sequences of human genes is one of the most important problems in gene recognition, which is not yet completely solved. This paper is devoted to solving the issue using a new method. The distributions of the three stop codons, i.e., TAA, TAG and TGA, in three phases along coding, noncoding, and intergenic sequences are studied in detail. Using the obtained distributions and other coding measures, a new algorithm for the recognition of shorter coding sequences of human genes is developed. The accuracy of the algorithm is tested based on a larger database of human genes. It is found that the average accuracy achieved is as high as 92.1% for the sequences with length of 192 base pairs, which is confirmed by sixfold cross-validation tests. It is hoped that by incorporating the present method with some existing algorithms, the accuracy for identifying human genes from unannotated sequences would be increased.
Collapse
Affiliation(s)
- Yonghong Wang
- Department of Physics, Tianjin University, Tianjin, 300072, China
| | | | | |
Collapse
|
40
|
Janssen CS, Barrett MP, Lawson D, Quail MA, Harris D, Bowman S, Phillips RS, Turner CM. Gene discovery in Plasmodium chabaudi by genome survey sequencing. Mol Biochem Parasitol 2001; 113:251-60. [PMID: 11295179 DOI: 10.1016/s0166-6851(01)00224-9] [Citation(s) in RCA: 21] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
The first genome survey sequencing of the rodent malaria parasite Plasmodium chabaudi is presented here. In 766 sequences, 131 putative gene sequences have been identified by sequence similarity database searches. Further, 7 potential gene families, four of which have not previously been described, were discovered. These genes may be important in understanding the biology of malaria, as well as offering potential new drug targets. We have also identified a number of candidate minisatellite sequences that could be helpful in genetic studies. Genome survey sequencing in P. chabaudi is a productive strategy in further developing this in vivo model of malaria, in the context of the malaria genome projects.
Collapse
Affiliation(s)
- C S Janssen
- Division of Infection & Immunity, IBLS, University of Glasgow, G12 8QQ, Glasgow, UK.
| | | | | | | | | | | | | | | |
Collapse
|
41
|
Kawashima T, Amano N, Koike H, Makino S, Higuchi S, Kawashima-Ohya Y, Watanabe K, Yamazaki M, Kanehori K, Kawamoto T, Nunoshiba T, Yamamoto Y, Aramaki H, Makino K, Suzuki M. Archaeal adaptation to higher temperatures revealed by genomic sequence of Thermoplasma volcanium. Proc Natl Acad Sci U S A 2000; 97:14257-62. [PMID: 11121031 PMCID: PMC18905 DOI: 10.1073/pnas.97.26.14257] [Citation(s) in RCA: 150] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The complete genomic sequence of the archaeon Thermoplasma volcanium, possessing optimum growth temperature (OGT) of 60 degrees C, is reported. By systematically comparing this genomic sequence with the other known genomic sequences of archaea, all possessing higher OGT, a number of strong correlations have been identified between characteristics of genomic organization and the OGT. With increasing OGT, in the genomic DNA, frequency of clustering purines and pyrimidines into separate dinucleotides rises (e.g., by often forming AA and TT, whereas avoiding TA and AT). Proteins coded in a genome are divided into two distinct subpopulations possessing isoelectric points in different ranges (i.e., acidic and basic), and with increasing OGT the size of the basic subpopulation becomes larger. At the metabolic level, genes coding for enzymes mediating pathways for synthesizing some coenzymes, such as heme, start missing. These findings provide insights into the design of individual genomic components, as well as principles for coordinating changes in these designs for the adaptation to new environments.
Collapse
Affiliation(s)
- T Kawashima
- National Institute of Bioscience and Human Technology, Core Research for Evolutional Science and Technology Centre of Structural Biology, 1-1 Higashi, Tsukuba 305-0046, Japan
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Jackson JH, George R, Herring PA. Vectors of shannon information from fourier signals characterizing base periodicity in genes and genomes. Biochem Biophys Res Commun 2000; 268:289-92. [PMID: 10679195 DOI: 10.1006/bbrc.2000.2112] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Equal Symbol Fourier Transforms (FTES), characterizing nucleotide periodicity, comprise components of 5-D vectors that define base-repeat properties of a genomic sequence. This report describes a conversion of the FTES signals to a common platform of Shannon information content to facilitate comparisons of periodic data with other measures of information for genes and genomes. The autocorrelation used to compute the discrete FTES formed the basis to define repeating bases in terms of conditional probabilities. We derived a vector equation to express the Shannon information content of a sequence in a way that preserves the distinct specificity of base repeat patterns characterized by FTES vectors. We suggest application of such information vectors to study the structure of information in genes, chromosomes, and genomes by chi(2) comparisons.
Collapse
Affiliation(s)
- J H Jackson
- Theoretical and Computational Biology Group, Michigan State University , USA.
| | | | | |
Collapse
|
43
|
Abstract
We present a new approach to DNA segmentation into compositionally homogeneous blocks. The Bayesian estimator, which is applicable for both short and long segments, is used to obtain the measure of homogeneity. An exact optimal segmentation is found via the dynamic programming technique. After completion of the segmentation procedure, the sequence composition on different scales can be analyzed with filtration of boundaries via the partition function approach.
Collapse
Affiliation(s)
- V E Ramensky
- Engelhardt Institute of Molecular Biology, Vavilova, Russia.
| | | | | |
Collapse
|
44
|
Nishizawa K, Nishizawa M, Kim KS. Tendency for local repetitiveness in amino acid usages in modern proteins. J Mol Biol 1999; 294:937-53. [PMID: 10588898 DOI: 10.1006/jmbi.1999.3275] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Systematic analyses of human proteins show that neural and immune system-specific, and therefore, relatively "modern" proteins have a tendency for repetitive use of amino acids at a local scale ( approximately 1-20 residues), while ancient proteins (human homologues of Escherichia coli proteins) do not. Those protein subsegments which are unique based on homology search account for the repetitiveness. Simulation shows that such repetitiveness can be maintained by frequent duplication on a very short scale (one to two codons) in the presence of substitutive point mutation, while the latter tends to mitigate the repetitiveness. DNA analyses also show the presence of cryptic (i.e. "out of the codon frame") repetitiveness, which cannot fully be explained by features in protein sequences. Simulative modification of the amino acid sequences of immune system-specific proteins estimate that 2.4 duplication events occur during the period equivalent to ten events of substitution mutation. It is also suggested that the repetitiveness leads to longitudinal unevenness within a given peptide domain. Those peptide motifs which contain similarly charged residues are likely to be generated more frequently in the presence of the tendency for repetitiveness than in its absence. Therefore, the neutral propensity of DNA for duplication, which can also tend to generate repetitiveness in amino acid sequences, seems to be manifested primarily when the constraints on amino acid sequences are relatively weak, and yet may be positively contributing to generation of unevenness in modern proteins.
Collapse
Affiliation(s)
- K Nishizawa
- Department of Biochemistry, Teikyo University School of Medicine, Kaga, Itabashi, Tokyo, 173, Japan.
| | | | | |
Collapse
|
45
|
Tatarenkov A, Sáez AG, Ayala FJ. A compact gene cluster in Drosophila: the unrelated Cs gene is compressed between duplicated amd and Ddc. Gene 1999; 231:111-20. [PMID: 10231575 DOI: 10.1016/s0378-1119(99)00096-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Cs, a gene with unknown function, and amd and Ddc, which encode decarboxylases, are among the most closely spaced genes in D. melanogaster. Untranslated 3' ends of the convergently transcribed genes Cs and Ddc are known to overlap by 88bp. A number of questions arise about the organization of this tightly-packed gene region and about the evolution and function of the Cs gene. We have now investigated this three-gene cluster in Scaptodrosophila lebanonensis (which diverged from D. melanogaster 60-65 MYA), as well as in D. melanogaster and D. simulans. Gene order and direction of transcription is the same in all three species. The Cs gene codes, in Scaptodrosophila, for a polypeptide of 544 amino acids; in D. melanogaster, it consists of 504 amino acids, which is twice as long as previously suggested, which makes the gene density even more spectacular. The Cs sequences exhibit higher number of non-synonymous substitutions between species, higher ratios of non-synonymous to synonymous substitutions, and lower codon usage bias than other genes, suggesting that Cs is less functionally constrained than the other genes. This is consistent with the failure of inducing phenotypic mutations in D. melanogaster. The function of Cs remains to be identified, but a high degree of similarity indicates that it is homologous to genes coding for a corticosteroid-binding protein in yeast and a polyamine oxidase in maize.
Collapse
Affiliation(s)
- A Tatarenkov
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697-2525, USA.
| | | | | |
Collapse
|
46
|
Suckow JM, Amano N, Ohfuku Y, Kakinuma J, Koike H, Suzuki M. A transcription frame-based analysis of the genomic DNA sequence of a hyper-thermophilic archaeon for the identification of genes, pseudo-genes and operon structures. FEBS Lett 1998; 426:86-92. [PMID: 9598984 DOI: 10.1016/s0014-5793(98)00323-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
An algorithm for identifying transcription units, independently regulated genes and operons, and pseudo-genes that are not expected to be expressed, has been developed by combining a system for predicting transcription and translation signals, and a system for scoring the triplet periodicity in ORF candidates. By using the algorithm, the 1.09 Mb sequence that covers approximately 60% of the genome of Pyrococcus sp. OT3 has been analyzed. The identified ORFs show the expected biological and physical characteristics, while the rejected ORF candidates do not. Frequent use of operon structures for transcription, and gene duplication followed by mutation or termination of the duplicated genes, are discussed.
Collapse
Affiliation(s)
- J M Suckow
- AIST-NIBHT CREST Centre of Structural Biology, Higashi, Tsukuba, Japan
| | | | | | | | | | | |
Collapse
|
47
|
Almirantis Y, Provata A. The "clustered structure" of the purines/pyrimidines distribution in DNA distinguishes systematically between coding and non-coding sequences. Bull Math Biol 1997; 59:975-92. [PMID: 9281907 DOI: 10.1007/bf02460002] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
A method allowing to measure the inhomogeneous distribution of purines/pyrimidines in nucleotide sequences is developed. We show that this measure relates to the coding or non-coding character of the considered sequence. Coding sequences present a near to the random Pu or Py distribution. This property is shared by both protein-coding DNA and functional RNA-coding DNA. Non-coding sequences present a highly clustered inhomogeneity. We propose the hypothesis, corroborated with appropriate computer simulations, that this is due to the action of various transposition events accumulated for long time periods.
Collapse
Affiliation(s)
- Y Almirantis
- Institute of Biology, National Research Center for Physical Sciences Demokritos, Athens, Greece
| | | |
Collapse
|
48
|
Tsonis AA, Kumar P, Elsner JB, Tsonis PA. Wavelet analysis of DNA sequences. PHYSICAL REVIEW. E, STATISTICAL PHYSICS, PLASMAS, FLUIDS, AND RELATED INTERDISCIPLINARY TOPICS 1996; 53:1828-1834. [PMID: 9964445 DOI: 10.1103/physreve.53.1828] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|