1
|
Arruda M, da Silva A, de Assis F. An Adaptive Mapping Method Using Spectral Envelope Approach for DNA Spectral Analysis. ENTROPY 2022; 24:e24070978. [PMID: 35885202 PMCID: PMC9323741 DOI: 10.3390/e24070978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/07/2022] [Accepted: 07/12/2022] [Indexed: 11/16/2022]
Abstract
The digital signal processing approaches were investigated as a preliminary indicator for discriminating between the protein coding and non-coding regions of DNA. This is because a three-base periodicity (TBP) has already been proven to exist in protein-coding regions arising from the length of codons (three nucleic acids). This demonstrates that there is a prominent peak in the energy spectrum of a DNA coding sequence at frequency 13 rad/sample. However, because DNA sequences are symbolic sequences, these should be mapped into one or more signals such that the hidden information is highlighted. We propose, therefore, two new algorithms for computing adaptive mappings and, by using them, finding periodicities. Both such algorithms are based on the spectral envelope approach. This adaptive approach is essentially important since a single mapping for any DNA sequence may ignore its intrinsic properties. Finally, the improved performance of the new methods is verified by using them with synthetic and real DNA sequences as compared to the classical methods, especially the minimum entropy mapping (MEM) spectrum, which is also an adaptive method. We demonstrated that our method is both more accurate and more responsive than all its counterparts. This is especially important in this application since it reduces the risks of a coding sequence being missed.
Collapse
|
2
|
Di Gioacchino A, Šulc P, Komarova AV, Greenbaum BD, Monasson R, Cocco S. The Heterogeneous Landscape and Early Evolution of Pathogen-Associated CpG Dinucleotides in SARS-CoV-2. Mol Biol Evol 2021; 38:2428-2445. [PMID: 33555346 PMCID: PMC7928797 DOI: 10.1093/molbev/msab036] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
COVID-19 can lead to acute respiratory syndrome, which can be due to dysregulated immune signaling. We analyze the distribution of CpG dinucleotides, a pathogen-associated molecular pattern, in the SARS-CoV-2 genome. We characterize CpG content by a CpG force that accounts for statistical constraints acting on the genome at the nucleotidic and amino acid levels. The CpG force, as the CpG content, is overall low compared with other pathogenic betacoronaviruses; however, it widely fluctuates along the genome, with a particularly low value, comparable with the circulating seasonal HKU1, in the spike coding region and a greater value, comparable with SARS and MERS, in the highly expressed nucleocapside coding region (N ORF), whose transcripts are relatively abundant in the cytoplasm of infected cells and present in the 3'UTRs of all subgenomic RNA. This dual nature of CpG content could confer to SARS-CoV-2 the ability to avoid triggering pattern recognition receptors upon entry, while eliciting a stronger response during replication. We then investigate the evolution of synonymous mutations since the outbreak of the COVID-19 pandemic, finding a signature of CpG loss in regions with a greater CpG force. Sequence motifs preceding the CpG-loss-associated loci in the N ORF match recently identified binding patterns of the zinc finger antiviral protein. Using a model of the viral gene evolution under human host pressure, we find that synonymous mutations seem driven in the SARS-CoV-2 genome, and particularly in the N ORF, by the viral codon bias, the transition-transversion bias, and the pressure to lower CpG content.
Collapse
Affiliation(s)
- Andrea Di Gioacchino
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
| | - Petr Šulc
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Anastassia V Komarova
- Molecular Genetics of RNA Viruses, Department of Virology, Institut Pasteur, CNRS UMR-3569, Paris, France
| | - Benjamin D Greenbaum
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Rémi Monasson
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
| | - Simona Cocco
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
| |
Collapse
|
3
|
Abstract
Periodic occurrences of oligonucleotide sequences can impact the physical properties of DNA. For example, DNA bendability is modulated by 10-bp periodic occurrences of WW (W = A/T) dinucleotides. We present periodicDNA, an R package to identify k-mer periodicity and generate continuous tracks of k-mer periodicity over genomic loci of interest, such as regulatory elements. periodicDNA will facilitate investigation and improve understanding of how periodic DNA sequence features impact function.
Collapse
Affiliation(s)
- Jacques Serizay
- The Gurdon Institute and Department of Genetics, University of Cambridge, Cambridge, CB2 1QN, UK
| | - Julie Ahringer
- The Gurdon Institute and Department of Genetics, University of Cambridge, Cambridge, CB2 1QN, UK
| |
Collapse
|
4
|
Di Gioacchino A, Šulc P, Komarova AV, Greenbaum BD, Monasson R, Cocco S. The heterogeneous landscape and early evolution of pathogen-associated CpG dinucleotides in SARS-CoV-2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2020. [PMID: 32511407 DOI: 10.1101/2020.05.06.074039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
COVID-19 can lead to acute respiratory syndrome, which can be due to dysregulated immune signaling. We analyze the distribution of CpG dinucleotides, a pathogen-associated molecular pattern, in the SARS-CoV-2 genome. We find that the CpG content, which we characterize by a force parameter that accounts for statistical constraints acting on the genome at the nucleotidic and amino-acid levels, is, on average, low compared to other pathogenic betacoronaviruses. However, the CpG force widely fluctuates along the genome, with a particularly low value, comparable to the circulating seasonal HKU1, in the spike coding region and a greater value, comparable to SARS and MERS, in the highly expressed nucleocapside coding region (N ORF), whose transcripts are relatively abundant in the cytoplasm of infected cells and present in the 3'UTRs of all subgenomic RNA. This dual nature of CpG content could confer to SARS-CoV-2 the ability to avoid triggering pattern recognition receptors upon entry, while eliciting a stronger response during replication. We then investigate the evolution of synonymous mutations since the outbreak of the COVID-19 pandemic, finding a signature of CpG loss in regions with a greater CpG force. Sequence motifs preceding the CpG-loss-associated loci in the N ORF match recently identified binding patterns of the Zinc finger Anti-viral Protein. Using a model of the viral gene evolution under human host pressure, we find that synonymous mutations seem driven in the SARS-CoV-2 genome, and particularly in the N ORF, by the viral codon bias, the transition-transversion bias and the pressure to lower CpG content.
Collapse
|
5
|
Di Gioacchino A, Šulc P, Komarova AV, Greenbaum BD, Monasson R, Cocco S. The Heterogeneous Landscape and Early Evolution of Pathogen-Associated CpG Dinucleotides in SARS-CoV-2. SSRN 2020:3611280. [PMID: 32714120 PMCID: PMC7366803 DOI: 10.2139/ssrn.3611280] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 05/27/2020] [Indexed: 11/15/2022]
Abstract
SARS-CoV-2 infection can lead to acute respiratory syndrome in patients, which can be due in part to dysregulated immune signalling. We analyze here the occurrences of CpG dinucleotides, which are putative pathogen-associated molecular patterns, along the viral sequence. Carrying out a comparative analysis with other ssRNA viruses and within the Coronaviridae family, we find the CpG content of SARS-CoV-2, while low compared to other betacoronaviruses, widely fluctuates along its primary sequence. While the CpG relative abundance and its associated CpG force parameter are low for the spike protein (S) and comparable to circulating seasonal coronaviruses such as HKU1, they are much greater and comparable to SARS and MERS for the 3'-end of the viral genome. In particular, the nucleocapsid protein (N), whose transcripts are relatively abundant in the cytoplasm of infected cells and present in the 3'UTRs of all subgenomic RNA, has high CpG content. We speculate this dual nature of CpG content can confer to SARS-CoV-2 high ability to both enter the host and trigger pattern recognition receptors (PRRs) in different contexts. We then investigate the evolution of synonymous mutations since the outbreak of the COVID-19 pandemic. Using a new application of selective forces on dinucleotides to estimate context driven mutational processes, we find that synonymous mutations seem driven both by the viral codon bias and by the high value of the CpG force in the N protein, leading to a loss in CpG content. Sequence motifs preceding these CpG-loss-associated loci match recently identified binding patterns of the Zinc Finger anti-viral Protein (ZAP) protein. Funding: This work was partially supported by the ANR19 Decrypted CE30-0021-01 grants. B.G. was supported by National Institutes of Health grants 7R01AI081848-04, 1R01CA240924-01, a Stand Up to Cancer - Lustgarten Foundation Convergence Dream Team Grant, and The Pershing Square Sohn Prize - Mark Foundation Fellow supported by funding from The Mark Foundation for Cancer Research.
Collapse
Affiliation(s)
- Andrea Di Gioacchino
- Laboratoire de Physique de l'Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, F-75005 Paris, France
| | - Petr Šulc
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, 1001 South McAllister Avenue, Tempe, Arizona 85281, USA
| | - Anastassia V Komarova
- Molecular Genetics of RNA viruses, Department of Virology, Institut Pasteur, CNRS UMR-3569, 75015 Paris, France
| | - Benjamin D Greenbaum
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, 1275 York Avenue New York, NY 10065
| | - Rémi Monasson
- Laboratoire de Physique de l'Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, F-75005 Paris, France
| | - Simona Cocco
- Laboratoire de Physique de l'Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, F-75005 Paris, France
| |
Collapse
|
6
|
Skutkova H, Maderankova D, Sedlar K, Jugas R, Vitek M. A degeneration-reducing criterion for optimal digital mapping of genetic codes. Comput Struct Biotechnol J 2019; 17:406-414. [PMID: 30984363 PMCID: PMC6444178 DOI: 10.1016/j.csbj.2019.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 02/07/2019] [Accepted: 03/15/2019] [Indexed: 01/08/2023] Open
Abstract
Bioinformatics may seem to be a scientific field processing primarily large string datasets, as nucleotides and amino acids are represented with dedicated characters. On the other hand, many computational tasks that bioinformatics challenges are mathematical problems understandable as operations with digits. In fact, many computational tasks are solved this way in the background. One of the most widely used digital representations is mapping of nucleotides and amino acids with integers 0–3 and 0–20, respectively. The limitation of this mapping occurs when the digital signal of nucleotides has to be translated into a digital signal of amino acids as the genetic code is degenerated. This causes non-monotonies in a mapping function. Although map for reducing this undesirable effect has already been proposed, it is defined theoretically and for standard genetic codes only. In this study, we derived a novel optimal criterion for reducing the influence of degeneration by utilizing a large dataset of real sequences with various genetic codes. As a result, we proposed a new robust global optimal map suitable for any genetic code as well as specialized optimal maps for particular genetic codes. Optimization of 1D numerical representation for DNA to protein translation. Reducing genetic code degeneracy in numerical representation of DNA sequences. More robust numerical conversion used for genomic-proteomic analysis.
Collapse
Affiliation(s)
- Helena Skutkova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Denisa Maderankova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Karel Sedlar
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Robin Jugas
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Martin Vitek
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| |
Collapse
|
7
|
Morán Losada P, Fischer S, Chouvarine P, Tümmler B. Three-base periodicity of sites of sequence variation in Pseudomonas aeruginosa and Staphylococcus aureus core genomes. FEBS Lett 2016; 590:3538-3543. [PMID: 27664047 DOI: 10.1002/1873-3468.12431] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2016] [Revised: 09/08/2016] [Accepted: 09/12/2016] [Indexed: 11/11/2022]
Abstract
The three-base periodicity property is characteristic of protein-coding sequences. Here, we report on three-base periodicity of sequence variation in the core genome of bacteria. Single nucleotide polymorphism (SNP) syntenies were extracted from pairwise genome alignments of 41 Staphylococcus aureus or 20 Pseudomonas aeruginosa strains. The length of fragment pairs with identical nucleotides at all SNP positions showed a length-dependent overrepresentation of multiples of three nucleotides at corresponding codon positions of the AT-rich S. aureus and the GC-rich P. aeruginosa. Three-base SNP periodicity seems to be a characteristic feature of the tightly arranged bacterial core genome.
Collapse
Affiliation(s)
- Patricia Morán Losada
- Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Germany
| | - Sebastian Fischer
- Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Germany
| | - Philippe Chouvarine
- Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Germany
| | - Burkhard Tümmler
- Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Germany. .,Biomedical Research in Endstage and Obstructive Lung Disease (BREATH), German Center for Lung Research, Hannover, Germany.
| |
Collapse
|
8
|
Jin H, Rube HT, Song JS. Categorical spectral analysis of periodicity in nucleosomal DNA. Nucleic Acids Res 2016; 44:2047-57. [PMID: 26893354 PMCID: PMC4797311 DOI: 10.1093/nar/gkw101] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Accepted: 02/09/2016] [Indexed: 12/26/2022] Open
Abstract
DNA helical twist imposes geometric constraints on the location of histone–DNA interaction sites along nucleosomal DNA. Certain 10.5-bp periodic nucleotides in phase with these geometric constraints have been suggested to facilitate nucleosome positioning. However, the extent of nucleotide periodicity in nucleosomal DNA and its significance in directing nucleosome positioning still remain unclear. We clarify these issues by applying categorical spectral analysis to high-resolution nucleosome maps in two yeast species. We find that only a small fraction of nucleosomal sequences contain significant 10.5-bp periodicity. We further develop a spectral decomposition method to show that the previously observed periodicity in aligned nucleosomal sequences mainly results from proper phasing among nucleosomal sequences, and not from a preponderant occurrence of periodicity within individual sequences. Importantly, we show that this phasing may arise from the histones’ proclivity for putting preferred nucleotides at some of the evenly spaced histone–DNA contact points with respect to the dyad axis. We demonstrate that 10.5-bp periodicity, when present, significantly facilitates rotational, but not translational, nucleosome positioning. Finally, although periodicity only moderately affects nucleosome occupancy genome wide, reduced periodicity is an evolutionarily conserved signature of nucleosome-depleted regions around transcription start/termination sites.
Collapse
Affiliation(s)
- Hu Jin
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, IL 61801, USA Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, IL 61801, USA
| | - H Tomas Rube
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, IL 61801, USA Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, IL 61801, USA
| | - Jun S Song
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, IL 61801, USA Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, IL 61801, USA Department of Bioengineering, University of Illinois, Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
9
|
Chaley M, Kutyrkin V. Stochastic model of homogeneous coding and latent periodicity in DNA sequences. J Theor Biol 2016; 390:106-16. [DOI: 10.1016/j.jtbi.2015.11.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Revised: 09/18/2015] [Accepted: 11/14/2015] [Indexed: 11/24/2022]
|
10
|
Suvorova YM, Korotkova MA, Korotkov EV. Comparative analysis of periodicity search methods in DNA sequences. Comput Biol Chem 2014; 53 Pt A:43-8. [PMID: 25218218 DOI: 10.1016/j.compbiolchem.2014.08.008] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 11/30/2022]
Abstract
To determine the periodicity of a DNA sequence, different spectral approaches are applied (discrete Fourier transform (DFT), autocorrelation (CORR), information decomposition (ID), hybrid method (HYB), concept of spectral envelope for spectral analysis (SE), normalized autocorrelation (CORR_N) and profile analysis (PA). In this work, we investigated the possibility of finding the true period length, by depending on the average number of accumulated changes in DNA bases (PM) for the methods stated above. The results show that for periods with short length (≤4 b.p), it is possible to use the hybrid method (HYB), which combines properties of autocorrelation, Fourier transform, and information decomposition (ID). For larger period lengths (>4) with values of point mutation (PM) equal to 1.0 or more per one nucleotide, it is preferable to use information of decomposition method (ID), as the other spectral approaches cannot achieve correct determination of the period length present in the analyzed sequence.
Collapse
Affiliation(s)
- Yulia M Suvorova
- Centre of Bioengineering Russian Academy of Sciences, Prospect 60-tya Oktyabrya 7/1, Moscow 117312, Russian Federation.
| | - Maria A Korotkova
- National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Kashirskoe Shosse, 31, Moscow 115522, Russian Federation.
| | - Eugene V Korotkov
- Centre of Bioengineering Russian Academy of Sciences, Prospect 60-tya Oktyabrya 7/1, Moscow 117312, Russian Federation; National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Kashirskoe Shosse, 31, Moscow 115522, Russian Federation.
| |
Collapse
|