1
|
Górski AZ, Piwowar M. Nucleotide spacing distribution analysis for human genome. Mamm Genome 2021; 32:123-128. [PMID: 33723659 PMCID: PMC8012312 DOI: 10.1007/s00335-021-09865-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 03/02/2021] [Indexed: 11/30/2022]
Abstract
The distribution of nucleotides spacing in human genome was investigated. An analysis of the frequency of occurrence in the human genome of different sequence lengths flanked by one type of nucleotide was carried out showing that the distribution has no self-similar (fractal) structure. The results nevertheless revealed several characteristic features: (i) the distribution for short-range spacing is quite similar to the purely stochastic sequences; (ii) the distribution for long-range spacing essentially deviates from the random sequence distribution, showing strong long-range correlations; (iii) the differences between (A, T) and (C, G) nucleotides are quite significant; (iv) the spacing distribution displays tiny oscillations.
Collapse
Affiliation(s)
- Andrzej Z Górski
- Polish Academy of Sciences, Institute of Nuclear Physics, Radzikowskiego 152 st, 31-342, Kraków, Poland
| | - Monika Piwowar
- Jagiellonian University, Collegium Medicum, Kopernika 7E st, 31-034, Kraków, Poland.
| |
Collapse
|
2
|
Wang J, Yin C. A Fast Algorithm for Computing the Fourier Spectrum of a Fractional Period. J Comput Biol 2020; 28:269-282. [PMID: 33290131 DOI: 10.1089/cmb.2020.0269] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Directly computing Fourier power spectra at fractional periods of real sequences can be beneficial in many digital signal processing applications. In this article, we present a fast algorithm to compute the fractional Fourier power spectra of real sequences. For a real sequence of length of m=nl, we may deduce its congruence derivative sequence with a length of l. The discrete Fourier transform of the original sequence can be calculated by the discrete Fourier transform of the congruence derivative sequence. The relation of discrete Fourier transforms between the two sequences may derive the special features of Fourier power spectra of the integer and fractional periods for a real sequence. It has been proved mathematically that after calculating the Fourier power spectrum (FPS) at an integer period, the Fourier power spectra of the fractional periods related this integer period can be easily represented by the computational result of the FPS at the integer period for the sequence. Computational experiments using a simulated sinusoidal data and protein sequence show that the computed results are a kind of Fourier power spectra corresponding to new frequencies that cannot be obtained from the traditional discrete Fourier transform. Therefore, the algorithm would be a new realization method for discrete Fourier transform of the real sequence.
Collapse
Affiliation(s)
- Jiasong Wang
- Department of Mathematics, Nanjing University, Nanjing, China
| | - Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, Illinois, USA
| |
Collapse
|
3
|
Suvorova YM, Korotkov EV. New Method for Potential Fusions Detection in Protein-Coding Sequences. J Comput Biol 2019; 26:1253-1261. [PMID: 31211597 DOI: 10.1089/cmb.2019.0122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
Abstract
Gene fusion is known to be one of the mechanisms of a new gene formation. Most bioinformatics methods for studying fused genes are based on the sequence similarity search. However, if the ancestral sequences were lost during evolution or changed too much, it is impossible to detect the fusion. Previously, we have developed a method of searching for triplet periodicity (TP) change points in protein-coding sequences (CDS) and showed the possible relation of this phenomenon with gene formation as a result of fusion. In this study, we improved the TP change point detection method and studied the genes of six eukaryotic genomes. At the level of 2%-3% of the probability of type I error, TP change points were found in 20%-40% of genes. Further analysis showed that about 30% of the TP change points can be explained by amino acid repeats. Another 30% can be potentially fused genes, alignment for which was detected by the BLAST program. We believe that the rest of the results can be fused genes, the ancestral sequences for which have been lost. The method is more sensitive to TP changes and allowed us to find up to two to three times more cases of significant TP change points than our previous method.
Collapse
Affiliation(s)
- Yulia M Suvorova
- Federal State Institution "Federal Research Centre "Fundamentals of Biotechnology" of the Russian Academy of Sciences", Moscow, Russian Federation
| | - Eugene V Korotkov
- Federal State Institution "Federal Research Centre "Fundamentals of Biotechnology" of the Russian Academy of Sciences", Moscow, Russian Federation.,Applied Mathematics Department, National Research Nuclear University MEPhI, Moscow, Russian Federation
| |
Collapse
|
4
|
Li J, Zhang L, Li H, Ping Y, Xu Q, Wang R, Tan R, Wang Z, Liu B, Wang Y. Integrated entropy-based approach for analyzing exons and introns in DNA sequences. BMC Bioinformatics 2019; 20:283. [PMID: 31182012 PMCID: PMC6557737 DOI: 10.1186/s12859-019-2772-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Numerous essential algorithms and methods, including entropy-based quantitative methods, have been developed to analyze complex DNA sequences since the last decade. Exons and introns are the most notable components of DNA and their identification and prediction are always the focus of state-of-the-art research. RESULTS In this study, we designed an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences. We optimized and implemented the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences. By comparing digitalizing entropy values of exons and introns, we observed that they are significantly different. After we converted DNA data to numerical topological entropy value, we applied SVD method to effectively investigate exon and intron regions on a single gene sequence. Additionally, several genes across five species are used for exon predictions. CONCLUSIONS Our approach not only helps to explore the complexity of DNA sequence and its functional elements, but also provides an entropy-based GSP method to analyze exon and intron regions. Our work is feasible across different species and extendable to analyze other components in both coding and noncoding region of DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Zhen Wang
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031 China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| |
Collapse
|