1
|
Wu P, Nie Z, Huang Z, Zhang X. CircPCBL: Identification of Plant CircRNAs with a CNN-BiGRU-GLT Model. PLANTS (BASEL, SWITZERLAND) 2023; 12:1652. [PMID: 37111874 PMCID: PMC10143888 DOI: 10.3390/plants12081652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 04/10/2023] [Accepted: 04/13/2023] [Indexed: 06/19/2023]
Abstract
Circular RNAs (circRNAs), which are produced post-splicing of pre-mRNAs, are strongly linked to the emergence of several tumor types. The initial stage in conducting follow-up studies involves identifying circRNAs. Currently, animals are the primary target of most established circRNA recognition technologies. However, the sequence features of plant circRNAs differ from those of animal circRNAs, making it impossible to detect plant circRNAs. For example, there are non-GT/AG splicing signals at circRNA junction sites and few reverse complementary sequences and repetitive elements in the flanking intron sequences of plant circRNAs. In addition, there have been few studies on circRNAs in plants, and thus it is urgent to create a plant-specific method for identifying circRNAs. In this study, we propose CircPCBL, a deep-learning approach that only uses raw sequences to distinguish between circRNAs found in plants and other lncRNAs. CircPCBL comprises two separate detectors: a CNN-BiGRU detector and a GLT detector. The CNN-BiGRU detector takes in the one-hot encoding of the RNA sequence as the input, while the GLT detector uses k-mer (k = 1 - 4) features. The output matrices of the two submodels are then concatenated and ultimately pass through a fully connected layer to produce the final output. To verify the generalization performance of the model, we evaluated CircPCBL using several datasets, and the results revealed that it had an F1 of 85.40% on the validation dataset composed of six different plants species and 85.88%, 75.87%, and 86.83% on the three cross-species independent test sets composed of Cucumis sativus, Populus trichocarpa, and Gossypium raimondii, respectively. With an accuracy of 90.9% and 90%, respectively, CircPCBL successfully predicted ten of the eleven circRNAs of experimentally reported Poncirus trifoliata and nine of the ten lncRNAs of rice on the real set. CircPCBL could potentially contribute to the identification of circRNAs in plants. In addition, it is remarkable that CircPCBL also achieved an average accuracy of 94.08% on the human datasets, which is also an excellent result, implying its potential application in animal datasets. Ultimately, CircPCBL is available as a web server, from which the data and source code can also be downloaded free of charge.
Collapse
Affiliation(s)
- Pengpeng Wu
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
- School of Life Science, Anhui Agricultural University, Hefei 230036, China
| | - Zhenjun Nie
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
- School of Information and Computer Science, Anhui Agricultural University, Hefei 230036, China
| | - Zhiqiang Huang
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
- School of Information and Computer Science, Anhui Agricultural University, Hefei 230036, China
| | - Xiaodan Zhang
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
- School of Information and Computer Science, Anhui Agricultural University, Hefei 230036, China
| |
Collapse
|
2
|
Wang H, Wang S, Zhang Y, Bi S, Zhu X. A brief review of machine learning methods for RNA methylation sites prediction. Methods 2022; 203:399-421. [DOI: 10.1016/j.ymeth.2022.03.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 02/15/2022] [Accepted: 03/01/2022] [Indexed: 02/07/2023] Open
|
3
|
Yang Z, Li H, Jia Y, Zheng Y, Meng H, Bao T, Li X, Luo L. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evol Biol 2020; 20:157. [PMID: 33228538 PMCID: PMC7684957 DOI: 10.1186/s12862-020-01723-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 11/10/2020] [Indexed: 11/17/2022] Open
Abstract
Background K-mer spectra of DNA sequences contain important information about sequence composition and sequence evolution. We want to reveal the evolution rules of genome sequences by studying the k-mer spectra of genome sequences. Results The intrinsic laws of k-mer spectra of 920 genome sequences from primate to prokaryote were analyzed. We found that there are two types of evolution selection modes in genome sequences, named as CG Independent Selection and TA Independent Selection. There is a mutual inhibition relationship between CG and TA independent selections. We found that the intensity of CG and TA independent selections correlates closely with genome evolution and G + C content of genome sequences. The living habits of species are related closely to the independent selection modes adopted by species genomes. Consequently, we proposed an evolution mechanism of genomes in which the genome evolution is determined by the intensities of the CG and TA independent selections and the mutual inhibition relationship. Besides, by the evolution mechanism of genomes, we speculated the evolution modes of prokaryotes in mild and extreme environments in the anaerobic age and the evolving process of prokaryotes from anaerobic to aerobic environment on earth as well as the originations of different eukaryotes. Conclusion We found that there are two independent selection modes in genome sequences. The evolution of genome sequence is determined by the two independent selection modes and the mutual inhibition relationship between them.
Collapse
Affiliation(s)
- Zhenhua Yang
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.,School of Economics and Management, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Yun Jia
- College of Science, Inner Mongolia University of Technology, Hohhot, 010051, China
| | - Yan Zheng
- Baotou Medical College, Inner Mongolia University of Science & Technology, Baotou, 014040, China
| | - Hu Meng
- School of Life Science & Technology, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Tonglaga Bao
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Xiaolong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Liaofu Luo
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
4
|
Sun JH, Ai SM, Liu SQ. Methylation-driven model for analysis of dinucleotide evolution in genomes. Theor Biol Med Model 2020; 17:3. [PMID: 32264909 PMCID: PMC7140373 DOI: 10.1186/s12976-020-00122-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 03/10/2020] [Indexed: 11/16/2022] Open
Abstract
Background CpGs, the major methylation sites in vertebrate genomes, exhibit a high mutation rate from the methylated form of CpG to TpG/CpA and, therefore, influence the evolution of genome composition. However, the quantitative effects of CpG to TpG/CpA mutations on the evolution of genome composition in terms of the dinucleotide frequencies/proportions remain poorly understood. Results Based on the neutral theory of molecular evolution, we propose a methylation-driven model (MDM) that allows predicting the changes in frequencies/proportions of the 16 dinucleotides and in the GC content of a genome given the known number of CpG to TpG/CpA mutations. The application of MDM to the 10 published vertebrate genomes shows that, for most of the 16 dinucleotides and the GC content, a good consistency is achieved between the predicted and observed trends of changes in the frequencies and content relative to the assumed initial values, and that the model performs better on the mammalian genomes than it does on the lower-vertebrate genomes. The model’s performance depends on the genome composition characteristics, the assumed initial state of the genome, and the estimated parameters, one or more of which are responsible for the different application effects on the mammalian and lower-vertebrate genomes and for the large deviations of the predicted frequencies of a few dinucleotides from their observed frequencies. Conclusions Despite certain limitations of the current model, the successful application to the higher-vertebrate (mammalian) genomes witnesses its potential for facilitating studies aimed at understanding the role of methylation in driving the evolution of genome dinucleotide composition.
Collapse
Affiliation(s)
- Jian-Hong Sun
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan & School of Life Sciences, Yunnan University, Kunming, 650091, China.,College of Engineering, Honghe University, Mengzi, 661100, China
| | - Shi-Meng Ai
- Department of Applied Mathematics, Yunnan Agricultural University, Kunming, 650201, China
| | - Shu-Qun Liu
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan & School of Life Sciences, Yunnan University, Kunming, 650091, China.
| |
Collapse
|
5
|
Zhao X, Zhang Y, Ning Q, Zhang H, Ji J, Yin M. Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer. J Theor Biol 2019; 467:39-47. [DOI: 10.1016/j.jtbi.2019.01.035] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2018] [Revised: 01/04/2019] [Accepted: 01/30/2019] [Indexed: 01/15/2023]
|
6
|
He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y. PseUI: Pseudouridine sites identification based on RNA sequence information. BMC Bioinformatics 2018; 19:306. [PMID: 30157750 PMCID: PMC6114832 DOI: 10.1186/s12859-018-2321-0] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 08/21/2018] [Indexed: 01/28/2023] Open
Abstract
Background Pseudouridylation is the most prevalent type of posttranscriptional modification in various stable RNAs of all organisms, which significantly affects many cellular processes that are regulated by RNA. Thus, accurate identification of pseudouridine (Ψ) sites in RNA will be of great benefit for understanding these cellular processes. Due to the low efficiency and high cost of current available experimental methods, it is highly desirable to develop computational methods for accurately and efficiently detecting Ψ sites in RNA sequences. However, the predictive accuracy of existing computational methods is not satisfactory and still needs improvement. Results In this study, we developed a new model, PseUI, for Ψ sites identification in three species, which are H. sapiens, S. cerevisiae, and M. musculus. Firstly, five different kinds of features including nucleotide composition (NC), dinucleotide composition (DC), pseudo dinucleotide composition (pseDNC), position-specific nucleotide propensity (PSNP), and position-specific dinucleotide propensity (PSDP) were generated based on RNA segments. Then, a sequential forward feature selection strategy was used to gain an effective feature subset with a compact representation but discriminative prediction power. Based on the selected feature subsets, we built our model by using a support vector machine (SVM). Finally, the generalization of our model was validated by both the jackknife test and independent validation tests on the benchmark datasets. The experimental results showed that our model is more accurate and stable than the previously published models. We have also provided a user-friendly web server for our model at http://zhulab.ahu.edu.cn/PseUI, and a brief instruction for the web server is provided in this paper. By using this instruction, the academic users can conveniently get their desired results without complicated calculations. Conclusion In this study, we proposed a new predictor, PseUI, to detect Ψ sites in RNA sequences. It is shown that our model outperformed the existing state-of-art models. It is expected that our model, PseUI, will become a useful tool for accurate identification of RNA Ψ sites. Electronic supplementary material The online version of this article (10.1186/s12859-018-2321-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jingjing He
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China
| | - Ting Fang
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China
| | - Zizheng Zhang
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China
| | - Bei Huang
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China
| | - Xiaolei Zhu
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China.
| | - Yi Xiong
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
7
|
Jia Y, Li H, Wang J, Meng H, Yang Z. Spectrum structures and biological functions of 8-mers in the human genome. Genomics 2018. [PMID: 29522801 DOI: 10.1016/j.ygeno.2018.03.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The spectra of k-mer frequencies can reveal the structures and evolution of genome sequences. We confirmed that the trimodal spectrum of 8-mers in human genome sequences is distinguished only by CG2, CG1 and CG0 8-mer sets, containing 2,1 or 0 CpG, respectively. This phenomenon is called independent selection law. The three types of CG 8-mers were considered as different functional elements. We conjectured that (1) nucleosome binding motifs are mainly characterized by CG1 8-mers and (2) the core structural units of CpG island sequences are predominantly characterized by CG2 8-mers. To validate our conjectures, nucleosome occupied sequences and CGI sequences were extracted, then the sequence parameters were constructed through the information of the three CG 8-mer sets respectively. ROC analysis showed that CG1 8-mers are more preference in nucleosome occupied segments (AUC > 0.7) and CG2 8-mers are more preference in CGI sequences (AUC > 0.99). This validates our conjecture in principle.
Collapse
Affiliation(s)
- Yun Jia
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; College of Science, Inner Mongolia University of Technology, Hohhot 010051, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
| | - Jingfeng Wang
- College of Science, Inner Mongolia University of Technology, Hohhot 010051, China
| | - Hu Meng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Zhenhua Yang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
8
|
Zheng Y, Li H, Wang Y, Meng H, Zhang Q, Zhao X. Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast. Chromosome Res 2017; 25:173-189. [PMID: 28181048 DOI: 10.1007/s10577-017-9554-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 12/27/2016] [Accepted: 01/27/2017] [Indexed: 01/01/2023]
Abstract
The rules of k-mer non-random usage and the biological functions are worthy of special attention. Firstly, the article studied human 8-mer spectra and found that only the spectra of cytosine-guanine (CG) dinucleotide classification formed independent unimodal distributions when the 8-mers were classified into three subsets under 16 dinucleotide classifications. Secondly, the distribution rules were reproduced by other seven species including yeast, which showed that the evolution phenomenon had species universality. It followed that we proposed two theoretical conjectures: (1) CG1 motifs (8-mers including 1 CG) are the nucleosome-binding motifs. (2) CG2 motifs (8-mers including two or more than two CG) are the modular units of CpG islands. Our conjectures were confirmed in yeast by the following results: a maximum of average area under the receiver operating characteristic (AUC) resulted from CG1 information during nucleosome core sequences, and linker sequences were distinguished by three CG subsets; there was a one-to-one relationship between abundant CG1 signal regions and histone positions; the sequence changing of squeezed nucleosomes was relevant with the strength of CG1 signals; and the AUC value of 0.986 was based on CG2 information when CpG islands and non-CpG islands were distinguished by the three CG subsets.
Collapse
Affiliation(s)
- Yan Zheng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China. .,, No.235, West University Street, Hohhot, Inner Mongolia, China.
| | - Yue Wang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hu Meng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Qiang Zhang
- College of Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Xiaoqing Zhao
- Biotechnology research centre, Inner Mongolia Academy of Agricultural and Animal Husbandry Science, Hohhot, 010021, China
| |
Collapse
|
9
|
Li X, Doukhan P, Feugeas JP. Statistical inference for DNA sequences of promoters: a non-stationary qualitative model. STATISTICS-ABINGDON 2016. [DOI: 10.1080/02331888.2016.1261474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Xiaoyin Li
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA
| | - Paul Doukhan
- AGM, Department of Mathematics, University of Cergy-Pontoise & IUF, Ile de France, Cergy-Pontoise, France
| | | |
Collapse
|
10
|
Li GQ, Liu Z, Shen HB, Yu DJ. TargetM6A: Identifying N6-Methyladenosine Sites From RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine. IEEE Trans Nanobioscience 2016; 15:674-682. [DOI: 10.1109/tnb.2016.2599115] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
11
|
Bonnici V, Manca V. Informational laws of genome structures. Sci Rep 2016; 6:28840. [PMID: 27354155 PMCID: PMC4937431 DOI: 10.1038/srep28840] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 06/09/2016] [Indexed: 01/06/2023] Open
Abstract
In recent years, the analysis of genomes by means of strings of length k occurring in the genomes, called k-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of k for applying information theoretic concepts that express intrinsic aspects of genomes. The value k = lg2(n), where n is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances entropic and anti-entropic components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined.
Collapse
Affiliation(s)
- Vincenzo Bonnici
- University of Verona, Department of Computer Science, University of Verona, Verona 37134, Italy,Center for BioMedical Computing, University of Verona, Verona, 37134, Italy
| | - Vincenzo Manca
- University of Verona, Department of Computer Science, University of Verona, Verona 37134, Italy,Center for BioMedical Computing, University of Verona, Verona, 37134, Italy,
| |
Collapse
|