1
|
Ghosh S, Pal J, Maji B, Cattani C, Bhattacharya DK. Choice of Metric Divergence in Genome Sequence Comparison. Protein J 2024; 43:259-273. [PMID: 38492188 DOI: 10.1007/s10930-024-10189-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/28/2024] [Indexed: 03/18/2024]
Abstract
The paper introduces a novel probability descriptor for genome sequence comparison, employing a generalized form of Jensen-Shannon divergence. This divergence metric stems from a one-parameter family, comprising fractions up to a maximum value of half. Utilizing this metric as a distance measure, a distance matrix is computed for the new probability descriptor, shaping Phylogenetic trees via the neighbor-joining method. Initial exploration involves setting the parameter at half for various species. Assessing the impact of parameter variation, trees drawn at different parameter values (half, one-fourth, one-eighth). However, measurement scales decrease with parameter value increments, with higher similarity accuracy corresponding to lower scale values. Ultimately, the highest accuracy aligns with the maximum parameter value of half. Comparative analyses against previous methods, evaluating via Symmetric Distance (SD) values and rationalized perception, consistently favor the present approach's results. Notably, outcomes at the maximum parameter value exhibit the most accuracy, validating the method's efficacy against earlier approaches.
Collapse
Affiliation(s)
- Soumen Ghosh
- Information Technology, Narula Institute of Technology, Kolkata, West Bengal, India.
| | - Jayanta Pal
- Computer Science & Engineering, Narula Institute of Technology, Kolkata, West Bengal, India
| | - Bansibadan Maji
- Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
| | - Carlo Cattani
- DEIM, University of Tuscia, Largo Dell'Universita, 01100, Viterbo, Italy
| | | |
Collapse
|
2
|
Orhan ME, Demirci YM, Saçar Demirci MD. NeRNA: A negative data generation framework for machine learning applications of noncoding RNAs. Comput Biol Med 2023; 159:106861. [PMID: 37075604 DOI: 10.1016/j.compbiomed.2023.106861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 02/03/2023] [Accepted: 03/30/2023] [Indexed: 04/21/2023]
Abstract
Many supervised machine learning based noncoding RNA (ncRNA) analysis methods have been developed to classify and identify novel sequences. During such analysis, the positive learning datasets usually consist of known examples of ncRNAs and some of them might even have weak or strong experimental validation. On the contrary, there are neither databases listing the confirmed negative sequences for a specific ncRNA class nor standardized methodologies developed to generate high quality negative examples. To overcome this challenge, a novel negative data generation method, NeRNA (negative RNA), is developed in this work. NeRNA uses known examples of given ncRNA sequences and their calculated structures for octal representation to create negative sequences in a manner similar to frameshift mutations but without deletion or insertion. NeRNA is tested individually with four different ncRNA datasets including microRNA (miRNA), transfer RNA (tRNA), long noncoding RNA (lncRNA), and circular RNA (circRNA). Furthermore, a species-specific case analysis is performed to demonstrate and compare the performance of NeRNA for miRNA prediction. The results of 1000 fold cross-validation on Decision Tree, Naïve Bayes and Random Forest classifiers, and deep learning algorithms such as Multilayer Perceptron, Convolutional Neural Network, and Simple feedforward Neural Networks indicate that models obtained by using NeRNA generated datasets, achieves substantially high prediction performance. NeRNA is released as an easy-to-use, updatable and modifiable KNIME workflow that can be downloaded with example datasets and required extensions. In particular, NeRNA is designed to be a powerful tool for RNA sequence data analysis.
Collapse
Affiliation(s)
- Mehmet Emin Orhan
- Department of Bioengineering, Graduate School of Engineering and Science, Abdullah Gül University, Kayseri, Turkey
| | - Yılmaz Mehmet Demirci
- Department of Engineering Science, Faculty of Engineering, Abdullah Gül University, Kayseri, Turkey
| | | |
Collapse
|
3
|
Ramanathan N, Ramamurthy J, Natarajan G. Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison - A Review. Comb Chem High Throughput Screen 2021; 25:365-380. [PMID: 34382516 DOI: 10.2174/1386207324666210811101437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 06/16/2021] [Accepted: 06/24/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND Biological macromolecules namely, DNA, RNA, and protein have their building blocks organized in a particular sequence and the sequential arrangement encodes evolutionary history of the organism (species). Hence, biological sequences have been used for studying evolutionary relationships among the species. This is usually carried out by multiple sequence algorithms (MSA). Due to certain limitations of MSA, alignment-free sequence comparison methods were developed. The present review is on alignment-free sequence comparison methods carried out using numerical characterization of DNA sequences. <P> Discussion: The graphical representation of DNA sequences by chaos game representation and other 2-dimesnional and 3-dimensional methods are discussed. The evolution of numerical characterization from the various graphical representations and the application of the DNA invariants thus computed in phylogenetic analysis is presented. The extension of computing molecular descriptors in chemometrics to the calculation of new set of DNA invariants and their use in alignment-free sequence comparison in a N-dimensional space and construction of phylogenetic tress is also reviewed. <P> Conclusion: The phylogenetic tress constructed by the alignment-free sequence comparison methods using DNA invariants were found to be better than those constructed using alignment-based tools such as PHLYIP and ClustalW. One of the graphical representation methods is now extended to study viral sequences of infectious diseases for the identification of conserved regions to design peptide-based vaccine by combining numerical characterization and graphical representation.
Collapse
Affiliation(s)
- Natarajan Ramanathan
- Department of Chemistry, Sri Sarada Niketan College for Women, Karur-639005, Tamil Nadu. India
| | - Jayalakshmi Ramamurthy
- Department of Computer Science, Sri Sarada Niketan College for Women, Karur-639005, Tamil Nadu. India
| | - Ganapathy Natarajan
- Department of Mechanical Engineering and Industrial Engineering, University of Wisconsin, Platteville, WI 53818. United States
| |
Collapse
|
4
|
Zhang Y, Huang H, Dong X, Fang Y, Wang K, Zhu L, Wang K, Huang T, Yang J. A Dynamic 3D Graphical Representation for RNA Structure Analysis and Its Application in Non-Coding RNA Classification. PLoS One 2016; 11:e0152238. [PMID: 27213271 PMCID: PMC4877074 DOI: 10.1371/journal.pone.0152238] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 03/10/2016] [Indexed: 12/21/2022] Open
Abstract
With the development of new technologies in transcriptome and epigenetics, RNAs have been identified to play more and more important roles in life processes. Consequently, various methods have been proposed to assess the biological functions of RNAs and thus classify them functionally, among which comparative study of RNA structures is perhaps the most important one. To measure the structural similarity of RNAs and classify them, we propose a novel three dimensional (3D) graphical representation of RNA secondary structure, in which an RNA secondary structure is first transformed into a characteristic sequence based on chemical property of nucleic acids; a dynamic 3D graph is then constructed for the characteristic sequence; and lastly a numerical characterization of the 3D graph is used to represent the RNA secondary structure. We tested our algorithm on three datasets: (1) Dataset I consisting of nine RNA secondary structures of viruses, (2) Dataset II consisting of complex RNA secondary structures including pseudo-knots, and (3) Dataset III consisting of 18 non-coding RNA families. We also compare our method with other nine existing methods using Dataset II and III. The results demonstrate that our method is better than other methods in similarity measurement and classification of RNA secondary structures.
Collapse
Affiliation(s)
- Yi Zhang
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, People's Republic of China
- Hebei Laboratory of Pharmaceutic Molecular Chemistry, Shijiazhuang, Hebei 050018, People's Republic of China
- * E-mail: (JY); (YZ); (TH)
| | - Haiyun Huang
- Department of Information Retrieval of Library, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, People's Republic of China
| | - Xiaoqing Dong
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, People's Republic of China
| | - Yiliang Fang
- International Travel Healthcare Center, Fuzhou, Fujian 350001, People's Republic of China
| | - Kejing Wang
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, People's Republic of China
| | - Lijuan Zhu
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, People's Republic of China
| | - Ke Wang
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, People's Republic of China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, People's Republic of China
- * E-mail: (JY); (YZ); (TH)
| | - Jialiang Yang
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, People's Republic of China
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States of America
- * E-mail: (JY); (YZ); (TH)
| |
Collapse
|
5
|
Multi-scale RNA comparison based on RNA triple vector curve representation. BMC Bioinformatics 2012; 13:280. [PMID: 23110635 PMCID: PMC3599440 DOI: 10.1186/1471-2105-13-280] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2012] [Accepted: 10/11/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In recent years, the important functional roles of RNAs in biological processes have been repeatedly demonstrated. Computing the similarity between two RNAs contributes to better understanding the functional relationship between them. But due to the long-range correlations of RNA, many efficient methods of detecting protein similarity do not work well. In order to comprehensively understand the RNA's function, the better similarity measure among RNAs should be designed to consider their structure features (base pairs). Current methods for RNA comparison could be generally classified into alignment-based and alignment-free. RESULTS In this paper, we propose a novel wavelet-based method based on RNA triple vector curve representation, named multi-scale RNA comparison. Firstly, we designed a novel numerical representation of RNA secondary structure termed as RNA triple vectors curve (TV-Curve). Secondly, we constructed a new similarity metric based on the wavelet decomposition of the TV-Curve of RNA. Finally we also applied our algorithm to the classification of non-coding RNA and RNA mutation analysis. Furthermore, we compared the results to the two well-known RNA comparison tools: RNAdistance and RNApdist. The results in this paper show the potentials of our method in RNA classification and RNA mutation analysis. CONCLUSION We provide a better visualization and analysis tool named TV-Curve of RNA, especially for long RNA, which can characterize both sequence and structure features. Additionally, based on TV-Curve representation of RNAs, a multi-scale similarity measure for RNA comparison is proposed, which can capture the local and global difference between the information of sequence and structure of RNAs. Compared with the well-known RNA comparison approaches, the proposed method is validated to be outstanding and effective in terms of non-coding RNA classification and RNA mutation analysis. From the numerical experiments, our proposed method can capture more efficient and subtle relationship of RNAs.
Collapse
|
6
|
Randić M, Zupan J, Balaban AT, Vikić-Topić D, Plavšić D. Graphical Representation of Proteins. Chem Rev 2010; 111:790-862. [PMID: 20939561 DOI: 10.1021/cr800198j] [Citation(s) in RCA: 93] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Milan Randić
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Jure Zupan
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Alexandru T. Balaban
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Dražen Vikić-Topić
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Dejan Plavšić
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| |
Collapse
|
7
|
Huang W, Zhang J, Wang Y, Huang D. A simple method to analyze the similarity of biological sequences based on the fuzzy theory. J Theor Biol 2010; 265:323-8. [DOI: 10.1016/j.jtbi.2010.05.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2009] [Revised: 04/01/2010] [Accepted: 05/07/2010] [Indexed: 11/28/2022]
|
8
|
Wang S, Tian F, Qiu Y, Liu X. Bilateral similarity function: a novel and universal method for similarity analysis of biological sequences. J Theor Biol 2010; 265:194-201. [PMID: 20399215 DOI: 10.1016/j.jtbi.2010.04.013] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2009] [Revised: 04/11/2010] [Accepted: 04/12/2010] [Indexed: 11/26/2022]
Abstract
Bilateral similarity function is designed for analyzing the similarities of biological sequences such as DNA, RNA secondary structure or protein in this paper. The defined function can perform comprehensive comparison between sequences remarkably well, both in terms of the Hamming distance of two compared sequences and the corresponding location difference. Compared with the existing methods for similarity analysis, the examination of similarities/dissimilarities illustrates that the proposed method with the computational complexity of O(N) is effective for these three kinds of biological sequences, and bears the universality for them.
Collapse
Affiliation(s)
- Shiyuan Wang
- College of Communication Engineering, Chongqing University, Chongqing 400044, China.
| | | | | | | |
Collapse
|
9
|
Pérez-Montoto LG, Dea-Ayuela MA, Prado-Prado FJ, Bolas-Fernández F, Ubeira FM, González-Díaz H. Study of peptide fingerprints of parasite proteins and drug-DNA interactions with Markov-Mean-Energy invariants of biopolymer molecular-dynamic lattice networks. POLYMER 2009; 50:3857-3870. [PMID: 32287404 PMCID: PMC7111648 DOI: 10.1016/j.polymer.2009.05.055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2009] [Revised: 05/06/2009] [Accepted: 05/14/2009] [Indexed: 11/26/2022]
Abstract
Since the advent of Molecular Dynamics (MD) in biopolymers science with the study by Karplus et al. on protein dynamics, MD has become the by foremost well established, computational technique to investigate structure and function of biomolecules and their respective complexes and interactions. The analysis of the MD trajectories (MDTs) remains, however, the greatest challenge and requires a great deal of insight, experience, and effort. Here, we introduce a new class of invariants for MDTs based on the spatial distribution of Mean-Energy values ξk (L) on a 2D Euclidean space representation of the MDTs. The procedure forces one MD trajectory to fold into a 2D Cartesian coordinates system using a step-by-step procedure driven by simple rules. The ξk (L) values are invariants of a Markov matrix (1 Π), which describes the probabilities of transition between two states in the new 2D space; which is associated to a graph representation of MDTs similar to the lattice networks (LNs) of DNA and protein sequences. We also introduce a new algorithm to perform phylogenetic analysis of peptides based on MDTs instead of the sequence of the polypeptide. In a first experiment, we illustrate this algorithm for 35 peptides present on the Peptide Mass Fingerprint (PMF) of a new protein of Leishmania infantum studied in this work. We report, by the first time, 2D Electrophoresis isolation, MALDI TOF Mass Spectroscopy characterization, and MASCOT search results for this PMF. In a second experiment, we construct the LNs for 422 MDTs obtained in DNA-Drug Docking simulations of the interaction of 57 anticancer furocoumarins with a DNA oligonucleotide. We calculated the respective ξk (L) values for all these LNs and used them as inputs to train a new classifier with Accuracy = 85.44% and 84.91% in training and validation respectively. The new model can be used as scoring function to guide DNA-Drug Docking studies in drug design of new coumarins for PUVA therapy. The new phylogenetics analysis algorithms encode information different from sequence similarity and may be used to analyze MDTs obtained in Docking or modeling experiments for any classes of biopolymers. The work opens new perspective on the analysis and applications of MD in polymer sciences.
Collapse
Affiliation(s)
- Lázaro Guillermo Pérez-Montoto
- Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
- Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - María Auxiliadora Dea-Ayuela
- Departamento de Atención Sanitaria, Salud Pública y Sanidad Animal, Facultad CC Experimentales y de La Salud, Universidad CEU Cardenal Herrera, 46113 Moncada (Valencia), Spain
| | - Francisco J Prado-Prado
- Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
- Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | | | - Florencio M Ubeira
- Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Humberto González-Díaz
- Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| |
Collapse
|
10
|
González-Díaz H, Dea-Ayuela MA, Pérez-Montoto LG, Prado-Prado FJ, Agüero-Chapín G, Bolas-Fernández F, Vazquez-Padrón RI, Ubeira FM. QSAR for RNases and theoretic-experimental study of molecular diversity on peptide mass fingerprints of a new Leishmania infantum protein. Mol Divers 2009; 14:349-69. [PMID: 19578942 PMCID: PMC7088557 DOI: 10.1007/s11030-009-9178-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2009] [Accepted: 06/13/2009] [Indexed: 11/29/2022]
Abstract
The toxicity and low success of current treatments for Leishmaniosis determines the search of new peptide drugs and/or molecular targets in Leishmania pathogen species (L. infantum and L. major). For example, Ribonucleases (RNases) are enzymes relevant to several biologic processes; then, theoretical and experimental study of the molecular diversity of Peptide Mass Fingerprints (PMFs) of RNases is useful for drug design. This study introduces a methodology that combines QSAR models, 2D-Electrophoresis (2D-E), MALDI-TOF Mass Spectroscopy (MS), BLAST alignment, and Molecular Dynamics (MD) to explore PMFs of RNases. We illustrate this approach by investigating for the first time the PMFs of a new protein of L. infantum. Here we report and compare new versus old predictive models for RNases based on Topological Indices (TIs) of Markov Pseudo-Folding Lattices. These group of indices called Pseudo-folding Lattice 2D-TIs include: Spectral moments pi ( k )(x,y), Mean Electrostatic potentials xi ( k )(x,y), and Entropy measures theta ( k )(x,y). The accuracy of the models (training/cross-validation) was as follows: xi ( k )(x,y)-model (96.0%/91.7%)>pi ( k )(x,y)-model (84.7/83.3) > theta ( k )(x,y)-model (66.0/66.7). We also carried out a 2D-E analysis of biological samples of L. infantum promastigotes focusing on a 2D-E gel spot of one unknown protein with M<20, 100 and pI <7. MASCOT search identified 20 proteins with Mowse score >30, but not one >52 (threshold value), the higher value of 42 was for a probable DNA-directed RNA polymerase. However, we determined experimentally the sequence of more than 140 peptides. We used QSAR models to predict RNase scores for these peptides and BLAST alignment to confirm some results. We also calculated 3D-folding TIs based on MD experiments and compared 2D versus 3D-TIs on molecular phylogenetic analysis of the molecular diversity of these peptides. This combined strategy may be of interest in drug development or target identification.
Collapse
Affiliation(s)
- Humberto González-Díaz
- Department of Microbiology and Parasitology, and Department of Organic Chemistry, Faculty of Pharmacy, USC, 15782, Santiago de Compostela, Spain.
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Quantitative Proteome–Property Relationships (QPPRs). Part 1: Finding biomarkers of organic drugs with mean Markov connectivity indices of spiral networks of blood mass spectra. Bioorg Med Chem 2008; 16:9684-93. [DOI: 10.1016/j.bmc.2008.10.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2008] [Revised: 09/29/2008] [Accepted: 10/02/2008] [Indexed: 11/22/2022]
|
12
|
Dai Q, Wang T. Use of linear regression model to compare RNA secondary structures. J Theor Biol 2008; 253:854-60. [DOI: 10.1016/j.jtbi.2008.04.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2008] [Revised: 04/17/2008] [Accepted: 04/17/2008] [Indexed: 11/25/2022]
|
13
|
Dai Q, Wang TM. Use of statistical measures for analyzing RNA secondary structures. J Comput Chem 2008; 29:1292-305. [PMID: 18172840 DOI: 10.1002/jcc.20891] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
With more and more RNA secondary structures accumulated, the need for comparing different RNA secondary structures often arises in function prediction and evolutionary analysis. Numerous efficient algorithms were developed for comparing different RNA secondary structures, but challenges remain. In this article, a new statistical measure extending the notion of relative entropy based on the proposed stochastic model is evaluated for RNA secondary structures. The results obtained from several experiments on real datasets have shown the effectiveness of the proposed approach. Moreover, the time complexity of our method is favorable by comparing with that of the existing methods which solve the similar problem.
Collapse
Affiliation(s)
- Qi Dai
- Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, People's Republic of China.
| | | |
Collapse
|
14
|
Dai Q, Liu XQ, Wang TM. matrix: A better numerical characterization for graphical representations of biological sequences. J Theor Biol 2007; 247:103-9. [PMID: 17428502 DOI: 10.1016/j.jtbi.2007.03.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2006] [Revised: 03/04/2007] [Accepted: 03/05/2007] [Indexed: 11/30/2022]
Abstract
We find that the traditional numerical characterizations of biological sequences, such as E matrix, D/D matrix, L/L matrix and their "high order" matrices, have their limitations to characterize the biological sequences exactly, but they are widely used to analyze the biological sequences. Here, we propose a better numerical characterization for graphical representations of biological sequences, C(i,j) matrix. It is associated with the curvature of every point and has many advantages: (1) It can characterize the graphical representations for DNA sequences exactly, because it can overcome the limitation of the traditional matrices. (2) If we choose an appropriate fixed point, we can make the elements of the C(i,j) matrix less than or equal to 1.
Collapse
Affiliation(s)
- Qi Dai
- Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, PR China.
| | | | | |
Collapse
|
15
|
González-Díaz H, Agüero-Chapin G, Varona J, Molina R, Delogu G, Santana L, Uriarte E, Podda G. 2D-RNA-coupling numbers: A new computational chemistry approach to link secondary structure topology with biological function. J Comput Chem 2007; 28:1049-56. [PMID: 17279496 DOI: 10.1002/jcc.20576] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Methods for prediction of proteins, DNA, or RNA function and mapping it onto sequence often rely on bioinformatics alignment approach instead of chemical structure. Consequently, it is interesting to develop computational chemistry approaches based on molecular descriptors. In this sense, many researchers used sequence-coupling numbers and our group extended them to 2D proteins representations. However, no coupling numbers have been reported for 2D-RNA topology graphs, which are highly branched and contain useful information. Here, we use a computational chemistry scheme: (a) transforming sequences into RNA secondary structures, (b) defining and calculating new 2D-RNA-coupling numbers, (c) seek a structure-function model, and (d) map biological function onto the folded RNA. We studied as example 1-aminocyclopropane-1-carboxylic acid (ACC) oxidases known as ACO, which control fruit ripening having importance for biotechnology industry. First, we calculated tau(k)(2D-RNA) values to a set of 90-folded RNAs, including 28 transcripts of ACO and control sequences. Afterwards, we compared the classification performance of 10 different classifiers implemented in the software WEKA. In particular, the logistic equation ACO = 23.8 . tau(1)(2D-RNA) + 41.4 predicts ACOs with 98.9%, 98.0%, and 97.8% of accuracy in training, leave-one-out and 10-fold cross-validation, respectively. Afterwards, with this equation we predict ACO function to a sequence isolated in this work from Coffea arabica (GenBank accession DQ218452). The tau(1)(2D-RNA) also favorably compare with other descriptors. This equation allows us to map the codification of ACO activity on different mRNA topology features. The present computational-chemistry approach is general and could be extended to connect RNA secondary structure topology to other functions.
Collapse
Affiliation(s)
- Humberto González-Díaz
- Department of Organic Chemistry, University of Santiago de Compostela, Santiago de Compostela 15782, Spain.
| | | | | | | | | | | | | | | |
Collapse
|