1
|
Dingle K, Alaskandarani M, Hamzi B, Louis AA. Exploring Simplicity Bias in 1D Dynamical Systems. ENTROPY (BASEL, SWITZERLAND) 2024; 26:426. [PMID: 38785675 PMCID: PMC11119897 DOI: 10.3390/e26050426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Revised: 05/11/2024] [Accepted: 05/13/2024] [Indexed: 05/25/2024]
Abstract
Arguments inspired by algorithmic information theory predict an inverse relation between the probability and complexity of output patterns in a wide range of input-output maps. This phenomenon is known as simplicity bias. By viewing the parameters of dynamical systems as inputs, and the resulting (digitised) trajectories as outputs, we study simplicity bias in the logistic map, Gauss map, sine map, Bernoulli map, and tent map. We find that the logistic map, Gauss map, and sine map all exhibit simplicity bias upon sampling of map initial values and parameter values, but the Bernoulli map and tent map do not. The simplicity bias upper bound on the output pattern probability is used to make a priori predictions regarding the probability of output patterns. In some cases, the predictions are surprisingly accurate, given that almost no details of the underlying dynamical systems are assumed. More generally, we argue that studying probability-complexity relationships may be a useful tool when studying patterns in dynamical systems.
Collapse
Affiliation(s)
- Kamal Dingle
- Centre for Applied Mathematics and Bioinformatics, Department of Mathematics and Natural Sciences, Gulf University for Science and Technology, Hawally 32093, Kuwait;
- Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Parks Road, Oxford OX1 3PU, UK;
| | - Mohammad Alaskandarani
- Centre for Applied Mathematics and Bioinformatics, Department of Mathematics and Natural Sciences, Gulf University for Science and Technology, Hawally 32093, Kuwait;
| | - Boumediene Hamzi
- Department of Computing and Mathematical Sciences, California Institute of Technology, Caltech, CA 91125, USA
- The Alan Turing Institute, London NW1 2DB, UK
| | - Ard A. Louis
- Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Parks Road, Oxford OX1 3PU, UK;
| |
Collapse
|
2
|
Reznikova Z. Information Theory Opens New Dimensions in Experimental Studies of Animal Behaviour and Communication. Animals (Basel) 2023; 13:ani13071174. [PMID: 37048430 PMCID: PMC10093743 DOI: 10.3390/ani13071174] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 03/15/2023] [Accepted: 03/24/2023] [Indexed: 03/29/2023] Open
Abstract
Over the last 40–50 years, ethology has become increasingly quantitative and computational. However, when analysing animal behavioural sequences, researchers often need help finding an adequate model to assess certain characteristics of these sequences while using a relatively small number of parameters. In this review, I demonstrate that the information theory approaches based on Shannon entropy and Kolmogorov complexity can furnish effective tools to analyse and compare animal natural behaviours. In addition to a comparative analysis of stereotypic behavioural sequences, information theory can provide ideas for particular experiments on sophisticated animal communications. In particular, it has made it possible to discover the existence of a developed symbolic “language” in leader-scouting ant species based on the ability of these ants to transfer abstract information about remote events.
Collapse
|
3
|
Silva JM, Qi W, Pinho AJ, Pratas D. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. Gigascience 2022; 12:giad101. [PMID: 38091509 PMCID: PMC10716826 DOI: 10.1093/gigascience/giad101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. FINDINGS This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. CONCLUSIONS The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Collapse
Affiliation(s)
- Jorge M Silva
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Weihong Qi
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse, 190, 8057, Zurich, Switzerland
- SIB, Swiss Institute of Bioinformatics, 1202, Geneva, Switzerland
| | - Armando J Pinho
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Diogo Pratas
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu, 3, 00014 Helsinki, Finland
| |
Collapse
|
4
|
Dingle K, Novev JK, Ahnert SE, Louis AA. Predicting phenotype transition probabilities via conditional algorithmic probability approximations. J R Soc Interface 2022; 19:20220694. [PMID: 36514888 PMCID: PMC9748496 DOI: 10.1098/rsif.2022.0694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Unravelling the structure of genotype-phenotype (GP) maps is an important problem in biology. Recently, arguments inspired by algorithmic information theory (AIT) and Kolmogorov complexity have been invoked to uncover simplicity bias in GP maps, an exponentially decaying upper bound in phenotype probability with the increasing phenotype descriptional complexity. This means that phenotypes with many genotypes assigned via the GP map must be simple, while complex phenotypes must have few genotypes assigned. Here, we use similar arguments to bound the probability P(x → y) that phenotype x, upon random genetic mutation, transitions to phenotype y. The bound is [Formula: see text], where [Formula: see text] is the estimated conditional complexity of y given x, quantifying how much extra information is required to make y given access to x. This upper bound is related to the conditional form of algorithmic probability from AIT. We demonstrate the practical applicability of our derived bound by predicting phenotype transition probabilities (and other related quantities) in simulations of RNA and protein secondary structures. Our work contributes to a general mathematical understanding of GP maps and may facilitate the prediction of transition probabilities directly from examining phenotype themselves, without utilizing detailed knowledge of the GP map.
Collapse
Affiliation(s)
- Kamaludin Dingle
- Department of Chemical Engineering and Biotechnology, Cambridge University, Cambridge CB2 1TN, UK,Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USA,Department of Mathematics and Natural Sciences, Centre for Applied Mathematics and Bioinformatics (CAMB), Gulf University for Science and Technology, 32093, Kuwait
| | - Javor K. Novev
- Department of Chemical Engineering and Biotechnology, Cambridge University, Cambridge CB2 1TN, UK
| | - Sebastian E. Ahnert
- Department of Chemical Engineering and Biotechnology, Cambridge University, Cambridge CB2 1TN, UK
| | - Ard A. Louis
- Department of Physics, Rudolf Peierls Centre for Theoretical Physics, Oxford University, Oxford OX1 2JD, UK
| |
Collapse
|
5
|
Comparative Study on Feature Selection in Protein Structure and Function Prediction. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:1650693. [PMID: 36267316 PMCID: PMC9578875 DOI: 10.1155/2022/1650693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 09/14/2022] [Indexed: 11/18/2022]
Abstract
Many effective methods extract and fuse different protein features to study the relationship between protein sequence, structure, and function, but different methods have preferences in solving the research of protein structure and function, which requires selecting valuable and contributing features to design more effective prediction methods. This work mainly focused on the feature selection methods in the study of protein structure and function, and systematically compared and analyzed the efficiency of different feature selection methods in the prediction of protein structures, protein disorders, protein molecular chaperones, and protein solubility. The results show that the feature selection method based on nonlinear SVM performs best in protein structure prediction, protein solubility prediction, protein molecular chaperone prediction, and protein solubility prediction. After selection, the accuracy of features is improved by 13.16% ~71%, especially the Kmer features and PSSM features of proteins.
Collapse
|
6
|
de la Torre-Abaitua G, Lago-Fernández LF, Arroyo D. A Compression-Based Method for Detecting Anomalies in Textual Data. ENTROPY 2021; 23:e23050618. [PMID: 34065721 PMCID: PMC8156803 DOI: 10.3390/e23050618] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 05/12/2021] [Accepted: 05/12/2021] [Indexed: 11/16/2022]
Abstract
Nowadays, information and communications technology systems are fundamental assets of our social and economical model, and thus they should be properly protected against the malicious activity of cybercriminals. Defence mechanisms are generally articulated around tools that trace and store information in several ways, the simplest one being the generation of plain text files coined as security logs. Such log files are usually inspected, in a semi-automatic way, by security analysts to detect events that may affect system integrity, confidentiality and availability. On this basis, we propose a parameter-free method to detect security incidents from structured text regardless its nature. We use the Normalized Compression Distance to obtain a set of features that can be used by a Support Vector Machine to classify events from a heterogeneous cybersecurity environment. In particular, we explore and validate the application of our method in four different cybersecurity domains: HTTP anomaly identification, spam detection, Domain Generation Algorithms tracking and sentiment analysis. The results obtained show the validity and flexibility of our approach in different security scenarios with a low configuration burden.
Collapse
Affiliation(s)
- Gonzalo de la Torre-Abaitua
- Departamento de Ingeniería Informática, Escuela Politécnica Superior, Universidad Autónoma de Madrid, 28049 Madrid, Spain; (G.d.l.T.-A.); (L.F.L.-F.)
| | - Luis Fernando Lago-Fernández
- Departamento de Ingeniería Informática, Escuela Politécnica Superior, Universidad Autónoma de Madrid, 28049 Madrid, Spain; (G.d.l.T.-A.); (L.F.L.-F.)
| | - David Arroyo
- Institute of Physical and Information Technologies (ITEFI), Spanish National Research Council (CSIC), 28006 Madrid, Spain
- Correspondence:
| |
Collapse
|
7
|
Wang Y, Xu Y, Yang Z, Liu X, Dai Q. Using Recursive Feature Selection with Random Forest to Improve Protein Structural Class Prediction for Low-Similarity Sequences. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:5529389. [PMID: 34055035 PMCID: PMC8123985 DOI: 10.1155/2021/5529389] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/07/2021] [Accepted: 04/28/2021] [Indexed: 11/20/2022]
Abstract
Many combinations of protein features are used to improve protein structural class prediction, but the information redundancy is often ignored. In order to select the important features with strong classification ability, we proposed a recursive feature selection with random forest to improve protein structural class prediction. We evaluated the proposed method with four experiments and compared it with the available competing prediction methods. The results indicate that the proposed feature selection method effectively improves the efficiency of protein structural class prediction. Only less than 5% features are used, but the prediction accuracy is improved by 4.6-13.3%. We further compared different protein features and found that the predicted secondary structural features achieve the best performance. This understanding can be used to design more powerful prediction methods for the protein structural class.
Collapse
Affiliation(s)
- Yaoxin Wang
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Yingjie Xu
- Qixin School, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Zhenyu Yang
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| |
Collapse
|
8
|
AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models. ENTROPY 2021; 23:e23050530. [PMID: 33925812 PMCID: PMC8146440 DOI: 10.3390/e23050530] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 04/19/2021] [Accepted: 04/22/2021] [Indexed: 12/28/2022]
Abstract
Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.
Collapse
|
9
|
Agüero-Chapin G, Galpert D, Molina-Ruiz R, Ancede-Gallardo E, Pérez-Machado G, De la Riva GA, Antunes A. Graph Theory-Based Sequence Descriptors as Remote Homology Predictors. Biomolecules 2019; 10:E26. [PMID: 31878100 PMCID: PMC7022958 DOI: 10.3390/biom10010026] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 12/16/2019] [Accepted: 12/18/2019] [Indexed: 12/23/2022] Open
Abstract
Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs. We also highlight the tendency of integrating AF features/measures with the AB ones, either into the same prediction model or by assembling the predictions from different algorithms using voting/weighting strategies, for improving the detection of remote signals. Lastly, we briefly discuss the efforts made to scale up AB and AF features/measures for the comparison of multiple genomes and proteomes. Alongside the achieved experiences in remote homology detection by both the most popular AF tools and other less known ones, we provide our own using the graphical-numerical methodologies, MARCH-INSIDE, TI2BioP, and ProtDCal. We also present a new Python-based tool (SeqDivA) with a friendly graphical user interface (GUI) for delimiting the twilight zone by using several similar criteria.
Collapse
Affiliation(s)
- Guillermin Agüero-Chapin
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos s/n 4450-208 Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal
| | - Deborah Galpert
- Departamento de Ciencia de la Computación. Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), Santa Clara 54830, Cuba;
| | - Reinaldo Molina-Ruiz
- Centro de Bioactivos Químicos (CBQ), Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), Santa Clara 54830, Cuba;
| | - Evys Ancede-Gallardo
- Programa de Doctorado en Fisicoquímica Molecular, Facultad de Ciencias Exactas, Universidad Andrés Bello, Av. República 239, Santiago 8370146, Chile;
| | - Gisselle Pérez-Machado
- EpiDisease S.L. Spin-Off of Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), 46980 Valencia, Spain;
| | - Gustavo A. De la Riva
- Laboratorio de Biotecnología Aplicada S. de R.L. de C.V., GRECA Inc., Carretera La Piedad-Carapán, km 3.5, La Piedad, Michoacán 59300, Mexico;
- Tecnológico Nacional de México, Instituto Tecnológico de la Piedad, Av. Ricardo Guzmán Romero, Santa Fe, La Piedad de Cavadas, Michoacán 59370, Mexico
| | - Agostinho Antunes
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos s/n 4450-208 Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal
| |
Collapse
|
10
|
Filatov G, Bauwens B, Kertész-Farkas A. LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification. Bioinformatics 2019; 34:3281-3288. [PMID: 29741583 DOI: 10.1093/bioinformatics/bty349] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 05/03/2018] [Indexed: 11/13/2022] Open
Abstract
Motivation Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis. Results Here, we present a new convolutional kernel function for protein sequences called the Lempel-Ziv-Welch (LZW)-Kernel. It is based on code words identified with the LZW universal text compressor. The LZW-Kernel is an alignment-free method, it is always symmetric, is positive, always provides 1.0 for self-similarity and it can directly be used with Support Vector Machines (SVMs) in classification problems, contrary to normalized compression distance, which often violates the distance metric properties in practice and requires further techniques to be used with SVMs. The LZW-Kernel is a one-pass algorithm, which makes it particularly plausible for big data applications. Our experimental studies on remote protein homology detection and protein classification tasks reveal that the LZW-Kernel closely approaches the performance of the Local Alignment Kernel (LAK) and the SVM-pairwise method combined with Smith-Waterman (SW) scoring at a fraction of the time. Moreover, the LZW-Kernel outperforms the SVM-pairwise method when combined with Basic Local Alignment Search Tool (BLAST) scores, which indicates that the LZW code words might be a better basis for similarity measures than local alignment approximations found with BLAST. In addition, the LZW-Kernel outperforms n-gram based mismatch kernels, hidden Markov model based SAM and Fisher kernel and protein family based PSI-BLAST, among others. Further advantages include the LZW-Kernel's reliance on a simple idea, its ease of implementation, and its high speed, three times faster than BLAST and several magnitudes faster than SW or LAK in our tests. Availability and implementation LZW-Kernel is implemented as a standalone C code and is a free open-source program distributed under GPLv3 license and can be downloaded from https://github.com/kfattila/LZW-Kernel. Supplementary information Supplementary data are available at Bioinformatics Online.
Collapse
Affiliation(s)
- Gleb Filatov
- Faculty of Computer Science, Department of Data Analysis and Artificial Intelligence, Moscow, Russia
| | - Bruno Bauwens
- Faculty of Computer Science, Department of Big Data and Information Retrieval, Moscow, Russia
| | - Attila Kertész-Farkas
- Faculty of Computer Science, Department of Data Analysis and Artificial Intelligence, Moscow, Russia
| |
Collapse
|
11
|
Monroe TO, Hill MC, Morikawa Y, Leach JP, Heallen T, Cao S, Krijger PHL, de Laat W, Wehrens XHT, Rodney GG, Martin JF. YAP Partially Reprograms Chromatin Accessibility to Directly Induce Adult Cardiogenesis In Vivo. Dev Cell 2019; 48:765-779.e7. [PMID: 30773489 DOI: 10.1016/j.devcel.2019.01.017] [Citation(s) in RCA: 138] [Impact Index Per Article: 27.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 12/10/2018] [Accepted: 01/17/2019] [Indexed: 01/22/2023]
Abstract
Specialized adult somatic cells, such as cardiomyocytes (CMs), are highly differentiated with poor renewal capacity, an integral reason underlying organ failure in disease and aging. Among the least renewable cells in the human body, CMs renew approximately 1% annually. Consistent with poor CM turnover, heart failure is the leading cause of death. Here, we show that an active version of the Hippo pathway effector YAP, termed YAP5SA, partially reprograms adult mouse CMs to a more fetal and proliferative state. One week after induction, 19% of CMs that enter S-phase do so twice, CM number increases by 40%, and YAP5SA lineage CMs couple to pre-existing CMs. Genomic studies showed that YAP5SA increases chromatin accessibility and expression of fetal genes, partially reprogramming long-lived somatic cells in vivo to a primitive, fetal-like, and proliferative state.
Collapse
Affiliation(s)
- Tanner O Monroe
- Department of Molecular Physiology and Biophysics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Matthew C Hill
- Program in Developmental Biology, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Yuka Morikawa
- Cardiomyocyte Renewal Laboratory, Texas Heart Institute, 6770 Bertner Avenue, Houston, TX 77030, USA
| | - John P Leach
- Department of Molecular Physiology and Biophysics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Todd Heallen
- Cardiomyocyte Renewal Laboratory, Texas Heart Institute, 6770 Bertner Avenue, Houston, TX 77030, USA
| | - Shuyi Cao
- Department of Molecular Physiology and Biophysics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA; Cardiovascular Research Institute, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Peter H L Krijger
- Oncode Institute, Hubrecht Institute-KNAW, Utrecht, the Netherlands; University Medical Center Utrecht, Utrecht, the Netherlands
| | - Wouter de Laat
- Oncode Institute, Hubrecht Institute-KNAW, Utrecht, the Netherlands; University Medical Center Utrecht, Utrecht, the Netherlands
| | - Xander H T Wehrens
- Department of Molecular Physiology and Biophysics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA; Cardiovascular Research Institute, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - George G Rodney
- Department of Molecular Physiology and Biophysics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA; Cardiovascular Research Institute, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - James F Martin
- Department of Molecular Physiology and Biophysics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA; Cardiomyocyte Renewal Laboratory, Texas Heart Institute, 6770 Bertner Avenue, Houston, TX 77030, USA; Program in Developmental Biology, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA; Cardiovascular Research Institute, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA.
| |
Collapse
|
12
|
Zhu XJ, Feng CQ, Lai HY, Chen W, Hao L. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2018.10.007] [Citation(s) in RCA: 69] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
13
|
Devine S. Algorithmic Entropy and Landauer's Principle Link Microscopic System Behaviour to the Thermodynamic Entropy. ENTROPY (BASEL, SWITZERLAND) 2018; 20:e20100798. [PMID: 33265885 PMCID: PMC7512359 DOI: 10.3390/e20100798] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 10/14/2018] [Accepted: 10/14/2018] [Indexed: 06/12/2023]
Abstract
Algorithmic information theory in conjunction with Landauer's principle can quantify the cost of maintaining a reversible real-world computational system distant from equilibrium. As computational bits are conserved in an isolated reversible system, bit flows can be used to track the way a highly improbable configuration trends toward a highly probable equilibrium configuration. In an isolated reversible system, all microstates within a thermodynamic macrostate have the same algorithmic entropy. However, from a thermodynamic perspective, when these bits primarily specify stored energy states, corresponding to a fluctuation from the most probable set of states, they represent "potential entropy". However, these bits become "realised entropy" when, under the second law of thermodynamics, they become bits specifying the momentum degrees of freedom. The distance of a fluctuation from equilibrium is identified as the number of computational bits that move from stored energy states to momentum states to define a highly probable or typical equilibrium state. When reversibility applies, from Landauer's principle, it costs k B l n 2 T Joules to move a bit within the system from stored energy states to the momentum states.
Collapse
Affiliation(s)
- Sean Devine
- School of Management, Victoria University of Wellington, P.O. Box 600, Wellington 6140, New Zealand
| |
Collapse
|
14
|
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes. ENTROPY 2018; 20:e20060393. [PMID: 33265483 PMCID: PMC7512912 DOI: 10.3390/e20060393] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2018] [Revised: 05/16/2018] [Accepted: 05/21/2018] [Indexed: 11/26/2022]
Abstract
An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.
Collapse
|
15
|
Dingle K, Camargo CQ, Louis AA. Input-output maps are strongly biased towards simple outputs. Nat Commun 2018; 9:761. [PMID: 29472533 PMCID: PMC5823903 DOI: 10.1038/s41467-018-03101-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 01/19/2018] [Indexed: 11/09/2022] Open
Abstract
Many systems in nature can be described using discrete input–output maps. Without knowing details about a map, there may seem to be no a priori reason to expect that a randomly chosen input would be more likely to generate one output over another. Here, by extending fundamental results from algorithmic information theory, we show instead that for many real-world maps, the a priori probability P(x) that randomly sampled inputs generate a particular output x decays exponentially with the approximate Kolmogorov complexity \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\tilde K(x)$$\end{document}K~(x) of that output. These input–output maps are biased towards simplicity. We derive an upper bound P(x) ≲ \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$2^{ - a\tilde K(x) - b}$$\end{document}2-aK~(x)-b, which is tight for most inputs. The constants a and b, as well as many properties of P(x), can be predicted with minimal knowledge of the map. We explore this strong bias towards simple outputs in systems ranging from the folding of RNA secondary structures to systems of coupled ordinary differential equations to a stochastic financial trading model. Algorithmic information theory measures the complexity of strings. Here the authors provide a practical bound on the probability that a randomly generated computer program produces a given output of a given complexity and apply this upper bound to RNA folding and financial trading algorithms.
Collapse
Affiliation(s)
- Kamaludin Dingle
- Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Oxford, OX1 3NP, UK. .,Systems Biology DTC, University of Oxford, Oxford, OX1 3QU, UK. .,International Centre for Applied Mathematics and Computational Bioengineering, Department of Mathematics and Natural Sciences, Gulf University for Science and Technology, P.O. Box 7207, Hawally 32093, Mubarak Al-Abdullah, Kuwait.
| | - Chico Q Camargo
- Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Oxford, OX1 3NP, UK.,Systems Biology DTC, University of Oxford, Oxford, OX1 3QU, UK
| | - Ard A Louis
- Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Oxford, OX1 3NP, UK.
| |
Collapse
|
16
|
Thankachan SV, Apostolico A, Aluru S. A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem. J Comput Biol 2016; 23:472-82. [PMID: 27058840 DOI: 10.1089/cmb.2015.0235] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.
Collapse
Affiliation(s)
| | - Alberto Apostolico
- College of Computing, Georgia Institute of Technology , Atlanta, Georgia
| | - Srinivas Aluru
- College of Computing, Georgia Institute of Technology , Atlanta, Georgia
| |
Collapse
|
17
|
Bywater RP. Prediction of protein structural features from sequence data based on Shannon entropy and Kolmogorov complexity. PLoS One 2015; 10:e0119306. [PMID: 25856073 PMCID: PMC4391790 DOI: 10.1371/journal.pone.0119306] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2013] [Accepted: 01/29/2015] [Indexed: 11/21/2022] Open
Abstract
While the genome for a given organism stores the information necessary for the organism to function and flourish it is the proteins that are encoded by the genome that perhaps more than anything else characterize the phenotype for that organism. It is therefore not surprising that one of the many approaches to understanding and predicting protein folding and properties has come from genomics and more specifically from multiple sequence alignments. In this work I explore ways in which data derived from sequence alignment data can be used to investigate in a predictive way three different aspects of protein structure: secondary structures, inter-residue contacts and the dynamics of switching between different states of the protein. In particular the use of Kolmogorov complexity has identified a novel pathway towards achieving these goals.
Collapse
|
18
|
Wang J, Wang C, Cao J, Liu X, Yao Y, Dai Q. Prediction of protein structural classes for low-similarity sequences using reduced PSSM and position-based secondary structural features. Gene 2015; 554:241-8. [DOI: 10.1016/j.gene.2014.10.037] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2014] [Revised: 10/19/2014] [Accepted: 10/22/2014] [Indexed: 10/24/2022]
|
19
|
Struck D, Lawyer G, Ternes AM, Schmit JC, Bercoff DP. COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification. Nucleic Acids Res 2014; 42:e144. [PMID: 25120265 PMCID: PMC4191385 DOI: 10.1093/nar/gku739] [Citation(s) in RCA: 255] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Viral sequence classification has wide applications in clinical, epidemiological, structural and functional categorization studies. Most existing approaches rely on an initial alignment step followed by classification based on phylogenetic or statistical algorithms. Here we present an ultrafast alignment-free subtyping tool for human immunodeficiency virus type one (HIV-1) adapted from Prediction by Partial Matching compression. This tool, named COMET, was compared to the widely used phylogeny-based REGA and SCUEAL tools using synthetic and clinical HIV data sets (1,090,698 and 10,625 sequences, respectively). COMET's sensitivity and specificity were comparable to or higher than the two other subtyping tools on both data sets for known subtypes. COMET also excelled in detecting and identifying new recombinant forms, a frequent feature of the HIV epidemic. Runtime comparisons showed that COMET was almost as fast as USEARCH. This study demonstrates the advantages of alignment-free classification of viral sequences, which feature high rates of variation, recombination and insertions/deletions. COMET is free to use via an online interface.
Collapse
Affiliation(s)
- Daniel Struck
- Laboratory of Retrovirology, CRP-Santé, 84, Val Fleuri, L-1526, Luxembourg
| | - Glenn Lawyer
- Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany
| | - Anne-Marie Ternes
- Laboratory of Retrovirology, CRP-Santé, 84, Val Fleuri, L-1526, Luxembourg
| | - Jean-Claude Schmit
- Laboratory of Retrovirology, CRP-Santé, 84, Val Fleuri, L-1526, Luxembourg
| | | |
Collapse
|
20
|
Pinello L, Lo Bosco G, Yuan GC. Applications of alignment-free methods in epigenomics. Brief Bioinform 2014; 15:419-30. [PMID: 24197932 PMCID: PMC4017331 DOI: 10.1093/bib/bbt078] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2013] [Accepted: 10/28/2013] [Indexed: 12/16/2022] Open
Abstract
Epigenetic mechanisms play an important role in the regulation of cell type-specific gene activities, yet how epigenetic patterns are established and maintained remains poorly understood. Recent studies have supported a role of DNA sequences in recruitment of epigenetic regulators. Alignment-free methods have been applied to identify distinct sequence features that are associated with epigenetic patterns and to predict epigenomic profiles. Here, we review recent advances in such applications, including the methods to map DNA sequence to feature space, sequence comparison and prediction models. Computational studies using these methods have provided important insights into the epigenetic regulatory mechanisms.
Collapse
|
21
|
Wang J, Li Y, Liu X, Dai Q, Yao Y, He P. High-accuracy prediction of protein structural classes using PseAA structural properties and secondary structural patterns. Biochimie 2014; 101:104-12. [PMID: 24412731 DOI: 10.1016/j.biochi.2013.12.021] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2013] [Accepted: 12/30/2013] [Indexed: 10/25/2022]
Abstract
Since introduction of PseAAs and functional domains, promising results have been achieved in protein structural class predication, but some challenges still exist in the representation of the PseAA structural correlation and structural domains. This paper proposed a high-accuracy prediction method using novel PseAA structural properties and secondary structural patterns, reflecting the long-range and local structural properties of the PseAAs and certain compact structural domains. The proposed prediction method was tested against the competing prediction methods with four experiments. The experiment results indicate that the proposed method achieved the best performance. Its overall accuracies for datasets 25 PDB, D640, FC699 and 1189 are 88.8%, 90.9%, 96.4% and 87.4%, which are 4.5%, 7.6%, 2% and 3.9% higher than the existing best-performing method. This understanding can be used to guide development of more powerful methods for protein structural class prediction. The software and supplement material are freely available at http://bioinfo.zstu.edu.cn/PseAA-SSP.
Collapse
Affiliation(s)
- Junru Wang
- College of Mechanical Engineering and Automation, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Yan Li
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, People's Republic of China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| | - Yuhua Yao
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Pingan He
- College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| |
Collapse
|
22
|
Giancarlo R, Rombo SE, Utro F. Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinform 2013; 15:390-406. [DOI: 10.1093/bib/bbt088] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
|
23
|
Deorowicz S, Grabowski S. Data compression for sequencing data. Algorithms Mol Biol 2013; 8:25. [PMID: 24252160 PMCID: PMC3868316 DOI: 10.1186/1748-7188-8-25] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2013] [Accepted: 09/25/2013] [Indexed: 12/12/2022] Open
Abstract
Post-Sanger sequencing methods produce tons of data, and there is a general
agreement that the challenge to store and process them must be addressed
with data compression. In this review we first answer the question
“why compression” in a quantitative manner. Then we also answer
the questions “what” and “how”, by sketching the
fundamental compression ideas, describing the main sequencing data types and
formats, and comparing the specialized compression algorithms and tools.
Finally, we go back to the question “why compression” and give
other, perhaps surprising answers, demonstrating the pervasiveness of data
compression techniques in computational biology.
Collapse
|
24
|
Schwende I, Pham TD. Pattern recognition and probabilistic measures in alignment-free sequence analysis. Brief Bioinform 2013; 15:354-68. [PMID: 24096012 DOI: 10.1093/bib/bbt070] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
With the massive production of genomic and proteomic data, the number of available biological sequences in databases has reached a level that is not feasible anymore for exact alignments even when just a fraction of all sequences is used. To overcome this inevitable time complexity, ultrafast alignment-free methods are studied. Within the past two decades, a broad variety of nonalignment methods have been proposed including dissimilarity measures on classical representations of sequences like k-words or Markov models. Furthermore, articles were published that describe distance measures on alternative representations such as compression complexity, spectral time series or chaos game representation. However, alignments are still the standard method for real world applications in biological sequence analysis, and the time efficient alignment-free approaches are usually applied in cases when the accustomed algorithms turn out to fail or be too inconvenient.
Collapse
Affiliation(s)
- Isabel Schwende
- PhD, Aizu Research Cluster for Medical Informatics and Engineering (ARC-Medical), Research Center for Advanced Information Science and Technology (CAIST), The University of Aizu, Aizuwakamatsu, Fukushima 965-8580, Japan.
| | | |
Collapse
|
25
|
Alignment free comparison: k word voting model and its applications. J Theor Biol 2013; 335:276-82. [DOI: 10.1016/j.jtbi.2013.06.037] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2012] [Revised: 04/25/2013] [Accepted: 06/26/2013] [Indexed: 02/06/2023]
|
26
|
Keratin protein property based classification of mammals and non-mammals using machine learning techniques. Comput Biol Med 2013; 43:889-99. [DOI: 10.1016/j.compbiomed.2013.04.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2011] [Revised: 04/08/2013] [Accepted: 04/09/2013] [Indexed: 11/21/2022]
|
27
|
Yang L, Zhang X, Wang T, Zhu H. Large local analysis of the unaligned genome and its application. J Comput Biol 2013; 20:19-29. [PMID: 23294269 DOI: 10.1089/cmb.2011.0052] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We describe a novel method for the local analysis of complete genomes. A local distance measure called LODIST is proposed, which is based on the relationship between the longest common words and the shortest absent words of two genomes we compared. LODIST can perform better than local alignment when the local region is large enough to cover some recombination genes. A distance measure called SILD.k.t with resolution k and step t is derived by the integral LODISTs of whole genomes. It is shown that the algorithm for computing the LODISTs and SILD.k.t is linear, which is fast enough to consider the problem of the genome comparison. We verify this method by recognizing the subtypes of the HIV-1 complete genomes and genome segments.
Collapse
Affiliation(s)
- Lianping Yang
- College of Sciences, Northeastern University, Shenyang, China
| | | | | | | |
Collapse
|
28
|
Dai Q, Li Y, Liu X, Yao Y, Cao Y, He P. Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position. BMC Bioinformatics 2013; 14:152. [PMID: 23641706 PMCID: PMC3652764 DOI: 10.1186/1471-2105-14-152] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2012] [Accepted: 04/03/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many content-based statistical features of secondary structural elements (CBF-PSSEs) have been proposed and achieved promising results in protein structural class prediction, but until now position distribution of the successive occurrences of an element in predicted secondary structure sequences hasn't been used. It is necessary to extract some appropriate position-based features of the secondary structural elements for prediction task. RESULTS We proposed some position-based features of predicted secondary structural elements (PBF-PSSEs) and assessed their intrinsic ability relative to the available CBF-PSSEs, which not only offers a systematic and quantitative experimental assessment of these statistical features, but also naturally complements the available comparison of the CBF-PSSEs. We also analyzed the performance of the CBF-PSSEs combined with the PBF-PSSE and further constructed a new combined feature set, PBF11CBF-PSSE. Based on these experiments, novel valuable guidelines for the use of PBF-PSSEs and CBF-PSSEs were obtained. CONCLUSIONS PBF-PSSEs and CBF-PSSEs have a compelling impact on protein structural class prediction. When combining with the PBF-PSSE, most of the CBF-PSSEs get a great improvement over the prediction accuracies, so the PBF-PSSEs and the CBF-PSSEs have to work closely so as to make significant and complementary contributions to protein structural class prediction. Besides, the proposed PBF-PSSE's performance is extremely sensitive to the choice of parameter k. In summary, our quantitative analysis verifies that exploring the position information of predicted secondary structural elements is a promising way to improve the abilities of protein structural class prediction.
Collapse
Affiliation(s)
- Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Yan Li
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Xiaoqing Liu
- College of Science, Hangzhou Dianzi University, Hangzhou, 310018, China
| | - Yuhua Yao
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Yunjie Cao
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Pingan He
- College of Science, Zhejiang Sci-Tech University, Hangzhou, 310018, China
| |
Collapse
|
29
|
La Rosa M, Fiannaca A, Rizzo R, Urso A. Alignment-free analysis of barcode sequences by means of compression-based methods. BMC Bioinformatics 2013; 14 Suppl 7:S4. [PMID: 23815444 PMCID: PMC3633054 DOI: 10.1186/1471-2105-14-s7-s4] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and morphological data in order to obtain a more consistent taxonomy. Recent studies have shown that, for the animal kingdom, the mitochondrial gene cytochrome c oxidase I (COI), about 650 bp long, can be used as a barcode sequence for identification and taxonomic purposes of animals. In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. Our purpose is to justify the employ of USM also for the analysis of short DNA barcode sequences, showing how USM is able to correctly extract taxonomic information among those kind of sequences. RESULTS We downloaded from Barcode of Life Data System (BOLD) database 30 datasets of barcode sequences belonging to different animal species. We built phylogenetic trees of every dataset, according to compression-based and classic evolutionary methods, and compared them in terms of topology preservation. In the experimental tests, we obtained scores with a percentage of similarity between evolutionary and compression-based trees between 80% and 100% for the most of datasets (94%). Moreover we carried out experimental tests using simulated barcode datasets composed of 100, 150, 200 and 500 sequences, each simulation replicated 25-fold. In this case, mean similarity scores between evolutionary and compression-based trees span between 83% and 99% for all simulated datasets. CONCLUSIONS In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. This way we demonstrate the reliability of compression-based methods even for the analysis of short barcode sequences. Compression-based methods, with their strong theoretical assumptions, may then represent a valid alignment-free and parameter-free approach for barcode studies.
Collapse
Affiliation(s)
- Massimo La Rosa
- ICAR-CNR, National Research Council of Italy, Viale delle Scienze Ed. 11, 90128, Palermo, Italy.
| | | | | | | |
Collapse
|
30
|
Dufourt J, Pouchin P, Peyret P, Brasset E, Vaury C. NucBase, an easy to use read mapper for small RNAs. Mob DNA 2013; 4:1. [PMID: 23276284 PMCID: PMC3554603 DOI: 10.1186/1759-8753-4-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2012] [Accepted: 10/16/2012] [Indexed: 01/05/2023] Open
Abstract
UNLABELLED BACKGROUND High-throughput deep-sequencing technology has generated an unprecedented number of expressed sequence reads that offer the opportunity to get insight into biological systems. Several databases report the sequence of small regulatory RNAs which play a prominent role in the control of transposable elements (TE). However, the huge amount of data reported in these databases remains mostly unexplored because the available tools are hard for biologists to use. RESULTS Here we report NucBase, a new program designed to make an exhaustive search for sequence matches and to align short sequence reads from large nucleic acid databases to genomes or input sequences. NucBase includes a graphical interface which allows biologists to align sequences with ease and immediately visualize matched sequences, their number and their genomic position. NucBase identifies nucleic motives with strict identity to input sequences, and it capably finds candidates with one or several mismatches. It offers the opportunity to identify "core sequences" comprised of a chosen number of consecutive matching nucleotides. This software can be run locally on any Windows, Linux or Mac OS computer with 32-bit architecture compatibility. CONCLUSIONS Since this software is easy to use and can detect reads that were undetected by other software, we believe that it will be useful for biologists involved in the field of TE silencing by small non-coding RNAs. We hope NucBase will be useful for a larger community of researchers, since it makes exploration of small nucleic sequences in any organism much easier.
Collapse
Affiliation(s)
- Jeremy Dufourt
- Clermont Université, Université d’Auvergne, Laboratoire GReD, BP 38, F-63001 Clermont-Ferrand, France
- Inserm U 1103, F-63001, Clermont-Ferrand, France
- CNRS, UMR 6293, F-63001, Clermont-Ferrand, France
| | - Pierre Pouchin
- Clermont Université, Université d’Auvergne, Laboratoire GReD, BP 38, F-63001 Clermont-Ferrand, France
- Inserm U 1103, F-63001, Clermont-Ferrand, France
- CNRS, UMR 6293, F-63001, Clermont-Ferrand, France
- CHRU, F-63001, Clermont-Ferrand, France
| | - Pierre Peyret
- EA4678, Université d’Auvergne, F-63001, Clermont-Ferrand, France
| | - Emilie Brasset
- Clermont Université, Université d’Auvergne, Laboratoire GReD, BP 38, F-63001 Clermont-Ferrand, France
- Inserm U 1103, F-63001, Clermont-Ferrand, France
- CNRS, UMR 6293, F-63001, Clermont-Ferrand, France
| | - Chantal Vaury
- Clermont Université, Université d’Auvergne, Laboratoire GReD, BP 38, F-63001 Clermont-Ferrand, France
- Inserm U 1103, F-63001, Clermont-Ferrand, France
- CNRS, UMR 6293, F-63001, Clermont-Ferrand, France
| |
Collapse
|
31
|
Freschi V, Bogliolo A. A lossy compression technique enabling duplication-aware sequence alignment. Evol Bioinform Online 2012; 8:171-80. [PMID: 22518086 PMCID: PMC3327517 DOI: 10.4137/ebo.s9131] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In spite of the recognized importance of tandem duplications in genome evolution, commonly adopted sequence comparison algorithms do not take into account complex mutation events involving more than one residue at the time, since they are not compliant with the underlying assumption of statistical independence of adjacent residues. As a consequence, the presence of tandem repeats in sequences under comparison may impair the biological significance of the resulting alignment. Although solutions have been proposed, repeat-aware sequence alignment is still considered to be an open problem and new efficient and effective methods have been advocated. The present paper describes an alternative lossy compression scheme for genomic sequences which iteratively collapses repeats of increasing length. The resulting approximate representations do not contain tandem duplications, while retaining enough information for making their comparison even more significant than the edit distance between the original sequences. This allows us to exploit traditional alignment algorithms directly on the compressed sequences. Results confirm the validity of the proposed approach for the problem of duplication-aware sequence alignment.
Collapse
Affiliation(s)
- Valerio Freschi
- Department of Base Sciences and Fundamentals, University of Urbino, Italy
| | | |
Collapse
|
32
|
Dai Q, Wu L, Li L. Improving protein structural class prediction using novel combined sequence information and predicted secondary structural features. J Comput Chem 2011; 32:3393-8. [DOI: 10.1002/jcc.21918] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2011] [Revised: 06/29/2011] [Accepted: 07/25/2011] [Indexed: 11/07/2022]
|
33
|
Lavesson N, Axelsson S. Similarity assessment for removal of noisy end user license agreements. Knowl Inf Syst 2011. [DOI: 10.1007/s10115-011-0438-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
34
|
Pinho AJ, Ferreira PJSG, Neves AJR, Bastos CAC. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 2011; 6:e21588. [PMID: 21738720 PMCID: PMC3128062 DOI: 10.1371/journal.pone.0021588] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2010] [Accepted: 06/06/2011] [Indexed: 11/19/2022] Open
Abstract
A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.
Collapse
Affiliation(s)
- Armando J Pinho
- Signal Processing Lab, IEETA/DETI, University of Aveiro, Aveiro, Portugal.
| | | | | | | |
Collapse
|
35
|
Menconi G, Puliti A, Sbrana I, Conti V, Marangoni R. A top-down linguistic approach to the analysis of genomic sequences: The metabotropic glutamate receptors 1 and 5 in human and in mouse as a case study. J Theor Biol 2011; 270:134-42. [DOI: 10.1016/j.jtbi.2010.11.020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2010] [Revised: 10/18/2010] [Accepted: 11/10/2010] [Indexed: 01/21/2023]
|
36
|
Haubold B, Reed FA, Pfaffelhuber P. Alignment-free estimation of nucleotide diversity. ACTA ACUST UNITED AC 2010; 27:449-55. [PMID: 21156730 DOI: 10.1093/bioinformatics/btq689] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Sequencing capacity is currently growing more rapidly than CPU speed, leading to an analysis bottleneck in many genome projects. Alignment-free sequence analysis methods tend to be more efficient than their alignment-based counterparts. They may, therefore, be important in the long run for keeping sequence analysis abreast with sequencing. RESULTS We derive and implement an alignment-free estimator of the number of pairwise mismatches, . Our implementation of , pim, is based on an enhanced suffix array and inherits the superior time and memory efficiency of this data structure. Simulations demonstrate that is accurate if mutations are distributed randomly along the chromosome. While real data often deviates from this ideal, remains useful for identifying regions of low genetic diversity using a sliding window approach. We demonstrate this by applying it to the complete genomes of 37 strains of Drosophila melanogaster, and to the genomes of two closely related Drosophila species, D.simulans and D.sechellia. In both cases, we detect the diversity minimum and discuss its biological implications.
Collapse
Affiliation(s)
- Bernhard Haubold
- Department of Evolutionary Genetics, Albert-Ludwigs University, Freiburg, Germany
| | | | | |
Collapse
|
37
|
|
38
|
Abstract
The quantitative underpinning of the information content of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Several past studies have addressed the question of what distinguishes biosequences from random strings, the latter being clearly unpalatable to the living cell. Such studies typically analyze the organization of biosequences in terms of their constituent characters or substrings and have, in particular, consistently exposed a tenacious lack of compressibility on behalf of biosequences. This article attempts, perhaps for the first time, an assessement of the structure and randomness of polypeptides in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. It is shown that such parameters grasp structural/functional information, and are related to each other under a specific set of rules that span biochemically diverse polypeptides. Measures on subsequences separate few amino acid strings from their random permutations, but show that the random permutations of most polypeptides amass along specific linear loci.
Collapse
Affiliation(s)
- Alberto Apostolico
- College of Computing, Georgia Institute of Technology, Atlanta, GA 30318, USA.
| | | |
Collapse
|
39
|
Yang L, Chang G, Zhang X, Wang T. Use of the Burrows–Wheeler similarity distribution to the comparison of the proteins. Amino Acids 2010; 39:887-98. [DOI: 10.1007/s00726-010-0547-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2009] [Accepted: 02/25/2010] [Indexed: 10/19/2022]
|
40
|
Galas DJ, Nykter M, Carter GW, Price ND, Shmulevich I. Biological Information as Set-Based Complexity. IEEE TRANSACTIONS ON INFORMATION THEORY 2010; 56:667-677. [PMID: 27857450 PMCID: PMC5110148 DOI: 10.1109/tit.2009.2037046] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
It is not obvious what fraction of all the potential information residing in the molecules and structures of living systems is significant or meaningful to the system. Sets of random sequences or identically repeated sequences, for example, would be expected to contribute little or no useful information to a cell. This issue of quantitation of information is important since the ebb and flow of biologically significant information is essential to our quantitative understanding of biological function and evolution. Motivated specifically by these problems of biological information, we propose here a class of measures to quantify the contextual nature of the information in sets of objects, based on Kolmogorov's intrinsic complexity. Such measures discount both random and redundant information and are inherent in that they do not require a defined state space to quantify the information. The maximization of this new measure, which can be formulated in terms of the universal information distance, appears to have several useful and interesting properties, some of which we illustrate with examples.
Collapse
Affiliation(s)
- David J. Galas
- Institute for Systems Biology, Seattle, WA, USA
- Battelle Memorial Institute, Columbus, OH, USA
| | - Matti Nykter
- Institute for Systems Biology, Seattle, WA, USA
- Institute of Signal Processing, Tampere University of Technology,
Tampere, Finland
| | | | | | | |
Collapse
|
41
|
Apostolico A. Maximal Words in Sequence Comparisons Based on Subword Composition. ALGORITHMS AND APPLICATIONS 2010. [DOI: 10.1007/978-3-642-12476-1_2] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
42
|
Data Compression Concepts and Algorithms and their Applications to Bioinformatics. ENTROPY 2009; 12:34. [PMID: 20157640 DOI: 10.3390/e12010034] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Data compression at its base is concerned with how information is organized in data. Understanding this organization can lead to efficient ways of representing the information and hence data compression. In this paper we review the ways in which ideas and approaches fundamental to the theory and practice of data compression have been used in the area of bioinformatics. We look at how basic theoretical ideas from data compression, such as the notions of entropy, mutual information, and complexity have been used for analyzing biological sequences in order to discover hidden patterns, infer phylogenetic relationships between organisms and study viral populations. Finally, we look at how inferred grammars for biological sequences have been used to uncover structure in biological sequences.
Collapse
|
43
|
Kemena C, Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 2009; 25:2455-65. [PMID: 19648142 PMCID: PMC2752613 DOI: 10.1093/bioinformatics/btp452] [Citation(s) in RCA: 150] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2009] [Revised: 06/24/2009] [Accepted: 07/16/2009] [Indexed: 12/22/2022] Open
Abstract
This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches.
Collapse
Affiliation(s)
- Carsten Kemena
- Centre For Genomic Regulation, Pompeus Fabre University, Carrer del Doctor Aiguader 88, 08003 Barcelona, Spain
| | | |
Collapse
|
44
|
Rodrigo A, Bertels F, Heled J, Noder R, Shearman H, Tsai P. The perils of plenty: what are we going to do with all these genes? Philos Trans R Soc Lond B Biol Sci 2009; 363:3893-902. [PMID: 18852100 DOI: 10.1098/rstb.2008.0173] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This new century's biology promises more of everything--more genes, more organisms, more species and, in short, more data. The flood of data challenges us to find better and quicker ways to summarize and analyse. Here, we present preliminary results and proofs of concept from three of our research projects that are motivated by our search for solutions to the perils of plenty. First, we discuss how models of evolution can accommodate change to better reflect the dynamics of sequence diversity, particularly when it is becoming a lot easier to obtain sequences at different times and across intervals where the probability of new mutations contributing to this diversity is high. Second, we describe our work on the use of a single locus for species delimitation; this research targets the new DNA-barcoding approach that aims to catalogue the entirety of life. We have developed a single-locus test based on the coalescent that tests the null hypothesis of panmixis. Finally, we discuss new sequencing technologies, the types of data available and the efficacy of alignment-free methods to estimate pairwise distances for phylogenetic analyses.
Collapse
Affiliation(s)
- Allen Rodrigo
- Allan Wilson Centre for Molecular Evolution and Bioinformatics Institute, University of Auckland, Private Bag 92019, Auckland 1142, New Zealand.
| | | | | | | | | | | |
Collapse
|
45
|
Giancarlo R, Scaturro D, Utro F. Textual data compression in computational biology: a synopsis. Bioinformatics 2009; 25:1575-86. [DOI: 10.1093/bioinformatics/btp117] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
46
|
Apostolico A, Denas O. Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms Mol Biol 2008; 3:13. [PMID: 18957094 PMCID: PMC2615014 DOI: 10.1186/1748-7188-3-13] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2008] [Accepted: 10/28/2008] [Indexed: 11/28/2022] Open
Abstract
The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background.
Collapse
Affiliation(s)
- Alberto Apostolico
- Academia Nazionale dei Lincei, Rome, Italy
- Department of Information Engineering, Universitá di Padova, Padova, Italy
- College of Computing, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Olgert Denas
- College of Computing, Georgia Institute of Technology, Atlanta, Georgia, USA
| |
Collapse
|
47
|
Dai Q, Wang T. Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'. BMC Bioinformatics 2008; 9:394. [PMID: 18811946 PMCID: PMC2571980 DOI: 10.1186/1471-2105-9-394] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2008] [Accepted: 09/23/2008] [Indexed: 11/30/2022] Open
Abstract
Background Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: gre.k (generalized relative entropy) and gsm.k (gapped similarity measure). Results We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained. Conclusion Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel gsm.k introduced by this article, the cos.k followed. When the data becomes less redundant, gre.k proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.
Collapse
Affiliation(s)
- Qi Dai
- Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, PR China.
| | | |
Collapse
|
48
|
Liu L, Wang T. Comparison of TOPS strings based on LZ complexity. J Theor Biol 2008; 251:159-66. [PMID: 18166201 DOI: 10.1016/j.jtbi.2007.11.016] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2007] [Revised: 11/13/2007] [Accepted: 11/13/2007] [Indexed: 10/22/2022]
|
49
|
ProCKSI: a decision support system for Protein (structure) Comparison, Knowledge, Similarity and Information. BMC Bioinformatics 2007; 8:416. [PMID: 17963510 PMCID: PMC2222653 DOI: 10.1186/1471-2105-8-416] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2007] [Accepted: 10/26/2007] [Indexed: 11/19/2022] Open
Abstract
Background We introduce the decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information (ProCKSI). ProCKSI integrates various protein similarity measures through an easy to use interface that allows the comparison of multiple proteins simultaneously. It employs the Universal Similarity Metric (USM), the Maximum Contact Map Overlap (MaxCMO) of protein structures and other external methods such as the DaliLite and the TM-align methods, the Combinatorial Extension (CE) of the optimal path, and the FAST Align and Search Tool (FAST). Additionally, ProCKSI allows the user to upload a user-defined similarity matrix supplementing the methods mentioned, and computes a similarity consensus in order to provide a rich, integrated, multicriteria view of large datasets of protein structures. Results We present ProCKSI's architecture and workflow describing its intuitive user interface, and show its potential on three distinct test-cases. In the first case, ProCKSI is used to evaluate the results of a previous CASP competition, assessing the similarity of proposed models for given targets where the structures could have a large deviation from one another. To perform this type of comparison reliably, we introduce a new consensus method. The second study deals with the verification of a classification scheme for protein kinases, originally derived by sequence comparison by Hanks and Hunter, but here we use a consensus similarity measure based on structures. In the third experiment using the Rost and Sander dataset (RS126), we investigate how a combination of different sets of similarity measures influences the quality and performance of ProCKSI's new consensus measure. ProCKSI performs well with all three datasets, showing its potential for complex, simultaneous multi-method assessment of structural similarity in large protein datasets. Furthermore, combining different similarity measures is usually more robust than relying on one single, unique measure. Conclusion Based on a diverse set of similarity measures, ProCKSI computes a consensus similarity profile for the entire protein set. All results can be clustered, visualised, analysed and easily compared with each other through a simple and intuitive interface. ProCKSI is publicly available at for academic and non-commercial use.
Collapse
|