1
|
Bobbo T, Biscarini F, Yaddehige SK, Alberghini L, Rigoni D, Bianchi N, Taccioli C. Machine learning classification of archaea and bacteria identifies novel predictive genomic features. BMC Genomics 2024; 25:955. [PMID: 39402493 PMCID: PMC11472548 DOI: 10.1186/s12864-024-10832-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 09/24/2024] [Indexed: 10/19/2024] Open
Abstract
BACKGROUND Archaea and Bacteria are distinct domains of life that are adapted to a variety of ecological niches. Several genome-based methods have been developed for their accurate classification, yet many aspects of the specific genomic features that determine these differences are not fully understood. In this study, we used publicly available whole-genome sequences from bacteria ( N = 2546 ) and archaea ( N = 109 ). From these, a set of genomic features (nucleotide frequencies and proportions, coding sequences (CDS), non-coding, ribosomal and transfer RNA genes (ncRNA, rRNA, tRNA), Chargaff's, topological entropy and Shannon's entropy scores) was extracted and used as input data to develop machine learning models for the classification of archaea and bacteria. RESULTS The classification accuracy ranged from 0.993 (Random Forest) to 0.998 (Neural Networks). Over the four models, only 11 examples were misclassified, especially those belonging to the minority class (Archaea). From variable importance, tRNA topological and Shannon's entropy, nucleotide frequencies in tRNA, rRNA and ncRNA, CDS, tRNA and rRNA Chargaff's scores have emerged as the top discriminating factors. In particular, tRNA entropy (both topological and Shannon's) was the most important genomic feature for classification, pointing at the complex interactions between the genetic code, tRNAs and the translational machinery. CONCLUSIONS tRNA, rRNA and ncRNA genes emerged as the key genomic elements that underpin the classification of archaea and bacteria. In particular, higher nucleotide diversity was found in tRNA from bacteria compared to archaea. The analysis of the few classification errors reflects the complex phylogenetic relationships between bacteria, archaea and eukaryotes.
Collapse
Affiliation(s)
- Tania Bobbo
- Institute for Biomedical Technologies, National Research Council (CNR), Via Fratelli Cervi 93, Segrate (MI), 20054, Italy
| | - Filippo Biscarini
- Institute of Agricultural Biology and Biotechnology, National Research Council (CNR), Via Edoardo Bassini 15, Milano, 20133, Italy.
| | - Sachithra K Yaddehige
- Department of Animal Medicine, Health and Production, University of Padova, Viale dell'Universitá 16, Legnaro, 35020, Italy
| | - Leonardo Alberghini
- Department of Animal Medicine, Health and Production, University of Padova, Viale dell'Universitá 16, Legnaro, 35020, Italy
| | - Davide Rigoni
- Department of Pharmaceutical and Pharmacological Sciences, University of Padova, Via Francesco Marzolo 5, Padova, 35131, Italy
| | - Nicoletta Bianchi
- Department of Translational Medicine, University of Ferrara, Via Luigi Borsari 46, Ferrara, 44121, Italy.
| | - Cristian Taccioli
- Department of Animal Medicine, Health and Production, University of Padova, Viale dell'Universitá 16, Legnaro, 35020, Italy.
| |
Collapse
|
2
|
Sarkar BK, Bhattacharya M, Agoramoorthy G, Dhama K, Chakraborty C. Entropy-Driven, Integrative Bioinformatics Approaches Reveal the Recent Transmission of the Monkeypox Virus from Nigeria to Multiple Non-African Countries. Mol Biotechnol 2024; 66:2816-2829. [PMID: 37798393 DOI: 10.1007/s12033-023-00889-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Accepted: 09/06/2023] [Indexed: 10/07/2023]
Abstract
Monkeypox virus (mpox) has currently affected multiple countries around the globe. This study aims to analyze how the virus spread globally. The study uses entropy-driven bioinformatics in five directions to analyze the 60 full-length complete genomes of mpox. We analyzed the topological entropy distribution of the genomes, principal component analysis (PCA), the dissimilarity matrix, entropy-driven phylogenetics, and genome clustering. The topological entropy distribution showed genome positional entropy. We found five clusters of the mpox genomes through the two PCA, while the three PCA elucidated the clustering events in 3D space. The clustering of genomes was further confirmed through the dissimilarity matrix and phylogenetic analysis which showed the bigger size of Cluster 1 and size similarity between Clusters 2 and 4 as well as Clusters 3 and 5. It corroborated with the phylogenetics of the genomes, where Cluster 1 showed clear segregation from the other four clusters. Finally, the study concluded that the spreading of the mpox is likely to have originated from African countries to the rest of the non-African countries. Overall, the spreading and distribution of the mpox will shed light on its evolution and pathogenicity of the mpox and help to adopt preventive measures to stop the spreading of the virus.
Collapse
Affiliation(s)
- Bimal Kumar Sarkar
- Department of Physics, Adamas University, Kolkata, West Bengal, 700126, India
| | - Manojit Bhattacharya
- Department of Zoology, Fakir Mohan University, Vyasa Vihar, Balasore, 756020, Odisha, India
| | | | - Kuldeep Dhama
- Division of Pathology, ICAR-Indian Veterinary Research Institute, Izatnagar, Bareilly, Uttar Pradesh, 243122, India.
| | - Chiranjib Chakraborty
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal, 700126, India.
| |
Collapse
|
3
|
Lyu C, Chen L, Liu X. Detecting tipping points of complex diseases by network information entropy. Brief Bioinform 2024; 25:bbae311. [PMID: 38960408 PMCID: PMC11221888 DOI: 10.1093/bib/bbae311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 05/30/2024] [Accepted: 06/14/2024] [Indexed: 07/05/2024] Open
Abstract
The progression of complex diseases often involves abrupt and non-linear changes characterized by sudden shifts that trigger critical transformations. Identifying these critical states or tipping points is crucial for understanding disease progression and developing effective interventions. To address this challenge, we have developed a model-free method named Network Information Entropy of Edges (NIEE). Leveraging dynamic network biomarkers, sample-specific networks, and information entropy theories, NIEE can detect critical states or tipping points in diverse data types, including bulk, single-sample expression data. By applying NIEE to real disease datasets, we successfully identified critical predisease stages and tipping points before disease onset. Our findings underscore NIEE's potential to enhance comprehension of complex disease development.
Collapse
Affiliation(s)
- Chengshang Lyu
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, 1 Xiangshan Branch Alley, Xihu District, Hangzhou 310024, China
- Department of Biomedical Sciences, City University of Hong Kong, 31 To Yuen Street, Kowloon Tong, Kowloon, Hong Kong 999077, China
| | - Lingxi Chen
- Department of Biomedical Sciences, City University of Hong Kong, 31 To Yuen Street, Kowloon Tong, Kowloon, Hong Kong 999077, China
| | - Xiaoping Liu
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, 1 Xiangshan Branch Alley, Xihu District, Hangzhou 310024, China
| |
Collapse
|
4
|
Wesp V, Theißen G, Schuster S. Statistical analysis of synonymous and stop codons in pseudo-random and real sequences as a function of GC content. Sci Rep 2023; 13:22996. [PMID: 38151539 PMCID: PMC10752896 DOI: 10.1038/s41598-023-49626-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 12/10/2023] [Indexed: 12/29/2023] Open
Abstract
Knowledge of the frequencies of synonymous triplets in protein-coding and non-coding DNA stretches can be used in gene finding. These frequencies depend on the GC content of the genome or parts of it. An example of interest is provided by stop codons. This is relevant for the definition of Open Reading Frames. A generic case is provided by pseudo-random sequences, especially when they code for complex proteins or when they are non-coding and not subject to selection pressure. Here, we calculate, for such sequences and for all 25 known genetic codes, the frequency of each amino acid and stop codon based on their set of codons and as a function of GC content. The amino acids can be classified into five groups according to the GC content where their expected frequency reaches its maximum. We determine the overall Shannon information based on groups of synonymous codons and show that it becomes maximum at a percent GC of 43.3% (for the standard code). This is in line with the observation that in most fungi, plants, and animals, this genomic parameter is in the range from 35 to 50%. By analysing natural sequences, we show that there is a clear bias for triplets corresponding to stop codons near the 5'- and 3'-splice sites in the introns of various clades.
Collapse
Affiliation(s)
- Valentin Wesp
- Department of Bioinformatics, Matthias Schleiden Institute, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743, Jena, Germany
| | - Günter Theißen
- Department of Genetics, Matthias Schleiden Institute, Friedrich Schiller University Jena, Philosophenweg 12, 07743, Jena, Germany
| | - Stefan Schuster
- Department of Bioinformatics, Matthias Schleiden Institute, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743, Jena, Germany.
| |
Collapse
|
5
|
Herbert A. The Intransitive Logic of Directed Cycles and Flipons Enhances the Evolution of Molecular Computers by Augmenting the Kolmogorov Complexity of Genomes. Int J Mol Sci 2023; 24:16482. [PMID: 38003672 PMCID: PMC10671625 DOI: 10.3390/ijms242216482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 11/14/2023] [Accepted: 11/14/2023] [Indexed: 11/26/2023] Open
Abstract
Cell responses are usually viewed as transitive events with fixed inputs and outputs that are regulated by feedback loops. In contrast, directed cycles (DCs) have all nodes connected, and the flow is in a single direction. Consequently, DCs can regenerate themselves and implement intransitive logic. DCs are able to couple unrelated chemical reactions to each edge. The output depends upon which node is used as input. DCs can also undergo selection to minimize the loss of thermodynamic entropy while maximizing the gain of information entropy. The intransitive logic underlying DCs enhances their programmability and impacts their evolution. The natural selection of DCs favors the persistence, adaptability, and self-awareness of living organisms and does not depend solely on changes to coding sequences. Rather, the process can be RNA-directed. I use flipons, nucleic acid sequences that change conformation under physiological conditions, as a simple example and then describe more complex DCs. Flipons are often encoded by repeats and greatly increase the Kolmogorov complexity of genomes by adopting alternative structures. Other DCs allow cells to regenerate, recalibrate, reset, repair, and rewrite themselves, going far beyond the capabilities of current computational devices. Unlike Turing machines, cells are not designed to halt but rather to regenerate.
Collapse
Affiliation(s)
- Alan Herbert
- InsideOutBio, 42 8th Street, Charlestown, MA 02129, USA
| |
Collapse
|
6
|
Machine Learning Algorithms Highlight tRNA Information Content and Chargaff’s Second Parity Rule Score as Important Features in Discriminating Probiotics from Non-Probiotics. BIOLOGY 2022; 11:biology11071024. [PMID: 36101405 PMCID: PMC9311688 DOI: 10.3390/biology11071024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 06/30/2022] [Accepted: 07/04/2022] [Indexed: 11/17/2022]
Abstract
Simple Summary Probiotics are a group of beneficial microorganisms that are symbionts of the human gut microbiome. The identification of new probiotics is therefore of paramount importance from both public health and commercial perspectives. In this study, we show for the first time that Artificial Intelligence algorithms can identify novel probiotics and also discriminate them from pathogenic organisms in the human gut. We were also able to determine the information content within tRNA sequences as the key genomic features capable of characterizing probiotics. Abstract Probiotic bacteria are microorganisms with beneficial effects on human health and are currently used in numerous food supplements. However, no selection process is able to effectively distinguish probiotics from non-probiotic organisms on the basis of their genomic characteristics. In the current study, four Machine Learning algorithms were employed to accurately identify probiotic bacteria based on their DNA characteristics. Although the prediction accuracies of all algorithms were excellent, the Neural Network returned the highest scores in all the evaluation metrics, managing to discriminate probiotics from non-probiotics with an accuracy greater than 90%. Interestingly, our analysis also highlighted the information content of the tRNA sequences as the most important feature in distinguishing the two groups of organisms probably because tRNAs have regulatory functions and might have allowed probiotics to evolve faster in the human gut environment. Through the methodology presented here, it was also possible to identify seven promising new probiotics that have a higher information content in their tRNA sequences compared to non-probiotics. In conclusion, we prove for the first time that Machine Learning methods can discriminate human probiotic from non-probiotic organisms underlining information within tRNA sequences as the most important genomic feature in distinguishing them.
Collapse
|
7
|
Prediction of Intrinsically Disordered Proteins Using Machine Learning Based on Low Complexity Methods. ALGORITHMS 2022. [DOI: 10.3390/a15030086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Prediction of intrinsic disordered proteins is a hot area in the field of bio-information. Due to the high cost of evaluating the disordered regions of protein sequences using experimental methods, we used a low-complexity prediction scheme. Sequence complexity is used in this scheme to calculate five features for each residue of the protein sequence, including the Shannon entropy, the Topo-logical entropy, the Permutation entropy and the weighted average values of two propensities. Particularly, this is the first time that permutation entropy has been applied to the field of protein sequencing. In addition, in the data preprocessing stage, an appropriately sized sliding window and a comprehensive oversampling scheme can be used to improve the prediction performance of our scheme, and two ensemble learning algorithms are also used to verify the prediction results before and after. The results show that adding permutation entropy improves the performance of the prediction algorithm, in which the MCC value can be improved from the original 0.465 to 0.526 in our scheme, proving its universality. Finally, we compare the simulation results of our scheme with those of some existing schemes to demonstrate its effectiveness.
Collapse
|
8
|
Meraz M, Vernon-Carter E, Rodriguez E, Alvarez-Ramirez J. A fractal scaling analysis of the SARS-CoV-2 genome sequence. Biomed Signal Process Control 2022; 73:103433. [PMID: 36567677 PMCID: PMC9760973 DOI: 10.1016/j.bspc.2021.103433] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Revised: 11/04/2021] [Accepted: 11/28/2021] [Indexed: 12/27/2022]
Abstract
An approach based on fractal scaling analysis to characterize the organization of the SARS-CoV-2 genome sequence was used. The method is based on the detrended fluctuation analysis (DFA) implemented on a sliding window scheme to detect variations of long-range correlations over the genome sequence regions. The nucleotides sequence is mapped in a numerical sequence by using four different assignation rules: amino-keto, purine-pyrimidine, hydrogen-bond and hydrophobicity patterns. The originally reported sequence from Wuhan isolates (Wuhan Hu-1) was considered as a reference to contrast the structure of the 2002-2004 SARS-CoV-1 strain. Long-range correlations, quantified in terms of a scaling exponent, depended on both the mapping rule and the sequence region. Deviations from randomness were attributed to serial correlations or anti-correlations, which can be ascribed to ordered regions of the genome sequence. It was found that the Wuhan Hu-1 sequence was more random than the SARS-CoV-1 sequence, which suggests that the SARS-CoV-2 possesses a more efficient genomic structure for replication and infection. In general, the virus isolated in the early 2020 months showed slight correlation differences with the Wuhan Hu-1 sequence. However, early isolates from India and Italy presented visible differences that led to a more ordered sequence organization. It is apparent that the increased sequence order, particularly in the spike region, endowed some early variants with a more efficient mechanism to spreading, replicating and infecting. Overall, the results showed that the DFA provides a suitable framework to assess long-term correlations hidden in the internal organization of the SARS-CoV-2 genome sequence.
Collapse
Affiliation(s)
- M. Meraz
- Departamento de Biotecnología, Universidad Autónoma Metropolitana-Iztapalapa, Apartado Postal 55-534, Iztapalapa, CDMX 09340, Mexico
| | - E.J. Vernon-Carter
- Departamento de Ingenieria de Procesos e Hidraulica, Universidad Autónoma Metropolitana-Iztapalapa, Apartado Postal 55-534, Iztapalapa, CDMX 09340, Mexico
| | - E. Rodriguez
- Departamento de Ingenieria Eléctrica y Computacion, Universidad Autónoma Metropolitana-Iztapalapa, Apartado Postal 55-534, Iztapalapa, CDMX 09340, Mexico
| | - J. Alvarez-Ramirez
- Departamento de Ingenieria de Procesos e Hidraulica, Universidad Autónoma Metropolitana-Iztapalapa, Apartado Postal 55-534, Iztapalapa, CDMX 09340, Mexico,Corresponding author
| |
Collapse
|
9
|
Bussi Y, Kapon R, Reich Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One 2021; 16:e0258693. [PMID: 34648558 PMCID: PMC8516232 DOI: 10.1371/journal.pone.0258693] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/02/2021] [Indexed: 12/24/2022] Open
Abstract
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
Collapse
Affiliation(s)
- Yuval Bussi
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
- Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Ruti Kapon
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Ziv Reich
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- * E-mail:
| |
Collapse
|
10
|
Information Entropy in Chemistry: An Overview. ENTROPY 2021; 23:e23101240. [PMID: 34681964 PMCID: PMC8534366 DOI: 10.3390/e23101240] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 09/19/2021] [Accepted: 09/20/2021] [Indexed: 12/20/2022]
Abstract
Basic applications of the information entropy concept to chemical objects are reviewed. These applications deal with quantifying chemical and electronic structures of molecules, signal processing, structural studies on crystals, and molecular ensembles. Recent advances in the mentioned areas make information entropy a central concept in interdisciplinary studies on digitalizing chemical reactions, chemico-information synthesis, crystal engineering, as well as digitally rethinking basic notions of structural chemistry in terms of informatics.
Collapse
|
11
|
Danko D, Bezdan D, Afshin EE, Ahsanuddin S, Bhattacharya C, Butler DJ, Chng KR, Donnellan D, Hecht J, Jackson K, Kuchin K, Karasikov M, Lyons A, Mak L, Meleshko D, Mustafa H, Mutai B, Neches RY, Ng A, Nikolayeva O, Nikolayeva T, Png E, Ryon KA, Sanchez JL, Shaaban H, Sierra MA, Thomas D, Young B, Abudayyeh OO, Alicea J, Bhattacharyya M, Blekhman R, Castro-Nallar E, Cañas AM, Chatziefthimiou AD, Crawford RW, De Filippis F, Deng Y, Desnues C, Dias-Neto E, Dybwad M, Elhaik E, Ercolini D, Frolova A, Gankin D, Gootenberg JS, Graf AB, Green DC, Hajirasouliha I, Hastings JJA, Hernandez M, Iraola G, Jang S, Kahles A, Kelly FJ, Knights K, Kyrpides NC, Łabaj PP, Lee PKH, Leung MHY, Ljungdahl PO, Mason-Buck G, McGrath K, Meydan C, Mongodin EF, Moraes MO, Nagarajan N, Nieto-Caballero M, Noushmehr H, Oliveira M, Ossowski S, Osuolale OO, Özcan O, Paez-Espino D, Rascovan N, Richard H, Rätsch G, Schriml LM, Semmler T, Sezerman OU, Shi L, Shi T, Siam R, Song LH, Suzuki H, Court DS, Tighe SW, Tong X, Udekwu KI, Ugalde JA, Valentine B, Vassilev DI, Vayndorf EM, Velavan TP, Wu J, Zambrano MM, Zhu J, Zhu S, Mason CE. A global metagenomic map of urban microbiomes and antimicrobial resistance. Cell 2021; 184:3376-3393.e17. [PMID: 34043940 PMCID: PMC8238498 DOI: 10.1016/j.cell.2021.05.002] [Citation(s) in RCA: 156] [Impact Index Per Article: 52.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Revised: 03/05/2021] [Accepted: 04/29/2021] [Indexed: 01/14/2023]
Abstract
We present a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over 3 years, representing the first systematic, worldwide catalog of the urban microbial ecosystem. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance (AMR) markers, and genetic elements, including 10,928 viruses, 1,302 bacteria, 2 archaea, and 838,532 CRISPR arrays not found in reference databases. We identified 4,246 known species of urban microorganisms and a consistent set of 31 species found in 97% of samples that were distinct from human commensal organisms. Profiles of AMR genes varied widely in type and density across cities. Cities showed distinct microbial taxonomic signatures that were driven by climate and geographic differences. These results constitute a high-resolution global metagenomic atlas that enables discovery of organisms and genes, highlights potential public health and forensic applications, and provides a culture-independent view of AMR burden in cities.
Collapse
Affiliation(s)
- David Danko
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Daniela Bezdan
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA; Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany; NGS Competence Center Tübingen (NCCT), University of Tübingen, Tübingen, Germany
| | - Evan E Afshin
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | | | - Chandrima Bhattacharya
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Daniel J Butler
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Kern Rei Chng
- Genome Institute of Singapore, A(∗)STAR, Singapore, Singapore
| | - Daisy Donnellan
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Jochen Hecht
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Katelyn Jackson
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Katerina Kuchin
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Mikhail Karasikov
- ETH Zurich, Department of Computer Science, Biomedical Informatics Group, Zurich, Switzerland; University Hospital Zurich, Biomedical Informatics Research, Zurich, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Abigail Lyons
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Lauren Mak
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Dmitry Meleshko
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Harun Mustafa
- ETH Zurich, Department of Computer Science, Biomedical Informatics Group, Zurich, Switzerland; University Hospital Zurich, Biomedical Informatics Research, Zurich, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Beth Mutai
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Kenya Medical Research Institute - Kisumu, Kisumu, Kenya
| | - Russell Y Neches
- Department of Energy, Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Amanda Ng
- Genome Institute of Singapore, A(∗)STAR, Singapore, Singapore
| | | | | | - Eileen Png
- Genome Institute of Singapore, A(∗)STAR, Singapore, Singapore
| | - Krista A Ryon
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Jorge L Sanchez
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Heba Shaaban
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Maria A Sierra
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Dominique Thomas
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Ben Young
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Omar O Abudayyeh
- Massachusetts Institute of Technology, McGovern Institute for Brain Research, Cambridge, MA, USA
| | - Josue Alicea
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Malay Bhattacharyya
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India; Centre for Artificial Intelligence and Machine Learning, Indian Statistical Institute, Kolkata, India
| | | | - Eduardo Castro-Nallar
- Universidad Andres Bello, Center for Bioinformatics and Integrative Biology, Facultad de Ciencias de la Vida, Santiago, Chile
| | - Ana M Cañas
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Aspassia D Chatziefthimiou
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | | | - Francesca De Filippis
- Department of Agricultural Sciences, Division of Microbiology, University of Naples Federico II, Naples, Italy; Task Force on Microbiome Studies, University of Naples Federico II, Naples, Italy
| | - Youping Deng
- University of Hawaii John A. Burns School of Medicine, Honolulu, HI, USA
| | - Christelle Desnues
- Aix-Marseille Université, Mediterranean Institute of Oceanology, Université de Toulon, CNRS, IRD, UM 110, Marseille, France
| | - Emmanuel Dias-Neto
- Medical Genomics group, A.C.Camargo Cancer Center, São Paulo - SP, Brazil
| | - Marius Dybwad
- Norwegian Defence Research Establishment FFI, Kjeller, Norway
| | - Eran Elhaik
- Department of Biology, Lund University, Lund, Sweden
| | - Danilo Ercolini
- Department of Agricultural Sciences, Division of Microbiology, University of Naples Federico II, Naples, Italy; Task Force on Microbiome Studies, University of Naples Federico II, Naples, Italy
| | - Alina Frolova
- Institute of Molecular Biology and Genetics of National Academy of Sciences of Ukraine, Kyiv, Ukraine; Kyiv Academic University, Kyiv, Ukraine
| | - Dennis Gankin
- Massachusetts Institute of Technology, McGovern Institute for Brain Research, Cambridge, MA, USA
| | - Jonathan S Gootenberg
- Massachusetts Institute of Technology, McGovern Institute for Brain Research, Cambridge, MA, USA
| | | | - David C Green
- Department of Analytical, Environmental and Forensic Sciences, King's College London, London, UK
| | - Iman Hajirasouliha
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Jaden J A Hastings
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | | | - Gregorio Iraola
- Microbial Genomics Laboratory, Institut Pasteur de Montevideo, Montevideo, Uruguay; Center for Integrative Biology, Universidad Mayor, Santiago de Chile, Santiago, Chile; Wellcome Sanger Institute, Hinxton, UK
| | | | - Andre Kahles
- ETH Zurich, Department of Computer Science, Biomedical Informatics Group, Zurich, Switzerland; Kyiv Academic University, Kyiv, Ukraine; C+, Research Center in Technologies for Society, School of Engineering, Universidad del Desarrollo, Santiago, Chile
| | - Frank J Kelly
- Department of Analytical, Environmental and Forensic Sciences, King's College London, London, UK
| | - Kaymisha Knights
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Nikos C Kyrpides
- Department of Energy, Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Paweł P Łabaj
- State Key Laboratory of Genetic Engineering (SKLGE) and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Human Phenome Institute, Fudan University, Shanghai, China; Małopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland; Boku University Viennna, Vienna, Austria
| | - Patrick K H Lee
- School of Energy and Environment, City University of Hong Kong, Hong Kong SAR, China
| | - Marcus H Y Leung
- School of Energy and Environment, City University of Hong Kong, Hong Kong SAR, China
| | - Per O Ljungdahl
- Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, Stockholm, Sweden
| | - Gabriella Mason-Buck
- Department of Analytical, Environmental and Forensic Sciences, King's College London, London, UK
| | - Ken McGrath
- Microba, 388 Queen St, Brisbane City, QLD 4000, Australia
| | - Cem Meydan
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Emmanuel F Mongodin
- University of Maryland School of Medicine, Institute for Genome Sciences, Baltimore, MD, USA
| | | | | | | | - Houtan Noushmehr
- University of São Paulo, Ribeirão Preto Medical School, Ribeirão Preto - SP, Brazil
| | - Manuela Oliveira
- Instituto de Patologia e Imunologia Molecular da Universidade do Porto, Porto, Portugal
| | - Stephan Ossowski
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany; NGS Competence Center Tübingen (NCCT), University of Tübingen, Tübingen, Germany
| | - Olayinka O Osuolale
- Applied Environmental Metagenomics and Infectious Diseases Research (AEMIDR), Department of Biological Sciences, Elizade University, Ilara-Mokin, Nigeria
| | - Orhan Özcan
- Acibadem Mehmet Ali Aydınlar University, Istanbul, Turkey
| | - David Paez-Espino
- Department of Energy, Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Nicolás Rascovan
- Microbial Paleogenomics Unit, Institut Pasteur, CNRS UMR2000, Paris 75015, France
| | - Hugues Richard
- Sorbonne University, Faculty of Science, Institute of Biology Paris-Seine, Laboratory of Computational and Quantitative Biology, Paris, France; Robert Koch Institute, Berlin, Germany
| | - Gunnar Rätsch
- ETH Zurich, Department of Computer Science, Biomedical Informatics Group, Zurich, Switzerland; University Hospital Zurich, Biomedical Informatics Research, Zurich, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Lynn M Schriml
- University of Maryland School of Medicine, Institute for Genome Sciences, Baltimore, MD, USA
| | | | | | - Leming Shi
- Center for Pharmacogenomics, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, China; State Key Laboratory of Genetic Engineering (SKLGE) and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Human Phenome Institute, Fudan University, Shanghai, China
| | - Tieliu Shi
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Rania Siam
- University of Medicine and Health Sciences, St. Kitts, West Indies and American University in Cairo, Cairo, Egypt
| | - Le Huu Song
- 108 Military Central Hospital, Hanoi, Vietnam; Vietnamese-German Center for Medical Research (VG-CARE), Hanoi, Vietnam
| | | | - Denise Syndercombe Court
- Department of Analytical, Environmental and Forensic Sciences, King's College London, London, UK
| | | | - Xinzhao Tong
- School of Energy and Environment, City University of Hong Kong, Hong Kong SAR, China
| | - Klas I Udekwu
- Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, Stockholm, Sweden; SciLife EVP, Department of Aquatic Sciences Assessment, Swedish University of Agricultural Sciences, Uppsala, Sweden
| | - Juan A Ugalde
- Millennium Initiative for Collaborative Research on Bacterial Resistance, Santiago, Chile; C+, Research Center in Technologies for Society, School of Engineering, Universidad del Desarrollo, Santiago, Chile
| | - Brandon Valentine
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Dimitar I Vassilev
- Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski," Sofia, Bulgaria
| | - Elena M Vayndorf
- Institute of Arctic Biology, University of Alaska, Fairbanks, Fairbanks, AK, USA
| | - Thirumalaisamy P Velavan
- Institute of Tropical Medicine, Univeristätsklinikum Tübingen, Tübingen, Germany; Faculty of Medicine, Duy Tan University, Da Nang, Vietnam
| | - Jun Wu
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | | | - Jifeng Zhu
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA
| | - Sibo Zhu
- State Key Laboratory of Genetic Engineering (SKLGE) and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Human Phenome Institute, Fudan University, Shanghai, China; Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
| | - Christopher E Mason
- Weill Cornell Medicine, New York, NY, USA; The Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA; The WorldQuant Initiative for Quantitative Prediction, Weill Cornell Medicine, New York, NY, USA.
| |
Collapse
|
12
|
Liu R, Yeh YHJ, Varabyou A, Collora JA, Sherrill-Mix S, Talbot CC, Mehta S, Albrecht K, Hao H, Zhang H, Pollack RA, Beg SA, Calvi RM, Hu J, Durand CM, Ambinder RF, Hoh R, Deeks SG, Chiarella J, Spudich S, Douek DC, Bushman FD, Pertea M, Ho YC. Single-cell transcriptional landscapes reveal HIV-1-driven aberrant host gene transcription as a potential therapeutic target. Sci Transl Med 2021; 12:12/543/eaaz0802. [PMID: 32404504 DOI: 10.1126/scitranslmed.aaz0802] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2019] [Revised: 10/29/2019] [Accepted: 04/17/2020] [Indexed: 12/22/2022]
Abstract
Understanding HIV-1-host interactions can identify the cellular environment supporting HIV-1 reactivation and mechanisms of clonal expansion. We developed HIV-1 SortSeq to isolate rare HIV-1-infected cells from virally suppressed, HIV-1-infected individuals upon early latency reversal. Single-cell transcriptome analysis of HIV-1 SortSeq+ cells revealed enrichment of nonsense-mediated RNA decay and viral transcription pathways. HIV-1 SortSeq+ cells up-regulated cellular factors that can support HIV-1 transcription (IMPDH1 and JAK1) or promote cellular survival (IL2 and IKBKB). HIV-1-host RNA landscape analysis at the integration site revealed that HIV-1 drives high aberrant host gene transcription downstream, but not upstream, of the integration site through HIV-1-to-host aberrant splicing, in which HIV-1 RNA splices into the host RNA and aberrantly drives host RNA transcription. HIV-1-induced aberrant transcription was driven by the HIV-1 promoter as shown by CRISPR-dCas9-mediated HIV-1-specific activation and could be suppressed by CRISPR-dCas9-mediated inhibition of HIV-1 5' long terminal repeat. Overall, we identified cellular factors supporting HIV-1 reactivation and HIV-1-driven aberrant host gene transcription as potential therapeutic targets to disrupt HIV-1 persistence.
Collapse
Affiliation(s)
- Runxia Liu
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Yang-Hui Jimmy Yeh
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Ales Varabyou
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jack A Collora
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Scott Sherrill-Mix
- Department of Microbiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
| | - C Conover Talbot
- Institute for Basic Biomedical Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Sameet Mehta
- Yale Center for Genome Analysis, Yale University, New Haven, CT 06519, USA
| | - Kristen Albrecht
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Haiping Hao
- Institute for Basic Biomedical Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Hao Zhang
- Department of Molecular Microbiology and Immunology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Ross A Pollack
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Subul A Beg
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Rachela M Calvi
- Department of Neurology, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Jianfei Hu
- Vaccine Research Center, National Institute of Health, Bethesda, MD 20892, USA
| | - Christine M Durand
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Richard F Ambinder
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Rebecca Hoh
- Department of Medicine, University of California, San Francisco, CA 94110, USA
| | - Steven G Deeks
- Department of Medicine, University of California, San Francisco, CA 94110, USA
| | - Jennifer Chiarella
- Department of Neurology, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Serena Spudich
- Department of Neurology, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Daniel C Douek
- Vaccine Research Center, National Institute of Health, Bethesda, MD 20892, USA
| | - Frederic D Bushman
- Department of Microbiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
| | - Mihaela Pertea
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.,Department of Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Ya-Chi Ho
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA.
| |
Collapse
|
13
|
Li J, Li H, Ye X, Zhang L, Xu Q, Ping Y, Jing X, Jiang W, Liao Q, Liu B, Wang Y. IIMLP: integrated information-entropy-based method for LncRNA prediction. BMC Bioinformatics 2021; 22:243. [PMID: 33980144 PMCID: PMC8117603 DOI: 10.1186/s12859-020-03884-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Accepted: 11/17/2020] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND The prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-med big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of lncRNAs. RESULTS We developed the lncRNA prediction method by integrating information-entropy-based features and machine learning algorithms. We calculate generalized topological entropy and generate 6 novel features for lncRNA sequences. By employing these 6 features and other features such as open reading frame, we apply supporting vector machine, XGBoost and random forest algorithms to distinguish human lncRNAs. We compare our method with the one which has more K-mer features and results show that our method has higher area under the curve up to 99.7905%. CONCLUSIONS We develop an accurate and efficient method which has novel information entropy features to analyze and classify lncRNAs. Our method is also extendable for research on the other functional elements in DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China.
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Xiao Ye
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Xiaozhu Jing
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Wei Jiang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Qing Liao
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Bo Liu
- Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China.
- Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China.
| |
Collapse
|
14
|
A novel entropy-based mapping method for determining the protein-protein interactions in viral genomes by using coevolution analysis. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2020.102359] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
15
|
Uncovering patterns of the evolution of genomic sequence entropy and complexity. Mol Genet Genomics 2020; 296:289-298. [PMID: 33252723 DOI: 10.1007/s00438-020-01729-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Accepted: 09/22/2020] [Indexed: 10/22/2022]
Abstract
The lack of consensus concerning the biological meaning of entropy and complexity of genomes and the different ways to assess these data hamper conclusions concerning what are the causes of genomic entropy variation among species. This study aims to evaluate the entropy and complexity of genomic sequences of several species without using homologies to assess relationships among these variables and non-molecular data (e.g., the number of individuals) to seek a trigger of interspecific genomic entropy variation. The results indicate a relationship among genomic entropy, genome size, genomic complexity, and the number of individuals: species with a small number of individuals harbors large genome, and hence, low entropy but a higher complexity. We defined that the complexity of a genome relies on the entropy of each DNA segment within genome. Then, the entropy and complexity of a genome reflects its organization solely. Exons of vertebrates harbor smaller entropies than non-exon regions (likely by the repeats that accumulated from duplications), whereas other taxonomic groups do not present this pattern. Our findings suggest that small initial population might have defined current genomic entropy and complexity: actual genomes are less complex than ancestral ones. Besides, our data disagree with the relationship between phenotype and genomic entropies previously established. Finally, by establishing the relationship between genomic entropy/complexity with the number of individuals and genome size, under an evolutive perspective, ideas concerning the genomic variability may emerge.
Collapse
|
16
|
Cofré R, Maldonado C, Cessac B. Thermodynamic Formalism in Neuronal Dynamics and Spike Train Statistics. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E1330. [PMID: 33266513 PMCID: PMC7712217 DOI: 10.3390/e22111330] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 11/13/2020] [Accepted: 11/15/2020] [Indexed: 12/04/2022]
Abstract
The Thermodynamic Formalism provides a rigorous mathematical framework for studying quantitative and qualitative aspects of dynamical systems. At its core, there is a variational principle that corresponds, in its simplest form, to the Maximum Entropy principle. It is used as a statistical inference procedure to represent, by specific probability measures (Gibbs measures), the collective behaviour of complex systems. This framework has found applications in different domains of science. In particular, it has been fruitful and influential in neurosciences. In this article, we review how the Thermodynamic Formalism can be exploited in the field of theoretical neuroscience, as a conceptual and operational tool, in order to link the dynamics of interacting neurons and the statistics of action potentials from either experimental data or mathematical models. We comment on perspectives and open problems in theoretical neuroscience that could be addressed within this formalism.
Collapse
Affiliation(s)
- Rodrigo Cofré
- CIMFAV-Ingemat, Facultad de Ingeniería, Universidad de Valparaíso, Valparaíso 2340000, Chile
| | - Cesar Maldonado
- IPICYT/División de Matemáticas Aplicadas, San Luis Potosí 78216, Mexico;
| | - Bruno Cessac
- Inria Biovision team and Neuromod Institute, Université Côte d’Azur, 06901 CEDEX Inria, France;
| |
Collapse
|
17
|
Humphrey S, Kerr A, Rattray M, Dive C, Miller CJ. A model of k-mer surprisal to quantify local sequence information content surrounding splice regions. PeerJ 2020; 8:e10063. [PMID: 33194378 PMCID: PMC7648452 DOI: 10.7717/peerj.10063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Accepted: 09/08/2020] [Indexed: 12/22/2022] Open
Abstract
Molecular sequences carry information. Analysis of sequence conservation between homologous loci is a proven approach with which to explore the information content of molecular sequences. This is often done using multiple sequence alignments to support comparisons between homologous loci. These methods therefore rely on sufficient underlying sequence similarity with which to construct a representative alignment. Here we describe a method using a formal metric of information, surprisal, to analyse biological sub-sequences without alignment constraints. We applied our model to the genomes of five different species to reveal similar patterns across a panel of eukaryotes. As the surprisal of a sub-sequence is inversely proportional to its occurrence within the genome, the optimal size of the sub-sequences was selected for each species under consideration. With the model optimized, we found a strong correlation between surprisal and CG dinucleotide usage. The utility of our model was tested by examining the sequences of genes known to undergo splicing. We demonstrate that our model can identify biological features of interest such as known donor and acceptor sites. Analysis across all annotated coding exon junctions in Homo sapiens reveals the information content of coding exons to be greater than the surrounding intron regions, a consequence of increased suppression of the CG dinucleotide in intronic space. Sequences within coding regions proximal to exon junctions exhibited novel patterns within DNA and coding mRNA that are not a function of the encoded amino acid sequence. Our findings are consistent with the presence of secondary information encoding features such as DNA and RNA binding sites, multiplexed through the coding sequence and independent of the information required to define the corresponding amino-acid sequence. We conclude that surprisal provides a complementary methodology with which to locate regions of interest in the genome, particularly in situations that lack an appropriate multiple sequence alignment.
Collapse
Affiliation(s)
- Sam Humphrey
- CRUK Manchester Institute Cancer Biomarker Centre, The University of Manchester, Manchester, United Kingdom
- CRUK Manchester Institute, CRUK Lung Cancer Centre of Excellence, Manchester, United Kingdom
| | - Alastair Kerr
- CRUK Manchester Institute Cancer Biomarker Centre, The University of Manchester, Manchester, United Kingdom
- CRUK Manchester Institute, CRUK Lung Cancer Centre of Excellence, Manchester, United Kingdom
| | - Magnus Rattray
- Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, United Kingdom
| | - Caroline Dive
- CRUK Manchester Institute Cancer Biomarker Centre, The University of Manchester, Manchester, United Kingdom
- CRUK Manchester Institute, CRUK Lung Cancer Centre of Excellence, Manchester, United Kingdom
| | - Crispin J. Miller
- Computational Biology Group, CRUK Beatson Institute, Glasgow, United Kingdom
- Institute of Cancer Sciences, University of Glasgow, Glasgow, United Kingdom
| |
Collapse
|
18
|
Li J, Zhang L, Li H, Ping Y, Xu Q, Wang R, Tan R, Wang Z, Liu B, Wang Y. Integrated entropy-based approach for analyzing exons and introns in DNA sequences. BMC Bioinformatics 2019; 20:283. [PMID: 31182012 PMCID: PMC6557737 DOI: 10.1186/s12859-019-2772-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Numerous essential algorithms and methods, including entropy-based quantitative methods, have been developed to analyze complex DNA sequences since the last decade. Exons and introns are the most notable components of DNA and their identification and prediction are always the focus of state-of-the-art research. RESULTS In this study, we designed an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences. We optimized and implemented the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences. By comparing digitalizing entropy values of exons and introns, we observed that they are significantly different. After we converted DNA data to numerical topological entropy value, we applied SVD method to effectively investigate exon and intron regions on a single gene sequence. Additionally, several genes across five species are used for exon predictions. CONCLUSIONS Our approach not only helps to explore the complexity of DNA sequence and its functional elements, but also provides an entropy-based GSP method to analyze exon and intron regions. Our work is feasible across different species and extendable to analyze other components in both coding and noncoding region of DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Zhen Wang
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031 China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| |
Collapse
|
19
|
Wu C, Chen J, Liu Y, Hu X. Improved Prediction of Regulatory Element Using Hybrid Abelian Complexity Features with DNA Sequences. Int J Mol Sci 2019; 20:ijms20071704. [PMID: 30959806 PMCID: PMC6480087 DOI: 10.3390/ijms20071704] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Revised: 04/01/2019] [Accepted: 04/02/2019] [Indexed: 12/14/2022] Open
Abstract
Deciphering the code of cis-regulatory element (CRE) is one of the core issues of current biology. As an important category of CRE, enhancers play crucial roles in gene transcriptional regulations in a distant manner. Further, the disruption of an enhancer can cause abnormal transcription and, thus, trigger human diseases, which means that its accurate identification is currently of broad interest. Here, we introduce an innovative concept, i.e., abelian complexity function (ACF), which is a more complex extension of the classic subword complexity function, for a new coding of DNA sequences. After feature selection by an upper bound estimation and integration with DNA composition features, we developed an enhancer prediction model with hybrid abelian complexity features (HACF). Compared with existing methods, HACF shows consistently superior performance on three sources of enhancer datasets. We tested the generalization ability of HACF by scanning human chromosome 22 to validate previously reported super-enhancers. Meanwhile, we identified novel candidate enhancers which have supports from enhancer-related ENCODE ChIP-seq signals. In summary, HACF improves current enhancer prediction and may be beneficial for further prioritization of functional noncoding variants.
Collapse
Affiliation(s)
- Chengchao Wu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| | - Jin Chen
- College of Science, Huazhong Agricultural University, Wuhan 430070, China.
| | - Yunxia Liu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| | - Xuehai Hu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| |
Collapse
|
20
|
Kistler L, Ware R, Smith O, Collins M, Allaby RG. A new model for ancient DNA decay based on paleogenomic meta-analysis. Nucleic Acids Res 2017; 45:6310-6320. [PMID: 28486705 PMCID: PMC5499742 DOI: 10.1093/nar/gkx361] [Citation(s) in RCA: 92] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Revised: 04/15/2017] [Accepted: 04/20/2017] [Indexed: 01/04/2023] Open
Abstract
The persistence of DNA over archaeological and paleontological timescales in diverse environments has led to a revolutionary body of paleogenomic research, yet the dynamics of DNA degradation are still poorly understood. We analyzed 185 paleogenomic datasets and compared DNA survival with environmental variables and sample ages. We find cytosine deamination follows a conventional thermal age model, but we find no correlation between DNA fragmentation and sample age over the timespans analyzed, even when controlling for environmental variables. We propose a model for ancient DNA decay wherein fragmentation rapidly reaches a threshold, then subsequently slows. The observed loss of DNA over time may be due to a bulk diffusion process in many cases, highlighting the importance of tissues and environments creating effectively closed systems for DNA preservation. This model of DNA degradation is largely based on mammal bone samples due to published genomic dataset availability. Continued refinement to the model to reflect diverse biological systems and tissue types will further improve our understanding of ancient DNA breakdown dynamics.
Collapse
MESH Headings
- Base Composition
- Base Sequence
- DNA Fragmentation
- DNA, Ancient/analysis
- DNA, Ancient/chemistry
- DNA, Mitochondrial/analysis
- DNA, Mitochondrial/chemistry
- DNA, Mitochondrial/genetics
- DNA, Plant/genetics
- Deamination
- Genome, Human
- Genome, Mitochondrial
- Humans
- Meta-Analysis as Topic
- Models, Chemical
- Paleontology/methods
- Sequence Analysis, DNA
- Thermodynamics
Collapse
Affiliation(s)
- Logan Kistler
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, UK
- Department of Anthropology, National Museum of Natural History, Smithsonian Institution, Washington, DC 20560, USA
| | - Roselyn Ware
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, UK
| | - Oliver Smith
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, UK
- Section for Evolutionary Genomics, Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, 1307 Copenhagen K, Denmark
| | - Matthew Collins
- Section for Evolutionary Genomics, Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, 1307 Copenhagen K, Denmark
- Department of Archaeology, University of York, PO Box 373, York, UK
| | - Robin G. Allaby
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, UK
| |
Collapse
|
21
|
Abstract
Based on the Shannon's information communication theory, information amount of the entire length of a polymeric macromolecule can be calculated in bits through adding the entropies of each building block. Proteins, DNA and RNA are such macromolecules. When only the building blocks' variation is considered as the source of entropy, there is seemingly lower information in case of the protein if this approach is applied directly on a protein of specific size and the coding sequence size of the mRNA corresponding to the particular length of the protein. This decrease in the information amount seems contradictory but this apparent conflict is resolved by considering the conformational variations in proteins as a new variable in the calculation and balancing the approximated entropy of the coding part of the mRNA and the protein. Probabilities can change therefore we also assigned hypothetical probabilities to the conformational states, which represent the uneven distribution as the time spent in one conformation, providing the probability of the presence in either or one of the possible conformations. Results that are obtained by using hypothetical probabilities are in line with the experimental values of variations in the conformational-state of protein populations. This equalization approach has further biological relevance that it compensates for the degeneracy in the codon usage during protein translation and it leads to the conclusion that the alphabet size for the protein is rather optimal for the proper protein functioning within the thermodynamic milieu of the cell. The findings were also discussed in relation to the codon bias and have implications in relation to the codon evolution concept. Eventually, this work brings the fields of protein structural studies and molecular protein translation processes together with a novel approach.
Collapse
Affiliation(s)
- Y Adiguzel
- Biophysics Department, School of Medicine, Istanbul Kemerburgaz University, Istanbul, Turkey.
| |
Collapse
|
22
|
|
23
|
Wu C, Yao S, Li X, Chen C, Hu X. Genome-Wide Prediction of DNA Methylation Using DNA Composition and Sequence Complexity in Human. Int J Mol Sci 2017; 18:E420. [PMID: 28212312 PMCID: PMC5343954 DOI: 10.3390/ijms18020420] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2017] [Revised: 02/03/2017] [Accepted: 02/08/2017] [Indexed: 02/02/2023] Open
Abstract
DNA methylation plays a significant role in transcriptional regulation by repressing activity. Change of the DNA methylation level is an important factor affecting the expression of target genes and downstream phenotypes. Because current experimental technologies can only assay a small proportion of CpG sites in the human genome, it is urgent to develop reliable computational models for predicting genome-wide DNA methylation. Here, we proposed a novel algorithm that accurately extracted sequence complexity features (seven features) and developed a support-vector-machine-based prediction model with integration of the reported DNA composition features (trinucleotide frequency and GC content, 65 features) by utilizing the methylation profiles of embryonic stem cells in human. The prediction results from 22 human chromosomes with size-varied windows showed that the 600-bp window achieved the best average accuracy of 94.7%. Moreover, comparisons with two existing methods further showed the superiority of our model, and cross-species predictions on mouse data also demonstrated that our model has certain generalization ability. Finally, a statistical test of the experimental data and the predicted data on functional regions annotated by ChromHMM found that six out of 10 regions were consistent, which implies reliable prediction of unassayed CpG sites. Accordingly, we believe that our novel model will be useful and reliable in predicting DNA methylation.
Collapse
Affiliation(s)
- Chengchao Wu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| | - Shixin Yao
- College of Science, Huazhong Agricultural University, Wuhan 430070, China.
| | - Xinghao Li
- College of Science, Huazhong Agricultural University, Wuhan 430070, China.
| | - Chujia Chen
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| | - Xuehai Hu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| |
Collapse
|
24
|
Yu Y, Yang L, Liu Z, Zhu C. Gene essentiality prediction based on fractal features and machine learning. MOLECULAR BIOSYSTEMS 2017; 13:577-584. [PMID: 28145541 DOI: 10.1039/c6mb00806b] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Predicting bacterial essential genes using only fractal features.
Collapse
Affiliation(s)
- Yongming Yu
- Department of Biomedical Engineering
- Shandong University
- Jinan
- China
| | - Licai Yang
- Department of Biomedical Engineering
- Shandong University
- Jinan
- China
| | - Zhiping Liu
- Department of Biomedical Engineering
- Shandong University
- Jinan
- China
| | - Chuansheng Zhu
- Department of Hematology
- Shandong University Affiliated Qianfoshan Hospital
- Jinan
- China
| |
Collapse
|
25
|
Bonnici V, Manca V. Informational laws of genome structures. Sci Rep 2016; 6:28840. [PMID: 27354155 PMCID: PMC4937431 DOI: 10.1038/srep28840] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 06/09/2016] [Indexed: 01/06/2023] Open
Abstract
In recent years, the analysis of genomes by means of strings of length k occurring in the genomes, called k-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of k for applying information theoretic concepts that express intrinsic aspects of genomes. The value k = lg2(n), where n is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances entropic and anti-entropic components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined.
Collapse
Affiliation(s)
- Vincenzo Bonnici
- University of Verona, Department of Computer Science, University of Verona, Verona 37134, Italy,Center for BioMedical Computing, University of Verona, Verona, 37134, Italy
| | - Vincenzo Manca
- University of Verona, Department of Computer Science, University of Verona, Verona 37134, Italy,Center for BioMedical Computing, University of Verona, Verona, 37134, Italy,
| |
Collapse
|
26
|
Thomas D, Finan C, Newport MJ, Jones S. DNA entropy reveals a significant difference in complexity between housekeeping and tissue specific gene promoters. Comput Biol Chem 2015; 58:19-24. [PMID: 25988219 DOI: 10.1016/j.compbiolchem.2015.05.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 05/01/2015] [Accepted: 05/01/2015] [Indexed: 10/23/2022]
Abstract
BACKGROUND The complexity of DNA can be quantified using estimates of entropy. Variation in DNA complexity is expected between the promoters of genes with different transcriptional mechanisms; namely housekeeping (HK) and tissue specific (TS). The former are transcribed constitutively to maintain general cellular functions, and the latter are transcribed in restricted tissue and cells types for specific molecular events. It is known that promoter features in the human genome are related to tissue specificity, but this has been difficult to quantify on a genomic scale. If entropy effectively quantifies DNA complexity, calculating the entropies of HK and TS gene promoters as profiles may reveal significant differences. RESULTS Entropy profiles were calculated for a total dataset of 12,003 human gene promoters and for 501 housekeeping (HK) and 587 tissue specific (TS) human gene promoters. The mean profiles show the TS promoters have a significantly lower entropy (p<2.2e-16) than HK gene promoters. The entropy distributions for the 3 datasets show that promoter entropies could be used to identify novel HK genes. CONCLUSION Functional features comprise DNA sequence patterns that are non-random and hence they have lower entropies. The lower entropy of TS gene promoters can be explained by a higher density of positive and negative regulatory elements, required for genes with complex spatial and temporary expression.
Collapse
Affiliation(s)
- David Thomas
- Brighton and Sussex Medical School, University of Sussex, Brighton BN1 9PX, UK
| | - Chris Finan
- Brighton and Sussex Medical School, University of Sussex, Brighton BN1 9PX, UK
| | - Melanie J Newport
- Brighton and Sussex Medical School, University of Sussex, Brighton BN1 9PX, UK
| | - Susan Jones
- The James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| |
Collapse
|
27
|
Downarowicz T, Travisany D, Montecino M, Maass A. Symbolic extensions applied to multiscale structure of genomes. Acta Biotheor 2014; 62:145-69. [PMID: 24728912 PMCID: PMC4012164 DOI: 10.1007/s10441-014-9215-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Accepted: 03/25/2014] [Indexed: 11/27/2022]
Abstract
A genome of a living organism consists of a long string of symbols over a finite alphabet carrying critical information for the organism. This includes its ability to control post natal growth, homeostasis, adaptation to changes in the surrounding environment, or to biochemically respond at the cellular level to various specific regulatory signals. In this sense, a genome represents a symbolic encoding of a highly organized system of information whose functioning may be revealed as a natural multilayer structure in terms of complexity and prominence. In this paper we use the mathematical theory of symbolic extensions as a framework to shed light onto how this multilayer organization is reflected in the symbolic coding of the genome. The distribution of data in an element of a standard symbolic extension of a dynamical system has a specific form: the symbolic sequence is divided into several subsequences (which we call layers) encoding the dynamics on various "scales". We propose that a similar structure resides within the genomes, building our analogy on some of the most recent findings in the field of regulation of genomic DNA functioning.
Collapse
Affiliation(s)
- Tomasz Downarowicz
- Institute of Mathematics and Computer Science, Wroclaw University of Technology, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland
| | - Dante Travisany
- Center for Mathematical Modeling, FONDAP Center for Genome Regulation, University of Chile, Blanco Encalada 2120, Santiago, Chile
| | - Martin Montecino
- Faculty of Biological Sciences and Faculty of Medicine, Center for Biomedical Research, FONDAP Center for Genome Regulation, Universidad Andrés Bello, Avenida República 239, Santiago, Chile
| | - Alejandro Maass
- Department of Mathematical Engineering, Center for Mathematical Modeling, FONDAP Center for Genome Regulation, University of Chile, Blanco Encalada 2120, Santiago, Chile
| |
Collapse
|
28
|
Holzinger A, Dehmer M, Jurisica I. Knowledge Discovery and interactive Data Mining in Bioinformatics--State-of-the-Art, future challenges and research directions. BMC Bioinformatics 2014; 15 Suppl 6:I1. [PMID: 25078282 PMCID: PMC4140208 DOI: 10.1186/1471-2105-15-s6-i1] [Citation(s) in RCA: 134] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Affiliation(s)
- Andreas Holzinger
- Research Unit Human-Computer Interaction, Austrian IBM Watson Think Group, Institute for Medical Informatics, Statistics & Documentation, Medical University Graz, Austria
- Institute of Information Systems and Computer Media, Graz University of Technology, Austria
| | - Matthias Dehmer
- Institute for Bioinformatics and Translational Research, UMIT Tyrol, Austria
| | - Igor Jurisica
- Departments of Medical Biophysics and Computer Science, University of Toronto, Ontario, Canada
- Princess Margaret Cancer Centre and Techna Institute for the Advancement of Technology for Health, University Health Network, IBM Life Sciences Discovery Centre, Ontario, Canada
| |
Collapse
|
29
|
Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014; 15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open
Abstract
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Collapse
Affiliation(s)
- Susana Vinga
- IDMEC, Instituto Superior Técnico - Universidade de Lisboa (IST-UL), Av. Rovisco Pais, 1049-001 Lisboa, Portugal. Tel.: +351-218419504; Fax: +351-218498097;
| |
Collapse
|
30
|
Jin S, Tan R, Jiang Q, Xu L, Peng J, Wang Y, Wang Y. A generalized topological entropy for analyzing the complexity of DNA sequences. PLoS One 2014; 9:e88519. [PMID: 24533097 PMCID: PMC3922877 DOI: 10.1371/journal.pone.0088519] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Accepted: 01/06/2014] [Indexed: 11/19/2022] Open
Abstract
Topological entropy is one of the most difficult entropies to be used to analyze the DNA sequences, due to the finite sample and high-dimensionality problems. In order to overcome these problems, a generalized topological entropy is introduced. The relationship between the topological entropy and the generalized topological entropy is compared, which shows the topological entropy is a special case of the generalized entropy. As an application the generalized topological entropy in introns, exons and promoter regions was computed, respectively. The results indicate that the entropy of introns is higher than that of exons, and the entropy of the exons is higher than that of the promoter regions for each chromosome, which suggest that DNA sequence of the promoter regions is more regular than the exons and introns.
Collapse
Affiliation(s)
- Shuilin Jin
- Department of Mathematics, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Li Xu
- College of Computer Science and Technology, Harbin Engineering University, Harbin, Heilongjiang, China
| | - Jiajie Peng
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Yong Wang
- Department of Mathematics, Harbin Institute of Technology, Harbin, Heilongjiang, China
- * E-mail: (Yadong Wang) (YW); (Yong Wang) (YW)
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
- * E-mail: (Yadong Wang) (YW); (Yong Wang) (YW)
| |
Collapse
|
31
|
Coding sequence density estimation via topological pressure. J Math Biol 2014; 70:45-69. [PMID: 24448658 DOI: 10.1007/s00285-014-0754-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2013] [Revised: 12/31/2013] [Indexed: 10/25/2022]
Abstract
We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the 'weighted information content' of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000 bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the 'coarse scale' problem of predicting CDS density. Inspired again by ergodic theory, the weightings of the nucleotide triplets obtained from our training procedure are used to define a probability distribution on finite sequences, which can be used to distinguish between intron and exon sequences from the human genome of lengths between 750 and 5,000 bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at http://sourceforge.net/projects/topologicalpres/ .
Collapse
|
32
|
Structural complexity of DNA sequence. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2013; 2013:628036. [PMID: 23662161 PMCID: PMC3638703 DOI: 10.1155/2013/628036] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2013] [Accepted: 03/03/2013] [Indexed: 11/17/2022]
Abstract
In modern bioinformatics, finding an efficient way to allocate sequence fragments with biological functions is an important issue. This paper presents a structural approach based on context-free grammars extracted from original DNA or protein sequences. This approach is radically different from all those statistical methods. Furthermore, this approach is compared with a topological entropy-based method for consistency and difference of the complexity results.
Collapse
|