1
|
Hussein M, Andrade dos Ramos Z, Vink MA, Kroon P, Yu Z, Enjuanes L, Zuñiga S, Berkhout B, Herrera-Carrillo E. Efficient CRISPR-Cas13d-Based Antiviral Strategy to Combat SARS-CoV-2. Viruses 2023; 15:v15030686. [PMID: 36992394 PMCID: PMC10051389 DOI: 10.3390/v15030686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Revised: 02/27/2023] [Accepted: 02/28/2023] [Indexed: 03/08/2023] Open
Abstract
The current SARS-CoV-2 pandemic forms a major global health burden. Although protective vaccines are available, concerns remain as new virus variants continue to appear. CRISPR-based gene-editing approaches offer an attractive therapeutic strategy as the CRISPR-RNA (crRNA) can be adjusted rapidly to accommodate a new viral genome sequence. This study aimed at using the RNA-targeting CRISPR-Cas13d system to attack highly conserved sequences in the viral RNA genome, thereby preparing for future zoonotic outbreaks of other coronaviruses. We designed 29 crRNAs targeting highly conserved sequences along the complete SARS-CoV-2 genome. Several crRNAs demonstrated efficient silencing of a reporter with the matching viral target sequence and efficient inhibition of a SARS-CoV-2 replicon. The crRNAs that suppress SARS-CoV-2 were also able to suppress SARS-CoV, thus demonstrating the breadth of this antiviral strategy. Strikingly, we observed that only crRNAs directed against the plus-genomic RNA demonstrated antiviral activity in the replicon assay, in contrast to those that bind the minus-genomic RNA, the replication intermediate. These results point to a major difference in the vulnerability and biology of the +RNA versus −RNA strands of the SARS-CoV-2 genome and provide important insights for the design of RNA-targeting antivirals.
Collapse
Affiliation(s)
- Mouraya Hussein
- Laboratory of Experimental Virology, Department of Medical Microbiology, Amsterdam UMC, Academic Medical Center, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
| | - Zaria Andrade dos Ramos
- Laboratory of Experimental Virology, Department of Medical Microbiology, Amsterdam UMC, Academic Medical Center, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
| | - Monique A. Vink
- Laboratory of Experimental Virology, Department of Medical Microbiology, Amsterdam UMC, Academic Medical Center, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
| | - Pascal Kroon
- Laboratory of Experimental Virology, Department of Medical Microbiology, Amsterdam UMC, Academic Medical Center, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
| | - Zhenghao Yu
- Laboratory of Experimental Virology, Department of Medical Microbiology, Amsterdam UMC, Academic Medical Center, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
| | - Luis Enjuanes
- Department of Molecular and Cell Biology, National Center of Biotechnology (CNB-CSIC), Campus Universidad Autónoma de Madrid, 28049 Madrid, Spain
| | - Sonia Zuñiga
- Department of Molecular and Cell Biology, National Center of Biotechnology (CNB-CSIC), Campus Universidad Autónoma de Madrid, 28049 Madrid, Spain
| | - Ben Berkhout
- Laboratory of Experimental Virology, Department of Medical Microbiology, Amsterdam UMC, Academic Medical Center, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
| | - Elena Herrera-Carrillo
- Laboratory of Experimental Virology, Department of Medical Microbiology, Amsterdam UMC, Academic Medical Center, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
- Correspondence:
| |
Collapse
|
2
|
Mesa-Rodríguez A, Gonzalez A, Estevez-Rams E, Valdes-Sosa PA. Cancer Segmentation by Entropic Analysis of Ordered Gene Expression Profiles. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1744. [PMID: 36554151 PMCID: PMC9777913 DOI: 10.3390/e24121744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 11/24/2022] [Accepted: 11/24/2022] [Indexed: 06/17/2023]
Abstract
The availability of massive gene expression data has been challenging in terms of how to cure, process, and extract useful information. Here, we describe the use of entropic measures as discriminating criteria in cancer using the whole data set of gene expression levels. These methods were applied in classifying samples between tumor and normal type for 13 types of tumors with a high success ratio. Using gene expression, ordered by pathways, results in complexity-entropy diagrams. The map allows the clustering of the tumor and normal types samples, with a high success rate for nine of the thirteen, studied cancer types. Further analysis using information distance also shows good discriminating behavior, but, more importantly, allows for discriminating between cancer types. Together, our results allow the classification of tissues without the need to identify relevant genes or impose a particular cancer model. The used procedure can be extended to classification problems beyond the reported results.
Collapse
Affiliation(s)
- Ania Mesa-Rodríguez
- The Clinical Hospital of Chengdu Brain Science Institute, University of Electronic Sciences and Technology of China, Chengdu 610054, China
- Facultad de Matemática, Universidad de La Habana, San Lazaro y L, La Habana 10400, Cuba
| | - Augusto Gonzalez
- The Clinical Hospital of Chengdu Brain Science Institute, University of Electronic Sciences and Technology of China, Chengdu 610054, China
- Instituto de Cibernética, Matemática y Física, La Habana 10400, Cuba
| | - Ernesto Estevez-Rams
- Facultad de Física, Instituto de Ciencias y Tecnología de Materiales (IMRE), Universidad de La Habana, San Lazaro y L, La Habana 10400, Cuba
| | - Pedro A. Valdes-Sosa
- The Clinical Hospital of Chengdu Brain Science Institute, University of Electronic Sciences and Technology of China, Chengdu 610054, China
- Centro de Neurociencias, BioCubaFarma, La Habana 10400, Cuba
| |
Collapse
|
3
|
Hussein M, Andrade dos Ramos Z, Berkhout B, Herrera-Carrillo E. In Silico Prediction and Selection of Target Sequences in the SARS-CoV-2 RNA Genome for an Antiviral Attack. Viruses 2022; 14:v14020385. [PMID: 35215977 PMCID: PMC8880226 DOI: 10.3390/v14020385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Revised: 02/07/2022] [Accepted: 02/08/2022] [Indexed: 12/10/2022] Open
Abstract
The SARS-CoV-2 pandemic has urged the development of protective vaccines and the search for specific antiviral drugs. The modern molecular biology tools provides alternative methods, such as CRISPR-Cas and RNA interference, that can be adapted as antiviral approaches, and contribute to this search. The unique CRISPR-Cas13d system, with the small crRNA guide molecule, mediates a sequence-specific attack on RNA, and can be developed as an anti-coronavirus strategy. We analyzed the SARS-CoV-2 genome to localize the hypothetically best crRNA-annealing sites of 23 nucleotides based on our extensive expertise with sequence-specific antiviral strategies. We considered target sites of which the sequence is well-conserved among SARS-CoV-2 isolates. As we should prepare for a potential future outbreak of related viruses, we screened for targets that are conserved between SARS-CoV-2 and SARS-CoV. To further broaden the search, we screened for targets that are conserved between SARS-CoV-2 and the more distantly related MERS-CoV, as well as the four other human coronaviruses (OC43, 229E, NL63, HKU1). Finally, we performed a search for pan-corona target sequences that are conserved among all these coronaviruses, including the new Omicron variant, that are able to replicate in humans. This survey may contribute to the design of effective, safe, and escape-proof antiviral strategies to prepare for future pandemics.
Collapse
Affiliation(s)
| | | | - Ben Berkhout
- Correspondence: (B.B.); (E.H.-C.); Tel.: +31-20-566-4822 (B.B.); +31-20-566-4865 (E.H.-C.)
| | - Elena Herrera-Carrillo
- Correspondence: (B.B.); (E.H.-C.); Tel.: +31-20-566-4822 (B.B.); +31-20-566-4865 (E.H.-C.)
| |
Collapse
|
4
|
Antich A, Palacín C, Turon X, Wangensteen OS. DnoisE: distance denoising by entropy. An open-source parallelizable alternative for denoising sequence datasets. PeerJ 2022; 10:e12758. [PMID: 35111399 PMCID: PMC8783565 DOI: 10.7717/peerj.12758] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 12/16/2021] [Indexed: 01/07/2023] Open
Abstract
DNA metabarcoding is broadly used in biodiversity studies encompassing a wide range of organisms. Erroneous amplicons, generated during amplification and sequencing procedures, constitute one of the major sources of concern for the interpretation of metabarcoding results. Several denoising programs have been implemented to detect and eliminate these errors. However, almost all denoising software currently available has been designed to process non-coding ribosomal sequences, most notably prokaryotic 16S rDNA. The growing number of metabarcoding studies using coding markers such as COI or RuBisCO demands a re-assessment and calibration of denoising algorithms. Here we present DnoisE, the first denoising program designed to detect erroneous reads and merge them with the correct ones using information from the natural variability (entropy) associated to each codon position in coding barcodes. We have developed an open-source software using a modified version of the UNOISE algorithm. DnoisE implements different merging procedures as options, and can incorporate codon entropy information either retrieved from the data or supplied by the user. In addition, the algorithm of DnoisE is parallelizable, greatly reducing runtimes on computer clusters. Our program also allows different input file formats, so it can be readily incorporated into existing metabarcoding pipelines.
Collapse
Affiliation(s)
- Adrià Antich
- Department of Marine Ecology, Centre for Advanced Studies of Blanes (CEAB- CSIC), Blanes (Girona), Catalonia, Spain
| | - Creu Palacín
- Department of Evolutionary Biology, Ecology and Environmental Sciences and Biodiversity Research Institute (IRBIO), University of Barcelona, Barcelona, Catalonia, Spain
| | - Xavier Turon
- Department of Marine Ecology, Centre for Advanced Studies of Blanes (CEAB- CSIC), Blanes (Girona), Catalonia, Spain
| | - Owen S. Wangensteen
- Norwegian School of Fishery Science, UiT The Arctic University of Norway, Tromsø, Troms og Finnmark, Norway
| |
Collapse
|
5
|
Bussi Y, Kapon R, Reich Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One 2021; 16:e0258693. [PMID: 34648558 PMCID: PMC8516232 DOI: 10.1371/journal.pone.0258693] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/02/2021] [Indexed: 12/24/2022] Open
Abstract
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
Collapse
Affiliation(s)
- Yuval Bussi
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
- Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Ruti Kapon
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Ziv Reich
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- * E-mail:
| |
Collapse
|
6
|
Information Entropy in Chemistry: An Overview. ENTROPY 2021; 23:e23101240. [PMID: 34681964 PMCID: PMC8534366 DOI: 10.3390/e23101240] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 09/19/2021] [Accepted: 09/20/2021] [Indexed: 12/20/2022]
Abstract
Basic applications of the information entropy concept to chemical objects are reviewed. These applications deal with quantifying chemical and electronic structures of molecules, signal processing, structural studies on crystals, and molecular ensembles. Recent advances in the mentioned areas make information entropy a central concept in interdisciplinary studies on digitalizing chemical reactions, chemico-information synthesis, crystal engineering, as well as digitally rethinking basic notions of structural chemistry in terms of informatics.
Collapse
|
7
|
Pasookhush P, Usmani A, Suwannahong K, Palittapongarnpim P, Rukseree K, Ariyachaokun K, Buates S, Siripattanapipong S, Ajawatanawong P. Single-Strand Conformation Polymorphism Fingerprint Method for Dictyostelids. Front Microbiol 2021; 12:708685. [PMID: 34512585 PMCID: PMC8431811 DOI: 10.3389/fmicb.2021.708685] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 07/22/2021] [Indexed: 11/13/2022] Open
Abstract
Dictyostelid social amoebae are a highly diverse group of eukaryotic soil microbes that are valuable resources for biological research. Genetic diversity study of these organisms solely relies on molecular phylogenetics of the SSU rDNA gene, which is not ideal for large-scale genetic diversity study. Here, we designed a set of PCR–single-strand conformation polymorphism (SSCP) primers and optimized the SSCP fingerprint method for the screening of dictyostelids. The optimized SSCP condition required gel purification of the SSCP amplicons followed by electrophoresis using a 9% polyacrylamide gel under 4°C. We also tested the optimized SSCP procedure with 73 Thai isolates of dictyostelid that had the SSU rDNA gene sequences published. The SSCP fingerprint patterns were related to the genus-level taxonomy of dictyostelids, but the fingerprint dendrogram did not reflect the deep phylogeny. This method is rapid, cost-effective, and suitable for large-scale sample screening as compared with the phylogenetic analysis of the SSU rDNA gene sequences.
Collapse
Affiliation(s)
- Phongthana Pasookhush
- Division of Bioinformatics and Data Management for Research, Research Division, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Asmatullah Usmani
- Department of Microbiology, Faculty of Science, Mahidol University, Bangkok, Thailand.,Department of Biology, Faculty of Education, Kandahar University, Kandahar, Afghanistan
| | - Kowit Suwannahong
- Department of Environmental Health, Faculty of Public Health, Burapa University, Chonburi, Thailand
| | - Prasit Palittapongarnpim
- Department of Microbiology, Faculty of Science, Mahidol University, Bangkok, Thailand.,National Science and Technology Development Agency (NSTDA), Thailand Science Park, Khlong Nueng, Thailand
| | - Kamolchanok Rukseree
- Department of Sciences and Liberal Arts, Mahidol University, Amnatcharoen Campus, Bung, Thailand
| | - Kanchiyaphat Ariyachaokun
- Department of Biological Sciences, Faculty of Science, Ubon Ratchathani University, Ubon Ratchathani, Thailand
| | - Sureemas Buates
- Department of Microbiology, Faculty of Science, Mahidol University, Bangkok, Thailand
| | | | - Pravech Ajawatanawong
- Division of Bioinformatics and Data Management for Research, Research Division, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| |
Collapse
|
8
|
Lewis RN, Soma M, de Kort SR, Gilman RT. Like Father Like Son: Cultural and Genetic Contributions to Song Inheritance in an Estrildid Finch. Front Psychol 2021; 12:654198. [PMID: 34149539 PMCID: PMC8213215 DOI: 10.3389/fpsyg.2021.654198] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Accepted: 05/05/2021] [Indexed: 11/25/2022] Open
Abstract
Social learning of vocalizations is integral to song inheritance in oscine passerines. However, other factors, such as genetic inheritance and the developmental environment, can also influence song phenotype. The relative contributions of these factors can have a strong influence on song evolution and may affect important evolutionary processes such as speciation. However, relative contributions are well-described only for a few species and are likely to vary with taxonomy. Using archived song data, we examined patterns of song inheritance in a domestic population of Java sparrows (Lonchura oryzivora), some of which had been cross-fostered. Six-hundred and seventy-six songs from 73 birds were segmented and classified into notes and note subtypes (N = 22,972), for which a range of acoustic features were measured. Overall, we found strong evidence for cultural inheritance of song structure and of the acoustic characteristics of notes; sons’ song syntax and note composition were similar to that of their social fathers and were not influenced by genetic relatedness. For vocal consistency of note subtypes, a measure of vocal performance, there was no apparent evidence of social or genetic inheritance, but both age and developmental environment influenced consistency. These findings suggest that high learning fidelity of song material, i.e., song structure and note characteristics, could allow novel variants to be preserved and accumulate over generations, with implications for evolution and conservation. However, differences in vocal performance do not show strong links to cultural inheritance, instead potentially serving as condition dependent signals.
Collapse
Affiliation(s)
- Rebecca N Lewis
- Department of Earth and Environmental Sciences, University of Manchester, Manchester, United Kingdom.,Chester Zoo, Chester, United Kingdom
| | - Masayo Soma
- Department of Biology, Faculty of Science, Hokkaido University, Hokkaido, Japan
| | - Selvino R de Kort
- Department of Natural Sciences, Ecology and Environment Research Centre, Manchester Metropolitan University, Manchester, United Kingdom
| | - R Tucker Gilman
- Department of Earth and Environmental Sciences, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
9
|
Antich A, Palacin C, Wangensteen OS, Turon X. To denoise or to cluster, that is not the question: optimizing pipelines for COI metabarcoding and metaphylogeography. BMC Bioinformatics 2021; 22:177. [PMID: 33820526 PMCID: PMC8020537 DOI: 10.1186/s12859-021-04115-6] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Accepted: 03/30/2021] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND The recent blooming of metabarcoding applications to biodiversity studies comes with some relevant methodological debates. One such issue concerns the treatment of reads by denoising or by clustering methods, which have been wrongly presented as alternatives. It has also been suggested that denoised sequence variants should replace clusters as the basic unit of metabarcoding analyses, missing the fact that sequence clusters are a proxy for species-level entities, the basic unit in biodiversity studies. We argue here that methods developed and tested for ribosomal markers have been uncritically applied to highly variable markers such as cytochrome oxidase I (COI) without conceptual or operational (e.g., parameter setting) adjustment. COI has a naturally high intraspecies variability that should be assessed and reported, as it is a source of highly valuable information. We contend that denoising and clustering are not alternatives. Rather, they are complementary and both should be used together in COI metabarcoding pipelines. RESULTS Using a COI dataset from benthic marine communities, we compared two denoising procedures (based on the UNOISE3 and the DADA2 algorithms), set suitable parameters for denoising and clustering, and applied these steps in different orders. Our results indicated that the UNOISE3 algorithm preserved a higher intra-cluster variability. We introduce the program DnoisE to implement the UNOISE3 algorithm taking into account the natural variability (measured as entropy) of each codon position in protein-coding genes. This correction increased the number of sequences retained by 88%. The order of the steps (denoising and clustering) had little influence on the final outcome. CONCLUSIONS We highlight the need for combining denoising and clustering, with adequate choice of stringency parameters, in COI metabarcoding. We present a program that uses the coding properties of this marker to improve the denoising step. We recommend researchers to report their results in terms of both denoised sequences (a proxy for haplotypes) and clusters formed (a proxy for species), and to avoid collapsing the sequences of the latter into a single representative. This will allow studies at the cluster (ideally equating species-level diversity) and at the intra-cluster level, and will ease additivity and comparability between studies.
Collapse
Affiliation(s)
- Adrià Antich
- Department of Marine Ecology, Centre for Advanced Studies of Blanes (CEAB-CSIC), Blanes (Girona), Catalonia, Spain
| | - Creu Palacin
- Department of Evolutionary Biology, Ecology and Environmental Sciences, University of Barcelona and Research Institute of Biodiversity (IRBIO), Barcelona, Catalonia, Spain
| | - Owen S Wangensteen
- Norwegian College of Fishery Science, UiT The Arctic University of Norway, Tromsö, Norway.
| | - Xavier Turon
- Department of Marine Ecology, Centre for Advanced Studies of Blanes (CEAB-CSIC), Blanes (Girona), Catalonia, Spain.
| |
Collapse
|
10
|
Górski AZ, Piwowar M. Nucleotide spacing distribution analysis for human genome. Mamm Genome 2021; 32:123-128. [PMID: 33723659 PMCID: PMC8012312 DOI: 10.1007/s00335-021-09865-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 03/02/2021] [Indexed: 11/30/2022]
Abstract
The distribution of nucleotides spacing in human genome was investigated. An analysis of the frequency of occurrence in the human genome of different sequence lengths flanked by one type of nucleotide was carried out showing that the distribution has no self-similar (fractal) structure. The results nevertheless revealed several characteristic features: (i) the distribution for short-range spacing is quite similar to the purely stochastic sequences; (ii) the distribution for long-range spacing essentially deviates from the random sequence distribution, showing strong long-range correlations; (iii) the differences between (A, T) and (C, G) nucleotides are quite significant; (iv) the spacing distribution displays tiny oscillations.
Collapse
Affiliation(s)
- Andrzej Z Górski
- Polish Academy of Sciences, Institute of Nuclear Physics, Radzikowskiego 152 st, 31-342, Kraków, Poland
| | - Monika Piwowar
- Jagiellonian University, Collegium Medicum, Kopernika 7E st, 31-034, Kraków, Poland.
| |
Collapse
|
11
|
Nykrynova M, Barton V, Sedlar K, Bezdicek M, Lengerova M, Skutkova H. Word Entropy-Based Approach to Detect Highly Variable Genetic Markers for Bacterial Genotyping. Front Microbiol 2021; 12:631605. [PMID: 33613503 PMCID: PMC7886790 DOI: 10.3389/fmicb.2021.631605] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Accepted: 01/13/2021] [Indexed: 11/13/2022] Open
Abstract
Genotyping methods are used to distinguish bacterial strains from one species. Thus, distinguishing bacterial strains on a global scale, between countries or local districts in one country is possible. However, the highly selected bacterial populations (e.g., local populations in hospital) are typically closely related and low diversified. Therefore, currently used typing methods are not able to distinguish individual strains from each other. Here, we present a novel pipeline to detect highly variable genetic segments for genotyping a closely related bacterial population. The method is based on a degree of disorder in analyzed sequences that can be represented by sequence entropy. With the identified variable sequences, it is possible to find out transmission routes and sources of highly virulent and multiresistant strains. The proposed method can be used for any bacterial population, and due to its whole genome range, also non-coding regions are examined.
Collapse
Affiliation(s)
- Marketa Nykrynova
- Department of Biomedical Engineering, Faculty of Electrical Engineering and Communication, Brno University of Technology, Brno, Czechia
| | - Vojtech Barton
- Department of Biomedical Engineering, Faculty of Electrical Engineering and Communication, Brno University of Technology, Brno, Czechia
| | - Karel Sedlar
- Department of Biomedical Engineering, Faculty of Electrical Engineering and Communication, Brno University of Technology, Brno, Czechia
| | - Matej Bezdicek
- Department of Internal Medicine - Hematology and Oncology, University Hospital Brno, Brno, Czechia
| | - Martina Lengerova
- Department of Internal Medicine - Hematology and Oncology, University Hospital Brno, Brno, Czechia
| | - Helena Skutkova
- Department of Biomedical Engineering, Faculty of Electrical Engineering and Communication, Brno University of Technology, Brno, Czechia
| |
Collapse
|
12
|
Markić I, Štula M, Zorić M, Stipaničev D. Entropy-Based Approach in Selection Exact String-Matching Algorithms. ENTROPY (BASEL, SWITZERLAND) 2020; 23:E31. [PMID: 33379282 PMCID: PMC7824336 DOI: 10.3390/e23010031] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2020] [Revised: 12/19/2020] [Accepted: 12/22/2020] [Indexed: 11/16/2022]
Abstract
The string-matching paradigm is applied in every computer science and science branch in general. The existence of a plethora of string-matching algorithms makes it hard to choose the best one for any particular case. Expressing, measuring, and testing algorithm efficiency is a challenging task with many potential pitfalls. Algorithm efficiency can be measured based on the usage of different resources. In software engineering, algorithmic productivity is a property of an algorithm execution identified with the computational resources the algorithm consumes. Resource usage in algorithm execution could be determined, and for maximum efficiency, the goal is to minimize resource usage. Guided by the fact that standard measures of algorithm efficiency, such as execution time, directly depend on the number of executed actions. Without touching the problematics of computer power consumption or memory, which also depends on the algorithm type and the techniques used in algorithm development, we have developed a methodology which enables the researchers to choose an efficient algorithm for a specific domain. String searching algorithms efficiency is usually observed independently from the domain texts being searched. This research paper aims to present the idea that algorithm efficiency depends on the properties of searched string and properties of the texts being searched, accompanied by the theoretical analysis of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through character comparison count metrics. The character comparison count metrics is a formal quantitative measure independent of algorithm implementation subtleties and computer platform differences. The model is developed for a particular problem domain by using appropriate domain data (patterns and texts) and provides for a specific domain the ranking of algorithms according to the patterns' entropy. The proposed approach is limited to on-line exact string-matching problems based on information entropy for a search pattern. Meticulous empirical testing depicts the methodology implementation and purports soundness of the methodology.
Collapse
Affiliation(s)
- Ivan Markić
- Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia
| | - Maja Štula
- Department of Electronics and Computing, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia; (M.Š.); (D.S.)
| | - Marija Zorić
- IT Department, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia;
| | - Darko Stipaničev
- Department of Electronics and Computing, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia; (M.Š.); (D.S.)
| |
Collapse
|
13
|
Identification of Regulatory SNPs Associated with Vicine and Convicine Content of Vicia faba Based on Genotyping by Sequencing Data Using Deep Learning. Genes (Basel) 2020; 11:genes11060614. [PMID: 32516876 PMCID: PMC7349281 DOI: 10.3390/genes11060614] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 05/26/2020] [Accepted: 05/28/2020] [Indexed: 12/15/2022] Open
Abstract
Faba bean (Vicia faba) is a grain legume, which is globally grown for both human consumption as well as feed for livestock. Despite its agro-ecological importance the usage of Vicia faba is severely hampered by its anti-nutritive seed-compounds vicine and convicine (V+C). The genes responsible for a low V+C content have not yet been identified. In this study, we aim to computationally identify regulatory SNPs (rSNPs), i.e., SNPs in promoter regions of genes that are deemed to govern the V+C content of Vicia faba. For this purpose we first trained a deep learning model with the gene annotations of seven related species of the Leguminosae family. Applying our model, we predicted putative promoters in a partial genome of Vicia faba that we assembled from genotyping-by-sequencing (GBS) data. Exploiting the synteny between Medicago truncatula and Vicia faba, we identified two rSNPs which are statistically significantly associated with V+C content. In particular, the allele substitutions regarding these rSNPs result in dramatic changes of the binding sites of the transcription factors (TFs) MYB4, MYB61, and SQUA. The knowledge about TFs and their rSNPs may enhance our understanding of the regulatory programs controlling V+C content of Vicia faba and could provide new hypotheses for future breeding programs.
Collapse
|
14
|
Sheng Q, Yu H, Oyebamiji O, Wang J, Chen D, Ness S, Zhao YY, Guo Y. AnnoGen: annotating genome-wide pragmatic features. Bioinformatics 2020; 36:2899-2901. [PMID: 31930398 PMCID: PMC7203733 DOI: 10.1093/bioinformatics/btaa027] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Revised: 12/19/2019] [Accepted: 01/08/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Genome annotation is an important step for all in-depth bioinformatics analysis. It is imperative to augment quantity and diversity of genome-wide annotation data for the latest reference genome to promote its adoption by ongoing and future impactful studies. RESULTS We developed a python toolkit AnnoGen, which at the first time, allows the annotation of three pragmatic genomic features for the GRCh38 genome in enormous base-wise quantities. The three features are chemical binding Energy, sequence information Entropy and Homology Score. The Homology Score is an exceptional feature that captures the genome-wide homology through single-base-offset tiling windows of 100 continual nucleotide bases. AnnoGen is capable of annotating the proprietary pragmatic features for variable user-interested genomic regions and optionally comparing two parallel sets of genomic regions. AnnoGen is characterized with simple utility modes and succinct HTML report of informative statistical tables and plots. AVAILABILITY AND IMPLEMENTATION https://github.com/shengqh/annogen.
Collapse
Affiliation(s)
- Quanhu Sheng
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Hui Yu
- Department of Internal Medicine, Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87109, USA
| | - Olufunmilola Oyebamiji
- Department of Internal Medicine, Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87109, USA
| | - Jiandong Wang
- Department of Computer Science, University of South Carolina, Columbia, SC 29205, USA
| | - Danqian Chen
- Key Laboratory of Resource Biology and Biotechnology, Western China School of Life Sciences, Northwest University, Xi'an, Shaanxi, China
| | - Scott Ness
- Department of Internal Medicine, Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87109, USA
| | - Ying-Yong Zhao
- Key Laboratory of Resource Biology and Biotechnology, Western China School of Life Sciences, Northwest University, Xi'an, Shaanxi, China
| | - Yan Guo
- Department of Internal Medicine, Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87109, USA
| |
Collapse
|
15
|
Turon X, Antich A, Palacín C, Præbel K, Wangensteen OS. From metabarcoding to metaphylogeography: separating the wheat from the chaff. ECOLOGICAL APPLICATIONS : A PUBLICATION OF THE ECOLOGICAL SOCIETY OF AMERICA 2020; 30:e02036. [PMID: 31709684 PMCID: PMC7078904 DOI: 10.1002/eap.2036] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Revised: 07/31/2019] [Accepted: 10/03/2019] [Indexed: 05/31/2023]
Abstract
Metabarcoding is by now a well-established method for biodiversity assessment in terrestrial, freshwater, and marine environments. Metabarcoding data sets are usually used for α- and β-diversity estimates, that is, interspecies (or inter-MOTU [molecular operational taxonomic unit]) patterns. However, the use of hypervariable metabarcoding markers may provide an enormous amount of intraspecies (intra-MOTU) information-mostly untapped so far. The use of cytochrome oxidase (COI) amplicons is gaining momentum in metabarcoding studies targeting eukaryote richness. COI has been for a long time the marker of choice in population genetics and phylogeographic studies. Therefore, COI metabarcoding data sets may be used to study intraspecies patterns and phylogeographic features for hundreds of species simultaneously, opening a new field that we suggest to name metaphylogeography. The main challenge for the implementation of this approach is the separation of erroneous sequences from true intra-MOTU variation. Here, we develop a cleaning protocol based on changes in entropy of the different codon positions of the COI sequence, together with co-occurrence patterns of sequences. Using a data set of community DNA from several benthic littoral communities in the Mediterranean and Atlantic seas, we first tested by simulation on a subset of sequences a two-step cleaning approach consisting of a denoising step followed by a minimal abundance filtering. The procedure was then applied to the whole data set. We obtained a total of 563 MOTUs that were usable for phylogeographic inference. We used semiquantitative rank data instead of read abundances to perform AMOVAs and haplotype networks. Genetic variability was mainly concentrated within samples, but with an important between seas component as well. There were intergroup differences in the amount of variability between and within communities in each sea. For two species, the results could be compared with traditional Sanger sequence data available for the same zones, giving similar patterns. Our study shows that metabarcoding data can be used to infer intra- and interpopulation genetic variability of many species at a time, providing a new method with great potential for basic biogeography, connectivity and dispersal studies, and for the more applied fields of conservation genetics, invasion genetics, and design of protected areas.
Collapse
Affiliation(s)
- Xavier Turon
- Department of Marine EcologyCentre for Advanced Studies of Blanes (CEAB, CSIC)BlanesCataloniaSpain
| | - Adrià Antich
- Department of Marine EcologyCentre for Advanced Studies of Blanes (CEAB, CSIC)BlanesCataloniaSpain
| | - Creu Palacín
- Department of Evolutionary Biology, Ecology and Environmental Sciences, and Institute of Biodiversity Research (IRBio)University of BarcelonaBarcelonaCataloniaSpain
| | - Kim Præbel
- Norwegian College of Fishery ScienceUiT the Arctic University of NorwayTromsøNorway
| | | |
Collapse
|
16
|
Czech L, Barbera P, Stamatakis A. Methods for automatic reference trees and multilevel phylogenetic placement. Bioinformatics 2020; 35:1151-1158. [PMID: 30169747 PMCID: PMC6449752 DOI: 10.1093/bioinformatics/bty767] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2018] [Revised: 07/24/2018] [Accepted: 08/30/2018] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION In most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results. RESULTS We present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence datasets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results. AVAILABILITY AND IMPLEMENTATION Freely available under GPLv3 at http://github.com/lczech/gappa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lucas Czech
- Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Pierre Barbera
- Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.,Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
17
|
Waters NR, Abram F, Brennan F, Holmes A, Pritchard L. riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions. Nucleic Acids Res 2019; 46:e68. [PMID: 29608703 PMCID: PMC6009695 DOI: 10.1093/nar/gky212] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2017] [Accepted: 03/12/2018] [Indexed: 11/12/2022] Open
Abstract
The vast majority of bacterial genome sequencing has been performed using Illumina short reads. Because of the inherent difficulty of resolving repeated regions with short reads alone, only ∼10% of sequencing projects have resulted in a closed genome. The most common repeated regions are those coding for ribosomal operons (rDNAs), which occur in a bacterial genome between 1 and 15 times, and are typically used as sequence markers to classify and identify bacteria. Here, we exploit the genomic context in which rDNAs occur across taxa to improve assembly of these regions relative to de novo sequencing by using the conserved nature of rDNAs across taxa and the uniqueness of their flanking regions within a genome. We describe a method to construct targeted pseudocontigs generated by iteratively assembling reads that map to a reference genome’s rDNAs. These pseudocontigs are then used to more accurately assemble the newly sequenced chromosome. We show that this method, implemented as riboSeed, correctly bridges across adjacent contigs in bacterial genome assembly and, when used in conjunction with other genome polishing tools, can assist in closure of a genome.
Collapse
Affiliation(s)
- Nicholas R Waters
- Microbiology, School of Natural Sciences, National University of Ireland, Galway, H91 TK33, Ireland.,Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland
| | - Florence Abram
- Microbiology, School of Natural Sciences, National University of Ireland, Galway, H91 TK33, Ireland
| | - Fiona Brennan
- Microbiology, School of Natural Sciences, National University of Ireland, Galway, H91 TK33, Ireland.,Soil and Environmental Microbiology, Environmental Research Centre, Teagasc, Johnstown Castle, Wexford, Y35 TC97, Ireland
| | - Ashleigh Holmes
- Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland
| | - Leighton Pritchard
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland
| |
Collapse
|
18
|
Li J, Zhang L, Li H, Ping Y, Xu Q, Wang R, Tan R, Wang Z, Liu B, Wang Y. Integrated entropy-based approach for analyzing exons and introns in DNA sequences. BMC Bioinformatics 2019; 20:283. [PMID: 31182012 PMCID: PMC6557737 DOI: 10.1186/s12859-019-2772-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Numerous essential algorithms and methods, including entropy-based quantitative methods, have been developed to analyze complex DNA sequences since the last decade. Exons and introns are the most notable components of DNA and their identification and prediction are always the focus of state-of-the-art research. RESULTS In this study, we designed an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences. We optimized and implemented the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences. By comparing digitalizing entropy values of exons and introns, we observed that they are significantly different. After we converted DNA data to numerical topological entropy value, we applied SVD method to effectively investigate exon and intron regions on a single gene sequence. Additionally, several genes across five species are used for exon predictions. CONCLUSIONS Our approach not only helps to explore the complexity of DNA sequence and its functional elements, but also provides an entropy-based GSP method to analyze exon and intron regions. Our work is feasible across different species and extendable to analyze other components in both coding and noncoding region of DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Zhen Wang
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031 China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| |
Collapse
|
19
|
Kycia RA. Landauer's Principle as a Special Case of Galois Connection. ENTROPY 2018; 20:e20120971. [PMID: 33266695 PMCID: PMC7512571 DOI: 10.3390/e20120971] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2018] [Revised: 12/03/2018] [Accepted: 12/12/2018] [Indexed: 11/30/2022]
Abstract
It is demonstrated how to construct a Galois connection between two related systems with entropy. The construction, called the Landauer’s connection, describes coupling between two systems with entropy. It is straightforward and transfers changes in one system to the other one, preserving ordering structure induced by entropy. The Landauer’s connection simplifies the description of the classical Landauer’s principle for computational systems. Categorification and generalization of the Landauer’s principle opens the area of modeling of various systems in presence of entropy in abstract terms.
Collapse
Affiliation(s)
- Radosław A. Kycia
- Department of Mathematics and Statistics, Masaryk Univeristy, Kotlářská 267/2, 611 37 Brno, Czech Republic; or
- Mathematics and Computer Science, Faculty of Physics, Cracow University of Technology, Warszawska 24, 31-155 Kraków, Poland
| |
Collapse
|
20
|
Hernández-Orozco S, Kiani NA, Zenil H. Algorithmically probable mutations reproduce aspects of evolution, such as convergence rate, genetic memory and modularity. ROYAL SOCIETY OPEN SCIENCE 2018; 5:180399. [PMID: 30225028 PMCID: PMC6124114 DOI: 10.1098/rsos.180399] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2018] [Accepted: 07/20/2018] [Indexed: 05/07/2023]
Abstract
Natural selection explains how life has evolved over millions of years from more primitive forms. The speed at which this happens, however, has sometimes defied formal explanations when based on random (uniformly distributed) mutations. Here, we investigate the application of a simplicity bias based on a natural but algorithmic distribution of mutations (no recombination) in various examples, particularly binary matrices, in order to compare evolutionary convergence rates. Results both on synthetic and on small biological examples indicate an accelerated rate when mutations are not statistically uniform but algorithmically uniform. We show that algorithmic distributions can evolve modularity and genetic memory by preservation of structures when they first occur sometimes leading to an accelerated production of diversity but also to population extinctions, possibly explaining naturally occurring phenomena such as diversity explosions (e.g. the Cambrian) and massive extinctions (e.g. the End Triassic) whose causes are currently a cause for debate. The natural approach introduced here appears to be a better approximation to biological evolution than models based exclusively upon random uniform mutations, and it also approaches a formal version of open-ended evolution based on previous formal results. These results validate some suggestions in the direction that computation may be an equally important driver of evolution. We also show that inducing the method on problems of optimization, such as genetic algorithms, has the potential to accelerate convergence of artificial evolutionary algorithms.
Collapse
Affiliation(s)
- Santiago Hernández-Orozco
- Posgrado en Ciencia e Ingeniería de la Computación, Universidad Nacional Autónoma de México (UNAM), Mexico
- Algorithmic Dynamics Lab, Unit of Computational Medicine, SciLifeLab, Department of Medicine Solna, Centre for Molecular Medicine, Stockholm, Sweden
- Algorithmic Nature Group, LABORES, Paris, France
| | - Narsis A. Kiani
- Algorithmic Dynamics Lab, Unit of Computational Medicine, SciLifeLab, Department of Medicine Solna, Centre for Molecular Medicine, Stockholm, Sweden
- Algorithmic Nature Group, LABORES, Paris, France
| | - Hector Zenil
- Algorithmic Dynamics Lab, Unit of Computational Medicine, SciLifeLab, Department of Medicine Solna, Centre for Molecular Medicine, Stockholm, Sweden
- Algorithmic Nature Group, LABORES, Paris, France
| |
Collapse
|
21
|
Barbosa VC. Information-theoretic signatures of biodiversity in the barcoding gene. J Theor Biol 2018; 451:111-116. [PMID: 29750998 DOI: 10.1016/j.jtbi.2018.05.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Revised: 04/30/2018] [Accepted: 05/08/2018] [Indexed: 11/16/2022]
Abstract
Analyzing the information content of DNA, though holding the promise to help quantify how the processes of evolution have led to information gain throughout the ages, has remained an elusive goal. Paradoxically, one of the main reasons for this has been precisely the great diversity of life on the planet: if on the one hand this diversity is a rich source of data for information-content analysis, on the other hand there is so much variation as to make the task unmanageable. During the past decade or so, however, succinct fragments of the COI mitochondrial gene, which is present in all animal phyla and in a few others, have been shown to be useful for species identification through DNA barcoding. A few million such fragments are now publicly available through the BOLD systems initiative, thus providing an unprecedented opportunity for relatively comprehensive information-theoretic analyses of DNA to be attempted. Here we show how a generalized form of total correlation can yield distinctive information-theoretic descriptors of the phyla represented in those fragments. In order to illustrate the potential of this analysis to provide new insight into the evolution of species, we performed principal component analysis on standardized versions of the said descriptors for 23 phyla. Surprisingly, we found that, though based solely on the species represented in the data, the first principal component correlates strongly with the natural logarithm of the number of all known living species for those phyla. The new descriptors thus constitute clear information-theoretic signatures of the processes whereby evolution has given rise to current biodiversity, which suggests their potential usefulness in further related studies.
Collapse
Affiliation(s)
- Valmir C Barbosa
- Programa de Engenharia de Sistemas e Computação, COPPE, Universidade Federal do Rio de Janeiro, Caixa Postal 68511, Rio de Janeiro, RJ 21941-972, Brazil.
| |
Collapse
|
22
|
Skene KR. Thermodynamics, ecology and evolutionary biology: A bridge over troubled water or common ground? ACTA OECOLOGICA-INTERNATIONAL JOURNAL OF ECOLOGY 2017. [DOI: 10.1016/j.actao.2017.10.010] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
23
|
Gebert D, Hewel C, Rosenkranz D. unitas: the universal tool for annotation of small RNAs. BMC Genomics 2017; 18:644. [PMID: 28830358 PMCID: PMC5567656 DOI: 10.1186/s12864-017-4031-9] [Citation(s) in RCA: 75] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Accepted: 08/07/2017] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Next generation sequencing is a key technique in small RNA biology research that has led to the discovery of functionally different classes of small non-coding RNAs in the past years. However, reliable annotation of the extensive amounts of small non-coding RNA data produced by high-throughput sequencing is time-consuming and requires robust bioinformatics expertise. Moreover, existing tools have a number of shortcomings including a lack of sensitivity under certain conditions, limited number of supported species or detectable sub-classes of small RNAs. RESULTS Here we introduce unitas, an out-of-the-box ready software for complete annotation of small RNA sequence datasets, supporting the wide range of species for which non-coding RNA reference sequences are available in the Ensembl databases (currently more than 800). unitas combines high quality annotation and numerous analysis features in a user-friendly manner. A complete annotation can be started with one simple shell command, making unitas particularly useful for researchers not having access to a bioinformatics facility. Noteworthy, the algorithms implemented in unitas are on par or even outperform comparable existing tools for small RNA annotation that map to publicly available ncRNA databases. CONCLUSIONS unitas brings together annotation and analysis features that hitherto required the installation of numerous different bioinformatics tools which can pose a challenge for the non-expert user. With this, unitas overcomes the problem of read normalization. Moreover, the high quality of sequence annotation and analysis, paired with the ease of use, make unitas a valuable tool for researchers in all fields connected to small RNA biology.
Collapse
Affiliation(s)
- Daniel Gebert
- Institute of Organismic and Molecular Evolutionary Biology, Anthropology, Johannes Gutenberg University, 55099, Mainz, Germany
| | - Charlotte Hewel
- Institute of Organismic and Molecular Evolutionary Biology, Anthropology, Johannes Gutenberg University, 55099, Mainz, Germany
| | - David Rosenkranz
- Institute of Organismic and Molecular Evolutionary Biology, Anthropology, Johannes Gutenberg University, 55099, Mainz, Germany.
| |
Collapse
|
24
|
Kistler L, Ware R, Smith O, Collins M, Allaby RG. A new model for ancient DNA decay based on paleogenomic meta-analysis. Nucleic Acids Res 2017; 45:6310-6320. [PMID: 28486705 PMCID: PMC5499742 DOI: 10.1093/nar/gkx361] [Citation(s) in RCA: 91] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Revised: 04/15/2017] [Accepted: 04/20/2017] [Indexed: 01/04/2023] Open
Abstract
The persistence of DNA over archaeological and paleontological timescales in diverse environments has led to a revolutionary body of paleogenomic research, yet the dynamics of DNA degradation are still poorly understood. We analyzed 185 paleogenomic datasets and compared DNA survival with environmental variables and sample ages. We find cytosine deamination follows a conventional thermal age model, but we find no correlation between DNA fragmentation and sample age over the timespans analyzed, even when controlling for environmental variables. We propose a model for ancient DNA decay wherein fragmentation rapidly reaches a threshold, then subsequently slows. The observed loss of DNA over time may be due to a bulk diffusion process in many cases, highlighting the importance of tissues and environments creating effectively closed systems for DNA preservation. This model of DNA degradation is largely based on mammal bone samples due to published genomic dataset availability. Continued refinement to the model to reflect diverse biological systems and tissue types will further improve our understanding of ancient DNA breakdown dynamics.
Collapse
MESH Headings
- Base Composition
- Base Sequence
- DNA Fragmentation
- DNA, Ancient/analysis
- DNA, Ancient/chemistry
- DNA, Mitochondrial/analysis
- DNA, Mitochondrial/chemistry
- DNA, Mitochondrial/genetics
- DNA, Plant/genetics
- Deamination
- Genome, Human
- Genome, Mitochondrial
- Humans
- Meta-Analysis as Topic
- Models, Chemical
- Paleontology/methods
- Sequence Analysis, DNA
- Thermodynamics
Collapse
Affiliation(s)
- Logan Kistler
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, UK
- Department of Anthropology, National Museum of Natural History, Smithsonian Institution, Washington, DC 20560, USA
| | - Roselyn Ware
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, UK
| | - Oliver Smith
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, UK
- Section for Evolutionary Genomics, Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, 1307 Copenhagen K, Denmark
| | - Matthew Collins
- Section for Evolutionary Genomics, Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, 1307 Copenhagen K, Denmark
- Department of Archaeology, University of York, PO Box 373, York, UK
| | - Robin G. Allaby
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, UK
| |
Collapse
|
25
|
Genotypic Complexity of Fisher's Geometric Model. Genetics 2017; 206:1049-1079. [PMID: 28450460 DOI: 10.1534/genetics.116.199497] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Accepted: 04/15/2017] [Indexed: 01/30/2023] Open
Abstract
Fisher's geometric model was originally introduced to argue that complex adaptations must occur in small steps because of pleiotropic constraints. When supplemented with the assumption of additivity of mutational effects on phenotypic traits, it provides a simple mechanism for the emergence of genotypic epistasis from the nonlinear mapping of phenotypes to fitness. Of particular interest is the occurrence of reciprocal sign epistasis, which is a necessary condition for multipeaked genotypic fitness landscapes. Here we compute the probability that a pair of randomly chosen mutations interacts sign epistatically, which is found to decrease with increasing phenotypic dimension n, and varies nonmonotonically with the distance from the phenotypic optimum. We then derive expressions for the mean number of fitness maxima in genotypic landscapes comprised of all combinations of L random mutations. This number increases exponentially with L, and the corresponding growth rate is used as a measure of the complexity of the landscape. The dependence of the complexity on the model parameters is found to be surprisingly rich, and three distinct phases characterized by different landscape structures are identified. Our analysis shows that the phenotypic dimension, which is often referred to as phenotypic complexity, does not generally correlate with the complexity of fitness landscapes and that even organisms with a single phenotypic trait can have complex landscapes. Our results further inform the interpretation of experiments where the parameters of Fisher's model have been inferred from data, and help to elucidate which features of empirical fitness landscapes can be described by this model.
Collapse
|
26
|
|
27
|
Wu C, Yao S, Li X, Chen C, Hu X. Genome-Wide Prediction of DNA Methylation Using DNA Composition and Sequence Complexity in Human. Int J Mol Sci 2017; 18:E420. [PMID: 28212312 PMCID: PMC5343954 DOI: 10.3390/ijms18020420] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2017] [Revised: 02/03/2017] [Accepted: 02/08/2017] [Indexed: 02/02/2023] Open
Abstract
DNA methylation plays a significant role in transcriptional regulation by repressing activity. Change of the DNA methylation level is an important factor affecting the expression of target genes and downstream phenotypes. Because current experimental technologies can only assay a small proportion of CpG sites in the human genome, it is urgent to develop reliable computational models for predicting genome-wide DNA methylation. Here, we proposed a novel algorithm that accurately extracted sequence complexity features (seven features) and developed a support-vector-machine-based prediction model with integration of the reported DNA composition features (trinucleotide frequency and GC content, 65 features) by utilizing the methylation profiles of embryonic stem cells in human. The prediction results from 22 human chromosomes with size-varied windows showed that the 600-bp window achieved the best average accuracy of 94.7%. Moreover, comparisons with two existing methods further showed the superiority of our model, and cross-species predictions on mouse data also demonstrated that our model has certain generalization ability. Finally, a statistical test of the experimental data and the predicted data on functional regions annotated by ChromHMM found that six out of 10 regions were consistent, which implies reliable prediction of unassayed CpG sites. Accordingly, we believe that our novel model will be useful and reliable in predicting DNA methylation.
Collapse
Affiliation(s)
- Chengchao Wu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| | - Shixin Yao
- College of Science, Huazhong Agricultural University, Wuhan 430070, China.
| | - Xinghao Li
- College of Science, Huazhong Agricultural University, Wuhan 430070, China.
| | - Chujia Chen
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| | - Xuehai Hu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| |
Collapse
|
28
|
Benjamin A, Keten S. Polymer Conjugation as a Strategy for Long-Range Order in Supramolecular Polymers. J Phys Chem B 2016; 120:3425-33. [DOI: 10.1021/acs.jpcb.5b12547] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Ari Benjamin
- Department
of Mechanical
Engineering, Northwestern University, 2145 Sheridan Road, Evanston, Illinois 60208-3109, United States
| | - Sinan Keten
- Department
of Mechanical
Engineering, Northwestern University, 2145 Sheridan Road, Evanston, Illinois 60208-3109, United States
| |
Collapse
|
29
|
Paci G, Cristadoro G, Monti B, Lenci M, Degli Esposti M, Castellani GC, Remondini D. Characterization of DNA methylation as a function of biological complexity via dinucleotide inter-distances. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2016; 374:rsta.2015.0227. [PMID: 26857665 DOI: 10.1098/rsta.2015.0227] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 11/23/2015] [Indexed: 06/05/2023]
Abstract
We perform a statistical study of the distances between successive occurrences of a given dinucleotide in the DNA sequence for a number of organisms of different complexity. Our analysis highlights peculiar features of the CG dinucleotide distribution in mammalian DNA, pointing towards a connection with the role of such dinucleotide in DNA methylation. While the CG distributions of mammals exhibit exponential tails with comparable parameters, the picture for the other organisms studied (e.g. fish, insects, bacteria and viruses) is more heterogeneous, possibly because in these organisms DNA methylation has different functional roles. Our analysis suggests that the distribution of the distances between CG dinucleotides provides useful insights into characterizing and classifying organisms in terms of methylation functionalities.
Collapse
Affiliation(s)
- Giulia Paci
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Giampaolo Cristadoro
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy
| | - Barbara Monti
- Department of Pharmacy and Biotechnology, University of Bologna, Via S. Donato 15, Bologna 40127, Italy
| | - Marco Lenci
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Mirko Degli Esposti
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy
| | - Gastone C Castellani
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Daniel Remondini
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| |
Collapse
|
30
|
Characterizing Protease Specificity: How Many Substrates Do We Need? PLoS One 2015; 10:e0142658. [PMID: 26559682 PMCID: PMC4641643 DOI: 10.1371/journal.pone.0142658] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2015] [Accepted: 10/26/2015] [Indexed: 12/26/2022] Open
Abstract
Calculation of cleavage entropies allows to quantify, map and compare protease substrate specificity by an information entropy based approach. The metric intrinsically depends on the number of experimentally determined substrates (data points). Thus a statistical analysis of its numerical stability is crucial to estimate the systematic error made by estimating specificity based on a limited number of substrates. In this contribution, we show the mathematical basis for estimating the uncertainty in cleavage entropies. Sets of cleavage entropies are calculated using experimental cleavage data and modeled extreme cases. By analyzing the underlying mathematics and applying statistical tools, a linear dependence of the metric in respect to 1/n was found. This allows us to extrapolate the values to an infinite number of samples and to estimate the errors. Analyzing the errors, a minimum number of 30 substrates was found to be necessary to characterize substrate specificity, in terms of amino acid variability, for a protease (S4-S4’) with an uncertainty of 5 percent. Therefore, we encourage experimental researchers in the protease field to record specificity profiles of novel proteases aiming to identify at least 30 peptide substrates of maximum sequence diversity. We expect a full characterization of protease specificity helpful to rationalize biological functions of proteases and to assist rational drug design.
Collapse
|
31
|
Lahens NF, Kavakli IH, Zhang R, Hayer K, Black MB, Dueck H, Pizarro A, Kim J, Irizarry R, Thomas RS, Grant GR, Hogenesch JB. IVT-seq reveals extreme bias in RNA sequencing. Genome Biol 2014; 15:R86. [PMID: 24981968 PMCID: PMC4197826 DOI: 10.1186/gb-2014-15-6-r86] [Citation(s) in RCA: 105] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2014] [Accepted: 06/30/2014] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND RNA-seq is a powerful technique for identifying and quantifying transcription and splicing events, both known and novel. However, given its recent development and the proliferation of library construction methods, understanding the bias it introduces is incomplete but critical to realizing its value. RESULTS We present a method, in vitro transcription sequencing (IVT-seq), for identifying and assessing the technical biases in RNA-seq library generation and sequencing at scale. We created a pool of over 1,000 in vitro transcribed RNAs from a full-length human cDNA library and sequenced them with polyA and total RNA-seq, the most common protocols. Because each cDNA is full length, and we show in vitro transcription is incredibly processive, each base in each transcript should be equivalently represented. However, with common RNA-seq applications and platforms, we find 50% of transcripts have more than two-fold and 10% have more than 10-fold differences in within-transcript sequence coverage. We also find greater than 6% of transcripts have regions of dramatically unpredictable sequencing coverage between samples, confounding accurate determination of their expression. We use a combination of experimental and computational approaches to show rRNA depletion is responsible for the most significant variability in coverage, and several sequence determinants also strongly influence representation. CONCLUSIONS These results show the utility of IVT-seq for promoting better understanding of bias introduced by RNA-seq. We find rRNA depletion is responsible for substantial, unappreciated biases in coverage introduced during library preparation. These biases suggest exon-level expression analysis may be inadvisable, and we recommend caution when interpreting RNA-seq results.
Collapse
|
32
|
Clustering of giant virus-DNA based on variations in local entropy. Viruses 2014; 6:2259-67. [PMID: 24887142 PMCID: PMC4074927 DOI: 10.3390/v6062259] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Revised: 05/19/2014] [Accepted: 05/21/2014] [Indexed: 11/17/2022] Open
Abstract
We present a method for clustering genomic sequences based on variations in local entropy. We have analyzed the distributions of the block entropies of viruses and plant genomes. A distinct pattern for viruses and plant genomes is observed. These distributions, which describe the local entropic variability of the genomes, are used for clustering the genomes based on the Jensen-Shannon (JS) distances. The analysis of the JS distances between all genomes that infect the chlorella algae shows the host specificity of the viruses. We illustrate the efficacy of this entropy-based clustering technique by the segregation of plant and virus genomes into separate bins.
Collapse
|
33
|
Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014; 15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open
Abstract
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Collapse
Affiliation(s)
- Susana Vinga
- IDMEC, Instituto Superior Técnico - Universidade de Lisboa (IST-UL), Av. Rovisco Pais, 1049-001 Lisboa, Portugal. Tel.: +351-218419504; Fax: +351-218498097;
| |
Collapse
|
34
|
Hudson NJ, Porto-Neto LR, Kijas J, McWilliam S, Taft RJ, Reverter A. Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest. BMC Bioinformatics 2014; 15:66. [PMID: 24606587 PMCID: PMC4015654 DOI: 10.1186/1471-2105-15-66] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2013] [Accepted: 02/26/2014] [Indexed: 11/20/2022] Open
Abstract
Background Genomic information allows population relatedness to be inferred and selected genes to be identified. Single nucleotide polymorphism microarray (SNP-chip) data, a proxy for genome composition, contains patterns in allele order and proportion. These patterns can be quantified by compression efficiency (CE). In principle, the composition of an entire genome can be represented by a CE number quantifying allele representation and order. Results We applied a compression algorithm (DEFLATE) to genome-wide high-density SNP data from 4,155 human, 1,800 cattle, 1,222 sheep, 81 dogs and 49 mice samples. All human ethnic groups can be clustered by CE and the clusters recover phylogeography based on traditional fixation index (FST) analyses. CE analysis of other mammals results in segregation by breed or species, and is sensitive to admixture and past effective population size. This clustering is a consequence of individual patterns such as runs of homozygosity. Intriguingly, a related approach can also be used to identify genomic loci that show population-specific CE segregation. A high resolution CE ‘sliding window’ scan across the human genome, organised at the population level, revealed genes known to be under evolutionary pressure. These include SLC24A5 (European and Gujarati Indian skin pigmentation), HERC2 (European eye color), LCT (European and Maasai milk digestion) and EDAR (Asian hair thickness). We also identified a set of previously unidentified loci with high population-specific CE scores including the chromatin remodeler SCMH1 in Africans and EDA2R in Asians. Closer inspection reveals that these prioritised genomic regions do not correspond to simple runs of homozygosity but rather compositionally complex regions that are shared by many individuals of a given population. Unlike FST, CE analyses do not require ab initio population comparisons and are amenable to the hemizygous X chromosome. Conclusions We conclude with a discussion of the implications of CE for a complex systems science view of genome evolution. CE allows one to clearly visualise the evolution of individual genomes and populations through a formal, mathematically-rigorous information space. Overall, CE makes a set of biological predictions, some of which are unique and await functional validation.
Collapse
Affiliation(s)
| | | | | | | | - Ryan J Taft
- Computational and Systems Biology, CSIRO Animal, Food and Health Sciences, St, Lucia, Brisbane, QLD 4067, Australia.
| | | |
Collapse
|
35
|
Bakouche N, Vandenbroucke AT, Goubau P, Ruelle J. Study of the HIV-2 Env cytoplasmic tail variability and its impact on Tat, Rev and Nef. PLoS One 2013; 8:e79129. [PMID: 24223892 PMCID: PMC3815105 DOI: 10.1371/journal.pone.0079129] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2013] [Accepted: 09/18/2013] [Indexed: 11/24/2022] Open
Abstract
Background The HIV-2 env’s 3’ end encodes the cytoplasmic tail (CT) of the Env protein. This genomic region also encodes the rev, Tat and Nef protein in overlapping reading frames. We studied the variability in the CT coding region in 46 clinical specimens and in 2 reference strains by sequencing and by culturing. The aims were to analyse the variability of Env CT and the evolution of proteins expressed from overlapping coding sequences. Results A 70% reduction of the length of the CT region affected the HIV-2 ROD and EHO strains invitro due to a premature stop codon in the env gene. In clinical samples this wasn’t observed, but the CT length varied due to insertions and deletions. We noted 3 conserved and 3 variable regions in the CT. The conserved regions were those containing residues involved in Env endocytosis, the potential HIV-2 CT region implicated in the NF-kB activation and the potential end of the lentiviral lytic peptide one. The variable regions were the potential HIV-2 Kennedy region, the potential lentiviral lytic peptide two and the beginning of the potential lentiviral lytic peptide one. A very hydrophobic region was coded downstream of the premature stop codon observed invitro, suggesting a membrane spanning region. Interestingly, the nucleotides that are responsible for the variability of the CT don’t impact rev and Nef. However, in the Kennedy-like coding region variability resulted only from nucleotide changes that impacted Env and Tat together. Conclusion The HIV-2 Env, Tat and Rev C-terminal part are subject to major length variations in both clinical samples and cultured strains. The HIV-2 Env CT contains variable and conserved regions. These regions don’t affect the rev and Nef amino acids composition which evolves independently. In contrast, Tat co-evolves with the Env CT.
Collapse
Affiliation(s)
- Nordine Bakouche
- Institut de recherche expérimentale et clinique, Université catholique de Louvain, Brussels, Belgium
| | | | - Patrick Goubau
- Institut de recherche expérimentale et clinique, Université catholique de Louvain, Brussels, Belgium
| | - Jean Ruelle
- Institut de recherche expérimentale et clinique, Université catholique de Louvain, Brussels, Belgium
- * E-mail:
| |
Collapse
|
36
|
On the fractal geometry of DNA by the binary image analysis. Bull Math Biol 2013; 75:1544-70. [PMID: 23760660 DOI: 10.1007/s11538-013-9859-9] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2012] [Accepted: 05/21/2013] [Indexed: 12/15/2022]
Abstract
The multifractal analysis of binary images of DNA is studied in order to define a methodological approach to the classification of DNA sequences. This method is based on the computation of some multifractality parameters on a suitable binary image of DNA, which takes into account the nucleotide distribution. The binary image of DNA is obtained by a dot-plot (recurrence plot) of the indicator matrix. The fractal geometry of these images is characterized by fractal dimension (FD), lacunarity, and succolarity. These parameters are compared with some other coefficients such as complexity and Shannon information entropy. It will be shown that the complexity parameters are more or less equivalent to FD, while the parameters of multifractality have different values in the sense that sequences with higher FD might have lower lacunarity and/or succolarity. In particular, the genome of Drosophila melanogaster has been considered by focusing on the chromosome 3r, which shows the highest fractality with a corresponding higher level of complexity. We will single out some results on the nucleotide distribution in 3r with respect to complexity and fractality. In particular, we will show that sequences with higher FD also have a higher frequency distribution of guanine, while low FD is characterized by the higher presence of adenine.
Collapse
|
37
|
|
38
|
Wei D, Jiang Q, Wei Y, Wang S. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinformatics 2012; 13:174. [PMID: 22823405 PMCID: PMC3443659 DOI: 10.1186/1471-2105-13-174] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2011] [Accepted: 06/30/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors. RESULTS The proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust, CD-HIT-EST and some others. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences. CONCLUSIONS We introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences.
Collapse
Affiliation(s)
- Dan Wei
- Cognitive Science Department & Fujian Key Laboratory of the Brain-like Intelligent Systems, Xiamen University, Xiamen, China
| | | | | | | |
Collapse
|
39
|
Energetic loads and informational entropy during insect metamorphosis: measuring structural variability and self-organization. J Theor Biol 2011; 286:1-12. [PMID: 21756920 DOI: 10.1016/j.jtbi.2011.06.029] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2010] [Revised: 06/21/2011] [Accepted: 06/22/2011] [Indexed: 11/23/2022]
Abstract
In this work an information theory approach is presented for measuring structural variability during insect metamorphosis. Following a self-organizational perspective, the underlying assumption is that an insect pupa is a cybernetic bio-system, which displays a homeostatic control during its metamorphosis. The description of structural variability was based on biochemical data (lipids, glycogen, carbohydrates and proteins) analysed at different time intervals during the metamorphosis of Anarsia lineatella Zeller (Lepidoptera: Gelechiidae). Probabilities of biochemical variables were further treated by considering a finite countable set of progressive metamorphosis states having Markov properties at isothermal conditions (25 °C, 16:8h L:D, 65 ± 5%RH). The probabilities of the biochemical variables, as well as the related informational entropies, are affected when the system moves one step forward for each successive state. In most cases, but protein, there is some observable evidence that histolysis could be related to a decrease in informational entropy H ('disorganization of the system'), followed by a 'stable balance period' during the middle stages of metamorphosis. An initial increase in H is measured at the last stages of metamorphosis, which theoretically correspond to histogenesis ('reorganization of the system'). In this context, the temporal evolution of pupal structural variability was probabilistically quantified according to the classical information theory. The principles of the proposed holistic system are independent of its detailed dynamics and the proposed model can potentially describe part of the observable experimental data during metamorphosis of a holometabolous insect.
Collapse
|
40
|
Abstract
MOTIVATION Topological entropy has been one of the most difficult to implement of all the entropy-theoretic notions. This is primarily due to finite sample effects and high-dimensionality problems. In particular, topological entropy has been implemented in previous literature to conclude that entropy of exons is higher than of introns, thus implying that exons are more 'random' than introns. RESULTS We define a new approximation to topological entropy free from the aforementioned difficulties. We compute its expected value and apply this definition to the intron and exon regions of the human genome to observe that as expected, the entropy of introns are significantly higher than that of exons. We also find that introns are less random than expected: their entropy is lower than the computed expected value. We also observe the perplexing phenomena that introns on chromosome Y have atypically low and bimodal entropy, possibly corresponding to random sequences (high entropy) and sequences that posses hidden structure or function (low entropy). AVAILABILITY A Mathematica implementation is available at http://www.math.psu.edu/koslicki/entropy.nb CONTACT koslicki@math.psu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Koslicki
- Department of Mathematics, Pennsylvania State University, State College, PA 16801, USA.
| |
Collapse
|
41
|
Athanasopoulou L, Athanasopoulos S, Karamanos K, Almirantis Y. Scaling properties and fractality in the distribution of coding segments in eukaryotic genomes revealed through a block entropy approach. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2010; 82:051917. [PMID: 21230510 DOI: 10.1103/physreve.82.051917] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2010] [Revised: 09/19/2010] [Indexed: 05/30/2023]
Abstract
Statistical methods, including block entropy based approaches, have already been used in the study of long-range features of genomic sequences seen as symbol series, either considering the full alphabet of the four nucleotides or the binary purine or pyrimidine character set. Here we explore the alternation of short protein-coding segments with long noncoding spacers in entire chromosomes, focusing on the scaling properties of block entropy. In previous studies, it has been shown that the sizes of noncoding spacers follow power-law-like distributions in most chromosomes of eukaryotic organisms from distant taxa. We have developed a simple evolutionary model based on well-known molecular events (segmental duplications followed by elimination of most of the duplicated genes) which reproduces the observed linearity in log-log plots. The scaling properties of block entropy H(n) have been studied in several works. Their findings suggest that linearity in semilogarithmic scale characterizes symbol sequences which exhibit fractal properties and long-range order, while this linearity has been shown in the case of the logistic map at the Feigenbaum accumulation point. The present work starts with the observation that the block entropy of the Cantor-like binary symbol series scales in a similar way. Then, we perform the same analysis for the full set of human chromosomes and for several chromosomes of other eukaryotes. A similar but less extended linearity in semilogarithmic scale, indicating fractality, is observed, while randomly formed surrogate sequences clearly lack this type of scaling. Genomic sequences always present entropy values much lower than their random surrogates. Symbol sequences produced by the aforementioned evolutionary model follow the scaling found in genomic sequences, thus corroborating the conjecture that "segmental duplication-gene elimination" dynamics may have contributed to the observed long rangeness in the coding or noncoding alternation in genomes.
Collapse
|
42
|
Abstract
Although no historical information exists about the Indus civilization (flourished ca. 2600-1900 B.C.), archaeologists have uncovered about 3,800 short samples of a script that was used throughout the civilization. The script remains undeciphered, despite a large number of attempts and claimed decipherments over the past 80 years. Here, we propose the use of probabilistic models to analyze the structure of the Indus script. The goal is to reveal, through probabilistic analysis, syntactic patterns that could point the way to eventual decipherment. We illustrate the approach using a simple Markov chain model to capture sequential dependencies between signs in the Indus script. The trained model allows new sample texts to be generated, revealing recurring patterns of signs that could potentially form functional subunits of a possible underlying language. The model also provides a quantitative way of testing whether a particular string belongs to the putative language as captured by the Markov model. Application of this test to Indus seals found in Mesopotamia and other sites in West Asia reveals that the script may have been used to express different content in these regions. Finally, we show how missing, ambiguous, or unreadable signs on damaged objects can be filled in with most likely predictions from the model. Taken together, our results indicate that the Indus script exhibits rich synactic structure and the ability to represent diverse content. both of which are suggestive of a linguistic writing system rather than a nonlinguistic symbol system.
Collapse
|
43
|
Jin NZ, Liu ZX, Qi YJ, Qiu WY. Repeat Sequences and Base Correlations in Human Y Chromosome Palindromes. CHINESE J CHEM PHYS 2009. [DOI: 10.1088/1674-0068/22/03/255-261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|
44
|
Giancarlo R, Scaturro D, Utro F. Textual data compression in computational biology: a synopsis. Bioinformatics 2009; 25:1575-86. [DOI: 10.1093/bioinformatics/btp117] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
45
|
Abstract
In his recent book The Mind Doesn't Work That Way, Fodor argues that computational modeling of global cognitive processes, such as abductive everyday reasoning, has not been successful. In this article the problem is analyzed in the framework of algorithmic information theory. It is argued that the failed approaches are characterized by shallow reductionism, which is rejected in favor of deep reductionism and nonreductionism.
Collapse
|
46
|
Rocha LB, Adam RL, Leite NJ, Metze K, Rossi MA. Shannon's entropy and fractal dimension provide an objective account of bone tissue organization during calvarial bone regeneration. Microsc Res Tech 2008; 71:619-25. [DOI: 10.1002/jemt.20598] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
47
|
Legendre M, Verstrepen KJ. Using the SERV Applet to Detect Tandem Repeats in DNA Sequences and to Predict Their Variability. ACTA ACUST UNITED AC 2008; 2008:pdb.ip50. [PMID: 21356663 DOI: 10.1101/pdb.ip50] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
INTRODUCTIONTandem repeats (satellite repeats) are short stretches of DNA that are repeated head-to-tail. Tandem repeats mutate at rates that are between 100- and 10,000-fold greater than normal (point) mutation rates in the rest of the genome. As a consequence of these frequent mutation events, "homologous" tandem repeats in closely related species, strains, or even individuals in the same population often contain a different number of repeat units. This heterogeneity is extensively used in today's molecular forensics and genotyping research. However, while all repeats are unstable, precise mutation rates vary greatly between different repeat loci. This implies that not all tandem repeats are suited as markers for forensics, genotyping, or putative hypervariable functional modules. The SERV ("Sequence-Based Estimation of Repeats Variability") applet enables finding repeats in DNA sequences and estimating their variability. Hence, it can be used to select repeats that are suitable markers for genotyping or interesting candidates for functional studies.
Collapse
Affiliation(s)
- Matthieu Legendre
- FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138, USA
| | | |
Collapse
|
48
|
Piqueira JRC, Serboncini FA, Monteiro LHA. Biological models: Measuring variability with classical and quantum information. J Theor Biol 2006; 242:309-13. [PMID: 16603194 DOI: 10.1016/j.jtbi.2006.02.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2005] [Revised: 02/23/2006] [Accepted: 02/27/2006] [Indexed: 11/26/2022]
Abstract
This essay proposes methods to analyse the variability of biological data. The idea is to express the state of a biological system as a linear combination of base states in a Hilbert space. Coefficients of the linear combination can be interpreted as probabilities and informational entropy is associated to each state allowing the definition of a classical variability measure. Besides, state transition matrices can also be calculated and their norms express the dynamics of the system organization and a quantum variability measure. As the examples show, the classical measure expresses a structural variability and the quantum measure expresses a functional variability.
Collapse
Affiliation(s)
- J R C Piqueira
- Departamento de Engenharia de Telecomunições e Controle, Escola Politécnica da Universidade de São Paulo, Av. Prof. Luciano Gualberto, Travessa 3, n. 158, 05508-900 São Paulo, Brazil.
| | | | | |
Collapse
|
49
|
Larsabal E, Danchin A. Genomes are covered with ubiquitous 11 bp periodic patterns, the "class A flexible patterns". BMC Bioinformatics 2005; 6:206. [PMID: 16120222 PMCID: PMC1242344 DOI: 10.1186/1471-2105-6-206] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2005] [Accepted: 08/24/2005] [Indexed: 11/17/2022] Open
Abstract
Background The genomes of prokaryotes and lower eukaryotes display a very strong 11 bp periodic bias in the distribution of their nucleotides. This bias is present throughout a given genome, both in coding and non-coding sequences. Until now this bias remained of unknown origin. Results Using a technique for analysis of auto-correlations based on linear projection, we identified the sequences responsible for the bias. Prokaryotic and lower eukaryotic genomes are covered with ubiquitous patterns that we termed "class A flexible patterns". Each pattern is composed of up to ten conserved nucleotides or dinucleotides distributed into a discontinuous motif. Each occurrence spans a region up to 50 bp in length. They belong to what we named the "flexible pattern" type, in that there is some limited fluctuation in the distances between the nucleotides composing each occurrence of a given pattern. When taken together, these patterns cover up to half of the genome in the majority of prokaryotes. They generate the previously recognized 11 bp periodic bias. Conclusion Judging from the structure of the patterns, we suggest that they may define a dense network of protein interaction sites in chromosomes.
Collapse
Affiliation(s)
- Etienne Larsabal
- Unité de Génétique des Génomes Bactériens, Institut Pasteur, URA CNRS 2171, 28, rue du Docteur Roux, 75724 Paris Cedex 15, France
| | - Antoine Danchin
- Unité de Génétique des Génomes Bactériens, Institut Pasteur, URA CNRS 2171, 28, rue du Docteur Roux, 75724 Paris Cedex 15, France
| |
Collapse
|
50
|
Nikolaou C, Almirantis Y. “Word” Preference in the Genomic Text and Genome Evolution: Different Modes of n-tuplet Usage in Coding and Noncoding Sequences. J Mol Evol 2005; 61:23-35. [PMID: 16059753 DOI: 10.1007/s00239-004-0209-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2004] [Accepted: 02/02/2005] [Indexed: 10/25/2022]
Abstract
Extensive work on n-tuplet occurrence in genomic sequences has revealed the correlation of their usage with sequence origin. Parallel to that, there exist different restrictions in the nucleotide composition of coding and noncoding sequences that may result in distinct modes of usage of n-tuplets. The relatively simple approaches described herein focus on such differences. They are based on simple summation measures of n-tuplet frequencies, computed after filtering the background nucleotide composition. Among the main targets of this work is to draw some conclusions on the qualitative differences in the composition of genomic sequences depending on their functionality. Moreover, an evolutionary model is formulated, including simple forms of ubiquitous events of genome dynamics: genomic fusions, genome shuffling due to transpositions, replication slippage, and point mutations. This model is shown to be able to reproduce all the statistical features of genomic sequences discussed herein.
Collapse
Affiliation(s)
- Christoforos Nikolaou
- Institute of Biology, National Research Center for Physical Sciences Demokritos,, 15310, Athens, Greece
| | | |
Collapse
|