1
|
Genomic signatures of high-altitude adaptation and chromosomal polymorphism in geladas. Nat Ecol Evol 2022; 6:630-643. [PMID: 35332281 PMCID: PMC9090980 DOI: 10.1038/s41559-022-01703-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Accepted: 02/15/2022] [Indexed: 01/31/2023]
Abstract
Primates have adapted to numerous environments and lifestyles, but very few species are native to high elevations. Here, we investigated high-altitude adaptations in the gelada (Theropithecus gelada), a monkey endemic to the Ethiopian Plateau. We examined genome-wide variation in conjunction with measurements of hematological and morphological traits. Our new gelada reference genome is highly intact and assembled at chromosome-length levels. Unexpectedly, we identified a chromosomal polymorphism in geladas that could potentially contribute to reproductive barriers between populations. Compared to baboons at low altitude, we found that high-altitude geladas exhibit significantly expanded chest circumferences, potentially allowing for greater lung surface area for increased oxygen diffusion. We identified gelada-specific amino acid substitutions in the alpha-chain subunit of adult hemoglobin but found that gelada hemoglobin does not exhibit markedly altered oxygenation properties compared to lowland primates. We also found that geladas at high altitude do not exhibit elevated blood hemoglobin concentrations, in contrast to the normal acclimatization response to hypoxia in lowland primates. The absence of altitude-related polycythemia suggests that geladas are able to sustain adequate tissue-oxygen delivery despite environmental hypoxia. Finally, we identified numerous genes and genomic regions exhibiting accelerated rates of evolution, as well as gene families exhibiting expansions in the gelada lineage, potentially reflecting altitude-related selection. Our findings lend insight into putative mechanisms of high-altitude adaptation while suggesting promising avenues for functional hypoxia research.
Collapse
|
2
|
Barrera-Redondo J, Sánchez-de la Vega G, Aguirre-Liguori JA, Castellanos-Morales G, Gutiérrez-Guerrero YT, Aguirre-Dugua X, Aguirre-Planter E, Tenaillon MI, Lira-Saade R, Eguiarte LE. The domestication of Cucurbita argyrosperma as revealed by the genome of its wild relative. HORTICULTURE RESEARCH 2021; 8:109. [PMID: 33931618 PMCID: PMC8087764 DOI: 10.1038/s41438-021-00544-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 03/03/2021] [Accepted: 03/14/2021] [Indexed: 05/06/2023]
Abstract
Despite their economic importance and well-characterized domestication syndrome, the genomic impact of domestication and the identification of variants underlying the domestication traits in Cucurbita species (pumpkins and squashes) is currently lacking. Cucurbita argyrosperma, also known as cushaw pumpkin or silver-seed gourd, is a Mexican crop consumed primarily for its seeds rather than fruit flesh. This makes it a good model to study Cucurbita domestication, as seeds were an essential component of early Mesoamerican diet and likely the first targets of human-guided selection in pumpkins and squashes. We obtained population-level data using tunable Genotype by Sequencing libraries for 192 individuals of the wild and domesticated subspecies of C. argyrosperma across Mexico. We also assembled the first high-quality wild Cucurbita genome. Comparative genomic analyses revealed several structural variants and presence/absence of genes related to domestication. Our results indicate a monophyletic origin of this domesticated crop in the lowlands of Jalisco. We found evidence of gene flow between the domesticated and wild subspecies, which likely alleviated the effects of the domestication bottleneck. We uncovered candidate domestication genes that are involved in the regulation of growth hormones, plant defense mechanisms, seed development, and germination. The presence of shared selected alleles with the closely related species Cucurbita moschata suggests domestication-related introgression between both taxa.
Collapse
Affiliation(s)
- Josué Barrera-Redondo
- Departamento de Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, Circuito Exterior s/n Anexo al Jardín Botánico, 04510, Ciudad de México, México.
| | - Guillermo Sánchez-de la Vega
- Departamento de Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, Circuito Exterior s/n Anexo al Jardín Botánico, 04510, Ciudad de México, México
| | - Jonás A Aguirre-Liguori
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, 92697, USA
| | - Gabriela Castellanos-Morales
- Departamento de Conservación de la Biodiversidad, El Colegio de la Frontera Sur, Villahermosa, Carretera Villahermosa-Reforma km 15.5 Ranchería El Guineo 2ª sección, 86280, Villahermosa, Tabasco, México
| | - Yocelyn T Gutiérrez-Guerrero
- Departamento de Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, Circuito Exterior s/n Anexo al Jardín Botánico, 04510, Ciudad de México, México
| | - Xitlali Aguirre-Dugua
- Departamento de Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, Circuito Exterior s/n Anexo al Jardín Botánico, 04510, Ciudad de México, México
| | - Erika Aguirre-Planter
- Departamento de Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, Circuito Exterior s/n Anexo al Jardín Botánico, 04510, Ciudad de México, México
| | - Maud I Tenaillon
- Génétique Quantitative et Evolution - Le Moulon, Université Paris-Saclay, Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement, Centre National de la Recherche Scientifique, AgroParisTech, Gif-sur-Yvette, 91190, France
| | - Rafael Lira-Saade
- UBIPRO, Facultad de Estudios Superiores Iztacala, Universidad Nacional Autónoma de México, Av. de los Barrios #1, Col. Los Reyes Iztacala, Tlalnepantla, Edo. de Mex, 54090, México.
| | - Luis E Eguiarte
- Departamento de Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, Circuito Exterior s/n Anexo al Jardín Botánico, 04510, Ciudad de México, México.
| |
Collapse
|
3
|
Silva M, Pratas D, Pinho AJ. Efficient DNA sequence compression with neural networks. Gigascience 2020; 9:giaa119. [PMID: 33179040 PMCID: PMC7657843 DOI: 10.1093/gigascience/giaa119] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 08/19/2020] [Accepted: 10/02/2020] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. FINDINGS We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7-3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. CONCLUSIONS GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.
Collapse
Affiliation(s)
- Milton Silva
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Diogo Pratas
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
| | - Armando J Pinho
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| |
Collapse
|
4
|
Hosseini M, Pratas D, Morgenstern B, Pinho AJ. Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. Gigascience 2020; 9:giaa048. [PMID: 32432328 PMCID: PMC7238676 DOI: 10.1093/gigascience/giaa048] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 04/06/2020] [Accepted: 04/20/2020] [Indexed: 12/05/2022] Open
Abstract
BACKGROUND The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. RESULTS We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. CONCLUSIONS Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ∼1 GB, which makes Smash++ feasible to run on present-day standard computers.
Collapse
Affiliation(s)
- Morteza Hosseini
- IEETA/DETI, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Diogo Pratas
- IEETA/DETI, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
| | - Burkhard Morgenstern
- Department of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| | - Armando J Pinho
- IEETA/DETI, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| |
Collapse
|
5
|
Liu R, Low WY, Tearle R, Koren S, Ghurye J, Rhie A, Phillippy AM, Rosen BD, Bickhart DM, Smith TPL, Hiendleder S, Williams JL. New insights into mammalian sex chromosome structure and evolution using high-quality sequences from bovine X and Y chromosomes. BMC Genomics 2019; 20:1000. [PMID: 31856728 PMCID: PMC6923926 DOI: 10.1186/s12864-019-6364-z] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2019] [Accepted: 12/02/2019] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Mammalian X chromosomes are mainly euchromatic with a similar size and structure among species whereas Y chromosomes are smaller, have undergone substantial evolutionary changes and accumulated male specific genes and genes involved in sex determination. The pseudoautosomal region (PAR) is conserved on the X and Y and pair during meiosis. The structure, evolution and function of mammalian sex chromosomes, particularly the Y chromsome, is still poorly understood because few species have high quality sex chromosome assemblies. RESULTS Here we report the first bovine sex chromosome assemblies that include the complete PAR spanning 6.84 Mb and three Y chromosome X-degenerate (X-d) regions. The PAR comprises 31 genes, including genes that are missing from the X chromosome in current cattle, sheep and goat reference genomes. Twenty-nine PAR genes are single-copy genes and two are multi-copy gene families, OBP, which has 3 copies and BDA20, which has 4 copies. The Y chromosome X-d1, 2a and 2b regions contain 11, 2 and 2 gametologs, respectively. CONCLUSIONS The ruminant PAR comprises 31 genes and is similar to the PAR of pig and dog but extends further than those of human and horse. Differences in the pseudoautosomal boundaries are consistent with evolutionary divergence times. A bovidae-specific expansion of members of the lipocalin gene family in the PAR reported here, may affect immune-modulation and anti-inflammatory responses in ruminants. Comparison of the X-d regions of Y chromosomes across species revealed that five of the X-Y gametologs, which are known to be global regulators of gene activity and candidate sexual dimorphism genes, are conserved.
Collapse
Affiliation(s)
- Ruijie Liu
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, South Australia, Australia
| | - Wai Yee Low
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, South Australia, Australia
| | - Rick Tearle
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, South Australia, Australia
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA
| | - Jay Ghurye
- Center for Bioinformatics and Computational Biology, Lab 3104A, Biomolecular Science Building, University of Maryland, College Park, MD, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA
| | - Benjamin D Rosen
- Animal Genomics and Improvement Laboratory, ARS USDA, Beltsville, MD, USA
| | - Derek M Bickhart
- Cell Wall Biology and Utilization Laboratory, ARS USDA, Madison, WI, USA
| | | | - Stefan Hiendleder
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, South Australia, Australia
| | - John L Williams
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, South Australia, Australia.
| |
Collapse
|
6
|
Hosseini M, Pratas D, Pinho AJ. AC: A Compression Tool for Amino Acid Sequences. Interdiscip Sci 2019; 11:68-76. [PMID: 30721401 DOI: 10.1007/s12539-019-00322-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 01/23/2019] [Accepted: 01/28/2019] [Indexed: 10/27/2022]
|
7
|
Diaz-del-Pino S, Rodriguez-Brazzarola P, Perez-Wohlfeil E, Trelles O. Combining Strengths for Multi-genome Visual Analytics Comparison. Bioinform Biol Insights 2019; 13:1177932218825127. [PMID: 30783378 PMCID: PMC6365554 DOI: 10.1177/1177932218825127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2018] [Accepted: 12/22/2018] [Indexed: 11/25/2022] Open
Abstract
The eclosion of data acquisition technologies has shifted the bottleneck in molecular biology research from data acquisition to data analysis. Such is the case in Comparative Genomics, where sequence analysis has transitioned from genes to genomes of several orders of magnitude larger. This fact has revealed the need to adapt software to work with huge experiments efficiently and to incorporate new data-analysis strategies to manage results from such studies. In previous works, we presented GECKO, a software to compare large sequences; now we address the representation, browsing, data exploration, and post-processing of the massive amount of information derived from such comparisons. GECKO-MGV is a web-based application organized as client-server architecture. It is aimed at visual analysis of the results from both pairwise and multiple sequences comparison studies combining a set of common commands for image exploration with improved state-of-the-art solutions. In addition, GECKO-MGV integrates different visualization analysis tools while exploiting the concept of layers to display multiple genome comparison datasets. Moreover, the software is endowed with capabilities for contacting external-proprietary and third-party services for further data post-processing and also presents a method to display a timeline of large-scale evolutionary events. As proof-of-concept, we present 2 exercises using bacterial and mammalian genomes which depict the capabilities of GECKO-MGV to perform in-depth, customizable analyses on the fly using web technologies. The first exercise is mainly descriptive and is carried out over bacterial genomes, whereas the second one aims to show the ability to deal with large sequence comparisons. In this case, we display results from the comparison of the first Homo sapiens chromosome against the first 5 chromosomes of Mus musculus.
Collapse
Affiliation(s)
- Sergio Diaz-del-Pino
- Department of Computer Architecture, University of
Málaga and Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga,
Spain
| | - Pablo Rodriguez-Brazzarola
- Department of Computer Architecture, University of
Málaga and Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga,
Spain
| | - Esteban Perez-Wohlfeil
- Department of Computer Architecture, University of
Málaga and Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga,
Spain
| | - Oswaldo Trelles
- Department of Computer Architecture, University of
Málaga and Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga,
Spain
| |
Collapse
|
8
|
Bruhn M, Schindler D, Kemter FS, Wiley MR, Chase K, Koroleva GI, Palacios G, Sozhamannan S, Waldminghaus T. Functionality of Two Origins of Replication in Vibrio cholerae Strains With a Single Chromosome. Front Microbiol 2018; 9:2932. [PMID: 30559732 PMCID: PMC6284228 DOI: 10.3389/fmicb.2018.02932] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Accepted: 11/14/2018] [Indexed: 12/16/2022] Open
Abstract
Chromosomal inheritance in bacteria usually entails bidirectional replication of a single chromosome from a single origin into two copies and subsequent partitioning of one copy each into daughter cells upon cell division. However, the human pathogen Vibrio cholerae and other Vibrionaceae harbor two chromosomes, a large Chr1 and a small Chr2. Chr1 and Chr2 have different origins, an oriC-type origin and a P1 plasmid-type origin, respectively, driving the replication of respective chromosomes. Recently, we described naturally occurring exceptions to the two-chromosome rule of Vibrionaceae: i.e., Chr1 and Chr2 fused single chromosome V. cholerae strains, NSCV1 and NSCV2, in which both origins of replication are present. Using NSCV1 and NSCV2, here we tested whether two types of origins of replication can function simultaneously on the same chromosome or one or the other origin is silenced. We found that in NSCV1, both origins are active whereas in NSCV2 ori2 is silenced despite the fact that it is functional in an isolated context. The ori2 activity appears to be primarily determined by the copy number of the triggering site, crtS, which in turn is determined by its location with respect to ori1 and ori2 on the fused chromosome.
Collapse
Affiliation(s)
- Matthias Bruhn
- LOEWE Centre for Synthetic Microbiology-SYNMIKRO, Philipps-Universität Marburg, Marburg, Germany
| | - Daniel Schindler
- Manchester Institute of Biotechnology, The University of Manchester, Manchester, United Kingdom
| | - Franziska S Kemter
- LOEWE Centre for Synthetic Microbiology-SYNMIKRO, Philipps-Universität Marburg, Marburg, Germany
| | - Michael R Wiley
- United States Army Medical Research Institute of Infectious Diseases, Frederick, MD, United States
| | - Kitty Chase
- United States Army Medical Research Institute of Infectious Diseases, Frederick, MD, United States
| | - Galina I Koroleva
- United States Army Medical Research Institute of Infectious Diseases, Frederick, MD, United States
| | - Gustavo Palacios
- United States Army Medical Research Institute of Infectious Diseases, Frederick, MD, United States
| | - Shanmuga Sozhamannan
- Defense Biological Product Assurance Office, Frederick, MD, United States.,The Tauri Group, LLC, Alexandria, VA, United States
| | - Torsten Waldminghaus
- LOEWE Centre for Synthetic Microbiology-SYNMIKRO, Philipps-Universität Marburg, Marburg, Germany
| |
Collapse
|
9
|
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes. ENTROPY 2018; 20:e20060393. [PMID: 33265483 PMCID: PMC7512912 DOI: 10.3390/e20060393] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2018] [Revised: 05/16/2018] [Accepted: 05/21/2018] [Indexed: 11/26/2022]
Abstract
An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.
Collapse
|
10
|
Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y. SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform. BMC Bioinformatics 2018; 19:165. [PMID: 29720081 PMCID: PMC5930706 DOI: 10.1186/s12859-018-2155-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Accepted: 04/11/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts. RESULTS A new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts k-mers from a sequence, then maps each k-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification. CONCLUSIONS Using two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.
Collapse
Affiliation(s)
- Jie Lin
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Jing Wei
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Donald Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, 26506, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, University of Iowa, Iowa city, 52242, Iowa, USA
| | - Yue Jiang
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China.
| |
Collapse
|
11
|
Alignment-based and alignment-free methods converge with experimental data on amino acids coded by stop codons at split between nuclear and mitochondrial genetic codes. Biosystems 2018; 167:33-46. [DOI: 10.1016/j.biosystems.2018.03.002] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Revised: 03/18/2018] [Accepted: 03/19/2018] [Indexed: 12/11/2022]
|
12
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 259] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
13
|
Sievers A, Bosiek K, Bisch M, Dreessen C, Riedel J, Froß P, Hausmann M, Hildenbrand G. K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features. Genes (Basel) 2017; 8:E122. [PMID: 28422050 PMCID: PMC5406869 DOI: 10.3390/genes8040122] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 03/24/2017] [Accepted: 04/04/2017] [Indexed: 12/26/2022] Open
Abstract
In genome analysis, k-mer-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve k-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local k-mer spectra (frequency distribution of k-mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤ k ≤ 4) on relatively small viral genomes of Papillomaviridae and Herpesviridae, while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in Papillomaviridae and Herpesviridae formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the k-mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown k-mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest k-mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard k-mer analysis.
Collapse
Affiliation(s)
- Aaron Sievers
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Katharina Bosiek
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Marc Bisch
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Chris Dreessen
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Jascha Riedel
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Patrick Froß
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Michael Hausmann
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Georg Hildenbrand
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
- Department of Radiation Oncology, Universitätsmedizin Mannheim, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68167 Mannheim, Germany.
| |
Collapse
|
14
|
Tavares AHMP, Pinho AJ, Silva RM, Rodrigues JMOS, Bastos CAC, Ferreira PJSG, Afreixo V. DNA word analysis based on the distribution of the distances between symmetric words. Sci Rep 2017; 7:728. [PMID: 28389642 PMCID: PMC5428789 DOI: 10.1038/s41598-017-00646-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2016] [Accepted: 03/02/2017] [Indexed: 02/01/2023] Open
Abstract
We address the problem of discovering pairs of symmetric genomic words (i.e., words and the corresponding reversed complements) occurring at distances that are overrepresented. For this purpose, we developed new procedures to identify symmetric word pairs with uncommon empirical distance distribution and with clusters of overrepresented short distances. We speculate that patterns of overrepresentation of short distances between symmetric word pairs may allow the occurrence of non-standard DNA conformations, such as hairpin/cruciform structures. We focused on the human genome, and analysed both the complete genome as well as a version with known repetitive sequences masked out. We reported several well-defined features in the distributions of distances, which can be classified into three different profiles, showing enrichment in distinct distance ranges. We analysed in greater detail certain pairs of symmetric words of length seven, found by our procedure, characterised by the surprising fact that they occur at single distances more frequently than expected.
Collapse
Affiliation(s)
- Ana H M P Tavares
- Department of Mathematics & CIDMA, University of Aveiro, Aveiro, Portugal.,Department of Medical Sciences & iBiMED, University of Aveiro, Aveiro, Portugal
| | - Armando J Pinho
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal.,IEETA, University of Aveiro, Aveiro, Portugal
| | - Raquel M Silva
- Department of Medical Sciences & iBiMED, University of Aveiro, Aveiro, Portugal.,IEETA, University of Aveiro, Aveiro, Portugal
| | - João M O S Rodrigues
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal.,IEETA, University of Aveiro, Aveiro, Portugal
| | - Carlos A C Bastos
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal.,IEETA, University of Aveiro, Aveiro, Portugal
| | - Paulo J S G Ferreira
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal.,IEETA, University of Aveiro, Aveiro, Portugal
| | - Vera Afreixo
- Department of Mathematics & CIDMA, University of Aveiro, Aveiro, Portugal. .,Department of Medical Sciences & iBiMED, University of Aveiro, Aveiro, Portugal. .,IEETA, University of Aveiro, Aveiro, Portugal.
| |
Collapse
|
15
|
|