1
|
Zimnyakov D, Alonova M, Skripal A, Dobdin S, Feodorova V. Quantification of the Diversity in Gene Structures Using the Principles of Polarization Mapping. Curr Issues Mol Biol 2023; 45:1720-1740. [PMID: 36826056 PMCID: PMC9955201 DOI: 10.3390/cimb45020111] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 02/05/2023] [Accepted: 02/16/2023] [Indexed: 02/22/2023] Open
Abstract
Results of computational analysis and visualization of differences in gene structures using polarization coding are presented. A two-dimensional phase screen, where each element of which corresponds to a specific basic nucleotide (adenine, cytosine, guanine, or thymine), displays the analyzed nucleotide sequence. Readout of the screen with a coherent beam characterized by a given polarization state forms a diffracted light field with a local polarization structure that is unique for the analyzed nucleotide sequence. This unique structure is described by spatial distributions of local values of the Stokes vector components. Analysis of these distributions allows the comparison of nucleotide sequences for different strains of pathogenic microorganisms and frequency analysis of the sequences. The possibilities of this polarization-based technique are illustrated by the model data obtained from a comparative analysis of the spike protein gene sequences for three different model variants (Wuhan, Delta, and Omicron) of the SARS-CoV-2 virus. Various modifications of polarization encoding and analysis of gene structures and a possibility for instrumental implementation of the proposed method are discussed.
Collapse
Affiliation(s)
- Dmitry Zimnyakov
- Physics Department, Yury Gagarin State Technical University of Saratov, 77 Polytechnicheskaya St., 410054 Saratov, Russia
- Precision Mechanics and Control Institute of Russian Academy of Sciences, 24 Rabochaya St., 410024 Saratov, Russia
- Institute of Physics, Saratov State University, 83 Astrakhanskaya St., 410012 Saratov, Russia
- Correspondence:
| | - Marina Alonova
- Physics Department, Yury Gagarin State Technical University of Saratov, 77 Polytechnicheskaya St., 410054 Saratov, Russia
| | - Anatoly Skripal
- Institute of Physics, Saratov State University, 83 Astrakhanskaya St., 410012 Saratov, Russia
| | - Sergey Dobdin
- Institute of Physics, Saratov State University, 83 Astrakhanskaya St., 410012 Saratov, Russia
| | - Valentina Feodorova
- Institute of Physics, Saratov State University, 83 Astrakhanskaya St., 410012 Saratov, Russia
| |
Collapse
|
2
|
Silva JM, Qi W, Pinho AJ, Pratas D. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. Gigascience 2022; 12:giad101. [PMID: 38091509 PMCID: PMC10716826 DOI: 10.1093/gigascience/giad101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. FINDINGS This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. CONCLUSIONS The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Collapse
Affiliation(s)
- Jorge M Silva
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Weihong Qi
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse, 190, 8057, Zurich, Switzerland
- SIB, Swiss Institute of Bioinformatics, 1202, Geneva, Switzerland
| | - Armando J Pinho
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Diogo Pratas
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu, 3, 00014 Helsinki, Finland
| |
Collapse
|
3
|
Silva JM, Pratas D, Caetano T, Matos S. The complexity landscape of viral genomes. Gigascience 2022; 11:6661051. [PMID: 35950839 PMCID: PMC9366995 DOI: 10.1093/gigascience/giac079] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 05/25/2022] [Accepted: 07/26/2022] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes' organization, relation, and fundamental characteristics. RESULTS This work provides a comprehensive landscape of the viral genome's complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers. CONCLUSIONS This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes' organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.
Collapse
Affiliation(s)
- Jorge Miguel Silva
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Diogo Pratas
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.,Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal.,Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
| | - Tânia Caetano
- Department of Biology, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
| | - Sérgio Matos
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.,Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
| |
Collapse
|
4
|
Pratas D, Toppinen M, Pyöriä L, Hedman K, Sajantila A, Perdomo MF. A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level. Gigascience 2020; 9:giaa086. [PMID: 32815536 PMCID: PMC7439602 DOI: 10.1093/gigascience/giaa086] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 05/25/2020] [Accepted: 07/23/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Advances in sequencing technologies have enabled the characterization of multiple microbial and host genomes, opening new frontiers of knowledge while kindling novel applications and research perspectives. Among these is the investigation of the viral communities residing in the human body and their impact on health and disease. To this end, the study of samples from multiple tissues is critical, yet, the complexity of such analysis calls for a dedicated pipeline. We provide an automatic and efficient pipeline for identification, assembly, and analysis of viral genomes that combines the DNA sequence data from multiple organs. TRACESPipe relies on cooperation among 3 modalities: compression-based prediction, sequence alignment, and de novo assembly. The pipeline is ultra-fast and provides, additionally, secure transmission and storage of sensitive data. FINDINGS TRACESPipe performed outstandingly when tested on synthetic and ex vivo datasets, identifying and reconstructing all the viral genomes, including those with high levels of single-nucleotide polymorphisms. It also detected minimal levels of genomic variation between different organs. CONCLUSIONS TRACESPipe's unique ability to simultaneously process and analyze samples from different sources enables the evaluation of within-host variability. This opens up the possibility to investigate viral tissue tropism, evolution, fitness, and disease associations. Moreover, additional features such as DNA damage estimation and mitochondrial DNA reconstruction and analysis, as well as exogenous-source controls, expand the utility of this pipeline to other fields such as forensics and ancient DNA studies. TRACESPipe is released under GPLv3 and is available for free download at https://github.com/viromelab/tracespipe.
Collapse
Affiliation(s)
- Diogo Pratas
- Department of Virology, University of Helsinki, Haartmaninkatu 3, Helsinki, 00290, Finland
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
| | - Mari Toppinen
- Department of Virology, University of Helsinki, Haartmaninkatu 3, Helsinki, 00290, Finland
| | - Lari Pyöriä
- Department of Virology, University of Helsinki, Haartmaninkatu 3, Helsinki, 00290, Finland
| | - Klaus Hedman
- Department of Virology, University of Helsinki, Haartmaninkatu 3, Helsinki, 00290, Finland
- HUSLAB, Helsinki University Hospital, Topeliuksenkatu 32, 00290 Helsinki, Finland
| | - Antti Sajantila
- Department of Forensic Medicine, University of Helsinki, Kytösuontie 11, 00300, Helsinki, Finland
- Forensic Medicine Unit, Finnish Institute of Health and Welfare, PO Box 30 FI-00271 Helsinki, Finland
| | - Maria F Perdomo
- Department of Virology, University of Helsinki, Haartmaninkatu 3, Helsinki, 00290, Finland
| |
Collapse
|
5
|
Pratas D, Hosseini M, Grilo G, Pinho AJ, Silva RM, Caetano T, Carneiro J, Pereira F. Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard. Genes (Basel) 2018; 9:E445. [PMID: 30200636 PMCID: PMC6162538 DOI: 10.3390/genes9090445] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 09/03/2018] [Accepted: 09/03/2018] [Indexed: 12/17/2022] Open
Abstract
The sequencing of ancient DNA samples provides a novel way to find, characterize, and distinguish exogenous genomes of endogenous targets. After sequencing, computational composition analysis enables filtering of undesired sources in the focal organism, with the purpose of improving the quality of assemblies and subsequent data analysis. More importantly, such analysis allows extinct and extant species to be identified without requiring a specific or new sequencing run. However, the identification of exogenous organisms is a complex task, given the nature and degradation of the samples, and the evident necessity of using efficient computational tools, which rely on algorithms that are both fast and highly sensitive. In this work, we relied on a fast and highly sensitive tool, FALCON-meta, which measures similarity against whole-genome reference databases, to analyse the metagenomic composition of an ancient polar bear (Ursus maritimus) jawbone fossil. The fossil was collected in Svalbard, Norway, and has an estimated age of 110,000 to 130,000 years. The FASTQ samples contained 349 GB of nonamplified shotgun sequencing data. We identified and localized, relative to the FASTQ samples, the genomes with significant similarities to reference microbial genomes, including those of viruses, bacteria, and archaea, and to fungal, mitochondrial, and plastidial sequences. Among other striking features, we found significant similarities between modern-human, some bacterial and viral sequences (contamination) and the organelle sequences of wild carrot and tomato relative to the whole samples. For each exogenous candidate, we ran a damage pattern analysis, which in addition to revealing shallow levels of damage in the plant candidates, identified the source as contamination.
Collapse
Affiliation(s)
- Diogo Pratas
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Morteza Hosseini
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Gonçalo Grilo
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Armando J Pinho
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Raquel M Silva
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Department of Medical Sciences, University of Aveiro, 3810-193 Aveiro, Portugal.
- Institute for Biomedicine, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Tânia Caetano
- Department of Biology, University of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Centre for Environmental and Marine Studies, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - João Carneiro
- Interdisciplinary Centre of Marine and Environmental Research, University of Porto, 4450-208 Matosinhos, Portugal.
| | - Filipe Pereira
- Interdisciplinary Centre of Marine and Environmental Research, University of Porto, 4450-208 Matosinhos, Portugal.
| |
Collapse
|
6
|
|
7
|
Pratas D, Silva RM, Pinho AJ, Ferreira PJ. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci Rep 2015; 5:10203. [PMID: 25984837 PMCID: PMC4434998 DOI: 10.1038/srep10203] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2014] [Accepted: 04/07/2015] [Indexed: 12/19/2022] Open
Abstract
Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail.
Collapse
|
8
|
Pratas D, Pinho AJ, Rodrigues JMOS. XS: a FASTQ read simulator. BMC Res Notes 2014; 7:40. [PMID: 24433564 PMCID: PMC3927261 DOI: 10.1186/1756-0500-7-40] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Accepted: 12/18/2013] [Indexed: 12/31/2022] Open
Abstract
Background The emerging next-generation sequencing (NGS) is bringing, besides the natural huge amounts of data, an avalanche of new specialized tools (for analysis, compression, alignment, among others) and large public and private network infrastructures. Therefore, a direct necessity of specific simulation tools for testing and benchmarking is rising, such as a flexible and portable FASTQ read simulator, without the need of a reference sequence, yet correctly prepared for producing approximately the same characteristics as real data. Findings We present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores). Conclusions XS provides an efficient and convenient method for fast simulation of FASTQ files, such as those from Ion Torrent (currently uncovered by other simulators), Roche-454, Illumina and ABI-SOLiD sequencing machines. This tool is publicly available at http://bioinformatics.ua.pt/software/xs/.
Collapse
Affiliation(s)
- Diogo Pratas
- Signal Processing Lab, IEETA/DETI University of Aveiro, Aveiro 3810-193, Portugal.
| | | | | |
Collapse
|