1
|
Islam R, Rahman A. An alignment-free method for detection of missing regions for phylogenetic analysis. Heliyon 2024; 10:e32227. [PMID: 38933968 PMCID: PMC11200290 DOI: 10.1016/j.heliyon.2024.e32227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 05/17/2024] [Accepted: 05/29/2024] [Indexed: 06/28/2024] Open
Abstract
Phylogenetic tree estimation using conventional approaches usually requires pairwise or multiple sequence alignment. However, sequence alignment has difficulties related to scalability and accuracy in case of long sequences such as whole genomes, low sequence identity, and in presence of genomic rearrangements. To address these issues, alignment-free approaches have been proposed. While these methods have demonstrated promising results, many of these lead to errors when regions are missing from the sequences of one or more species that are trivially detected in alignment-based methods. Here, we present an alignment-free method for detecting missing regions in sequences of species for which phylogeny is to be estimated. It is based on counts of k-mers and can be used to filter out k-mers belonging to regions in one species that are missing in one or more of the other species. We perform experiments with real and simulated datasets containing missing regions and find that it can successfully detect a large fraction of such k-mers and can lead to improvements in the estimated phylogenies. Our method can be used in k-mer based alignment-free phylogeny estimation methods to filter out k-mers corresponding to missing regions.
Collapse
Affiliation(s)
- Rubyeat Islam
- Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh
| | - Atif Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
2
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Thind AS, Sinha S. Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. Curr Genomics 2023; 24:187-195. [PMID: 38178984 PMCID: PMC10761335 DOI: 10.2174/0113892029264990231013112156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 08/09/2023] [Accepted: 09/15/2023] [Indexed: 01/06/2024] Open
Abstract
Background Viruses have high mutation rates, facilitating rapid evolution and the emergence of new species, subspecies, strains and recombinant forms. Accurate classification of these forms is crucial for understanding viral evolution and developing therapeutic applications. Phylogenetic classification is typically performed by analyzing molecular differences at the genomic and sub-genomic levels. This involves aligning homologous proteins or genes. However, there is growing interest in developing alignment-free methods for whole-genome comparisons that are computationally efficient. Methods Here we elaborate on the Chaos Game Representation (CGR) method, based on concepts of statistical physics and free of sequence alignment assumptions. We adopt the CGR method for classification of the closely related clades/lineages A and B of the SARS-Corona virus 2019 (SARS-CoV-2), which is one of the fastest evolving viruses. Results Our study shows that the CGR approach can easily yield the SARS-CoV-2 phylogeny from the available whole genomes of lineage A and lineage B sequences. It also shows an accurate classification of eight different strains and the newly evolved XBB variant from its parental strains. Compared to alignment-based methods (Neighbour-Joining and Maximum Likelihood), the CGR method requires low computational resources, is fast and accurate for long sequences, and, being a K-mer based approach, allows simultaneous comparison of a large number of closely-related sequences of different sizes. Further, we developed an R pipeline CGRphylo, available on GitHub, which integrates the CGR module with various other R packages to create phylogenetic trees and visualize them. Conclusion Our findings demonstrate the efficacy of the CGR method for accurate classification and tracking of rapidly evolving viruses, offering valuable insights into the evolution and emergence of new SARS-CoV-2 strains and recombinants.
Collapse
Affiliation(s)
- Amarinder Singh Thind
- Department of Biological Sciences, Indian Institute of Science Education & Research, Mohali, India
- Illawarra Shoalhaven Local Health District (ISLHD), NSW Health, Australia
| | - Somdatta Sinha
- Department of Biological Sciences, Indian Institute of Science Education & Research, Mohali, India
| |
Collapse
|
4
|
Zhang W, Wang R, Zou X, Gu C, Yang Q, He M, Xiao W, He L, Zhao M, Yu Z. Comparative genomic analysis of alloherpesviruses: Exploring an available genus/species demarcation proposal and method. Virus Res 2023; 334:199163. [PMID: 37364814 PMCID: PMC10410580 DOI: 10.1016/j.virusres.2023.199163] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 06/22/2023] [Accepted: 06/23/2023] [Indexed: 06/28/2023]
Abstract
The family Alloherpesviridae contains herpesviruses of fish and amphibians. Due to the significant economic losses to aquaculture that herpesviruses can cause, the primary areas of research interest are concerning their pathogenesis and prevention. Despite alloherpesvirus genomic sequences becoming more widely accessible, methods regarding their genus/species classification are still relatively unexplored. In the present study, the phylogenetic relationships between 40 completely sequenced alloherpesviruses were illustrated by the viral proteomic tree (ViPTree), which was divided into three monophyletic groups, namely Cyprinivirus, Ictalurivirus and Batrachovirus. Additionally, average nucleotide identity (ANI) and average amino acid identity (AAI) analyses were performed across all available sequences and clearly displayed species boundaries with the threshold value of ANI/AAI set at 90%. Subsequently, core-pan analysis uncovered 809 orthogroups and 11 core genes shared by all 40 alloherpesvirus genome sequences. For the former, a 15 percent identity depicts a clear genus boundary; for the latter, 8 of them may be qualified for phylogenetic analysis based on amino acid or nucleic acid sequences after being verified using maximum likelihood (ML) or neighbor-joining (NJ) phylogenetic trees. Finally, although the dot plot analysis was valid for the members within Ictalurivirus, it was unsuccessful for Cyprinivirus and Batrachovirus. Taken together, the comparison of individual methodologies provides a wide range of alternatives for alloherpesviruses classification under various circumstances.
Collapse
Affiliation(s)
- Wenjie Zhang
- Laboratory Animal Center, Southwest Medical University, Luzhou Sichuan, PR China
| | - Ran Wang
- Laboratory Animal Center, Southwest Medical University, Luzhou Sichuan, PR China
| | - Xiaoxia Zou
- Suining First People's Hospital, Suining, PR China
| | - Congwei Gu
- Laboratory Animal Center, Southwest Medical University, Luzhou Sichuan, PR China; Model Animal and Human Disease Research of Luzhou Key Laboratory, PR China
| | - Qian Yang
- Laboratory Animal Center, Southwest Medical University, Luzhou Sichuan, PR China; Model Animal and Human Disease Research of Luzhou Key Laboratory, PR China
| | - Manli He
- Laboratory Animal Center, Southwest Medical University, Luzhou Sichuan, PR China; Model Animal and Human Disease Research of Luzhou Key Laboratory, PR China
| | - Wudian Xiao
- Laboratory Animal Center, Southwest Medical University, Luzhou Sichuan, PR China; Model Animal and Human Disease Research of Luzhou Key Laboratory, PR China
| | - Lvqin He
- Laboratory Animal Center, Southwest Medical University, Luzhou Sichuan, PR China; Model Animal and Human Disease Research of Luzhou Key Laboratory, PR China
| | - Mingde Zhao
- Laboratory Animal Center, Southwest Medical University, Luzhou Sichuan, PR China; Model Animal and Human Disease Research of Luzhou Key Laboratory, PR China
| | - Zehui Yu
- Laboratory Animal Center, Southwest Medical University, Luzhou Sichuan, PR China; Model Animal and Human Disease Research of Luzhou Key Laboratory, PR China; Scholl of Basic Medical Sciences, Zhejiang University, Hangzhou, PR China.
| |
Collapse
|
5
|
van Oers MM, Herniou EA, Jehle JA, Krell PJ, Abd-Alla AMM, Ribeiro BM, Theilmann DA, Hu Z, Harrison RL. Developments in the classification and nomenclature of arthropod-infecting large DNA viruses that contain pif genes. Arch Virol 2023; 168:182. [PMID: 37322175 PMCID: PMC10271883 DOI: 10.1007/s00705-023-05793-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Viruses of four families of arthropod-specific, large dsDNA viruses (the nuclear arthropod large DNA viruses, or NALDVs) possess homologs of genes encoding conserved components involved in the baculovirus primary infection mechanism. The presence of such homologs encoding per os infectivity factors (pif genes), along with their absence from other viruses and the occurrence of other shared characteristics, suggests a common origin for the viruses of these families. Therefore, the class Naldaviricetes was recently established, accommodating these four families. In addition, within this class, the ICTV approved the creation of the order Lefavirales for three of these families, whose members carry homologs of the baculovirus genes that code for components of the viral RNA polymerase, which is responsible for late gene expression. We further established a system for the binomial naming of all virus species in the order Lefavirales, in accordance with a decision by the ICTV in 2019 to move towards a standardized nomenclature for all virus species. The binomial species names for members of the order Lefavirales consist of the name of the genus to which the species belongs (e.g., Alphabaculovirus), followed by a single epithet that refers to the host species from which the virus was originally isolated. The common names of viruses and the abbreviations thereof will not change, as the format of virus names lies outside the remit of the ICTV.
Collapse
Affiliation(s)
- Monique M van Oers
- Laboratory of Virology, Wageningen University and Research, Wageningen, the Netherlands.
| | - Elisabeth A Herniou
- Institut de Recherche sur la Biologie de l'Insecte, UMR 7261, CNRS - University of Tours, 37200, Tours, France
| | - Johannes A Jehle
- Institute for Biological Control, Julius Kühn-Institut, 69221, Dossenheim, Germany
| | - Peter J Krell
- Department of Molecular and Cellular Biology, University of Guelph, Guelph, N1G 2W1, Canada
| | - Adly M M Abd-Alla
- Joint FAO/IAEA Programme of Nuclear Techniques in Food and Agriculture, Vienna International Centre, Vienna, Austria
| | - Bergmann M Ribeiro
- Laboratory of Baculovirus, Cell Biology Department, University of Brasília, Brasília, Brazil
| | - David A Theilmann
- Summerland Research and Development Centre, Agriculture and Agri-Food Canada, 4200 Highway 97, Box 5000, Summerland, BC, V0H1Z0, Canada
| | - Zhihong Hu
- State Key Laboratory of Virology, Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan, 430071, P. R. China
| | - Robert L Harrison
- Invasive Insect Biocontrol and Behavior Laboratory, USDA-ARS, 10300 Baltimore Avenue, Bldg 007 Barc‑West, Beltsville, MD, 20705, USA
| |
Collapse
|
6
|
Perico CP, De Pierri CR, Neto GP, Fernandes DR, Pedrosa FO, de Souza EM, Raittz RT. Genomic landscape of the SARS-CoV-2 pandemic in Brazil suggests an external P.1 variant origin. Front Microbiol 2022; 13:1037455. [PMID: 36620039 PMCID: PMC9814972 DOI: 10.3389/fmicb.2022.1037455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 12/01/2022] [Indexed: 12/24/2022] Open
Abstract
Brazil was the epicenter of worldwide pandemics at the peak of its second wave. The genomic/proteomic perspective of the COVID-19 pandemic in Brazil could provide insights to understand the global pandemics behavior. In this study, we track SARS-CoV-2 molecular information in Brazil using real-time bioinformatics and data science strategies to provide a comparative and evolutive panorama of the lineages in the country. SWeeP vectors represented the Brazilian and worldwide genomic/proteomic data from Global Initiative on Sharing Avian Influenza Data (GISAID) between February 2020 and August 2021. Clusters were analyzed and compared with PANGO lineages. Hierarchical clustering provided phylogenetic and evolutionary analyses of the lineages, and we tracked the P.1 (Gamma) variant origin. The genomic diversity based on Chao's estimation allowed us to compare richness and coverage among Brazilian states and other representative countries. We found that epidemics in Brazil occurred in two moments with different genetic profiles. The P.1 lineages emerged in the second wave, which was more aggressive. We could not trace the origin of P.1 from the variants present in Brazil. Instead, we found evidence pointing to its external source and a possible recombinant event that may relate P.1 to a B.1.1.28 variant subset. We discussed the potential application of the pipeline for emerging variants detection and the PANGO terminology stability over time. The diversity analysis showed that the low coverage and unbalanced sequencing among states in Brazil could have allowed the silent entry and dissemination of P.1 and other dangerous variants. This study may help to understand the development and consequences of variants of concern (VOC) entry.
Collapse
Affiliation(s)
- Camila P Perico
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
| | - Camilla R De Pierri
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Department of Biochemistry and Molecular Biology, Federal University of Paraná, Curitiba, Brazil
| | - Giuseppe Pasqualato Neto
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
| | - Danrley R Fernandes
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
| | - Fabio O Pedrosa
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Department of Biochemistry and Molecular Biology, Federal University of Paraná, Curitiba, Brazil
| | - Emanuel M de Souza
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Department of Biochemistry and Molecular Biology, Federal University of Paraná, Curitiba, Brazil
| | - Roberto T Raittz
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
| |
Collapse
|
7
|
Rachtman E, Sarmashghi S, Bafna V, Mirarab S. Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling. Cell Syst 2022; 13:817-829.e3. [PMID: 36265468 PMCID: PMC9589918 DOI: 10.1016/j.cels.2022.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 03/14/2022] [Accepted: 06/28/2022] [Indexed: 01/26/2023]
Abstract
Computing distance between two genomes without alignments or even access to assemblies has many downstream analyses. However, alignment-free methods, including in the fast-growing field of genome skimming, are hampered by a significant methodological gap. While accurate methods (many k-mer-based) for assembly-free distance calculation exist, measuring the uncertainty of estimated distances has not been sufficiently studied. In this paper, we show that bootstrapping, the standard non-parametric method of measuring estimator uncertainty, is not accurate for k-mer-based methods that rely on k-mer frequency profiles. Instead, we propose using subsampling (with no replacement) in combination with a correction step to reduce the variance of the inferred distribution. We show that the distribution of distances using our procedure matches the true uncertainty of the estimator. The resulting phylogenetic support values effectively differentiate between correct and incorrect branches and identify controversial branches that change across alignment-free and alignment-based phylogenies reported in the literature.
Collapse
Affiliation(s)
- Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, San Diego, CA 92093, USA
| | - Shahab Sarmashghi
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA.
| |
Collapse
|
8
|
Balaban M, Bristy NA, Faisal A, Bayzid MS, Mirarab S. Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model. BIOINFORMATICS ADVANCES 2022; 2:vbac055. [PMID: 35992043 PMCID: PMC9383262 DOI: 10.1093/bioadv/vbac055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Accepted: 08/09/2022] [Indexed: 01/27/2023]
Abstract
While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data. Availability and implementation Our software is available open source at https://github.com/nishatbristy007/NSB. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Ahnaf Faisal
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | | |
Collapse
|
9
|
He L, Sun S, Zhang Q, Bao X, Li PK. Alignment-free sequence comparison for virus genomes based on location correlation coefficient. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2021; 96:105106. [PMID: 34626822 PMCID: PMC8493760 DOI: 10.1016/j.meegid.2021.105106] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Revised: 09/08/2021] [Accepted: 10/03/2021] [Indexed: 12/18/2022]
Abstract
Coronaviruses (especially SARS-CoV-2) are characterized by rapid mutation and wide spread. As these characteristics easily lead to global pandemics, studying the evolutionary relationship between viruses is essential for clinical diagnosis. DNA sequencing has played an important role in evolutionary analysis. Recent alignment-free methods can overcome the problems of traditional alignment-based methods, which consume both time and space. This paper proposes a novel alignment-free method called the correlation coefficient feature vector (CCFV), which defines a correlation measure of the L-step delay of a nucleotide location from its location in the original DNA sequence. The numerical feature is a 16×L-dimensional numerical vector describing the distribution characteristics of the nucleotide positions in a DNA sequence. The proposed L-step delay correlation measure is interestingly related to some types of L+1 spaced mers. Unlike traditional gene comparison, our method avoids the computational complexity of multiple sequence alignment, and hence improves the speed of sequence comparison. Our method is applied to evolutionary analysis of the common human viruses including SARS-CoV-2, Dengue virus, Hepatitis B virus, and human rhinovirus and achieves the same or even better results than alignment-based methods. Especially for SARS-CoV-2, our method also confirms that bats are potential intermediate hosts of SARS-CoV-2.
Collapse
Affiliation(s)
- Lily He
- School of Science, Beijing University of Civil Engineering and Architecture, Beijing 102616, PR China.
| | - Siyang Sun
- The High School Affiliated to Renmin University of China, Beijing 100080, PR China
| | - Qianyue Zhang
- The High School Affiliated to Renmin University of China, Beijing 100080, PR China
| | - Xiaona Bao
- School of Science, Beijing University of Civil Engineering and Architecture, Beijing 102616, PR China
| | - Peter K Li
- School of Life Sciences, Tsinghua University, Beijing 100084, PR China.
| |
Collapse
|
10
|
Saleh M, Sellyei B, Kovács G, Székely C. Viruses Infecting the European Catfish ( Silurus glanis). Viruses 2021; 13:1865. [PMID: 34578446 PMCID: PMC8473376 DOI: 10.3390/v13091865] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 09/13/2021] [Accepted: 09/15/2021] [Indexed: 12/23/2022] Open
Abstract
In aquaculture, disease management and pathogen control are key for a successful fish farming industry. In past years, European catfish farming has been flourishing. However, devastating fish pathogens including limiting fish viruses are considered a big threat to further expanding of the industry. Even though mainly the ranavirus (Iridoviridea) and circovirus (Circoviridea) infections are considered well- described in European catfish, more other agents including herpes-, rhabdo or papillomaviruses are also observed in the tissues of catfish with or without any symptoms. The etiological role of these viruses has been unclear until now. Hence, there is a requisite for more detailed information about the latter and the development of preventive and therapeutic approaches to complete them. In this review, we summarize recent knowledge about viruses that affect the European catfish and describe their origin, distribution, molecular characterisation, and phylogenetic classification. We also highlight the knowledge gaps, which need more in-depth investigations in the future.
Collapse
Affiliation(s)
- Mona Saleh
- Clinical Division of Fish Medicine, University of Veterinary Medicine, 1220 Vienna, Austria
| | - Boglárka Sellyei
- Fish Pathology and Parasitology Research Team, Veterinary Medical Research Institute, Hungária krt. 21., 1143 Budapest, Hungary; (B.S.); (C.S.)
| | - Gyula Kovács
- Research Institute for Fisheries and Aquaculture (HAKI), Hungarian University of Agriculture and Life Sciences, Anna-liget utca 35., 5540 Szarvas, Hungary;
| | - Csaba Székely
- Fish Pathology and Parasitology Research Team, Veterinary Medical Research Institute, Hungária krt. 21., 1143 Budapest, Hungary; (B.S.); (C.S.)
| |
Collapse
|
11
|
Mahapatra A, Mukherjee J. Taxonomy classification using genomic footprint of mitochondrial sequences. Comb Chem High Throughput Screen 2021; 25:401-413. [PMID: 34382517 DOI: 10.2174/1386207324666210811102109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2020] [Revised: 07/07/2021] [Accepted: 07/12/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND Advancement in the sequencing technology yields a huge number of genomes of a multitude of organisms in our planet. One of the fundamental tasks for processing and analyzing these sequences is to organize them in the existing taxonomic orders. <P> Method: Recently we proposed a novel approach, GenFooT, of taxonomy classification using the concept of genomic footprint (GFP). The technique is further refined and enhanced in this work leading to improved accuracies in the task of taxonomic classification on various benchmark datasets. GenFooT maps a genome sequence in a 2D coordinate space and extracts features from that representation. It uses two hyper-parameters, namely block size and number of fragments of genomic sequence while computing the feature. In this work, we propose an analysis for choosing values of those parameters adaptively from the sequences. The enhanced version of GenFooT is named GenFooT2. <P> Results and Conclusion: We have experimented GenFooT2 on ten different biological datasets of genomic sequences of various organisms belonging to different taxonomy ranks. Our experimental results indicate more than 3% improved classification performance of the proposed features with Logistic regression classifier than the GenFooT. We also performed the statistical test to compare the performance of GenFooT2 with the state-of-the-art methods including our previous method GenFooT. The experimental results as well as the statistical test exhibit that the performance of the proposed GenFooT2 is significantly better.
Collapse
Affiliation(s)
- Aritra Mahapatra
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur. India
| | - Jayanta Mukherjee
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur. India
| |
Collapse
|
12
|
Pornputtapong N, Acheampong DA, Patumcharoenpol P, Jenjaroenpun P, Wongsurawat T, Jun SR, Yongkiettrakul S, Chokesajjawatee N, Nookaew I. KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis. Front Bioeng Biotechnol 2020; 8:556413. [PMID: 33072720 PMCID: PMC7538862 DOI: 10.3389/fbioe.2020.556413] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 08/24/2020] [Indexed: 12/22/2022] Open
Abstract
Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune.
Collapse
Affiliation(s)
- Natapol Pornputtapong
- Department of Biochemistry and Microbiology, Faculty of Pharmaceutical Sciences, and Research Unit of DNA Barcoding of Thai Medicinal Plants, Chulalongkorn University, Bangkok, Thailand
| | - Daniel A Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States.,Joint Graduate Program in Bioinformatics, University of Arkansas at Little Rock and University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Preecha Patumcharoenpol
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Se-Ran Jun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Suganya Yongkiettrakul
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Nipa Chokesajjawatee
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| |
Collapse
|
13
|
Wang W, Ren J, Tang K, Dart E, Ignacio-Espinoza JC, Fuhrman JA, Braun J, Sun F, Ahlgren NA. A network-based integrated framework for predicting virus-prokaryote interactions. NAR Genom Bioinform 2020; 2:lqaa044. [PMID: 32626849 PMCID: PMC7324143 DOI: 10.1093/nargab/lqaa044] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 03/12/2020] [Accepted: 06/05/2020] [Indexed: 12/12/2022] Open
Abstract
Metagenomic sequencing has greatly enhanced the discovery of viral genomic sequences; however, it remains challenging to identify the host(s) of these new viruses. We developed VirHostMatcher-Net, a flexible, network-based, Markov random field framework for predicting virus–prokaryote interactions using multiple, integrated features: CRISPR sequences and alignment-free similarity measures (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$s_2^*$\end{document} and WIsH). Evaluation of this method on a benchmark set of 1462 known virus–prokaryote pairs yielded host prediction accuracy of 59% and 86% at the genus and phylum levels, representing 16–27% and 6–10% improvement, respectively, over previous single-feature prediction approaches. We applied our host prediction tool to crAssphage, a human gut phage, and two metagenomic virus datasets: marine viruses and viral contigs recovered from globally distributed, diverse habitats. Host predictions were frequently consistent with those of previous studies, but more importantly, this new tool made many more confident predictions than previous tools, up to nearly 3-fold more (n > 27 000), greatly expanding the diversity of known virus–host interactions.
Collapse
Affiliation(s)
- Weili Wang
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | - Jie Ren
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | - Kujin Tang
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | - Emily Dart
- Biology Department, Clark University, Worcester, MA 01610, USA
| | | | - Jed A Fuhrman
- Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Jonathan Braun
- Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | | |
Collapse
|
14
|
Positional Correlation Natural Vector: A Novel Method for Genome Comparison. Int J Mol Sci 2020; 21:ijms21113859. [PMID: 32485813 PMCID: PMC7312176 DOI: 10.3390/ijms21113859] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 05/17/2020] [Accepted: 05/26/2020] [Indexed: 12/17/2022] Open
Abstract
Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.
Collapse
|
15
|
De Pierri CR, Voyceik R, Santos de Mattos LGC, Kulik MG, Camargo JO, Repula de Oliveira AM, de Lima Nichio BT, Marchaukoski JN, da Silva Filho AC, Guizelini D, Ortega JM, Pedrosa FO, Raittz RT. SWeeP: representing large biological sequences datasets in compact vectors. Sci Rep 2020; 10:91. [PMID: 31919449 PMCID: PMC6952362 DOI: 10.1038/s41598-019-55627-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Accepted: 12/02/2019] [Indexed: 12/25/2022] Open
Abstract
Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/.
Collapse
Affiliation(s)
- Camilla Reginatto De Pierri
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | - Ricardo Voyceik
- Federal University of Minas Gerais, Institute of Biological Sciences (ICB), Belo Horizonte, Minas Gerais, Brazil
| | | | - Mariane Gonçalves Kulik
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil
| | - Josué Oliveira Camargo
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | - Aryel Marlus Repula de Oliveira
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Genetics, Curitiba, Paraná, Brazil
| | - Bruno Thiago de Lima Nichio
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | | | - Antonio Camilo da Silva Filho
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Pharmaceutical Sciences, Curitiba, Paraná, Brazil
| | - Dieval Guizelini
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil
| | - J Miguel Ortega
- Federal University of Minas Gerais, Institute of Biological Sciences (ICB), Belo Horizonte, Minas Gerais, Brazil
| | - Fabio O Pedrosa
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | - Roberto Tadeu Raittz
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. .,Federal University of Minas Gerais, Institute of Biological Sciences (ICB), Belo Horizonte, Minas Gerais, Brazil. .,Federal University of Paraná, Department of Genetics, Curitiba, Paraná, Brazil.
| |
Collapse
|
16
|
Maabar M, Davison AJ, Vučak M, Thorburn F, Murcia PR, Gunson R, Palmarini M, Hughes J. DisCVR: Rapid viral diagnosis from high-throughput sequencing data. Virus Evol 2019; 5:vez033. [PMID: 31528358 PMCID: PMC6735924 DOI: 10.1093/ve/vez033] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
High-throughput sequencing (HTS) enables most pathogens in a clinical sample to be detected from a single analysis, thereby providing novel opportunities for diagnosis, surveillance, and epidemiology. However, this powerful technology is difficult to apply in diagnostic laboratories because of its computational and bioinformatic demands. We have developed DisCVR, which detects known human viruses in clinical samples by matching sample k-mers (twenty-two nucleotide sequences) to k-mers from taxonomically labeled viral genomes. DisCVR was validated using published HTS data for eighty-nine clinical samples from adults with upper respiratory tract infections. These samples had been tested for viruses metagenomically and also by real-time polymerase chain reaction assay, which is the standard diagnostic method. DisCVR detected human viruses with high sensitivity (79%) and specificity (100%), and was able to detect mixed infections. Moreover, it produced results comparable to those in a published metagenomic analysis of 177 blood samples from patients in Nigeria. DisCVR has been designed as a user-friendly tool for detecting human viruses from HTS data using computers with limited RAM and processing power, and includes a graphical user interface to help users interpret and validate the output. It is written in Java and is publicly available from http://bioinformatics.cvr.ac.uk/discvr.php.
Collapse
Affiliation(s)
- Maha Maabar
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Andrew J Davison
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Matej Vučak
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Fiona Thorburn
- Microbiology Department, Glasgow Royal Infirmary, Glasgow G4 0SF, UK
| | - Pablo R Murcia
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Rory Gunson
- West of Scotland Specialist Virology Centre, Glasgow Royal Infirmary, Glasgow G4 0SF, UK
| | - Massimo Palmarini
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| | - Joseph Hughes
- MRC-University of Glasgow Centre for Virus Research, Sir Michael Stoker Building, 464 Bearsden Road, Glasgow G61 1QH, UK
| |
Collapse
|
17
|
An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One 2018; 13:e0206409. [PMID: 30427878 PMCID: PMC6235296 DOI: 10.1371/journal.pone.0206409] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 10/14/2018] [Indexed: 01/11/2023] Open
Abstract
For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.
Collapse
|
18
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
19
|
Nishimura Y, Yoshida T, Kuronishi M, Uehara H, Ogata H, Goto S. ViPTree: the viral proteomic tree server. Bioinformatics 2018; 33:2379-2380. [PMID: 28379287 DOI: 10.1093/bioinformatics/btx157] [Citation(s) in RCA: 418] [Impact Index Per Article: 69.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 03/21/2017] [Indexed: 11/12/2022] Open
Abstract
Summary ViPTree is a web server provided through GenomeNet to generate viral proteomic trees for classification of viruses based on genome-wide similarities. Users can upload viral genomes sequenced either by genomics or metagenomics. ViPTree generates proteomic trees for the uploaded genomes together with flexibly selected reference viral genomes. ViPTree also serves as a platform to visually investigate genomic alignments and automatically annotated gene functions for the uploaded viral genomes, thus providing virus researchers the first choice for classifying and understanding newly sequenced viral genomes. Availability and Implementation ViPTree is freely available at: http://www.genome.jp/viptree . Contact goto@kuicr.kyoto-u.ac.jp. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yosuke Nishimura
- Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan.,Graduate School of Agriculture, Kyoto University, Kitashirakawa-Oiwake, Sakyo-ku, Kyoto 606-8502, Japan
| | - Takashi Yoshida
- Graduate School of Agriculture, Kyoto University, Kitashirakawa-Oiwake, Sakyo-ku, Kyoto 606-8502, Japan
| | - Megumi Kuronishi
- Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | - Hideya Uehara
- SGI Japan, Ltd., Yebisu Garden Place Tower 31F, 4-20-3 Ebisu, Shibuya-ku, Tokyo 150-6031, Japan
| | - Hiroyuki Ogata
- Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | - Susumu Goto
- Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| |
Collapse
|
20
|
Chinchar V, Waltzek TB, Subramaniam K. Ranaviruses and other members of the family Iridoviridae: Their place in the virosphere. Virology 2017. [DOI: 10.1016/j.virol.2017.06.007] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
21
|
Abstract
With sharp increasing in biological sequences, the traditional sequence alignment methods become unsuitable and infeasible. It motivates a surge of fast alignment-free techniques for sequence analysis. Among these methods, many sorts of feature vector methods are established and applied to reconstruction of species phylogeny. The vectors basically consist of some typical numerical features for certain biological problems. The features may come from the primary sequences, secondary or three dimensional structures of macromolecules. In this study, we propose a novel numerical vector based on only primary sequences of organism to build their phylogeny. Three chemical and physical properties of primary sequences: purine, pyrimidine and keto are also incorporated to the vector. Using each property, we convert the nucleotide sequence into a new sequence consisting of only two kinds of letters. Therefore, three sequences are constructed according to the three properties. For each letter of each sequence we calculate the number of the letter, the average position of the letter and the variation of the position of the letter appearing in the sequence. Tested on several datasets related to mammals, viruses and bacteria, this new tool is fast in speed and accurate for inferring the phylogeny of organisms.
Collapse
|
22
|
Zhang Q, Jun SR, Leuze M, Ussery D, Nookaew I. Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer. Sci Rep 2017; 7:40712. [PMID: 28102365 PMCID: PMC5244389 DOI: 10.1038/srep40712] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2016] [Accepted: 12/08/2016] [Indexed: 11/25/2022] Open
Abstract
The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral "tree of life". However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. The resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.
Collapse
Affiliation(s)
- Qian Zhang
- UT-ORNL Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN 37996, USA
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory Oak Ridge, TN 37831 USA
| | - Se-Ran Jun
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory Oak Ridge, TN 37831 USA
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA
| | - Michael Leuze
- Joint Institute for Computational Sciences, University of Tennessee, Knoxville, TN 37831, USA
- Computational Biomolecular Modeling and Bioinformatics Group, Computer Science and Mathematics Division, Oak Ridge National Laboratories, Oak Ridge, TN 37831, USA
| | - David Ussery
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory Oak Ridge, TN 37831 USA
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA
| | - Intawat Nookaew
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory Oak Ridge, TN 37831 USA
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA
| |
Collapse
|
23
|
Li Y, He L, He RL, Yau SST. Zika and Flaviviruses Phylogeny Based on the Alignment-Free Natural Vector Method. DNA Cell Biol 2016; 36:109-116. [PMID: 27977308 DOI: 10.1089/dna.2016.3532] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Zika virus (ZIKV) is a mosquito-borne flavivirus. It was first isolated from Uganda in 1947 and has become an emergent event since 2007. However, because of the inconsistency of alignment methods, the evolution of ZIKV remains poorly understood. In this study, we first use the complete protein and an alignment-free method to build a phylogenetic tree of 87 Zika strains in which Asian, East African, and West African lineages are characterized. We also use the NS5 protein to construct the genetic relationship among 44 Zika strains. For the first time, these strains are divided into two clades: African 1 and African 2. This result suggests that ZIKV originates from Africa, then spread to Asia, Pacific islands, and throughout the Americas. We also perform the phylogeny analysis for 53 viruses in genus Flavivirus to which ZIKV belongs using complete proteins. Our conclusion is consistent with the classification by the hosts and transmission vectors.
Collapse
Affiliation(s)
- Yongkun Li
- 1 Department of Mathematical Sciences, Tsinghua University , Beijing, People's Republic of China
| | - Lily He
- 1 Department of Mathematical Sciences, Tsinghua University , Beijing, People's Republic of China
| | - Rong Lucy He
- 2 Department of Biological Sciences, Chicago State University , Chicago, Illinois
| | - Stephen S-T Yau
- 1 Department of Mathematical Sciences, Tsinghua University , Beijing, People's Republic of China
| |
Collapse
|
24
|
Chen S, Deng LY, Bowman D, Shiau JJH, Wong TY, Madahian B, Lu HHS. Phylogenetic tree construction using trinucleotide usage profile (TUP). BMC Bioinformatics 2016; 17:381. [PMID: 27766939 PMCID: PMC5073869 DOI: 10.1186/s12859-016-1222-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND It has been a challenging task to build a genome-wide phylogenetic tree for a large group of species containing a large number of genes with long nucleotides sequences. The most popular method, called feature frequency profile (FFP-k), finds the frequency distribution for all words of certain length k over the whole genome sequence using (overlapping) windows of the same length. For a satisfactory result, the recommended word length (k) ranges from 6 to 15 and it may not be a multiple of 3 (codon length). The total number of possible words needed for FFP-k can range from 46=4096 to 415. RESULTS We propose a simple improvement over the popular FFP method using only a typical word length of 3. A new method, called Trinucleotide Usage Profile (TUP), is proposed based only on the (relative) frequency distribution using non-overlapping windows of length 3. The total number of possible words needed for TUP is 43=64, which is much less than the total count for the recommended optimal "resolution" for FFP. To build a phylogenetic tree, we propose first representing each of the species by a TUP vector and then using an appropriate distance measure between pairs of the TUP vectors for the tree construction. In particular, we propose summarizing a DNA sequence by a matrix of three rows corresponding to three reading frames, recording the frequency distribution of the non-overlapping words of length 3 in each of the reading frame. We also provide a numerical measure for comparing trees constructed with various methods. CONCLUSIONS Compared to the FFP method, our empirical study showed that the proposed TUP method is more capable of building phylogenetic trees with a stronger biological support. We further provide some justifications on this from the information theory viewpoint. Unlike the FFP method, the TUP method takes the advantage that the starting of the first reading frame is (usually) known. Without this information, the FFP method could only rely on the frequency distribution of overlapping words, which is the average (or mixture) of the frequency distributions of three possible reading frames. Consequently, we show (from the entropy viewpoint) that the FFP procedure could dilute important gene information and therefore provides less accurate classification.
Collapse
Affiliation(s)
- Si Chen
- Key Laboratory of Combinatorial Biosynthesis and Drug Discovery Ministry of Education and School of Pharmaceutical Sciences Wuhan University, Wuhan, China
| | - Lih-Yuan Deng
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, USA
| | - Dale Bowman
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, USA
| | | | - Tit-Yee Wong
- Department of Biological Sciences, University of Memphis, Memphis, TN, USA
| | - Behrouz Madahian
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, USA
| | | |
Collapse
|
25
|
Li Y, Tian K, Yin C, He RL, Yau SST. Virus classification in 60-dimensional protein space. Mol Phylogenet Evol 2016; 99:53-62. [PMID: 26988414 DOI: 10.1016/j.ympev.2016.03.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2015] [Revised: 01/24/2016] [Accepted: 03/10/2016] [Indexed: 10/22/2022]
Abstract
Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed "Natural Vector (NV) representation" has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.
Collapse
Affiliation(s)
- Yongkun Li
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Kun Tian
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, USA
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.
| |
Collapse
|
26
|
Overstreet RM, Lotz JM. Host–Symbiont Relationships: Understanding the Change from Guest to Pest. ADVANCES IN ENVIRONMENTAL MICROBIOLOGY 2016. [PMCID: PMC7123458 DOI: 10.1007/978-3-319-28170-4_2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
27
|
Yang WF, Yu ZG, Anh V. Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation. Mol Phylogenet Evol 2015; 96:102-111. [PMID: 26724405 DOI: 10.1016/j.ympev.2015.12.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Revised: 12/17/2015] [Accepted: 12/18/2015] [Indexed: 01/18/2023]
Abstract
UNLABELLED Traditional methods for sequence comparison and phylogeny reconstruction rely on pair wise and multiple sequence alignments. But alignment could not be directly applied to whole genome/proteome comparison and phylogenomic studies due to their high computational complexity. Hence alignment-free methods became popular in recent years. Here we propose a fast alignment-free method for whole genome/proteome comparison and phylogeny reconstruction using higher order Markov model and chaos game representation. In the present method, we use the transition matrices of higher order Markov models to characterize amino acid or DNA sequences for their comparison. The order of the Markov model is uniquely identified by maximizing the average Shannon entropy of conditional probability distributions. Using one-dimensional chaos game representation and linked list, this method can reduce large memory and time consumption which is due to the large-scale conditional probability distributions. To illustrate the effectiveness of our method, we employ it for fast phylogeny reconstruction based on genome/proteome sequences of two species data sets used in previous published papers. Our results demonstrate that the present method is useful and efficient. AVAILABILITY AND IMPLEMENTATION The source codes for our algorithm to get the distance matrix and genome/proteome sequences can be downloaded from ftp://121.199.20.25/. The software Phylip and EvolView we used to construct phylogenetic trees can be referred from their websites.
Collapse
Affiliation(s)
- Wei-Feng Yang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; Department of Mathematics and Physics, Hunan Institute of Engineering, Hunan 411104, PR China.
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| |
Collapse
|
28
|
Jun SR, Leuze MR, Nookaew I, Uberbacher EC, Land M, Zhang Q, Wanchai V, Chai J, Nielsen M, Trolle T, Lund O, Buzard GS, Pedersen TD, Wassenaar TM, Ussery DW. Ebolavirus comparative genomics. FEMS Microbiol Rev 2015; 39:764-78. [PMID: 26175035 PMCID: PMC4551310 DOI: 10.1093/femsre/fuv031] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/08/2015] [Indexed: 12/17/2022] Open
Abstract
The 2014 Ebola outbreak in West Africa is the largest documented for this virus. To examine the dynamics of this genome, we compare more than 100 currently available ebolavirus genomes to each other and to other viral genomes. Based on oligomer frequency analysis, the family Filoviridae forms a distinct group from all other sequenced viral genomes. All filovirus genomes sequenced to date encode proteins with similar functions and gene order, although there is considerable divergence in sequences between the three genera Ebolavirus, Cuevavirus and Marburgvirus within the family Filoviridae. Whereas all ebolavirus genomes are quite similar (multiple sequences of the same strain are often identical), variation is most common in the intergenic regions and within specific areas of the genes encoding the glycoprotein (GP), nucleoprotein (NP) and polymerase (L). We predict regions that could contain epitope-binding sites, which might be good vaccine targets. This information, combined with glycosylation sites and experimentally determined epitopes, can identify the most promising regions for the development of therapeutic strategies.This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Collapse
Affiliation(s)
- Se-Ran Jun
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA Joint Institute for Computational Sciences, University of Tennessee, Knoxville, TN 37996, USA
| | - Michael R Leuze
- Computer Science and Mathematics Division, Computer Science Research Group, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Intawat Nookaew
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Edward C Uberbacher
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Miriam Land
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Qian Zhang
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA UT-ORNL Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN 37996, USA
| | - Visanu Wanchai
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Juanjuan Chai
- Computer Science and Mathematics Division, Computer Science Research Group, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Morten Nielsen
- Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, San Martín, B 1650 HMP, Buenos Aires, Argentina
| | - Thomas Trolle
- Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| | - Ole Lund
- Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| | | | - Thomas D Pedersen
- Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark Assays, Cultures and Enzymes Division, Chr. Hansen A/S, Hørsholm, Denmark
| | - Trudy M Wassenaar
- Molecular Microbiology and Genomics Consultants, Tannenstr 7, D-55576 Zotzenheim, Germany
| | - David W Ussery
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA UT-ORNL Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN 37996, USA Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| |
Collapse
|
29
|
Xie XH, Yu ZG, Han GS, Yang WF, Anh V. Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles. Mol Phylogenet Evol 2015; 89:37-45. [PMID: 25882834 DOI: 10.1016/j.ympev.2015.04.008] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 03/29/2015] [Accepted: 04/06/2015] [Indexed: 11/18/2022]
Abstract
There has been a growing interest in alignment-free methods for whole genome comparison and phylogenomic studies. In this study, we propose an alignment-free method for phylogenetic tree construction using whole-proteome sequences. Based on the inter-amino-acid distances, we first convert the whole-proteome sequences into inter-amino-acid distance vectors, which are called observed inter-amino-acid distance profiles. Then, we propose to use conditional geometric distribution profiles (the distributions of sequences where the amino acids are placed randomly and independently) as the reference distribution profiles. Last the relative deviation between the observed and reference distribution profiles is used to define a simple metric that reflects the phylogenetic relationships between whole-proteome sequences of different organisms. We name our method inter-amino-acid distances and conditional geometric distribution profiles (IAGDP). We evaluate our method on two data sets: the benchmark dataset including 29 genomes used in previous published papers, and another one including 67 mammal genomes. Our results demonstrate that the new method is useful and efficient.
Collapse
Affiliation(s)
- Xian-Hua Xie
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematics and Computer Science, Gannan Normal University, Jiangxi 341000, PR China.
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| | - Guo-Sheng Han
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China.
| | - Wei-Feng Yang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China.
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| |
Collapse
|
30
|
Yuan J, Zhu Q, Liu B. Phylogenetic and biological significance of evolutionary elements from metazoan mitochondrial genomes. PLoS One 2014; 9:e84330. [PMID: 24465405 PMCID: PMC3896360 DOI: 10.1371/journal.pone.0084330] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2013] [Accepted: 11/14/2013] [Indexed: 12/29/2022] Open
Abstract
The evolutionary history of living species is usually inferred through the phylogenetic analysis of molecular and morphological information using various mathematical models. New challenges in phylogenetic analysis are centered mostly on the search for accurate and efficient methods to handle the huge amounts of sequence data generated from newer genome sequencing. The next major challenge is the determination of relationships between the evolution of structural elements and their functional implementation, which is largely ignored in previous analyses. Here, we described the discovery of structural elements in metazoan mitochondrial genomes, termed key K-strings, that can serve as a basis for phylogenetic tree construction. Although comprising only a small fraction (0.73%) of all K-strings, these key K-strings are pivotal to the tree construction because they allow for a significant reduction in the computational time required to construct phylogenetic trees, and more importantly, they make significant improvement to the results of phylogenetic inference. The trees constructed from the key K-strings were consistent overall to our current view of metazoan phylogeny and exhibited a more rational topology than the trees constructed by using other conventional methods. Surprisingly, the key K-strings tended to accumulate in the conserved regions of the original sequences, which were most likely due to strong selection pressure. Furthermore, the special structural features of the key K-strings should have some potential applications in the study of the structures and functions relationship of proteins and in the determination of evolutionary trajectory of species. The novelty and potential importance of key K-strings lead us to believe that they are essential evolutionary elements. As such, they may play important roles in the process of species evolution and their physical existence. Further studies could lead to discoveries regarding the relationship between evolution and processes of speciation.
Collapse
Affiliation(s)
- Jianbo Yuan
- Center of Systematic Genomics, Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences, Urumqi, Xinjiang, China
- CAS Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, Shandong, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | | | - Bin Liu
- Center of Systematic Genomics, Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences, Urumqi, Xinjiang, China
- CAS Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, Shandong, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
- * E-mail:
| |
Collapse
|
31
|
Wang JD. Comparing virus classification using genomic materials according to different taxonomic levels. J Bioinform Comput Biol 2013; 11:1343003. [PMID: 24372032 DOI: 10.1142/s0219720013430038] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, three genomic materials--DNA sequences, protein sequences, and regions (domains) are used to compare methods of virus classification. Virus classes (categories) are divided by various taxonomic level of virus into three datasets for 6 order, 42 family, and 33 genera. To increase the robustness and comparability of experimental results of virus classification, the classes are selected that contain at least 10 instances, and meanwhile each instance contains at least one region name. Experimental results show that the approach using region names achieved the best accuracies--reaching 99.9%, 97.3%, and 99.0% for 6 orders, 42 families, and 33 genera, respectively. This paper not only involves exhaustive experiments that compare virus classifications using different genomic materials, but also proposes a novel approach to biological classification based on molecular biology instead of traditional morphology.
Collapse
Affiliation(s)
- Jing-Doo Wang
- Department of Computer Science and Information Engineering, Asia University, No 500, Lioufeng Road Wufeng, Taichung 41354, Taiwan
| |
Collapse
|
32
|
Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform 2013; 15:343-53. [PMID: 24064230 DOI: 10.1093/bib/bbt067] [Citation(s) in RCA: 112] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.
Collapse
Affiliation(s)
- Kai Song
- Molecular and Computational Biology Program, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, USA. or
| | | | | | | | | | | |
Collapse
|
33
|
Kariithi HM, van Oers MM, Vlak JM, Vreysen MJB, Parker AG, Abd-Alla AMM. Virology, Epidemiology and Pathology of Glossina Hytrosavirus, and Its Control Prospects in Laboratory Colonies of the Tsetse Fly, Glossina pallidipes (Diptera; Glossinidae). INSECTS 2013; 4:287-319. [PMID: 26462422 PMCID: PMC4553466 DOI: 10.3390/insects4030287] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/06/2013] [Revised: 06/13/2013] [Accepted: 06/13/2013] [Indexed: 01/03/2023]
Abstract
The Glossina hytrosavirus (family Hytrosaviridae) is a double-stranded DNA virus with rod-shaped, enveloped virions. Its 190 kbp genome encodes 160 putative open reading frames. The virus replicates in the nucleus, and acquires a fragile envelope in the cell cytoplasm. Glossina hytrosavirus was first isolated from hypertrophied salivary glands of the tsetse fly, Glossina pallidipes Austen (Diptera; Glossinidae) collected in Kenya in 1986. A certain proportion of laboratory G. pallidipes flies infected by Glossina hytrosavirus develop hypertrophied salivary glands and midgut epithelial cells, gonadal anomalies and distorted sex-ratios associated with reduced insemination rates, fecundity and lifespan. These symptoms are rare in wild tsetse populations. In East Africa, G. pallidipes is one of the most important vectors of African trypanosomosis, a debilitating zoonotic disease that afflicts 37 sub-Saharan African countries. There is a large arsenal of control tactics available to manage tsetse flies and the disease they transmit. The sterile insect technique (SIT) is a robust control tactic that has shown to be effective in eradicating tsetse populations when integrated with other control tactics in an area-wide integrated approach. The SIT requires production of sterile male flies in large production facilities. To supply sufficient numbers of sterile males for the SIT component against G. pallidipes, strategies have to be developed that enable the management of the Glossina hytrosavirus in the colonies. This review provides a historic chronology of the emergence and biogeography of Glossina hytrosavirus, and includes researches on the infectomics (defined here as the functional and structural genomics and proteomics) and pathobiology of the virus. Standard operation procedures for viral management in tsetse mass-rearing facilities are proposed and a future outlook is sketched.
Collapse
Affiliation(s)
- Henry M Kariithi
- Laboratory of Virology, Wageningen University, Droevendaalsesteeg 1, Wageningen 6708 PB, The Netherlands.
- Insect Pest Control Laboratories, Joint FAO/IAEA Programme of Nuclear Techniques in Food and Agriculture, International Atomic Energy Agency, Wagrammer Strasse 5, P.O. Box 100, 1400 Vienna, Austria.
- Biotechnology Centre, Kenya Agricultural Research Institute, Waiyaki Way, P.O. Box 14733-00100, Nairobi, Kenya.
| | - Monique M van Oers
- Laboratory of Virology, Wageningen University, Droevendaalsesteeg 1, Wageningen 6708 PB, The Netherlands.
| | - Just M Vlak
- Laboratory of Virology, Wageningen University, Droevendaalsesteeg 1, Wageningen 6708 PB, The Netherlands.
| | - Marc J B Vreysen
- Insect Pest Control Laboratories, Joint FAO/IAEA Programme of Nuclear Techniques in Food and Agriculture, International Atomic Energy Agency, Wagrammer Strasse 5, P.O. Box 100, 1400 Vienna, Austria.
| | - Andrew G Parker
- Insect Pest Control Laboratories, Joint FAO/IAEA Programme of Nuclear Techniques in Food and Agriculture, International Atomic Energy Agency, Wagrammer Strasse 5, P.O. Box 100, 1400 Vienna, Austria.
| | - Adly M M Abd-Alla
- Insect Pest Control Laboratories, Joint FAO/IAEA Programme of Nuclear Techniques in Food and Agriculture, International Atomic Energy Agency, Wagrammer Strasse 5, P.O. Box 100, 1400 Vienna, Austria.
| |
Collapse
|
34
|
New insights into the evolution of Entomopoxvirinae from the complete genome sequences of four entomopoxviruses infecting Adoxophyes honmai, Choristoneura biennis, Choristoneura rosaceana, and Mythimna separata. J Virol 2013; 87:7992-8003. [PMID: 23678178 DOI: 10.1128/jvi.00453-13] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Poxviruses are nucleocytoplasmic large DNA viruses encompassing two subfamilies, the Chordopoxvirinae and the Entomopoxvirinae, infecting vertebrates and insects, respectively. While chordopoxvirus genomics have been widely studied, only two entomopoxvirus (EPV) genomes have been entirely sequenced. We report the genome sequences of four EPVs of the Betaentomopoxvirus genus infecting the Lepidoptera: Adoxophyes honmai EPV (AHEV), Choristoneura biennis EPV (CBEV), Choristoneura rosaceana EPV (CREV), and Mythimna separata EPV (MySEV). The genomes are 80% AT rich, are 228 to 307 kbp long, and contain 247 to 334 open reading frames (ORFs). Most genes are homologous to those of Amsacta moorei entomopoxvirus and encode several protein families repeated in tandem in terminal regions. Some genomes also encode proteins of unknown functions with similarity to those of other insect viruses. Comparative genomic analyses highlight a high colinearity among the lepidopteran EPV genomes and little gene order conservation with other poxvirus genomes. As with previously sequenced EPVs, the genomes include a relatively conserved central region flanked by inverted terminal repeats. Protein clustering identified 104 core EPV genes. Among betaentomopoxviruses, 148 core genes were found in relatively high synteny, pointing to low genomic diversity. Whole-genome and spheroidin gene phylogenetic analyses showed that the lepidopteran EPVs group closely in a monophyletic lineage, corroborating their affiliation with the Betaentomopoxvirus genus as well as a clear division of the EPVs according to the orders of insect hosts (Lepidoptera, Coleoptera, and Orthoptera). This suggests an ancient coevolution of EPVs with their insect hosts and the need to revise the current EPV taxonomy to separate orthopteran EPVs from the lepidopteran-specific betaentomopoxviruses so as to form a new genus.
Collapse
|
35
|
Abstract
Since the emergence of high-throughput genome sequencing platforms and more recently the next-generation platforms, the genome databases are growing at an astronomical rate. Tremendous efforts have been invested in recent years in understanding intriguing complexities beneath the vast ocean of genomic data. This is apparent in the spurt of computational methods for interpreting these data in the past few years. Genomic data interpretation is notoriously difficult, partly owing to the inherent heterogeneities appearing at different scales. Methods developed to interpret these data often suffer from their inability to adequately measure the underlying heterogeneities and thus lead to confounding results. Here, we present an information entropy-based approach that unravels the distinctive patterns underlying genomic data efficiently and thus is applicable in addressing a variety of biological problems. We show the robustness and consistency of the proposed methodology in addressing three different biological problems of significance—identification of alien DNAs in bacterial genomes, detection of structural variants in cancer cell lines and alignment-free genome comparison.
Collapse
Affiliation(s)
- Rajeev K Azad
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| | | |
Collapse
|
36
|
Phylogeny and evolution of Hytrosaviridae. J Invertebr Pathol 2012; 112 Suppl:S62-7. [PMID: 22841640 DOI: 10.1016/j.jip.2012.07.015] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Revised: 05/21/2012] [Accepted: 05/22/2012] [Indexed: 11/21/2022]
Abstract
The Hytrosaviridae comprises a family of dsDNA viruses with a circular genome of 120-190 kb p. They are exclusively associated with Diptera, such as the tsetse fly, the house fly and the Narcissus bulb fly. Hytrosaviruses cause a very unique pathology including hypertrophy of salivary glands as well as testicular and ovarian malformation. On the other hand these viruses share a significant number of gene homologues with other dsDNA viruses, esp. baculoviruses and nudiviruses. These gene homologues include twelve so-called baculovirus core genes involved in transcription, DNA replication and the infection process. Most strikingly, the Musca domestica salivary gland hypertrophy virus (MdSGHV) encodes a homologue of a polyhedrin/granulin gene of Alpha-, Beta-, Gammabaculoviruses. Hence, it is proposed that hytrosaviruses are phylogenetically related to baculoviruses but evolved in a very close association with their dipteran hosts.
Collapse
|
37
|
Zhai Z, Reinert G, Song K, Waterman MS, Luan Y, Sun F. Normal and compound poisson approximations for pattern occurrences in NGS reads. J Comput Biol 2012; 19:839-54. [PMID: 22697250 PMCID: PMC3375642 DOI: 10.1089/cmb.2012.0029] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).
Collapse
Affiliation(s)
- Zhiyuan Zhai
- School of Mathematics, Shandong University, Jinan, Shandong, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Kai Song
- School of Mathematics, Peking University, Beijing, China
| | - Michael S. Waterman
- Molecular and Computational Biology, University of Southern California, Los Angeles, California
- TNLIST/Department of Automation, Tsinghua University, Beijing, China
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan, Shandong, China
| | - Fengzhu Sun
- Molecular and Computational Biology, University of Southern California, Los Angeles, California
- TNLIST/Department of Automation, Tsinghua University, Beijing, China
| |
Collapse
|
38
|
Göke J, Schulz MH, Lasserre J, Vingron M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. ACTA ACUST UNITED AC 2012; 28:656-63. [PMID: 22247280 PMCID: PMC3289921 DOI: 10.1093/bioinformatics/bts028] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets. RESULTS We present the standardized alignment-free sequence similarity measure N2, a flexible framework that is defined for word neighbourhoods. We explore the usefulness of adding reverse complement words as well as words including mismatches into the neighbourhood. On simulated enhancer sequences as well as functional enhancers in mouse development, N2 is shown to outperform previous alignment-free measures. N2 is flexible, faster than competing methods and less susceptible to single sequence noise and the occurrence of repetitive sequences. Experiments on the mouse enhancers reveal that enhancers active in different tissues can be separated by pairwise comparison using N2. CONCLUSION N2 represents an improvement over previous alignment-free similarity measures without compromising speed, which makes it a good candidate for large-scale sequence comparison of regulatory sequences. AVAILABILITY The software is part of the open-source C++ library SeqAn (www.seqan.de) and a compiled version can be downloaded at http://www.seqan.de/projects/alf.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jonathan Göke
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany.
| | | | | | | |
Collapse
|
39
|
Gao Y, Luo L. Genome-based phylogeny of dsDNA viruses by a novel alignment-free method. Gene 2011; 492:309-14. [PMID: 22100880 DOI: 10.1016/j.gene.2011.11.004] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2011] [Revised: 09/19/2011] [Accepted: 11/01/2011] [Indexed: 12/25/2022]
Abstract
Sequence alignment is not directly applicable to whole genome phylogeny since several events such as rearrangements make full length alignments impossible. Here, a novel alignment-free method derived from the standpoint of information theory is proposed and used to construct the whole-genome phylogeny for a population of viruses from 13 viral families comprising 218 dsDNA viruses. The method is based on information correlation (IC) and partial information correlation (PIC). We observe that (i) the IC-PIC tree segregates the population into clades, the membership of each is remarkably consistent with biologist's systematics only with little exceptions; (ii) the IC-PIC tree reveals potential evolutionary relationships among some viral families; and (iii) the IC-PIC tree predicts the taxonomic positions of certain "unclassified" viruses. Our approach provides a new way for recovering the phylogeny of viruses, and has practical applications in developing alignment-free methods for sequence classification.
Collapse
Affiliation(s)
- Yang Gao
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | | |
Collapse
|
40
|
Hanson L, Dishon A, Kotler M. Herpesviruses that infect fish. Viruses 2011; 3:2160-91. [PMID: 22163339 PMCID: PMC3230846 DOI: 10.3390/v3112160] [Citation(s) in RCA: 91] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Revised: 10/15/2011] [Accepted: 10/22/2011] [Indexed: 11/25/2022] Open
Abstract
Herpesviruses are host specific pathogens that are widespread among vertebrates. Genome sequence data demonstrate that most herpesviruses of fish and amphibians are grouped together (family Alloherpesviridae) and are distantly related to herpesviruses of reptiles, birds and mammals (family Herpesviridae). Yet, many of the biological processes of members of the order Herpesvirales are similar. Among the conserved characteristics are the virion structure, replication process, the ability to establish long term latency and the manipulation of the host immune response. Many of the similar processes may be due to convergent evolution. This overview of identified herpesviruses of fish discusses the diseases that alloherpesviruses cause, the biology of these viruses and the host-pathogen interactions. Much of our knowledge on the biology of Alloherpesvirdae is derived from research with two species: Ictalurid herpesvirus 1 (channel catfish virus) and Cyprinid herpesvirus 3 (koi herpesvirus).
Collapse
Affiliation(s)
- Larry Hanson
- Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, P.O. Box 6100, Starkville, MS 39759, USA
| | - Arnon Dishon
- KoVax Ltd., P.O. Box 45212, Bynet Build., Har Hotzvim Inds. Pk., Jerusalem 97444, Israel; E-Mail:
| | - Moshe Kotler
- Department of Pathology, Hadassah Medical School, the Hebrew University, Jerusalem 91120, Israel; E-Mail:
- The Lautenberg Center for General and Tumor Immunology, Hadassah Medical School, the Hebrew University, Jerusalem 91120, Israel
| |
Collapse
|
41
|
Abstract
To understand how extant viruses interact with their hosts, we need a historical framework of their evolutionary association. Akin to retrovirus or hepadnavirus viral fossils present in eukaryotic genomes, bracoviruses are integrated in braconid wasp genomes and are transmitted by Mendelian inheritance. However, unlike viral genomic fossils, they have retained functional machineries homologous to those of large dsDNA viruses pathogenic to arthropods. Using a phylogenomic approach, we resolved the relationships between bracoviruses and their closest free relatives: baculoviruses and nudiviruses. The phylogeny showed that bracoviruses are nested within the nudivirus clade. Bracoviruses establish a bridge between the virus and animal worlds. Their inclusion in a virus phylogeny allowed us to relate free viruses to fossils. The ages of the wasps were used to calibrate the virus phylogeny. Bayesian analyses revealed that insect dsDNA viruses first evolved at ∼310 Mya in the Paleozoic Era during the Carboniferous Period with the first insects. Furthermore the virus diversification time frame during the Mesozoic Era appears linked to the diversification of insect orders; baculoviruses that infect larvae evolved at the same period as holometabolous insects. These results imply ancient coevolution by resource tracking between several insect dsDNA virus families and their hosts, dating back to 310 Mya.
Collapse
|
42
|
Liu X, Wan L, Li J, Reinert G, Waterman MS, Sun F. New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J Theor Biol 2011; 284:106-16. [PMID: 21723298 DOI: 10.1016/j.jtbi.2011.06.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2011] [Revised: 05/30/2011] [Accepted: 06/17/2011] [Indexed: 12/15/2022]
Abstract
Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.
Collapse
Affiliation(s)
- Xuemei Liu
- School of Physics, South China University of Technology, Guangzhou, PR China
| | | | | | | | | | | |
Collapse
|
43
|
Nagamine B, Jones L, Tellgren-Roth C, Cavender J, Bratanich AC. A novel gammaherpesvirus isolated from a black-tailed prairie dog (Cynomys ludovicianus). Arch Virol 2011; 156:1835-40. [PMID: 21630099 DOI: 10.1007/s00705-011-1024-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2011] [Accepted: 05/06/2011] [Indexed: 11/27/2022]
Abstract
A new gammaherpesvirus, tentatively named cynomys herpesvirus 1 (CynGHV-1), was isolated from a black-tailed prairie dog (Cynomys ludovicianus). CynGHV-1 replicated cytopathogenically to moderate titers in various cell lines. Ten kb of the CynGHV-1 genome was sequenced using degenerate PCR and genomic cloning. Sequence similarities were found to different genes from known gammaherpesviruses. Phylogenetic analysis suggested that CynGHV-1 was in fact a novel virus closely related to representatives of different genera and unclassified members of the subfamily Gammaherpesvirinae. However, CynGHV-1 could not be assigned to any particular genus and therefore remains unclassified.
Collapse
Affiliation(s)
- Brandy Nagamine
- Department of Veterinary Science, University of Wyoming, Laramie, WY 82070, USA
| | | | | | | | | |
Collapse
|
44
|
Abstract
Despite recent advances in our understanding of diverse aspects of virus evolution, particularly on the epidemiological scale, revealing the ultimate origins of viruses has proven to be a more intractable problem. Herein, I review some current ideas on the evolutionary origins of viruses and assess how well these theories accord with what we know about the evolution of contemporary viruses. I note the growing evidence for the theory that viruses arose before the last universal cellular ancestor (LUCA). This ancient origin theory is supported by the presence of capsid architectures that are conserved among diverse RNA and DNA viruses and by the strongly inverse relationship between genome size and mutation rate across all replication systems, such that pre-LUCA genomes were probably both small and highly error prone and hence RNA virus-like. I also highlight the advances that are needed to come to a better understanding of virus origins, most notably the ability to accurately infer deep evolutionary history from the phylogenetic analysis of conserved protein structures.
Collapse
Affiliation(s)
- Edward C Holmes
- Center for Infectious Disease Dynamics, Department of Biology, The Pennsylvania State University, Mueller Laboratory, University Park,Pennsylvania 16802, USA.
| |
Collapse
|
45
|
Wang Y, Bininda-Emonds ORP, van Oers MM, Vlak JM, Jehle JA. The genome of Oryctes rhinoceros nudivirus provides novel insight into the evolution of nuclear arthropod-specific large circular double-stranded DNA viruses. Virus Genes 2011; 42:444-56. [DOI: 10.1007/s11262-011-0589-5] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2010] [Accepted: 02/21/2011] [Indexed: 11/29/2022]
|
46
|
Hendrickson RC, Wang C, Hatcher EL, Lefkowitz EJ. Orthopoxvirus genome evolution: the role of gene loss. Viruses 2010; 2:1933-1967. [PMID: 21994715 PMCID: PMC3185746 DOI: 10.3390/v2091933] [Citation(s) in RCA: 146] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2010] [Revised: 08/25/2010] [Accepted: 09/01/2010] [Indexed: 12/26/2022] Open
Abstract
Poxviruses are highly successful pathogens, known to infect a variety of hosts. The family Poxviridae includes Variola virus, the causative agent of smallpox, which has been eradicated as a public health threat but could potentially reemerge as a bioterrorist threat. The risk scenario includes other animal poxviruses and genetically engineered manipulations of poxviruses. Studies of orthologous gene sets have established the evolutionary relationships of members within the Poxviridae family. It is not clear, however, how variations between family members arose in the past, an important issue in understanding how these viruses may vary and possibly produce future threats. Using a newly developed poxvirus-specific tool, we predicted accurate gene sets for viruses with completely sequenced genomes in the genus Orthopoxvirus. Employing sensitive sequence comparison techniques together with comparison of syntenic gene maps, we established the relationships between all viral gene sets. These techniques allowed us to unambiguously identify the gene loss/gain events that have occurred over the course of orthopoxvirus evolution. It is clear that for all existing Orthopoxvirus species, no individual species has acquired protein-coding genes unique to that species. All existing species contain genes that are all present in members of the species Cowpox virus and that cowpox virus strains contain every gene present in any other orthopoxvirus strain. These results support a theory of reductive evolution in which the reduction in size of the core gene set of a putative ancestral virus played a critical role in speciation and confining any newly emerging virus species to a particular environmental (host or tissue) niche.
Collapse
Affiliation(s)
- Robert Curtis Hendrickson
- Department of Microbiology, University of Alabama at Birmingham, BBRB 276/11, 845 19th St S, Birmingham, AL 35222, USA; E-Mails: (R.C.H.); (E.L.H.)
| | - Chunlin Wang
- Stanford Genome Technology Center, Stanford University, 855 California Ave, Palo Alto, CA 94304, USA; E-Mail:
| | - Eneida L. Hatcher
- Department of Microbiology, University of Alabama at Birmingham, BBRB 276/11, 845 19th St S, Birmingham, AL 35222, USA; E-Mails: (R.C.H.); (E.L.H.)
| | - Elliot J. Lefkowitz
- Department of Microbiology, University of Alabama at Birmingham, BBRB 276/11, 845 19th St S, Birmingham, AL 35222, USA; E-Mails: (R.C.H.); (E.L.H.)
| |
Collapse
|
47
|
Blanc G, Duncan G, Agarkova I, Borodovsky M, Gurnon J, Kuo A, Lindquist E, Lucas S, Pangilinan J, Polle J, Salamov A, Terry A, Yamada T, Dunigan DD, Grigoriev IV, Claverie JM, Van Etten JL. The Chlorella variabilis NC64A genome reveals adaptation to photosymbiosis, coevolution with viruses, and cryptic sex. THE PLANT CELL 2010; 22:2943-55. [PMID: 20852019 PMCID: PMC2965543 DOI: 10.1105/tpc.110.076406] [Citation(s) in RCA: 337] [Impact Index Per Article: 24.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/06/2010] [Revised: 07/15/2010] [Accepted: 09/01/2010] [Indexed: 05/18/2023]
Abstract
Chlorella variabilis NC64A, a unicellular photosynthetic green alga (Trebouxiophyceae), is an intracellular photobiont of Paramecium bursaria and a model system for studying virus/algal interactions. We sequenced its 46-Mb nuclear genome, revealing an expansion of protein families that could have participated in adaptation to symbiosis. NC64A exhibits variations in GC content across its genome that correlate with global expression level, average intron size, and codon usage bias. Although Chlorella species have been assumed to be asexual and nonmotile, the NC64A genome encodes all the known meiosis-specific proteins and a subset of proteins found in flagella. We hypothesize that Chlorella might have retained a flagella-derived structure that could be involved in sexual reproduction. Furthermore, a survey of phytohormone pathways in chlorophyte algae identified algal orthologs of Arabidopsis thaliana genes involved in hormone biosynthesis and signaling, suggesting that these functions were established prior to the evolution of land plants. We show that the ability of Chlorella to produce chitinous cell walls likely resulted from the capture of metabolic genes by horizontal gene transfer from algal viruses, prokaryotes, or fungi. Analysis of the NC64A genome substantially advances our understanding of the green lineage evolution, including the genomic interplay with viruses and symbiosis between eukaryotes.
Collapse
Affiliation(s)
- Guillaume Blanc
- Centre National de la Recherche Scientifique, Laboratoire Information Génomique et Structurale UPR2589, Aix-Marseille Université, Institut de Microbiologie de la Méditerranée, 13009 Marseille, France.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Eaton HE, Ring BA, Brunetti CR. The genomic diversity and phylogenetic relationship in the family iridoviridae. Viruses 2010; 2:1458-1475. [PMID: 21994690 PMCID: PMC3185713 DOI: 10.3390/v2071458] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2010] [Revised: 07/12/2010] [Accepted: 07/13/2010] [Indexed: 01/13/2023] Open
Abstract
The Iridoviridae family are large viruses (∼120–200 nm) that contain a linear double-stranded DNA genome. The genomic size of Iridoviridae family members range from 105,903 bases encoding 97 open reading frames (ORFs) for frog virus 3 to 212,482 bases encoding 211 ORFs for Chilo iridescent virus. The family Iridoviridae is currently subdivided into five genera: Chloriridovirus, Iridovirus, Lymphocystivirus, Megalocytivirus, and Ranavirus. Iridoviruses have been found to infect invertebrates and poikilothermic vertebrates, including amphibians, reptiles, and fish. With such a diverse array of hosts, there is great diversity in gene content between different genera. To understand the origin of iridoviruses, we explored the phylogenetic relationship between individual iridoviruses and defined the core-set of genes shared by all members of the family. In order to further explore the evolutionary relationship between the Iridoviridae family repetitive sequences were identified and compared. Each genome was found to contain a set of unique repetitive sequences that could be used in future virus identification. Repeats common to more than one virus were also identified and changes in copy number between these repeats may provide a simple method to differentiate between very closely related virus strains. The results of this paper will be useful in identifying new iridoviruses and determining their relationship to other members of the family.
Collapse
Affiliation(s)
| | | | - Craig R. Brunetti
- Author to whom correspondence should be addressed; E-Mail: ; Tel.: +1-705-748-1011; Fax: +1-705-748-1205
| |
Collapse
|
49
|
Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography. ENTROPY 2010. [DOI: 10.3390/e12071765] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
50
|
Yu ZG, Chu KH, Li CP, Anh V, Zhou LQ, Wang RW. Whole-proteome phylogeny of large dsDNA viruses and parvoviruses through a composition vector method related to dynamical language model. BMC Evol Biol 2010; 10:192. [PMID: 20565983 PMCID: PMC2898692 DOI: 10.1186/1471-2148-10-192] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2009] [Accepted: 06/22/2010] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND The vast sequence divergence among different virus groups has presented a great challenge to alignment-based analysis of virus phylogeny. Due to the problems caused by the uncertainty in alignment, existing tools for phylogenetic analysis based on multiple alignment could not be directly applied to the whole-genome comparison and phylogenomic studies of viruses. There has been a growing interest in alignment-free methods for phylogenetic analysis using complete genome data. Among the alignment-free methods, a dynamical language (DL) method proposed by our group has successfully been applied to the phylogenetic analysis of bacteria and chloroplast genomes. RESULTS In this paper, the DL method is used to analyze the whole-proteome phylogeny of 124 large dsDNA viruses and 30 parvoviruses, two data sets with large difference in genome size. The trees from our analyses are in good agreement to the latest classification of large dsDNA viruses and parvoviruses by the International Committee on Taxonomy of Viruses (ICTV). CONCLUSIONS The present method provides a new way for recovering the phylogeny of large dsDNA viruses and parvoviruses, and also some insights on the affiliation of a number of unclassified viruses. In comparison, some alignment-free methods such as the CV Tree method can be used for recovering the phylogeny of large dsDNA viruses, but they are not suitable for resolving the phylogeny of parvoviruses with a much smaller genome size.
Collapse
Affiliation(s)
- Zu-Guo Yu
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, Q 4001, Australia
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China
| | - Ka Hou Chu
- Department of Biology, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
| | - Chi Pang Li
- Department of Biology, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, Q 4001, Australia
| | - Li-Qian Zhou
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China
| | - Roger Wei Wang
- Department of Mathematics, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
| |
Collapse
|