1
|
van Zyl DJ, Dunaiski M, Tegally H, Baxter C, de Oliveira T, Xavier JS. Alignment-Free Viral Sequence Classification at Scale. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.12.10.627186. [PMID: 39713356 PMCID: PMC11661207 DOI: 10.1101/2024.12.10.627186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/24/2024]
Abstract
Background The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-based methods, such as BLAST, are increasingly overwhelmed by the scale of contemporary datasets due to their high computational demands for classification. This study evaluates alignment-free (AF) methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets. Results We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV-2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV-2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively. Conclusion Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.
Collapse
|
2
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Limaye S, Shelke A, Kale MM, Kulkarni-Kale U, Kuchipudi SV. IDV Typer: An Automated Tool for Lineage Typing of Influenza D Viruses Based on Return Time Distribution. Viruses 2024; 16:373. [PMID: 38543738 PMCID: PMC10976072 DOI: 10.3390/v16030373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 02/24/2024] [Accepted: 02/24/2024] [Indexed: 05/23/2024] Open
Abstract
Influenza D virus (IDV) is the most recent addition to the Orthomyxoviridae family and cattle serve as the primary reservoir. IDV has been implicated in Bovine Respiratory Disease Complex (BRDC), and there is serological evidence of human infection of IDV. Evolutionary changes in the IDV genome have resulted in the expansion of genetic diversity and the emergence of multiple lineages that might expand the host tropism and potentially increase the pathogenicity to animals and humans. Therefore, there is an urgent need for automated, accurate and rapid typing tools for IDV lineage typing. Currently, IDV lineage typing is carried out using BLAST-based searches and alignment-based molecular phylogeny of the hemagglutinin-esterase fusion (HEF) gene sequences, and lineage is assigned to query sequences based on sequence similarity (BLAST search) and proximity to the reference lineages in the tree topology, respectively. To minimize human intervention and lineage typing time, we developed IDV Typer server, implementing alignment-free method based on return time distribution (RTD) of k-mers. Lineages are assigned using HEF gene sequences. The server performs with 100% sensitivity and specificity. The IDV Typer server is the first application of an RTD-based alignment-free method for typing animal viruses.
Collapse
Affiliation(s)
- Sanket Limaye
- Bioinformatics Centre, Savitribai Phule Pune University (Formerly University of Pune), Pune 411007, India; (S.L.); (A.S.)
| | - Anant Shelke
- Bioinformatics Centre, Savitribai Phule Pune University (Formerly University of Pune), Pune 411007, India; (S.L.); (A.S.)
| | - Mohan M. Kale
- Department of Statistics, Savitribai Phule Pune University (Formerly University of Pune), Pune 411007, India;
| | - Urmila Kulkarni-Kale
- Bioinformatics Centre, Savitribai Phule Pune University (Formerly University of Pune), Pune 411007, India; (S.L.); (A.S.)
| | - Suresh V. Kuchipudi
- Department of Infectious Diseases and Microbiology, University of Pittsburgh School of Public Health, Pittsburgh, PA 15261, USA
| |
Collapse
|
4
|
Dey S, Ghosh P, Das S. Positional difference and Frequency (PdF) based alignment-free technique for genome sequence comparison. J Biomol Struct Dyn 2023; 42:12660-12688. [PMID: 37885236 DOI: 10.1080/07391102.2023.2272748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 09/19/2023] [Indexed: 10/28/2023]
Abstract
In the field of computational biology, genome sequence comparison among different species is essential and has applications in both the research and scientific fields. Owing to the lengthy processing time and large number of data sets, the alignment-based approaches are unsuitable and ineffective. Therefore, alignment-free techniques have obtained popularity for acquiring proper sequence clustering and evolutionary relationship among species. In this paper, a complete bipartite graph based Positional difference and Frequency (PdF) vector descriptor is introduced. Positional difference and Frequency, two parameters, are applied to the genome sequence to create a 16- dimensional vector descriptor using the di-nucleotide representation of genome sequence. Subsequently, a distance matrix is calculated to construct the phylogenetic trees for different data sets of mammals and viruses. The achieved outcomes are compared with the phylogenetic trees of the earlier methods viz. the FFP method, the ClustalW method, the MEV method, the PCNV method and the FIS method. In most instances, the proposed method produces more precise outcomes than the preceding techniques and has potential for genome sequence comparison on both the equal and unequal length of data-sets.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Sudeshna Dey
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Papri Ghosh
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Subhram Das
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| |
Collapse
|
5
|
Andino R, Kirkegaard K, Macadam A, Racaniello VR, Rosenfeld AB. The Picornaviridae Family: Knowledge Gaps, Animal Models, Countermeasures, and Prototype Pathogens. J Infect Dis 2023; 228:S427-S445. [PMID: 37849401 DOI: 10.1093/infdis/jiac426] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2023] Open
Abstract
Picornaviruses are nonenveloped particles with a single-stranded RNA genome of positive polarity. This virus family includes poliovirus, hepatitis A virus, rhinoviruses, and Coxsackieviruses. Picornaviruses are common human pathogens, and infection can result in a spectrum of serious illnesses, including acute flaccid myelitis, severe respiratory complications, and hand-foot-mouth disease. Despite research on poliovirus establishing many fundamental principles of RNA virus biology and the first transgenic animal model of disease for infection by a human virus, picornaviruses are understudied. Existing knowledge gaps include, identification of molecules required for virus entry, understanding cellular and humoral immune responses elicited during virus infection, and establishment of immune-competent animal models of virus pathogenesis. Such knowledge is necessary for development of pan-picornavirus countermeasures. Defining enterovirus A71 and D68, human rhinovirus C, and echoviruses 29 as prototype pathogens of this virus family may provide insight into picornavirus biology needed to establish public health strategies necessary for pandemic preparedness.
Collapse
Affiliation(s)
- Raul Andino
- Department of Microbiology and Immunology, University of California, San Francisco, California, USA
| | - Karla Kirkegaard
- Department of Microbiology and Immunology, Stanford University School of Medicine, Stanford University, Stanford, California, USA
- Department of Genetics, Stanford University School of Medicine, Stanford University, Stanford, California, USA
| | - Andrew Macadam
- National Institute for Biological Standards and Control, South Mimms, Hertfordshire, United Kingdom
| | - Vincent R Racaniello
- Department of Microbiology and Immunology, Vagelos College of Physicians and Surgeons, Columbia University, New York, New York, USA
| | - Amy B Rosenfeld
- Department of Microbiology and Immunology, Vagelos College of Physicians and Surgeons, Columbia University, New York, New York, USA
- Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| |
Collapse
|
6
|
Tang R, Yu Z, Li J. KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences. Mol Phylogenet Evol 2023; 179:107662. [PMID: 36375789 DOI: 10.1016/j.ympev.2022.107662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/10/2022] [Accepted: 11/02/2022] [Indexed: 11/13/2022]
Abstract
Alignment-based methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational complexity. Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Here, we explore an alignment-free approach that uses inner distance distributions of k-mer pairs in biological sequences for phylogeny inference. For every sequence in a dataset, our method transforms the sequence into a numeric feature vector consisting of features each representing a specific k-mer pair's contribution to the characterization of the sequentiality uniqueness of the sequence. This newly defined k-mer pair's contribution is an integration of the reverse Kullback-Leibler divergence, pseudo mode and the classic entropy of an inner distance distribution of the k-mer pair in the sequence. Our method has been tested on datasets of complete genome sequences, complete protein sequences, and gene sequences of rRNA of various lengths. Our method achieves the best performance in comparison with state-of-the-art alignment-free methods as measured by the Robinson-Foulds distance between the reference and the constructed phylogeny trees.
Collapse
Affiliation(s)
- Runbin Tang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China; School of Mathematical Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Zuguo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China.
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia.
| |
Collapse
|
7
|
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
Collapse
|
8
|
Gwak HJ, Rho M. ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data. Brief Bioinform 2022; 23:6603436. [PMID: 35667011 DOI: 10.1093/bib/bbac204] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 05/02/2022] [Accepted: 05/04/2022] [Indexed: 11/13/2022] Open
Abstract
Viruses are ubiquitous in humans and various environments and continually mutate themselves. Identifying viruses in an environment without cultivation is challenging; however, promoting the screening of novel viruses and expanding the knowledge of viral space is essential. Homology-based methods that identify viruses using known viral genomes rely on sequence alignments, making it difficult to capture remote homologs of the known viruses. To accurately capture viral signals from metagenomic samples, models are needed to understand the patterns encoded in the viral genomes. In this study, we developed a hierarchical BERT model named ViBE to detect eukaryotic viruses from metagenome sequencing data and classify them at the order level. We pre-trained ViBE using read-like sequences generated from the virus reference genomes and derived three fine-tuned models that classify paired-end reads to orders for eukaryotic deoxyribonucleic acid viruses and eukaryotic ribonucleic acid viruses. ViBE achieved higher recall than state-of-the-art alignment-based methods while maintaining comparable precision. ViBE outperformed state-of-the-art alignment-free methods for all test cases. The performance of ViBE was also verified using real sequencing datasets, including the vaginal virome.
Collapse
Affiliation(s)
- Ho-Jin Gwak
- Department of Computer Science, Hanyang University, Seoul, Korea
| | - Mina Rho
- Department of Computer Science, Hanyang University, Seoul, Korea.,Department of Biomedical Informatics, Hanyang University, Seoul, Korea
| |
Collapse
|
9
|
Hufsky F, Abecasis A, Agudelo-Romero P, Bletsa M, Brown K, Claus C, Deinhardt-Emmer S, Deng L, Friedel CC, Gismondi MI, Kostaki EG, Kühnert D, Kulkarni-Kale U, Metzner KJ, Meyer IM, Miozzi L, Nishimura L, Paraskevopoulou S, Pérez-Cataluña A, Rahlff J, Thomson E, Tumescheit C, van der Hoek L, Van Espen L, Vandamme AM, Zaheri M, Zuckerman N, Marz M. Women in the European Virus Bioinformatics Center. Viruses 2022; 14:1522. [PMID: 35891501 PMCID: PMC9319252 DOI: 10.3390/v14071522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/05/2022] [Accepted: 07/07/2022] [Indexed: 02/01/2023] Open
Abstract
Viruses are the cause of a considerable burden to human, animal and plant health, while on the other hand playing an important role in regulating entire ecosystems. The power of new sequencing technologies combined with new tools for processing "Big Data" offers unprecedented opportunities to answer fundamental questions in virology. Virologists have an urgent need for virus-specific bioinformatics tools. These developments have led to the formation of the European Virus Bioinformatics Center, a network of experts in virology and bioinformatics who are joining forces to enable extensive exchange and collaboration between these research areas. The EVBC strives to provide talented researchers with a supportive environment free of gender bias, but the gender gap in science, especially in math-intensive fields such as computer science, persists. To bring more talented women into research and keep them there, we need to highlight role models to spark their interest, and we need to ensure that female scientists are not kept at lower levels but are given the opportunity to lead the field. Here we showcase the work of the EVBC and highlight the achievements of some outstanding women experts in virology and viral bioinformatics.
Collapse
Affiliation(s)
- Franziska Hufsky
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Ana Abecasis
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Global Health and Tropical Medicine, Institute of Hygiene and Tropical Medicine, New University of Lisbon, 1349-008 Lisbon, Portugal
| | - Patricia Agudelo-Romero
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Wal-Yan Respiratory Research Centre, Telethon Kids Institute, University of Western Australia, Nedlands, WA 6009, Australia
| | - Magda Bletsa
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, 115 27 Athens, Greece
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
| | - Katherine Brown
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Division of Virology, Department of Pathology, University of Cambridge, Cambridge CB2 1TN, UK
| | - Claudia Claus
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Medical Microbiology and Virology, Medical Faculty, Leipzig University, 04103 Leipzig, Germany
| | - Stefanie Deinhardt-Emmer
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Medical Microbiology, Jena University Hospital, 07747 Jena, Germany
| | - Li Deng
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Virology, Helmholtz Centre Munich-German Research Center for Environmental Health, 85764 Neuherberg, Germany
- Microbial Disease Prevention, School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Caroline C. Friedel
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Informatics, Ludwig-Maximilians-Universität München, 80333 Munich, Germany
| | - María Inés Gismondi
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Agrobiotechnology and Molecular Biology (IABIMO), National Institute for Agriculture Technology (INTA), National Research Council (CONICET), Hurlingham B1686IGC, Argentina
- Department of Basic Sciences, National University of Luján, Luján B6702MZP, Argentina
| | - Evangelia Georgia Kostaki
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, 115 27 Athens, Greece
| | - Denise Kühnert
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Transmission, Infection, Diversification and Evolution Group, Max Planck Institute for the Science of Human History, 07745 Jena, Germany
| | - Urmila Kulkarni-Kale
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Bioinformatics Centre, Savitribai Phule Pune University, Pune 411007, India
| | - Karin J. Metzner
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, 8091 Zurich, Switzerland
- Institute of Medical Virology, University of Zurich, 8057 Zurich, Switzerland
| | - Irmtraud M. Meyer
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 10115 Berlin, Germany
- Institute of Chemistry and Biochemistry, Department of Biology, Chemistry and Pharmacy, Freie Universität Berlin, 14195 Berlin, Germany
- Faculty of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany
| | - Laura Miozzi
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute for Sustainable Plant Protection, National Research Council of Italy, 10135 Torino, Italy
| | - Luca Nishimura
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Genetics, School of Life Science, The Graduate University for Advanced Studies (SOKENDAI), Mishima 411-8540, Japan
- Human Genetics Laboratory, National Institute of Genetics, Mishima 411-8540, Japan
| | - Sofia Paraskevopoulou
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Methods Development and Research Infrastructure, Bioinformatics and Systems Biology, Robert Koch Institute, 13353 Berlin, Germany
| | - Alba Pérez-Cataluña
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- VISAFELab, Department of Preservation and Food Safety Technologies, Institute of Agrochemistry and Food Technology, IATA-CSIC, 46980 Valencia, Spain
| | - Janina Rahlff
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Centre for Ecology and Evolution in Microbial Model Systems (EEMiS), Department of Biology and Environmental Science, Linneaus University, 391 82 Kalmar, Sweden
| | - Emma Thomson
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Queen Elizabeth University Hospital, NHS Greater Glasgow and Clyde, Glasgow G51 4TF, UK
- MRC-University of Glasgow Centre for Virus Research, Glasgow G61 1QH, UK
| | - Charlotte Tumescheit
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- School of Biological Sciences, Seoul National University, Seoul 08826, Korea
| | - Lia van der Hoek
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Laboratory of Experimental Virology, Department of Medical Microbiology and Infection Prevention, Amsterdam UMC, University of Amsterdam, 1012 WX Amsterdam, The Netherlands
- Amsterdam Institute for Infection and Immunity, 1100 DD Amsterdam, The Netherlands
| | - Lore Van Espen
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
| | - Anne-Mieke Vandamme
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
- Global Health and Tropical Medicine, Instituto de Higiene e Medicina Tropical, Universidade Nova de Lisboa, 1349-008 Lisbon, Portugal
- Institute for the Future, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
| | - Maryam Zaheri
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Medical Virology, University of Zurich, 8057 Zurich, Switzerland
| | - Neta Zuckerman
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Central Virology Laboratory, Public Health Services, Ministry of Health and Sheba Medical Center, Ramat Gan 52621, Israel
| | - Manja Marz
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany
| |
Collapse
|
10
|
Mane A, Limaye S, Patil L, Kulkarni-Kale U. Genetic variability in minor capsid protein (L2 gene) of human papillomavirus type 16 among Indian women. Med Microbiol Immunol 2022; 211:153-160. [PMID: 35552511 PMCID: PMC9101989 DOI: 10.1007/s00430-022-00739-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 04/21/2022] [Indexed: 11/17/2022]
Abstract
Human papillomavirus type 16 (HPV-16) is the predominant genotype worldwide associated with invasive cervical cancer and hence remains as the focus for diagnostic development and vaccine research. L2, the minor capsid protein forms the packaging unit for the HPV genome along with the L1 protein and is primarily associated with transport of genomic DNA to the nucleus. Unlike L1, L2 is known to elicit cross-neutralizing antibodies and thus becomes a suitable candidate for pan-HPV prophylactic vaccine development. In the present study, a total of 148 cervical HPV-16 isolates from Indian women were analyzed by PCR-directed sequencing, phylogenetic analysis and in silico immunoinformatics tools to determine the L2 variations that may impact the immune response and oncogenesis. Ninety-one SNPs translating to 35 non-synonymous amino acid substitutions were observed, of these 16 substitutions are reported in the Indian isolates for the first time. T245A, L266F, S378V and S384A substitutions were significantly associated with high-grade cervical neoplastic status. Multiple substitutions were observed in samples from high-grade cervical neoplastic status as compared to those from normal cervical status (p = 0.027), specifically from the D3 sub-lineage. It was observed that substitution T85A was part of both, B and T cell epitopes recognized by MHC-I molecules; T245A was common to B and T cell epitopes recognized by MHC-II molecules and S122P/A was common to the region recognized by both MHC-I and MHC-II molecules. These findings reporting L2 protein substitutions have implications on cervical oncogenesis and design of next-generation L2-based HPV vaccines.
Collapse
Affiliation(s)
- Arati Mane
- ICMR - National AIDS Research Institute, '73' G Block, MIDC, Bhosari, Pune, 411026, India.
| | - Sanket Limaye
- Savitribai Phule Pune University, Ganeshkhind Road, Pune, 411007, India
| | - Linata Patil
- ICMR - National AIDS Research Institute, '73' G Block, MIDC, Bhosari, Pune, 411026, India
| | | |
Collapse
|
11
|
Mane A, Limaye S, Patil L, Kulkarni-Kale U. Genetic variations in the long control region of human papillomavirus type 16 isolates from India: implications for cervical carcinogenesis. J Med Microbiol 2022; 71. [PMID: 35040427 DOI: 10.1099/jmm.0.001475] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Introduction. Infection with high-risk human papillomavirus (HPV) types, specifically HPV type 16 (HPV16), is considered to be the most important risk factor in the development of cervical intraepithelial neoplasia and cancer. The long control region (LCR) is a noncoding region that comprises approximately 10 % of the HPV genome and contains regulatory elements for viral transcription and replication. Sequence variations in LCR may impact on the replication efficiency and oncogenic potential of the virus.Gap statement. Studies documenting variations in LCR of HPV16 isolates pertaining to cervical neoplastic status in India are limited.Aim. The present study was designed to characterize variations in the LCR of Indian isolates of HPV16 and study their association with cervical disease grades.Methodology. The LCR was amplified and sequenced from HPV16 positive cervical samples belonging to different cervical disease grades. Sequences were aligned to identify variations and potential transcription factor binding sites (TFbs) were predicted using the JASPAR database in addition to phylogenetic studies.Results. Among the 163 HPV16 isolates analysed, 47 different nucleotide variations were detected in the LCR, of which 25 are reported for first time in Indian isolates. Point mutations were detected in 35/54 (64.8 %) samples with normal cervical status, 44/50 (88 %) samples with low-grade cervical disease and 53/59 (89.8 %) samples with high-grade cervical disease. Variations T6586C, G6657A and T6850G were significantly associated with high-grade cervical status. Thirteen LCR variations were detected in the binding sites for CEBPB, ETS1, JUN, MYB, NFIL3, PHOX2A and SOX9 transcription factors.Conclusion. The present study helped to identify unique variations in the LCRs of HPV16 Indian isolates. The variations in the A4 sub-lineage were significantly associated with high-grade disease status. The isolates belonging to the A4 and D3 sub-lineages harboured mutations in putative TFbs, implying a potential impact on viral replication and progression to cervical cancer.
Collapse
Affiliation(s)
- Arati Mane
- ICMR-National AIDS Research Institute, 73 G block, MIDC, Bhosari, Pune-411026, India
| | - Sanket Limaye
- Bioinformatics Centre, Savitribai Phule Pune University, Ganeshkhind, Pune-411007, India
| | - Linata Patil
- ICMR-National AIDS Research Institute, 73 G block, MIDC, Bhosari, Pune-411026, India
| | - Urmila Kulkarni-Kale
- Bioinformatics Centre, Savitribai Phule Pune University, Ganeshkhind, Pune-411007, India
| |
Collapse
|
12
|
Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]
Abstract
In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.
Collapse
Affiliation(s)
- Katrin Sophie Bohnsack
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Marika Kaden
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Julia Abel
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Sascha Saralajew
- Bosch Center for Artificial Intelligence, 71272 Renningen, Germany;
| | - Thomas Villmann
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| |
Collapse
|
13
|
Qi Z, Wen X. Novel Protein Sequence Comparison Method Based on Transition Probability Graph and Information Entropy. Comb Chem High Throughput Screen 2020; 25:392-400. [PMID: 32875978 DOI: 10.2174/1386207323666200901103001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 07/17/2020] [Accepted: 07/17/2020] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE Sequence analysis is one of the foundations in bioinformatics. It is widely used to find out the feature metric hidden in the sequence. Otherwise, the graphical representation of biologic sequence is an important tool for sequencing analysis. This study is undertaken to find out a new graphical representation of biosequences. MATERIALS AND METHODS The transition probability is used to describe amino acid combinations of protein sequences. The combinations are composed of amino acids directly adjacent to each other or separated by multiple amino acids. The transition probability graph is built up by the transition probabilities of amino acid combinations. Next, a map is defined as a representation from transition probability graph to transition probability vector by k-order transition probability graph. Transition entropy vectors are developed by the transition probability vector and information entropy. Finally, the proposed method is applied to two separate applications, 499 HA genes of H1N1, and 95 coronaviruses. RESULTS By constructing a phylogenetic tree, we find that the results of each application are consistent with other studies. CONCLUSION the graphical representation proposed in this article is a practical and correct method.
Collapse
Affiliation(s)
- Zhaohui Qi
- College of Information Science and Engineering Hunan Normal University, Changsha 410081. China
| | - Xinlong Wen
- College of Information Science and Engineering Hunan Normal University, Changsha 410081. China
| |
Collapse
|
14
|
ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time. Interdiscip Sci 2020; 12:276-287. [PMID: 32524529 DOI: 10.1007/s12539-020-00380-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 05/19/2020] [Accepted: 06/02/2020] [Indexed: 10/24/2022]
Abstract
Protein sequence is a wealth of experimental information which is yet to be exploited to extract information on protein homologues. Consequently, it is observed from publications that dynamic programming, heuristics and HMM profile-based alignment techniques along with the alignment free techniques do not directly utilize ordered profile of physicochemical properties of a protein to identify its homologue. Also, it is found that these works lack crucial bench-marking or validation in absence of which their incorporation in search engines may appears to be questionable. In this direction this research approach offers fixed dimensional numerical representation of protein sequences extending the concept of periodicity count value of nucleotide types (2017) to accommodate Euclidean distance as direct similarity measure between two proteins. Instead of bench-marking with BLAST and PSI-BLAST only, this new similarity measure was also compared with Needleman-Wunsch and Smith-Waterman. For enhancing the strength of comparison, this work for the first time introduces two novel benchmarking methods based on correlation of "similarity scores" and "proximity of ranked outputs from a standard sequence alignment method" between all possible pairs of search techniques including the new one presented in this paper. It is found that the novel and unique numerical representation of a protein can reduce computational complexity of protein sequence search to the tune of O(log(n)). It may also help implementation of various other similarity-based operation possible, such as clustering, phylogenetic analysis and classification of proteins on the basis of the properties used to build this numerical representation of protein.
Collapse
|
15
|
Mitra U, Bhattacharyya B, Mukhopadhyay T. PEER: A direct method for biosequence pattern mining through waits of optimal k-mers. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2019.12.072] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
16
|
Mane A, Patil L, Limaye S, Nirmalkar A, Kulkarni‐Kale U. Characterization of major capsid protein (L1) variants of
Human papillomavirus
type 16 by cervical neoplastic status in Indian women: Phylogenetic and functional analysis. J Med Virol 2020; 92:1303-1308. [DOI: 10.1002/jmv.25675] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2019] [Accepted: 01/10/2020] [Indexed: 11/10/2022]
Affiliation(s)
- Arati Mane
- Division of MicrobiologyICMR‐National AIDS Research InstitutePune India
| | - Linata Patil
- Division of MicrobiologyICMR‐National AIDS Research InstitutePune India
| | - Sanket Limaye
- Bioinformatics CentreSavitribai Phule Pune UniversityPune India
| | - Amit Nirmalkar
- Division of Data Management, Biostatistics and ITICMR‐National AIDS Research InstitutePune India
| | | |
Collapse
|
17
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 101] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
18
|
Genetic diversity and evolutionary dynamics of dengue isolates from India. Virusdisease 2019; 30:354-359. [PMID: 31803801 DOI: 10.1007/s13337-019-00538-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Accepted: 06/25/2019] [Indexed: 10/26/2022] Open
Abstract
Dengue virus (DENV) is the mosquito borne virus which causes Dengue Haemorrhagic Fever and Dengue Shock Syndrome. It consists of four distinct serotypes (DENV 1-4). DENV 1, 3 and 4 were classified into five genotypes (GI-GV), where as DENV-2 belongs to American and Cosmopolitan genotypes. Dengue virus is most prevalent in south and Southeast Asia including India. This study was initiated to study the genetic diversity and evolution among the Dengue isolates in India. Pairwise comparison of amino acid sequences among the serotypes has shown that DENV-3 is having less sequence diversity compared to other serotypes having differences in their amino acid numbers. We have analyzed the 50 Indian strains and 19 of those strains have been identified as recombinant strains by using RDP4 package, which are then excluded for future selection. Episodic positive selection of DENV was obtained using MEME with P value is ≤ 5. Positive selection on several codons was used to correlate the genetic diversity between serotypes. This study clearly established that diversity of amino acids and inter genotypic recombination of strains are the major cause for antigenicity variation and evolution of DENV within India.
Collapse
|
19
|
Criscuolo A. A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36178] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.
Collapse
|
20
|
Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. J Comput Biol 2019; 26:519-535. [PMID: 31050550 DOI: 10.1089/cmb.2018.0239] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
The classification of pathogens in emerging and re-emerging viruses represents major interests in taxonomic studies, functional genomics, host-pathogen interplay, prevention, and disease treatments. It consists of assigning a given sequence to its related group of known sequences sharing similar characteristics and traits. The challenges to such classification could be associated with several virus properties including recombination, mutation rate, multiplicity of motifs, and diversity. In domains such as pathogen monitoring and surveillance, it is important to detect and quantify known and novel taxa without exploiting the full and accurate alignments or virus family profiles. In this study, we propose an alignment-free method, CASTOR-KRFE, to detect discriminating subsequences within known pathogen sequences to classify accurately unknown pathogen sequences. This method includes three major steps: (1) vectorization of known viral genomic sequences based on k-mers to constitute the potential features, (2) efficient way of pattern extraction and evaluation maximizing classification performance, and (3) prediction of the minimal set of features fitting a given criterion (threshold of performance metric and maximum number of features). We assessed this method through a jackknife data partitioning on a dozen of various virus data sets, covering the seven major virus groups and including influenza virus, Ebola virus, human immunodeficiency virus 1, hepatitis C virus, hepatitis B virus, and human papillomavirus. CASTOR-KRFE provides a weighted average F-measure >0.96 over a wide range of viruses. Our method also shows better performance on complex virus data sets than multiple subsequences extractor for classification (MISSEL), a subsequence extraction method, and the Discriminative mode of MEME patterns extraction tool.
Collapse
Affiliation(s)
- Dylan Lebatteux
- Department of Computer Science, Université du Québec à Montréal, Montreal, Canada
| | - Amine M Remita
- Department of Computer Science, Université du Québec à Montréal, Montreal, Canada
| | | |
Collapse
|
21
|
Saw AK, Raj G, Das M, Talukdar NC, Tripathy BC, Nandi S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Sci Rep 2019; 9:3753. [PMID: 30842590 PMCID: PMC6403383 DOI: 10.1038/s41598-019-40452-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 01/28/2019] [Indexed: 12/28/2022] Open
Abstract
A larger amount of sequence data in private and public databases produced by next-generation sequencing put new challenges due to limitation associated with the alignment-based method for sequence comparison. So, there is a high need for faster sequence analysis algorithms. In this study, we developed an alignment-free algorithm for faster sequence analysis. The novelty of our approach is the inclusion of fuzzy integral with Markov chain for sequence analysis in the alignment-free model. The method estimate the parameters of a Markov chain by considering the frequencies of occurrence of all possible nucleotide pairs from each DNA sequence. These estimated Markov chain parameters were used to calculate similarity among all pairwise combinations of DNA sequences based on a fuzzy integral algorithm. This matrix is used as an input for the neighbor program in the PHYLIP package for phylogenetic tree construction. Our method was tested on eight benchmark datasets and on in-house generated datasets (18 s rDNA sequences from 11 arbuscular mycorrhizal fungi (AMF) and 16 s rDNA sequences of 40 bacterial isolates from plant interior). The results indicate that the fuzzy integral algorithm is an efficient and feasible alignment-free method for sequence analysis on the genomic scale.
Collapse
Affiliation(s)
- Ajay Kumar Saw
- Institute of Advanced Study in Science and Technology, Mathematical Sciences Division, Guwahati, 781035, India
| | - Garima Raj
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Manashi Das
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Narayan Chandra Talukdar
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | | | - Soumyadeep Nandi
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India.
| |
Collapse
|
22
|
An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One 2018; 13:e0206409. [PMID: 30427878 PMCID: PMC6235296 DOI: 10.1371/journal.pone.0206409] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 10/14/2018] [Indexed: 01/11/2023] Open
Abstract
For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.
Collapse
|
23
|
Alignment-based and alignment-free methods converge with experimental data on amino acids coded by stop codons at split between nuclear and mitochondrial genetic codes. Biosystems 2018; 167:33-46. [DOI: 10.1016/j.biosystems.2018.03.002] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Revised: 03/18/2018] [Accepted: 03/19/2018] [Indexed: 12/11/2022]
|
24
|
Abstract
With sharp increasing in biological sequences, the traditional sequence alignment methods become unsuitable and infeasible. It motivates a surge of fast alignment-free techniques for sequence analysis. Among these methods, many sorts of feature vector methods are established and applied to reconstruction of species phylogeny. The vectors basically consist of some typical numerical features for certain biological problems. The features may come from the primary sequences, secondary or three dimensional structures of macromolecules. In this study, we propose a novel numerical vector based on only primary sequences of organism to build their phylogeny. Three chemical and physical properties of primary sequences: purine, pyrimidine and keto are also incorporated to the vector. Using each property, we convert the nucleotide sequence into a new sequence consisting of only two kinds of letters. Therefore, three sequences are constructed according to the three properties. For each letter of each sequence we calculate the number of the letter, the average position of the letter and the variation of the position of the letter appearing in the sequence. Tested on several datasets related to mammals, viruses and bacteria, this new tool is fast in speed and accurate for inferring the phylogeny of organisms.
Collapse
|
25
|
Waman VP, Kale MM, Kulkarni-Kale U. Genetic diversity and evolution of dengue virus serotype 3: A comparative genomics study. INFECTION GENETICS AND EVOLUTION 2017; 49:234-240. [PMID: 28126562 DOI: 10.1016/j.meegid.2017.01.022] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2016] [Revised: 01/14/2017] [Accepted: 01/21/2017] [Indexed: 11/29/2022]
Abstract
Dengue virus serotype 3 (DENV-3), one of the four serotypes of Dengue viruses, is geographically diverse. There are five distinct genotypes (I-V) of DENV-3. Emerging strains and lineages of DENV-3 are increasingly being reported. Availability of genomic data for DENV-3 strains provides opportunity to study its population structure. Complete genome sequences are available for 860 strains of four genotypes (I, II, III and V) isolated worldwide and were analyzed using population genetics and evolutionary approaches to map landscape of genomic diversity. DENV-3 population is observed to be stratified into five major subpopulations. Genotype I and II formed independent subpopulations while genotype III is subdivided into three subpopulations (GIII-a, GIII-b and GIII-c) and is therefore heterogeneous. Genotypes I, II and GIII-a subpopulations comprise of Asian strains whereas GIII-c comprises of American strains. GIII-b subpopulation includes mainly of American strains along with a few strains from Sri Lanka. Genetic admixture is predominantly observed in Sri Lankan strains of genotype III and all strains of genotype V. Inter-genotype recombination was observed to occur in non-structural region of several Asian strains whereas extent of recombination was limited in American strains. Significant positive selection was found to be operational on all genes and observed to be the main driving force of genetic diversity. Positive selection was strongly operational on the branches leading to Asian genotypes and helped to delineate the genetic differences between Asian and American lineages. Thus, inter-genotype recombination, migration and adaptive evolution are the major determinants of evolution of DENV-3.
Collapse
Affiliation(s)
- Vaishali P Waman
- Bioinformatics Centre, Savitribai Phule Pune University (formerly University of Pune), Pune 411007, Maharashtra, India
| | - Mohan M Kale
- Department of Statistics, Savitribai Phule Pune University (formerly University of Pune), Pune 411007, Maharashtra, India
| | - Urmila Kulkarni-Kale
- Bioinformatics Centre, Savitribai Phule Pune University (formerly University of Pune), Pune 411007, Maharashtra, India.
| |
Collapse
|
26
|
Waman VP, Kolekar P, Ramtirthkar MR, Kale MM, Kulkarni-Kale U. Analysis of genotype diversity and evolution of Dengue virus serotype 2 using complete genomes. PeerJ 2016; 4:e2326. [PMID: 27635316 PMCID: PMC5012332 DOI: 10.7717/peerj.2326] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2016] [Accepted: 07/14/2016] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Dengue is one of the most common arboviral diseases prevalent worldwide and is caused by Dengue viruses (genus Flavivirus, family Flaviviridae). There are four serotypes of Dengue Virus (DENV-1 to DENV-4), each of which is further subdivided into distinct genotypes. DENV-2 is frequently associated with severe dengue infections and epidemics. DENV-2 consists of six genotypes such as Asian/American, Asian I, Asian II, Cosmopolitan, American and sylvatic. Comparative genomic study was carried out to infer population structure of DENV-2 and to analyze the role of evolutionary and spatiotemporal factors in emergence of diversifying lineages. METHODS Complete genome sequences of 990 strains of DENV-2 were analyzed using Bayesian-based population genetics and phylogenetic approaches to infer genetically distinct lineages. The role of spatiotemporal factors, genetic recombination and selection pressure in the evolution of DENV-2 is examined using the sequence-based bioinformatics approaches. RESULTS DENV-2 genetic structure is complex and consists of fifteen subpopulations/lineages. The Asian/American genotype is observed to be diversified into seven lineages. The Asian I, Cosmopolitan and sylvatic genotypes were found to be subdivided into two lineages, each. The populations of American and Asian II genotypes were observed to be homogeneous. Significant evidence of episodic positive selection was observed in all the genes, except NS4A. Positive selection operational on a few codons in envelope gene confers antigenic and lineage diversity in the American strains of Asian/American genotype. Selection on codons of non-structural genes was observed to impact diversification of lineages in Asian I, cosmopolitan and sylvatic genotypes. Evidence of intra/inter-genotype recombination was obtained and the uncertainty in classification of recombinant strains was resolved using the population genetics approach. DISCUSSION Complete genome-based analysis revealed that the worldwide population of DENV-2 strains is subdivided into fifteen lineages. The population structure of DENV-2 is spatiotemporal and is shaped by episodic positive selection and recombination. Intra-genotype diversity was observed in four genotypes (Asian/American, Asian I, cosmopolitan and sylvatic). Episodic positive selection on envelope and non-structural genes translates into antigenic diversity and appears to be responsible for emergence of strains/lineages in DENV-2 genotypes. Understanding of the genotype diversity and emerging lineages will be useful to design strategies for epidemiological surveillance and vaccine design.
Collapse
Affiliation(s)
- Vaishali P. Waman
- Bioinformatics Centre, Savitribai Phule Pune University (formerly University of Pune), Pune, Maharashtra, India
| | - Pandurang Kolekar
- Bioinformatics Centre, Savitribai Phule Pune University (formerly University of Pune), Pune, Maharashtra, India
| | - Mukund R. Ramtirthkar
- Department of Statistics, Savitribai Phule Pune University (formerly University of Pune), Pune, Maharashtra, India
| | - Mohan M. Kale
- Department of Statistics, Savitribai Phule Pune University (formerly University of Pune), Pune, Maharashtra, India
| | - Urmila Kulkarni-Kale
- Bioinformatics Centre, Savitribai Phule Pune University (formerly University of Pune), Pune, Maharashtra, India
| |
Collapse
|
27
|
Waman VP, Kasibhatla SM, Kale MM, Kulkarni-Kale U. Population genomics of dengue virus serotype 4: insights into genetic structure and evolution. Arch Virol 2016; 161:2133-48. [DOI: 10.1007/s00705-016-2886-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Accepted: 05/02/2016] [Indexed: 12/30/2022]
|
28
|
Kolekar PS, Waman VP, Kale MM, Kulkarni-Kale U. RV-Typer: A Web Server for Typing of Rhinoviruses Using Alignment-Free Approach. PLoS One 2016; 11:e0149350. [PMID: 26870949 PMCID: PMC4752186 DOI: 10.1371/journal.pone.0149350] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2015] [Accepted: 01/29/2016] [Indexed: 11/24/2022] Open
Abstract
Rhinoviruses (RV) are increasingly being reported to cause mild to severe infections of respiratory tract in humans. RV are antigenically the most diverse species of the genus Enterovirus and family Picornaviridae. There are three species of RV (RV-A, -B and -C), with 80, 32 and 55 serotypes/types, respectively. Antigenic variation is the main limiting factor for development of a cross-protective vaccine against RV.Serotyping of Rhinoviruses is carried out using cross-neutralization assays in cell culture. However, these assays become laborious and time-consuming for the large number of strains. Alternatively, serotyping of RV is carried out by alignment-based phylogeny of both protein and nucleotide sequences of VP1. However, serotyping of RV based on alignment-based phylogeny is a multi-step process, which needs to be repeated every time a new isolate is sequenced. In view of the growing need for serotyping of RV, an alignment-free method based on "return time distribution" (RTD) of amino acid residues in VP1 protein has been developed and implemented in the form of a web server titled RV-Typer. RV-Typer accepts nucleotide or protein sequences as an input and computes return times of di-peptides (k = 2) to assign serotypes. The RV-Typer performs with 100% sensitivity and specificity. It is significantly faster than alignment-based methods. The web server is available at http://bioinfo.net.in/RV-Typer/home.html.
Collapse
Affiliation(s)
- Pandurang S. Kolekar
- Bioinformatics Centre, Savitribai Phule Pune University (formerly University of Pune), Pune, 411 007, India
| | - Vaishali P. Waman
- Bioinformatics Centre, Savitribai Phule Pune University (formerly University of Pune), Pune, 411 007, India
| | - Mohan M. Kale
- Department of Statistics, Savitribai Phule Pune University (formerly University of Pune), Pune, 411 007, India
| | - Urmila Kulkarni-Kale
- Bioinformatics Centre, Savitribai Phule Pune University (formerly University of Pune), Pune, 411 007, India
| |
Collapse
|
29
|
PCV: An Alignment Free Method for Finding Homologous Nucleotide Sequences and its Application in Phylogenetic Study. Interdiscip Sci 2016; 9:173-183. [PMID: 26825665 DOI: 10.1007/s12539-015-0136-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Revised: 11/03/2015] [Accepted: 12/15/2015] [Indexed: 10/22/2022]
Abstract
Online retrieval of the homologous nucleotide sequences through existing alignment techniques is a common practice against the given database of sequences. The salient point of these techniques is their dependence on local alignment techniques and scoring matrices the reliability of which is limited by computational complexity and accuracy. Toward this direction, this work offers a novel way for numerical representation of genes which can further help in dividing the data space into smaller partitions helping formation of a search tree. In this context, this paper introduces a 36-dimensional Periodicity Count Value (PCV) which is representative of a particular nucleotide sequence and created through adaptation from the concept of stochastic model of Kolekar et al. (American Institute of Physics 1298:307-312, 2010. doi: 10.1063/1.3516320 ). The PCV construct uses information on physicochemical properties of nucleotides and their positional distribution pattern within a gene. It is observed that PCV representation of gene reduces computational cost in the calculation of distances between a pair of genes while being consistent with the existing methods. The validity of PCV-based method was further tested through their use in molecular phylogeny constructs in comparison with that using existing sequence alignment methods.
Collapse
|
30
|
Progressive alignment of genomic signals by multiple dynamic time warping. J Theor Biol 2015; 385:20-30. [PMID: 26300069 DOI: 10.1016/j.jtbi.2015.08.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Revised: 07/21/2015] [Accepted: 08/03/2015] [Indexed: 11/22/2022]
Abstract
This paper presents the utilization of progressive alignment principle for positional adjustment of a set of genomic signals with different lengths. The new method of multiple alignment of signals based on dynamic time warping is tested for the purpose of evaluating the similarity of different length genes in phylogenetic studies. Two sets of phylogenetic markers were used to demonstrate the effectiveness of the evaluation of intraspecies and interspecies genetic variability. The part of the proposed method is modification of pairwise alignment of two signals by dynamic time warping with using correlation in a sliding window. The correlation based dynamic time warping allows more accurate alignment dependent on local homologies in sequences without the need of scoring matrix or evolutionary models, because mutual similarities of residues are included in the numerical code of signals.
Collapse
|
31
|
Sedlar K, Skutkova H, Vitek M, Provaznik I. Set of rules for genomic signal downsampling. Comput Biol Med 2015; 69:308-14. [PMID: 26078051 DOI: 10.1016/j.compbiomed.2015.05.022] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Revised: 05/25/2015] [Accepted: 05/26/2015] [Indexed: 12/14/2022]
Abstract
Comparison and classification of organisms based on molecular data is an important task of computational biology, since at least parts of DNA sequences for many organisms are available. Unfortunately, methods for comparison are computationally very demanding, suitable only for short sequences. In this paper, we focus on the redundancy of genetic information stored in DNA sequences. We proposed rules for downsampling of DNA signals of cumulated phase. According to the length of an original sequence, we are able to significantly reduce the amount of data with only slight loss of original information. Dyadic wavelet transform was chosen for fast downsampling with minimum influence on signal shape carrying the biological information. We proved the usability of such new short signals by measuring percentage deviation of pairs of original and downsampled signals while maintaining spectral power of signals. Minimal loss of biological information was proved by measuring the Robinson-Foulds distance between pairs of phylogenetic trees reconstructed from the original and downsampled signals. The preservation of inter-species and intra-species information makes these signals suitable for fast sequence identification as well as for more detailed phylogeny reconstruction.
Collapse
Affiliation(s)
- Karel Sedlar
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic.
| | - Helena Skutkova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic.
| | - Martin Vitek
- International Clinical Research Center - Center of Biomedical Engineering, St. Anne׳s University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic.
| | - Ivo Provaznik
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic; International Clinical Research Center - Center of Biomedical Engineering, St. Anne׳s University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic.
| |
Collapse
|
32
|
Kubicova V, Provaznik I. Use of whole genome DNA spectrograms in bacterial classification. Comput Biol Med 2015; 69:298-307. [PMID: 26004007 DOI: 10.1016/j.compbiomed.2015.04.038] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2014] [Revised: 04/03/2015] [Accepted: 04/29/2015] [Indexed: 12/16/2022]
Abstract
A spectrogram reflects the arrangement of nucleotides through the whole chromosome or genome. Our previous study suggested that the spectrogram of whole genome DNA sequences is a suitable tool for the determination of relationships among bacteria. Related bacteria have similar spectrograms, and similarity in spectrograms was measured using a color layout descriptor. Several parameters, such as the mapping of four bases into a spectrogram, the number of considered elements in the color layout descriptor, the color model of the image and the building tree method, can be changed. This study addresses the use of parameter selection to ensure the best classification results. The quality of the classification was measured by Matthew's correlation coefficient (MCC). The proposed method with optimal parameters (called SpectCMP-Spectrogram CoMParison method) achieved an average MCC of 0.73 at the phylum level. The SpectCMP method was also tested at the order level; the average MCC in the classification of class Gammaproteobacteria was 0.76. The success of a classification with respect to the correct phyla was compared to three methods that are used in bacterial phylogeny: the CVTree method, OGTree method and moment vector method. The results show that the SpectCMP method can be used in bacterial classification at various taxonomic levels.
Collapse
Affiliation(s)
- Vladimira Kubicova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, Brno 61600, Czech Republic.
| | - Ivo Provaznik
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, Brno 61600, Czech Republic; International Clinical Research Center-Center of Biomedical Engineering, St. Anne's University Hospital Brno, Pekarska 53, Brno 65691, Czech Republic
| |
Collapse
|
33
|
Xie XH, Yu ZG, Han GS, Yang WF, Anh V. Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles. Mol Phylogenet Evol 2015; 89:37-45. [PMID: 25882834 DOI: 10.1016/j.ympev.2015.04.008] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 03/29/2015] [Accepted: 04/06/2015] [Indexed: 11/18/2022]
Abstract
There has been a growing interest in alignment-free methods for whole genome comparison and phylogenomic studies. In this study, we propose an alignment-free method for phylogenetic tree construction using whole-proteome sequences. Based on the inter-amino-acid distances, we first convert the whole-proteome sequences into inter-amino-acid distance vectors, which are called observed inter-amino-acid distance profiles. Then, we propose to use conditional geometric distribution profiles (the distributions of sequences where the amino acids are placed randomly and independently) as the reference distribution profiles. Last the relative deviation between the observed and reference distribution profiles is used to define a simple metric that reflects the phylogenetic relationships between whole-proteome sequences of different organisms. We name our method inter-amino-acid distances and conditional geometric distribution profiles (IAGDP). We evaluate our method on two data sets: the benchmark dataset including 29 genomes used in previous published papers, and another one including 67 mammal genomes. Our results demonstrate that the new method is useful and efficient.
Collapse
Affiliation(s)
- Xian-Hua Xie
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematics and Computer Science, Gannan Normal University, Jiangxi 341000, PR China.
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| | - Guo-Sheng Han
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China.
| | - Wei-Feng Yang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China.
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| |
Collapse
|
34
|
A novel k-word relative measure for sequence comparison. Comput Biol Chem 2014; 53PB:331-338. [PMID: 25462340 DOI: 10.1016/j.compbiolchem.2014.10.007] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2014] [Revised: 08/10/2014] [Accepted: 10/25/2014] [Indexed: 12/28/2022]
Abstract
In order to extract phylogenetic information from DNA sequences, the new normalized k-word average relative distance is proposed in this paper. The proposed measure was tested by discriminate analysis and phylogenetic analysis. The phylogenetic trees based on the Manhattan distance measure are reconstructed with k ranging from 1 to 12. At the same time, a new method is suggested to reduce the matrix dimension, can greatly lessen the amount of calculation and operation time. The experimental assessment demonstrated that our measure was efficient. What's more, comparing with other methods' results shows that our method is feasible and powerful for phylogenetic analysis.
Collapse
|
35
|
Gupta S, Chavan S, Deobagkar DN, Deobagkar DD. Bio/chemoinformatics in India: an outlook. Brief Bioinform 2014; 16:710-31. [PMID: 25159593 DOI: 10.1093/bib/bbu028] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 07/28/2014] [Indexed: 12/25/2022] Open
Abstract
With the advent of significant establishment and development of Internet facilities and computational infrastructure, an overview on bio/chemoinformatics is presented along with its multidisciplinary facts, promises and challenges. The Government of India has paved the way for more profound research in biological field with the use of computational facilities and schemes/projects to collaborate with scientists from different disciplines. Simultaneously, the growth of available biomedical data has provided fresh insight into the nature of redundant and compensatory data. Today, bioinformatics research in India is characterized by a powerful grid computing systems, great variety of biological questions addressed and the close collaborations between scientists and clinicians, with a full spectrum of focuses ranging from database building and methods development to biological discoveries. In fact, this outlook provides a resourceful platform highlighting the funding agencies, institutes and industries working in this direction, which would certainly be of great help to students seeking their career in bioinformatics. Thus, in short, this review highlights the current bio/chemoinformatics trend, educations, status, diverse applicability and demands for further development.
Collapse
|
36
|
Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B. Fast alignment-free sequence comparison using spaced-word frequencies. ACTA ACUST UNITED AC 2014; 30:1991-9. [PMID: 24700317 PMCID: PMC4080745 DOI: 10.1093/bioinformatics/btu177] [Citation(s) in RCA: 105] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Motivation: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. Results: To reduce the statistical dependency between adjacent word matches, we propose to use ‘spaced words’, defined by patterns of ‘match’ and ‘don’t care’ positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. Availability and implementation: Our program is freely available at http://spaced.gobics.de/. Contact:chris.leimeister@stud.uni-goettingen.de Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chris-Andre Leimeister
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| | - Marcus Boden
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| | - Sebastian Horwege
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| | - Sebastian Lindner
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| | - Burkhard Morgenstern
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, FranceDepartment of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| |
Collapse
|
37
|
Kolekar P, Hake N, Kale M, Kulkarni-Kale U. WNV Typer: a server for genotyping of West Nile viruses using an alignment-free method based on a return time distribution. J Virol Methods 2014; 198:41-55. [PMID: 24388930 DOI: 10.1016/j.jviromet.2013.12.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Revised: 11/27/2013] [Accepted: 12/17/2013] [Indexed: 01/20/2023]
Abstract
West Nile virus (WNV), genus Flavivirus, family Flaviviridae, is a major cause of viral encephalitis with broad host range and global spread. The virus has undergone a series of evolutionary changes with emergence of various genotypic lineages that are known to differ in type and severity of the diseases caused. Currently, genotyping is carried out using molecular phylogeny of complete coding sequences and genotype is assigned based on proximity to reference genotypes in tree topology. Efficient epidemiological surveillance of WNVs demands development of objective criteria for typing. An alignment-free approach based on return time distribution (RTD) of k-mers has been validated for genotyping of WNVs. The RTDs of complete genome sequences at k=7 were found to be optimum for classification of the known lineages of WNVs as well as for genotyping. It provides time and computationally efficient alternative for genome based annotation of WNV lineages. The development of a WNV Typer server based on RTD is described (http://bioinfo.net.in/wnv/homepage.html). Both the method and the server have 100% sensitivity and specificity.
Collapse
Affiliation(s)
| | - Nilesh Hake
- Bioinformatics Centre, University of Pune, Pune 411007, India
| | - Mohan Kale
- Department of Statistics, University of Pune, Pune 411007, India.
| | | |
Collapse
|
38
|
Abstract
Phylogenetic analysis based on alignment method meets huge challenges when dealing with whole-genome sequences, for example, recombination, shuffling, and rearrangement of sequences. Thus, various alignment-free methods for phylogeny construction have been proposed. However, most of these methods have not been implemented as tools or web servers. Researchers cannot use these methods easily with their data sets. To facilitate the usage of various alignment-free methods, we implemented most of the popular alignment-free methods and constructed a user-friendly web server for alignment-free genome phylogeny (AGP). AGP integrated the phylogenetic tree construction, visualization, and comparison functions together. Both AGP and all source code of the methods are available at http://www.herbbol.org:8000/agp (last accessed February 26, 2013). AGP will facilitate research in the field of whole-genome phylogeny and comparison.
Collapse
|
39
|
Tretyakov K, Goldberg T, Jin VX, Horton P. Summary of talks and papers at ISCB-Asia/SCCG 2012. BMC Genomics 2013. [PMCID: PMC3639071 DOI: 10.1186/1471-2164-14-s2-i1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
The second ISCB-Asia conference of the International Society for Computational Biology took place December 17-19, 2012, in Shenzhen, China. The conference was co-hosted by BGI as the first Shenzhen Conference on Computational Genomics (SCCG).
45 talks were presented at ISCB-Asia/SCCG 2012. The topics covered included software tools, reproducible computing, next-generation sequencing data analysis, transcription and mRNA regulation, protein structure and function, cancer genomics and personalized medicine. Nine of the proceedings track talks are included as full papers in this supplement.
In this report we first give a short overview of the conference by listing some statistics and visualizing the talk abstracts as word clouds. Then we group the talks by topic and briefly summarize each one, providing references to related publications whenever possible. Finally, we close with a few comments on the success of this conference.
Collapse
|