1
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
2
|
Tang R, Yu Z, Li J. KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences. Mol Phylogenet Evol 2023; 179:107662. [PMID: 36375789 DOI: 10.1016/j.ympev.2022.107662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/10/2022] [Accepted: 11/02/2022] [Indexed: 11/13/2022]
Abstract
Alignment-based methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational complexity. Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Here, we explore an alignment-free approach that uses inner distance distributions of k-mer pairs in biological sequences for phylogeny inference. For every sequence in a dataset, our method transforms the sequence into a numeric feature vector consisting of features each representing a specific k-mer pair's contribution to the characterization of the sequentiality uniqueness of the sequence. This newly defined k-mer pair's contribution is an integration of the reverse Kullback-Leibler divergence, pseudo mode and the classic entropy of an inner distance distribution of the k-mer pair in the sequence. Our method has been tested on datasets of complete genome sequences, complete protein sequences, and gene sequences of rRNA of various lengths. Our method achieves the best performance in comparison with state-of-the-art alignment-free methods as measured by the Robinson-Foulds distance between the reference and the constructed phylogeny trees.
Collapse
Affiliation(s)
- Runbin Tang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China; School of Mathematical Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Zuguo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China.
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia.
| |
Collapse
|
3
|
Redrado S, Esteban P, Domingo MP, Lopez C, Rezusta A, Ramirez-Labrada A, Arias M, Pardo J, Galvez EM. Integration of In Silico and In Vitro Analysis of Gliotoxin Production Reveals a Narrow Range of Producing Fungal Species. J Fungi (Basel) 2022; 8:jof8040361. [PMID: 35448592 PMCID: PMC9030297 DOI: 10.3390/jof8040361] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 03/28/2022] [Accepted: 03/29/2022] [Indexed: 02/06/2023] Open
Abstract
Gliotoxin is a fungal secondary metabolite with impact on health and agriculture since it might act as virulence factor and contaminate human and animal food. Homologous gliotoxin (GT) gene clusters are spread across a number of fungal species although if they produce GT or other related epipolythiodioxopiperazines (ETPs) remains obscure. Using bioinformatic tools, we have identified homologous gli gene clusters similar to the A. fumigatus GT gene cluster in several fungal species. In silico study led to in vitro confirmation of GT and Bisdethiobis(methylthio)gliotoxin (bmGT) production in fungal strain cultures by HPLC detection. Despite we selected most similar homologous gli gene cluster in 20 different species, GT and bmGT were only detected in section Fumigati species and in a Trichoderma virens Q strain. Our results suggest that in silico gli homology analyses in different fungal strains to predict GT production might be only informative when accompanied by analysis about mycotoxin production in cell cultures.
Collapse
Affiliation(s)
- Sergio Redrado
- Instituto de Carboquımica ICB-CSIC, 50018 Zaragoza, Spain; (S.R.); (M.P.D.)
| | - Patricia Esteban
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
| | | | - Concepción Lopez
- Department of Microbiology, Hospital Universitario Miguel Servet, IIS Aragón, 50009 Zaragoza, Spain; (C.L.); (A.R.)
| | - Antonio Rezusta
- Department of Microbiology, Hospital Universitario Miguel Servet, IIS Aragón, 50009 Zaragoza, Spain; (C.L.); (A.R.)
| | - Ariel Ramirez-Labrada
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
| | - Maykel Arias
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
| | - Julián Pardo
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
- Department of Microbiology, Pediatrics, Radiology and Public Health, University of Zaragoza, 50009 Zaragoza, Spain
- Aragon I+D Foundation (ARAID), 50018 Zaragoza, Spain
| | - Eva M. Galvez
- Instituto de Carboquımica ICB-CSIC, 50018 Zaragoza, Spain; (S.R.); (M.P.D.)
- Correspondence:
| |
Collapse
|
4
|
Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-mers Method. ENTROPY 2020; 22:e22020255. [PMID: 33286029 PMCID: PMC7516702 DOI: 10.3390/e22020255] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 02/07/2020] [Accepted: 02/20/2020] [Indexed: 12/31/2022]
Abstract
HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis of HIV-1 viruses using complete genome sequences. Our method combines the position distribution information and the counts of the k-mers together. We also propose a metric to determine the optimal k value. We name our method the Position-Weighted k-mers (PWkmer) method. Validation and comparison with the Robinson-Foulds distance method and the modified bootstrap method on a benchmark dataset show that our method is reliable for the phylogenetic analysis of HIV-1 viruses. PWkmer can resolve within-group variations for different known subtypes of Group M of HIV-1 viruses. This method is simple and computationally fast for whole genome phylogenetic analysis.
Collapse
|
5
|
Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics 2018; 111:1298-1305. [PMID: 30195069 DOI: 10.1016/j.ygeno.2018.08.010] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 08/19/2018] [Accepted: 08/27/2018] [Indexed: 11/22/2022]
Abstract
Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.
Collapse
|
6
|
He L, Li Y, He RL, Yau SST. A novel alignment-free vector method to cluster protein sequences. J Theor Biol 2017; 427:41-52. [DOI: 10.1016/j.jtbi.2017.06.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Revised: 05/04/2017] [Accepted: 06/02/2017] [Indexed: 11/29/2022]
|
7
|
Li Y, Tian K, Yin C, He RL, Yau SST. Virus classification in 60-dimensional protein space. Mol Phylogenet Evol 2016; 99:53-62. [PMID: 26988414 DOI: 10.1016/j.ympev.2016.03.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2015] [Revised: 01/24/2016] [Accepted: 03/10/2016] [Indexed: 10/22/2022]
Abstract
Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed "Natural Vector (NV) representation" has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.
Collapse
Affiliation(s)
- Yongkun Li
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Kun Tian
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, USA
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.
| |
Collapse
|
8
|
Yang WF, Yu ZG, Anh V. Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation. Mol Phylogenet Evol 2015; 96:102-111. [PMID: 26724405 DOI: 10.1016/j.ympev.2015.12.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Revised: 12/17/2015] [Accepted: 12/18/2015] [Indexed: 01/18/2023]
Abstract
UNLABELLED Traditional methods for sequence comparison and phylogeny reconstruction rely on pair wise and multiple sequence alignments. But alignment could not be directly applied to whole genome/proteome comparison and phylogenomic studies due to their high computational complexity. Hence alignment-free methods became popular in recent years. Here we propose a fast alignment-free method for whole genome/proteome comparison and phylogeny reconstruction using higher order Markov model and chaos game representation. In the present method, we use the transition matrices of higher order Markov models to characterize amino acid or DNA sequences for their comparison. The order of the Markov model is uniquely identified by maximizing the average Shannon entropy of conditional probability distributions. Using one-dimensional chaos game representation and linked list, this method can reduce large memory and time consumption which is due to the large-scale conditional probability distributions. To illustrate the effectiveness of our method, we employ it for fast phylogeny reconstruction based on genome/proteome sequences of two species data sets used in previous published papers. Our results demonstrate that the present method is useful and efficient. AVAILABILITY AND IMPLEMENTATION The source codes for our algorithm to get the distance matrix and genome/proteome sequences can be downloaded from ftp://121.199.20.25/. The software Phylip and EvolView we used to construct phylogenetic trees can be referred from their websites.
Collapse
Affiliation(s)
- Wei-Feng Yang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; Department of Mathematics and Physics, Hunan Institute of Engineering, Hunan 411104, PR China.
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| |
Collapse
|