1
|
Wei ZG, Chen X, Zhang XD, Zhang H, Fan XG, Gao HY, Liu F, Qian Y. Comparison of Methods for Biological Sequence Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2874-2888. [PMID: 37028305 DOI: 10.1109/tcbb.2023.3253138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probe the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.
Collapse
|
2
|
4D-Dynamic Representation of DNA/RNA Sequences: Studies on Genetic Diversity of Echinococcus multilocularis in Red Foxes in Poland. LIFE (BASEL, SWITZERLAND) 2022; 12:life12060877. [PMID: 35743908 PMCID: PMC9227292 DOI: 10.3390/life12060877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 05/20/2022] [Accepted: 06/08/2022] [Indexed: 11/17/2022]
Abstract
The 4D-Dynamic Representation of DNA/RNA Sequences, an alignment-free bioinformatics method recently developed by us, has been used to study the genetic diversity of Echinococcus multilocularis in red foxes in Poland. Sequences of three mitochondrial genes, i.e., NADH dehydrogenase subunit 2 (nad2), cytochrome b (cob), and cytochrome c oxidase subunit 1 (cox1), are analyzed. The sequences are represented by sets of material points in a 4D space, i.e., 4D-dynamic graphs. As a visualization of the sequences, projections of the graphs into 3D space are shown. The differences between 3D graphs corresponding to European, Asian, and American haplotypes are small. Numerical characteristics (sequence descriptors) applied in the studies can recognize the differences. The concept of creating descriptors of 4D-dynamic graphs has been borrowed from classical dynamics; these are coordinates of the centers or mass and moments of inertia of 4D-dynamic graphs. Based on these descriptors, classification maps are constructed. The concentrations of points in the maps indicate one Polish haplotype (EmPL9) of Asian origin.
Collapse
|
3
|
Kieft K, Anantharaman K. Virus genomics: what is being overlooked? Curr Opin Virol 2022; 53:101200. [DOI: 10.1016/j.coviro.2022.101200] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 12/21/2021] [Accepted: 01/03/2022] [Indexed: 01/05/2023]
|
4
|
Mahapatra A, Mukherjee J. Taxonomy classification using genomic footprint of mitochondrial sequences. Comb Chem High Throughput Screen 2021; 25:401-413. [PMID: 34382517 DOI: 10.2174/1386207324666210811102109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2020] [Revised: 07/07/2021] [Accepted: 07/12/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND Advancement in the sequencing technology yields a huge number of genomes of a multitude of organisms in our planet. One of the fundamental tasks for processing and analyzing these sequences is to organize them in the existing taxonomic orders. <P> Method: Recently we proposed a novel approach, GenFooT, of taxonomy classification using the concept of genomic footprint (GFP). The technique is further refined and enhanced in this work leading to improved accuracies in the task of taxonomic classification on various benchmark datasets. GenFooT maps a genome sequence in a 2D coordinate space and extracts features from that representation. It uses two hyper-parameters, namely block size and number of fragments of genomic sequence while computing the feature. In this work, we propose an analysis for choosing values of those parameters adaptively from the sequences. The enhanced version of GenFooT is named GenFooT2. <P> Results and Conclusion: We have experimented GenFooT2 on ten different biological datasets of genomic sequences of various organisms belonging to different taxonomy ranks. Our experimental results indicate more than 3% improved classification performance of the proposed features with Logistic regression classifier than the GenFooT. We also performed the statistical test to compare the performance of GenFooT2 with the state-of-the-art methods including our previous method GenFooT. The experimental results as well as the statistical test exhibit that the performance of the proposed GenFooT2 is significantly better.
Collapse
Affiliation(s)
- Aritra Mahapatra
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur. India
| | - Jayanta Mukherjee
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur. India
| |
Collapse
|
5
|
Can wood-decaying urban macrofungi be identified by using fuzzy interference system? An example in Central European Ganoderma species. Sci Rep 2021; 11:13222. [PMID: 34168175 PMCID: PMC8225830 DOI: 10.1038/s41598-021-92237-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Accepted: 05/28/2021] [Indexed: 11/20/2022] Open
Abstract
Ganoderma is a cosmopolitan genus of wood-decaying basidiomycetous macrofungi that can rot the roots and/or lower trunk. Among the standing trees, their presence often indicates that a hazard assessment may be necessary. These bracket fungi are commonly known for the crust-like upper surfaces of their basidiocarps and formation of white rot. Six species occur in central European urban habitats. Several of them, such as Ganoderma adspersum, G. applanatum, G. resinaceum and G. pfeifferi, are most hazardous fungi causing extensive horizontal stem decay in urban trees. Therefore, their early identification is crucial for correct management of trees. In this paper, a fast technique is tested for the determination of phytopathologically important urban macrofungi using fuzzy interference system of Sugeno type based on 13 selected traits of 72 basidiocarps of six Ganoderma species and compared to the ITS sequence based determination. Basidiocarps features were processed for the following situations: At first, the FIS of Sugeno 2 type (without basidiospore sizes) was used and 57 Ganoderma basidiocarps (79.17%) were correctly determined. Determination success increased to 96.61% after selecting basidiocarps with critical values (15 basidiocarps). These undeterminable basidiocarps must be analyzed by molecular methods. In a case, that basidiospore sizes of some basidiocarps were known, a combination of Sugeno 1 (31 basidiocarps with known basidiospore size) and Sugeno 2 (41 basidiocarps with unknown basidiospore size) was used. 84.72% of Ganoderma basidiocarps were correctly identified. Determination success increased to 96.83% after selecting basidiocarps with critical values (11 basidiocarps).
Collapse
|
6
|
Koulouras G, Frith MC. Significant non-existence of sequences in genomes and proteomes. Nucleic Acids Res 2021; 49:3139-3155. [PMID: 33693858 PMCID: PMC8034619 DOI: 10.1093/nar/gkab139] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 02/11/2021] [Accepted: 02/25/2021] [Indexed: 12/22/2022] Open
Abstract
Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.
Collapse
Affiliation(s)
- Grigorios Koulouras
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Martin C Frith
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Chiba, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST, Shinjuku-ku, Tokyo, Japan
| |
Collapse
|
7
|
Bielińska-Wąż D, Wąż P. Non-standard bioinformatics characterization of SARS-CoV-2. Comput Biol Med 2021; 131:104247. [PMID: 33611129 PMCID: PMC7966820 DOI: 10.1016/j.compbiomed.2021.104247] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2020] [Revised: 01/22/2021] [Accepted: 01/26/2021] [Indexed: 12/16/2022]
Abstract
A non-standard bioinformatics method, 4D-Dynamic Representation of DNA/RNA Sequences, aiming at an analysis of the information available in nucleotide databases, has been formulated. The sequences are represented by sets of "material points" in a 4D space - 4D-dynamic graphs. The graphs representing the sequences are treated as "rigid bodies" and characterized by values analogous to the ones used in the classical dynamics. As the graphical representations of the sequences, the projections of the graphs into 2D and 3D spaces are used. The method has been applied to an analysis of the complete genome sequences of the 2019 novel coronavirus. As a result, 2D and 3D classification maps are obtained. The coordinate axes in the maps correspond to the values derived from the exact formulas characterizing the graphs: the coordinates of the centers of mass and the 4D moments of inertia. The points in the maps represent sequences and their coordinates are used as the classifiers. The main result of this work has been derived from the 3D classification maps. The distribution of clusters of points which emerged in these maps, supports the hypothesis that SARS-CoV-2 may have originated in bat and in pangolin. Pilot calculations for Zika virus sequence data prove that the proposed approach is also applicable to a description of time evolution of genome sequences of viruses.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Department of Radiological Informatics and Statistics, Medical University of Gdańsk, 80-210, Gdańsk, Poland,Corresponding author
| | - Piotr Wąż
- Department of Nuclear Medicine, Medical University of Gdańsk, 80-210, Gdańsk, Poland
| |
Collapse
|
8
|
High-Throughput Genotyping Technologies in Plant Taxonomy. Methods Mol Biol 2021; 2222:149-166. [PMID: 33301093 DOI: 10.1007/978-1-0716-0997-2_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Molecular markers provide researchers with a powerful tool for variation analysis between plant genomes. They are heritable and widely distributed across the genome and for this reason have many applications in plant taxonomy and genotyping. Over the last decade, molecular marker technology has developed rapidly and is now a crucial component for genetic linkage analysis, trait mapping, diversity analysis, and association studies. This chapter focuses on molecular marker discovery, its application, and future perspectives for plant genotyping through pangenome assemblies. Included are descriptions of automated methods for genome and sequence distance estimation, genome contaminant analysis in sequence reads, genome structural variation, and SNP discovery methods.
Collapse
|
9
|
Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis. Gene 2020; 766:145096. [PMID: 32919006 DOI: 10.1016/j.gene.2020.145096] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 08/16/2020] [Accepted: 08/24/2020] [Indexed: 12/17/2022]
Abstract
The phylogenetic analysis based on sequence similarity targeted to real biological taxa is one of the major challenging tasks. In this paper, we propose a novel alignment-free method, CoFASA (Codon Feature based Amino acid Sequence Analyser), for similarity analysis of nucleotide sequences. At first, we assign numerical weights to the four nucleotides. We then calculate a score of each codon based on the numerical value of the constituent nucleotides, termed as degree of codons. Accordingly, we obtain the degree of each amino acid based on the degree of codons targeted towards a specific amino acid. Utilizing the degree of twenty amino acids and their relative abundance within a given sequence, we generate 20-dimensional features for every coding DNA sequence or protein sequence. We use the features for performing phylogenetic analysis of the set of candidate sequences. We use multiple protein sequences derived from Beta-globin (BG), NADH dehydrogenase subunit 5 (ND5), Transferrins (TFs), Xylanases, low identity (<40%) and high identity (⩾40%) protein sequences (encompassing 533 and 1064 protein families) for experimental assessments. We compare our results with sixteen (16) well-known methods, including both alignment-based and alignment-free methods. Various assessment indices are used, such as the Pearson correlation coefficient, RF (Robinson-Foulds) distance and ROC score for performance analysis. While comparing the performance of CoFASA with alignment-based methods (ClustalW, ClustalΩ, MAFFT, and MUSCLE), it shows very similar results. Further, CoFASA shows better performance in comparison to well-known alignment-free methods, including LZW-Kernal, jD2Stat, FFP, spaced, and AFKS-D2s in predicting taxonomic relationship among candidate taxa. Overall, we observe that the features derived by CoFASA are very much useful in isolating the sequences according to their taxonomic labels. While our method is cost-effective, at the same time, produces consistent and satisfactory outcomes.
Collapse
|
10
|
Qi Z, Wen X. Novel Protein Sequence Comparison Method Based on Transition Probability Graph and Information Entropy. Comb Chem High Throughput Screen 2020; 25:392-400. [PMID: 32875978 DOI: 10.2174/1386207323666200901103001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 07/17/2020] [Accepted: 07/17/2020] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE Sequence analysis is one of the foundations in bioinformatics. It is widely used to find out the feature metric hidden in the sequence. Otherwise, the graphical representation of biologic sequence is an important tool for sequencing analysis. This study is undertaken to find out a new graphical representation of biosequences. MATERIALS AND METHODS The transition probability is used to describe amino acid combinations of protein sequences. The combinations are composed of amino acids directly adjacent to each other or separated by multiple amino acids. The transition probability graph is built up by the transition probabilities of amino acid combinations. Next, a map is defined as a representation from transition probability graph to transition probability vector by k-order transition probability graph. Transition entropy vectors are developed by the transition probability vector and information entropy. Finally, the proposed method is applied to two separate applications, 499 HA genes of H1N1, and 95 coronaviruses. RESULTS By constructing a phylogenetic tree, we find that the results of each application are consistent with other studies. CONCLUSION the graphical representation proposed in this article is a practical and correct method.
Collapse
Affiliation(s)
- Zhaohui Qi
- College of Information Science and Engineering Hunan Normal University, Changsha 410081. China
| | - Xinlong Wen
- College of Information Science and Engineering Hunan Normal University, Changsha 410081. China
| |
Collapse
|
11
|
Borrayo E, May-Canche I, Paredes O, Morales JA, Romo-Vázquez R, Vélez-Pérez H. Whole-Genome k-mer Topic Modeling AssociatesBacterial Families. Genes (Basel) 2020; 11:genes11020197. [PMID: 32075081 PMCID: PMC7074292 DOI: 10.3390/genes11020197] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/07/2020] [Accepted: 02/09/2020] [Indexed: 11/16/2022] Open
Abstract
Alignment-free k-mer-based algorithms in whole genome sequence comparisons remainan ongoing challenge. Here, we explore the possibility to use Topic Modeling for organismwhole-genome comparisons. We analyzed 30 complete genomes from three bacterial families bytopic modeling. For this, each genome was considered as a document and 13-mer nucleotiderepresentations as words. Latent Dirichlet allocation was used as the probabilistic modeling of thecorpus. We where able to identify the topic distribution among analyzed genomes, which is highlyconsistent with traditional hierarchical classification. It is possible that topic modeling may be appliedto establish relationships between genome's composition and biological phenomena.
Collapse
Affiliation(s)
- Ernesto Borrayo
- Electronics Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico;
| | - Isaias May-Canche
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
- Instituto Tecnológico de Chetumal, Quintana Roo 77000, Mexico
| | - Omar Paredes
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - J. Alejandro Morales
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - Rebeca Romo-Vázquez
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - Hugo Vélez-Pérez
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
- Correspondence:
| |
Collapse
|