Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Deng M, Yu C, Liang Q, He RL, Yau SST. A novel method of characterizing genetic sequences: genome space with biological distance and applications. PLoS One 2011;6:e17293. [PMID: 21399690 PMCID: PMC3047556 DOI: 10.1371/journal.pone.0017293] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2010] [Accepted: 01/28/2011] [Indexed: 11/18/2022] Open

For:	Deng M, Yu C, Liang Q, He RL, Yau SST. A novel method of characterizing genetic sequences: genome space with biological distance and applications. PLoS One 2011;6:e17293. [PMID: 21399690 PMCID: PMC3047556 DOI: 10.1371/journal.pone.0017293] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2010] [Accepted: 01/28/2011] [Indexed: 11/18/2022] Open

Number

Cited by Other Article(s)

Guan M, Sun N, Yau SST. Geometric analysis of SARS-CoV-2 variants. Gene 2024;909:148291. [PMID: 38417688 DOI: 10.1016/j.gene.2024.148291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/23/2024] [Accepted: 02/14/2024] [Indexed: 03/01/2024]

Yu H, Yau SST. Automated recognition of chromosome fusion using an alignment-free natural vector method. Front Genet 2024;15:1364951. [PMID: 38572414 PMCID: PMC10987741 DOI: 10.3389/fgene.2024.1364951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 03/06/2024] [Indexed: 04/05/2024] Open

Cahuantzi R, Lythgoe KA, Hall I, Pellis L, House T. Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning methods. Proc Natl Acad Sci U S A 2024;121:e2317284121. [PMID: 38478692 PMCID: PMC10962941 DOI: 10.1073/pnas.2317284121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 02/05/2024] [Indexed: 03/21/2024] Open

Zulfiqar H, Guo Z, Ahmad RM, Ahmed Z, Cai P, Chen X, Zhang Y, Lin H, Shi Z. Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings. Front Med (Lausanne) 2024;10:1291352. [PMID: 38298505 PMCID: PMC10829051 DOI: 10.3389/fmed.2023.1291352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Accepted: 12/26/2023] [Indexed: 02/02/2024] Open

Dey S, Das S, Bhattacharya DK. Biochemical Property Based Positional Matrix: A New Approach Towards Genome Sequence Comparison. J Mol Evol 2023;91:93-131. [PMID: 36587178 PMCID: PMC9805373 DOI: 10.1007/s00239-022-10082-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 12/01/2022] [Indexed: 01/01/2023]

Sun N, Yau SST. In-depth investigation of the point mutation pattern of HIV-1. Front Cell Infect Microbiol 2022;12:1033481. [PMID: 36457853 PMCID: PMC9705751 DOI: 10.3389/fcimb.2022.1033481] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 10/25/2022] [Indexed: 04/29/2024] Open

Liu J, Xia KL, Wu J, Yau SST, Wei GW. Biomolecular Topology: Modelling and Analysis. ACTA MATHEMATICA SINICA, ENGLISH SERIES 2022;38:1901-1938. [PMID: 36407804 PMCID: PMC9640850 DOI: 10.1007/s10114-022-2326-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 07/12/2022] [Indexed: 05/25/2023]

Shi H, Zhang S, Li X. R5hmCFDV: computational identification of RNA 5-hydroxymethylcytosine based on deep feature fusion and deep voting. Brief Bioinform 2022;23:6658858. [PMID: 35945157 DOI: 10.1093/bib/bbac341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Revised: 07/17/2022] [Accepted: 07/25/2022] [Indexed: 11/13/2022] Open

Abstract

RNA 5-hydroxymethylcytosine (5hmC) is a kind of RNA modification, which is related to the life activities of many organisms. Studying its distribution is very important to reveal its biological function. Previously, high-throughput sequencing was used to identify 5hmC, but it is expensive and inefficient. Therefore, machine learning is used to identify 5hmC sites. Here, we design a model called R5hmCFDV, which is mainly divided into feature representation, feature fusion and classification. (i) Pseudo dinucleotide composition, dinucleotide binary profile and frequency, natural vector and physicochemical property are used to extract features from four aspects: nucleotide composition, coding, natural language and physical and chemical properties. (ii) To strengthen the relevance of features, we construct a novel feature fusion method. Firstly, the attention mechanism is employed to process four single features, stitch them together and feed them to the convolution layer. After that, the output data are processed by BiGRU and BiLSTM, respectively. Finally, the features of these two parts are fused by the multiply function. (iii) We design the deep voting algorithm for classification by imitating the soft voting mechanism in the Python package. The base classifiers contain deep neural network (DNN), convolutional neural network (CNN) and improved gated recurrent unit (GRU). And then using the principle of soft voting, the corresponding weights are assigned to the predicted probabilities of the three classifiers. The predicted probability values are multiplied by the corresponding weights and then summed to obtain the final prediction results. We use 10-fold cross-validation to evaluate the model, and the evaluation indicators are significantly improved. The prediction accuracy of the two datasets is as high as 95.41% and 93.50%, respectively. It demonstrates the stronger competitiveness and generalization performance of our model. In addition, all datasets and source codes can be found at https://github.com/HongyanShi026/R5hmCFDV.

Collapse

Protein-protein interaction and non-interaction predictions using gene sequence natural vector. Commun Biol 2022;5:652. [PMID: 35780196 PMCID: PMC9250521 DOI: 10.1038/s42003-022-03617-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 06/21/2022] [Indexed: 12/02/2022] Open

Sun N, Zhao X, Yau SST. An efficient numerical representation of genome sequence: natural vector with covariance component. PeerJ 2022;10:e13544. [PMID: 35729905 PMCID: PMC9206847 DOI: 10.7717/peerj.13544] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 05/16/2022] [Indexed: 01/17/2023] Open

Ren R, Yin C, S-T Yau S. kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding. J Comput Biol 2022;29:1001-1021. [PMID: 35593919 DOI: 10.1089/cmb.2021.0536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Abstract

The comparison of DNA sequences is of great significance in genomics analysis. Although the traditional multiple sequence alignment (MSA) method is popularly used for evolutionary analysis, optimally aligning k sequences becomes computationally intractable when k increases due to the intrinsic computational complexity of MSA. Despite numerous k-mer alignment-free methods being proposed, the existing k-mer alignment-free methods may not truly capture the contextual structures of the sequences. In this study, we present a novel k-mer contextual alignment-free method (called kmer2vec), in which the sequence k-mers are semantically embedded to word2vec vectors, an essential technique in natural language processing. Consequently, the method converts each DNA/RNA sequence into a point in the word2vec high-dimensional space and compares DNA sequences in the space. Because the word2vec vectors are trained from the contextual relationship of k-mers in the genomes, the method may extract valuable structural information from the sequences and reflect the relationship among them properly. The proposed method is optimized on the parameters from word2vec training and verified in the phylogenetic analysis of large whole genomes, including coronavirus and bacterial genomes. The results demonstrate the effectiveness of the method on phylogenetic tree construction and species clustering. The method running speed is much faster than that of the MSA method, especially the phylogenetic relationships constructed by the kmer2vec method are more accurate than the conventional k-mer alignment-free method. Therefore, this approach can provide new perspectives for phylogeny and evolution and make it possible to analyze large genomes. In addition, we discuss special parameterization in the k-mer word2vec embedding construction. An effective tool for rapid SARS-CoV-2 typing can also be derived when combining kmer2vec with clustering methods.

Collapse

Dong R, Pei S, Guan M, Yau SC, Yin C, He RL, Yau SST. Full Chromosomal Relationships Between Populations and the Origin of Humans. Front Genet 2022;12:828805. [PMID: 35186019 PMCID: PMC8847220 DOI: 10.3389/fgene.2021.828805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2021] [Accepted: 12/22/2021] [Indexed: 11/23/2022] Open

A k-mer based approach for classifying viruses without taxonomy identifies viral associations in human autism and plant microbiomes. Comput Struct Biotechnol J 2021;19:5911-5919. [PMID: 34849195 PMCID: PMC8605058 DOI: 10.1016/j.csbj.2021.10.029] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 10/17/2021] [Accepted: 10/20/2021] [Indexed: 12/25/2022] Open

Abstract

Viruses are an underrepresented taxa in the study and identification of microbiome constituents; however, they play an essential role in health, microbiome regulation, and transfer of genetic material. Only a few thousand viruses have been isolated, sequenced, and assigned a taxonomy, which limits the ability to identify and quantify viruses in the microbiome. Additionally, the vast diversity of viruses represents a challenge for classification, not only in constructing a viral taxonomy, but also in identifying similarities between a virus' genotype and its phenotype. However, the diversity of viral sequences can be leveraged to classify their sequences in metagenomic and metatranscriptomic samples, even if they do not have a taxonomy. To identify and quantify viruses in transcriptomic and genomic samples, we developed a dynamic programming algorithm for creating a classification tree out of 715,672 metagenome viruses. To create the classification tree, we clustered proportional similarity scores generated from the k-mer profiles of each of the metagenome viruses to create a database of metagenomic viruses. The resulting Kraken2 database of the metagenomic viruses can be found here: https://www.osti.gov/biblio/1615774 and is compatible with Kraken2. We then integrated the viral classification database with databases created with genomes from NCBI for use with ParaKraken (a parallelized version of Kraken provided in Supplemental Zip 1), a metagenomic/transcriptomic classifier. To illustrate the breadth of our utility for classifying metagenome viruses, we analyzed data from a plant metagenome study identifying genotypic and compartment specific differences between two Populus genotypes in three different compartments. We also identified a significant increase in abundance of eight viral sequences in post mortem brains in a human metatranscriptome study comparing Autism Spectrum Disorder patients and controls. We also show the potential accuracy for classifying viruses by utilizing both the JGI and NCBI viral databases to identify the uniqueness of viral sequences. Finally, we validate the accuracy of viral classification with NCBI databases containing viruses with taxonomy to identify pathogenic viruses in known COVID-19 and cassava brown streak virus infection samples. Our method represents the compulsory first step in better understanding the role of viruses in the microbiome by allowing for a more complete identification of sequences without taxonomy. Better classification of viruses will improve identifying associations between viruses and their hosts as well as viruses and other microbiome members. Despite the lack of taxonomy, this database of metagenomic viruses can be used with any tool that utilizes a taxonomy, such as Kraken, for accurate classification of viruses.

Collapse

He L, Sun S, Zhang Q, Bao X, Li PK. Alignment-free sequence comparison for virus genomes based on location correlation coefficient. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2021;96:105106. [PMID: 34626822 PMCID: PMC8493760 DOI: 10.1016/j.meegid.2021.105106] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Revised: 09/08/2021] [Accepted: 10/03/2021] [Indexed: 12/18/2022]

Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021;23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]

Jiao X, Pei S, Sun Z, Kang J, Yau SST. Determination of the nucleotide or amino acid composition of genome or protein sequences by using natural vector method and convex hull principle. FUNDAMENTAL RESEARCH 2021. [DOI: 10.1016/j.fmre.2021.08.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open

Sun N, Pei S, He L, Yin C, He RL, Yau SST. Geometric construction of viral genome space and its applications. Comput Struct Biotechnol J 2021;19:4226-4234. [PMID: 34429843 PMCID: PMC8353408 DOI: 10.1016/j.csbj.2021.07.028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 07/24/2021] [Accepted: 07/24/2021] [Indexed: 11/25/2022] Open

Kaden M, Bohnsack KS, Weber M, Kudła M, Gutowska K, Blazewicz J, Villmann T. Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences. Neural Comput Appl 2021;34:67-78. [PMID: 33935376 PMCID: PMC8076884 DOI: 10.1007/s00521-021-06018-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 04/07/2021] [Indexed: 02/06/2023]

Pei S, Yau SST. Analysis of the Genomic Distance Between Bat Coronavirus RaTG13 and SARS-CoV-2 Reveals Multiple Origins of COVID-19. ACTA MATHEMATICA SCIENTIA = SHU XUE WU LI XUE BAO 2021;41:1017-1022. [PMID: 33897081 PMCID: PMC8054123 DOI: 10.1007/s10473-021-0323-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 03/10/2021] [Indexed: 05/29/2023]

Lv H, Dao FY, Zulfiqar H, Su W, Ding H, Liu L, Lin H. A sequence-based deep learning approach to predict CTCF-mediated chromatin loop. Brief Bioinform 2021;22:6149346. [PMID: 33634313 DOI: 10.1093/bib/bbab031] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 12/01/2020] [Accepted: 01/21/2021] [Indexed: 12/13/2022] Open

iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021;2021:6664362. [PMID: 33505515 PMCID: PMC7808816 DOI: 10.1155/2021/6664362] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Revised: 12/13/2020] [Accepted: 12/28/2020] [Indexed: 02/07/2023]

Sun Z, Pei S, He RL, Yau SST. A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector. Comput Struct Biotechnol J 2020;18:1904-1913. [PMID: 32774785 PMCID: PMC7390779 DOI: 10.1016/j.csbj.2020.07.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 07/04/2020] [Accepted: 07/05/2020] [Indexed: 12/16/2022] Open

Dong R, Pei S, Yin C, He RL, Yau SST. Analysis of the Hosts and Transmission Paths of SARS-CoV-2 in the COVID-19 Outbreak. Genes (Basel) 2020;11:E637. [PMID: 32526937 PMCID: PMC7349679 DOI: 10.3390/genes11060637] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Revised: 05/30/2020] [Accepted: 06/03/2020] [Indexed: 12/11/2022] Open

Positional Correlation Natural Vector: A Novel Method for Genome Comparison. Int J Mol Sci 2020;21:ijms21113859. [PMID: 32485813 PMCID: PMC7312176 DOI: 10.3390/ijms21113859] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 05/17/2020] [Accepted: 05/26/2020] [Indexed: 12/17/2022] Open

Das S, Das A, Mondal B, Dey N, Bhattacharya DK, Tibarewala DN. Genome sequence comparison under a new form of tri-nucleotide representation based on bio-chemical properties of nucleotides. Gene 2019;730:144257. [PMID: 31759983 DOI: 10.1016/j.gene.2019.144257] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Revised: 11/01/2019] [Accepted: 11/05/2019] [Indexed: 10/25/2022]

Lv H, Dao FY, Guan ZX, Zhang D, Tan JX, Zhang Y, Chen W, Lin H. iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice. Front Genet 2019;10:793. [PMID: 31552096 PMCID: PMC6746913 DOI: 10.3389/fgene.2019.00793] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Accepted: 07/26/2019] [Indexed: 01/08/2023] Open

Pei S, Dong R, He RL, Yau SST. Large-Scale Genome Comparison Based on Cumulative Fourier Power and Phase Spectra: Central Moment and Covariance Vector. Comput Struct Biotechnol J 2019;17:982-994. [PMID: 31384399 PMCID: PMC6661692 DOI: 10.1016/j.csbj.2019.07.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Revised: 06/24/2019] [Accepted: 07/10/2019] [Indexed: 01/04/2023] Open

Criscuolo A. A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36178] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open

Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. J Comput Biol 2019;26:519-535. [PMID: 31050550 DOI: 10.1089/cmb.2018.0239] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open

Abstract

The classification of pathogens in emerging and re-emerging viruses represents major interests in taxonomic studies, functional genomics, host-pathogen interplay, prevention, and disease treatments. It consists of assigning a given sequence to its related group of known sequences sharing similar characteristics and traits. The challenges to such classification could be associated with several virus properties including recombination, mutation rate, multiplicity of motifs, and diversity. In domains such as pathogen monitoring and surveillance, it is important to detect and quantify known and novel taxa without exploiting the full and accurate alignments or virus family profiles. In this study, we propose an alignment-free method, CASTOR-KRFE, to detect discriminating subsequences within known pathogen sequences to classify accurately unknown pathogen sequences. This method includes three major steps: (1) vectorization of known viral genomic sequences based on k-mers to constitute the potential features, (2) efficient way of pattern extraction and evaluation maximizing classification performance, and (3) prediction of the minimal set of features fitting a given criterion (threshold of performance metric and maximum number of features). We assessed this method through a jackknife data partitioning on a dozen of various virus data sets, covering the seven major virus groups and including influenza virus, Ebola virus, human immunodeficiency virus 1, hepatitis C virus, hepatitis B virus, and human papillomavirus. CASTOR-KRFE provides a weighted average F-measure >0.96 over a wide range of viruses. Our method also shows better performance on complex virus data sets than multiple subsequences extractor for classification (MISSEL), a subsequence extraction method, and the Discriminative mode of MEME patterns extraction tool.

Collapse

Dong R, He L, He RL, Yau SST. A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance. Front Genet 2019;10:234. [PMID: 31024610 PMCID: PMC6465635 DOI: 10.3389/fgene.2019.00234] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Accepted: 03/04/2019] [Indexed: 11/30/2022] Open

Huang HH, Girimurugan SB. Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences. Stat Appl Genet Mol Biol 2019;18:/j/sagmb.ahead-of-print/sagmb-2018-0045/sagmb-2018-0045.xml. [PMID: 30772870 DOI: 10.1515/sagmb-2018-0045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Wang Y, Tian K, Yau SST. Protein Sequence Classification Using Natural Vector and Convex Hull Method. J Comput Biol 2019;26:315-321. [PMID: 30762422 DOI: 10.1089/cmb.2018.0216] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open

Tian K, Zhao X, Zhang Y, Yau S. Comparing protein structures and inferring functions with a novel three-dimensional Yau-Hausdorff method. J Biomol Struct Dyn 2018;37:4151-4160. [PMID: 30518311 DOI: 10.1080/07391102.2018.1540359] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]

Huang HH, Hao S, Alarcon S, Yang J. Comparisons of classification methods for viral genomes and protein families using alignment-free vectorization. Stat Appl Genet Mol Biol 2018;17:sagmb-2018-0004. [PMID: 29959888 DOI: 10.1515/sagmb-2018-0004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Adetiba E, Olugbara OO, Taiwo TB, Adebiyi MO, Badejo JA, Akanle MB, Matthews VO. Alignment-Free Z-Curve Genomic Cepstral Coefficients and Machine Learning for Classification of Viruses. BIOINFORMATICS AND BIOMEDICAL ENGINEERING 2018. [PMCID: PMC7120486 DOI: 10.1007/978-3-319-78723-7_25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]

Dong R, Zheng H, Tian K, Yau SC, Mao W, Yu W, Yin C, Yu C, He RL, Yang J, Yau SS. Virus Database and Online Inquiry System Based on Natural Vectors. Evol Bioinform Online 2017;13:1176934317746667. [PMID: 29308007 PMCID: PMC5751915 DOI: 10.1177/1176934317746667] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2017] [Accepted: 10/05/2017] [Indexed: 01/09/2023] Open

Zhao X, Tian K, He RL, Yau SST. Establishing the phylogeny of Prochlorococcus with a new alignment-free method. Ecol Evol 2017;7:11057-11065. [PMID: 29299281 PMCID: PMC5743538 DOI: 10.1002/ece3.3535] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 09/04/2017] [Accepted: 09/14/2017] [Indexed: 11/11/2022] Open

A novel fast vector method for genetic sequence comparison. Sci Rep 2017;7:12226. [PMID: 28939913 PMCID: PMC5610321 DOI: 10.1038/s41598-017-12493-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Accepted: 09/07/2017] [Indexed: 11/08/2022] Open

He L, Li Y, He RL, Yau SST. A novel alignment-free vector method to cluster protein sequences. J Theor Biol 2017;427:41-52. [DOI: 10.1016/j.jtbi.2017.06.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Revised: 05/04/2017] [Accepted: 06/02/2017] [Indexed: 11/29/2022]

Yu C, Arcos-Burgos M, Licinio J, Wong ML. A latent genetic subtype of major depression identified by whole-exome genotyping data in a Mexican-American cohort. Transl Psychiatry 2017;7:e1134. [PMID: 28509902 PMCID: PMC5534938 DOI: 10.1038/tp.2017.102] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Revised: 04/04/2017] [Accepted: 04/10/2017] [Indexed: 02/07/2023] Open

Yu C, Baune BT, Licinio J, Wong ML. A novel strategy for clustering major depression individuals using whole-genome sequencing variant data. Sci Rep 2017;7:44389. [PMID: 28287625 PMCID: PMC5347377 DOI: 10.1038/srep44389] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2016] [Accepted: 02/07/2017] [Indexed: 12/01/2022] Open

Li Y, He L, He RL, Yau SST. Zika and Flaviviruses Phylogeny Based on the Alignment-Free Natural Vector Method. DNA Cell Biol 2016;36:109-116. [PMID: 27977308 DOI: 10.1089/dna.2016.3532] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Hernandez T, Yang J. Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses. J Comput Biol 2016;23:810-20. [DOI: 10.1089/cmb.2013.0132] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Huang HH, Yu C. Clustering DNA sequences using the out-of-place measure with reduced n-grams. J Theor Biol 2016;406:61-72. [PMID: 27375217 DOI: 10.1016/j.jtbi.2016.06.029] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Revised: 05/18/2016] [Accepted: 06/21/2016] [Indexed: 11/25/2022]

Huang HH. An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses. J Theor Biol 2016;398:136-44. [DOI: 10.1016/j.jtbi.2016.03.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Revised: 02/25/2016] [Accepted: 03/02/2016] [Indexed: 11/29/2022]

Li Y, Tian K, Yin C, He RL, Yau SST. Virus classification in 60-dimensional protein space. Mol Phylogenet Evol 2016;99:53-62. [PMID: 26988414 DOI: 10.1016/j.ympev.2016.03.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2015] [Revised: 01/24/2016] [Accepted: 03/10/2016] [Indexed: 10/22/2022]

Zhao X, Wan X, He RL, Yau SST. A new method for studying the evolutionary origin of the SAR11 clade marine bacteria. Mol Phylogenet Evol 2016;98:271-9. [PMID: 26926946 DOI: 10.1016/j.ympev.2016.02.015] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2015] [Revised: 02/18/2016] [Accepted: 02/18/2016] [Indexed: 12/14/2022]

Development of self-compressing BLSOM for comprehensive analysis of big sequence data. BIOMED RESEARCH INTERNATIONAL 2015;2015:506052. [PMID: 26495297 PMCID: PMC4606171 DOI: 10.1155/2015/506052] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/27/2015] [Revised: 06/25/2015] [Accepted: 07/12/2015] [Indexed: 11/17/2022]

Tian K, Yang X, Kong Q, Yin C, He RL, Yau SST. Two Dimensional Yau-Hausdorff Distance with Applications on Comparison of DNA and Protein Sequences. PLoS One 2015;10:e0136577. [PMID: 26384293 PMCID: PMC4575136 DOI: 10.1371/journal.pone.0136577] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Accepted: 08/05/2015] [Indexed: 11/20/2022] Open

Progressive alignment of genomic signals by multiple dynamic time warping. J Theor Biol 2015;385:20-30. [PMID: 26300069 DOI: 10.1016/j.jtbi.2015.08.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Revised: 07/21/2015] [Accepted: 08/03/2015] [Indexed: 11/22/2022]