Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Wei D, Jiang Q, Wei Y, Wang S. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinformatics 2012;13:174. [PMID: 22823405 PMCID: PMC3443659 DOI: 10.1186/1471-2105-13-174] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2011] [Accepted: 06/30/2012] [Indexed: 11/10/2022] Open

For:	Wei D, Jiang Q, Wei Y, Wang S. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinformatics 2012;13:174. [PMID: 22823405 PMCID: PMC3443659 DOI: 10.1186/1471-2105-13-174] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2011] [Accepted: 06/30/2012] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Li X, Wei Z, Hu Y, Zhu X. GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models. Int J Biol Macromol 2024;280:135599. [PMID: 39276905 DOI: 10.1016/j.ijbiomac.2024.135599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 09/11/2024] [Accepted: 09/11/2024] [Indexed: 09/17/2024]

Bajić V, Schulmann VH, Nowick K. mtDNA "nomenclutter" and its consequences on the interpretation of genetic data. BMC Ecol Evol 2024;24:110. [PMID: 39160470 PMCID: PMC11331612 DOI: 10.1186/s12862-024-02288-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Accepted: 07/11/2024] [Indexed: 08/21/2024] Open

Abstract

Population-based studies of human mitochondrial genetic diversity often require the classification of mitochondrial DNA (mtDNA) haplotypes into more than 5400 described haplogroups, and further grouping those into hierarchically higher haplogroups. Such secondary haplogroup groupings (e.g., "macro-haplogroups") vary across studies, as they depend on the sample quality, technical factors of haplogroup calling, the aims of the study, and the researchers' understanding of the mtDNA haplogroup nomenclature. Retention of historical nomenclature coupled with a growing number of newly described mtDNA lineages results in increasingly complex and inconsistent nomenclature that does not reflect phylogeny well. This "clutter" leaves room for grouping errors and inconsistencies across scientific publications, especially when the haplogroup names are used as a proxy for secondary groupings, and represents a source for scientific misinterpretation. Here we explore the effects of phylogenetically insensitive secondary mtDNA haplogroup groupings, and the lack of standardized secondary haplogroup groupings on downstream analyses and interpretation of genetic data. We demonstrate that frequency-based analyses produce inconsistent results when different secondary mtDNA groupings are applied, and thus allow for vastly different interpretations of the same genetic data. The lack of guidelines and recommendations on how to choose appropriate secondary haplogroup groupings presents an issue for the interpretation of results, as well as their comparison and reproducibility across studies. To reduce biases originating from arbitrarily defined secondary nomenclature-based groupings, we suggest that future updates of mtDNA phylogenies aimed for the use in mtDNA haplogroup nomenclature should also provide well-defined and standardized sets of phylogenetically meaningful algorithm-based secondary haplogroup groupings such as "macro-haplogroups", "meso-haplogroups", and "micro-haplogroups". Ideally, each of the secondary haplogroup grouping levels should be informative about different human population history events. Those phylogenetically informative levels of haplogroup groupings can be easily defined using TreeCluster, and then implemented into haplogroup callers such as HaploGrep3. This would foster reproducibility across studies, provide a grouping standard for population-based studies, and reduce errors associated with haplogroup nomenclatures in future studies.

Collapse

Kong J, Yao Z, Chen J, Zhao Q, Li T, Dong M, Bai Y, Liu Y, Lin Z, Xie Q, Zhang X. Comparative Transcriptome Analysis Unveils Regulatory Factors Influencing Fatty Liver Development in Lion-Head Geese under High-Intake Feeding Compared to Normal Feeding. Vet Sci 2024;11:366. [PMID: 39195820 PMCID: PMC11359645 DOI: 10.3390/vetsci11080366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Revised: 07/13/2024] [Accepted: 08/01/2024] [Indexed: 08/29/2024] Open

Abstract

The lion-head goose is the only large goose species in China, and it is one of the largest goose species in the world. Lion-head geese have a strong tolerance for massive energy intake and show a priority of fat accumulation in liver tissue through special feeding. Therefore, the aim of this study was to investigate the impact of high feed intake compared to normal feeding conditions on the transcriptome changes associated with fatty liver development in lion-head geese. In this study, 20 healthy adult lion-head geese were randomly assigned to a control group (CONTROL, n = 10) and high-intake-fed group (CASE, n = 10). After 38 d of treatment, all geese were sacrificed, and liver samples were collected. Three geese were randomly selected from the CONTROL and CASE groups, respectively, to perform whole-transcriptome analysis to analyze the key regulatory genes. We identified 716 differentially expressed mRNAs, 145 differentially expressed circRNAs, and 39 differentially expressed lncRNAs, including upregulated and downregulated genes. GO enrichment analysis showed that these genes were significantly enriched in molecular function. The node degree analysis and centrality metrics of the mRNA-lncRNA-circRNA triple regulatory network indicate the presence of crucial functional nodes in the network. We identified differentially expressed genes, including HSPB9, Pgk1, Hsp70, ME2, malic enzyme, HSP90, FADS1, transferrin, FABP, PKM2, Serpin2, and PKS, and we additionally confirmed the accuracy of sequencing at the RNA level. In this study, we studied for the first time the important differential genes that regulate fatty liver in high-intake feeding of the lion-head goose. In summary, these differentially expressed genes may play important roles in fatty liver development in the lion-head goose, and the functions and mechanisms should be investigated in future studies.

Collapse

Affiliation(s)

Jie Kong State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.) Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
Ziqi Yao State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.) Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
Junpeng Chen Shantou Baisha Research Institute of Original Species of Poultry and Stock, Shantou 515000, China; (J.C.); (Z.L.)
Qiqi Zhao State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.) Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
Tong Li State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.) Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
Mengyue Dong State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.) Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
Yuhang Bai State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.) Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
Yuanjia Liu College of Coastal Agricultural Sciences, Guangdong Ocean University, Zhanjiang 524088, China;
Zhenping Lin Shantou Baisha Research Institute of Original Species of Poultry and Stock, Shantou 515000, China; (J.C.); (Z.L.)
Qingmei Xie State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.) Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
Xinheng Zhang State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.) Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China

Collapse

Wright E. Accurately clustering biological sequences in linear time by relatedness sorting. Nat Commun 2024;15:3047. [PMID: 38589369 PMCID: PMC11001989 DOI: 10.1038/s41467-024-47371-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 03/28/2024] [Indexed: 04/10/2024] Open

KafiKang M, Abeysiriwardana C, Singh VK, Young Koh C, Prichard J, Mor SK, Hendawi A. Analysis of Emerging Variants of Turkey Reovirus using Machine Learning. Brief Bioinform 2024;25:bbae224. [PMID: 38752857 PMCID: PMC11097603 DOI: 10.1093/bib/bbae224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 03/26/2024] [Accepted: 04/25/2024] [Indexed: 05/19/2024] Open

Abstract

Avian reoviruses continue to cause disease in turkeys with varied pathogenicity and tissue tropism. Turkey enteric reovirus has been identified as a causative agent of enteritis or inapparent infections in turkeys. The new emerging variants of turkey reovirus, tentatively named turkey arthritis reovirus (TARV) and turkey hepatitis reovirus (THRV), are linked to tenosynovitis/arthritis and hepatitis, respectively. Turkey arthritis and hepatitis reoviruses are causing significant economic losses to the turkey industry. These infections can lead to poor weight gain, uneven growth, poor feed conversion, increased morbidity and mortality and reduced marketability of commercial turkeys. To combat these issues, detecting and classifying the types of reoviruses in turkey populations is essential. This research aims to employ clustering methods, specifically K-means and Hierarchical clustering, to differentiate three types of turkey reoviruses and identify novel emerging variants. Additionally, it focuses on classifying variants of turkey reoviruses by leveraging various machine learning algorithms such as Support Vector Machines, Naive Bayes, Random Forest, Decision Tree, and deep learning algorithms, including convolutional neural networks (CNNs). The experiments use real turkey reovirus sequence data, allowing for robust analysis and evaluation of the proposed methods. The results indicate that machine learning methods achieve an average accuracy of 92%, F1-Macro of 93% and F1-Weighted of 92% scores in classifying reovirus types. In contrast, the CNN model demonstrates an average accuracy of 85%, F1-Macro of 71% and F1-Weighted of 84% scores in the same classification task. The superior performance of the machine learning classifiers provides valuable insights into reovirus evolution and mutation, aiding in detecting emerging variants of pathogenic TARVs and THRVs.

Collapse

Jamialahmadi H, Khalili-Tanha G, Nazari E, Rezaei-Tavirani M. Artificial intelligence and bioinformatics: a journey from traditional techniques to smart approaches. GASTROENTEROLOGY AND HEPATOLOGY FROM BED TO BENCH 2024;17:241-252. [PMID: 39308539 PMCID: PMC11413381 DOI: 10.22037/ghfbb.v17i3.2977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Accepted: 05/11/2024] [Indexed: 09/25/2024]

Zhang X, Zhang H, Wang Z, Ma X, Luo J, Zhu Y. PWSC: a novel clustering method based on polynomial weight-adjusted sparse clustering for sparse biomedical data and its application in cancer subtyping. BMC Bioinformatics 2023;24:490. [PMID: 38129803 PMCID: PMC10740247 DOI: 10.1186/s12859-023-05595-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 12/04/2023] [Indexed: 12/23/2023] Open

Abstract

BACKGROUND

Clustering analysis is widely used to interpret biomedical data and uncover new knowledge and patterns. However, conventional clustering methods are not effective when dealing with sparse biomedical data. To overcome this limitation, we propose a hierarchical clustering method called polynomial weight-adjusted sparse clustering (PWSC).

RESULTS

The PWSC algorithm adjusts feature weights using a polynomial function, redefines the distances between samples, and performs hierarchical clustering analysis based on these adjusted distances. Additionally, we incorporate a consensus clustering approach to determine the optimal number of classifications. This consensus approach utilizes relative change in the cumulative distribution function to identify the best number of clusters, resulting in more stable clustering results. Leveraging the PWSC algorithm, we successfully classified a cohort of gastric cancer patients, enabling categorization of patients carrying different types of altered genes. Further evaluation using Entropy showed a significant improvement (p = 2.905e-05), while using the Calinski-Harabasz index demonstrates a remarkable 100% improvement in the quality of the best classification compared to conventional algorithms. Similarly, significantly increased entropy (p = 0.0336) and comparable CHI, were observed when classifying another colorectal cancer cohort with microbial abundance. The above attempts in cancer subtyping demonstrate that PWSC is highly applicable to different types of biomedical data. To facilitate its application, we have developed a user-friendly tool that implements the PWSC algorithm, which canbe accessed at http://pwsc.aiyimed.com/ .

CONCLUSIONS

PWSC addresses the limitations of conventional approaches when clustering sparse biomedical data. By adjusting feature weights and employing consensus clustering, we achieve improved clustering results compared to conventional methods. The PWSC algorithm provides a valuable tool for researchers in the field, enabling more accurate and stable clustering analysis. Its application can enhance our understanding of complex biological systems and contribute to advancements in various biomedical disciplines.

Collapse

Han R, Qi J, Xue Y, Sun X, Zhang F, Gao X, Li G. HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing. Genome Biol 2023;24:222. [PMID: 37798751 PMCID: PMC10552309 DOI: 10.1186/s13059-023-03053-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 09/08/2023] [Indexed: 10/07/2023] Open

Luan T, Muralidharan HS, Alshehri M, Mittra I, Pop M. SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets. Nucleic Acids Res 2023;51:e46. [PMID: 36912074 PMCID: PMC10164572 DOI: 10.1093/nar/gkad158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 02/01/2023] [Accepted: 02/28/2023] [Indexed: 03/14/2023] Open

Benchmarking machine learning robustness in Covid-19 genome sequence classification. Sci Rep 2023;13:4154. [PMID: 36914815 PMCID: PMC10010240 DOI: 10.1038/s41598-023-31368-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 03/10/2023] [Indexed: 03/16/2023] Open

Semi-supervised and un-supervised clustering: A review and experimental evaluation. INFORM SYST 2023. [DOI: 10.1016/j.is.2023.102178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]

Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Dall'Alba G, Casa PL, Abreu FPD, Notari DL, de Avila E Silva S. A Survey of Biological Data in a Big Data Perspective. BIG DATA 2022;10:279-297. [PMID: 35394342 DOI: 10.1089/big.2020.0383] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]

Lombardo SD, Wangsaputra IF, Menche J, Stevens A. Network Approaches for Charting the Transcriptomic and Epigenetic Landscape of the Developmental Origins of Health and Disease. Genes (Basel) 2022;13:764. [PMID: 35627149 PMCID: PMC9141211 DOI: 10.3390/genes13050764] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/04/2022] [Accepted: 04/13/2022] [Indexed: 02/04/2023] Open

Chiu JKH, Ong RTH. Clustering biological sequences with dynamic sequence similarity threshold. BMC Bioinformatics 2022;23:108. [PMID: 35354426 PMCID: PMC8969259 DOI: 10.1186/s12859-022-04643-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 03/02/2022] [Indexed: 11/10/2022] Open

Gharavi E, Gu A, Zheng G, Smith JP, Cho HJ, Zhang A, Brown DE, Sheffield NC. Embeddings of genomic region sets capture rich biological associations in lower dimensions. Bioinformatics 2021;37:4299-4306. [PMID: 34156475 PMCID: PMC8652032 DOI: 10.1093/bioinformatics/btab439] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 06/07/2021] [Accepted: 06/15/2021] [Indexed: 11/12/2022] Open

Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021;23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]

Huang YS, Cheng WC, Lin CY. Androgenic Sensitivities and Ovarian Gene Expression Profiles Prior to Treatment in Japanese Eel (Anguilla japonica). MARINE BIOTECHNOLOGY (NEW YORK, N.Y.) 2021;23:430-444. [PMID: 34191211 DOI: 10.1007/s10126-021-10035-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 04/28/2021] [Indexed: 06/13/2023]

Bhattacharyya B, Mitra U, Bhattacharyya R. Tandem repeat interval pattern identifies animal taxa. Bioinformatics 2021;37:2250-2258. [PMID: 33677492 DOI: 10.1093/bioinformatics/btab124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 12/11/2020] [Accepted: 02/22/2021] [Indexed: 11/14/2022] Open

Macari G, Toti D, Pasquadibisceglie A, Polticelli F. DockingApp RF: A State-of-the-Art Novel Scoring Function for Molecular Docking in a User-Friendly Interface to AutoDock Vina. Int J Mol Sci 2020;21:ijms21249548. [PMID: 33333976 PMCID: PMC7765429 DOI: 10.3390/ijms21249548] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Revised: 12/11/2020] [Accepted: 12/11/2020] [Indexed: 11/28/2022] Open

Paul T, Vainio S, Roning J. Clustering and classification of virus sequence through music communication protocol and wavelet transform. Genomics 2020;113:778-784. [PMID: 33069829 PMCID: PMC7561519 DOI: 10.1016/j.ygeno.2020.10.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2020] [Accepted: 10/13/2020] [Indexed: 01/19/2023]

Yang A, Zhang W, Wang J, Yang K, Han Y, Zhang L. Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA. Front Bioeng Biotechnol 2020;8:1032. [PMID: 33015010 PMCID: PMC7498545 DOI: 10.3389/fbioe.2020.01032] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 08/10/2020] [Indexed: 11/13/2022] Open

Le Goallec A, Tierney BT, Luber JM, Cofer EM, Kostic AD, Patel CJ. A systematic machine learning and data type comparison yields metagenomic predictors of infant age, sex, breastfeeding, antibiotic usage, country of origin, and delivery type. PLoS Comput Biol 2020;16:e1007895. [PMID: 32392251 PMCID: PMC7241849 DOI: 10.1371/journal.pcbi.1007895] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 05/21/2020] [Accepted: 04/21/2020] [Indexed: 12/31/2022] Open

Luczak BB, James BT, Girgis HZ. A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Brief Bioinform 2020;20:1222-1237. [PMID: 29220512 PMCID: PMC6781583 DOI: 10.1093/bib/bbx161] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 10/13/2017] [Indexed: 11/29/2022] Open

Abstract

Motivation

Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences.

Results

We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours.

Availability

The source code of the benchmarking tool is available as Supplementary Materials.

Collapse

James BT, Luczak BB, Girgis HZ. MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res 2019;46:e83. [PMID: 29718317 PMCID: PMC6101578 DOI: 10.1093/nar/gky315] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2018] [Accepted: 04/13/2018] [Indexed: 11/13/2022] Open

Chen W, Li W, Huang G, Flavel M. The Applications of Clustering Methods in Predicting Protein Functions. CURR PROTEOMICS 2019. [DOI: 10.2174/1570164616666181212114612] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Meher PK, Sahu TK, Gahoi S, Satpathy S, Rao AR. Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition. Gene 2019;705:113-126. [PMID: 31009682 DOI: 10.1016/j.gene.2019.04.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Revised: 03/27/2019] [Accepted: 04/17/2019] [Indexed: 02/02/2023]

Abstract

Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.

Collapse

Dall'Alba G, Casa PL, Notari DL, Adami AG, Echeverrigaray S, de Avila E Silva S. Analysis of the nucleotide content of Escherichia coli promoter sequences related to the alternative sigma factors. J Mol Recognit 2018;32:e2770. [PMID: 30458580 DOI: 10.1002/jmr.2770] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Revised: 10/23/2018] [Accepted: 10/24/2018] [Indexed: 01/26/2023]

Silveira MC, Azevedo da Silva R, Faria da Mota F, Catanho M, Jardim R, R Guimarães AC, de Miranda AB. Systematic Identification and Classification of β-Lactamases Based on Sequence Similarity Criteria: β-Lactamase Annotation. Evol Bioinform Online 2018;14:1176934318797351. [PMID: 30210232 PMCID: PMC6131288 DOI: 10.1177/1176934318797351] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 08/08/2018] [Indexed: 12/11/2022] Open

Sriwanna K, Boongoen T, Iam-On N. Graph clustering-based discretization approach to microarray data. Knowl Inf Syst 2018. [DOI: 10.1007/s10115-018-1249-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

Lin J, Adjeroh DA, Jiang BH, Jiang Y. K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics. Bioinformatics 2018;34:1682-1689. [PMID: 29253072 PMCID: PMC6355110 DOI: 10.1093/bioinformatics/btx809] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Revised: 12/11/2017] [Accepted: 12/14/2017] [Indexed: 11/13/2022] Open

Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y. SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform. BMC Bioinformatics 2018;19:165. [PMID: 29720081 PMCID: PMC5930706 DOI: 10.1186/s12859-018-2155-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Accepted: 04/11/2018] [Indexed: 11/10/2022] Open

Guo G, Chen L, Ye Y, Jiang Q. Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017;28:2936-2948. [PMID: 28114078 DOI: 10.1109/tnnls.2016.2608354] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]

Yuan L, Wang W, Chen L. Two-stage pruning method for gram-based categorical sequence clustering. INT J MACH LEARN CYB 2017. [DOI: 10.1007/s13042-017-0744-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017;18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 248] [Impact Index Per Article: 35.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open

Amiri S, Dinov ID. Comparison of genomic data via statistical distribution. J Theor Biol 2016;407:318-327. [PMID: 27460589 PMCID: PMC5361063 DOI: 10.1016/j.jtbi.2016.07.032] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Revised: 06/22/2016] [Accepted: 07/20/2016] [Indexed: 11/28/2022]

Taylor WR. Reduction, alignment and visualisation of large diverse sequence families. BMC Bioinformatics 2016;17:300. [PMID: 27484804 PMCID: PMC4971687 DOI: 10.1186/s12859-016-1059-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2016] [Accepted: 04/21/2016] [Indexed: 11/10/2022] Open

Dubey AK, Gupta U, Jain S. Analysis of k-means clustering approach on the breast cancer Wisconsin dataset. Int J Comput Assist Radiol Surg 2016;11:2033-2047. [PMID: 27311823 DOI: 10.1007/s11548-016-1437-9] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2016] [Accepted: 05/27/2016] [Indexed: 11/25/2022]

Abstract

PURPOSE

Breast cancer is one of the most common cancers found worldwide and most frequently found in women. An early detection of breast cancer provides the possibility of its cure; therefore, a large number of studies are currently going on to identify methods that can detect breast cancer in its early stages. This study was aimed to find the effects of k-means clustering algorithm with different computation measures like centroid, distance, split method, epoch, attribute, and iteration and to carefully consider and identify the combination of measures that has potential of highly accurate clustering accuracy.

METHODS

K-means algorithm was used to evaluate the impact of clustering using centroid initialization, distance measures, and split methods. The experiments were performed using breast cancer Wisconsin (BCW) diagnostic dataset. Foggy and random centroids were used for the centroid initialization. In foggy centroid, based on random values, the first centroid was calculated. For random centroid, the initial centroid was considered as (0, 0).

RESULTS

The results were obtained by employing k-means algorithm and are discussed with different cases considering variable parameters. The calculations were based on the centroid (foggy/random), distance (Euclidean/Manhattan/Pearson), split (simple/variance), threshold (constant epoch/same centroid), attribute (2-9), and iteration (4-10). Approximately, 92 % average positive prediction accuracy was obtained with this approach. Better results were found for the same centroid and the highest variance. The results achieved using Euclidean and Manhattan were better than the Pearson correlation.

CONCLUSIONS

The findings of this work provided extensive understanding of the computational parameters that can be used with k-means. The results indicated that k-means has a potential to classify BCW dataset.

Collapse

Ahmad M, Jung LT, Bhuiyan MAA. On fuzzy semantic similarity measure for DNA coding. Comput Biol Med 2015;69:144-51. [PMID: 26773936 DOI: 10.1016/j.compbiomed.2015.12.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Revised: 12/22/2015] [Accepted: 12/23/2015] [Indexed: 11/28/2022]

Romanel A, Lago S, Prandi D, Sboner A, Demichelis F. ASEQ: fast allele-specific studies from next-generation sequencing data. BMC Med Genomics 2015;8:9. [PMID: 25889339 PMCID: PMC4363342 DOI: 10.1186/s12920-015-0084-2] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Accepted: 02/12/2015] [Indexed: 11/17/2022] Open

Schmidt TSB, Matias Rodrigues JF, von Mering C. Limits to robustness and reproducibility in the demarcation of operational taxonomic units. Environ Microbiol 2014;17:1689-706. [PMID: 25156547 DOI: 10.1111/1462-2920.12610] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2014] [Accepted: 08/21/2014] [Indexed: 11/27/2022]

Bao J, Yuan R, Bao Z. An improved alignment-free model for DNA sequence similarity metric. BMC Bioinformatics 2014;15:321. [PMID: 25261973 PMCID: PMC4261891 DOI: 10.1186/1471-2105-15-321] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2013] [Accepted: 09/23/2014] [Indexed: 11/23/2022] Open

Abstract

BACKGROUND

DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words' probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality.

RESULTS

This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings.

CONCLUSIONS

The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization.

Collapse

Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014;15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open