1
|
Li X, Wei Z, Hu Y, Zhu X. GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models. Int J Biol Macromol 2024; 280:135599. [PMID: 39276905 DOI: 10.1016/j.ijbiomac.2024.135599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 09/11/2024] [Accepted: 09/11/2024] [Indexed: 09/17/2024]
Abstract
The computational identification of nucleic acid-binding proteins (NABP) is of great significance for understanding the mechanisms of these biological activities and drug discovery. Although a bunch of sequence-based methods have been proposed to predict NABP and achieved promising performance, the structure information is often overlooked. On the other hand, the power of popular protein language models (pLM) has seldom been harnessed for predicting NABPs. In this study, we propose a novel framework called GraphNABP, to predict NABP by integrating sequence and predicted 3D structure information. Specifically, sequence embeddings and protein molecular graphs were first obtained from ProtT5 protein language model and predicted 3D structures, respectively. Then, graph attention (GAT) and bidirectional long short-term memory (BiLSTM) neural networks were used to enhance feature representations. Finally, a fully connected layer is used to predict NABPs. To the best of our knowledge, this is the first time to integrate AlphaFold and protein language models for the prediction of NABPs. The performances on multiple independent test sets indicate that GraphNABP outperforms other state-of-the-art methods. Our results demonstrate the effectiveness of pLM embeddings and structural information for NABP prediction. The codes and data used in this study are available at https://github.com/lixiangli01/GraphNABP.
Collapse
Affiliation(s)
- Xiang Li
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Zhuoyu Wei
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yueran Hu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China.
| |
Collapse
|
2
|
Bajić V, Schulmann VH, Nowick K. mtDNA "nomenclutter" and its consequences on the interpretation of genetic data. BMC Ecol Evol 2024; 24:110. [PMID: 39160470 PMCID: PMC11331612 DOI: 10.1186/s12862-024-02288-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Accepted: 07/11/2024] [Indexed: 08/21/2024] Open
Abstract
Population-based studies of human mitochondrial genetic diversity often require the classification of mitochondrial DNA (mtDNA) haplotypes into more than 5400 described haplogroups, and further grouping those into hierarchically higher haplogroups. Such secondary haplogroup groupings (e.g., "macro-haplogroups") vary across studies, as they depend on the sample quality, technical factors of haplogroup calling, the aims of the study, and the researchers' understanding of the mtDNA haplogroup nomenclature. Retention of historical nomenclature coupled with a growing number of newly described mtDNA lineages results in increasingly complex and inconsistent nomenclature that does not reflect phylogeny well. This "clutter" leaves room for grouping errors and inconsistencies across scientific publications, especially when the haplogroup names are used as a proxy for secondary groupings, and represents a source for scientific misinterpretation. Here we explore the effects of phylogenetically insensitive secondary mtDNA haplogroup groupings, and the lack of standardized secondary haplogroup groupings on downstream analyses and interpretation of genetic data. We demonstrate that frequency-based analyses produce inconsistent results when different secondary mtDNA groupings are applied, and thus allow for vastly different interpretations of the same genetic data. The lack of guidelines and recommendations on how to choose appropriate secondary haplogroup groupings presents an issue for the interpretation of results, as well as their comparison and reproducibility across studies. To reduce biases originating from arbitrarily defined secondary nomenclature-based groupings, we suggest that future updates of mtDNA phylogenies aimed for the use in mtDNA haplogroup nomenclature should also provide well-defined and standardized sets of phylogenetically meaningful algorithm-based secondary haplogroup groupings such as "macro-haplogroups", "meso-haplogroups", and "micro-haplogroups". Ideally, each of the secondary haplogroup grouping levels should be informative about different human population history events. Those phylogenetically informative levels of haplogroup groupings can be easily defined using TreeCluster, and then implemented into haplogroup callers such as HaploGrep3. This would foster reproducibility across studies, provide a grouping standard for population-based studies, and reduce errors associated with haplogroup nomenclatures in future studies.
Collapse
Affiliation(s)
- Vladimir Bajić
- Human Biology and Primate Evolution, Freie Universität Berlin, Berlin, Germany.
| | | | - Katja Nowick
- Human Biology and Primate Evolution, Freie Universität Berlin, Berlin, Germany.
| |
Collapse
|
3
|
Kong J, Yao Z, Chen J, Zhao Q, Li T, Dong M, Bai Y, Liu Y, Lin Z, Xie Q, Zhang X. Comparative Transcriptome Analysis Unveils Regulatory Factors Influencing Fatty Liver Development in Lion-Head Geese under High-Intake Feeding Compared to Normal Feeding. Vet Sci 2024; 11:366. [PMID: 39195820 PMCID: PMC11359645 DOI: 10.3390/vetsci11080366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Revised: 07/13/2024] [Accepted: 08/01/2024] [Indexed: 08/29/2024] Open
Abstract
The lion-head goose is the only large goose species in China, and it is one of the largest goose species in the world. Lion-head geese have a strong tolerance for massive energy intake and show a priority of fat accumulation in liver tissue through special feeding. Therefore, the aim of this study was to investigate the impact of high feed intake compared to normal feeding conditions on the transcriptome changes associated with fatty liver development in lion-head geese. In this study, 20 healthy adult lion-head geese were randomly assigned to a control group (CONTROL, n = 10) and high-intake-fed group (CASE, n = 10). After 38 d of treatment, all geese were sacrificed, and liver samples were collected. Three geese were randomly selected from the CONTROL and CASE groups, respectively, to perform whole-transcriptome analysis to analyze the key regulatory genes. We identified 716 differentially expressed mRNAs, 145 differentially expressed circRNAs, and 39 differentially expressed lncRNAs, including upregulated and downregulated genes. GO enrichment analysis showed that these genes were significantly enriched in molecular function. The node degree analysis and centrality metrics of the mRNA-lncRNA-circRNA triple regulatory network indicate the presence of crucial functional nodes in the network. We identified differentially expressed genes, including HSPB9, Pgk1, Hsp70, ME2, malic enzyme, HSP90, FADS1, transferrin, FABP, PKM2, Serpin2, and PKS, and we additionally confirmed the accuracy of sequencing at the RNA level. In this study, we studied for the first time the important differential genes that regulate fatty liver in high-intake feeding of the lion-head goose. In summary, these differentially expressed genes may play important roles in fatty liver development in the lion-head goose, and the functions and mechanisms should be investigated in future studies.
Collapse
Affiliation(s)
- Jie Kong
- State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.)
- Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China
- Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
| | - Ziqi Yao
- State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.)
- Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China
- Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
| | - Junpeng Chen
- Shantou Baisha Research Institute of Original Species of Poultry and Stock, Shantou 515000, China; (J.C.); (Z.L.)
| | - Qiqi Zhao
- State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.)
- Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China
- Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
| | - Tong Li
- State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.)
- Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China
- Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
| | - Mengyue Dong
- State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.)
- Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China
- Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
| | - Yuhang Bai
- State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.)
- Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China
- Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
| | - Yuanjia Liu
- College of Coastal Agricultural Sciences, Guangdong Ocean University, Zhanjiang 524088, China;
| | - Zhenping Lin
- Shantou Baisha Research Institute of Original Species of Poultry and Stock, Shantou 515000, China; (J.C.); (Z.L.)
| | - Qingmei Xie
- State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.)
- Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China
- Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
| | - Xinheng Zhang
- State Key Laboratory of Swine and Poultry Breeding Industry & Heyuan Branch, Guangdong Provincial Laboratory of Lingnan Modern Agricultural Science and Technology, College of Animal Science, South China Agricultural University, Guangzhou 510642, China; (J.K.); (Z.Y.); (Q.Z.); (T.L.); (M.D.); (Y.B.)
- Guangdong Provincial Key Lab of AgroAnimal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China
- Guangdong Engineering Research Center for Vector Vaccine of Animal Virus, Guangzhou 510642, China
- Zhongshan Innovation Center, South China Agricultural University, Zhongshan 528400, China
| |
Collapse
|
4
|
Wright E. Accurately clustering biological sequences in linear time by relatedness sorting. Nat Commun 2024; 15:3047. [PMID: 38589369 PMCID: PMC11001989 DOI: 10.1038/s41467-024-47371-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 03/28/2024] [Indexed: 04/10/2024] Open
Abstract
Clustering biological sequences into similar groups is an increasingly important task as the number of available sequences continues to grow exponentially. Search-based approaches to clustering scale super-linearly with the number of input sequences, making it impractical to cluster very large sets of sequences. Approaches to clustering sequences in linear time currently lack the accuracy of super-linear approaches. Here, I set out to develop and characterize a strategy for clustering with linear time complexity that retains the accuracy of less scalable approaches. The resulting algorithm, named Clusterize, sorts sequences by relatedness to linearize the clustering problem. Clusterize produces clusters with accuracy rivaling popular programs (CD-HIT, MMseqs2, and UCLUST) but exhibits linear asymptotic scalability. Clusterize generates higher accuracy and oftentimes much larger clusters than Linclust, a fast linear time clustering algorithm. I demonstrate the utility of Clusterize by accurately solving different clustering problems involving millions of nucleotide or protein sequences.
Collapse
Affiliation(s)
- Erik Wright
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
- Center for Evolutionary Biology and Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
5
|
KafiKang M, Abeysiriwardana C, Singh VK, Young Koh C, Prichard J, Mor SK, Hendawi A. Analysis of Emerging Variants of Turkey Reovirus using Machine Learning. Brief Bioinform 2024; 25:bbae224. [PMID: 38752857 PMCID: PMC11097603 DOI: 10.1093/bib/bbae224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 03/26/2024] [Accepted: 04/25/2024] [Indexed: 05/19/2024] Open
Abstract
Avian reoviruses continue to cause disease in turkeys with varied pathogenicity and tissue tropism. Turkey enteric reovirus has been identified as a causative agent of enteritis or inapparent infections in turkeys. The new emerging variants of turkey reovirus, tentatively named turkey arthritis reovirus (TARV) and turkey hepatitis reovirus (THRV), are linked to tenosynovitis/arthritis and hepatitis, respectively. Turkey arthritis and hepatitis reoviruses are causing significant economic losses to the turkey industry. These infections can lead to poor weight gain, uneven growth, poor feed conversion, increased morbidity and mortality and reduced marketability of commercial turkeys. To combat these issues, detecting and classifying the types of reoviruses in turkey populations is essential. This research aims to employ clustering methods, specifically K-means and Hierarchical clustering, to differentiate three types of turkey reoviruses and identify novel emerging variants. Additionally, it focuses on classifying variants of turkey reoviruses by leveraging various machine learning algorithms such as Support Vector Machines, Naive Bayes, Random Forest, Decision Tree, and deep learning algorithms, including convolutional neural networks (CNNs). The experiments use real turkey reovirus sequence data, allowing for robust analysis and evaluation of the proposed methods. The results indicate that machine learning methods achieve an average accuracy of 92%, F1-Macro of 93% and F1-Weighted of 92% scores in classifying reovirus types. In contrast, the CNN model demonstrates an average accuracy of 85%, F1-Macro of 71% and F1-Weighted of 84% scores in the same classification task. The superior performance of the machine learning classifiers provides valuable insights into reovirus evolution and mutation, aiding in detecting emerging variants of pathogenic TARVs and THRVs.
Collapse
Affiliation(s)
- Maryam KafiKang
- Computer Science Department, University of Rhode Island, Kingston, 02881, RI, USA
| | | | - Vikash K Singh
- Department of Veterinary Population Medicine, and Veterinary Diagnostic Laboratory, University of Minnesota, Saint Paul, 55108, MN, USA
| | - Chan Young Koh
- Computer Science Department, University of Rhode Island, Kingston, 02881, RI, USA
| | - Janet Prichard
- Department of Information Systems and Analytics, Bryant University, Smithfield, 02917, RI, USA
| | - Sunil K Mor
- Department of Veterinary Population Medicine, and Veterinary Diagnostic Laboratory, University of Minnesota, Saint Paul, 55108, MN, USA
- Department of Veterinary and Biomedical Sciences and Animal Disease Research & Diagnostic Laboratory, South Dakota State University, Brookings, 57007, SD, USA
| | - Abdeltawab Hendawi
- Computer Science Department, University of Rhode Island, Kingston, 02881, RI, USA
- Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
| |
Collapse
|
6
|
Jamialahmadi H, Khalili-Tanha G, Nazari E, Rezaei-Tavirani M. Artificial intelligence and bioinformatics: a journey from traditional techniques to smart approaches. GASTROENTEROLOGY AND HEPATOLOGY FROM BED TO BENCH 2024; 17:241-252. [PMID: 39308539 PMCID: PMC11413381 DOI: 10.22037/ghfbb.v17i3.2977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Accepted: 05/11/2024] [Indexed: 09/25/2024]
Abstract
The incorporation of AI models into bioinformatics has brought about a revolutionary era in the analysis and interpretation of biological data. This mini-review offers a succinct overview of the indispensable role AI plays in the convergence of computational techniques and biological research. The search strategy followed PRISMA guidelines, encompassing databases such as PubMed, Embase, and Google Scholar to include studies published between 2018 and 2024, utilizing specific keywords. We explored the diverse applications of AI methodologies, including machine learning (ML), deep learning (DL), and natural language processing (NLP), across various domains of bioinformatics. These domains encompass genome sequencing, protein structure prediction, drug discovery, systems biology, personalized medicine, imaging, signal processing, and text mining. AI algorithms have exhibited remarkable efficacy in tackling intricate biological challenges, spanning from genome sequencing to protein structure prediction, and from drug discovery to personalized medicine. In conclusion, this study scrutinizes the evolving landscape of AI-driven tools and algorithms, emphasizing their pivotal role in expediting research, facilitating data interpretation, and catalyzing innovations in biomedical sciences.
Collapse
Affiliation(s)
- Hamid Jamialahmadi
- Department of Medical Genetics and Molecular Medicine, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
- These authors equally contributed to this study as the first authors.
| | - Ghazaleh Khalili-Tanha
- Department of Medical Genetics and Molecular Medicine, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
- These authors equally contributed to this study as the first authors.
| | - Elham Nazari
- Proteomics Research Center, Faculty of Paramedical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mostafa Rezaei-Tavirani
- Proteomics Research Center, Faculty of Paramedical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| |
Collapse
|
7
|
Zhang X, Zhang H, Wang Z, Ma X, Luo J, Zhu Y. PWSC: a novel clustering method based on polynomial weight-adjusted sparse clustering for sparse biomedical data and its application in cancer subtyping. BMC Bioinformatics 2023; 24:490. [PMID: 38129803 PMCID: PMC10740247 DOI: 10.1186/s12859-023-05595-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 12/04/2023] [Indexed: 12/23/2023] Open
Abstract
BACKGROUND Clustering analysis is widely used to interpret biomedical data and uncover new knowledge and patterns. However, conventional clustering methods are not effective when dealing with sparse biomedical data. To overcome this limitation, we propose a hierarchical clustering method called polynomial weight-adjusted sparse clustering (PWSC). RESULTS The PWSC algorithm adjusts feature weights using a polynomial function, redefines the distances between samples, and performs hierarchical clustering analysis based on these adjusted distances. Additionally, we incorporate a consensus clustering approach to determine the optimal number of classifications. This consensus approach utilizes relative change in the cumulative distribution function to identify the best number of clusters, resulting in more stable clustering results. Leveraging the PWSC algorithm, we successfully classified a cohort of gastric cancer patients, enabling categorization of patients carrying different types of altered genes. Further evaluation using Entropy showed a significant improvement (p = 2.905e-05), while using the Calinski-Harabasz index demonstrates a remarkable 100% improvement in the quality of the best classification compared to conventional algorithms. Similarly, significantly increased entropy (p = 0.0336) and comparable CHI, were observed when classifying another colorectal cancer cohort with microbial abundance. The above attempts in cancer subtyping demonstrate that PWSC is highly applicable to different types of biomedical data. To facilitate its application, we have developed a user-friendly tool that implements the PWSC algorithm, which canbe accessed at http://pwsc.aiyimed.com/ . CONCLUSIONS PWSC addresses the limitations of conventional approaches when clustering sparse biomedical data. By adjusting feature weights and employing consensus clustering, we achieve improved clustering results compared to conventional methods. The PWSC algorithm provides a valuable tool for researchers in the field, enabling more accurate and stable clustering analysis. Its application can enhance our understanding of complex biological systems and contribute to advancements in various biomedical disciplines.
Collapse
Affiliation(s)
- Xiaomeng Zhang
- Department of Nephrology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430030, Hubei Province, China
| | - Hongtao Zhang
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430070, Hubei Province, China
| | - Zhihao Wang
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430070, Hubei Province, China
| | - Xiaofei Ma
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430070, Hubei Province, China
| | - Jiancheng Luo
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430070, Hubei Province, China.
| | - Yingying Zhu
- Department of Oncology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430030, Hubei Province, China.
| |
Collapse
|
8
|
Han R, Qi J, Xue Y, Sun X, Zhang F, Gao X, Li G. HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing. Genome Biol 2023; 24:222. [PMID: 37798751 PMCID: PMC10552309 DOI: 10.1186/s13059-023-03053-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 09/08/2023] [Indexed: 10/07/2023] Open
Abstract
DNA barcodes enable Oxford Nanopore sequencing to sequence multiple barcoded DNA samples on a single flow cell. DNA sequences with the same barcode need to be grouped together through demultiplexing. As the number of samples increases, accurate demultiplexing becomes difficult. We introduce HycDemux, which incorporates a GPU-parallelized hybrid clustering algorithm that uses nanopore signals and DNA sequences for accurate data clustering, alongside a voting-based module to finalize the demultiplexing results. Comprehensive experiments demonstrate that our approach outperforms unsupervised tools in short sequence fragment clustering and performs more robustly than current state-of-the-art demultiplexing tools for complex multi-sample sequencing data.
Collapse
Affiliation(s)
- Renmin Han
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Junhai Qi
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
- BioMap Research, California, USA
| | - Yang Xue
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Xiujuan Sun
- High Performance Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Fa Zhang
- School of Medical Technolgoy, Beijing Institute of Technology, Beijing, 100085, China.
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal, 23955, Saudi Arabia.
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China.
| |
Collapse
|
9
|
Luan T, Muralidharan HS, Alshehri M, Mittra I, Pop M. SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets. Nucleic Acids Res 2023; 51:e46. [PMID: 36912074 PMCID: PMC10164572 DOI: 10.1093/nar/gkad158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 02/01/2023] [Accepted: 02/28/2023] [Indexed: 03/14/2023] Open
Abstract
16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.
Collapse
Affiliation(s)
- Tu Luan
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Harihara Subrahmaniam Muralidharan
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Marwan Alshehri
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
| | - Ipsa Mittra
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
10
|
Benchmarking machine learning robustness in Covid-19 genome sequence classification. Sci Rep 2023; 13:4154. [PMID: 36914815 PMCID: PMC10010240 DOI: 10.1038/s41598-023-31368-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 03/10/2023] [Indexed: 03/16/2023] Open
Abstract
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.
Collapse
|
11
|
Semi-supervised and un-supervised clustering: A review and experimental evaluation. INFORM SYST 2023. [DOI: 10.1016/j.is.2023.102178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
12
|
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
Collapse
|
13
|
Dall'Alba G, Casa PL, Abreu FPD, Notari DL, de Avila E Silva S. A Survey of Biological Data in a Big Data Perspective. BIG DATA 2022; 10:279-297. [PMID: 35394342 DOI: 10.1089/big.2020.0383] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.
Collapse
Affiliation(s)
- Gabriel Dall'Alba
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
- Genome Science and Technology Program, Faculty of Science, The University of British Columbia, Vancouver, Canada
| | - Pedro Lenz Casa
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Fernanda Pessi de Abreu
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Daniel Luis Notari
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Scheila de Avila E Silva
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| |
Collapse
|
14
|
Lombardo SD, Wangsaputra IF, Menche J, Stevens A. Network Approaches for Charting the Transcriptomic and Epigenetic Landscape of the Developmental Origins of Health and Disease. Genes (Basel) 2022; 13:764. [PMID: 35627149 PMCID: PMC9141211 DOI: 10.3390/genes13050764] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/04/2022] [Accepted: 04/13/2022] [Indexed: 02/04/2023] Open
Abstract
The early developmental phase is of critical importance for human health and disease later in life. To decipher the molecular mechanisms at play, current biomedical research is increasingly relying on large quantities of diverse omics data. The integration and interpretation of the different datasets pose a critical challenge towards the holistic understanding of the complex biological processes that are involved in early development. In this review, we outline the major transcriptomic and epigenetic processes and the respective datasets that are most relevant for studying the periconceptional period. We cover both basic data processing and analysis steps, as well as more advanced data integration methods. A particular focus is given to network-based methods. Finally, we review the medical applications of such integrative analyses.
Collapse
Affiliation(s)
- Salvo Danilo Lombardo
- Max Perutz Labs, Department of Structural and Computational Biology, University of Vienna, 1030 Vienna, Austria;
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, 1030 Vienna, Austria
| | - Ivan Fernando Wangsaputra
- Maternal and Fetal Health Research Group, Division of Developmental Biology and Medicine, Faculty of Biology, Medicine and Health, University of Manchester, Manchester M13 9WL, UK;
| | - Jörg Menche
- Max Perutz Labs, Department of Structural and Computational Biology, University of Vienna, 1030 Vienna, Austria;
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, 1030 Vienna, Austria
- Faculty of Mathematics, University of Vienna, 1030 Vienna, Austria
| | - Adam Stevens
- Maternal and Fetal Health Research Group, Division of Developmental Biology and Medicine, Faculty of Biology, Medicine and Health, University of Manchester, Manchester M13 9WL, UK;
| |
Collapse
|
15
|
Chiu JKH, Ong RTH. Clustering biological sequences with dynamic sequence similarity threshold. BMC Bioinformatics 2022; 23:108. [PMID: 35354426 PMCID: PMC8969259 DOI: 10.1186/s12859-022-04643-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 03/02/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences. RESULTS We present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignment-free sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cut-off threshold for each individual cluster by considering both cluster separation and intra-cluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the non-singleton clusters obtained. CONCLUSIONS ALFATClust is able to generate sequence clusters having high intra-cluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cut-off thresholds.
Collapse
Affiliation(s)
- Jimmy Ka Ho Chiu
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, 117549, Singapore
| | - Rick Twee-Hee Ong
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, 117549, Singapore.
| |
Collapse
|
16
|
Gharavi E, Gu A, Zheng G, Smith JP, Cho HJ, Zhang A, Brown DE, Sheffield NC. Embeddings of genomic region sets capture rich biological associations in lower dimensions. Bioinformatics 2021; 37:4299-4306. [PMID: 34156475 PMCID: PMC8652032 DOI: 10.1093/bioinformatics/btab439] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 06/07/2021] [Accepted: 06/15/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. RESULTS We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. AVAILABILITY AND IMPLEMENTATION https://github.com/databio/regionset-embedding. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Erfaneh Gharavi
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22903, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22903, USA
| | - Aaron Gu
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22903, USA
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
| | - Guangtao Zheng
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
| | - Jason P Smith
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22903, USA
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22903, USA
| | - Hyun Jae Cho
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22903, USA
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
| | - Aidong Zhang
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
| | - Donald E Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22903, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22903, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22903, USA
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22903, USA
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22903, USA
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22903, USA
| |
Collapse
|
17
|
Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]
Abstract
In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.
Collapse
Affiliation(s)
- Katrin Sophie Bohnsack
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Marika Kaden
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Julia Abel
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Sascha Saralajew
- Bosch Center for Artificial Intelligence, 71272 Renningen, Germany;
| | - Thomas Villmann
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| |
Collapse
|
18
|
Huang YS, Cheng WC, Lin CY. Androgenic Sensitivities and Ovarian Gene Expression Profiles Prior to Treatment in Japanese Eel (Anguilla japonica). MARINE BIOTECHNOLOGY (NEW YORK, N.Y.) 2021; 23:430-444. [PMID: 34191211 DOI: 10.1007/s10126-021-10035-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 04/28/2021] [Indexed: 06/13/2023]
Abstract
Androgens stimulate ovarian development in eels. Our previous report indicated a correlation between the initial (debut) ovarian status (determined by kernel density estimation (KDE), presented as a probability density of oocyte size) and the consequence of 17MT treatment (change in ovary). The initial ovarian status appeared to be an important factor influencing ovarian androgenic sensitivity. We postulated that the sensitivities of initial ovaries are correlated with their gene expression profiles. Japanese eels underwent operation to sample the initial ovarian tissues, and the samples were stored in liquid nitrogen. Using high-throughput next-generation sequencing (NGS) technology, ovarian transcriptomic data were mined and analyzed based on functional gene classification with cutoff-based differentially expressed genes (DEGs); the ovarian status was transformed into gene expression profiles globally or was represented by a set of gene list. Our results also implied that the initial ovary might be an important factor influencing the outcomes of 17MT treatments, and the genes related with neuronal activities or neurogenesis seemed to play an essential role in the positive effect.
Collapse
Affiliation(s)
- Yung-Sen Huang
- Department of Life Science, National University of Kaohsiung, No. 700 Kaohsiung University Road, Nan Tzu Dist, 811, Kaohsiung, Taiwan.
| | - Wen-Chih Cheng
- Institute of Information Science, Academia Sinica, No. 128 Academia Road, Section 2, Nankang Dist., 115, Taipei, Taiwan
| | - Chung-Yen Lin
- Institute of Information Science, Academia Sinica, No. 128 Academia Road, Section 2, Nankang Dist., 115, Taipei, Taiwan
| |
Collapse
|
19
|
Bhattacharyya B, Mitra U, Bhattacharyya R. Tandem repeat interval pattern identifies animal taxa. Bioinformatics 2021; 37:2250-2258. [PMID: 33677492 DOI: 10.1093/bioinformatics/btab124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 12/11/2020] [Accepted: 02/22/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION We discover that maximality of information content among intervals of Tandem Repeats (TRs) in animal genome segregates over taxa such that taxa identification becomes swift and accurate. Successive TRs of a motif occur at intervals over the sequence, forming a trail of TRs of the motif across the genome. We present a method, Tandem Repeat Information Mining (TRIM), that mines 4k number of TR trails of all k length motifs from a whole genome sequence and extracts the information content within intervals of the trails. TRIM vector formed from the ordered set of interval entropies becomes instrumental for genome segregation. RESULTS Reconstruction of correct phylogeny for animals from whole genome sequences proves precision of TRIM. Identification of animal taxa by TRIM vector upon feature selection is the most significant achievement. These suggest Tandem Repeat Interval Pattern (TRIP) is a taxa-specific constitutional characteristic in animal genome. AVAILABILITY Source and executable code of TRIM along with usage manual are made available at https://github.com/BB-BiG/TRIM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Balaram Bhattacharyya
- Department of Computer and System Sciences, Visva-Bharati University, Santiniketan, 731235
| | - Uddalak Mitra
- Department of Computer and System Sciences, Visva-Bharati University, Santiniketan, 731235
| | | |
Collapse
|
20
|
Macari G, Toti D, Pasquadibisceglie A, Polticelli F. DockingApp RF: A State-of-the-Art Novel Scoring Function for Molecular Docking in a User-Friendly Interface to AutoDock Vina. Int J Mol Sci 2020; 21:ijms21249548. [PMID: 33333976 PMCID: PMC7765429 DOI: 10.3390/ijms21249548] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Revised: 12/11/2020] [Accepted: 12/11/2020] [Indexed: 11/28/2022] Open
Abstract
Motivation: Bringing a new drug to the market is expensive and time-consuming. To cut the costs and time, computer-aided drug design (CADD) approaches have been increasingly included in the drug discovery pipeline. However, despite traditional docking tools show a good conformational space sampling ability, they are still unable to produce accurate binding affinity predictions. This work presents a novel scoring function for molecular docking seamlessly integrated into DockingApp, a user-friendly graphical interface for AutoDock Vina. The proposed function is based on a random forest model and a selection of specific features to overcome the existing limits of Vina’s original scoring mechanism. A novel version of DockingApp, named DockingApp RF, has been developed to host the proposed scoring function and to automatize the rescoring procedure of the output of AutoDock Vina, even to nonexpert users. Results: By coupling intermolecular interaction, solvent accessible surface area features and Vina’s energy terms, DockingApp RF’s new scoring function is able to improve the binding affinity prediction of AutoDock Vina. Furthermore, comparison tests carried out on the CASF-2013 and CASF-2016 datasets demonstrate that DockingApp RF’s performance is comparable to other state-of-the-art machine-learning- and deep-learning-based scoring functions. The new scoring function thus represents a significant advancement in terms of the reliability and effectiveness of docking compared to AutoDock Vina’s scoring function. At the same time, the characteristics that made DockingApp appealing to a wide range of users are retained in this new version and have been complemented with additional features.
Collapse
Affiliation(s)
- Gabriele Macari
- Department of Sciences, Roma Tre University, 00146 Rome, Italy; (G.M.); (A.P.)
| | - Daniele Toti
- Faculty of Mathematical, Physical and Natural Sciences, Catholic University of the Sacred Heart, 25121 Brescia, Italy;
| | | | - Fabio Polticelli
- Department of Sciences, Roma Tre University, 00146 Rome, Italy; (G.M.); (A.P.)
- National Institute of Nuclear Physics, Roma Tre Section, 00146 Rome, Italy
- Correspondence:
| |
Collapse
|
21
|
Paul T, Vainio S, Roning J. Clustering and classification of virus sequence through music communication protocol and wavelet transform. Genomics 2020; 113:778-784. [PMID: 33069829 PMCID: PMC7561519 DOI: 10.1016/j.ygeno.2020.10.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2020] [Accepted: 10/13/2020] [Indexed: 01/19/2023]
Abstract
The coronavirus pandemic became a major risk in global public health. The outbreak is caused by SARS-CoV-2, a member of the coronavirus family. Though the images of the virus are familiar to us, in the present study, an attempt is made to hear the coronavirus by translating its protein spike into audio sequences. The musical features such as pitch, timbre, volume and duration are mapped based on the coronavirus protein sequence. Three different viruses Influenza, Ebola and Coronavirus were studied and compared through their auditory virus sequences by implementing Haar wavelet transform. The sonification of the coronavirus benefits in understanding the protein structures by enhancing the hidden features. Further, it makes a clear difference in the representation of coronavirus compared with other viruses, which will help in various research works related to virus sequence. This evolves as a simplified and novel way of representing the conventional computational methods.
Collapse
Affiliation(s)
- Tirthankar Paul
- InfoTech Oulu, Biomimetics and Intelligent Systems Group (BISG), Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, Finland.
| | - Seppo Vainio
- InfoTech Oulu, Faculty of Biochemistry and Molecular Medicine, Biocenter Oulu, Laboratory of Development Biology, University of Oulu, Oulu, Finland.
| | - Juha Roning
- InfoTech Oulu, Biomimetics and Intelligent Systems Group (BISG), Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, Finland.
| |
Collapse
|
22
|
Yang A, Zhang W, Wang J, Yang K, Han Y, Zhang L. Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA. Front Bioeng Biotechnol 2020; 8:1032. [PMID: 33015010 PMCID: PMC7498545 DOI: 10.3389/fbioe.2020.01032] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 08/10/2020] [Indexed: 11/13/2022] Open
Abstract
Deoxyribonucleic acid (DNA) is a biological macromolecule. Its main function is information storage. At present, the advancement of sequencing technology had caused DNA sequence data to grow at an explosive rate, which has also pushed the study of DNA sequences in the wave of big data. Moreover, machine learning is a powerful technique for analyzing largescale data and learns spontaneously to gain knowledge. It has been widely used in DNA sequence data analysis and obtained a lot of research achievements. Firstly, the review introduces the development process of sequencing technology, expounds on the concept of DNA sequence data structure and sequence similarity. Then we analyze the basic process of data mining, summary several major machine learning algorithms, and put forward the challenges faced by machine learning algorithms in the mining of biological sequence data and possible solutions in the future. Then we review four typical applications of machine learning in DNA sequence data: DNA sequence alignment, DNA sequence classification, DNA sequence clustering, and DNA pattern mining. We analyze their corresponding biological application background and significance, and systematically summarized the development and potential problems in the field of DNA sequence data mining in recent years. Finally, we summarize the content of the review and look into the future of some research directions for the next step.
Collapse
Affiliation(s)
- Aimin Yang
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Wei Zhang
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Jiahao Wang
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Ke Yang
- College of Yi Sheng, North China University of Science and Technology, Tangshan, China
| | - Yang Han
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Limin Zhang
- Mathmatics and Computer Department, Hengshui University, Hengshui, China
| |
Collapse
|
23
|
Le Goallec A, Tierney BT, Luber JM, Cofer EM, Kostic AD, Patel CJ. A systematic machine learning and data type comparison yields metagenomic predictors of infant age, sex, breastfeeding, antibiotic usage, country of origin, and delivery type. PLoS Comput Biol 2020; 16:e1007895. [PMID: 32392251 PMCID: PMC7241849 DOI: 10.1371/journal.pcbi.1007895] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 05/21/2020] [Accepted: 04/21/2020] [Indexed: 12/31/2022] Open
Abstract
The microbiome is a new frontier for building predictors of human phenotypes. However, machine learning in the microbiome is fraught with issues of reproducibility, driven in large part by the wide range of analytic models and metagenomic data types available. We aimed to build robust metagenomic predictors of host phenotype by comparing prediction performances and biological interpretation across 8 machine learning methods and 4 different types of metagenomic data. Using 1,570 samples from 300 infants, we fit 7,865 models for 6 host phenotypes. We demonstrate the dependence of accuracy on algorithm choice and feature definition in microbiome data and propose a framework for building microbiome-derived indicators of host phenotype. We additionally identify biological features predictive of age, sex, breastfeeding status, historical antibiotic usage, country of origin, and delivery type. Our complete results can be viewed at http://apps.chiragjpgroup.org/ubiome_predictions/.
Collapse
Affiliation(s)
- Alan Le Goallec
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
- Department of Systems Biology, Harvard University, Cambridge, Massachusetts, United States of America
| | - Braden T. Tierney
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
- Section on Pathophysiology and Molecular Pharmacology, Joslin Diabetes Center, Boston, Massachusetts, United States of America
- Section on Islet Cell and Regenerative Biology, Joslin Diabetes Center, Boston, Massachusetts, United States of America
- Department of Microbiology and Immunobiology, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Jacob M. Luber
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Evan M. Cofer
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Aleksandar D. Kostic
- Section on Pathophysiology and Molecular Pharmacology, Joslin Diabetes Center, Boston, Massachusetts, United States of America
- Section on Islet Cell and Regenerative Biology, Joslin Diabetes Center, Boston, Massachusetts, United States of America
- Department of Microbiology and Immunobiology, Harvard Medical School, Boston, Massachusetts, United States of America
- * E-mail: (ADK); (CJP)
| | - Chirag J. Patel
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
- * E-mail: (ADK); (CJP)
| |
Collapse
|
24
|
Luczak BB, James BT, Girgis HZ. A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Brief Bioinform 2020; 20:1222-1237. [PMID: 29220512 PMCID: PMC6781583 DOI: 10.1093/bib/bbx161] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 10/13/2017] [Indexed: 11/29/2022] Open
Abstract
Motivation Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability The source code of the benchmarking tool is available as Supplementary Materials.
Collapse
Affiliation(s)
| | | | - Hani Z Girgis
- Corresponding author. Hani Z. Girgis, Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA. E-mail:
| |
Collapse
|
25
|
James BT, Luczak BB, Girgis HZ. MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res 2019; 46:e83. [PMID: 29718317 PMCID: PMC6101578 DOI: 10.1093/nar/gky315] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2018] [Accepted: 04/13/2018] [Indexed: 11/13/2022] Open
Abstract
Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter is inaccurate. To overcome this limitation, we adapted the mean shift algorithm, an unsupervised machine-learning algorithm, which has been used successfully thousands of times in fields such as image processing and computer vision. The theory behind the mean shift algorithm, unlike the greedy approaches, guarantees convergence to the modes, e.g. cluster centers. Here we describe the first application of the mean shift algorithm to clustering DNA sequences. MeShClust is one of few applications of the mean shift algorithm in bioinformatics. Further, we applied supervised machine learning to predict the identity score produced by global alignment using alignment-free methods. We demonstrate MeShClust's ability to cluster DNA sequences with high accuracy even when the sequence similarity parameter provided by the user is not very accurate.
Collapse
Affiliation(s)
- Benjamin T James
- Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA.,Mathematics Department, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA
| | - Brian B Luczak
- Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA.,Mathematics Department, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA
| | - Hani Z Girgis
- Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA
| |
Collapse
|
26
|
Chen W, Li W, Huang G, Flavel M. The Applications of Clustering Methods in Predicting Protein Functions. CURR PROTEOMICS 2019. [DOI: 10.2174/1570164616666181212114612] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
The understanding of protein function is essential to the study of biological
processes. However, the prediction of protein function has been a difficult task for bioinformatics to
overcome. This has resulted in many scholars focusing on the development of computational methods
to address this problem.
Objective:
In this review, we introduce the recently developed computational methods of protein function
prediction and assess the validity of these methods. We then introduce the applications of clustering
methods in predicting protein functions.
Collapse
Affiliation(s)
- Weiyang Chen
- College of Information, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Weiwei Li
- College of Information, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Guohua Huang
- College of Information Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Matthew Flavel
- School of Life Sciences, La Trobe University, Bundoora, Vic 3083, Australia
| |
Collapse
|
27
|
Meher PK, Sahu TK, Gahoi S, Satpathy S, Rao AR. Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition. Gene 2019; 705:113-126. [PMID: 31009682 DOI: 10.1016/j.gene.2019.04.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Revised: 03/27/2019] [Accepted: 04/17/2019] [Indexed: 02/02/2023]
Abstract
Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Shachi Gahoi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Subhrajit Satpathy
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | | |
Collapse
|
28
|
Dall'Alba G, Casa PL, Notari DL, Adami AG, Echeverrigaray S, de Avila E Silva S. Analysis of the nucleotide content of Escherichia coli promoter sequences related to the alternative sigma factors. J Mol Recognit 2018; 32:e2770. [PMID: 30458580 DOI: 10.1002/jmr.2770] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Revised: 10/23/2018] [Accepted: 10/24/2018] [Indexed: 01/26/2023]
Abstract
Promoters are DNA sequences located upstream of the transcription start site of genes. In bacteria, the RNA polymerase enzyme requires additional subunits, called sigma factors (σ) to begin specific gene transcription in distinct environmental conditions. Currently, promoter prediction still poses many challenges due to the characteristics of these sequences. In this paper, the nucleotide content of Escherichia coli promoter sequences, related to five alternative σ factors, was analyzed by a machine learning technique in order to provide profiles according to the σ factor which recognizes them. For this, the clustering technique was applied since it is a viable method for finding hidden patterns on a data set. As a result, 20 groups of sequences were formed, and, aided by the Weblogo tool, it was possible to determine sequence profiles. These found patterns should be considered for implementing computational prediction tools. In addition, evidence was found of an overlap between the functions of the genes regulated by different σ factors, suggesting that DNA structural properties are also essential parameters for further studies.
Collapse
Affiliation(s)
- Gabriel Dall'Alba
- Department of Life Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Pedro Lenz Casa
- Department of Life Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Daniel Luis Notari
- Department of Exact Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Andre Gustavo Adami
- Department of Exact Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Sergio Echeverrigaray
- Department of Life Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Scheila de Avila E Silva
- Department of Exact Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| |
Collapse
|
29
|
Silveira MC, Azevedo da Silva R, Faria da Mota F, Catanho M, Jardim R, R Guimarães AC, de Miranda AB. Systematic Identification and Classification of β-Lactamases Based on Sequence Similarity Criteria: β-Lactamase Annotation. Evol Bioinform Online 2018; 14:1176934318797351. [PMID: 30210232 PMCID: PMC6131288 DOI: 10.1177/1176934318797351] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 08/08/2018] [Indexed: 12/11/2022] Open
Abstract
β-lactamases, the enzymes responsible for resistance to β-lactam antibiotics, are
widespread among prokaryotic genera. However, current β-lactamase classification
schemes do not represent their present diversity. Here, we propose a workflow to
identify and classify β-lactamases. Initially, a set of curated sequences was
used as a model for the construction of profiles Hidden Markov Models (HMM),
specific for each β-lactamase class. An extensive, nonredundant set of
β-lactamase sequences was constructed from 7 different resistance proteins
databases to test the methodology. The profiles HMM were improved for their
specificity and sensitivity and then applied to fully assembled genomes. Five
hierarchical classification levels are described, and a new class of
β-lactamases with fused domains is proposed. Our profiles HMM provide a better
annotation of β-lactamases, with classes and subclasses defined by objective
criteria such as sequence similarity. This classification offers a solid base to
the elaboration of studies on the diversity, dispersion, prevalence, and
evolution of the different classes and subclasses of this critical enzymatic
activity.
Collapse
Affiliation(s)
- Melise Chaves Silveira
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Rangeline Azevedo da Silva
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Fábio Faria da Mota
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Marcos Catanho
- Laboratório de Genômica Funcional e Bioinformática, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Rodrigo Jardim
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Ana Carolina R Guimarães
- Laboratório de Genômica Funcional e Bioinformática, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Antonio B de Miranda
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| |
Collapse
|
30
|
Sriwanna K, Boongoen T, Iam-On N. Graph clustering-based discretization approach to microarray data. Knowl Inf Syst 2018. [DOI: 10.1007/s10115-018-1249-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
31
|
Lin J, Adjeroh DA, Jiang BH, Jiang Y. K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics. Bioinformatics 2018; 34:1682-1689. [PMID: 29253072 PMCID: PMC6355110 DOI: 10.1093/bioinformatics/btx809] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Revised: 12/11/2017] [Accepted: 12/14/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. Results We propose a new non-parametric alignment-free sequence comparison method, called K2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K2 approach is much faster than the other methods. An improved method, K2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes. Availability and implementation The K2 and K2* approaches are implemented in the R language as a package and is freely available for open access (http://community.wvu.edu/daadjeroh/projects/K2/K2_1.0.tar.gz). Contact yueljiang@163.com. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jie Lin
- Department of Software engineering, College of Mathematics and
Informatics, Fujian Normal University, Fuzhou, China
| | - Donald A Adjeroh
- Department of Computer Science & Electrical Engineering, West
Virginia University, Morgantown, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, Carver College of Medicine, The University of
Iowa, Iowa City, IA, USA
| | - Yue Jiang
- Department of Software engineering, College of Mathematics and
Informatics, Fujian Normal University, Fuzhou, China
| |
Collapse
|
32
|
Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y. SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform. BMC Bioinformatics 2018; 19:165. [PMID: 29720081 PMCID: PMC5930706 DOI: 10.1186/s12859-018-2155-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Accepted: 04/11/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts. RESULTS A new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts k-mers from a sequence, then maps each k-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification. CONCLUSIONS Using two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.
Collapse
Affiliation(s)
- Jie Lin
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Jing Wei
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Donald Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, 26506, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, University of Iowa, Iowa city, 52242, Iowa, USA
| | - Yue Jiang
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China.
| |
Collapse
|
33
|
Guo G, Chen L, Ye Y, Jiang Q. Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:2936-2948. [PMID: 28114078 DOI: 10.1109/tnnls.2016.2608354] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, in this paper, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intracluster structural compactness and intercluster structural separation linearly combined to measure the quality of sequence clusters. A partition-based algorithm for robust clustering of categorical sequences is also proposed, which provides the new measure with high-quality clustering results by the deterministic initialization and the elimination of noise clusters using an information theoretic method. The new clustering algorithm and the CVI are then assembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. A case study on commonly used protein sequences and the experimental results on some real-world sequence sets from different domains are given to demonstrate the performance of the proposed method.
Collapse
|
34
|
Yuan L, Wang W, Chen L. Two-stage pruning method for gram-based categorical sequence clustering. INT J MACH LEARN CYB 2017. [DOI: 10.1007/s13042-017-0744-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
35
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 248] [Impact Index Per Article: 35.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
36
|
Amiri S, Dinov ID. Comparison of genomic data via statistical distribution. J Theor Biol 2016; 407:318-327. [PMID: 27460589 PMCID: PMC5361063 DOI: 10.1016/j.jtbi.2016.07.032] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Revised: 06/22/2016] [Accepted: 07/20/2016] [Indexed: 11/28/2022]
Abstract
Sequence comparison has become an essential tool in bioinformatics, because highly homologous sequences usually imply significant functional or structural similarity. Traditional sequence analysis techniques are based on preprocessing and alignment, which facilitate measuring and quantitative characterization of genetic differences, variability and complexity. However, recent developments of next generation and whole genome sequencing technologies give rise to new challenges that are related to measuring similarity and capturing rearrangements of large segments contained in the genome. This work is devoted to illustrating different methods recently introduced for quantifying sequence distances and variability. Most of the alignment-free methods rely on counting words, which are small contiguous fragments of the genome. Our approach considers the locations of nucleotides in the sequences and relies more on appropriate statistical distributions. The results of this technique for comparing sequences, by extracting information and comparing matching fidelity and location regularization information, are very encouraging, specifically to classify mutation sequences.
Collapse
Affiliation(s)
- Saeid Amiri
- University of Wisconsin-Green Bay, Department of Natural and Applied Sciences, Green Bay, WI, USA.
| | - Ivo D Dinov
- Statistics Online Computational Resource (SOCR), Michigan Institute for Data Science (MIDAS), School of Nursing, University of Michigan, Ann Arbor, MI 49109, USA.
| |
Collapse
|
37
|
Taylor WR. Reduction, alignment and visualisation of large diverse sequence families. BMC Bioinformatics 2016; 17:300. [PMID: 27484804 PMCID: PMC4971687 DOI: 10.1186/s12859-016-1059-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2016] [Accepted: 04/21/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Current volumes of sequence data can lead to large numbers of hits identified on a search, typically in the range of 10s to 100s of thousands. It is often quite difficult to tell from these raw results whether the search has been a success or has picked-up sequences with little or no relationship to the query. The best approach to this problem is to cluster and align the resulting families, however, existing methods concentrate on fast clustering and either do not align the sequences or only perform a limited alignment. RESULTS A method (MULSEL) is presented that combines fast peptide-based pre-sorting with a following cascade of mini-alignments, each of which are generated with a robust profile/profile method. From these mini-alignments, a representative sequence is selected, based on a variety of intrinsic and user-specified criteria that are combined to produce the sequence collection for the next cycle of alignment. For moderate sized sequence collections (10s of thousands) the method executes on a laptop computer within seconds or minutes. CONCLUSIONS MULSEL bridges a gap between fast clustering methods and slower multiple sequence alignment methods and provides a seamless transition from one to the other. Furthermore, it presents the resulting reduced family in a graphical manner that makes it clear if family members have been misaligned or if there are sequences present that appear inconsistent.
Collapse
|
38
|
Dubey AK, Gupta U, Jain S. Analysis of k-means clustering approach on the breast cancer Wisconsin dataset. Int J Comput Assist Radiol Surg 2016; 11:2033-2047. [PMID: 27311823 DOI: 10.1007/s11548-016-1437-9] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2016] [Accepted: 05/27/2016] [Indexed: 11/25/2022]
Abstract
PURPOSE Breast cancer is one of the most common cancers found worldwide and most frequently found in women. An early detection of breast cancer provides the possibility of its cure; therefore, a large number of studies are currently going on to identify methods that can detect breast cancer in its early stages. This study was aimed to find the effects of k-means clustering algorithm with different computation measures like centroid, distance, split method, epoch, attribute, and iteration and to carefully consider and identify the combination of measures that has potential of highly accurate clustering accuracy. METHODS K-means algorithm was used to evaluate the impact of clustering using centroid initialization, distance measures, and split methods. The experiments were performed using breast cancer Wisconsin (BCW) diagnostic dataset. Foggy and random centroids were used for the centroid initialization. In foggy centroid, based on random values, the first centroid was calculated. For random centroid, the initial centroid was considered as (0, 0). RESULTS The results were obtained by employing k-means algorithm and are discussed with different cases considering variable parameters. The calculations were based on the centroid (foggy/random), distance (Euclidean/Manhattan/Pearson), split (simple/variance), threshold (constant epoch/same centroid), attribute (2-9), and iteration (4-10). Approximately, 92 % average positive prediction accuracy was obtained with this approach. Better results were found for the same centroid and the highest variance. The results achieved using Euclidean and Manhattan were better than the Pearson correlation. CONCLUSIONS The findings of this work provided extensive understanding of the computational parameters that can be used with k-means. The results indicated that k-means has a potential to classify BCW dataset.
Collapse
Affiliation(s)
- Ashutosh Kumar Dubey
- JK Lakshmipat University, Near Mahindra SEZ, P.O. Mahapura Ajmer Road, Jaipur, Rajasthan, 302 026, India.
| | - Umesh Gupta
- JK Lakshmipat University, Near Mahindra SEZ, P.O. Mahapura Ajmer Road, Jaipur, Rajasthan, 302 026, India
| | - Sonal Jain
- JK Lakshmipat University, Near Mahindra SEZ, P.O. Mahapura Ajmer Road, Jaipur, Rajasthan, 302 026, India
| |
Collapse
|
39
|
Ahmad M, Jung LT, Bhuiyan MAA. On fuzzy semantic similarity measure for DNA coding. Comput Biol Med 2015; 69:144-51. [PMID: 26773936 DOI: 10.1016/j.compbiomed.2015.12.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Revised: 12/22/2015] [Accepted: 12/23/2015] [Indexed: 11/28/2022]
Abstract
A coding measure scheme numerically translates the DNA sequence to a time domain signal for protein coding regions identification. A number of coding measure schemes based on numerology, geometry, fixed mapping, statistical characteristics and chemical attributes of nucleotides have been proposed in recent decades. Such coding measure schemes lack the biologically meaningful aspects of nucleotide data and hence do not significantly discriminate coding regions from non-coding regions. This paper presents a novel fuzzy semantic similarity measure (FSSM) coding scheme centering on FSSM codons׳ clustering and genetic code context of nucleotides. Certain natural characteristics of nucleotides i.e. appearance as a unique combination of triplets, preserving special structure and occurrence, and ability to own and share density distributions in codons have been exploited in FSSM. The nucleotides׳ fuzzy behaviors, semantic similarities and defuzzification based on the center of gravity of nucleotides revealed a strong correlation between nucleotides in codons. The proposed FSSM coding scheme attains a significant enhancement in coding regions identification i.e. 36-133% as compared to other existing coding measure schemes tested over more than 250 benchmarked and randomly taken DNA datasets of different organisms.
Collapse
Affiliation(s)
- Muneer Ahmad
- College of Computer Sciences, King Faisal University, Saudi Arabia.
| | - Low Tang Jung
- Department of Computer Sciences, University Technology PETRONAS, Malaysia.
| | | |
Collapse
|
40
|
Romanel A, Lago S, Prandi D, Sboner A, Demichelis F. ASEQ: fast allele-specific studies from next-generation sequencing data. BMC Med Genomics 2015; 8:9. [PMID: 25889339 PMCID: PMC4363342 DOI: 10.1186/s12920-015-0084-2] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Accepted: 02/12/2015] [Indexed: 11/17/2022] Open
Abstract
Background Single base level information from next-generation sequencing (NGS) allows for the quantitative assessment of biological phenomena such as mosaicism or allele-specific features in healthy and diseased cells. Such studies often present with computationally challenging burdens that hinder genome-wide investigations across large datasets that are now becoming available through the 1,000 Genomes Project and The Cancer Genome Atlas (TCGA) initiatives. Results We present ASEQ, a tool to perform gene-level allele-specific expression (ASE) analysis from paired genomic and transcriptomic NGS data without requiring paternal and maternal genome data. ASEQ offers an easy-to-use set of modes that transparently to the user takes full advantage of a built-in fast computational engine. We report its performances on a set of 20 individuals from the 1,000 Genomes Project and show its detection power on imprinted genes. Next we demonstrate high level of ASE calls concordance when comparing it to AlleleSeq and MBASED tools. Finally, using a prostate cancer dataset we report on a higher fraction of ASE genes with respect to healthy individuals and show allele-specific events nominated by ASEQ in genes that are implicated in the disease. Conclusions ASEQ can be used to rapidly and reliably screen large NGS datasets for the identification of allele specific features. It can be integrated in any NGS pipeline and runs on computer systems with multiple CPUs, CPUs with multiple cores or across clusters of machines. Electronic supplementary material The online version of this article (doi:10.1186/s12920-015-0084-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alessandro Romanel
- Centre for Integrative Biology (CIBIO), University of Trento, Trento, Italy.
| | - Sara Lago
- Centre for Integrative Biology (CIBIO), University of Trento, Trento, Italy.
| | - Davide Prandi
- Centre for Integrative Biology (CIBIO), University of Trento, Trento, Italy.
| | - Andrea Sboner
- Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, USA. .,Institute for Computational Biomedicine, Weill Cornell Medical College, New York, USA. .,Institute for Precision Medicine, Weill Cornell Medical College & New York Presbyterian Hospital, New York, USA.
| | - Francesca Demichelis
- Centre for Integrative Biology (CIBIO), University of Trento, Trento, Italy. .,Institute for Computational Biomedicine, Weill Cornell Medical College, New York, USA. .,Institute for Precision Medicine, Weill Cornell Medical College & New York Presbyterian Hospital, New York, USA.
| |
Collapse
|
41
|
Schmidt TSB, Matias Rodrigues JF, von Mering C. Limits to robustness and reproducibility in the demarcation of operational taxonomic units. Environ Microbiol 2014; 17:1689-706. [PMID: 25156547 DOI: 10.1111/1462-2920.12610] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2014] [Accepted: 08/21/2014] [Indexed: 11/27/2022]
Abstract
The demarcation of operational taxonomic units (OTUs) from complex sequence data sets is a key step in contemporary studies of microbial ecology. However, as biologically motivated 'optimal' OTU-binning algorithms remain elusive, many conceptually distinct approaches continue to be used. Using a global data set of 887 870 bacterial 16S rRNA gene sequences, we objectively quantified biases introduced by several widely employed sequence clustering algorithms. We found that OTU-binning methods often provided surprisingly non-equivalent partitions of identical data sets, notably when clustering to the same nominal similarity thresholds; and we quantified the resulting impact on ecological data description for a well-defined human skin microbiome data set. We observed that some methods were very robust to varying clustering thresholds, while others were found to be highly susceptible even to slight threshold variations. Moreover, we comprehensively quantified the impact of the choice of 16S rRNA gene subregion, as well as of data set scope and context on algorithm performance. Our findings may contribute to an enhanced comparability of results across sequence-processing pipelines, and we arrive at recommendations towards higher levels of standardization in established workflows.
Collapse
Affiliation(s)
- Thomas S B Schmidt
- Institute for Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, Winterthurerstrasse 190, Zürich, 8057, Switzerland
| | | | | |
Collapse
|
42
|
Bao J, Yuan R, Bao Z. An improved alignment-free model for DNA sequence similarity metric. BMC Bioinformatics 2014; 15:321. [PMID: 25261973 PMCID: PMC4261891 DOI: 10.1186/1471-2105-15-321] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2013] [Accepted: 09/23/2014] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words' probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality. RESULTS This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings. CONCLUSIONS The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization.
Collapse
Affiliation(s)
- Junpeng Bao
- Department of Computer Science and Technology Xi’an Jiaotong University, West Xianning Road, 710049 Xi’an, P.R. China
| | - Ruiyu Yuan
- Department of Computer Science and Technology Xi’an Jiaotong University, West Xianning Road, 710049 Xi’an, P.R. China
| | - Zhe Bao
- Department of Computer Science and Technology Xi’an Jiaotong University, West Xianning Road, 710049 Xi’an, P.R. China
| |
Collapse
|
43
|
Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014; 15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open
Abstract
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Collapse
Affiliation(s)
- Susana Vinga
- IDMEC, Instituto Superior Técnico - Universidade de Lisboa (IST-UL), Av. Rovisco Pais, 1049-001 Lisboa, Portugal. Tel.: +351-218419504; Fax: +351-218498097;
| |
Collapse
|