1
|
Zheng Z, Zhu M, Zhang J, Liu X, Hou L, Liu W, Yuan S, Luo C, Yao X, Liu J, Yang Y. A sequence-aware merger of genomic structural variations at population scale. Nat Commun 2024; 15:960. [PMID: 38307885 PMCID: PMC10837428 DOI: 10.1038/s41467-024-45244-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 01/18/2024] [Indexed: 02/04/2024] Open
Abstract
Merging structural variations (SVs) at the population level presents a significant challenge, yet it is essential for conducting comprehensive genotypic analyses, especially in the era of pangenomics. Here, we introduce PanPop, a tool that utilizes an advanced sequence-aware SV merging algorithm to efficiently merge SVs of various types. We demonstrate that PanPop can merge and optimize the majority of multiallelic SVs into informative biallelic variants. We show its superior precision and lower rates of missing data compared to alternative software solutions. Our approach not only enables the filtering of SVs by leveraging multiple SV callers for enhanced accuracy but also facilitates the accurate merging of large-scale population SVs. These capabilities of PanPop will help to accelerate future SV-related studies.
Collapse
Affiliation(s)
- Zeyu Zheng
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Mingjia Zhu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Jin Zhang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Xinfeng Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Liqiang Hou
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Wenyu Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Shuai Yuan
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Changhong Luo
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Xinhao Yao
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Jianquan Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China.
| | - Yongzhi Yang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China.
| |
Collapse
|
2
|
Niu M, Wang C, Chen Y, Zou Q, Xu L. Identification, characterization and expression analysis of circRNA encoded by SARS-CoV-1 and SARS-CoV-2. Brief Bioinform 2024; 25:bbad537. [PMID: 38279648 PMCID: PMC10818166 DOI: 10.1093/bib/bbad537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/12/2023] [Accepted: 12/22/2023] [Indexed: 01/28/2024] Open
Abstract
Virus-encoded circular RNA (circRNA) participates in the immune response to viral infection, affects the human immune system, and can be used as a target for precision therapy and tumor biomarker. The coronaviruses SARS-CoV-1 and SARS-CoV-2 (SARS-CoV-1/2) that have emerged in recent years are highly contagious and have high mortality rates. In coronaviruses, little is known about the circRNA encoded by the SARS-CoV-1/2. Therefore, this study explores whether SARS-CoV-1/2 encodes circRNA and characteristics and functions of circRNA. Based on RNA-seq data of SARS-CoV-1 and SARS-CoV-2 infections, we used circRNA identification tools (circRNA_finder, find_circ and CIRI2) to identify circRNAs. The number of circRNAs encoded by SARS-CoV-1 and SARS-CoV-2 was identified as 151 and 470, respectively. It can be found that SARS-CoV-2 shows more prominent circRNA encoding ability than SARS-CoV-1. Expression analysis showed that only a few circRNAs encoded by SARS-CoV-1/2 showed high expression levels, and the positive strand produced more abundant circRNAs. Then, based on the identified SARS-CoV-1/2-encoded circRNAs, we performed circRNA identification and characterization using the previously developed CirRNAPL. Finally, target gene prediction and functional enrichment analysis were performed. It was found that viral circRNA is closely related to cancer and has a potential role in regulating host cell functions. This study studied the characteristics and functions of viral circRNA encoded by coronavirus SARS-CoV-1/2, providing a valuable resource for further research on the function and molecular mechanism of coronavirus circRNA.
Collapse
Affiliation(s)
- Mengting Niu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic University, Shenzhen 518055, China
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Chunyu Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150000, China
| | - Yaojia Chen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.4 Block 2 North Jianshe Road, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.4 Block 2 North Jianshe Road, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic University, Shenzhen 518055, China
| |
Collapse
|
3
|
Zulfiqar H, Guo Z, Ahmad RM, Ahmed Z, Cai P, Chen X, Zhang Y, Lin H, Shi Z. Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings. Front Med (Lausanne) 2024; 10:1291352. [PMID: 38298505 PMCID: PMC10829051 DOI: 10.3389/fmed.2023.1291352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Accepted: 12/26/2023] [Indexed: 02/02/2024] Open
Abstract
Snake venom contains many toxic proteins that can destroy the circulatory system or nervous system of prey. Studies have found that these snake venom proteins have the potential to treat cardiovascular and nervous system diseases. Therefore, the study of snake venom protein is conducive to the development of related drugs. The research technologies based on traditional biochemistry can accurately identify these proteins, but the experimental cost is high and the time is long. Artificial intelligence technology provides a new means and strategy for large-scale screening of snake venom proteins from the perspective of computing. In this paper, we developed a sequence-based computational method to recognize snake toxin proteins. Specially, we utilized three different feature descriptors, namely g-gap, natural vector and word 2 vector, to encode snake toxin protein sequences. The analysis of variance (ANOVA), gradient-boost decision tree algorithm (GBDT) combined with incremental feature selection (IFS) were used to optimize the features, and then the optimized features were input into the deep learning model for model training. The results show that our model can achieve a prediction performance with an accuracy of 82.00% in 10-fold cross-validation. The model is further verified on independent data, and the accuracy rate reaches to 81.14%, which demonstrated that our model has excellent prediction performance and robustness.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China
| | - Zhiling Guo
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Ramala Masood Ahmad
- Department of Plant Breeding and Genetics, University of Agriculture Faisalabad, Faisalabad, Pakistan
| | - Zahoor Ahmed
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China
| | - Peiling Cai
- School of Basic Medical Sciences, Chengdu University, Chengdu, China
| | - Xiang Chen
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hao Lin
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China
| | - Zheng Shi
- Clinical Genetics Laboratory, Clinical Medical College & Affiliated Hospital, Chengdu University, Chengdu, China
| |
Collapse
|
4
|
Pang Y, Liu B. DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model. BMC Biol 2024; 22:3. [PMID: 38166858 PMCID: PMC10762911 DOI: 10.1186/s12915-023-01803-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 12/15/2023] [Indexed: 01/05/2024] Open
Abstract
Intrinsically disordered proteins and regions (IDPs/IDRs) are functionally important proteins and regions that exist as highly dynamic conformations under natural physiological conditions. IDPs/IDRs exhibit a broad range of molecular functions, and their functions involve binding interactions with partners and remaining native structural flexibility. The rapid increase in the number of proteins in sequence databases and the diversity of disordered functions challenge existing computational methods for predicting protein intrinsic disorder and disordered functions. A disordered region interacts with different partners to perform multiple functions, and these disordered functions exhibit different dependencies and correlations. In this study, we introduce DisoFLAG, a computational method that leverages a graph-based interaction protein language model (GiPLM) for jointly predicting disorder and its multiple potential functions. GiPLM integrates protein semantic information based on pre-trained protein language models into graph-based interaction units to enhance the correlation of the semantic representation of multiple disordered functions. The DisoFLAG predictor takes amino acid sequences as the only inputs and provides predictions of intrinsic disorder and six disordered functions for proteins, including protein-binding, DNA-binding, RNA-binding, ion-binding, lipid-binding, and flexible linker. We evaluated the predictive performance of DisoFLAG following the Critical Assessment of protein Intrinsic Disorder (CAID) experiments, and the results demonstrated that DisoFLAG offers accurate and comprehensive predictions of disordered functions, extending the current coverage of computationally predicted disordered function categories. The standalone package and web server of DisoFLAG have been established to provide accurate prediction tools for intrinsic disorders and their associated functions.
Collapse
Affiliation(s)
- Yihe Pang
- School of Computer Science and Technology, Beijing Institute of Technology, No. 5, South Zhongguancun Street, Beijing, Haidian District, 100081, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, No. 5, South Zhongguancun Street, Beijing, Haidian District, 100081, China.
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, No. 5, South Zhongguancun Street, Beijing, Haidian District, 100081, China.
| |
Collapse
|
5
|
Zhang P, Liu H, Wei Y, Zhai Y, Tian Q, Zou Q. FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets. Bioinformatics 2024; 40:btae014. [PMID: 38200554 PMCID: PMC10809904 DOI: 10.1093/bioinformatics/btae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Revised: 12/27/2023] [Accepted: 01/09/2024] [Indexed: 01/12/2024] Open
Abstract
MOTIVATION In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly. RESULTS FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame. AVAILABILITY AND IMPLEMENTATION Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770.
Collapse
Affiliation(s)
- Pinglu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China
| | - Huan Liu
- School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang 621010, Sichuan, China
| | - Yanming Wei
- School of Computer Science and Technology, Xidian University, Xi’an 710071, Shaanxi, China
| | - Yixiao Zhai
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China
| | - Qinzhong Tian
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China
| |
Collapse
|
6
|
Mahmoud MAB. Classification of DNA Sequence Based on a Non-gradient Algorithm: Pseudoinverse Learners. Methods Mol Biol 2024; 2744:359-373. [PMID: 38683331 DOI: 10.1007/978-1-0716-3581-0_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2024]
Abstract
This chapter proposes a prototype-based classification approach for analyzing DNA barcodes that uses a spectral representation of DNA sequences and a non-gradient neural network. Biological sequences can be viewed as data components with higher non-fixed dimensions, which correspond to the length of the sequences. Through computational procedures such as one-hot encoding, numerical encoding plays an important role in DNA sequence evaluation (OHE). However, the OHE method has some disadvantages: (1) It does not add any details that could result in an additional predictive variable, and (2) if the variable has many classes, OHE significantly expands the feature space. To address these shortcomings, this chapter proposes a computationally efficient framework for classifying DNA sequences of living organisms in the image domain. A multilayer perceptron trained by a pseudoinverse learning autoencoder (PILAE) algorithm is used in the proposed strategy. The learning control parameters and the number of hidden layers do not have to be specified during the PILAE training process. As a result, the PILAE classifier outperforms other deep neural network (DNN) strategies such as the VGG-16 and Xception models.
Collapse
Affiliation(s)
- Mohammed A B Mahmoud
- Faculty of Computer Science, October University for Modern Sciences and Arts, Cairo, Egypt.
| |
Collapse
|
7
|
Chen J, Chao J, Liu H, Yang F, Zou Q, Tang F. WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a fast win-win mode combining the center star and progressive strategies. Brief Bioinform 2023:7169135. [PMID: 37200156 DOI: 10.1093/bib/bbad190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 04/09/2023] [Accepted: 04/26/2023] [Indexed: 05/20/2023] Open
Abstract
Multiple sequence alignment is widely used for sequence analysis, such as identifying important sites and phylogenetic analysis. Traditional methods, such as progressive alignment, are time-consuming. To address this issue, we introduce StarTree, a novel method to fast construct a guide tree by combining sequence clustering and hierarchical clustering. Furthermore, we develop a new heuristic similar region detection algorithm using the FM-index and apply the k-banded dynamic program to the profile alignment. We also introduce a win-win alignment algorithm that applies the central star strategy within the clusters to fast the alignment process, then uses the progressive strategy to align the central-aligned profiles, guaranteeing the final alignment's accuracy. We present WMSA 2 based on these improvements and compare the speed and accuracy with other popular methods. The results show that the guide tree made by the StarTree clustering method can lead to better accuracy than that of PartTree while consuming less time and memory than that of UPGMA and mBed methods on datasets with thousands of sequences. During the alignment of simulated data sets, WMSA 2 can consume less time and memory while ranking at the top of Q and TC scores. The WMSA 2 is still better at the time, and memory efficiency on the real datasets and ranks at the top on the average sum of pairs score. For the alignment of 1 million SARS-CoV-2 genomes, the win-win mode of WMSA 2 significantly decreased the consumption time than the former version. The source code and data are available at https://github.com/malabz/WMSA2.
Collapse
Affiliation(s)
- Juntao Chen
- Quzhou People's Hospital, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou, China, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China, and the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Jiannan Chao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China, and the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Huan Liu
- School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang, China
| | - Fenglong Yang
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China and the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Furong Tang
- Quzhou People's Hospital, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou, China, and Department of Basic Medical Sciences, School of Medicine, Tsinghua University, Beijing, China
| |
Collapse
|
8
|
Wei Y, Zou Q, Tang F, Yu L. WMSA: a novel method for multiple sequence alignment of DNA sequences. Bioinformatics 2022; 38:5019-5025. [PMID: 36179076 DOI: 10.1093/bioinformatics/btac658] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 08/30/2022] [Accepted: 09/29/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) is a fundamental problem in bioinformatics. The quality of alignment will affect downstream analysis. MAFFT has adopted the Fast Fourier Transform method for searching the homologous segments and using them as anchors to divide the sequences, then making alignment only on segments, which can save time and memory without overly reducing the sequence alignment quality. MAFFT becomes slow when the dataset is large. RESULTS We made a software, WMSA, which uses the divide-and-conquer method to split the sequences into clusters, aligns those clusters into profiles with the center star strategy and then makes a progressive profile-profile alignment. The alignment is conducted by the compiled algorithms of MAFFT, K-Band with multithread parallelism. Our method can balance time, space and quality and performs better than MAFFT in test experiments on highly conserved datasets. AVAILABILITY AND IMPLEMENTATION Source code is freely available at https://github.com/malabz/WMSA/, which is implemented in C/C++ and supported on Linux, and datasets are available at https://github.com/malabz/WMSA-dataset. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yanming Wei
- School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi 710071, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China
| | - Furong Tang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi 710071, China
| |
Collapse
|
9
|
Zhang Y, Zhang Q, Liu Y, Lin M, Ding C. Multiple Sequence Alignment based on deep Q Network with negative feedback policy. Comput Biol Chem 2022; 101:107780. [DOI: 10.1016/j.compbiolchem.2022.107780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 09/27/2022] [Accepted: 10/18/2022] [Indexed: 11/28/2022]
|
10
|
Tang F, Chao J, Wei Y, Yang F, Zhai Y, Xu L, Zou Q. HAlign 3: fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences. Mol Biol Evol 2022; 39:6653123. [PMID: 35915051 PMCID: PMC9372455 DOI: 10.1093/molbev/msac166] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
HAlign is a cross-platform program that performs multiple sequence alignments based on the center star strategy. Here we present two major updates of HAlign 3, which helped improve the time efficiency and the alignment quality, and made HAlign 3 a specialized program to process ultra-large numbers of similar DNA/RNA sequences, such as closely related viral or prokaryotic genomes. HAlign 3 can be easily installed via the Anaconda and Java release package on macOS, Linux, Windows subsystem for Linux, and Windows systems, and the source code is available on GitHub (https://github.com/malabz/HAlign-3).
Collapse
Affiliation(s)
- Furong Tang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.,School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
| | - Jiannan Chao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Yanming Wei
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Fenglong Yang
- School of Medical Technology and Engineering, Fujian Medical University, Fuzhou, China
| | - Yixiao Zhai
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
11
|
Karabacak M, Ozkara BB, Mordag S, Bisdas S. Deep learning for prediction of isocitrate dehydrogenase mutation in gliomas: a critical approach, systematic review and meta-analysis of the diagnostic test performance using a Bayesian approach. Quant Imaging Med Surg 2022; 12:4033-4046. [PMID: 35919062 PMCID: PMC9338374 DOI: 10.21037/qims-22-34] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 05/25/2022] [Indexed: 11/08/2022]
Abstract
Background Conventionally, identifying isocitrate dehydrogenase (IDH) mutation in gliomas is based on histopathological analysis of tissue specimens acquired via stereotactic biopsy or definitive resection. Accurate pre-treatment prediction of IDH mutation status using magnetic resonance imaging (MRI) can guide clinical decision-making. We aim to evaluate the diagnostic performance of deep learning (DL) to determine IDH mutation status in gliomas. Methods A systematic search of Cochrane Library, Web of Science, Medline, and Scopus was conducted to identify relevant publications until August 1, 2021. Articles were included if all the following criteria were met: (I) patients with histopathologically confirmed World Health Organization (WHO) grade II, III, or IV gliomas; (II) histopathological examination with the IDH mutation; (III) DL was used to predict the IDH mutation status; (IV) sufficient data for reconstruction of confusion matrices in terms of the diagnostic performance of the DL algorithms; and (V) original research articles. Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) and Checklist for Artificial Intelligence in Medical Imaging (CLAIM) was used to assess the studies' quality. Bayes theorem was utilized to calculate the posttest probability. Results Four studies with a total of 1,295 patients were included. In the training set, the pooled sensitivity, specificity, and area under the summary receiver operating characteristic (SROC) curve were 93.9%, 90.9% and 0.958, respectively. In the validation set, the pooled sensitivity, specificity, and area under the SROC curve were 90.8%, 85.5% and 0.939, respectively. With a known pretest probability of 80.2%, the Bayes theorem yielded a posttest probability of 97.6% and 96.0% for a positive test and 27.0% and 30.6% for a negative test for training sets and validation sets, respectively. Discussion This is the first meta-analysis that summarizes the diagnostic performance of DL in predicting IDH mutation status in gliomas via the Bayes theorem. DL algorithms demonstrate excellent diagnostic performance in predicting IDH mutation in gliomas. Radiomic features associated with IDH mutation, and its underlying pathophysiology extracted from advanced MRI may improve prediction probability. However, more studies are required to optimize and increase its reliability. Limitations include obtaining some data via email and lack of training and test sets statistics.
Collapse
Affiliation(s)
- Mert Karabacak
- Cerrahpasa Faculty of Medicine, Istanbul University-Cerrahpasa, Cerrahpasa, Istanbul, Turkey
| | - Burak Berksu Ozkara
- Cerrahpasa Faculty of Medicine, Istanbul University-Cerrahpasa, Cerrahpasa, Istanbul, Turkey
| | - Seren Mordag
- Faculty of Medicine, Hacettepe University, Sihhiye, Ankara, Turkey
| | - Sotirios Bisdas
- Lysholm Department of Neuroradiology, National Hospital for Neurology and Neurosurgery, London, UK
| |
Collapse
|
12
|
Pang Y, Liu B. SelfAT-Fold: Protein Fold Recognition Based on Residue-Based and Motif-Based Self-Attention Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1861-1869. [PMID: 33090951 DOI: 10.1109/tcbb.2020.3031888] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The protein fold recognition is a fundamental and crucial step of tertiary structure determination. In this regard, several computational predictors have been proposed. Recently, the predictive performance has been obviously improved by the fold-specific features generated by deep learning techniques. However, these methods failed to measure the global associations among residues or motifs along the protein sequences. Furthermore, these deep learning techniques are often treated as black boxes without interpretability. Inspired by the similarities between protein sequences and natural language sentences, we applied the self-attention mechanism derived from natural language processing (NLP) field to protein fold recognition. The motif-based self-attention network (MSAN) and the residue-based self-attention network (RSAN) were constructed based on a training set to capture the global associations among the structure motifs and residues along the protein sequences, respectively. The fold-specific attention features trained and generated from the training set were then combined with Support Vector Machines (SVMs) to predict the samples in the widely used LE benchmark dataset, which is fully independent from the training set. Experimental results showed that the proposed two SelfAT-Fold predictors outperformed 34 existing state-of-the-art computational predictors. The two SelfAT-Fold predictors were further tested on an independent dataset SCOP_TEST, and they can achieve stable performance. Furthermore, the fold-specific attention features can be used to analyse the characteristics of protein folds. The trained models and data of SelfAT-Fold can be downloaded from http://bliulab.net/selfAT_fold/.
Collapse
|
13
|
Chao J, Tang F, Xu L. Developments in Algorithms for Sequence Alignment: A Review. Biomolecules 2022; 12:biom12040546. [PMID: 35454135 PMCID: PMC9024764 DOI: 10.3390/biom12040546] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 03/29/2022] [Accepted: 03/31/2022] [Indexed: 01/27/2023] Open
Abstract
The continuous development of sequencing technologies has enabled researchers to obtain large amounts of biological sequence data, and this has resulted in increasing demands for software that can perform sequence alignment fast and accurately. A number of algorithms and tools for sequence alignment have been designed to meet the various needs of biologists. Here, the ideas that prevail in the research of sequence alignment and some quality estimation methods for multiple sequence alignment tools are summarized.
Collapse
Affiliation(s)
- Jiannan Chao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China;
| | - Furong Tang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China;
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
- Correspondence:
| |
Collapse
|
14
|
Zhang H, Zou Q, Ju Y, Song C, Chen D. Distance-based support vector machine to predict DNA N6-methyladenine modification. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220404145517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
DNA N6-methyladenine plays an important role in the restriction-modification system to isolate invasion from adventive DNA. The shortcomings of the high time-consumption and high costs of experimental methods have been exposed, and some computational methods have emerged. The support vector machine theory has received extensive attention in the bioinformatics field due to its solid theoretical foundation and many good characteristics.
Objective:
General machine learning methods include an important step of extracting features. The research has omitted this step and replaced with easy-to-obtain sequence distances matrix to obtain better results
Method:
First sequence alignment technology was used to achieve the similarity matrix. Then a novel transformation turned the similarity matrix into a distance matrix. Next, the similarity-distance matrix is made positive semi-definite so that it can be used in the kernel matrix. Finally, the LIBSVM software was applied to solve the support vector machine.
Results:
The five-fold cross-validation of this model on rice and mouse data has achieved excellent accuracy rates of 92.04% and 96.51%, respectively. This shows that the DB-SVM method has obvious advantages compared with traditional machine learning methods. Meanwhile this model achieved 0.943,0.982 and 0.818 accuracy,0.944, 0.982, and 0.838 Matthews correlation coefficient and 0.942, 0.982 and 0.840 F1 scores for the rice, M. musculus and cross-species genome datasets, respectively.
Conclusion:
These outcomes show that this model outperforms the iIM-CNN and csDMA in the prediction of DNA 6mA modification, which are the lastest research on DNA 6mA.
Collapse
Affiliation(s)
- Haoyu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Chenggang Song
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China
| |
Collapse
|
15
|
Wang Z, Tan J, Long Y, Liu Y, Lei W, Cai J, Yang Y, Liu Z. SaAlign: Multiple DNA/RNA Sequence Alignment and Phylogenetic Tree Construction Tool for Ultra-large Datasets and Ultra-long Sequences Based on Suffix Array. Comput Struct Biotechnol J 2022; 20:1487-1493. [PMID: 35422971 PMCID: PMC8976100 DOI: 10.1016/j.csbj.2022.03.018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 03/09/2022] [Accepted: 03/19/2022] [Indexed: 11/03/2022] Open
Abstract
Multiple DNA/RNA sequence alignment is an important fundamental tool in bioinformatics, especially for phylogenetic tree construction. With DNA-sequencing improvements, the amount of bioinformatics data is constantly increasing, and various tools need to be iterated constantly. Mitochondrial genome analyses of multiple individuals and species require bioinformatics software; therefore, their performances need to be optimized. To improve the alignment of ultra-large datasets and ultra-long sequences, we optimized a dynamic programming algorithm using longest common substring methods. Ultra-large test DNA datasets, containing sequences of different lengths, some over 300 kb (kilobase), revealed that the Multiple DNA/RNA Sequence Alignment Tool Based on Suffix Tree (SaAlign) saved time and computational space. It outperformed the existing technical tools, including MAFFT and HAlign-II. For mitochondrial genome datasets having limited numbers of sequences, MAFFT performed the required tasks, but it could not handle ultra-large mitochondrial genome datasets for core dump error. We implement a multiple DNA/RNA sequence alignment tool based on Center Star strategy and use suffix array algorithm to optimize the spatial and time efficiency. Nowadays, whole-genome research and NGS technology are becoming more popular, and it is necessary to save computational resources for laboratories. That software is of great significance in these aspects, especially in the study of the whole-mitochondrial genome of plants.
Collapse
|
16
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
17
|
Wang Y, Wu J, Yan J, Guo M, Xu L, Hou L, Zou Q. Comparative genome analysis of plant ascomycete fungal pathogens with different lifestyles reveals distinctive virulence strategies. BMC Genomics 2022; 23:34. [PMID: 34996360 PMCID: PMC8740420 DOI: 10.1186/s12864-021-08165-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 11/10/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Pathogens have evolved diverse lifestyles and adopted pivotal new roles in both natural ecosystems and human environments. However, the molecular mechanisms underlying their adaptation to new lifestyles are obscure. Comparative genomics was adopted to determine distinct strategies of plant ascomycete fungal pathogens with different lifestyles and to elucidate their distinctive virulence strategies. RESULTS We found that plant ascomycete biotrophs exhibited lower gene gain and loss events and loss of CAZyme-encoding genes involved in plant cell wall degradation and biosynthesis gene clusters for the production of secondary metabolites in the genome. Comparison with the candidate effectome detected distinctive variations between plant biotrophic pathogens and other groups (including human, necrotrophic and hemibiotrophic pathogens). The results revealed the biotroph-specific and lifestyle-conserved candidate effector families. These data have been configured in web-based genome browser applications for public display ( http://lab.malab.cn/soft/PFPG ). This resource allows researchers to profile the genome, proteome, secretome and effectome of plant fungal pathogens. CONCLUSIONS Our findings demonstrated different genome evolution strategies of plant fungal pathogens with different lifestyles and explored their lifestyle-conserved and specific candidate effectors. It will provide a new basis for discovering the novel effectors and their pathogenic mechanisms.
Collapse
Affiliation(s)
- Yansu Wang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, 518000, Shenzhen, P. R. China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, P. R. China
| | - Jie Wu
- Key Laboratory of Biotechnology for Medicinal Plants of Jiangsu Province, School of Life Sciences, Jiangsu Normal University, 221116, Xuzhou, P. R. China
| | - Jiacheng Yan
- Key Laboratory of Biotechnology for Medicinal Plants of Jiangsu Province, School of Life Sciences, Jiangsu Normal University, 221116, Xuzhou, P. R. China
| | - Ming Guo
- Department of Agronomy and Horticulture, University of Nebraska, Lincoln, USA
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, 518000, Shenzhen, P. R. China
| | - Liping Hou
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, P. R. China.
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin, China.
| |
Collapse
|
18
|
Abstract
Background:
Therapeutic peptide prediction is critical for drug development and therapy. Researchers have been studying this essential task, developing several computational methods to identify different therapeutic peptide types.
Objective:
Most predictors are the specific methods for certain peptides. Currently, developing methods to predict the presence of multiple peptides remains a challenging problem. Moreover, it is still challenging to combine different features to make the therapeutic prediction.
Method:
In this paper, we proposed a new ensemble method TP-MV for general therapeutic peptide recognition. TP-MV is developed using the stacking framework in conjunction with the KNN, SVM, ET, RF, and XGB. Then TP-MV constructs a multi-view learning model as meta-classifiers to extract the discriminative feature for different peptides.
Results:
In the experiment, the proposed method outperforms the other existing methods on the benchmark datasets, indicating that the proposed method has the ability to predict multiple therapeutic peptides simultaneously.
Conclusion:
The TP-MV is a useful tool for predicting therapeutic peptides.
Collapse
Affiliation(s)
- Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Hongwu Lv
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Yichen Guo
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Jie Wen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
19
|
Liu H, Zou Q, Xu Y. A novel fast multiple nucleotide sequence alignment method based on FM-index. Brief Bioinform 2021; 23:6458932. [PMID: 34893794 DOI: 10.1093/bib/bbab519] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Revised: 10/19/2021] [Accepted: 11/14/2021] [Indexed: 11/13/2022] Open
Abstract
Multiple sequence alignment (MSA) is fundamental to many biological applications. But most classical MSA algorithms are difficult to handle large-scale multiple sequences, especially long sequences. Therefore, some recent aligners adopt an efficient divide-and-conquer strategy to divide long sequences into several short sub-sequences. Selecting the common segments (i.e. anchors) for division of sequences is very critical as it directly affects the accuracy and time cost. So, we proposed a novel algorithm, FMAlign, to improve the performance of multiple nucleotide sequence alignment. We use FM-index to extract long common segments at a low cost rather than using a space-consuming hash table. Moreover, after finding the longer optimal common segments, the sequences are divided by the longer common segments. FMAlign has been tested on virus and bacteria genome and human mitochondrial genome datasets, and compared with existing MSA methods such as MAFFT, HAlign and FAME. The experiments show that our method outperforms the existing methods in terms of running time, and has a high accuracy on long sequence sets. All the results demonstrate that our method is applicable to the large-scale nucleotide sequences in terms of sequence length and sequence number. The source code and related data are accessible in https://github.com/iliuh/FMAlign.
Collapse
Affiliation(s)
- Huan Liu
- School of Computer Science, University of Science and Technology of China and Key Laboratory on High Performance Computing of Anhui, China
| | - Quan Zou
- Institute of basic and Frontier Sciences, University of Electronic Science and Technology of China and Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Yun Xu
- School of Computer Science, University of Science and Technology of China and Key Laboratory on High Performance Computing of Anhui, China
| |
Collapse
|
20
|
Yan K, Wen J, Xu Y, Liu B. Protein Fold Recognition Based on Auto-Weighted Multi-View Graph Embedding Learning Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2682-2691. [PMID: 32356759 DOI: 10.1109/tcbb.2020.2991268] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Protein fold recognition is critical for studies of the protein structure prediction and drug design. Several methods have been proposed to obtain discriminative features from the protein sequences for fold recognition. However, the ensemble methods that combine the various features to improve predictive performance remain the challenge problems. In this study, we proposed two novel algorithms: AWMG and EMfold. AWMG used a novel predictor based on the multi-view learning framework for fold recognition. Each view was treated as the intermediate representation of the corresponding data source of proteins, including the evolutionary information and the retrieval information. AWMG calculated the auto-weight for each view respectively and constructed the latent subspace which contains the common information shared by different views. The marginalized constraint was employed to enlarge the margins between different folds, improving the predictive performance of AWMG. Furthermore, we proposed a novel ensemble method called EMfold, which combines two complementary methods AWMG and DeepSS. The later method was a template-based algorithm using the SPARKS-X and DeepFR programs. EMfold integrated the advantages of template-based assignment and machine learning classifier. Experimental results on the two widely datasets (LE and YK) showed that the proposed methods outperformed some state-of-the-art methods, indicating that AWMG and EMfold are useful tools for protein fold recognition.
Collapse
|
21
|
Yan K, Wen J, Liu JX, Xu Y, Liu B. Protein Fold Recognition by Combining Support Vector Machines and Pairwise Sequence Similarity Scores. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2008-2016. [PMID: 31940548 DOI: 10.1109/tcbb.2020.2966450] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein fold recognition is one of the most essential steps for protein structure prediction, aiming to classify proteins into known protein folds. There are two main computational approaches: one is the template-based method based on the alignment scores between query-template protein pairs and the other is the machine learning method based on the feature representation and classifier. These two approaches have their own advantages and disadvantages. Can we combine these methods to establish more accurate predictors for protein fold recognition? In this study, we made an initial attempt and proposed two novel algorithms: TSVM-fold and ESVM-fold. TSVM-fold was based on the Support Vector Machines (SVMs), which utilizes a set of pairwise sequence similarity scores generated by three complementary template-based methods, including HHblits, SPARKS-X, and DeepFR. These scores measured the global relationships between query sequences and templates. The comprehensive features of the attributes of the sequences were fed into the SVMs for the prediction. Then the TSVM-fold was further combined with the HHblits algorithm so as to improve its generalization ability. The combined method is called ESVM-fold. Experimental results in two rigorous benchmark datasets (LE and YK datasets) showed that the proposed methods outperform some state-of-the-art methods, indicating that the TSVM-fold and ESVM-fold are efficient predictors for protein fold recognition.
Collapse
|
22
|
Lu D, Zhang Y, Zhang L, Wang H, Weng W, Li L, Cai H. Methods of privacy-preserving genomic sequencing data alignments. Brief Bioinform 2021; 22:6279828. [PMID: 34021302 DOI: 10.1093/bib/bbab151] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 03/10/2021] [Accepted: 03/30/2021] [Indexed: 11/14/2022] Open
Abstract
Genomic data alignment, a fundamental operation in sequencing, can be utilized to map reads into a reference sequence, query on a genomic database and perform genetic tests. However, with the reduction of sequencing cost and the accumulation of genome data, privacy-preserving genomic sequencing data alignment is becoming unprecedentedly important. In this paper, we present a comprehensive review of secure genomic data comparison schemes. We discuss the privacy threats, including adversaries and privacy attacks. The attacks can be categorized into inference, membership, identity tracing and completion attacks and have been applied to obtaining the genomic privacy information. We classify the state-of-the-art genomic privacy-preserving alignment methods into three different scenarios: large-scale reads mapping, encrypted genomic datasets querying and genetic testing to ease privacy threats. A comprehensive analysis of these approaches has been carried out to evaluate the computation and communication complexity as well as the privacy requirements. The survey provides the researchers with the current trends and the insights on the significance and challenges of privacy issues in genomic data alignment.
Collapse
Affiliation(s)
- Dandan Lu
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Yue Zhang
- School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, 510006, China
| | - Ling Zhang
- Department of Radiology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, 651 Dongfeng East Road, Guangzhou, P. R. China,510060
| | - Haiyan Wang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Wanlin Weng
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Li Li
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Hongmin Cai
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| |
Collapse
|
23
|
Zou Y, Zhu Y, Li Y, Wu FX, Wang J. Parallel computing for genome sequence processing. Brief Bioinform 2021; 22:6210355. [PMID: 33822883 DOI: 10.1093/bib/bbab070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 01/26/2021] [Accepted: 02/10/2021] [Indexed: 01/08/2023] Open
Abstract
The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.
Collapse
Affiliation(s)
- You Zou
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yuejie Zhu
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yaohang Li
- computer science at Old Dominion University, USA
| | - Fang-Xiang Wu
- College of Engineering and the Department of Computer Science at the University of Saskatchewan, Saskatoon, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering at Central South University, Changsha, Hunan, China
| |
Collapse
|
24
|
Shi H, Shi H, Xu S. Efficient Multiple Sequences Alignment Algorithm Generation via Components Assembly Under PAR Framework. Front Genet 2021; 11:628175. [PMID: 33613626 PMCID: PMC7890700 DOI: 10.3389/fgene.2020.628175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 12/29/2020] [Indexed: 11/25/2022] Open
Abstract
As a key algorithm in bioinformatics, sequence alignment algorithm is widely used in sequence similarity analysis and genome sequence database search. Existing research focuses mainly on the specific steps of the algorithm or is for specific problems, lack of high-level abstract domain algorithm framework. Multiple sequence alignment algorithms are more complex, redundant, and difficult to understand, and it is not easy for users to select the appropriate algorithm; some computing errors may occur. Based on our constructed pairwise sequence alignment algorithm component library and the convenient software platform PAR, a few expansion domain components are developed for multiple sequence alignment application domain, and specific multiple sequence alignment algorithm can be designed, and its corresponding program, i.e., C++/Java/Python program, can be generated efficiently and thus enables the improvement of the development efficiency of complex algorithms, as well as accuracy of sequence alignment calculation. A star alignment algorithm is designed and generated to demonstrate the development process.
Collapse
Affiliation(s)
- Haipeng Shi
- School of Information Management, Jiangxi University of Finance and Economics, Nanchang, China.,School of Software, Jiangxi Normal University, Nanchang, China
| | - Haihe Shi
- School of Computer and Information Engineering, Jiangxi Normal University, Nanchang, China
| | - Shenghua Xu
- School of Information Management, Jiangxi University of Finance and Economics, Nanchang, China
| |
Collapse
|
25
|
Chen L, Li J, Chang M. Cancer Diagnosis and Disease Gene Identification via Statistical Machine Learning. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200207094947] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Diagnosing cancer and identifying the disease gene by using DNA microarray gene
expression data are the hot topics in current bioinformatics. This paper is devoted to the latest
development in cancer diagnosis and gene selection via statistical machine learning. A support
vector machine is firstly introduced for the binary cancer diagnosis. Then, 1-norm support vector
machine, doubly regularized support vector machine, adaptive huberized support vector machine
and other extensions are presented to improve the performance of gene selection. Lasso, elastic
net, partly adaptive elastic net, group lasso, sparse group lasso, adaptive sparse group lasso and
other sparse regression methods are also introduced for performing simultaneous binary cancer
classification and gene selection. In addition to introducing three strategies for reducing multiclass
to binary, methods of directly considering all classes of data in a learning model (multi_class
support vector, sparse multinomial regression, adaptive multinomial regression and so on) are
presented for performing multiple cancer diagnosis. Limitations and promising directions are also
discussed.
Collapse
Affiliation(s)
- Liuyuan Chen
- Henan Engineering Laboratory for Big Data Statistical Analysis and Optimal Control, College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China
| | - Juntao Li
- Henan Engineering Laboratory for Big Data Statistical Analysis and Optimal Control, College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China
| | - Mingming Chang
- Henan Engineering Laboratory for Big Data Statistical Analysis and Optimal Control, College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China
| |
Collapse
|
26
|
Li F, Luo M, Zhou W, Li J, Jin X, Xu Z, Juan L, Zhang Z, Li Y, Liu R, Li Y, Xu C, Ma K, Cao H, Wang J, Wang P, Bu Z, Jiang Q. Single cell RNA and immune repertoire profiling of COVID-19 patients reveal novel neutralizing antibody. Protein Cell 2020; 12:751-755. [PMID: 33237441 PMCID: PMC7686823 DOI: 10.1007/s13238-020-00807-6] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/29/2020] [Indexed: 11/27/2022] Open
Affiliation(s)
- Fang Li
- State Key Laboratory of Veterinary Biotechnology, Harbin Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Harbin, 150001, China
| | - Meng Luo
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Wenyang Zhou
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Jinliang Li
- Harbin Sixth Hospital, Harbin, 150000, China
| | - Xiyun Jin
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Zhaochun Xu
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Liran Juan
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Zheng Zhang
- Harbin Sixth Hospital, Harbin, 150000, China
| | - Yuou Li
- Harbin Sixth Hospital, Harbin, 150000, China
| | - Renqiang Liu
- State Key Laboratory of Veterinary Biotechnology, Harbin Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Harbin, 150001, China
| | - Yiqun Li
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Chang Xu
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Kexin Ma
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Huimin Cao
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | | | - Pingping Wang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China.
| | - Zhigao Bu
- State Key Laboratory of Veterinary Biotechnology, Harbin Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Harbin, 150001, China.
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China.
- Key Laboratory of Biological Big Data (Harbin Institute of Technology), Ministry of Education, Harbin, 150001, China.
| |
Collapse
|
27
|
|
28
|
Zhan Q, Fu Y, Jiang Q, Liu B, Peng J, Wang Y. SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically. Protein Pept Lett 2020; 27:295-302. [PMID: 31385760 DOI: 10.2174/0929866526666190806143959] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2019] [Revised: 04/26/2019] [Accepted: 06/14/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Multiple Sequence Alignment (MSA) is a fundamental task in bioinformatics and is required for many biological analysis tasks. The more accurate the alignments are, the more credible the downstream analyses. Most protein MSA algorithms realign an alignment to refine it by dividing it into two groups horizontally and then realign the two groups. However, this strategy does not consider that different regions of the sequences have different conservation; this property may lead to incorrect residue-residue or residue-gap pairs, which cannot be corrected by this strategy. OBJECTIVE In this article, our motivation is to develop a novel refinement method based on splitting- splicing vertically. METHODS Here, we present a novel refinement method based on splitting-splicing vertically, called SpliVert. For an alignment, we split it vertically into 3 parts, remove the gap characters in the middle, realign the middle part alone, and splice the realigned middle parts with the other two initial pieces to obtain a refined alignment. In the realign procedure of our method, the aligner will only focus on a certain part, ignoring the disturbance of the other parts, which could help fix the incorrect pairs. RESULTS We tested our refinement strategy for 2 leading MSA tools on 3 standard benchmarks, according to the commonly used average SP (and TC) score. The results show that given appropriate proportions to split the initial alignment, the average scores are increased comparably or slightly after using our method. We also compared the alignments refined by our method with alignments directly refined by the original alignment tools. The results suggest that using our SpliVert method to refine alignments can also outperform direct use of the original alignment tools. CONCLUSION The results reveal that splitting vertically and realigning part of the alignment is a good strategy for the refinement of protein multiple sequence alignments.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yilei Fu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
29
|
Naznooshsadat E, Elham P, Ali SZ. FAME: fast and memory efficient multiple sequences alignment tool through compatible chain of roots. Bioinformatics 2020; 36:3662-3668. [PMID: 32170927 DOI: 10.1093/bioinformatics/btaa175] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Revised: 02/10/2020] [Accepted: 03/12/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) is important and challenging problem of computational biology. Most of the existing methods can only provide a short length multiple alignments in an acceptable time. Nevertheless, when the researchers confront the genome size in the multiple alignments, the process has required a huge processing space/time. Accordingly, using the method that can align genome size rapidly and precisely has a great effect, especially on the analysis of the very long alignments. Herein, we have proposed an efficient method, called FAME, which vertically divides sequences from the places that they have common areas; then they are arranged in consecutive order. Then these common areas are shifted and placed under each other, and the subsequences between them are aligned using any existing MSA tool. RESULTS The results demonstrate that the combination of FAME and the MSA methods and deploying minimizer are capable to be executed on personal computer and finely align long length sequences with much higher sum-of-pair (SP) score compared to the standalone MSA tools. As we select genomic datasets with longer length, the SP score of the combinatorial methods is gradually improved. The calculated computational complexity of methods supports the results in a way that combining FAME and the MSA tools leads to at least four times faster execution on the datasets. AVAILABILITY AND IMPLEMENTATION The source code and all datasets and run-parameters are accessible free on http://github.com/naznoosh/msa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Etminan Naznooshsadat
- Department of Computer Engineering, Shiraz Branch, Islamic Azad University, Shiraz, Iran
| | - Parvinnia Elham
- Department of Computer Engineering, Shiraz Branch, Islamic Azad University, Shiraz, Iran
| | - Sharifi-Zarchi Ali
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
30
|
Libin PJK, Deforche K, Abecasis AB, Theys K. VIRULIGN: fast codon-correct alignment and annotation of viral genomes. Bioinformatics 2020; 35:1763-1765. [PMID: 30295730 PMCID: PMC6513156 DOI: 10.1093/bioinformatics/bty851] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 09/24/2018] [Accepted: 10/05/2018] [Indexed: 12/11/2022] Open
Abstract
Summary Virus sequence data are an essential resource for reconstructing spatiotemporal dynamics of viral spread as well as to inform treatment and prevention strategies. However, the potential benefit of these applications critically depends on accurate and correctly annotated alignments of genetically heterogeneous data. VIRULIGN was built for fast codon-correct alignments of large datasets, with standardized and formalized genome annotation and various alignment export formats. Availability and implementation VIRULIGN is freely available at https://github.com/rega-cev/virulign as an open source software project. Supplementary information Supplementary data is available at Bioinformatics online.
Collapse
Affiliation(s)
- Pieter J K Libin
- KU Leuven, Rega Institute for Medical, Laboratorium of Clinical and Evolutionary Virology, Leuven, Belgium.,Artificial Intelligence Lab, Department of Computer Science, Vrije Universiteit Brussel, Brussels, Belgium
| | | | - Ana B Abecasis
- Center for Global Health and Tropical Medicine, Institute for Hygiene and Tropical Medicine, Lisboa, Portugal
| | - Kristof Theys
- KU Leuven, Rega Institute for Medical, Laboratorium of Clinical and Evolutionary Virology, Leuven, Belgium
| |
Collapse
|
31
|
Fang C, Jia Y, Hu L, Lu Y, Wang H. IMPContact: An Interhelical Residue Contact Prediction Method. BIOMED RESEARCH INTERNATIONAL 2020; 2020:4569037. [PMID: 32309431 PMCID: PMC7140131 DOI: 10.1155/2020/4569037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Accepted: 03/09/2020] [Indexed: 11/17/2022]
Abstract
As an important category of proteins, alpha-helix transmembrane proteins (αTMPs) play an important role in various biological activities. Because the solved αTMP structures are inadequate, predicting the residue contacts among the transmembrane segments of an αTMP exhibits the basis of protein fold, which can be used to further discover more protein functions. A few efforts have been devoted to predict the interhelical residue contact using machine learning methods based on the prior knowledge of transmembrane protein structure. However, it is still a challenge to improve the prediction accuracy, while the deep learning method provides an opportunity to utilize the structural knowledge in a different insight. For this purpose, we proposed a novel αTMP residue-residue contact prediction method IMPContact, in which a convolutional neural network (CNN) was applied to recognize those interhelical contacts in a TMP using its specific structural features. There were four sequence-based TMP-specific features selected to descript a pair of residues, namely, evolutionary covariation, predicted topology structure, residue relative position, and evolutionary conservation. An up-to-date dataset was used to train and test the IMPContact; our method achieved better performance compared to peer methods. In the case studies, IHRCs in the regular transmembrane helixes were better predicted than in the irregular ones.
Collapse
Affiliation(s)
- Chao Fang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Yajie Jia
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Lihong Hu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Yinghua Lu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Department of Computer Science, College of Humanities & Sciences of Northeast Normal University, Changchun 130117, China
| | - Han Wang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
- Department of Computer Science, College of Humanities & Sciences of Northeast Normal University, Changchun 130117, China
| |
Collapse
|
32
|
Shi H, Zhang X. Component-Based Design and Assembly of Heuristic Multiple Sequence Alignment Algorithms. Front Genet 2020; 11:105. [PMID: 32174970 PMCID: PMC7056898 DOI: 10.3389/fgene.2020.00105] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 01/29/2020] [Indexed: 11/24/2022] Open
Abstract
In recent years, there has been an explosive increase in the amount of bioinformatics data produced, but data are not information. The purpose of bioinformatics research is to obtain information with biological significance from large amounts of data. Multiple sequence alignment is widely used in sequence homology detection, protein secondary and tertiary structure prediction, phylogenetic tree analysis, and other fields. Existing research mainly focuses on the specific steps of the algorithm or on specific problems, and there is a lack of high-level abstract domain algorithm frameworks. As a result, multiple sequence alignment algorithms are complex, redundant, and difficult to understand, and it is not easy for users to select the appropriate algorithm, which may lead to computing errors. Here, through in-depth study and analysis of the heuristic multiple sequence alignment algorithm (HMSAA) domain, a domain-feature model and an interactive model of HMSAA components have been established according to the generative programming method. With the support of the PAR (partition and recur) platform, the HMSAA algorithm component library is formalized and a specific alignment algorithm is assembled, thus improving the reliability of algorithm assembly. This work provides a valuable theoretical reference for the applications of other biological sequence analysis algorithms.
Collapse
Affiliation(s)
- Haihe Shi
- School of Computer and Information Engineering, Jiangxi Normal University, Nanchang, China
| | - Xuchu Zhang
- School of Computer and Information Engineering, Jiangxi Normal University, Nanchang, China
| |
Collapse
|
33
|
Mrozek D, Kwiendacz J, Malysiak-Mrozek B. Protein Construction-Based Data Partitioning Scheme for Alignment of Protein Macromolecular Structures Through Distributed Querying in Federated Databases. IEEE Trans Nanobioscience 2020; 19:102-116. [DOI: 10.1109/tnb.2019.2930494] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
34
|
Al-Eitan L, Saadeh H, Alnaamneh A, Darabseh S, Al-Sarhan N, Alzihlif M, Hakooz N, Ivanova E, Kelsey G, Dajani R. The genetic landscape of Arab Population, Chechens and Circassians subpopulations from Jordan through HV1 and HV2 regions of mtDNA. Gene 2019; 729:144314. [PMID: 31884104 DOI: 10.1016/j.gene.2019.144314] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2019] [Revised: 12/18/2019] [Accepted: 12/20/2019] [Indexed: 11/29/2022]
Abstract
Mitochondrial DNA (mtDNA) is widely used in several fields including medical genetics, forensic science, genetic genealogy, and evolutionary anthropology. In this study, mtDNA haplotype diversity was determined for 293 unrelated subjects from Jordanian population (Circassians, Chechens, and the original inhabitants of Jordan). A total of 102 haplotypes were identified and analyzed among the populations to describe the maternal lineage landscape. Our results revealed that the distribution of mtDNA haplotype frequencies among the three populations showed disparity and significant differences when compared to each other. We also constructed mitochondrial haplotype classification trees for the three populations to determine the phylogenetic relationship of mtDNA haplotype variants, and we observed clear differences in the distribution of maternal genetic ancestries, especially between Arab and the minority ethnic populations. To our knowledge, this study is the first, to date, to characterize mitochondrial haplotypes and haplotype distributions in a population-based sample from the Jordanian population. It provides a powerful reference for future studies investigating the contribution of mtDNA variation to human health and disease and studying population history and evolution by comparing the mtDNA haplotypes to other populations.
Collapse
Affiliation(s)
- Laith Al-Eitan
- Department of Applied Biological Sciences, Jordan University of Science and Technology, Irbid 22110, Jordan; Department of Biotechnology and Genetic Engineering, Jordan University of Science and Technology, Irbid 22110, Jordan.
| | - Heba Saadeh
- Computer Science Department, The University of Jordan, Amman 11942, Jordan
| | - Adan Alnaamneh
- Department of Applied Biological Sciences, Jordan University of Science and Technology, Irbid 22110, Jordan
| | - Salma Darabseh
- Department of Applied Biological Sciences, Jordan University of Science and Technology, Irbid 22110, Jordan
| | - Na'meh Al-Sarhan
- Department of Applied Biological Sciences, Jordan University of Science and Technology, Irbid 22110, Jordan
| | - Malek Alzihlif
- Department of Pharmacology, Faculty of Medicine, The University of Jordan, Amman 11942, Jordan
| | - Nancy Hakooz
- Department of Biopharmaceutics and Clinical Pharmacy, School of Pharmacy, The University of Jordan, Amman 11942, Jordan
| | - Elena Ivanova
- Epigenetics Programme, The Babraham Institute, Cambridge, CB22 3AT, UK
| | - Gavin Kelsey
- Epigenetics Programme, The Babraham Institute, Cambridge, CB22 3AT, UK; The Centre for Trophoblast Research, University of Cambridge, CB2 3EG, UK
| | - Rana Dajani
- Department of Biology and Biotechnology, The Hashemite University, Zarqa 13133, Jordan; Radcliffe Institute for Advanced Studies, Harvard University, Cambridge, MA 02138, USA; Jepson School of Leadership, Richmond University, 221 Richmond Way, Richmond, VA 23173, USA
| |
Collapse
|
35
|
Deng L, Zhong G, Liu C, Luo J, Liu H. MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinformatics 2019; 20:662. [PMID: 31870277 PMCID: PMC6929402 DOI: 10.1186/s12859-019-3235-1] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 11/14/2019] [Indexed: 01/22/2023] Open
Abstract
Background Protein comparative analysis and similarity searches play essential roles in structural bioinformatics. A couple of algorithms for protein structure alignments have been developed in recent years. However, facing the rapid growth of protein structure data, improving overall comparison performance and running efficiency with massive sequences is still challenging. Results Here, we propose MADOKA, an ultra-fast approach for massive structural neighbor searching using a novel two-phase algorithm. Initially, we apply a fast alignment between pairwise structures. Then, we employ a score to select pairs with more similarity to carry out a more accurate fragment-based residue-level alignment. MADOKA performs about 6–100 times faster than existing methods, including TM-align and SAL, in massive alignments. Moreover, the quality of structural alignment of MADOKA is better than the existing algorithms in terms of TM-score and number of aligned residues. We also develop a web server to search structural neighbors in PDB database (About 360,000 protein chains in total), as well as additional features such as 3D structure alignment visualization. The MADOKA web server is freely available at: http://madoka.denglab.org/ Conclusions MADOKA is an efficient approach to search for protein structure similarity. In addition, we provide a parallel implementation of MADOKA which exploits massive power of multi-core CPUs.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Guolun Zhong
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Chenzhe Liu
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Judong Luo
- Department of Radiation Oncology, the Affiliated Changzhou No.2 People's Hospital of Nanjing Medical University, Changzhou, China.
| | - Hui Liu
- Lab of Information Management, Changzhou University, Changzhou, 213164, China.
| |
Collapse
|
36
|
Liu B, Zhu Y, Yan K. Fold-LTR-TCP: protein fold recognition based on triadic closure principle. Brief Bioinform 2019; 21:2185-2193. [DOI: 10.1093/bib/bbz139] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 10/01/2019] [Accepted: 10/09/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
As an important task in protein structure and function studies, protein fold recognition has attracted more and more attention. The existing computational predictors in this field treat this task as a multi-classification problem, ignoring the relationship among proteins in the dataset. However, previous studies showed that their relationship is critical for protein homology analysis. In this study, the protein fold recognition is treated as an information retrieval task. The Learning to Rank model (LTR) was employed to retrieve the query protein against the template proteins to find the template proteins in the same fold with the query protein in a supervised manner. The triadic closure principle (TCP) was performed on the ranking list generated by the LTR to improve its accuracy by considering the relationship among the query protein and the template proteins in the ranking list. Finally, a predictor called Fold-LTR-TCP was proposed. The rigorous test on the LE benchmark dataset showed that the Fold-LTR-TCP predictor achieved an accuracy of 73.2%, outperforming all the other competing methods.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| | - Yulin Zhu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| |
Collapse
|
37
|
Shi L, Wang Z. Computational Strategies for Scalable Genomics Analysis. Genes (Basel) 2019; 10:E1017. [PMID: 31817630 PMCID: PMC6947637 DOI: 10.3390/genes10121017] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 12/01/2019] [Accepted: 12/03/2019] [Indexed: 12/14/2022] Open
Abstract
The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications.
Collapse
Affiliation(s)
- Lizhen Shi
- Department of Computer Science, Florida State University, Tallahassee, FL 32304, USA;
| | - Zhong Wang
- US Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- School of Natural Sciences, University of California at Merced, Merced, CA 95343, USA
| |
Collapse
|
38
|
Li CC, Liu B. MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks. Brief Bioinform 2019; 21:2133-2141. [PMID: 31774907 DOI: 10.1093/bib/bbz133] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 09/16/2019] [Accepted: 09/17/2019] [Indexed: 12/31/2022] Open
Abstract
Protein fold recognition is one of the most critical tasks to explore the structures and functions of the proteins based on their primary sequence information. The existing protein fold recognition approaches rely on features reflecting the characteristics of protein folds. However, the feature extraction methods are still the bottleneck of the performance improvement of these methods. In this paper, we proposed two new feature extraction methods called MotifCNN and MotifDCNN to extract more discriminative fold-specific features based on structural motif kernels to construct the motif-based convolutional neural networks (CNNs). The pairwise sequence similarity scores calculated based on fold-specific features are then fed into support vector machines to construct the predictor for fold recognition, and a predictor called MotifCNN-fold has been proposed. Experimental results on the benchmark dataset showed that MotifCNN-fold obviously outperformed all the other competing methods. In particular, the fold-specific features extracted by MotifCNN and MotifDCNN are more discriminative than the fold-specific features extracted by other deep learning techniques, indicating that incorporating the structural motifs into the CNN is able to capture the characteristics of protein folds.
Collapse
Affiliation(s)
- Chen-Chen Li
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
39
|
Zhang W, Lin W, Zhang D, Wang S, Shi J, Niu Y. Recent Advances in the Machine Learning-Based Drug-Target Interaction Prediction. Curr Drug Metab 2019; 20:194-202. [PMID: 30129407 DOI: 10.2174/1389200219666180821094047] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Revised: 01/18/2018] [Accepted: 03/19/2018] [Indexed: 12/28/2022]
Abstract
BACKGROUND The identification of drug-target interactions is a crucial issue in drug discovery. In recent years, researchers have made great efforts on the drug-target interaction predictions, and developed databases, software and computational methods. RESULTS In the paper, we review the recent advances in machine learning-based drug-target interaction prediction. First, we briefly introduce the datasets and data, and summarize features for drugs and targets which can be extracted from different data. Since drug-drug similarity and target-target similarity are important for many machine learning prediction models, we introduce how to calculate similarities based on data or features. Different machine learningbased drug-target interaction prediction methods can be proposed by using different features or information. Thus, we summarize, analyze and compare different machine learning-based prediction methods. CONCLUSION This study provides the guide to the development of computational methods for the drug-target interaction prediction.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer Science, Wuhan University, Wuhan 430072, China
| | - Weiran Lin
- School of Computer Science, Wuhan University, Wuhan 430072, China
| | - Ding Zhang
- School of Computer Science, Wuhan University, Wuhan 430072, China
| | - Siman Wang
- School of Computer Science, Wuhan University, Wuhan 430072, China
| | - Jingwen Shi
- School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China
| | - Yanqing Niu
- School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan 430074, China
| |
Collapse
|
40
|
Bai N, Tang S, Yu C, Fu H, Wang C, Chen X. GMSA: A Data Sharing System for Multiple Sequence Alignment Across Multiple Users. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190111160101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
In recent years, the rapid growth of biological datasets in Bioinformatics
has made the computation of Multiple Sequence Alignment (MSA) become extremely slow. Using
the GPU to accelerate MSA has shown to be an effective approach. Moreover, there is a trend that
many bioinformatic researchers or institutes setup a shared server for remote users to submit MSA
jobs via provided web-pages or tools.
Objective:
Given the fact that different MSA jobs submitted by users often process similar datasets,
there can be an opportunity for users to share their computation results between each other,
which can avoid the redundant computation and thereby reduce the overall computing time. Furthermore,
in the heterogeneous CPU/GPU platform, many existing applications assign their computation
on GPU devices only, which leads to a waste of the CPU resources. Co-run computation
can increase the utilization of computing resources on both CPUs and GPUs by dispatching workloads
onto them simultaneously.
Methods:
In this paper, we propose an efficient MSA system called GMSA for multi-users on
shared heterogeneous CPU/GPU platforms. To accelerate the computation of jobs from multiple
users, data sharing is considered in GMSA due to the fact that different MSA jobs often have a
percentage of the same data and tasks. Additionally, we also propose a scheduling strategy based
on the similarity in datasets or tasks between MSA jobs. Furthermore, co-run computation model
is adopted to take full use of both CPUs and GPUs.
Results:
We use four protein datasets which were redesigned according to different similarity. We
compare GMSA with ClustalW and CUDA-ClustalW in multiple users scenarios. Experiments results
showed that GMSA can achieve a speedup of up to 32X.
Conclusion:
GMSA is a system designed for accelerating the computation of MSA jobs with
shared input datasets on heterogeneous CPU/GPU platforms. In this system, a strategy was proposed
and implemented to find the common datasets among jobs submitted by multiple users, and
a scheduling algorithm is presented based on it. To utilize the overall resource of both CPU and
GPU, GMSA employs the co-run computation model. Results showed that it can speed up the total
computation of jobs efficiently.
Collapse
Affiliation(s)
- Na Bai
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Shanjiang Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Ce Yu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Hao Fu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Chen Wang
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana- Champaign, Illinois, United States
| | - Xi Chen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| |
Collapse
|
41
|
Liu B, Li S. ProtDet-CCH: Protein Remote Homology Detection by Combining Long Short-Term Memory and Ranking Methods. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1203-1210. [PMID: 29993950 DOI: 10.1109/tcbb.2018.2789880] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
As one of the most challenging tasks in sequence analysis, protein remote homology detection has been extensively studied. Methods based on discriminative models and ranking approaches have achieved the state-of-the-art performance, and these two kinds of methods are complementary. In this study, three LSTM models have been applied to construct the predictors for protein remote homology detection, including ULSTM, BLSTM, and CNN-BLSTM. They are able to automatically extract the local and global sequence order information. Combined with PSSMs, the CNN-BLSTM achieved the best performance among the three LSTM-based models. We named this method as CNN-BLSTM-PSSM. Finally, a new method called ProtDet-CCH was proposed by combining CNN-BLSTM-PSSM and a ranking method HHblits. Tested on a widely used SCOP benchmark dataset, ProtDet-CCH achieved an ROC score of 0.998, and an ROC50 score of 0.982, significantly outperforming other existing state-of-the-art methods. Experimental results on two updated SCOPe independent datasets showed that ProtDet-CCH can achieve stable performance. Furthermore, our method can provide useful insights for studying the features and motifs of protein families and superfamilies. It is anticipated that ProtDet-CCH will become a very useful tool for protein remote homology detection.
Collapse
|
42
|
Zhang J, Zhang Y, Ma Z. In silico Prediction of Human Secretory Proteins in Plasma Based on Discrete Firefly Optimization and Application to Cancer Biomarkers Identification. Front Genet 2019; 10:542. [PMID: 31244885 PMCID: PMC6563772 DOI: 10.3389/fgene.2019.00542] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2019] [Accepted: 05/21/2019] [Indexed: 12/20/2022] Open
Abstract
The early control and prevention of cancer contributes effectively interventions and cancer therapies. Secretory protein, one of the richest biomarkers, is proved important as molecular signposts of the physiological state of a cell. In this work, we aim to propose a proteomic high-throughput technology platform to facilitate detection of early cancer by means of biomarkers that secreted into the bloodstream. We compile a new benchmark dataset of human secretory proteins in plasma. A series of sequence-derived features, which have been proved involved in the structure and function of the secretory proteins, are collected to mathematically encode these proteins. Considering the influence of potential irrelevant or redundant features, we introduce discrete firefly optimization algorithm to perform feature selection. We evaluate and compare the proposed method SCRIP (Secretory proteins in plasma) with state-of-the-art approaches on benchmark datasets and independent testing datasets. SCRIP achieves the average AUC values of 0.876 and 0.844 in five-fold the cross-validation and independent test, respectively. Besides that, we also test SCRIP on proteins in four types of cancer tissues and successfully detect 66∼77% potential cancer biomarkers.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, China
- Henan Key Laboratory of Education Big Data Analysis and Application, Xinyang, China
| | - Yu Zhang
- Information Engineering College, Huanghuai University, Zhumadian, China
- Henan Key Laboratory of Smart Lighting, Zhumadian, China
| | - Zhiqiang Ma
- Department of Computer Science, College of Humanities & Sciences of Northeast Normal University, Changchun, China
| |
Collapse
|
43
|
Vineetha V, Biji CL, Nair AS. SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning. Sci Rep 2019; 9:6631. [PMID: 31036850 PMCID: PMC6488671 DOI: 10.1038/s41598-019-42966-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2018] [Accepted: 04/12/2019] [Indexed: 11/09/2022] Open
Abstract
Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features while performing the alignment. The algorithm uses suffix tree for identifying common substrings and uses a modified Needleman-Wunsch algorithm for pairwise alignments. In order to improve the efficiency of pairwise alignments, a knowledge base is created and a supervised learning with nearest neighbor algorithm is used to guide the alignment. The algorithm provided linear complexity O(m) compared to O(m2). Comparing with state-of-the-art algorithms (e.g., HAlign II), SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. The algorithm is implemented on big data framework Apache Spark in order to improve the scalability. The source code & test data are available at: https://sourceforge.net/projects/spark-msna/ .
Collapse
Affiliation(s)
- V Vineetha
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India.
| | - C L Biji
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| | - Achuthsankar S Nair
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| |
Collapse
|
44
|
Liu L, Wang H. The Recent Applications and Developments of Bioinformatics and Omics Technologies in Traditional Chinese Medicine. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190102125403] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Background:Traditional Chinese Medicine (TCM) is widely utilized as complementary health care in China whose acceptance is still hindered by conventional scientific research methodology, although it has been exercised and implemented for nearly 2000 years. Identifying the molecular mechanisms, targets and bioactive components in TCM is a critical step in the modernization of TCM because of the complexity and uniqueness of the TCM system. With recent advances in computational approaches and high throughput technologies, it has become possible to understand the potential TCM mechanisms at the molecular and systematic level, to evaluate the effectiveness and toxicity of TCM treatments. Bioinformatics is gaining considerable attention to unearth the in-depth molecular mechanisms of TCM, which emerges as an interdisciplinary approach owing to the explosive omics data and development of computer science. Systems biology, based on the omics techniques, opens up a new perspective which enables us to investigate the holistic modulation effect on the body.Objective:This review aims to sum up the recent efforts of bioinformatics and omics techniques in the research of TCM including Systems biology, Metabolomics, Proteomics, Genomics and Transcriptomics.Conclusion:Overall, bioinformatics tools combined with omics techniques have been extensively used to scientifically support the ancient practice of TCM to be scientific and international through the acquisition, storage and analysis of biomedical data.
Collapse
Affiliation(s)
- Lin Liu
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin 14195, Germany
| | - Hao Wang
- Institute of Chemistry and Biochemistry, Freie Universität Berlin, Berlin 14195, Germany
| |
Collapse
|
45
|
PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction. Genes (Basel) 2019; 10:genes10020073. [PMID: 30678245 PMCID: PMC6410268 DOI: 10.3390/genes10020073] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 01/04/2019] [Accepted: 01/14/2019] [Indexed: 11/21/2022] Open
Abstract
Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees. However, most of these methods ignore the global similarities between sequences or some specific valuable features, e.g., frequent patterns overall datasets. To make further improvement, we propose an alignment free algorithm based on sequential pattern mining, where each sequence is converted into a binary representation of sequential patterns among sequences. The phylogenetic tree is further constructed via clustering distance matrix which is calculated from pattern vectors. To increase accuracy for highly divergent sequences, we consider pattern weight and filtering redundancy sub-patterns. Both simulated and real data demonstrates our method outperform other alignment free methods, especially for large sequence set with low similarity.
Collapse
|
46
|
Abstract
Background Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable. Results Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results. Conclusions These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results. Electronic supplementary material The online version of this article (10.1186/s12859-018-2524-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yingying Wang
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Hongyan Wu
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| | - Yunpeng Cai
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| |
Collapse
|
47
|
Prediction of GluN2B-CT 1290-1310/DAPK1 Interaction by Protein⁻Peptide Docking and Molecular Dynamics Simulation. Molecules 2018; 23:molecules23113018. [PMID: 30463177 PMCID: PMC6278559 DOI: 10.3390/molecules23113018] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2018] [Revised: 11/04/2018] [Accepted: 11/06/2018] [Indexed: 02/08/2023] Open
Abstract
The interaction of death-associated protein kinase 1 (DAPK1) with the 2B subunit (GluN2B) C-terminus of N-methyl-D-aspartate receptor (NMDAR) plays a critical role in the pathophysiology of depression and is considered a potential target for the structure-based discovery of new antidepressants. However, the 3D structures of C-terminus residues 1290⁻1310 of GluN2B (GluN2B-CT1290-1310) remain elusive and the interaction between GluN2B-CT1290-1310 and DAPK1 is unknown. In this study, the mechanism of interaction between DAPK1 and GluN2B-CT1290-1310 was predicted by computational simulation methods including protein⁻peptide docking and molecular dynamics (MD) simulation. Based on the equilibrated MD trajectory, the total binding free energy between GluN2B-CT1290-1310 and DAPK1 was computed by the mechanics generalized born surface area (MM/GBSA) approach. The simulation results showed that hydrophobic, van der Waals, and electrostatic interactions are responsible for the binding of GluN2B-CT1290⁻1310/DAPK1. Moreover, through per-residue free energy decomposition and in silico alanine scanning analysis, hotspot residues between GluN2B-CT1290-1310 and DAPK1 interface were identified. In conclusion, this work predicted the binding mode and quantitatively characterized the protein⁻peptide interface, which will aid in the discovery of novel drugs targeting the GluN2B-CT1290-1310 and DAPK1 interface.
Collapse
|
48
|
Zou Q, Lin G, Jiang X, Liu X, Zeng X. Sequence clustering in bioinformatics: an empirical study. Brief Bioinform 2018; 21:1-10. [PMID: 30239587 DOI: 10.1093/bib/bby090] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2018] [Revised: 08/18/2018] [Accepted: 08/18/2018] [Indexed: 12/13/2022] Open
Abstract
Sequence clustering is a basic bioinformatics task that is attracting renewed attention with the development of metagenomics and microbiomics. The latest sequencing techniques have decreased costs and as a result, massive amounts of DNA/RNA sequences are being produced. The challenge is to cluster the sequence data using stable, quick and accurate methods. For microbiome sequencing data, 16S ribosomal RNA operational taxonomic units are typically used. However, there is often a gap between algorithm developers and bioinformatics users. Different software tools can produce diverse results and users can find them difficult to analyze. Understanding the different clustering mechanisms is crucial to understanding the results that they produce. In this review, we selected several popular clustering tools, briefly explained the key computing principles, analyzed their characters and compared them using two independent benchmark datasets. Our aim is to assist bioinformatics users in employing suitable clustering tools effectively to analyze big sequencing data. Related data, codes and software tools were accessible at the link http://lab.malab.cn/∼lg/clustering/.
Collapse
Affiliation(s)
- Quan Zou
- Tianjin University.,University of Electronic Science and Technology of China
| | | | | | | | | |
Collapse
|
49
|
Guo R, Zhao Y, Zou Q, Fang X, Peng S. Bioinformatics applications on Apache Spark. Gigascience 2018; 7:5067872. [PMID: 30101283 PMCID: PMC6113509 DOI: 10.1093/gigascience/giy098] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 07/28/2018] [Indexed: 11/13/2022] Open
Abstract
With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of–the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.
Collapse
Affiliation(s)
- Runxin Guo
- College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China
| | - Yi Zhao
- Institute of Computing Technology, Chinese Academy of Sciences, No.6, South Road of the Academy of Sciences, Haidian District, Beijing, 100190, China
| | - Quan Zou
- School of Computer Science and Technology, No.135, Yaguan Road, Jinnan District, Tianjin University, Tianjin, 300050, China
| | - Xiaodong Fang
- BGI Genomics, BGI-Shenzhen, No.21, Mingzhu Road, Yantian District, Shenzhen, 518083, China
| | - Shaoliang Peng
- College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China.,College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, No.252, Shannan Road, Yuelu District, Changsha, 410082, China
| |
Collapse
|
50
|
Liang S, Zhang R, Liang D, Song T, Ai T, Xia C, Xia L, Wang Y. Multimodal 3D DenseNet for IDH Genotype Prediction in Gliomas. Genes (Basel) 2018; 9:E382. [PMID: 30061525 PMCID: PMC6115744 DOI: 10.3390/genes9080382] [Citation(s) in RCA: 65] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2018] [Revised: 07/06/2018] [Accepted: 07/16/2018] [Indexed: 01/08/2023] Open
Abstract
Non-invasive prediction of isocitrate dehydrogenase (IDH) genotype plays an important role in tumor glioma diagnosis and prognosis. Recently, research has shown that radiology images can be a potential tool for genotype prediction, and fusion of multi-modality data by deep learning methods can further provide complementary information to enhance prediction accuracy. However, it still does not have an effective deep learning architecture to predict IDH genotype with three-dimensional (3D) multimodal medical images. In this paper, we proposed a novel multimodal 3D DenseNet (M3D-DenseNet) model to predict IDH genotypes with multimodal magnetic resonance imaging (MRI) data. To evaluate its performance, we conducted experiments on the BRATS-2017 and The Cancer Genome Atlas breast invasive carcinoma (TCGA-BRCA) dataset to get image data as input and gene mutation information as the target, respectively. We achieved 84.6% accuracy (area under the curve (AUC) = 85.7%) on the validation dataset. To evaluate its generalizability, we applied transfer learning techniques to predict World Health Organization (WHO) grade status, which also achieved a high accuracy of 91.4% (AUC = 94.8%) on validation dataset. With the properties of automatic feature extraction, and effective and high generalizability, M3D-DenseNet can serve as a useful method for other multimodal radiogenomics problems and has the potential to be applied in clinical decision making.
Collapse
Affiliation(s)
- Sen Liang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun 130012, China.
| | - Rongguo Zhang
- Advanced Institute, Infervision, Beijing 100000, China.
| | - Dayang Liang
- School of Mechatronics Engineering, Nanchang University, Nanchang 330031, China.
| | - Tianci Song
- Advanced Institute, Infervision, Beijing 100000, China.
| | - Tao Ai
- Department of Radiology, Tongji Hospital, Wuhan 430030, China.
| | - Chen Xia
- Advanced Institute, Infervision, Beijing 100000, China.
| | - Liming Xia
- Department of Radiology, Tongji Hospital, Wuhan 430030, China.
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun 130012, China.
- Cancer Systems Biology Center, China-Japan Union Hospital, Jilin University, Changchun 130033, China.
| |
Collapse
|