1
|
Harihar B, Saravanan KM, Gromiha MM, Selvaraj S. Importance of Inter-residue Contacts for Understanding Protein Folding and Unfolding Rates, Remote Homology, and Drug Design. Mol Biotechnol 2024:10.1007/s12033-024-01119-4. [PMID: 38498284 DOI: 10.1007/s12033-024-01119-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Accepted: 02/10/2024] [Indexed: 03/20/2024]
Abstract
Inter-residue interactions in protein structures provide valuable insights into protein folding and stability. Understanding these interactions can be helpful in many crucial applications, including rational design of therapeutic small molecules and biologics, locating functional protein sites, and predicting protein-protein and protein-ligand interactions. The process of developing machine learning models incorporating inter-residue interactions has been improved recently. This review highlights the theoretical models incorporating inter-residue interactions in predicting folding and unfolding rates of proteins. Utilizing contact maps to depict inter-residue interactions aids researchers in developing computer models for detecting remote homologs and interface residues within protein-protein complexes which, in turn, enhances our knowledge of the relationship between sequence and structure of proteins. Further, the application of contact maps derived from inter-residue interactions is highlighted in the field of drug discovery. Overall, this review presents an extensive assessment of the significant models that use inter-residue interactions to investigate folding rates, unfolding rates, remote homology, and drug development, providing potential future advancements in constructing efficient computational models in structural biology.
Collapse
Affiliation(s)
- Balasubramanian Harihar
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
| | - Konda Mani Saravanan
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India
- Department of Biotechnology, Bharath Institute of Higher Education and Research, Chennai, Tamil Nadu, 600073, India
| | - Michael M Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
| | - Samuel Selvaraj
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India.
| |
Collapse
|
2
|
Guo J. Improving structure-based protein-ligand affinity prediction by graph representation learning and ensemble learning. PLoS One 2024; 19:e0296676. [PMID: 38232063 DOI: 10.1371/journal.pone.0296676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 12/15/2023] [Indexed: 01/19/2024] Open
Abstract
Predicting protein-ligand binding affinity presents a viable solution for accelerating the discovery of new lead compounds. The recent widespread application of machine learning approaches, especially graph neural networks, has brought new advancements in this field. However, some existing structure-based methods treat protein macromolecules and ligand small molecules in the same way and ignore the data heterogeneity, potentially leading to incomplete exploration of the biochemical information of ligands. In this work, we propose LGN, a graph neural network-based fusion model with extra ligand feature extraction to effectively capture local features and global features within the protein-ligand complex, and make use of interaction fingerprints. By combining the ligand-based features and interaction fingerprints, LGN achieves Pearson correlation coefficients of up to 0.842 on the PDBbind 2016 core set, compared to 0.807 when using the features of complex graphs alone. Finally, we verify the rationalization and generalization of our model through comprehensive experiments. We also compare our model with state-of-the-art baseline methods, which validates the superiority of our model. To reduce the impact of data similarity, we increase the robustness of the model by incorporating ensemble learning.
Collapse
Affiliation(s)
- Jia Guo
- Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Beijing, P.R. China
- Chongqing School, University of Chinese Academy of Sciences, Chongqing, China
| |
Collapse
|
3
|
Qin X, Zhang L, Liu M, Xu Z, Liu G. ASFold-DNN: Protein Fold Recognition Based on Evolutionary Features With Variable Parameters Using Full Connected Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2712-2722. [PMID: 34133282 DOI: 10.1109/tcbb.2021.3089168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Protein fold recognition contribute to comprehend the function of proteins, which is of great help to the gene therapy of diseases and the development of new drugs. Researchers have been working in this direction and have made considerable achievements, but challenges still exist on low sequence similarity datasets. In this study, we propose the ASFold-DNN framework for protein fold recognition research. Above all, four groups of evolutionary features are extracted from the primary structures of proteins, and a preliminary selection of variable parameter is made for two groups of features including ACC _HMM and SXG _HMM, respectively. Then several feature selection algorithms are selected for comparison and the best feature selection scheme is obtained by changing their internal threshold values. Finally, multiple hyper-parameters of Full Connected Neural Network are fully optimized to construct the best model. DD, EDD and TG datasets with low sequence similarities are chosen to evaluate the performance of the models constructed by the framework, and the final prediction accuracy are 85.28, 95.00 and 88.84 percent, respectively. Furthermore, the ASTRAL186 and LE datasets are introduced to further verify the generalization ability of our proposed framework. Comprehensive experimental results prove that the ASFold-DNN framework is more prominent than the state-of-the-art studies on protein fold recognition. The source code and data of ASFold-DNN can be downloaded from https://github.com/Bioinformatics-Laboratory/project/tree/master/ASFold.
Collapse
|
4
|
Genetic Mining of Newly Isolated Salmophages for Phage Therapy. Int J Mol Sci 2022; 23:ijms23168917. [PMID: 36012174 PMCID: PMC9409062 DOI: 10.3390/ijms23168917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 07/29/2022] [Accepted: 08/07/2022] [Indexed: 11/16/2022] Open
Abstract
Salmonella enterica, a Gram-negative zoonotic bacterium, is mainly a food-borne pathogen and the main cause of diarrhea in humans worldwide. The main reservoirs are found in poultry farms, but they are also found in wild birds. The development of antibiotic resistance in S. enterica species raises concerns about the future of efficient therapies against this pathogen and revives the interest in bacteriophages as a useful therapy against bacterial infections. Here, we aimed to decipher and functionally annotate 10 new Salmonella phage genomes isolated in Spain in the light of phage therapy. We designed a bioinformatic pipeline using available building blocks to de novo assemble genomes and perform syntaxic annotation. We then used genome-wide analyses for taxonomic annotation enabled by vContact2 and VICTOR. We were also particularly interested in improving functional annotation using remote homologies detection and comparisons with the recently published phage-specific PHROG protein database. Finally, we searched for useful functions for phage therapy, such as systems encoded by the phage to circumvent cellular defenses with a particular focus on anti-CRISPR proteins. We, thus, were able to genetically characterize nine virulent phages and one temperate phage and identify putative functions relevant to the formulation of phage cocktails for Salmonella biocontrol.
Collapse
|
5
|
Pang Y, Liu B. SelfAT-Fold: Protein Fold Recognition Based on Residue-Based and Motif-Based Self-Attention Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1861-1869. [PMID: 33090951 DOI: 10.1109/tcbb.2020.3031888] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The protein fold recognition is a fundamental and crucial step of tertiary structure determination. In this regard, several computational predictors have been proposed. Recently, the predictive performance has been obviously improved by the fold-specific features generated by deep learning techniques. However, these methods failed to measure the global associations among residues or motifs along the protein sequences. Furthermore, these deep learning techniques are often treated as black boxes without interpretability. Inspired by the similarities between protein sequences and natural language sentences, we applied the self-attention mechanism derived from natural language processing (NLP) field to protein fold recognition. The motif-based self-attention network (MSAN) and the residue-based self-attention network (RSAN) were constructed based on a training set to capture the global associations among the structure motifs and residues along the protein sequences, respectively. The fold-specific attention features trained and generated from the training set were then combined with Support Vector Machines (SVMs) to predict the samples in the widely used LE benchmark dataset, which is fully independent from the training set. Experimental results showed that the proposed two SelfAT-Fold predictors outperformed 34 existing state-of-the-art computational predictors. The two SelfAT-Fold predictors were further tested on an independent dataset SCOP_TEST, and they can achieve stable performance. Furthermore, the fold-specific attention features can be used to analyse the characteristics of protein folds. The trained models and data of SelfAT-Fold can be downloaded from http://bliulab.net/selfAT_fold/.
Collapse
|
6
|
Yan K, Wen J, Xu Y, Liu B. Protein Fold Recognition Based on Auto-Weighted Multi-View Graph Embedding Learning Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2682-2691. [PMID: 32356759 DOI: 10.1109/tcbb.2020.2991268] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Protein fold recognition is critical for studies of the protein structure prediction and drug design. Several methods have been proposed to obtain discriminative features from the protein sequences for fold recognition. However, the ensemble methods that combine the various features to improve predictive performance remain the challenge problems. In this study, we proposed two novel algorithms: AWMG and EMfold. AWMG used a novel predictor based on the multi-view learning framework for fold recognition. Each view was treated as the intermediate representation of the corresponding data source of proteins, including the evolutionary information and the retrieval information. AWMG calculated the auto-weight for each view respectively and constructed the latent subspace which contains the common information shared by different views. The marginalized constraint was employed to enlarge the margins between different folds, improving the predictive performance of AWMG. Furthermore, we proposed a novel ensemble method called EMfold, which combines two complementary methods AWMG and DeepSS. The later method was a template-based algorithm using the SPARKS-X and DeepFR programs. EMfold integrated the advantages of template-based assignment and machine learning classifier. Experimental results on the two widely datasets (LE and YK) showed that the proposed methods outperformed some state-of-the-art methods, indicating that AWMG and EMfold are useful tools for protein fold recognition.
Collapse
|
7
|
Yan K, Wen J, Liu JX, Xu Y, Liu B. Protein Fold Recognition by Combining Support Vector Machines and Pairwise Sequence Similarity Scores. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2008-2016. [PMID: 31940548 DOI: 10.1109/tcbb.2020.2966450] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein fold recognition is one of the most essential steps for protein structure prediction, aiming to classify proteins into known protein folds. There are two main computational approaches: one is the template-based method based on the alignment scores between query-template protein pairs and the other is the machine learning method based on the feature representation and classifier. These two approaches have their own advantages and disadvantages. Can we combine these methods to establish more accurate predictors for protein fold recognition? In this study, we made an initial attempt and proposed two novel algorithms: TSVM-fold and ESVM-fold. TSVM-fold was based on the Support Vector Machines (SVMs), which utilizes a set of pairwise sequence similarity scores generated by three complementary template-based methods, including HHblits, SPARKS-X, and DeepFR. These scores measured the global relationships between query sequences and templates. The comprehensive features of the attributes of the sequences were fed into the SVMs for the prediction. Then the TSVM-fold was further combined with the HHblits algorithm so as to improve its generalization ability. The combined method is called ESVM-fold. Experimental results in two rigorous benchmark datasets (LE and YK datasets) showed that the proposed methods outperform some state-of-the-art methods, indicating that the TSVM-fold and ESVM-fold are efficient predictors for protein fold recognition.
Collapse
|
8
|
Zhu W, Jiang Y, Liu JS, Deng K. Partition–Mallows Model and Its Inference for Rank Aggregation. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1930547] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Wanchuang Zhu
- Center for Statistical Science & Department of Industrial Engineering, Tsinghua University, Beijing, China
| | - Yingkai Jiang
- Yau Mathematical Sciences Center & Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Jun S. Liu
- Department of Statistics, Harvard University, Cambridge, MA
| | - Ke Deng
- Center for Statistical Science & Department of Industrial Engineering, Tsinghua University, Beijing, China
| |
Collapse
|
9
|
Shao J, Yan K, Liu B. FoldRec-C2C: protein fold recognition by combining cluster-to-cluster model and protein similarity network. Brief Bioinform 2021; 22:5873289. [PMID: 32685972 PMCID: PMC7454262 DOI: 10.1093/bib/bbaa144] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2020] [Revised: 05/26/2020] [Accepted: 06/11/2020] [Indexed: 12/27/2022] Open
Abstract
As a key for studying the protein structures, protein fold recognition is playing an important role in predicting the protein structures associated with COVID-19 and other important structures. However, the existing computational predictors only focus on the protein pairwise similarity or the similarity between two groups of proteins from 2-folds. However, the homology relationship among proteins is in a hierarchical structure. The global protein similarity network will contribute to the performance improvement. In this study, we proposed a predictor called FoldRec-C2C to globally incorporate the interactions among proteins into the prediction. For the FoldRec-C2C predictor, protein fold recognition problem is treated as an information retrieval task in nature language processing. The initial ranking results were generated by a surprised ranking algorithm Learning to Rank, and then three re-ranking algorithms were performed on the ranking lists to adjust the results globally based on the protein similarity network, including seq-to-seq model, seq-to-cluster model and cluster-to-cluster model (C2C). When tested on a widely used and rigorous benchmark dataset LINDAHL dataset, FoldRec-C2C outperforms other 34 state-of-the-art methods in this field. The source code and data of FoldRec-C2C can be downloaded from http://bliulab.net/FoldRec-C2C/download.
Collapse
Affiliation(s)
- Jiangyi Shao
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
10
|
Protein Structure Prediction: Conventional and Deep Learning Perspectives. Protein J 2021; 40:522-544. [PMID: 34050498 DOI: 10.1007/s10930-021-10003-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/21/2021] [Indexed: 10/21/2022]
Abstract
Protein structure prediction is a way to bridge the sequence-structure gap, one of the main challenges in computational biology and chemistry. Predicting any protein's accurate structure is of paramount importance for the scientific community, as these structures govern their function. Moreover, this is one of the complicated optimization problems that computational biologists have ever faced. Experimental protein structure determination methods include X-ray crystallography, Nuclear Magnetic Resonance Spectroscopy and Electron Microscopy. All of these are tedious and time-consuming procedures that require expertise. To make the process less cumbersome, scientists use predictive tools as part of computational methods, using data consolidated in the protein repositories. In recent years, machine learning approaches have raised the interest of the structure prediction community. Most of the machine learning approaches for protein structure prediction are centred on co-evolution based methods. The accuracy of these approaches depends on the number of homologous protein sequences available in the databases. The prediction problem becomes challenging for many proteins, especially those without enough sequence homologs. Deep learning methods allow for the extraction of intricate features from protein sequence data without making any intuitions. Accurately predicted protein structures are employed for drug discovery, antibody designs, understanding protein-protein interactions, and interactions with other molecules. This article provides a review of conventional and deep learning approaches in protein structure prediction. We conclude this review by outlining a few publicly available datasets and deep learning architectures currently employed for protein structure prediction tasks.
Collapse
|
11
|
Ru X, Ye X, Sakurai T, Zou Q. Application of learning to rank in bioinformatics tasks. Brief Bioinform 2021; 22:6102666. [PMID: 33454758 DOI: 10.1093/bib/bbaa394] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Revised: 11/09/2020] [Accepted: 11/24/2020] [Indexed: 12/17/2022] Open
Abstract
Over the past decades, learning to rank (LTR) algorithms have been gradually applied to bioinformatics. Such methods have shown significant advantages in multiple research tasks in this field. Therefore, it is necessary to summarize and discuss the application of these algorithms so that these algorithms are convenient and contribute to bioinformatics. In this paper, the characteristics of LTR algorithms and their strengths over other types of algorithms are analyzed based on the application of multiple perspectives in bioinformatics. Finally, the paper further discusses the shortcomings of the LTR algorithms, the methods and means to better use the algorithms and some open problems that currently exist.
Collapse
Affiliation(s)
| | - Xiucai Ye
- Department of Computer Science and Center for Artificial Intelligence Research (C-AIR), University of Tsukuba
| | | | - Quan Zou
- University of Electronic Science and Technology of China
| |
Collapse
|
12
|
|
13
|
Shah AA, Khan YD. Identification of 4-carboxyglutamate residue sites based on position based statistical feature and multiple classification. Sci Rep 2020; 10:16913. [PMID: 33037248 PMCID: PMC7547663 DOI: 10.1038/s41598-020-73107-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Accepted: 08/20/2020] [Indexed: 11/08/2022] Open
Abstract
Glutamic acid is an alpha-amino acid used by all living beings in protein biosynthesis. One of the important glutamic acid modifications is post-translationally modified 4-carboxyglutamate. It has a significant role in blood coagulation. 4-carboxyglumates are required for the binding of calcium ions. On the contrary, this modification can also cause different diseases such as bone resorption, osteoporosis, papilloma, and plaque atherosclerosis. Considering its importance, it is necessary to predict the occurrence of glutamic acid carboxylation in amino acid stretches. As there is no computational based prediction model available to identify 4-carboxyglutamate modification, this study is, therefore, designed to predict 4-carboxyglutamate sites with a less computational cost. A machine learning model is devised with a Multilayered Perceptron (MLP) classifier using Chou's 5-step rule. It may help in learning statistical moments and based on this learning, the prediction is to be made accurately either it is 4-carboxyglutamate residue site or detected residue site having no 4-carboxyglutamate. Prediction accuracy of the proposed model is 94% using an independent set test, while obtained prediction accuracy is 99% by self-consistency tests.
Collapse
Affiliation(s)
- Asghar Ali Shah
- Department of Computer Sciences, Bahria University Lahore Campus, Lahore, 25000, Pakistan.
| | | |
Collapse
|
14
|
Shao J, Liu B. ProtFold-DFG: protein fold recognition by combining Directed Fusion Graph and PageRank algorithm. Brief Bioinform 2020; 22:5901980. [PMID: 32892224 DOI: 10.1093/bib/bbaa192] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 07/16/2020] [Accepted: 07/28/2020] [Indexed: 12/27/2022] Open
Abstract
As one of the most important tasks in protein structure prediction, protein fold recognition has attracted more and more attention. In this regard, some computational predictors have been proposed with the development of machine learning and artificial intelligence techniques. However, these existing computational methods are still suffering from some disadvantages. In this regard, we propose a new network-based predictor called ProtFold-DFG for protein fold recognition. We propose the Directed Fusion Graph (DFG) to fuse the ranking lists generated by different methods, which employs the transitive closure to incorporate more relationships among proteins and uses the KL divergence to calculate the relationship between two proteins so as to improve its generalization ability. Finally, the PageRank algorithm is performed on the DFG to accurately recognize the protein folds by considering the global interactions among proteins in the DFG. Tested on a widely used and rigorous benchmark data set, LINDAHL dataset, experimental results show that the ProtFold-DFG outperforms the other 35 competing methods, indicating that ProtFold-DFG will be a useful method for protein fold recognition. The source code and data of ProtFold-DFG can be downloaded from http://bliulab.net/ProtFold-DFG/download.
Collapse
Affiliation(s)
- Jiangyi Shao
- School of Computer Science and Technology, Beijing Institute of Technology, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
15
|
miRNALoc: predicting miRNA subcellular localizations based on principal component scores of physico-chemical properties and pseudo compositions of di-nucleotides. Sci Rep 2020; 10:14557. [PMID: 32884018 PMCID: PMC7471944 DOI: 10.1038/s41598-020-71381-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Accepted: 07/07/2020] [Indexed: 12/20/2022] Open
Abstract
MicroRNAs (miRNAs) are one kind of non-coding RNA, play vital role in regulating several physiological and developmental processes. Subcellular localization of miRNAs and their abundance in the native cell are central for maintaining physiological homeostasis. Besides, RNA silencing activity of miRNAs is also influenced by their localization and stability. Thus, development of computational method for subcellular localization prediction of miRNAs is desired. In this work, we have proposed a computational method for predicting subcellular localizations of miRNAs based on principal component scores of thermodynamic, structural properties and pseudo compositions of di-nucleotides. Prediction accuracy was analyzed following fivefold cross validation, where ~ 63–71% of AUC-ROC and ~ 69–76% of AUC-PR were observed. While evaluated with independent test set, > 50% localizations were found to be correctly predicted. Besides, the developed computational model achieved higher accuracy than the existing methods. A user-friendly prediction server “miRNALoc” is freely accessible at https://cabgrid.res.in:8080/mirnaloc/, by which the user can predict localizations of miRNAs.
Collapse
|
16
|
PredDBP-Stack: Prediction of DNA-Binding Proteins from HMM Profiles using a Stacked Ensemble Method. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7297631. [PMID: 32352006 PMCID: PMC7174956 DOI: 10.1155/2020/7297631] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 04/01/2020] [Indexed: 12/02/2022]
Abstract
DNA-binding proteins (DBPs) play vital roles in all aspects of genetic activities. However, the identification of DBPs by using wet-lab experimental approaches is often time-consuming and laborious. In this study, we develop a novel computational method, called PredDBP-Stack, to predict DBPs solely based on protein sequences. First, amino acid composition (AAC) and transition probability composition (TPC) extracted from the hidden markov model (HMM) profile are adopted to represent a protein. Next, we establish a stacked ensemble model to identify DBPs, which involves two stages of learning. In the first stage, the four base classifiers are trained with the features of HMM-based compositions. In the second stage, the prediction probabilities of these base classifiers are used as inputs to the meta-classifier to perform the final prediction of DBPs. Based on the PDB1075 benchmark dataset, we conduct a jackknife cross validation with the proposed PredDBP-Stack predictor and obtain a balanced sensitivity and specificity of 92.47% and 92.36%, respectively. This outcome outperforms most of the existing classifiers. Furthermore, our method also achieves superior performance and model robustness on the PDB186 independent dataset. This demonstrates that the PredDBP-Stack is an effective classifier for accurately identifying DBPs based on protein sequence information alone.
Collapse
|
17
|
Ilyas M, Irfan M, Mahmood T, Hussain H, Latif-ur-Rehman, Naeem I, Khaliq-ur-Rahman. Analysis of Germin-like Protein Genes (OsGLPs) Family in Rice Using Various In silico Approaches. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190722165130] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Background:
Germin-like Proteins (GLPs) play an important role in various stresses.
Rice contains 43 GLPs, among which many remain functionally unexplored. The computational
analysis will provide significant insight into their function.
Objective:
To find various structural properties, functional importance, phylogeny and expression
pattern of all OsGLPs using various bioinformatics tools.
Methods:
Physiochemical properties, sub-cellular localization, domain composition, Nglycosylation
and Phosphorylation sites, and 3D structural models of the OsGLPs were predicted
using various bioinformatics tools. Functional analysis was carried out with the Search Tool for
the Retrieval of Interacting Genes/Proteins (STRING) and Blast2GO servers. The expression
profile of the OsGLPs was predicted by retrieving the data for expression values from tissuespecific
and hormonal stressed array libraries of RiceXPro. Their phylogenetic relationship was
computed using Molecular and Evolutionary Genetic Analysis (MEGA6) tool.
Results:
Most of the OsGLPs are stable in the cellular environment with a prominent expression in
the extracellular region (57%) and plasma membrane (33%). Besides, 3 basic cupin domains, 7
more were reported, among which NTTNKVGSNVTLINV, FLLAALLALASWQAI, and
MASSSF were common to 99% of the sequences, related to bacterial pathogenicity, peroxidase
activity, and peptide signal activity, respectively. Structurally, OsGLPs are similar but functionally
they are diverse with novel enzymatic activities of oxalate decarboxylase, lyase, peroxidase, and
oxidoreductase. Expression analysis revealed prominent activities in the root, endosperm, and
leaves. OsGLPs were strongly expressed by abscisic acid, auxin, gibberellin, cytokinin, and
brassinosteroid. Phylogenetically they showed polyphyletic origin with a narrow genetic
background of 0.05%. OsGLPs of chromosome 3, 8, and 12 are functionally more important due to
their defensive role against various stresses through co-expression strategy.
Conclusion:
The analysis will help to utilize OsGLPs in future food programs.
Collapse
Affiliation(s)
- Muhammad Ilyas
- Department of Botany, University of Swabi, Swabi-23561, Khyber Pakhtunkhwa, Pakistan
| | - Muhammad Irfan
- Department of Botany, University of Swabi, Swabi-23561, Khyber Pakhtunkhwa, Pakistan
| | - Tariq Mahmood
- Department of Botany, Faculty of Biological Sciences, Quaid-I-Azam University, Islamabad 45320, Pakistan
| | - Hazrat Hussain
- Department of Biotechnology, University of Swabi, Swabi-23561, Khyber Pakhtunkhwa, Pakistan
| | - Latif-ur-Rehman
- Department of Biotechnology, University of Swabi, Swabi-23561, Khyber Pakhtunkhwa, Pakistan
| | - Ijaz Naeem
- Department of Biotechnology, University of Swabi, Swabi-23561, Khyber Pakhtunkhwa, Pakistan
| | - Khaliq-ur-Rahman
- Department of Chemistry, University of Swabi, Swabi-23561, Khyber Pakhtunkhwa, Pakistan
| |
Collapse
|
18
|
Shao YT, Liu XX, Lu Z, Chou KC. pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.127042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
19
|
Patil K, Chouhan U. Relevance of Machine Learning Techniques and Various Protein Features in Protein Fold Classification: A Review. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190204154038] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Background:
Protein fold prediction is a fundamental step in Structural Bioinformatics.
The tertiary structure of a protein determines its function and to predict its tertiary structure, fold
prediction serves an important role. Protein fold is simply the arrangement of the secondary
structure elements relative to each other in space. A number of studies have been carried out till
date by different research groups working worldwide in this field by using the combination of
different benchmark datasets, different types of descriptors, features and classification techniques.
Objective:
In this study, we have tried to put all these contributions together, analyze their study
and to compare different techniques used by them.
Methods:
Different features are derived from protein sequence, its secondary structure, different
physicochemical properties of amino acids, domain composition, Position Specific Scoring Matrix,
profile and threading techniques.
Conclusion:
Combination of these different features can improve classification accuracy to a
large extent. With the help of this survey, one can know the most suitable feature/attribute set and
classification technique for this multi-class protein fold classification problem.
Collapse
Affiliation(s)
- Komal Patil
- Department of Mathematics, Maulana Azad National Institute of Technology (MANIT), Bhopal, 462003 M.P, India
| | - Usha Chouhan
- Department of Mathematics, Maulana Azad National Institute of Technology (MANIT), Bhopal, 462003 M.P, India
| |
Collapse
|
20
|
Chou KC. Impacts of Pseudo Amino Acid Components and 5-steps Rule to Proteomics and Proteome Analysis. Curr Top Med Chem 2019; 19:2283-2300. [DOI: 10.2174/1568026619666191018100141] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 08/18/2019] [Accepted: 08/26/2019] [Indexed: 01/27/2023]
Abstract
Stimulated by the 5-steps rule during the last decade or so, computational proteomics has achieved remarkable progresses in the following three areas: (1) protein structural class prediction; (2) protein subcellular location prediction; (3) post-translational modification (PTM) site prediction. The results obtained by these predictions are very useful not only for an in-depth study of the functions of proteins and their biological processes in a cell, but also for developing novel drugs against major diseases such as cancers, Alzheimer’s, and Parkinson’s. Moreover, since the targets to be predicted may have the multi-label feature, two sets of metrics are introduced: one is for inspecting the global prediction quality, while the other for the local prediction quality. All the predictors covered in this review have a userfriendly web-server, through which the majority of experimental scientists can easily obtain their desired data without the need to go through the complicated mathematics.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| |
Collapse
|
21
|
Li CC, Liu B. MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks. Brief Bioinform 2019; 21:2133-2141. [PMID: 31774907 DOI: 10.1093/bib/bbz133] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 09/16/2019] [Accepted: 09/17/2019] [Indexed: 12/31/2022] Open
Abstract
Protein fold recognition is one of the most critical tasks to explore the structures and functions of the proteins based on their primary sequence information. The existing protein fold recognition approaches rely on features reflecting the characteristics of protein folds. However, the feature extraction methods are still the bottleneck of the performance improvement of these methods. In this paper, we proposed two new feature extraction methods called MotifCNN and MotifDCNN to extract more discriminative fold-specific features based on structural motif kernels to construct the motif-based convolutional neural networks (CNNs). The pairwise sequence similarity scores calculated based on fold-specific features are then fed into support vector machines to construct the predictor for fold recognition, and a predictor called MotifCNN-fold has been proposed. Experimental results on the benchmark dataset showed that MotifCNN-fold obviously outperformed all the other competing methods. In particular, the fold-specific features extracted by MotifCNN and MotifDCNN are more discriminative than the fold-specific features extracted by other deep learning techniques, indicating that incorporating the structural motifs into the CNN is able to capture the characteristics of protein folds.
Collapse
Affiliation(s)
- Chen-Chen Li
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
22
|
Liu B, Li CC, Yan K. DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks. Brief Bioinform 2019; 21:1733-1741. [DOI: 10.1093/bib/bbz098] [Citation(s) in RCA: 106] [Impact Index Per Article: 21.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Revised: 06/27/2019] [Accepted: 07/06/2019] [Indexed: 12/30/2022] Open
Abstract
Abstract
Protein fold recognition is critical for studying the structures and functions of proteins. The existing protein fold recognition approaches failed to efficiently calculate the pairwise sequence similarity scores of the proteins in the same fold sharing low sequence similarities. Furthermore, the existing feature vectorization strategies are not able to measure the global relationships among proteins from different protein folds. In this article, we proposed a new computational predictor called DeepSVM-fold for protein fold recognition by introducing a new feature vector based on the pairwise sequence similarity scores calculated from the fold-specific features extracted by deep learning networks. The feature vectors are then fed into a support vector machine to construct the predictor. Experimental results on the benchmark dataset (LE) show that DeepSVM-fold obviously outperforms all the other competing methods.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| | - Chen-Chen Li
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| |
Collapse
|
23
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
24
|
Su ZD, Huang Y, Zhang ZY, Zhao YW, Wang D, Chen W, Chou KC, Lin H. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics 2019; 34:4196-4204. [PMID: 29931187 DOI: 10.1093/bioinformatics/bty508] [Citation(s) in RCA: 124] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 06/19/2018] [Indexed: 12/20/2022] Open
Abstract
Motivation Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. They have important functions in cell development and metabolism, such as genetic markers, genome rearrangements, chromatin modifications, cell cycle regulation, transcription and translation. Their functions are generally closely related to their localization in the cell. Therefore, knowledge about their subcellular locations can provide very useful clues or preliminary insight into their biological functions. Although biochemical experiments could determine the localization of lncRNAs in a cell, they are both time-consuming and expensive. Therefore, it is highly desirable to develop bioinformatics tools for fast and effective identification of their subcellular locations. Results We developed a sequence-based bioinformatics tool called 'iLoc-lncRNA' to predict the subcellular locations of LncRNAs by incorporating the 8-tuple nucleotide features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. Rigorous jackknife tests have shown that the overall accuracy achieved by the new predictor on a stringent benchmark dataset is 86.72%, which is over 20% higher than that by the existing state-of-the-art predictor evaluated on the same tests. Availability and implementation A user-friendly webserver has been established at http://lin-group.cn/server/iLoc-LncRNA, by which users can easily obtain their desired results. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhen-Dong Su
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yan Huang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Zhao-Yue Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Ya-Wei Zhao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Dong Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, China.,Gordon Life Science Institute, Boston, MA, USA
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Gordon Life Science Institute, Boston, MA, USA
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Gordon Life Science Institute, Boston, MA, USA
| |
Collapse
|
25
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
26
|
Liu B, Li K, Huang DS, Chou KC. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics 2019; 34:3835-3842. [PMID: 29878118 DOI: 10.1093/bioinformatics/bty458] [Citation(s) in RCA: 130] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Accepted: 06/06/2018] [Indexed: 11/14/2022] Open
Abstract
Motivation Identification of enhancers and their strength is important because they play a critical role in controlling gene expression. Although some bioinformatics tools were developed, they are limited in discriminating enhancers from non-enhancers only. Recently, a two-layer predictor called 'iEnhancer-2L' was developed that can be used to predict the enhancer's strength as well. However, its prediction quality needs further improvement to enhance the practical application value. Results A new predictor called 'iEnhancer-EL' was proposed that contains two layer predictors: the first one (for identifying enhancers) is formed by fusing an array of six key individual classifiers, and the second one (for their strength) formed by fusing an array of ten key individual classifiers. All these key classifiers were selected from 171 elementary classifiers formed by SVM (Support Vector Machine) based on kmer, subsequence profile and PseKNC (Pseudo K-tuple Nucleotide Composition), respectively. Rigorous cross-validations have indicated that the proposed predictor is remarkably superior to the existing state-of-the-art one in this area. Availability and implementation A web server for the iEnhancer-EL has been established at http://bioinformatics.hitsz.edu.cn/iEnhancer-EL/, by which users can easily get their desired results without the need to go through the mathematical details. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China.,Gordon Life Science Institute, Belmont, MA, USA
| | - Kai Li
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA, USA.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
27
|
Zhuang YY, Liu HJ, Song X, Ju Y, Peng H. A Linear Regression Predictor for Identifying N 6-Methyladenosine Sites Using Frequent Gapped K-mer Pattern. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 18:673-680. [PMID: 31707204 PMCID: PMC6849367 DOI: 10.1016/j.omtn.2019.10.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Revised: 08/19/2019] [Accepted: 10/03/2019] [Indexed: 01/07/2023]
Abstract
N6-methyladenosine (m6A) is one of the most common and abundant modifications in RNA, which is related to many biological processes in humans. Abnormal RNA modifications are often associated with a series of diseases, including tumors, neurogenic diseases, and embryonic retardation. Therefore, identifying m6A sites is of paramount importance in the post-genomic age. Although many lab-based methods have been proposed to annotate m6A sites, they are time consuming and cost ineffective. In view of the drawbacks of the intrinsic methods in RNA sequence recognition, computational methods are suggested as a supplement to identify m6A sites. In this study, we develop a novel feature extraction algorithm based on the frequent gapped k-mer pattern (FGKP) and apply the linear regression to construct the prediction model. The new predictor is used to identify m6A sites in the Saccharomyces cerevisiae database. It has been shown by the 10-fold cross-validation that the performance is better than that of recent methods. Comparative results indicate that our model has great potential to become a useful and effective tool for genome analysis and gain more insights for locating m6A sites.
Collapse
Affiliation(s)
- Y Y Zhuang
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - H J Liu
- College of Information Technology and Computer Science, University of the Cordilleras, Baguio 2600, Philippines
| | - X Song
- School of Computer and Information Technology, Nanyang Normal University, Nanyang 473000, China.
| | - Y Ju
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - H Peng
- School of Informatics, Xiamen University, Xiamen 361005, China
| |
Collapse
|
28
|
Liu B, Chen S, Yan K, Weng F. iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition. Front Genet 2019; 10:842. [PMID: 31620165 PMCID: PMC6759546 DOI: 10.3389/fgene.2019.00842] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 08/13/2019] [Indexed: 11/22/2022] Open
Abstract
Identification of replication origins is playing a key role in understanding the mechanism of DNA replication. This task is of great significance in DNA sequence analysis. Because of its importance, some computational approaches have been introduced. Among these predictors, the iRO-3wPseKNC predictor is the first discriminative method that is able to correctly identify the entire replication origins. For further improving its predictive performance, we proposed the Pseudo k-tuple GC Composition (PsekGCC) approach to capture the "GC asymmetry bias" of yeast species by considering both the GC skew and the sequence order effects of k-tuple GC Composition (k-GCC) in this study. Based on PseKGCC, we proposed a new predictor called iRO-PsekGCC to identify the DNA replication origins. Rigorous jackknife test on two yeast species benchmark datasets (Saccharomyces cerevisiae, Pichia pastoris) indicated that iRO-PsekGCC outperformed iRO-3wPseKNC. It can be anticipated that iRO-PsekGCC will be a useful tool for DNA replication origin identification. Availability and implementation: The web-server for the iRO-PsekGCC predictor was established, and it can be accessed at http://bliulab.net/iRO-PsekGCC/.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| | - Shengyu Chen
- School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, IN, United States
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| | - Fan Weng
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| |
Collapse
|
29
|
Peng LX, Liu XH, Lu B, Liao SM, Zhou F, Huang JM, Chen D, Troy FA, Zhou GP, Huang RB. The Inhibition of Polysialyltranseferase ST8SiaIV Through Heparin Binding to Polysialyltransferase Domain (PSTD). Med Chem 2019; 15:486-495. [PMID: 30569872 DOI: 10.2174/1573406415666181218101623] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 11/22/2022]
Abstract
BACKGROUND The polysialic acid (polySia) is a unique carbohydrate polymer produced on the surface Of Neuronal Cell Adhesion Molecule (NCAM) in a number of cancer cells, and strongly correlates with the migration and invasion of tumor cells and with aggressive, metastatic disease and poor clinical prognosis in the clinic. Its synthesis is catalyzed by two polysialyltransferases (polySTs), ST8SiaIV (PST) and ST8SiaII (STX). Selective inhibition of polySTs, therefore, presents a therapeutic opportunity to inhibit tumor invasion and metastasis due to NCAM polysialylation. Heparin has been found to be effective in inhibiting the ST8Sia IV activity, but no clear molecular rationale. It has been found that polysialyltransferase domain (PSTD) in polyST plays a significant role in influencing polyST activity, and thus it is critical for NCAM polysialylation based on the previous studies. OBJECTIVE To determine whether the three different types of heparin (unfractionated hepain (UFH), low molecular heparin (LMWH) and heparin tetrasaccharide (DP4)) is bound to the PSTD; and if so, what are the critical residues of the PSTD for these binding complexes? METHODS Fluorescence quenching analysis, the Circular Dichroism (CD) spectroscopy, and NMR spectroscopy were used to determine and analyze interactions of PSTD-UFH, PSTD-LMWH, and PSTD-DP4. RESULTS The fluorescence quenching analysis indicates that the PSTD-UFH binding is the strongest and the PSTD-DP4 binding is the weakest among these three types of the binding; the CD spectra showed that mainly the PSTD-heparin interactions caused a reduction in signal intensity but not marked decrease in α-helix content; the NMR data of the PSTD-DP4 and the PSTDLMWH interactions showed that the different types of heparin shared 12 common binding sites at N247, V251, R252, T253, S257, R265, Y267, W268, L269, V273, I275, and K276, which were mainly distributed in the long α-helix of the PSTD and the short 3-residue loop of the C-terminal PSTD. In addition, three residues K246, K250 and A254 were bound to the LMWH, but not to DP4. This suggests that the PSTD-LMWH binding is stronger than the PSTD-DP4 binding, and the LMWH is a more effective inhibitor than DP4. CONCLUSION The findings in the present study demonstrate that PSTD domain is a potential target of heparin and may provide new insights into the molecular rationale of heparin-inhibiting NCAM polysialylation.
Collapse
Affiliation(s)
- Li-Xin Peng
- Life Science and Technology College, Guangxi University, Nanning, Guangxi, 530004 China; 2Institute of Biophysics, Chinese Academy of Sciences, Beijing, China.,National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007, China
| | - Xue-Hui Liu
- Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Bo Lu
- National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007, China
| | - Si-Ming Liao
- National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007, China
| | - Feng Zhou
- National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007, China
| | - Ji-Min Huang
- National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007, China
| | - Dong Chen
- National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007, China
| | - Frederic A Troy
- Department of Biochemistry and Molecular Medicine, University of California School of Medicine, Davis, CL, United States
| | - Guo-Ping Zhou
- National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007, China.,Gordon Life Science Institute, 53 South Cottage Road Belmont, MA 02478, United States
| | - Ri-Bo Huang
- Life Science and Technology College, Guangxi University, Nanning, Guangxi, 530004 China; 2Institute of Biophysics, Chinese Academy of Sciences, Beijing, China.,National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007, China
| |
Collapse
|
30
|
de Lima Nichio BT, de Oliveira AMR, de Pierri CR, Santos LGC, Lejambre AQ, Vialle RA, da Rocha Coimbra NA, Guizelini D, Marchaukoski JN, de Oliveira Pedrosa F, Raittz RT. RAFTS 3G: an efficient and versatile clustering software to analyses in large protein datasets. BMC Bioinformatics 2019; 20:392. [PMID: 31307371 PMCID: PMC6631606 DOI: 10.1186/s12859-019-2973-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2018] [Accepted: 06/28/2019] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. RESULTS Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS3G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS3G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering. CONCLUSION In general, RAFTS3G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS3G compared to other "standard-gold" methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS3G process.
Collapse
Affiliation(s)
- Bruno Thiago de Lima Nichio
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
- Department of Biochemistry, Biological Sciences Sector – Federal University of Paraná (UFPR), Curitiba, PR Brazil
| | - Aryel Marlus Repula de Oliveira
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Camilla Reginatto de Pierri
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
- Department of Biochemistry, Biological Sciences Sector – Federal University of Paraná (UFPR), Curitiba, PR Brazil
| | - Leticia Graziela Costa Santos
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Alexandre Quadros Lejambre
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Ricardo Assunção Vialle
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Nilson Antônio da Rocha Coimbra
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Dieval Guizelini
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Jeroniza Nunes Marchaukoski
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Fabio de Oliveira Pedrosa
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
- Department of Biochemistry, Biological Sciences Sector – Federal University of Paraná (UFPR), Curitiba, PR Brazil
| | - Roberto Tadeu Raittz
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| |
Collapse
|
31
|
Xiao X, Cheng X, Chen G, Mao Q, Chou KC. pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset. Med Chem 2019; 15:496-509. [DOI: 10.2174/1573406415666181217114710] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/17/2022]
Abstract
Background/Objective:Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mVirus” was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as “multiplex proteins”, may simultaneously occur in, or move between two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.Methods:Using the Chou's general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called “pLoc_bal-mVirus” for predicting the subcellular localization of multi-label virus proteins.Results:Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-theart predictor for the same purpose.Conclusion:Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_balmVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding of the biological process in a cell.
Collapse
Affiliation(s)
- Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Genqiang Chen
- College of Chemistry, Chemical Engineering and Biotechnology, Donghua University, Shanghai 201620, China
| | - Qi Mao
- College of Information Science and Technology, Donghua University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
32
|
Liu B, Li S. ProtDet-CCH: Protein Remote Homology Detection by Combining Long Short-Term Memory and Ranking Methods. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1203-1210. [PMID: 29993950 DOI: 10.1109/tcbb.2018.2789880] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
As one of the most challenging tasks in sequence analysis, protein remote homology detection has been extensively studied. Methods based on discriminative models and ranking approaches have achieved the state-of-the-art performance, and these two kinds of methods are complementary. In this study, three LSTM models have been applied to construct the predictors for protein remote homology detection, including ULSTM, BLSTM, and CNN-BLSTM. They are able to automatically extract the local and global sequence order information. Combined with PSSMs, the CNN-BLSTM achieved the best performance among the three LSTM-based models. We named this method as CNN-BLSTM-PSSM. Finally, a new method called ProtDet-CCH was proposed by combining CNN-BLSTM-PSSM and a ranking method HHblits. Tested on a widely used SCOP benchmark dataset, ProtDet-CCH achieved an ROC score of 0.998, and an ROC50 score of 0.982, significantly outperforming other existing state-of-the-art methods. Experimental results on two updated SCOPe independent datasets showed that ProtDet-CCH can achieve stable performance. Furthermore, our method can provide useful insights for studying the features and motifs of protein families and superfamilies. It is anticipated that ProtDet-CCH will become a very useful tool for protein remote homology detection.
Collapse
|
33
|
Hassan S, Sudhakar V, Nancy Mary MB, Babu R, Doble M, Dadar M, Hanna LE. Computational approach identifies protein off-targets for Isoniazid-NAD adduct: hypothesizing a possible drug resistance mechanism in Mycobacterium tuberculosis. J Biomol Struct Dyn 2019; 38:1697-1710. [PMID: 31094664 DOI: 10.1080/07391102.2019.1615987] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Isoniazid is an important antitubercular molecule identified as a drug of choice in tuberculosis treatment. As such, INH is an inactive prodrug; it acquires an active conformation by forming an adduct with NAD. The adduct targets inhA protein, a reductase responsible for fatty acid chain elongation in the cell wall of Mycobacterium tuberculosis. Resistance to INH is majorly contributed by mutations in inhA, katG and geneic and non-geneic regions associated with efflux genes. Despite being widespread, the mechanism of resistance remains unknown in ∼15% of INH-resistant strains. Studies report that an intracellular increase in NADH concentration prevents inhA inhibition, leading to INH resistance. In the pursuit of finding possible resistance mechanisms, we set out to find NAD binding proteins to explore similarities in structure and NAD binding property of these proteins with that of inhA. We identified 172 NAD binding proteins, of which 53 were identified to have sequence or structural similarity to inhA. By performing docking analysis on selected proteins, we identified INH-adduct to have good binding affinity despite very minimal structural similarity to inhA. This analysis was further supported by principal component analysis, which identified 65 proteins with NAD binding conformation similar to that of inhA. These findings prompt us to hypothesize that upon exposure to INH, bacteria tries to reduce inhA susceptibility by inducing expression of these NAD binding proteins through increase in NADH concentration. This in turn favours off-target binding and leads to decreased binding and potency of INH, thus contributing indirectly to INH resistance.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Sameer Hassan
- Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden.,Department of HIV, National Institute for Research in Tuberculosis, Chennai, Tamil Nadu, India
| | - Vaishnavi Sudhakar
- Department of HIV, National Institute for Research in Tuberculosis, Chennai, Tamil Nadu, India
| | - M Benita Nancy Mary
- Department of HIV, National Institute for Research in Tuberculosis, Chennai, Tamil Nadu, India
| | - Rajeshwari Babu
- Department of HIV, National Institute for Research in Tuberculosis, Chennai, Tamil Nadu, India
| | - Mukesh Doble
- Department of Biotechnology, Indian Institute of Technology, Chennai, Tamil Nadu, India
| | - Maryam Dadar
- Education and Extension Organization, Razi Vaccine and Serum Research Institute, Agricultural Research, Karaj, Iran
| | - Luke Elizabeth Hanna
- Department of HIV, National Institute for Research in Tuberculosis, Chennai, Tamil Nadu, India
| |
Collapse
|
34
|
Abstract
Background:DNA-binding proteins, binding to DNA, widely exist in living cells, participating in many cell activities. They can participate some DNA-related cell activities, for instance DNA replication, transcription, recombination, and DNA repair.Objective:Given the importance of DNA-binding proteins, studies for predicting the DNA-binding proteins have been a popular issue over the past decades. In this article, we review current machine-learning methods which research on the prediction of DNA-binding proteins through feature representation methods, classifiers, measurements, dataset and existing web server.Method:The prediction methods of DNA-binding protein can be divided into two types, based on amino acid composition and based on protein structure. In this article, we accord to the two types methods to introduce the application of machine learning in DNA-binding proteins prediction.Results:Machine learning plays an important role in the classification of DNA-binding proteins, and the result is better. The best ACC is above 80%.Conclusion:Machine learning can be widely used in many aspects of biological information, especially in protein classification. Some issues should be considered in future work. First, the relationship between the number of features and performance must be explored. Second, many features are used to predict DNA-binding proteins and propose solutions for high-dimensional spaces.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
35
|
Yang W, Zhu XJ, Huang J, Ding H, Lin H. A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization. Curr Bioinform 2019. [DOI: 10.2174/1574893613666181113131415] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Background:The location of proteins in a cell can provide important clues to their functions in various biological processes. Thus, the application of machine learning method in the prediction of protein subcellular localization has become a hotspot in bioinformatics. As one of key organelles, the Golgi apparatus is in charge of protein storage, package, and distribution.Objective:The identification of protein location in Golgi apparatus will provide in-depth insights into their functions. Thus, the machine learning-based method of predicting protein location in Golgi apparatus has been extensively explored. The development of protein sub-Golgi apparatus localization prediction should be reviewed for providing a whole background for the fields.Method:The benchmark dataset, feature extraction, machine learning method and published results were summarized.Results:We briefly introduced the recent progresses in protein sub-Golgi apparatus localization prediction using machine learning methods and discussed their advantages and disadvantages.Conclusion:We pointed out the perspective of machine learning methods in protein sub-Golgi localization prediction.
Collapse
Affiliation(s)
- Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Xiao-Juan Zhu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Jian Huang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| |
Collapse
|
36
|
Liu B, Chen J, Guo M, Wang X. Protein Remote Homology Detection and Fold Recognition Based on Sequence-Order Frequency Matrix. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:292-300. [PMID: 29990004 DOI: 10.1109/tcbb.2017.2765331] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Protein remote homology detection and fold recognition are two critical tasks for the studies of protein structures and functions. Currently, the profile-based methods achieve the state-of-the-art performance in these fields. However, the widely used sequence profiles, like position-specific frequency matrix (PSFM) and position-specific scoring matrix (PSSM), ignore the sequence-order effects along protein sequence. In this study, we have proposed a novel profile, called sequence-order frequency matrix (SOFM), to extract the sequence-order information of neighboring residues from multiple sequence alignment (MSA). Combined with two profile feature extraction approaches, top-n-grams and the Smith-Waterman algorithm, the SOFMs are applied to protein remote homology detection and fold recognition, and two predictors called SOFM-Top and SOFM-SW are proposed. Experimental results show that SOFM contains more information content than other profiles, and these two predictors outperform other state-of-the-art methods. It is anticipated that SOFM will become a very useful profile in the studies of protein structures and functions.
Collapse
|
37
|
Zhang S, Lin J, Su L, Zhou Z. pDHS-DSET: Prediction of DNase I hypersensitive sites in plant genome using DS evidence theory. Anal Biochem 2019; 564-565:54-63. [DOI: 10.1016/j.ab.2018.10.018] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2018] [Revised: 10/10/2018] [Accepted: 10/15/2018] [Indexed: 10/28/2022]
|
38
|
Liu B, Jiang S, Zou Q. HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search. Brief Bioinform 2018; 21:298-308. [PMID: 30403770 DOI: 10.1093/bib/bby104] [Citation(s) in RCA: 46] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 10/03/2018] [Accepted: 10/04/2018] [Indexed: 11/12/2022] Open
Abstract
As one of the most important fundamental problems in protein sequence analysis, protein remote homology detection is critical for both theoretical research (protein structure and function studies) and real world applications (drug design). Although several computational predictors have been proposed, their detection performance is still limited. In this study, we treat protein remote homology detection as a document retrieval task, where the proteins are considered as documents and its aim is to find the highly related documents with the query documents in a database. A protein similarity network was constructed based on the true labels of proteins in the database, and the query proteins were then connected into the network based on the similarity scores calculated by three ranking methods, including PSI-BLAST, Hmmer and HHblits. The PageRank algorithm and Hyperlink-Induced Topic Search (HITS) algorithm were respectively performed on this network to move the homologous proteins of query proteins to the neighbors of the query proteins in the network. Finally, PageRank and HITS algorithms were combined, and a predictor called HITS-PR-HHblits was proposed to further improve the predictive performance. Tested on the SCOP and SCOPe benchmark datasets, the experimental results showed that the proposed protocols outperformed other state-of-the-art methods. For the convenience of the most experimental scientists, a web server for HITS-PR-HHblits was established at http://bioinformatics.hitsz.edu.cn/HITS-PR-HHblits, by which the users can easily get the results without the need to go through the mathematical details. The HITS-PR-HHblits predictor is a protocol for protein remote homology detection using different sets of programs, which will become a very useful computational tool for proteome analysis.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| | - Shuangyan Jiang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
39
|
Chen W, Ding H, Zhou X, Lin H, Chou KC. iRNA(m6A)-PseDNC: Identifying N 6-methyladenosine sites using pseudo dinucleotide composition. Anal Biochem 2018; 561-562:59-65. [PMID: 30201554 DOI: 10.1016/j.ab.2018.09.002] [Citation(s) in RCA: 126] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 08/31/2018] [Accepted: 09/03/2018] [Indexed: 01/28/2023]
Abstract
As a prevalent post-transcriptional modification, N6-methyladenosine (m6A) plays key roles in a series of biological processes. Although experimental technologies have been developed and applied to identify m6A sites, they are still cost-ineffective for transcriptome-wide detections of m6A. As good complements to the experimental techniques, some computational methods have been proposed to identify m6A sites. However, their performance remains unsatisfactory. In this study, we firstly proposed an Euclidean distance based method to construct a high quality benchmark dataset. By encoding the RNA sequences using pseudo nucleotide composition, a new predictor called iRNA(m6A)-PseDNC was developed to identify m6A sites in the Saccharomyces cerevisiae genome. It has been demonstrated by the 10-fold cross validation test that the performance of iRNA(m6A)-PseDNC is superior to the existing methods. Meanwhile, for the convenience of most experimental scientists, established at the site http://lin-group.cn/server/iRNA(m6A)-PseDNC.php is its web-server, by which users can easily get their desired results without need to go through the detailed mathematics. It is anticipated that iRNA(m6A)-PseDNC will become a useful high throughput tool for identifying m6A sites in the S. cerevisiae genome.
Collapse
Affiliation(s)
- Wei Chen
- School of Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, 063000, China; Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, 611730, China; Gordon Life Science Institute, Boston, MA, 02478, USA.
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China.
| | - Xu Zhou
- School of Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, 063000, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China; Gordon Life Science Institute, Boston, MA, 02478, USA.
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China; Gordon Life Science Institute, Boston, MA, 02478, USA.
| |
Collapse
|
40
|
Khan YD, Rasool N, Hussain W, Khan SA, Chou KC. iPhosT-PseAAC: Identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC. Anal Biochem 2018; 550:109-116. [DOI: 10.1016/j.ab.2018.04.021] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Revised: 04/19/2018] [Accepted: 04/21/2018] [Indexed: 01/29/2023]
|
41
|
Zhang LQ, Li QZ. Estimating the effects of transcription factors binding and histone modifications on gene expression levels in human cells. Oncotarget 2018; 8:40090-40103. [PMID: 28454114 PMCID: PMC5522221 DOI: 10.18632/oncotarget.16988] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 03/11/2017] [Indexed: 12/22/2022] Open
Abstract
Transcription factors and histone modifications are vital for the regulation of gene expression. Hence, to estimate the effects of transcription factors binding and histone modifications on gene expression, we construct a statistical model for the genome-wide 15 transcription factors binding data, 10 histone modifications profiles and DNase-I hypersensitivity data in three mammalian. Remarkably, our results show POLR2A and H3K36me3 can highly and consistently predict gene expression in three cell lines. And H3K4me3, H3K27me3 and H3K9ac are more reliable predictors than other histone modifications in human embryonic stem cells. Moreover, genome-wide statistical redundancies exist within and between transcription factors and histone modifications, and these phenomena may be caused by the regulation mechanism. In further study, we find that even though transcription factors and histone modifications offer similar effects on expression levels of genome-wide genes, the effects of transcription factors and histone modifications on predictive abilities are different for genes in independent biological processes.
Collapse
Affiliation(s)
- Lu-Qiang Zhang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| |
Collapse
|
42
|
Prediction of the aquatic toxicity of aromatic compounds to tetrahymena pyriformis through support vector regression. Oncotarget 2018; 8:49359-49369. [PMID: 28467816 PMCID: PMC5564774 DOI: 10.18632/oncotarget.17210] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2017] [Accepted: 03/30/2017] [Indexed: 01/24/2023] Open
Abstract
Toxicity evaluation is an extremely important process during drug development. It is usually initiated by experiments on animals, which is time-consuming and costly. To speed up such a process, a quantitative structure-activity relationship (QSAR) study was performed to develop a computational model for correlating the structures of 581 aromatic compounds with their aquatic toxicity to tetrahymena pyriformis. A set of 68 molecular descriptors derived solely from the structures of the aromatic compounds were calculated based on Gaussian 03, HyperChem 7.5, and TSAR V3.3. A comprehensive feature selection method, minimum Redundancy Maximum Relevance (mRMR)-genetic algorithm (GA)-support vector regression (SVR) method, was applied to select the best descriptor subset in QSAR analysis. The SVR method was employed to model the toxicity potency from a training set of 500 compounds. Five-fold cross-validation method was used to optimize the parameters of SVR model. The new SVR model was tested on an independent dataset of 81 compounds. Both high internal consistent and external predictive rates were obtained, indicating the SVR model is very promising to become an effective tool for fast detecting the toxicity.
Collapse
|
43
|
iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget 2018; 8:41178-41188. [PMID: 28476023 PMCID: PMC5522291 DOI: 10.18632/oncotarget.17104] [Citation(s) in RCA: 146] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2017] [Accepted: 03/15/2017] [Indexed: 01/24/2023] Open
Abstract
Occurring at cytosine (C) of RNA, 5-methylcytosine (m5C) is an important post-transcriptional modification (PTCM). The modification plays significant roles in biological processes by regulating RNA metabolism in both eukaryotes and prokaryotes. It may also, however, cause cancers and other major diseases. Given an uncharacterized RNA sequence that contains many C residues, can we identify which one of them can be of m5C modification, and which one cannot? It is no doubt a crucial problem, particularly with the explosive growth of RNA sequences in the postgenomic age. Unfortunately, so far no user-friendly web-server whatsoever has been developed to address such a problem. To meet the increasingly high demand from most experimental scientists working in the area of drug development, we have developed a new predictor called iRNAm5C-PseDNC by incorporating ten types of physical-chemical properties into pseudo dinucleotide composition via the auto/cross-covariance approach. Rigorous jackknife tests show that its anticipated accuracy is quite high. For most experimental scientists’ convenience, a user-friendly web-server for the predictor has been provided at http://www.jci-bioinfo.cn/iRNAm5C-PseDNC along with a step-by-step user guide, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It has not escaped our notice that the approach presented here can also be used to deal with many other problems in genome analysis.
Collapse
|
44
|
Chen W, Feng P, Yang H, Ding H, Lin H, Chou KC. iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences. Oncotarget 2018; 8:4208-4217. [PMID: 27926534 PMCID: PMC5354824 DOI: 10.18632/oncotarget.13758] [Citation(s) in RCA: 199] [Impact Index Per Article: 33.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Accepted: 11/23/2016] [Indexed: 01/14/2023] Open
Abstract
Catalyzed by adenosine deaminase (ADAR), the adenosine to inosine (A-to-I) editing in RNA is not only involved in various important biological processes, but also closely associated with a series of major diseases. Therefore, knowledge about the A-to-I editing sites in RNA is crucially important for both basic research and drug development. Given an uncharacterized RNA sequence that contains many adenosine (A) residues, can we identify which one of them can be of A-to-I editing, and which one cannot? Unfortunately, so far no computational method whatsoever has been developed to address such an important problem based on the RNA sequence information alone. To fill this empty area, we have proposed a predictor called iRNA-AI by incorporating the chemical properties of nucleotides and their sliding occurrence density distribution along a RNA sequence into the general form of pseudo nucleotide composition (PseKNC). It has been shown by the rigorous jackknife test and independent dataset test that the performance of the proposed predictor is quite promising. For the convenience of most experimental scientists, a user-friendly web-server for iRNA-AI has been established at http://lin.uestc.edu.cn/server/iRNA-AI/, by which users can easily get their desired results without the need to go through the mathematical details.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, Tangshan, China.,Gordon Life Science Institute, Belmont, Massachusetts, United States of America
| | - Pengmian Feng
- Hebei Province Key Laboratory of Occupational Health and Safety for Coal Industry, School of Public Health, North China University of Science and Technology, Tangshan, China
| | - Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Gordon Life Science Institute, Belmont, Massachusetts, United States of America
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Gordon Life Science Institute, Belmont, Massachusetts, United States of America
| |
Collapse
|
45
|
Mehrotra P, Ramakrishnan G, Dhandapani G, Srinivasan N, Madanan MG. Comparison of Leptospira interrogans and Leptospira biflexa genomes: analysis of potential leptospiral-host interactions. MOLECULAR BIOSYSTEMS 2018; 13:883-891. [PMID: 28294222 DOI: 10.1039/c6mb00856a] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Leptospirosis, a potentially life-threatening disease, remains the most widespread zoonosis caused by pathogenic species of Leptospira. The pathogenic spirochaete, Leptospira interrogans, is characterized by its ability to permeate human host tissues rapidly and colonize multiple organs in the host. In spite of the efforts taken to comprehend the pathophysiology of the pathogen and the heterogeneity posed by L. interrogans, the current knowledge on the mechanism of pathogenesis is modest. In an attempt to contribute towards the same, we demonstrate the use of an established structure-based protocol coupled with information on subcellular localization of proteins and their tissue-specificity, in recognizing a set of 49 biologically feasible interactions potentially mediated by proteins of L. interrogans in humans. We have also presented means to adjudge the physicochemical viability of the predicted host-pathogen interactions, for selected cases, in terms of interaction energies and geometric shape complementarity of the interacting proteins. Comparative analyses of proteins of L. interrogans and the saprophytic spirochaete, Leptospira biflexa, and their predicted involvement in interactions with human hosts, aided in underpinning the functional relevance of leptospiral-host protein-protein interactions specific to L. interrogans as well as those specific to L. biflexa. Our study presents characteristics of the pathogenic L. interrogans that are predicted to facilitate its ability to persist in human hosts.
Collapse
Affiliation(s)
- Prachi Mehrotra
- Indian Institute of Science Mathematics Initiative, Indian Institute of Science, Bangalore 560012, India
| | | | | | | | | |
Collapse
|
46
|
pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 2018; 110:50-58. [DOI: 10.1016/j.ygeno.2017.08.005] [Citation(s) in RCA: 180] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Revised: 08/10/2017] [Accepted: 08/11/2017] [Indexed: 11/22/2022]
|
47
|
Abstract
BACKGROUND Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. RESULTS HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. CONCLUSIONS In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .
Collapse
Affiliation(s)
- Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
- Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Shixiang Wan
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
| | - Xiangxiang Zeng
- Department of Computer Science, Xiamen University, Xiamen, China.
| | - Zhanshan Sam Ma
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China.
| |
Collapse
|
48
|
Cheng X, Xiao X, Chou KC. pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information. Bioinformatics 2017; 34:1448-1456. [DOI: 10.1093/bioinformatics/btx711] [Citation(s) in RCA: 127] [Impact Index Per Article: 18.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Accepted: 10/31/2017] [Indexed: 01/19/2023] Open
Affiliation(s)
- Xiang Cheng
- Computer Science, Jingdezhen Ceramic Institute, Jingdezhen, China
- Computational Biology, Gordon Life Science Institute, Boston, MA, USA
| | - Xuan Xiao
- Computer Science, Jingdezhen Ceramic Institute, Jingdezhen, China
- Computational Biology, Gordon Life Science Institute, Boston, MA, USA
| | - Kuo-Chen Chou
- Computer Science, Jingdezhen Ceramic Institute, Jingdezhen, China
- Computational Biology, Gordon Life Science Institute, Boston, MA, USA
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
49
|
Li S, Chen J, Liu B. Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinformatics 2017; 18:443. [PMID: 29017445 PMCID: PMC5634958 DOI: 10.1186/s12859-017-1842-2] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 09/21/2017] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Protein remote homology detection plays a vital role in studies of protein structures and functions. Almost all of the traditional machine leaning methods require fixed length features to represent the protein sequences. However, it is never an easy task to extract the discriminative features with limited knowledge of proteins. On the other hand, deep learning technique has demonstrated its advantage in automatically learning representations. It is worthwhile to explore the applications of deep learning techniques to the protein remote homology detection. RESULTS In this study, we employ the Bidirectional Long Short-Term Memory (BLSTM) to learn effective features from pseudo proteins, also propose a predictor called ProDec-BLSTM: it includes input layer, bidirectional LSTM, time distributed dense layer and output layer. This neural network can automatically extract the discriminative features by using bidirectional LSTM and the time distributed dense layer. CONCLUSION Experimental results on a widely-used benchmark dataset show that ProDec-BLSTM outperforms other related methods in terms of both the mean ROC and mean ROC50 scores. This promising result shows that ProDec-BLSTM is a useful tool for protein remote homology detection. Furthermore, the hidden patterns learnt by ProDec-BLSTM can be interpreted and visualized, and therefore, additional useful information can be obtained.
Collapse
Affiliation(s)
- Shumin Li
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, China.
| |
Collapse
|
50
|
pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC. Gene 2017; 628:315-321. [DOI: 10.1016/j.gene.2017.07.036] [Citation(s) in RCA: 135] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2017] [Revised: 07/08/2017] [Accepted: 07/11/2017] [Indexed: 12/25/2022]
|