1
|
Dixson JD, Azad RK. Physicochemical Evaluation of Remote Homology in the Twilight Zone. Proteins 2024. [PMID: 39219099 DOI: 10.1002/prot.26742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Accepted: 08/13/2024] [Indexed: 09/04/2024]
Abstract
A fundamental problem in the field of protein evolutionary biology is determining the degree and nature of evolutionary relatedness among homologous proteins that have diverged to a point where they share less than 30% amino acid identity yet retain similar structures and/or functions. Such proteins are said to lie within the "Twilight Zone" of amino acid identity. Many researchers have leveraged experimentally determined structures in the quest to classify proteins in the Twilight Zone. Such endeavors can be highly time consuming and prohibitively expensive for large-scale analyses. Motivated by this problem, here we use molecular weight-hydrophobicity physicochemical dynamic time warping (MWHP DTW) to quantify similarity of simulated and real-world homologous protein domains. MWHP DTW is a physicochemical method requiring only the amino acid sequence to quantify similarity of related proteins and is particularly useful in determining similarity within the Twilight Zone due to its resilience to primary sequence substitution saturation. This is a step forward in determination of the relatedness among Twilight Zone proteins and most notably allows for the discrimination of random similarity and true homology in the 0%-20% identity range. This method was previously presented expeditiously just after the outbreak of COVID-19 because it was able to functionally cluster ACE2-binding betacoronavirus receptor binding domains (RBDs), a task that has been elusive using standard techniques. Here we show that one reason that MWHP DTW is an effective technique for comparisons within the Twilight Zone is because it can uncover hidden homology by exploiting physicochemical conservation, a problem that protein sequence alignment algorithms are inherently incapable of addressing within the Twilight Zone. Further, we present an extended definition of the Twilight Zone that incorporates the dynamic relationship between structural, physicochemical, and sequence-based metrics.
Collapse
Affiliation(s)
- Jamie Dennis Dixson
- Department of Biological Sciences, University of North Texas, Denton, Texas, USA
| | - Rajeev Kumar Azad
- Department of Biological Sciences, University of North Texas, Denton, Texas, USA
- BioDiscovery Institute, University of North Texas, Denton, Texas, USA
| |
Collapse
|
2
|
Kumar R, Yadav G, Kuddus M, Ashraf GM, Singh R. Unlocking the microbial studies through computational approaches: how far have we reached? ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2023; 30:48929-48947. [PMID: 36920617 PMCID: PMC10016191 DOI: 10.1007/s11356-023-26220-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 02/24/2023] [Indexed: 04/16/2023]
Abstract
The metagenomics approach accelerated the study of genetic information from uncultured microbes and complex microbial communities. In silico research also facilitated an understanding of protein-DNA interactions, protein-protein interactions, docking between proteins and phyto/biochemicals for drug design, and modeling of the 3D structure of proteins. These in silico approaches provided insight into analyzing pathogenic and nonpathogenic strains that helped in the identification of probable genes for vaccines and antimicrobial agents and comparing whole-genome sequences to microbial evolution. Artificial intelligence, more precisely machine learning (ML) and deep learning (DL), has proven to be a promising approach in the field of microbiology to handle, analyze, and utilize large data that are generated through nucleic acid sequencing and proteomics. This enabled the understanding of the functional and taxonomic diversity of microorganisms. ML and DL have been used in the prediction and forecasting of diseases and applied to trace environmental contaminants and environmental quality. This review presents an in-depth analysis of the recent application of silico approaches in microbial genomics, proteomics, functional diversity, vaccine development, and drug design.
Collapse
Affiliation(s)
- Rajnish Kumar
- Amity Institute of Biotechnology, Amity University Uttar Pradesh Lucknow Campus, Lucknow, Uttar Pradesh, India
- Department of Veterinary Medicine and Surgery, College of Veterinary Medicine, University of Missouri, Columbia, MO, USA
| | - Garima Yadav
- Amity Institute of Biotechnology, Amity University Uttar Pradesh Lucknow Campus, Lucknow, Uttar Pradesh, India
| | - Mohammed Kuddus
- Department of Biochemistry, College of Medicine, University of Hail, Hail, Saudi Arabia
| | - Ghulam Md Ashraf
- Department of Medical Laboratory Sciences, College of Health Sciences, and Sharjah Institute for Medical Research, University of Sharjah, Sharjah , 27272, United Arab Emirates
| | - Rachana Singh
- Amity Institute of Biotechnology, Amity University Uttar Pradesh Lucknow Campus, Lucknow, Uttar Pradesh, India.
| |
Collapse
|
3
|
Li F, Guo X, Xiang D, Pitt ME, Bainomugisa A, Coin LJ. Computational analysis and prediction of PE_PGRS proteins using machine learning. Comput Struct Biotechnol J 2022; 20:662-674. [PMID: 35140886 PMCID: PMC8804200 DOI: 10.1016/j.csbj.2022.01.019] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 01/09/2022] [Accepted: 01/18/2022] [Indexed: 12/18/2022] Open
Abstract
Mycobacterium tuberculosis genome comprises approximately 10% of two families of poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought to be involved in host response and disease pathogenicity. Due to their high genetic variability and complexity of analysis, they are typically disregarded for further research in genomic studies. There are currently limited online resources and homology computational tools that can identify and analyse PE_PGRS proteins. In addition, they are computational-intensive and time-consuming, and lack sensitivity. Therefore, computational methods that can rapidly and accurately identify PE_PGRS proteins are valuable to facilitate the functional elucidation of the PE_PGRS family proteins. In this study, we developed the first machine learning-based bioinformatics approach, termed PEPPER, to allow users to identify PE_PGRS proteins rapidly and accurately. PEPPER was built upon a comprehensive evaluation of 13 popular machine learning algorithms with various sequence and physicochemical features. Empirical studies demonstrated that PEPPER achieved significantly better performance than alignment-based approaches, BLASTP and PHMMER, in both prediction accuracy and speed. PEPPER is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE_PGRS proteins.
Collapse
Affiliation(s)
- Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | - Xudong Guo
- School of Information Engineering, Ningxia University, Yinchuan, Ningxia 750021, China
| | - Dongxu Xiang
- Faculty of Engineering and Information Technology, The University of Melbourne, VIC 3000, Australia
| | - Miranda E. Pitt
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | | | - Lachlan J.M. Coin
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| |
Collapse
|
4
|
Qian Y, Meng H, Lu W, Liao Z, Ding Y, Wu H. Identification of DNA-Binding Proteins via Hypergraph Based Laplacian
Support Vector Machine. Curr Bioinform 2022. [DOI: 10.2174/1574893616666210806091922] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
The identification of DNA binding proteins (DBP) is an important research
field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP.
Objective:
To solve the problem of large-scale DBP identification, some machine learning methods are
proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence-
based machine learning model to predict DBP.
Methods:
In our study, we extracted six types of features (including NMBAC, GE, MCD, PSSM-AB,
PSSM-DWT, and PsePSSM) from protein sequences. We used Multiple Kernel Learning based on Hilbert-
Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we constructed
a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian
Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is
tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets.
Result:
Compared with other methods, our model achieved best results on benchmark data sets.
Conclusion:
The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of
PDB1075) and PDB2272 (Independent test of PDB14189), respectively.
Collapse
Affiliation(s)
- Yuqing Qian
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, P.R. China
| | - Hao Meng
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, P.R. China
| | - Weizhong Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, P.R. China
| | - Zhijun Liao
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University,
Fuzhou, P.R. China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China,
Quzhou, P.R. China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, P.R. China
| |
Collapse
|
5
|
Guo X, Zhou W, Yu Y, Cai Y, Zhang Y, Du A, Lu Q, Ding Y, Li C. Multiple Laplacian Regularized RBF Neural Network for Assessing Dry Weight of Patients With End-Stage Renal Disease. Front Physiol 2021; 12:790086. [PMID: 34966294 PMCID: PMC8711098 DOI: 10.3389/fphys.2021.790086] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 11/17/2021] [Indexed: 11/28/2022] Open
Abstract
Dry weight (DW) is an important dialysis index for patients with end-stage renal disease. It can guide clinical hemodialysis. Brain natriuretic peptide, chest computed tomography image, ultrasound, and bioelectrical impedance analysis are key indicators (multisource information) for assessing DW. By these approaches, a trial-and-error method (traditional measurement method) is employed to assess DW. The assessment of clinician is time-consuming. In this study, we developed a method based on artificial intelligence technology to estimate patient DW. Based on the conventional radial basis function neural (RBFN) network, we propose a multiple Laplacian-regularized RBFN (MLapRBFN) model to predict DW of patient. Compared with other model and body composition monitor, our method achieves the lowest value (1.3226) of root mean square error. In Bland-Altman analysis of MLapRBFN, the number of out agreement interval is least (17 samples). MLapRBFN integrates multiple Laplace regularization terms, and employs an efficient iterative algorithm to solve the model. The ratio of out agreement interval is 3.57%, which is lower than 5%. Therefore, our method can be tentatively applied for clinical evaluation of DW in hemodialysis patients.
Collapse
Affiliation(s)
- Xiaoyi Guo
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Wei Zhou
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Yan Yu
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Yinghua Cai
- Department of Nursing, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Yuan Zhang
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Aiyan Du
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Qun Lu
- Department of Nursing, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Chao Li
- General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| |
Collapse
|
6
|
Yan K, Wen J, Xu Y, Liu B. Protein Fold Recognition Based on Auto-Weighted Multi-View Graph Embedding Learning Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2682-2691. [PMID: 32356759 DOI: 10.1109/tcbb.2020.2991268] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Protein fold recognition is critical for studies of the protein structure prediction and drug design. Several methods have been proposed to obtain discriminative features from the protein sequences for fold recognition. However, the ensemble methods that combine the various features to improve predictive performance remain the challenge problems. In this study, we proposed two novel algorithms: AWMG and EMfold. AWMG used a novel predictor based on the multi-view learning framework for fold recognition. Each view was treated as the intermediate representation of the corresponding data source of proteins, including the evolutionary information and the retrieval information. AWMG calculated the auto-weight for each view respectively and constructed the latent subspace which contains the common information shared by different views. The marginalized constraint was employed to enlarge the margins between different folds, improving the predictive performance of AWMG. Furthermore, we proposed a novel ensemble method called EMfold, which combines two complementary methods AWMG and DeepSS. The later method was a template-based algorithm using the SPARKS-X and DeepFR programs. EMfold integrated the advantages of template-based assignment and machine learning classifier. Experimental results on the two widely datasets (LE and YK) showed that the proposed methods outperformed some state-of-the-art methods, indicating that AWMG and EMfold are useful tools for protein fold recognition.
Collapse
|
7
|
Zhu Q, Yang J, Xu B, Hou Z, Sun L, Zhang D. Multimodal Brain Network Jointly Construction and Fusion for Diagnosis of Epilepsy. Front Neurosci 2021; 15:734711. [PMID: 34658773 PMCID: PMC8511490 DOI: 10.3389/fnins.2021.734711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 08/10/2021] [Indexed: 11/24/2022] Open
Abstract
Brain network analysis has been proved to be one of the most effective methods in brain disease diagnosis. In order to construct discriminative brain networks and improve the performance of disease diagnosis, many machine learning–based methods have been proposed. Recent studies show that combining functional and structural brain networks is more effective than using only single modality data. However, in the most of existing multi-modal brain network analysis methods, it is a common strategy that constructs functional and structural network separately, which is difficult to embed complementary information of different modalities of brain network. To address this issue, we propose a unified brain network construction algorithm, which jointly learns both functional and structural data and effectively face the connectivity and node features for improving classification. First, we conduct space alignment and brain network construction under a unified framework, and then build the correlation model among all brain regions with functional data by low-rank representation so that the global brain region correlation can be captured. Simultaneously, the local manifold with structural data is embedded into this model to preserve the local structural information. Second, the PageRank algorithm is adaptively used to evaluate the significance of different brain regions, in which the interaction of multiple brain regions is considered. Finally, a multi-kernel strategy is utilized to solve the data heterogeneity problem and merge the connectivity as well as node information for classification. We apply the proposed method to the diagnosis of epilepsy, and the experimental results show that our method can achieve a promising performance.
Collapse
Affiliation(s)
- Qi Zhu
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Jing Yang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Bingliang Xu
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Zhenghua Hou
- Department of Psychosomatics and Psychiatry, Affiliated Zhongda Hospital, School of Medicine, Southeast University, Nanjing, China
| | - Liang Sun
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Daoqiang Zhang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| |
Collapse
|
8
|
Yan K, Wen J, Liu JX, Xu Y, Liu B. Protein Fold Recognition by Combining Support Vector Machines and Pairwise Sequence Similarity Scores. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2008-2016. [PMID: 31940548 DOI: 10.1109/tcbb.2020.2966450] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein fold recognition is one of the most essential steps for protein structure prediction, aiming to classify proteins into known protein folds. There are two main computational approaches: one is the template-based method based on the alignment scores between query-template protein pairs and the other is the machine learning method based on the feature representation and classifier. These two approaches have their own advantages and disadvantages. Can we combine these methods to establish more accurate predictors for protein fold recognition? In this study, we made an initial attempt and proposed two novel algorithms: TSVM-fold and ESVM-fold. TSVM-fold was based on the Support Vector Machines (SVMs), which utilizes a set of pairwise sequence similarity scores generated by three complementary template-based methods, including HHblits, SPARKS-X, and DeepFR. These scores measured the global relationships between query sequences and templates. The comprehensive features of the attributes of the sequences were fed into the SVMs for the prediction. Then the TSVM-fold was further combined with the HHblits algorithm so as to improve its generalization ability. The combined method is called ESVM-fold. Experimental results in two rigorous benchmark datasets (LE and YK datasets) showed that the proposed methods outperform some state-of-the-art methods, indicating that the TSVM-fold and ESVM-fold are efficient predictors for protein fold recognition.
Collapse
|
9
|
Shao J, Chen J, Liu B. ProtRe-CN: Protein Remote Homology Detection by Combining Classification Methods and Network Methods via Learning to Rank. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:1-1. [PMID: 34460380 DOI: 10.1109/tcbb.2021.3108168] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Protein remote homology detection is one of fundamental research tasks for downstream analysis (i.e., protein structure and function prediction). Many advanced methods are proposed from different views with complementary detection ability, such as the classification method, the network method, and the ranking method. A framework integrating these heterogeneous methods is urgently desired to reduce the false positive rate and predictive bias. We propose a novel ranking method called ProtRe-CN by fusing the classification methods and network methods via Learning to Rank. Experimental results on the benchmark dataset and the independent dataset show that ProtRe-CN outperforms other existing state-of-the-art predictors. ProtRe-CN improves the detective performance via correcting the false positives in the ranking list by combining the heterogeneous methods. The web server of ProtRe-CN can be accessed at http://bliulab.net/ProtRe-CN.
Collapse
|
10
|
Su R, Hu J, Zou Q, Manavalan B, Wei L. Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools. Brief Bioinform 2021; 21:408-420. [PMID: 30649170 DOI: 10.1093/bib/bby124] [Citation(s) in RCA: 107] [Impact Index Per Article: 35.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2018] [Revised: 11/30/2018] [Accepted: 11/30/2018] [Indexed: 12/16/2022] Open
Abstract
Cell-penetrating peptides (CPPs) facilitate the delivery of therapeutically relevant molecules, including DNA, proteins and oligonucleotides, into cells both in vitro and in vivo. This unique ability explores the possibility of CPPs as therapeutic delivery and its potential applications in clinical therapy. Over the last few decades, a number of machine learning (ML)-based prediction tools have been developed, and some of them are freely available as web portals. However, the predictions produced by various tools are difficult to quantify and compare. In particular, there is no systematic comparison of the web-based prediction tools in performance, especially in practical applications. In this work, we provide a comprehensive review on the biological importance of CPPs, CPP database and existing ML-based methods for CPP prediction. To evaluate current prediction tools, we conducted a comparative study and analyzed a total of 12 models from 6 publicly available CPP prediction tools on 2 benchmark validation sets of CPPs and non-CPPs. Our benchmarking results demonstrated that a model from the KELM-CPPpred, namely KELM-hybrid-AAC, showed a significant improvement in overall performance, when compared to the other 11 prediction models. Moreover, through a length-dependency analysis, we find that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20-25 residues long than peptides in other length ranges.
Collapse
Affiliation(s)
- Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jie Hu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | | | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
11
|
Shen Z, Liu T, Xu T. Accurate Identification of Antioxidant Proteins Based on a Combination of Machine Learning Techniques and Hidden Markov Model Profiles. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:5770981. [PMID: 34413898 PMCID: PMC8369162 DOI: 10.1155/2021/5770981] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 07/15/2021] [Accepted: 07/26/2021] [Indexed: 01/19/2023]
Abstract
Antioxidant proteins (AOPs) play important roles in the management and prevention of several human diseases due to their ability to neutralize excess free radicals. However, the identification of AOPs by using wet-lab experimental techniques is often time-consuming and expensive. In this study, we proposed an accurate computational model, called AOP-HMM, to predict AOPs by extracting discriminatory evolutionary features from hidden Markov model (HMM) profiles. First, auto cross-covariance (ACC) variables were applied to transform the HMM profiles into fixed-length feature vectors. Then, we performed the analysis of variance (ANOVA) method to reduce the dimensionality of the raw feature space. Finally, a support vector machine (SVM) classifier was adopted to conduct the prediction of AOPs. To comprehensively evaluate the performance of the proposed AOP-HMM model, the 10-fold cross-validation (CV), the jackknife CV, and the independent test were carried out on two widely used benchmark datasets. The experimental results demonstrated that AOP-HMM outperformed most of the existing methods and could be used to quickly annotate AOPs and guide the experimental process.
Collapse
Affiliation(s)
- Zhehan Shen
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Ting Xu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| |
Collapse
|
12
|
Li Y, Pu F, Wang J, Zhou Z, Zhang C, He F, Ma Z, Zhang J. Machine Learning Methods in Prediction of Protein Palmitoylation Sites: A Brief Review. Curr Pharm Des 2021; 27:2189-2198. [PMID: 33183190 DOI: 10.2174/1381612826666201112142826] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 07/27/2020] [Indexed: 11/22/2022]
Abstract
Protein palmitoylation is a fundamental and reversible post-translational lipid modification that involves a series of biological processes. Although a large number of experimental studies have explored the molecular mechanism behind the palmitoylation process, the computational methods has attracted much attention for its good performance in predicting palmitoylation sites compared with expensive and time-consuming biochemical experiments. The prediction of protein palmitoylation sites is helpful to reveal its biological mechanism. Therefore, the research on the application of machine learning methods to predict palmitoylation sites has become a hot topic in bioinformatics and promoted the development in the related fields. In this review, we briefly introduced the recent development in predicting protein palmitoylation sites by using machine learningbased methods and discussed their benefits and drawbacks. The perspective of machine learning-based methods in predicting palmitoylation sites was also provided. We hope the review could provide a guide in related fields.
Collapse
Affiliation(s)
- Yanwen Li
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Feng Pu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Jingru Wang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Zhiguo Zhou
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Chunhua Zhang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Fei He
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Zhiqiang Ma
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Jingbo Zhang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| |
Collapse
|
13
|
Zhang J, Chen Q, Liu B. DeepDRBP-2L: A New Genome Annotation Predictor for Identifying DNA-Binding Proteins and RNA-Binding Proteins Using Convolutional Neural Network and Long Short-Term Memory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1451-1463. [PMID: 31722485 DOI: 10.1109/tcbb.2019.2952338] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) are two kinds of crucial proteins, which are associated with various cellule activities and some important diseases. Accurate identification of DBPs and RBPs facilitate both theoretical research and real world application. Existing sequence-based DBP predictors can accurately identify DBPs but incorrectly predict many RBPs as DBPs, and vice versa, resulting in low prediction precision. Moreover, some proteins (DRBPs) interacting with both DNA and RNA play important roles in gene expression and cannot be identified by existing computational methods. In this study, a two-level predictor named DeepDRBP-2L was proposed by combining Convolutional Neural Network (CNN) and the Long Short-Term Memory (LSTM). It is the first computational method that is able to identify DBPs, RBPs and DRBPs. Rigorous cross-validations and independent tests showed that DeepDRBP-2L is able to overcome the shortcoming of the existing methods and can go one further step to identify DRBPs. Application of DeepDRBP-2L to tomato genome further demonstrated its performance. The webserver of DeepDRBP-2L is freely available at http://bliulab.net/DeepDRBP-2L.
Collapse
|
14
|
Jin X, Liao Q, Liu B. S2L-PSIBLAST: a supervised two-layer search framework based on PSI-BLAST for protein remote homology detection. Bioinformatics 2021; 37:4321-4327. [PMID: 34170287 DOI: 10.1093/bioinformatics/btab472] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 05/29/2021] [Accepted: 06/24/2021] [Indexed: 01/26/2023] Open
Abstract
MOTIVATION Protein remote homology detection is a challenging task for the studies of protein evolutionary relationships. PSI-BLAST is an important and fundamental search method for detecting homology proteins. Although many improved versions of PSI-BLAST have been proposed, their performance is limited by the search processes of PSI-BLAST. RESULTS For further improving the performance of PSI-BLAST for protein remote homology detection, a supervised two-layer search framework based on PSI-BLAST (S2L-PSIBLAST) is proposed. S2L-PSIBLAST consists of a two-level search: the first-level search provides high-quality search results by using SMI-BLAST framework and double-link strategy to filter the non-homology protein sequences, the second-level search detects more homology proteins by profile-link similarity, and more accurate ranking lists for those detected protein sequences are obtained by learning to rank strategy. Experimental results on the updated version of Structural Classification of Proteins-extended benchmark dataset show that S2L-PSIBLAST not only obviously improves the performance of PSI-BLAST, but also achieves better performance on two improved versions of PSI-BLAST: DELTA-BLAST and PSI-BLASTexB. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaopeng Jin
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Qing Liao
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
15
|
Zou Y, Wu H, Guo X, Peng L, Ding Y, Tang J, Guo F. MK-FSVM-SVDD: A Multiple Kernel-based Fuzzy SVM Model for Predicting DNA-binding Proteins via Support Vector Data Description. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200607173829] [Citation(s) in RCA: 67] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Detecting DNA-binding proteins (DBPs) based on biological and chemical
methods is time-consuming and expensive.
Objective:
In recent years, the rise of computational biology methods based on Machine Learning
(ML) has greatly improved the detection efficiency of DBPs.
Method:
In this study, the Multiple Kernel-based Fuzzy SVM Model with Support Vector Data
Description (MK-FSVM-SVDD) is proposed to predict DBPs. Firstly, sex features are extracted
from the protein sequence. Secondly, multiple kernels are constructed via these sequence features.
Then, multiple kernels are integrated by Centered Kernel Alignment-based Multiple Kernel
Learning (CKA-MKL). Next, fuzzy membership scores of training samples are calculated with
Support Vector Data Description (SVDD). FSVM is trained and employed to detect new DBPs.
Results:
Our model is evaluated on several benchmark datasets. Compared with other methods, MKFSVM-
SVDD achieves best Matthew's Correlation Coefficient (MCC) on PDB186 (0.7250) and
PDB2272 (0.5476).
Conclusion:
We can conclude that MK-FSVM-SVDD is more suitable than common SVM, as the
classifier for DNA-binding proteins identification.
Collapse
Affiliation(s)
- Yi Zou
- School of Internet of Things Engineering, Jiangnan University, Wuxi, 214122, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, No. 1 Kerui Road, 215009, Suzhou, China
| | - Xiaoyi Guo
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, 214000, Wuxi, China
| | - Li Peng
- School of Internet of Things Engineering, Jiangnan University, Wuxi, 214122, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, No. 1 Kerui Road, 215009, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China
| |
Collapse
|
16
|
Guo X, Zhou W, Shi B, Wang X, Du A, Ding Y, Tang J, Guo F. An Efficient Multiple Kernel Support Vector Regression Model for Assessing Dry Weight of Hemodialysis Patients. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200614172536] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Dry Weight (DW) is the lowest weight after dialysis, and patients with
lower weight usually have symptoms of hypotension and shock. Several clinical-based approaches
have been presented to assess the dry weight of hemodialysis patients. However, these traditional
methods all depend on special instruments and professional technicians.
Objective:
In order to avoid this limitation, we need to find a machine-independent way to assess dry
weight, therefore we collected some clinical influencing characteristic data and constructed a
Machine Learning-based (ML) model to predict the dry weight of hemodialysis patients.
Methods::
In this paper, 476 hemodialysis patients' demographic data, anthropometric measurements,
and Bioimpedance spectroscopy (BIS) were collected. Among them, these patients' age, sex, Body
Mass Index (BMI), Blood Pressure (BP) and Heart Rate (HR) and Years of Dialysis (YD) were
closely related to their dry weight. All these relevant data were used to enter the regression equation.
Multiple Kernel Support Vector Regression-based on Maximizes the Average Similarity (MKSVRMAS)
model was proposed to predict the dry weight of hemodialysis patients.
Result:
The experimental results show that dry weight is positively correlated with BMI and HR.
And age, sex, systolic blood pressure, diastolic blood pressure and hemodialysis time are negatively
correlated with dry weight. Moreover, the Root Mean Square Error (RMSE) of our model was
1.3817.
Conclusion:
Our proposed model could serve as a viable alternative for dry weight estimation of
hemodialysis patients, thus providing a new way for clinical practice. Our proposed model could serve as a viable alternative of dry weight estimation for hemodialysis patients,
thus providing a new way for the clinic.
Collapse
Affiliation(s)
- Xiaoyi Guo
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, 214000, Wuxi, China
| | - Wei Zhou
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, 214000, Wuxi, China
| | - Bin Shi
- Hemodialysis Center, Northern Jiangsu People's Hospital, 225001, Yangzhou, China
| | - Xiaohua Wang
- Department of Urology, the First Affiliated Hospital of Soochow University, 215006, Suzhou, China
| | - Aiyan Du
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, 214000, Wuxi, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, 215009, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China
| |
Collapse
|
17
|
Ullah A, Wang B, Sheng J, Long J, Khan N, Sun Z. Identification of nodes influence based on global structure model in complex networks. Sci Rep 2021; 11:6173. [PMID: 33731720 PMCID: PMC7969936 DOI: 10.1038/s41598-021-84684-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Accepted: 02/12/2021] [Indexed: 01/31/2023] Open
Abstract
Identification of Influential nodes in complex networks is challenging due to the largely scaled data and network sizes, and frequently changing behaviors of the current topologies. Various application scenarios like disease transmission and immunization, software virus infection and disinfection, increased product exposure and rumor suppression, etc., are applicable domains in the corresponding networks where identification of influential nodes is crucial. Though a lot of approaches are proposed to address the challenges, most of the relevant research concentrates only on single and limited aspects of the problem. Therefore, we propose Global Structure Model (GSM) for influential nodes identification that considers self-influence as well as emphasizes on global influence of the node in the network. We applied GSM and utilized Susceptible Infected Recovered model to evaluate its efficiency. Moreover, various standard algorithms such as Betweenness Centrality, Profit Leader, H-Index, Closeness Centrality, Hyperlink Induced Topic Search, Improved K-shell Hybrid, Density Centrality, Extended Cluster Coefficient Ranking Measure, and Gravity Index Centrality are employed as baseline benchmarks to evaluate the performance of GSM. Similarly, we used seven real-world and two synthetic multi-typed complex networks along-with different well-known datasets for experiments. Results analysis indicates that GSM outperformed the baseline algorithms in identification of influential node(s).
Collapse
Affiliation(s)
- Aman Ullah
- grid.216417.70000 0001 0379 7164School of Computer Science and Engineering, Central South University, Changsha, 410083 China
| | - Bin Wang
- grid.216417.70000 0001 0379 7164School of Computer Science and Engineering, Central South University, Changsha, 410083 China
| | - JinFang Sheng
- grid.216417.70000 0001 0379 7164School of Computer Science and Engineering, Central South University, Changsha, 410083 China
| | - Jun Long
- grid.216417.70000 0001 0379 7164School of Computer Science and Engineering, Central South University, Changsha, 410083 China ,grid.216417.70000 0001 0379 7164Big Data Institute, Central South University, Changsha, 410083 China
| | - Nasrullah Khan
- grid.64938.300000 0000 9558 9911College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016 China ,grid.418920.60000 0004 0607 0704Department of Computer Science, COMSATS University Islamabad, Vehari Campus, Vehari, 61100 Pakistan
| | - ZeJun Sun
- grid.449268.50000 0004 1797 3968School of Information Engineering, Pingdingshan University, Pingdingshan, Henan China
| |
Collapse
|
18
|
He S, Guo F, Zou Q, HuiDing. MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200503030350] [Citation(s) in RCA: 101] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
The study aims to find a way to reduce the dimensionality of the dataset.
Background:
Dimensionality reduction is the key issue of the machine learning process. It does
not only improve the prediction performance but also could recommend the intrinsic features and
help to explore the biological expression of the machine learning “black box”.
Objective:
A variety of feature selection algorithms are used to select data features to achieve
dimensionality reduction.
Methods:
First, MRMD2.0 integrated 7 different popular feature ranking algorithms with
PageRank strategy. Second, optimized dimensionality was detected with forward adding strategy.
Result:
We have achieved good results in our experiments.
Conclusion:
Several works have been tested with MRMD2.0. It showed well performance.
Otherwise, it also can draw the performance curves according to the feature dimensionality. If
users want to sacrifice accuracy for fewer features, they can select the dimensionality from the
performance curves.
Other:
We developed friendly python tools together with the web server. The users could upload
their csv, arff or libsvm format files. Then the webserver would help to rank features and find the
optimized dimensionality.
Collapse
Affiliation(s)
- Shida He
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - HuiDing
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
19
|
Assessing Dry Weight of Hemodialysis Patients via Sparse Laplacian Regularized RVFL Neural Network with L 2,1-Norm. BIOMED RESEARCH INTERNATIONAL 2021; 2021:6627650. [PMID: 33628794 PMCID: PMC7880720 DOI: 10.1155/2021/6627650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/01/2021] [Revised: 01/21/2021] [Accepted: 01/25/2021] [Indexed: 11/28/2022]
Abstract
Dry weight is the normal weight of hemodialysis patients after hemodialysis. If the amount of water in diabetes is too much (during hemodialysis), the patient will experience hypotension and shock symptoms. Therefore, the correct assessment of the patient's dry weight is clinically important. These methods all rely on professional instruments and technicians, which are time-consuming and labor-intensive. To avoid this limitation, we hope to use machine learning methods on patients. This study collected demographic and anthropometric data of 476 hemodialysis patients, including age, gender, blood pressure (BP), body mass index (BMI), years of dialysis (YD), and heart rate (HR). We propose a Sparse Laplacian regularized Random Vector Functional Link (SLapRVFL) neural network model on the basis of predecessors. When we evaluate the prediction performance of the model, we fully compare SLapRVFL with the Body Composition Monitor (BCM) instrument and other models. The Root Mean Square Error (RMSE) of SLapRVFL is 1.3136, which is better than other methods. The SLapRVFL neural network model could be a viable alternative of dry weight assessment.
Collapse
|
20
|
Xu L, Liang G, Chen B, Tan X, Xiang H, Liao C. A Computational Method for the Identification of Endolysins and Autolysins. Protein Pept Lett 2020; 27:329-336. [PMID: 31577192 DOI: 10.2174/0929866526666191002104735] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Revised: 06/27/2019] [Accepted: 09/03/2019] [Indexed: 12/21/2022]
Abstract
BACKGROUND Cell lytic enzyme is a kind of highly evolved protein, which can destroy the cell structure and kill the bacteria. Compared with antibiotics, cell lytic enzyme will not cause serious problem of drug resistance of pathogenic bacteria. Thus, the study of cell wall lytic enzymes aims at finding an efficient way for curing bacteria infectious. Compared with using antibiotics, the problem of drug resistance becomes more serious. Therefore, it is a good choice for curing bacterial infections by using cell lytic enzymes. Cell lytic enzyme includes endolysin and autolysin and the difference between them is the purpose of the break of cell wall. The identification of the type of cell lytic enzymes is meaningful for the study of cell wall enzymes. OBJECTIVE In this article, our motivation is to predict the type of cell lytic enzyme. Cell lytic enzyme is helpful for killing bacteria, so it is meaningful for study the type of cell lytic enzyme. However, it is time consuming to detect the type of cell lytic enzyme by experimental methods. Thus, an efficient computational method for the type of cell lytic enzyme prediction is proposed in our work. METHODS We propose a computational method for the prediction of endolysin and autolysin. First, a data set containing 27 endolysins and 41 autolysins is built. Then the protein is represented by tripeptides composition. The features are selected with larger confidence degree. At last, the classifier is trained by the labeled vectors based on support vector machine. The learned classifier is used to predict the type of cell lytic enzyme. RESULTS Following the proposed method, the experimental results show that the overall accuracy can attain 97.06%, when 44 features are selected. Compared with Ding's method, our method improves the overall accuracy by nearly 4.5% ((97.06-92.9)/92.9%). The performance of our proposed method is stable, when the selected feature number is from 40 to 70. The overall accuracy of tripeptides optimal feature set is 94.12%, and the overall accuracy of Chou's amphiphilic PseAAC method is 76.2%. The experimental results also demonstrate that the overall accuracy is improved by nearly 18% when using the tripeptides optimal feature set. CONCLUSION The paper proposed an efficient method for identifying endolysin and autolysin. In this paper, support vector machine is used to predict the type of cell lytic enzyme. The experimental results show that the overall accuracy of the proposed method is 94.12%, which is better than some existing methods. In conclusion, the selected 44 features can improve the overall accuracy for identification of the type of cell lytic enzyme. Support vector machine performs better than other classifiers when using the selected feature set on the benchmark data set.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Guangmin Liang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Baowen Chen
- School of Software, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Xu Tan
- School of Software, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Huaikun Xiang
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Changrui Liao
- Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen, China
| |
Collapse
|
21
|
Makigaki S, Ishida T. Sequence alignment generation using intermediate sequence search for homology modeling. Comput Struct Biotechnol J 2020; 18:2043-2050. [PMID: 32802276 PMCID: PMC7415839 DOI: 10.1016/j.csbj.2020.07.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Revised: 07/15/2020] [Accepted: 07/15/2020] [Indexed: 11/25/2022] Open
Abstract
Protein tertiary structure is important information in various areas of biological research, however, the experimental cost associated with structure determination is high, and computational prediction methods have been developed to facilitate a more economical approach. Currently, template-based modeling methods are considered to be the most practical because the resulting predicted structures are often accurate, provided an appropriate template protein is available. During the first stage of template-based modeling, sensitive homology detection is essential for accurate structure prediction. However, sufficient structural models cannot always be obtained due to a lack of quality in the sequence alignment generated by a homology detection program. Therefore, an automated method that detects remote homologs accurately and generates appropriate alignments for accurate structure prediction is needed. In this paper, we propose an algorithm for suitable alignment generation using an intermediate sequence search for use with template-based modeling. We used intermediate sequence search for remote homology detection and intermediate sequences for alignment generation of remote homologs. We then evaluated the proposed method by comparing the sensitivity and selectivity of homology detection. Furthermore, based on the accuracy of the predicted structure model, we verify the accuracy of the alignments generated by our method. We demonstrate that our method generates more appropriate alignments for template-based modeling, especially for remote homologs. All source codes are available at https://github.com/shuichiro-makigaki/agora.
Collapse
Affiliation(s)
- Shuichiro Makigaki
- Department of Computer Science, School of Computing, Tokyo Institute of Technology Ookayama, Meguro-ku, Tokyo 152-8550, Japan
| | - Takashi Ishida
- Department of Computer Science, School of Computing, Tokyo Institute of Technology Ookayama, Meguro-ku, Tokyo 152-8550, Japan
| |
Collapse
|
22
|
Gu X, Chen Z, Wang D. Prediction of G Protein-Coupled Receptors With CTDC Extraction and MRMD2.0 Dimension-Reduction Methods. Front Bioeng Biotechnol 2020; 8:635. [PMID: 32671038 PMCID: PMC7329982 DOI: 10.3389/fbioe.2020.00635] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 05/26/2020] [Indexed: 11/13/2022] Open
Abstract
The G Protein-Coupled Receptor (GPCR) family consists of more than 800 different members. In this article, we attempt to use the physicochemical properties of Composition, Transition, Distribution (CTD) to represent GPCRs. The dimensionality reduction method of MRMD2.0 filters the physicochemical properties of GPCR redundancy. Matplotlib plots the coordinates to distinguish GPCRs from other protein sequences. The chart data show a clear distinction effect, and there is a well-defined boundary between the two. The experimental results show that our method can predict GPCRs.
Collapse
Affiliation(s)
- Xingyue Gu
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Zhihua Chen
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| |
Collapse
|
23
|
A Novel Triple Matrix Factorization Method for Detecting Drug-Side Effect Association Based on Kernel Target Alignment. BIOMED RESEARCH INTERNATIONAL 2020; 2020:4675395. [PMID: 32596314 PMCID: PMC7275954 DOI: 10.1155/2020/4675395] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Accepted: 04/08/2020] [Indexed: 01/01/2023]
Abstract
All drugs usually have side effects, which endanger the health of patients. To identify potential side effects of drugs, biological and pharmacological experiments are done but are expensive and time-consuming. So, computation-based methods have been developed to accurately and quickly predict side effects. To predict potential associations between drugs and side effects, we propose a novel method called the Triple Matrix Factorization- (TMF-) based model. TMF is built by the biprojection matrix and latent feature of kernels, which is based on Low Rank Approximation (LRA). LRA could construct a lower rank matrix to approximate the original matrix, which not only retains the characteristics of the original matrix but also reduces the storage space and computational complexity of the data. To fuse multivariate information, multiple kernel matrices are constructed and integrated via Kernel Target Alignment-based Multiple Kernel Learning (KTA-MKL) in drug and side effect space, respectively. Compared with other methods, our model achieves better performance on three benchmark datasets. The values of the Area Under the Precision-Recall curve (AUPR) are 0.677, 0.685, and 0.680 on three datasets, respectively.
Collapse
|
24
|
Hou R, Wang L, Wu YJ. Predicting ATP-Binding Cassette Transporters Using the Random Forest Method. Front Genet 2020; 11:156. [PMID: 32269586 PMCID: PMC7109328 DOI: 10.3389/fgene.2020.00156] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2019] [Accepted: 02/11/2020] [Indexed: 12/21/2022] Open
Abstract
ATP-binding cassette (ABC) proteins play important roles in a wide variety of species. These proteins are involved in absorbing nutrients, exporting toxic substances, and regulating potassium channels, and they contribute to drug resistance in cancer cells. Therefore, the identification of ABC transporters is an urgent task. The present study used 188D as the feature extraction method, which is based on sequence information and physicochemical properties. We also visualized the feature extracted by t-Distributed Stochastic Neighbor Embedding (t-SNE). The sample based on the features extracted by 188D may be separated. Further, random forest (RF) is an efficient classifier to identify proteins. Under the 10-fold cross-validation of the model proposed here for a training set, the average accuracy rate of 10 training sets was 89.54%. We obtained values of 0.87 for specificity, 0.92 for sensitivity, and 0.79 for MCC. In the testing set, the accuracy achieved was 89%. These results suggest that the model combining 188D with RF is an optimal tool to identify ABC transporters.
Collapse
Affiliation(s)
- Ruiyan Hou
- Laboratory of Molecular Toxicology, State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.,College of Life Science, University of Chinese Academy of Sciences, Beijing, China
| | - Lida Wang
- Department of Scientific Research, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| | - Yi-Jun Wu
- Laboratory of Molecular Toxicology, State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
25
|
Meng C, Zhang J, Ye X, Guo F, Zou Q. Review and comparative analysis of machine learning-based phage virion protein identification methods. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140406. [PMID: 32135196 DOI: 10.1016/j.bbapap.2020.140406] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Revised: 02/14/2020] [Accepted: 02/27/2020] [Indexed: 02/01/2023]
Abstract
Phage virion protein (PVP) identification plays key role in elucidating relationships between phages and hosts. Moreover, PVP identification can facilitate the design of related biochemical entities. Recently, several machine learning approaches have emerged for this purpose and have shown their potential capacities. In this study, the proposed PVP identifiers are systemically reviewed, and the related algorithms and tools are comprehensively analyzed. We summarized the common framework of these PVP identifiers and constructed our own novel identifiers based upon the framework. Furthermore, we focus on a performance comparison of all PVP identifiers by using a training dataset and an independent dataset. Highlighting the pros and cons of these identifiers demonstrates that g-gap DPC (dipeptide composition) features are capable of representing characteristics of PVPs. Moreover, SVM (support vector machine) is proven to be the more effective classifier to distinguish PVPs and non-PVPs.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Intelligence and Computing, Tianjin University, Tianjin, China; College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Science City, Japan
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
26
|
Identification of prokaryotic promoters and their strength by integrating heterogeneous features. Genomics 2020; 112:1396-1403. [DOI: 10.1016/j.ygeno.2019.08.009] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/31/2019] [Accepted: 08/14/2019] [Indexed: 12/21/2022]
|
27
|
Liu Y, Wang X, Liu B. RFPR-IDP: reduce the false positive rates for intrinsically disordered protein and region prediction by incorporating both fully ordered proteins and disordered proteins. Brief Bioinform 2020; 22:2000-2011. [PMID: 32112084 PMCID: PMC7986600 DOI: 10.1093/bib/bbaa018] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
As an important type of proteins, intrinsically disordered proteins/regions (IDPs/IDRs) are related to many crucial biological functions. Accurate prediction of IDPs/IDRs is beneficial to the prediction of protein structures and functions. Most of the existing methods ignore the fully ordered proteins without IDRs during training and test processes. As a result, the corresponding predictors prefer to predict the fully ordered proteins as disordered proteins. Unfortunately, these methods were only evaluated on datasets consisting of disordered proteins without or with only a few fully ordered proteins, and therefore, this problem escapes the attention of the researchers. However, most of the newly sequenced proteins are fully ordered proteins in nature. These predictors fail to accurately predict the ordered and disordered proteins in real-world applications. In this regard, we propose a new method called RFPR-IDP trained with both fully ordered proteins and disordered proteins, which is constructed based on the combination of convolution neural network (CNN) and bidirectional long short-term memory (BiLSTM). The experimental results show that although the existing predictors perform well for predicting the disordered proteins, they tend to predict the fully ordered proteins as disordered proteins. In contrast, the RFPR-IDP predictor can correctly predict the fully ordered proteins and outperform the other 10 state-of-the-art methods when evaluated on a test dataset with both fully ordered proteins and disordered proteins. The web server and datasets of RFPR-IDP are freely available at http://bliulab.net/RFPR-IDP/server.
Collapse
Affiliation(s)
- Yumeng Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
28
|
Ru X, Wang L, Li L, Ding H, Ye X, Zou Q. Exploration of the correlation between GPCRs and drugs based on a learning to rank algorithm. Comput Biol Med 2020; 119:103660. [PMID: 32090901 DOI: 10.1016/j.compbiomed.2020.103660] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Revised: 02/04/2020] [Accepted: 02/12/2020] [Indexed: 02/01/2023]
Abstract
Exploring the protein - drug correlation can not only solve the problem of selecting candidate compounds but also solve related problems such as drug redirection and finding potential drug targets. Therefore, many researchers have proposed different machine learning methods for prediction of protein-drug correlations. However, many existing models simply divide the protein-drug relationship into related or irrelevant categories and do not deeply explore the most relevant target (or drug) for a given drug (or target). In order to solve this problem, this paper applies the ranking concept to the prediction of the GPCR (G Protein-Coupled Receptors)-drug correlation. This study uses two different types of data sets to explore candidate compound and potential target problems, and both sets achieved good results. In addition, this study also found that the family to which a protein belongs is not an inherent factor that affects the ranking of GPCR-drug correlations; however, if the drug affects other family members of the protein, then the protein is likely to be a potential target of the drug. This study showed that the learning to rank algorithm is a good tool for exploring protein-drug correlations.
Collapse
Affiliation(s)
- Xiaoqing Ru
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China; School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Lida Wang
- Scientific Research Department, Heilongjiang Agricultural Recalmation General Hospital, Harbin, China.
| | - Lihong Li
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba Science City, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
29
|
Song X, Zhuang Y, Lan Y, Lin Y, Min X. Comprehensive Review and Comparison for Anticancer Peptides Identification Models. Curr Protein Pept Sci 2020; 22:CPPS-EPUB-103745. [PMID: 31957608 DOI: 10.2174/1389203721666200117162958] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 05/16/2019] [Accepted: 05/30/2019] [Indexed: 11/22/2022]
Abstract
Anticancer peptides (ACPs) eliminate pathogenic bacteria and kill tumor cells, showing no hemolysis and no damages to normal human cells. This unique ability explores the possibility of ACPs as therapeutic delivery and its potential applications in clinical therapy. Identifying ACPs is one of the most fundamental and central problems in new antitumor drug research. During the past decades, a number of machine learning-based prediction tools have been developed to solve this important task. However, the predictions produced by various tools are difficult to quantify and compare. Therefore, in this article, we provide a comprehensive review of existing machine learning methods for ACPs prediction and fair comparison of the predictors. To evaluate current prediction tools, we conducted a comparative study and analyzed the existing ACPs predictor from 10 public literatures. The comparative results obtained suggest that Support Vector Machine-based model with features combination provided significant improvement in the overall performance, when compared to the other machine learning method-based prediction models.
Collapse
|
30
|
Liu B, Zhu Y, Yan K. Fold-LTR-TCP: protein fold recognition based on triadic closure principle. Brief Bioinform 2019; 21:2185-2193. [DOI: 10.1093/bib/bbz139] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 10/01/2019] [Accepted: 10/09/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
As an important task in protein structure and function studies, protein fold recognition has attracted more and more attention. The existing computational predictors in this field treat this task as a multi-classification problem, ignoring the relationship among proteins in the dataset. However, previous studies showed that their relationship is critical for protein homology analysis. In this study, the protein fold recognition is treated as an information retrieval task. The Learning to Rank model (LTR) was employed to retrieve the query protein against the template proteins to find the template proteins in the same fold with the query protein in a supervised manner. The triadic closure principle (TCP) was performed on the ranking list generated by the LTR to improve its accuracy by considering the relationship among the query protein and the template proteins in the ranking list. Finally, a predictor called Fold-LTR-TCP was proposed. The rigorous test on the LE benchmark dataset showed that the Fold-LTR-TCP predictor achieved an accuracy of 73.2%, outperforming all the other competing methods.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| | - Yulin Zhu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| |
Collapse
|
31
|
Wu Z, Liao Q, Liu B. A comprehensive review and evaluation of computational methods for identifying protein complexes from protein–protein interaction networks. Brief Bioinform 2019; 21:1531-1548. [DOI: 10.1093/bib/bbz085] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 06/17/2019] [Accepted: 06/17/2019] [Indexed: 02/04/2023] Open
Abstract
Abstract
Protein complexes are the fundamental units for many cellular processes. Identifying protein complexes accurately is critical for understanding the functions and organizations of cells. With the increment of genome-scale protein–protein interaction (PPI) data for different species, various computational methods focus on identifying protein complexes from PPI networks. In this article, we give a comprehensive and updated review on the state-of-the-art computational methods in the field of protein complex identification, especially focusing on the newly developed approaches. The computational methods are organized into three categories, including cluster-quality-based methods, node-affinity-based methods and ensemble clustering methods. Furthermore, the advantages and disadvantages of different methods are discussed, and then, the performance of 17 state-of-the-art methods is evaluated on two widely used benchmark data sets. Finally, the bottleneck problems and their potential solutions in this important field are discussed.
Collapse
Affiliation(s)
- Zhourun Wu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Qing Liao
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
32
|
Meng C, Jin S, Wang L, Guo F, Zou Q. AOPs-SVM: A Sequence-Based Classifier of Antioxidant Proteins Using a Support Vector Machine. Front Bioeng Biotechnol 2019; 7:224. [PMID: 31620433 PMCID: PMC6759716 DOI: 10.3389/fbioe.2019.00224] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Accepted: 09/03/2019] [Indexed: 01/03/2023] Open
Abstract
Antioxidant proteins play important roles in countering oxidative damage in organisms. Because it is time-consuming and has a high cost, the accurate identification of antioxidant proteins using biological experiments is a challenging task. For these reasons, we proposed a model using machine-learning algorithms that we named AOPs-SVM, which was developed based on sequence features and a support vector machine. Using a testing dataset, we conducted a jackknife cross-validation test with the proposed AOPs-SVM classifier and obtained 0.68 in sensitivity, 0.985 in specificity, 0.942 in average accuracy, 0.741 in MCC, and 0.832 in AUC. This outperformed existing classifiers. The experiment results demonstrate that the AOPs-SVM is an effective classifier and contributes to the research related to antioxidant proteins. A web server was built at http://server.malab.cn/AOPs-SVM/index.jsp to provide open access.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Intelligence and Computing, Tianjin University, Tianjin, China.,College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Shunshan Jin
- Department of Neurology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Lei Wang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- College of Intelligence and Computing, Tianjin University, Tianjin, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
33
|
FKRR-MVSF: A Fuzzy Kernel Ridge Regression Model for Identifying DNA-Binding Proteins by Multi-View Sequence Features via Chou's Five-Step Rule. Int J Mol Sci 2019; 20:ijms20174175. [PMID: 31454964 PMCID: PMC6747228 DOI: 10.3390/ijms20174175] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Revised: 08/10/2019] [Accepted: 08/19/2019] [Indexed: 12/22/2022] Open
Abstract
DNA-binding proteins play an important role in cell metabolism. In biological laboratories, the detection methods of DNA-binding proteins includes yeast one-hybrid methods, bacterial singles and X-ray crystallography methods and others, but these methods involve a lot of labor, material and time. In recent years, many computation-based approachs have been proposed to detect DNA-binding proteins. In this paper, a machine learning-based method, which is called the Fuzzy Kernel Ridge Regression model based on Multi-View Sequence Features (FKRR-MVSF), is proposed to identifying DNA-binding proteins. First of all, multi-view sequence features are extracted from protein sequences. Next, a Multiple Kernel Learning (MKL) algorithm is employed to combine multiple features. Finally, a Fuzzy Kernel Ridge Regression (FKRR) model is built to detect DNA-binding proteins. Compared with other methods, our model achieves good results. Our method obtains an accuracy of 83.26% and 81.72% on two benchmark datasets (PDB1075 and compared with PDB186), respectively.
Collapse
|
34
|
Identification of Intrinsically Disordered Proteins and Regions by Length-Dependent Predictors Based on Conditional Random Fields. MOLECULAR THERAPY-NUCLEIC ACIDS 2019; 17:396-404. [PMID: 31307006 PMCID: PMC6626971 DOI: 10.1016/j.omtn.2019.06.004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 06/06/2019] [Accepted: 06/07/2019] [Indexed: 01/24/2023]
Abstract
Accurate identification of intrinsically disordered proteins/regions (IDPs/IDRs) is critical for predicting protein structure and function. Previous studies have shown that IDRs of different lengths have different characteristics, and several classification-based predictors have been proposed for predicting different types of IDRs. Compared with these classification-based predictors, the previously proposed predictor IDP-CRF exhibits state-of-the-art performance for predicting IDPs/IDRs, which is a sequence labeling model based on conditional random fields (CRFs). Motivated by these methods, we propose a predictor called IDP-FSP, which is an ensemble of three CRF-based predictors called IDP-FSP-L, IDP-FSP-S, and IDP-FSP-G. These three predictors are specially designed to predict long, short, and generic disordered regions, respectively, and they are constructed based on different features. To the best of our knowledge, IDP-FSP is the first predictor that combines a sequence labeling algorithm with IDRs of different lengths. Experimental results using two independent test datasets show that IDP-FSP achieves better or at least comparable predictive performance with 26 existing state-of-the-art methods in this field, proving the effectiveness of IDP-FSP.
Collapse
|
35
|
Wei H, Liu B. iCircDA-MF: identification of circRNA-disease associations based on matrix factorization. Brief Bioinform 2019; 21:1356-1367. [DOI: 10.1093/bib/bbz057] [Citation(s) in RCA: 68] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Revised: 03/13/2019] [Accepted: 04/17/2019] [Indexed: 12/19/2022] Open
Abstract
Abstract
Circular RNAs (circRNAs) are a group of novel discovered non-coding RNAs with closed-loop structure, which play critical roles in various biological processes. Identifying associations between circRNAs and diseases is critical for exploring the complex disease mechanism and facilitating disease-targeted therapy. Although several computational predictors have been proposed, their performance is still limited. In this study, a novel computational method called iCircDA-MF is proposed. Because the circRNA-disease associations with experimental validation are very limited, the potential circRNA-disease associations are calculated based on the circRNA similarity and disease similarity extracted from the disease semantic information and the known associations of circRNA-gene, gene-disease and circRNA-disease. The circRNA-disease interaction profiles are then updated by the neighbour interaction profiles so as to correct the false negative associations. Finally, the matrix factorization is performed on the updated circRNA-disease interaction profiles to predict the circRNA-disease associations. The experimental results on a widely used benchmark dataset showed that iCircDA-MF outperforms other state-of-the-art predictors and can identify new circRNA-disease associations effectively.
Collapse
Affiliation(s)
- Hang Wei
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
36
|
Ru X, Li L, Zou Q. Incorporating Distance-Based Top-n-gram and Random Forest To Identify Electron Transport Proteins. J Proteome Res 2019; 18:2931-2939. [DOI: 10.1021/acs.jproteome.9b00250] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Affiliation(s)
- Xiaoqing Ru
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Lihong Li
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
37
|
Shen Z, Lin Y, Zou Q. Transcription factors–DNA interactions in rice: identification and verification. Brief Bioinform 2019; 21:946-956. [DOI: 10.1093/bib/bbz045] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 03/25/2019] [Accepted: 03/25/2019] [Indexed: 01/08/2023] Open
Abstract
Abstract
The completion of the rice genome sequence paved the way for rice functional genomics research. Additionally, the functional characterization of transcription factors is currently a popular and crucial objective among researchers. Transcription factors are one of the groups of proteins that bind to either enhancer or promoter regions of genes to regulate expression. On the basis of several typical examples of transcription factor analyses, we herein summarize selected research strategies and methods and introduce their advantages and disadvantages. This review may provide some theoretical and technical guidelines for future investigations of transcription factors, which may be helpful to develop new rice varieties with ideal traits.
Collapse
Affiliation(s)
- Zijie Shen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuan Lin
- Department of System Integration, Sparebanken Vest, Bergen, Norway
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
38
|
Han K, Wang M, Zhang L, Wang Y, Guo M, Zhao M, Zhao Q, Zhang Y, Zeng N, Wang C. Predicting Ion Channels Genes and Their Types With Machine Learning Techniques. Front Genet 2019; 10:399. [PMID: 31130983 PMCID: PMC6510169 DOI: 10.3389/fgene.2019.00399] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2019] [Accepted: 04/12/2019] [Indexed: 02/01/2023] Open
Abstract
Motivation: The number of ion channels is increasing rapidly. As many of them are associated with diseases, they are the targets of more than 700 drugs. The discovery of new ion channels is facilitated by computational methods that predict ion channels and their types from protein sequences. Methods: We used the SVMProt and the k-skip-n-gram methods to extract the feature vectors of ion channels, and obtained 188- and 400-dimensional features, respectively. The 188- and 400-dimensional features were combined to obtain 588-dimensional features. We then employed the maximum-relevance-maximum-distance method to reduce the dimensions of the 588-dimensional features. Finally, the support vector machine and random forest methods were used to build the prediction models to evaluate the classification effect. Results: Different methods were employed to extract various feature vectors, and after effective dimensionality reduction, different classifiers were used to classify the ion channels. We extracted the ion channel data from the Universal Protein Resource (UniProt, http://www.uniprot.org/) and Ligand-Gated Ion Channel databases (http://www.ebi.ac.uk/compneur-srv/LGICdb/LGICdb.php), and then verified the performance of the classifiers after screening. The findings of this study could inform the research and development of drugs.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Miao Wang
- Life Sciences and Environmental Sciences Development Center, Harbin University of Commerce, Harbin, China
| | - Lei Zhang
- Life Sciences and Environmental Sciences Development Center, Harbin University of Commerce, Harbin, China
| | - Ying Wang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
| | - Mian Guo
- Department of Neurosurgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Ming Zhao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Qian Zhao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Yu Zhang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Nianyin Zeng
- Department of Instrumental and Electrical Engineering, Xiamen University, Xiamen, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
39
|
Qu K, Guo F, Liu X, Lin Y, Zou Q. Application of Machine Learning in Microbiology. Front Microbiol 2019; 10:827. [PMID: 31057526 PMCID: PMC6482238 DOI: 10.3389/fmicb.2019.00827] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 04/01/2019] [Indexed: 02/01/2023] Open
Abstract
Microorganisms are ubiquitous and closely related to people's daily lives. Since they were first discovered in the 19th century, researchers have shown great interest in microorganisms. People studied microorganisms through cultivation, but this method is expensive and time consuming. However, the cultivation method cannot keep a pace with the development of high-throughput sequencing technology. To deal with this problem, machine learning (ML) methods have been widely applied to the field of microbiology. Literature reviews have shown that ML can be used in many aspects of microbiology research, especially classification problems, and for exploring the interaction between microorganisms and the surrounding environment. In this study, we summarize the application of ML in microbiology.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xiangrong Liu
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Yuan Lin
- School of Information Science and Technology, Xiamen University, Xiamen, China
- Department of System Integration, Sparebanken Vest, Bergen, Norway
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
40
|
Ru X, Li L, Wang C. Identification of Phage Viral Proteins With Hybrid Sequence Features. Front Microbiol 2019; 10:507. [PMID: 30972038 PMCID: PMC6443926 DOI: 10.3389/fmicb.2019.00507] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2018] [Accepted: 02/27/2019] [Indexed: 02/01/2023] Open
Abstract
The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.
Collapse
Affiliation(s)
- Xiaoqing Ru
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Lihong Li
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
41
|
Su R, Liu X, Wei L. MinE-RFE: determine the optimal subset from RFE by minimizing the subset-accuracy–defined energy. Brief Bioinform 2019; 21:687-698. [DOI: 10.1093/bib/bbz021] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2018] [Revised: 01/24/2019] [Accepted: 02/02/2019] [Indexed: 01/18/2023] Open
Abstract
Abstract
Recursive feature elimination (RFE), as one of the most popular feature selection algorithms, has been extensively applied to bioinformatics. During the training, a group of candidate subsets are generated by iteratively eliminating the least important features from the original features. However, how to determine the optimal subset from them still remains ambiguous. Among most current studies, either overall accuracy or subset size (SS) is used to select the most predictive features. Using which one or both and how they affect the prediction performance are still open questions. In this study, we proposed MinE-RFE, a novel RFE-based feature selection approach by sufficiently considering the effect of both factors. Subset decision problem was reflected into subset-accuracy space and became an energy-minimization problem. We also provided a mathematical description of the relationship between the overall accuracy and SS using Gaussian Mixture Models together with spline fitting. Besides, we comprehensively reviewed a variety of state-of-the-art applications in bioinformatics using RFE. We compared their approaches of deciding the final subset from all the candidate subsets with MinE-RFE on diverse bioinformatics data sets. Additionally, we also compared MinE-RFE with some well-used feature selection algorithms. The comparative results demonstrate that the proposed approach exhibits the best performance among all the approaches. To facilitate the use of MinE-RFE, we further established a user-friendly web server with the implementation of the proposed approach, which is accessible at http://qgking.wicp.net/MinE/. We expect this web server will be a useful tool for research community.
Collapse
Affiliation(s)
- Ran Su
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xinyi Liu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
42
|
Xu L, Liang G, Liao C, Chen GD, Chang CC. k-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer's Disease Protein Identification. Front Genet 2019; 10:33. [PMID: 30809242 PMCID: PMC6379451 DOI: 10.3389/fgene.2019.00033] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Accepted: 01/17/2019] [Indexed: 11/18/2022] Open
Abstract
In this paper, a computational method based on machine learning technique for identifying Alzheimer's disease genes is proposed. Compared with most existing machine learning based methods, existing methods predict Alzheimer's disease genes by using structural magnetic resonance imaging (MRI) technique. Most methods have attained acceptable results, but the cost is expensive and time consuming. Thus, we proposed a computational method for identifying Alzheimer disease genes by use of the sequence information of proteins, and classify the feature vectors by random forest. In the proposed method, the gene protein information is extracted by adaptive k-skip-n-gram features. The proposed method can attain the accuracy to 85.5% on the selected UniProt dataset, which has been demonstrated by the experimental results.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Guangmin Liang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Changrui Liao
- Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen, China
| | - Gin-Den Chen
- Department of Obstetrics and Gynecology, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Chi-Chang Chang
- School of Medical Informatics, Chung Shan Medical University, Taichung, Taiwan
- IT Office, Chung Shan Medical University Hospital, Taichung, Taiwan
| |
Collapse
|
43
|
Yan K, Fang X, Xu Y, Liu B. Protein fold recognition based on multi-view modeling. Bioinformatics 2019; 35:2982-2990. [DOI: 10.1093/bioinformatics/btz040] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 12/29/2018] [Accepted: 01/16/2019] [Indexed: 12/22/2022] Open
Abstract
Abstract
Motivation
Protein fold recognition has attracted increasing attention because it is critical for studies of the 3D structures of proteins and drug design. Researchers have been extensively studying this important task, and several features with high discriminative power have been proposed. However, the development of methods that efficiently combine these features to improve the predictive performance remains a challenging problem.
Results
In this study, we proposed two algorithms: MV-fold and MT-fold. MV-fold is a new computational predictor based on the multi-view learning model for fold recognition. Different features of proteins were treated as different views of proteins, including the evolutionary information, secondary structure information and physicochemical properties. These different views constituted the latent space. The ε-dragging technique was employed to enlarge the margins between different protein folds, improving the predictive performance of MV-fold. Then, MV-fold was combined with two template-based methods: HHblits and HMMER. The ensemble method is called MT-fold incorporating the advantages of both discriminative methods and template-based methods. Experimental results on five widely used benchmark datasets (DD, RDD, EDD, TG and LE) showed that the proposed methods outperformed some state-of-the-art methods in this field, indicating that MV-fold and MT-fold are useful computational tools for protein fold recognition and protein homology detection and would be efficient tools for protein sequence analysis. Finally, we constructed an update and rigorous benchmark dataset based on SCOPe (version 2.07) to fairly evaluate the performance of the proposed method, and our method achieved stable performance on this new dataset. This new benchmark dataset will become a widely used benchmark dataset to fairly evaluate the performance of different methods for fold recognition.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Xiaozhao Fang
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China
| | - Yong Xu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|