51
|
Lv H, Dao FY, Guan ZX, Zhang D, Tan JX, Zhang Y, Chen W, Lin H. iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice. Front Genet 2019; 10:793. [PMID: 31552096 PMCID: PMC6746913 DOI: 10.3389/fgene.2019.00793] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Accepted: 07/26/2019] [Indexed: 01/08/2023] Open
Abstract
DNA N6-methyladenine (6mA) is a dominant DNA modification form and involved in many biological functions. The accurate genome-wide identification of 6mA sites may increase understanding of its biological functions. Experimental methods for 6mA detection in eukaryotes genome are laborious and expensive. Therefore, it is necessary to develop computational methods to identify 6mA sites on a genomic scale, especially for plant genomes. Based on this consideration, the study aims to develop a machine learning-based method of predicting 6mA sites in the rice genome. We initially used mono-nucleotide binary encoding to formulate positive and negative samples. Subsequently, the machine learning algorithm named Random Forest was utilized to perform the classification for identifying 6mA sites. Our proposed method could produce an area under the receiver operating characteristic curve of 0.964 with an overall accuracy of 0.917, as indicated by the fivefold cross-validation test. Furthermore, an independent dataset was established to assess the generalization ability of our method. Finally, an area under the receiver operating characteristic curve of 0.981 was obtained, suggesting that the proposed method had good performance of predicting 6mA sites in the rice genome. For the convenience of retrieving 6mA sites, on the basis of the computational method, we built a freely accessible web server named iDNA6mA-Rice at http://lin-group.cn/server/iDNA6mA-Rice.
Collapse
Affiliation(s)
- Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Dan Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Jiu-Xin Tan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yong Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
52
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
53
|
Zhang J, Peng W, Wang L. LeNup: learning nucleosome positioning from DNA sequences with improved convolutional neural networks. Bioinformatics 2019; 34:1705-1712. [PMID: 29329398 PMCID: PMC5946947 DOI: 10.1093/bioinformatics/bty003] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Accepted: 01/09/2018] [Indexed: 11/17/2022] Open
Abstract
Motivation Nucleosome positioning plays significant roles in proper genome packing and its accessibility to execute transcription regulation. Despite a multitude of nucleosome positioning resources available on line including experimental datasets of genome-wide nucleosome occupancy profiles and computational tools to the analysis on these data, the complex language of eukaryotic Nucleosome positioning remains incompletely understood. Results Here, we address this challenge using an approach based on a state-of-the-art machine learning method. We present a novel convolutional neural network (CNN) to understand nucleosome positioning. We combined Inception-like networks with a gating mechanism for the response of multiple patterns and long term association in DNA sequences. We developed the open-source package LeNup based on the CNN to predict nucleosome positioning in Homo sapiens, Caenorhabditis elegans, Drosophila melanogaster as well as Saccharomyces cerevisiae genomes. We trained LeNup on four benchmark datasets. LeNup achieved greater predictive accuracy than previously published methods. Availability and implementation LeNup is freely available as Python and Lua script source code under a BSD style license from https://github.com/biomedBit/LeNup. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Juhua Zhang
- Department of Biomedical Engineering.,Key Laboratory of Convergence Medical Engineering System and Healthcare Technology of the Ministry of Industry and Information Technology, School of Life Science, Beijing Institute of Technology, Beijing 100081, China
| | | | - Lei Wang
- Department of Biomedical Engineering
| |
Collapse
|
54
|
|
55
|
Lin H, Liang ZY, Tang H, Chen W. Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1316-1321. [PMID: 28186907 DOI: 10.1109/tcbb.2017.2666141] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Promoters are DNA regulatory elements located directly upstream or at the 5' end of the transcription initiation site (TSS), which are in charge of gene transcription initiation. With the completion of a large number of microorganism genomics, it is urgent to predict promoters accurately in bacteria by using the computational method. In this work, a sequence-based predictor named "iPro70-PseZNC" was designed for identifying sigma70 promoters in prokaryote. In the predictor, the samples of DNA sequences are formulated by a novel pseudo nucleotide composition, called PseZNC, into which the multi-window Z-curve composition and six local DNA structural properties are incorporated. In the 5-fold cross-validation, the area under the curve of receiver operating characteristic of 0.909 was obtained on our benchmark dataset, indicating that the proposed predictor is promising and will provide an important guide in this area. Further studies showed that the performance of PseZNC is better than it of multi-window Z-curve composition. For the sake of convenience for researchers, a user-friendly online service was established and can be freely accessible at http://lin.uestc.edu.cn/server/iPro70-PseZNC. The PseZNC approach can be also extended to other DNA-related problems.
Collapse
|
56
|
Wei L, Xing P, Shi G, Ji Z, Zou Q. Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1264-1273. [PMID: 28222000 DOI: 10.1109/tcbb.2017.2670558] [Citation(s) in RCA: 124] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Protein methylation, an important post-translational modification, plays crucial roles in many cellular processes. The accurate prediction of protein methylation sites is fundamentally important for revealing the molecular mechanisms undergoing methylation. In recent years, computational prediction based on machine learning algorithms has emerged as a powerful and robust approach for identifying methylation sites, and much progress has been made in predictive performance improvement. However, the predictive performance of existing methods is not satisfactory in terms of overall accuracy. Motivated by this, we propose a novel random-forest-based predictor called MePred-RF, integrating several discriminative sequence-based feature descriptors and improving feature representation capability using a powerful feature selection technique. Importantly, unlike other methods based on multiple, complex information inputs, our proposed MePred-RF is based on sequence information alone. Comparative studies on benchmark datasets via vigorous jackknife tests indicate that our proposed MePred-RF method remarkably outperforms other state-of-the-art predictors, leading by a 4.5 percent average in terms of overall accuracy. A user-friendly webserver that implements the proposed method has been established for researchers' convenience, and is now freely available for public use through http://server.malab.cn/MePred-RF. We anticipate our research tool to be useful for the large-scale prediction and analysis of protein methylation sites.
Collapse
|
57
|
Zhao W, Li GP, Wang J, Zhou YK, Gao Y, Du PF. Predicting protein sub-Golgi locations by combining functional domain enrichment scores with pseudo-amino acid compositions. J Theor Biol 2019; 473:38-43. [DOI: 10.1016/j.jtbi.2019.04.025] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Revised: 04/22/2019] [Accepted: 04/29/2019] [Indexed: 12/11/2022]
|
58
|
Lin Y, Cai Y, Liu J, Lin C, Liu X. An advanced approach to identify antimicrobial peptides and their function types for penaeus through machine learning strategies. BMC Bioinformatics 2019; 20:291. [PMID: 31182007 PMCID: PMC6557738 DOI: 10.1186/s12859-019-2766-9] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Background Antimicrobial peptides (AMPs) are essential components of the innate immune system and can protect the host from various pathogenic bacteria. The marine environment is known to be one of the richest sources for AMPs. Effective usage of AMPs and their derivatives can greatly improve the immunity and breeding survival rate of aquatic products. It is highly desirable to develop computational tools for rapidly and accurately identifying AMPs and their functional types, for the purpose of helping design new and more effective antimicrobial agents. Results In this study, we made an attempt to develop an advanced machine learning based computational approach, MAMPs-Pred, for identification of AMPs and its function types. Initially, SVM-prot 188-D features were extracted that were subsequently used as input to a two-layer multi-label classifier. In specific, the first layer is to identify whether it is an AMP by applying RF classifier, and the second layer addresses the multi-type problem by identifying the activites or function types of AMPs by applying PS-RF and LC-RF classifiers. To benchmark the methods,the MAMPs-Pred method is also compared with existing best-performing methods in literature and has shown an improved identification accuracy. Conclusions The results reported in this study indicate that the MAMP-Pred method achieves high performance for identifying AMPs and its functional types.The proposed approach is believed to supplement the tools and techniques that have been developed in the past for predicting AMPs and their function types.
Collapse
Affiliation(s)
- Yuan Lin
- Department of Computer Science, School of Information Science and Technology, Xiamen University, Xiamen, 361005, China.,Sparebanken Vest, Jonsvollsgaten 2, 5011 Bergen, Bergen, 5058, Norway
| | - Yinyin Cai
- Department of Computer Science, School of Information Science and Technology, Xiamen University, Xiamen, 361005, China
| | - Juan Liu
- Department of Instrumental and Electrical Engineering, School of Aerospace Engineering, Xiamen University, Xiamen, 361005, China
| | - Chen Lin
- Department of Computer Science, School of Information Science and Technology, Xiamen University, Xiamen, 361005, China
| | - Xiangrong Liu
- Department of Computer Science, School of Information Science and Technology, Xiamen University, Xiamen, 361005, China.
| |
Collapse
|
59
|
Liang Y, Zhang S. iDHS-DMCAC: identifying DNase I hypersensitive sites with balanced dinucleotide-based detrending moving-average cross-correlation coefficient. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2019; 30:429-445. [PMID: 31117818 DOI: 10.1080/1062936x.2019.1615546] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
DNase I hypersensitive sites (DHSs) are associated with regulatory DNA elements, so their good understanding is significant for both the biomedical research and the discovery of new drugs. Traditional experimental methods are laborious, time consuming and an inaccurately task to detect DHSs. More importantly, with the avalanche of genome sequences in the postgenomic age, it is highly essential to develop cost-effective computational approaches to identify DHSs. In this paper, we develop a statistical feature extraction model using the detrended moving-average cross-correlation (DMCA) coefficient descriptor based on dinucleotide property matrix generated by the 15 DNA dinucleotide properties, and this model is named iDHS-DMCAC. A 105-dimensional feature vector is constructed for a certain window on the two class imbalanced benchmark datasets, with over-sampling and support vector machine algorithms. Rigorous cross-validations indicate that our predictor remarkably outperforms the existing models in both accuracy and stability. We anticipate that iDHS-DMCAC will become a very useful high throughput tool, or at the very least, a complementary tool to the existing methods of identifying DNase I hypersensitive sites. The datasets and source codes of the proposed model are freely available at https://github.com/shengli0201/Datasets .
Collapse
Affiliation(s)
- Y Liang
- a School of Science , Xi'an Polytechnic University , Xi'an , P. R. China
| | - S Zhang
- b School of Mathematics and Statistics , Xidian University , Xi'an , P. R. China
| |
Collapse
|
60
|
Wei HH, Yang W, Tang H, Lin H. The Development of Machine Learning Methods in Cell-Penetrating Peptides Identification: A Brief Review. Curr Drug Metab 2019; 20:217-223. [DOI: 10.2174/1389200219666181010114750] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2018] [Revised: 05/21/2018] [Accepted: 08/02/2018] [Indexed: 11/22/2022]
Abstract
Background:Cell-penetrating Peptides (CPPs) are important short peptides that facilitate cellular intake or uptake of various molecules. CPPs can transport drug molecules through the plasma membrane and send these molecules to different cellular organelles. Thus, CPP identification and related mechanisms have been extensively explored. In order to reveal the penetration mechanisms of a large number of CPPs, it is necessary to develop convenient and fast methods for CPPs identification.Methods:Biochemical experiments can provide precise details for accurately identifying CPP, but these methods are expensive and laborious. To overcome these disadvantages, several computational methods have been developed to identify CPPs. We have performed review on the development of machine learning methods in CPP identification. This review provides an insight into CPP identification.Results:We summarized the machine learning-based CPP identification methods and compared the construction strategies of 11 different computational methods. Furthermore, we pointed out the limitations and difficulties in predicting CPPs.Conclusion:In this review, the last studies on CPP identification using machine learning method were reported. We also discussed the future development direction of CPP recognition with computational methods.
Collapse
Affiliation(s)
- Huan-Huan Wei
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wuritu Yang
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou, China
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
61
|
Zhang L, Kong L. iRSpot-PDI: Identification of recombination spots by incorporating dinucleotide property diversity information into Chou's pseudo components. Genomics 2019; 111:457-464. [DOI: 10.1016/j.ygeno.2018.03.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2017] [Revised: 02/27/2018] [Accepted: 03/03/2018] [Indexed: 12/11/2022]
|
62
|
Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, Zhu Y, Powell DR, Akutsu T, Webb GI, Chou KC, Smith AI, Daly RJ, Li J, Song J. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform 2019; 21:1047-1057. [DOI: 10.1093/bib/bbz041] [Citation(s) in RCA: 189] [Impact Index Per Article: 37.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2019] [Revised: 02/28/2019] [Accepted: 03/13/2019] [Indexed: 12/13/2022] Open
Abstract
Abstract
With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.
Collapse
Affiliation(s)
- Zhen Chen
- School of Basic Medical Science, Qingdao University, 38 Dengzhou Road, Qingdao, 266021, Shandong, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang, 455000, China
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Tatiana T Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Jerico Revote
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Yan Zhu
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - David R Powell
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, USA
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Roger J Daly
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Jian Li
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
63
|
Tahir M, Tayara H, Chong KT. iPseU-CNN: Identifying RNA Pseudouridine Sites Using Convolutional Neural Networks. MOLECULAR THERAPY-NUCLEIC ACIDS 2019; 16:463-470. [PMID: 31048185 PMCID: PMC6488737 DOI: 10.1016/j.omtn.2019.03.010] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Revised: 03/29/2019] [Accepted: 03/29/2019] [Indexed: 12/15/2022]
Abstract
Pseudouridine is the most prevalent RNA modification and has been found in both eukaryotes and prokaryotes. Currently, pseudouridine has been demonstrated in several kinds of RNAs, such as small nuclear RNA, rRNA, tRNA, mRNA, and small nucleolar RNA. Therefore, its significance to academic research and drug development is understandable. Through biochemical experiments, the pseudouridine site identification has produced good outcomes, but these lab exploratory methods and biochemical processes are expensive and time consuming. Therefore, it is important to introduce efficient methods for identification of pseudouridine sites. In this study, an intelligent method for pseudouridine sites using the deep-learning approach was developed. The proposed prediction model is called iPseU-CNN (identifying pseudouridine by convolutional neural networks). The existing methods used handcrafted features and machine-learning approaches to identify pseudouridine sites. However, the proposed predictor extracts the features of the pseudouridine sites automatically using a convolution neural network model. The iPseU-CNN model yields better outcomes than the current state-of-the-art models in all evaluation parameters. It is thus highly projected that the iPseU-CNN predictor will become a helpful tool for academic research on pseudouridine site prediction of RNA, as well as in drug discovery.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Electronics and Information Engineering, Chonbuk National University, Jeonju 54896, South Korea; Department of Computer Science, Abdul Wali Khan University, Mardan 23200, Pakistan
| | - Hilal Tayara
- Department of Electronics and Information Engineering, Chonbuk National University, Jeonju 54896, South Korea.
| | - Kil To Chong
- Advanced Electronics and Information Research Center, Chonbuk National University, Jeonju 54896, South Korea.
| |
Collapse
|
64
|
Yonge F, Weixia X. Identification of Mitochondrial Proteins of Malaria Parasite Adding the New Parameter. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666180608100348] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Malaria has been one of the serious infectious diseases caused by Plasmodium falciparum (P. falciparum). Mitochondrial proteins of P. falciparum are regarded as effective drug targets against malaria. Thus, it is necessary to accurately identify mitochondrial proteins of malaria parasite. Many algorithms have been proposed for the prediction of mitochondrial proteins of malaria parasite and yielded the better results. However, the parameters used by these methods were primarily based on amino acid sequences. In this study, we added a novel parameter for predicting mitochondrial proteins of malaria parasite based on protein secondary structure. Firstly, we extracted three feature parameters, namely, three kinds of protein secondary structures compositions (3PSS), 20 amino acid compositions (20AAC) and 400 dipeptide compositions (400DC), and used the analysis of variance (ANOVA) to screen 400 dipeptides. Secondly, we adopted these features to predict mitochondrial proteins of malaria parasite by using support vector machine (SVM). Finally, we found that 1) adding the feature of protein secondary structure (3PSS) can indeed improve the prediction accuracy. This result demonstrated that the parameter of protein secondary structure is a valid feature in the prediction of mitochondrial proteins of malaria parasite; 2) feature combination can improve the prediction’s results; feature selection can reduce the dimension and simplify the calculation. We achieved the sensitivity (Sn) of 98.16%, the specificity (Sp) of 97.64% and overall accuracy (Acc) of 97.88% with 0.957 of Mathew’s correlation coefficient (MCC) by using 3PSS+ 20AAC+ 34DC as a feature in 15-fold cross-validation. This result is compared with that of the similar work in the same dataset, showing the superiority of our work.
Collapse
Affiliation(s)
- Feng Yonge
- College of Science, Inner Mongolia Agriculture University, Hohhot 010018, China
| | - Xie Weixia
- College of Science, Inner Mongolia Agriculture University, Hohhot 010018, China
| |
Collapse
|
65
|
Wu J, Mai G, Deng B, Younseo J, Du D, Chen F, Ma Q. Quantitative Structure-activity Relationship of Acetylcholinesterase Inhibitors based on mRMR Combined with Support Vector Regression. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666181008125341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
In this work, support vector regression (SVR), an effective machine learning method, proposed by Vapnik was applied to establish QSAR model for a series of AchEI. Fourteen descriptors were selected for constructing the SVR mode by using mRMR-Forward feature selection method. The parameters (ε, C) were adjusted by leave-one-out cross validation (LOOCV) method which was used to judge the predictive power of different models. After optimization, one optimal SVR-QSAR model was attained, and the mean relative errors (MRE) of LOOCV by using SVR is 1.72%. As a result, LogP negatively affected the activity, Refractivity and Water Accessible Surface Area positively affected the activity.
Collapse
Affiliation(s)
- Jiaxiang Wu
- Shanghai Key Laboratory of Bio-Crops, College of Life Science, Shanghai University, Shanghai, China
| | - Guozhao Mai
- Department of Rehabilitation Medicine, The People's Hospital of Heshan, Guangdong, China
| | - Bowen Deng
- Shanghai Key Laboratory of Bio-Crops, College of Life Science, Shanghai University, Shanghai, China
| | - Jeong Younseo
- Center for Bioinformatics and Computational Biology, Pai Chai University, Daejeon, South Korea
| | - Dongsu Du
- Shanghai Key Laboratory of Bio-Crops, College of Life Science, Shanghai University, Shanghai, China
| | - Fuxue Chen
- Shanghai Key Laboratory of Bio-Crops, College of Life Science, Shanghai University, Shanghai, China
| | - Qiaorong Ma
- Department of Clinical Laboratory, Minzu Hospital of Guangxi Zhuang Autonomous Region, Affiliated Minzu Hospital of Guangxi Medical University, Nanning, Guangxi, China
| |
Collapse
|
66
|
Yang Q, Jia C, Li T. Prediction of aptamer-protein interacting pairs based on sparse autoencoder feature extraction and an ensemble classifier. Math Biosci 2019; 311:103-108. [PMID: 30880100 DOI: 10.1016/j.mbs.2019.01.009] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Revised: 01/29/2019] [Accepted: 01/29/2019] [Indexed: 10/27/2022]
Abstract
Aptamer-protein interacting pairs play important roles in physiological functions and structural characterization. Identifying aptamer-protein interacting pairs is challenging and limited, despite of the tremendous applications of aptamers. Therefore, it is vital to construct a high prediction performance model for identifying aptamer-target interacting pairs. In this study, a novel ensemble method is presented to predict aptamer-protein interacting pairs by integrating sequence characteristics derived from aptamers and the target proteins. The features extracted for aptamers were the compositions of amino acids and pseudo K-tuple nucleotides. In addition, a sparse autoencoder was used to characterize features for the target protein sequences. To remove redundant features, gradient boosting decision tree (GBDT) and incremental feature selection (IFS) methods were used to obtain the optimum combination of sequence characters. Based on 616 selected features, an ensemble of three sub- support vector machine (SVM) classifiers was used to construct our prediction model. Evaluated on an independent dataset, our predictor obtained an accuracy of 75.7%, Matthew's Correlation Coefficient of 0.478, and Youden's Index of 0.538, which were superior to the values reached using other existing predictors. The results show that our model can be used to distinguishing novel aptamer-protein interacting pairs and revealing the interrelation between aptamers and proteins.
Collapse
Affiliation(s)
- Qing Yang
- Institute of Environmental Systems Biology, College of Environmental and Engineering, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Taoying Li
- Department of Maritime Economics and Management, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China.
| |
Collapse
|
67
|
Abstract
Background:DNA-binding proteins, binding to DNA, widely exist in living cells, participating in many cell activities. They can participate some DNA-related cell activities, for instance DNA replication, transcription, recombination, and DNA repair.Objective:Given the importance of DNA-binding proteins, studies for predicting the DNA-binding proteins have been a popular issue over the past decades. In this article, we review current machine-learning methods which research on the prediction of DNA-binding proteins through feature representation methods, classifiers, measurements, dataset and existing web server.Method:The prediction methods of DNA-binding protein can be divided into two types, based on amino acid composition and based on protein structure. In this article, we accord to the two types methods to introduce the application of machine learning in DNA-binding proteins prediction.Results:Machine learning plays an important role in the classification of DNA-binding proteins, and the result is better. The best ACC is above 80%.Conclusion:Machine learning can be widely used in many aspects of biological information, especially in protein classification. Some issues should be considered in future work. First, the relationship between the number of features and performance must be explored. Second, many features are used to predict DNA-binding proteins and propose solutions for high-dimensional spaces.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
68
|
Yang W, Zhu XJ, Huang J, Ding H, Lin H. A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization. Curr Bioinform 2019. [DOI: 10.2174/1574893613666181113131415] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Background:The location of proteins in a cell can provide important clues to their functions in various biological processes. Thus, the application of machine learning method in the prediction of protein subcellular localization has become a hotspot in bioinformatics. As one of key organelles, the Golgi apparatus is in charge of protein storage, package, and distribution.Objective:The identification of protein location in Golgi apparatus will provide in-depth insights into their functions. Thus, the machine learning-based method of predicting protein location in Golgi apparatus has been extensively explored. The development of protein sub-Golgi apparatus localization prediction should be reviewed for providing a whole background for the fields.Method:The benchmark dataset, feature extraction, machine learning method and published results were summarized.Results:We briefly introduced the recent progresses in protein sub-Golgi apparatus localization prediction using machine learning methods and discussed their advantages and disadvantages.Conclusion:We pointed out the perspective of machine learning methods in protein sub-Golgi localization prediction.
Collapse
Affiliation(s)
- Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Xiao-Juan Zhu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Jian Huang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| |
Collapse
|
69
|
Zhang J, Liu B. A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods. Curr Bioinform 2019. [DOI: 10.2174/1574893614666181212102749] [Citation(s) in RCA: 96] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:Proteins play a crucial role in life activities, such as catalyzing metabolic reactions, DNA replication, responding to stimuli, etc. Identification of protein structures and functions are critical for both basic research and applications. Because the traditional experiments for studying the structures and functions of proteins are expensive and time consuming, computational approaches are highly desired. In key for computational methods is how to efficiently extract the features from the protein sequences. During the last decade, many powerful feature extraction algorithms have been proposed, significantly promoting the development of the studies of protein structures and functions.Objective:To help the researchers to catch up the recent developments in this important field, in this study, an updated review is given, focusing on the sequence-based feature extractions of protein sequences.Method:These sequence-based features of proteins were grouped into three categories, including composition-based features, autocorrelation-based features and profile-based features. The detailed information of features in each group was introduced, and their advantages and disadvantages were discussed. Besides, some useful tools for generating these features will also be introduced.Results:Generally, autocorrelation-based features outperform composition-based features, and profile-based features outperform autocorrelation-based features. The reason is that profile-based features consider the evolutionary information, which is useful for identification of protein structures and functions. However, profile-based features are more time consuming, because the multiple sequence alignment process is required.Conclusion:In this study, some recently proposed sequence-based features were introduced and discussed, such as basic k-mers, PseAAC, auto-cross covariance, top-n-gram etc. These features did make great contributions to the developments of protein sequence analysis. Future studies can be focus on exploring the combinations of these features. Besides, techniques from other fields, such as signal processing, natural language process (NLP), image processing etc., would also contribute to this important field, because natural languages (such as English) and protein sequences share some similarities. Therefore, the proteins can be treated as documents, and the features, such as k-mers, top-n-grams, motifs, can be treated as the words in the languages. Techniques from these filed will give some new ideas and strategies for extracting the features from proteins.
Collapse
Affiliation(s)
- Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055, China
| |
Collapse
|
70
|
Liu S, Lu M, Li H, Zuo Y. Prediction of Gene Expression Patterns With Generalized Linear Regression Model. Front Genet 2019; 10:120. [PMID: 30886626 PMCID: PMC6409355 DOI: 10.3389/fgene.2019.00120] [Citation(s) in RCA: 132] [Impact Index Per Article: 26.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2018] [Accepted: 02/04/2019] [Indexed: 01/10/2023] Open
Abstract
Cell reprogramming has played important roles in medical science, such as tissue repair, organ reconstruction, disease treatment, new drug development, and new species breeding. Oct4, a core pluripotency factor, has especially played a key role in somatic cell reprogramming through transcriptional control and affects the expression level of genes by its combination intensity. However, the quantitative relationship between Oct4 combination intensity and target gene expression is still not clear. Therefore, firstly, a generalized linear regression method was constructed to predict gene expression values in promoter regions affected by Oct4 combination intensity. Training data, including Oct4 combination intensity and target gene expression, were from promoter regions of genes with different cell development stages. Additionally, the quantitative relationship between gene expression and Oct4 combination intensity was analyzed with the proposed model. Then, the quantitative relationship between gene expression and Oct4 combination intensity at each stage of cell development was classified into high and low levels. Experimental analysis showed that the combination height of Oct4-inhibited gene expression decremented by a temporal exponential value, whereas the combination width of Oct4-promoted gene expression incremented by a temporal logarithmic value. Experimental results showed that the proposed method can achieve goodness of fit with high confidence.
Collapse
Affiliation(s)
- Shuai Liu
- College of Information Science and Engineering, Hunan Normal University, Changsha, China
- College of Computer Science, Inner Mongolia University, Hohhot, China
| | - Mengye Lu
- College of Computer Science, Inner Mongolia University, Hohhot, China
| | - Hanshuang Li
- College of Life Sciences, Inner Mongolia University, Hohhot, China
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot, China
| | - Yongchun Zuo
- College of Life Sciences, Inner Mongolia University, Hohhot, China
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot, China
| |
Collapse
|
71
|
Kabir M, Ahmad S, Iqbal M, Hayat M. iNR-2L: A two-level sequence-based predictor developed via Chou's 5-steps rule and general PseAAC for identifying nuclear receptors and their families. Genomics 2019; 112:276-285. [PMID: 30779939 DOI: 10.1016/j.ygeno.2019.02.006] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 01/09/2019] [Accepted: 02/07/2019] [Indexed: 12/25/2022]
Abstract
Nuclear receptor proteins (NRPs) perform a vital role in regulating gene expression. With the rapidity growth of NRPs in post-genomic era, it is highly recommendable to identify NRPs and their sub-families accurately from their primary sequences. Several conventional methods have been used for discrimination of NRPs and their sub-families, but did not achieve considerable results. In a sequel, a two-level new computational model "iNR-2 L" is developed. Two discrete methods namely: Dipeptide Composition and Tripeptide Composition were used to formulate NRPs sequences. Further, both the descriptor spaces were merged to construct hybrid space. Furthermore, feature selection technique minimum redundancy and maximum relevance was employed in order to select salient features as well as reduce the noise and redundancy. The experiential outcomes exhibited that the proposed model iNR-2 L achieved outstanding results. It is anticipated that the proposed computational model might be a practical and effective tool for academia and research community.
Collapse
Affiliation(s)
- Muhammad Kabir
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China.
| | - Saeed Ahmad
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
| | - Muhammad Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| |
Collapse
|
72
|
HRGPred: Prediction of herbicide resistant genes with k-mer nucleotide compositional features and support vector machine. Sci Rep 2019; 9:778. [PMID: 30692561 PMCID: PMC6349872 DOI: 10.1038/s41598-018-37309-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Accepted: 12/03/2018] [Indexed: 02/07/2023] Open
Abstract
Herbicide resistance (HR) is a major concern for the agricultural producers as well as environmentalists. Resistance to commonly used herbicides are conferred due to mutation(s) in the genes encoding herbicide target sites/proteins (GETS). Identification of these genes through wet-lab experiments is time consuming and expensive. Thus, a supervised learning-based computational model has been proposed in this study, which is first of its kind for the prediction of seven classes of GETS. The cDNA sequences of the genes were initially transformed into numeric features based on the k-mer compositions and then supplied as input to the support vector machine. In the proposed SVM-based model, the prediction occurs in two stages, where a binary classifier in the first stage discriminates the genes involved in conferring the resistance to herbicides from other genes, followed by a multi-class classifier in the second stage that categorizes the predicted herbicide resistant genes in the first stage into any one of the seven resistant classes. Overall classification accuracies were observed to be ~89% and >97% for binary and multi-class classifications respectively. The proposed model confirmed higher accuracy than the homology-based algorithms viz., BLAST and Hidden Markov Model. Besides, the developed computational model achieved ~87% accuracy, while tested with an independent dataset. An online prediction server HRGPred (http://cabgrid.res.in:8080/hrgpred) has also been established to facilitate the prediction of GETS by the scientific community.
Collapse
|
73
|
Jia J, Li X, Qiu W, Xiao X, Chou KC. iPPI-PseAAC(CGR): Identify protein-protein interactions by incorporating chaos game representation into PseAAC. J Theor Biol 2019; 460:195-203. [DOI: 10.1016/j.jtbi.2018.10.021] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 09/16/2018] [Accepted: 10/08/2018] [Indexed: 01/11/2023]
|
74
|
Dao FY, Lv H, Wang F, Ding H. Recent Advances on the Machine Learning Methods in Identifying DNA Replication Origins in Eukaryotic Genomics. Front Genet 2018; 9:613. [PMID: 30619452 PMCID: PMC6295579 DOI: 10.3389/fgene.2018.00613] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2018] [Accepted: 11/21/2018] [Indexed: 01/01/2023] Open
Abstract
The initiate site of DNA replication is called origins of replication (ORI) which is regulated by a set of regulatory proteins and plays important roles in the basic biochemical process during cell growth and division in all living organisms. Therefore, the study of ORIs is essential for understanding the cell-division cycle and gene expression regulation so that scholars can develop a new strategy against genetic diseases by using the knowledge of DNA replication. Thus, the accurate identification of ORIs will provide key clues for DNA replication research and clinical medicine. Although, the conventional experiments could provide accurate results, they are time-consuming and cost ineffective. On the contrary, bioinformatics-based methods can overcome these shortcomings. Especially, with the emergence of DNA sequences in the post-genomic era, it is highly expected to develop high throughput tools to identify ORIs based on sequence information. In this review, we will summarize the current progress in computational prediction of eukaryotic ORIs including the collection of benchmark dataset, the application of machine learning-based techniques, the results obtained by these methods, and the construction of web servers. Finally, we gave the future perspectives on ORIs prediction. The review provided readers with a whole background of ORIs prediction based on machine learning methods, which will be helpful for researchers to study DNA replication in-depth and drug therapy of genetic defect.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fang Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
75
|
Xiao X, Xu ZC, Qiu WR, Wang P, Ge HT, Chou KC. iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. Genomics 2018; 111:1785-1793. [PMID: 30529532 DOI: 10.1016/j.ygeno.2018.12.001] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 11/20/2018] [Accepted: 12/04/2018] [Indexed: 12/20/2022]
Abstract
The promoter is a regulatory DNA region about 81-1000 base pairs long, usually located near the transcription start site (TSS) along upstream of a given gene. By combining a certain protein called transcription factor, the promoter provides the starting point for regulated gene transcription, and hence plays a vitally important role in gene transcriptional regulation. With explosive growth of DNA sequences in the post-genomic age, it has become an urgent challenge to develop computational method for effectively identifying promoters because the information thus obtained is very useful for both basic research and drug development. Although some prediction methods were developed in this regard, most of them were limited at merely identifying whether a query DNA sequence being of a promoter or not. However, based on their strength-distinct levels for transcriptional activation and expression, promoter should be divided into two categories: strong and weak types. Here a new two-layer predictor, called "iPSW(2L)-PseKNC", was developed by fusing the physicochemical properties of nucleotides and their nucleotide density into PseKNC (pseudo K-tuple nucleotide composition). Its 1st-layer serves to predict whether a query DNA sequence sample is of promoter or not, while its 2nd-layer is able to predict the strength of promoters. It has been observed through rigorous cross-validations that the 1st-layer sub-predictor is remarkably superior to the existing state-of-the-art predictors in identifying the promoters and non-promoters, and that the 2nd-layer sub-predictor can do what is beyond the reach of the existing predictors. Moreover, the web-server for iPSW(2L)-PseKNC has been established at http://www.jci-bioinfo.cn/iPSW(2L)-PseKNC, by which the majority of experimental scientists can easily get the results they need.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Zhao-Chun Xu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China.
| | - Wang-Ren Qiu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA
| | - Peng Wang
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Hui-Ting Ge
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Kuo-Chen Chou
- The Gordon Life Science Institute, Boston, MA 02478, USA; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
76
|
Xu L, Liang G, Liao C, Chen GD, Chang CC. An Efficient Classifier for Alzheimer's Disease Genes Identification. Molecules 2018; 23:molecules23123140. [PMID: 30501121 PMCID: PMC6321377 DOI: 10.3390/molecules23123140] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Revised: 11/17/2018] [Accepted: 11/19/2018] [Indexed: 11/16/2022] Open
Abstract
Alzheimer’s disease (AD) is considered to one of 10 key diseases leading to death in humans. AD is considered the main cause of brain degeneration, and will lead to dementia. It is beneficial for affected patients to be diagnosed with the disease at an early stage so that efforts to manage the patient can begin as soon as possible. Most existing protocols diagnose AD by way of magnetic resonance imaging (MRI). However, because the size of the images produced is large, existing techniques that employ MRI technology are expensive and time-consuming to perform. With this in mind, in the current study, AD is predicted instead by the use of a support vector machine (SVM) method based on gene-coding protein sequence information. In our proposed method, the frequency of two consecutive amino acids is used to describe the sequence information. The accuracy of the proposed method for identifying AD is 85.7%, which is demonstrated by the obtained experimental results. The experimental results also show that the sequence information of gene-coding proteins can be used to predict AD.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China.
| | - Guangmin Liang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China.
| | - Changrui Liao
- Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen 518060, China.
| | - Gin-Den Chen
- Department of Obstetrics and Gynecology, Chung Shan Medical University Hospital, Taichung 40201, Taiwan.
| | - Chi-Chang Chang
- School of Medical Informatics, Chung Shan Medical University, Taichung 40201, Taiwan.
- IT Office, Chung Shan Medical University Hospital, Taichung 40201, Taiwan.
| |
Collapse
|
77
|
Di Gangi M, Lo Bosco G, Rizzo R. Deep learning architectures for prediction of nucleosome positioning from sequences data. BMC Bioinformatics 2018; 19:418. [PMID: 30453896 PMCID: PMC6245688 DOI: 10.1186/s12859-018-2386-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Nucleosomes are DNA-histone complex, each wrapping about 150 pairs of double-stranded DNA. Their function is fundamental for one of the primary functions of Chromatin i.e. packing the DNA into the nucleus of the Eukaryote cells. Several biological studies have shown that the nucleosome positioning influences the regulation of cell type-specific gene activities. Moreover, computational studies have shown evidence of sequence specificity concerning the DNA fragment wrapped into nucleosomes, clearly underlined by the organization of particular DNA substrings. As the main consequence, the identification of nucleosomes on a genomic scale has been successfully performed by computational methods using a sequence features representation. Results In this work, we propose a deep learning model for nucleosome identification. Our model stacks convolutional layers and Long Short-term Memories to automatically extract features from short- and long-range dependencies in a sequence. Using this model we are able to avoid the feature extraction and selection steps while improving the classification performances. Conclusions Results computed on eleven data sets of five different organisms, from Yeast to Human, show the superiority of the proposed method with respect to the state of the art recently presented in the literature.
Collapse
Affiliation(s)
- Mattia Di Gangi
- Fondazione Bruno Kessler, Via Sommarive, 18, Trento, 38123, Italy.,ICT International Doctoral School, Via Sommarive, 9, Trento, 38123, Italy
| | - Giosuè Lo Bosco
- Dipartimento di Matematica e Informatica, Università degli studi di Palermo, Via Archirafi, 34, Palermo, 90123, Italy. .,Dipartimento di Scienze per l'Innovazione tecnologica, Istituto Euro-Mediterraneo di Scienza e Tecnologia, Via Michele Miraglia, 20, Palermo, 90139, Italy.
| | - Riccardo Rizzo
- CNR-ICAR, National Research Council of Italy, Via Ugo La Malfa, 153, Palermo, 90146, Italy
| |
Collapse
|
78
|
Dao FY, Lv H, Wang F, Feng CQ, Ding H, Chen W, Lin H. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics 2018; 35:2075-2083. [DOI: 10.1093/bioinformatics/bty943] [Citation(s) in RCA: 147] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Revised: 11/06/2018] [Accepted: 11/13/2018] [Indexed: 02/07/2023] Open
Affiliation(s)
- Fu-Ying Dao
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lv
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fang Wang
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Chao-Qin Feng
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Ding
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wei Chen
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
79
|
|
80
|
Ariaeenejad S, Mousivand M, Moradi Dezfouli P, Hashemi M, Kavousi K, Hosseini Salekdeh G. A computational method for prediction of xylanase enzymes activity in strains of Bacillus subtilis based on pseudo amino acid composition features. PLoS One 2018; 13:e0205796. [PMID: 30346964 PMCID: PMC6197662 DOI: 10.1371/journal.pone.0205796] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2017] [Accepted: 10/02/2018] [Indexed: 01/09/2023] Open
Abstract
Xylanases are hydrolytic enzymes which based on physicochemical properties, structure, mode of action and substrate specificities are classified into various glycoside hydrolase (GH) families. The purpose of this study is to show that the activity of the members of the xylanase family in the specified pH and temperature conditions can be computationally predicted. The proposed computational regression model was trained and tested with the Pseudo Amino Acid Composition (PseAAC) features extracted solely from the amino acid sequences of enzymes. The xylanases with experimentally determined activities were used as the training dataset to adjust the model parameters. To develop the model, 41 strains of Bacillus subtilis isolated from field soil were screened. From them, 28 strains with the highest halo diameter were selected for further studies. The performance of the model for prediction of xylanase activity was evaluated in three different temperature and pH conditions using stratified cross-validation and jackknife methods. The trained model can be used for determining the activity of newly found xylanases in the specified condition. Such computational models help to scale down the experimental costs and save time by identifying enzymes with appropriate activity for scientific and industrial usage. Our methodology for activity prediction of xylanase enzymes can be potentially applied to the members of the other enzyme families. The availability of sufficient experimental data in specified pH and temperature conditions is a prerequisite for training the learning model and to achieve high accuracy.
Collapse
Affiliation(s)
- Shohreh Ariaeenejad
- Department of Systems Biology, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research Education and Extension Organization (AREO), Karaj, Iran
| | - Maryam Mousivand
- Department of Microbial Biotechnology, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research Education and Extension Organization (AREO), Karaj, Iran
| | - Parinaz Moradi Dezfouli
- Department of Microbial Biotechnology, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research Education and Extension Organization (AREO), Karaj, Iran
| | - Maryam Hashemi
- Department of Microbial Biotechnology, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research Education and Extension Organization (AREO), Karaj, Iran
| | - Kaveh Kavousi
- Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Ghasem Hosseini Salekdeh
- Department of Systems Biology, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research Education and Extension Organization (AREO), Karaj, Iran
| |
Collapse
|
81
|
Tahir M, Hayat M, Khan SA. iNuc-ext-PseTNC: an efficient ensemble model for identification of nucleosome positioning by extending the concept of Chou's PseAAC to pseudo-tri-nucleotide composition. Mol Genet Genomics 2018; 294:199-210. [PMID: 30291426 DOI: 10.1007/s00438-018-1498-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Accepted: 09/28/2018] [Indexed: 10/28/2022]
Abstract
Nucleosome is a central element of eukaryotic chromatin, which composes of histone proteins and DNA molecules. It performs vital roles in many eukaryotic intra-nuclear processes, for instance, chromatin structure and transcriptional regulation formation. Identification of nucleosome positioning via wet lab is difficult; so, the attention is diverted towards the accurate intelligent automated prediction. In this regard, a novel intelligent automated model "iNuc-ext-PseTNC" is developed to identify the nucleosome positioning in genomes accurately. In this predictor, the sequences of DNA are mathematically represented by two different discrete feature extraction techniques, namely pseudo-tri-nucleotide composition (PseTNC) and pseudo-di-nucleotide composition. Several contemporary machine learning algorithms were examined. Further, the predictions of individual classifiers were integrated through an evolutionary genetic algorithm. The success rates of the ensemble model are higher than individual classifiers. After analyzing the prediction results, it is noticed that iNuc-ext-PseTNC model has achieved better performance in combination with PseTNC feature space, which are 94.3%, 93.14%, and 88.60% of accuracies using six-fold cross-validation test for the three benchmark datasets S1, S2, and S3, respectively. The achieved outcomes exposed that the results of iNuc-ext-PseTNC model are prominent compared to the existing methods so far notifiable in the literature. It is ascertained that the proposed model might be more fruitful and a practical tool for rudimentary academia and research.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan.
| | - Sher Afzal Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan
| |
Collapse
|
82
|
Contreras-Torres E. Predicting structural classes of proteins by incorporating their global and local physicochemical and conformational properties into general Chou's PseAAC. J Theor Biol 2018; 454:139-145. [DOI: 10.1016/j.jtbi.2018.05.033] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Revised: 05/23/2018] [Accepted: 05/28/2018] [Indexed: 11/24/2022]
|
83
|
Chen W, Ding H, Zhou X, Lin H, Chou KC. iRNA(m6A)-PseDNC: Identifying N 6-methyladenosine sites using pseudo dinucleotide composition. Anal Biochem 2018; 561-562:59-65. [PMID: 30201554 DOI: 10.1016/j.ab.2018.09.002] [Citation(s) in RCA: 126] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 08/31/2018] [Accepted: 09/03/2018] [Indexed: 01/28/2023]
Abstract
As a prevalent post-transcriptional modification, N6-methyladenosine (m6A) plays key roles in a series of biological processes. Although experimental technologies have been developed and applied to identify m6A sites, they are still cost-ineffective for transcriptome-wide detections of m6A. As good complements to the experimental techniques, some computational methods have been proposed to identify m6A sites. However, their performance remains unsatisfactory. In this study, we firstly proposed an Euclidean distance based method to construct a high quality benchmark dataset. By encoding the RNA sequences using pseudo nucleotide composition, a new predictor called iRNA(m6A)-PseDNC was developed to identify m6A sites in the Saccharomyces cerevisiae genome. It has been demonstrated by the 10-fold cross validation test that the performance of iRNA(m6A)-PseDNC is superior to the existing methods. Meanwhile, for the convenience of most experimental scientists, established at the site http://lin-group.cn/server/iRNA(m6A)-PseDNC.php is its web-server, by which users can easily get their desired results without need to go through the detailed mathematics. It is anticipated that iRNA(m6A)-PseDNC will become a useful high throughput tool for identifying m6A sites in the S. cerevisiae genome.
Collapse
Affiliation(s)
- Wei Chen
- School of Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, 063000, China; Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, 611730, China; Gordon Life Science Institute, Boston, MA, 02478, USA.
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China.
| | - Xu Zhou
- School of Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, 063000, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China; Gordon Life Science Institute, Boston, MA, 02478, USA.
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China; Gordon Life Science Institute, Boston, MA, 02478, USA.
| |
Collapse
|
84
|
Rahman MS, Aktar U, Jani MR, Shatabda S. iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features. Mol Genet Genomics 2018; 294:69-84. [PMID: 30187132 DOI: 10.1007/s00438-018-1487-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Accepted: 08/29/2018] [Indexed: 01/16/2023]
Abstract
In bacterial DNA, there are specific sequences of nucleotides called promoters that can bind to the RNA polymerase. Sigma70 ([Formula: see text]) is one of the most important promoter sequences due to its presence in most of the DNA regulatory functions. In this paper, we identify the most effective and optimal sequence-based features for prediction of [Formula: see text] promoter sequences in a bacterial genome. We used both short-range and long-range DNA sequences in our proposed method. A very small number of effective features are selected from a large number of the extracted features using multi-window of different sizes within the DNA sequences. We call our prediction method iPro70-FMWin and made it freely accessible online via a web application established at http://ipro70.pythonanywhere.com/server for the sake of convenience of the researchers. We have tested our method using a standard benchmark dataset. In the experiments, iPro70-FMWin has achieved an area under the curve of the receiver operating characteristic and accuracy of 0.959 and 90.57%, respectively, which significantly outperforms the state-of-the-art predictors.
Collapse
Affiliation(s)
- Md Siddiqur Rahman
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Usma Aktar
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Md Rafsan Jani
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh.
| |
Collapse
|
85
|
Tan JX, Dao FY, Lv H, Feng PM, Ding H. Identifying Phage Virion Proteins by Using Two-Step Feature Selection Methods. Molecules 2018; 23:molecules23082000. [PMID: 30103458 PMCID: PMC6222849 DOI: 10.3390/molecules23082000] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Revised: 07/30/2018] [Accepted: 08/08/2018] [Indexed: 12/31/2022] Open
Abstract
Accurate identification of phage virion protein is not only a key step for understanding the function of the phage virion protein but also helpful for further understanding the lysis mechanism of the bacterial cell. Since traditional experimental methods are time-consuming and costly for identifying phage virion proteins, it is extremely urgent to apply machine learning methods to accurately and efficiently identify phage virion proteins. In this work, a support vector machine (SVM) based method was proposed by mixing multiple sets of optimal g-gap dipeptide compositions. The analysis of variance (ANOVA) and the minimal-redundancy-maximal-relevance (mRMR) with an increment feature selection (IFS) were applied to single out the optimal feature set. In the five-fold cross-validation test, the proposed method achieved an overall accuracy of 87.95%. We believe that the proposed method will become an efficient and powerful method for scientists concerning phage virion proteins.
Collapse
Affiliation(s)
- Jiu-Xin Tan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Peng-Mian Feng
- Hebei Province Key Laboratory of Occupational Health and Safety for Coal Industry, School of Public Health, North China University of Science and Technology, Tangshan 063000, China.
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
86
|
Liu G, Liu GJ, Tan JX, Lin H. DNA physical properties outperform sequence compositional information in classifying nucleosome-enriched and -depleted regions. Genomics 2018; 111:1167-1175. [PMID: 30055231 DOI: 10.1016/j.ygeno.2018.07.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Revised: 07/07/2018] [Accepted: 07/15/2018] [Indexed: 12/15/2022]
Abstract
The nucleosome is the fundamental structural unit of eukaryotic chromatin and plays an essential role in the epigenetic regulation of cellular processes, such as DNA replication, recombination, and transcription. Hence, it is important to identify nucleosome positions in the genome. Our previous model based on DNA deformation energy, in which a set of DNA physical descriptors was used, performed well in predicting nucleosome dyad positions and occupancy. In this study, we established a machine-learning model for predicting nucleosome occupancy in order to further verify the physical descriptors. Results showed that (1) our model outperformed several other sequence compositional information-based models, indicating a stronger dependence of nucleosome positioning on DNA physical properties; (2) nucleosome-enriched and -depleted regions have distinct features in terms of DNA physical descriptors like sequence-dependent flexibility and equilibrium structure parameters; (3) gene transcription start sites and termination sites can be well characterized with the distribution patterns of the physical descriptors, indicating the regulatory role of DNA physical properties in gene transcription. In addition, we developed a web server for the model, which is freely accessible at http://lin-group.cn/server/iNuc-force/.
Collapse
Affiliation(s)
- Guoqing Liu
- The School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou 014010, China.
| | - Guo-Jun Liu
- School of Natural Sciences and Mathematics, Ural Federal University, Ekaterinburg 620000, Russia
| | - Jiu-Xin Tan
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
87
|
He W, Jia C, Zou Q. 4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics 2018; 35:593-601. [DOI: 10.1093/bioinformatics/bty668] [Citation(s) in RCA: 103] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Revised: 07/07/2018] [Accepted: 07/24/2018] [Indexed: 11/14/2022] Open
Affiliation(s)
- Wenying He
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Cangzhi Jia
- Department of Mathematics, Dalian Maritime University, Dalian, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| |
Collapse
|
88
|
Al Maruf MA, Shatabda S. iRSpot-SF: Prediction of recombination hotspots by incorporating sequence based features into Chou's Pseudo components. Genomics 2018; 111:966-972. [PMID: 29935224 DOI: 10.1016/j.ygeno.2018.06.003] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2018] [Revised: 06/09/2018] [Accepted: 06/13/2018] [Indexed: 11/28/2022]
Abstract
Recombination hotspots in a genome are unevenly distributed. Hotspots are regions in a genome that show higher rates of meiotic recombinations. Computational methods for recombination hotspot prediction often use sophisticated features that are derived from physico-chemical or structure based properties of nucleotides. In this paper, we propose iRSpot-SF that uses sequence based features which are computationally cheap to generate. Four feature groups are used in our method: k-mer composition, gapped k-mer composition, TF-IDF of k-mers and reverse complement k-mer composition. We have used recursive feature elimination to select 17 top features for hotspot prediction. Our analysis shows the superiority of gapped k-mer composition and reverse complement k-mer composition features over others. We have used SVM with RBF kernel as a classification algorithm. We have tested our algorithm on standard benchmark datasets. Compared to other methods iRSpot-SF is able to produce significantly better results in terms of accuracy, Mathew's Correlation Coefficient and sensitivity which are 84.58%, 0.6941 and 84.57%. We have made our method readily available to use as a python based tool and made the datasets and source codes available at: https://github.com/abdlmaruf/iRSpot-SF. An web application is developed based on iRSpot-SF and freely available to use at: http://irspot.pythonanywhere.com/server.html.
Collapse
Affiliation(s)
- Md Abdullah Al Maruf
- Department of Computer Science and Engineering, United International University, Madani Aveneue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Madani Aveneue, Satarkul, Badda, Dhaka 1212, Bangladesh.
| |
Collapse
|
89
|
Xu L, Liang G, Shi S, Liao C. SeqSVM: A Sequence-Based Support Vector Machine Method for Identifying Antioxidant Proteins. Int J Mol Sci 2018; 19:ijms19061773. [PMID: 29914044 PMCID: PMC6032279 DOI: 10.3390/ijms19061773] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Revised: 06/10/2018] [Accepted: 06/11/2018] [Indexed: 12/20/2022] Open
Abstract
Antioxidant proteins can be beneficial in disease prevention. More attention has been paid to the functionality of antioxidant proteins. Therefore, identifying antioxidant proteins is important for the study. In our work, we propose a computational method, called SeqSVM, for predicting antioxidant proteins based on their primary sequence features. The features are removed to reduce the redundancy by max relevance max distance method. Finally, the antioxidant proteins are identified by support vector machine (SVM). The experimental results demonstrated that our method performs better than existing methods, with the overall accuracy of 89.46%. Although a proposed computational method can attain an encouraging classification result, the experimental results are verified based on the biochemical approaches, such as wet biochemistry and molecular biology techniques.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518060, China.
| | - Guangmin Liang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518060, China.
| | - Shuhua Shi
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518060, China.
| | - Changrui Liao
- Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen 518060, China.
| |
Collapse
|
90
|
The implication of DNA bending energy for nucleosome positioning and sliding. Sci Rep 2018; 8:8853. [PMID: 29891930 PMCID: PMC5995830 DOI: 10.1038/s41598-018-27247-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2017] [Accepted: 05/24/2018] [Indexed: 11/24/2022] Open
Abstract
Nucleosome not only directly affects cellular processes, such as DNA replication, recombination, and transcription, but also severs as a fundamentally important target of epigenetic modifications. Our previous study indicated that the bending property of DNA is important in nucleosome formation, particularly in predicting the dyad positions of nucleosomes on a DNA segment. Here, we investigated the role of bending energy in nucleosome positioning and sliding in depth to decipher sequence-directed mechanism. The results show that bending energy is a good physical index to predict the free energy in the process of nucleosome reconstitution in vitro. Our data also imply that there are at least 20% of the nucleosomes in budding yeast do not adopt canonical positioning, in which underlying sequences wrapped around histones are structurally symmetric. We also revealed distinct patterns of bending energy profile for distinctly organized chromatin structures, such as well-positioned nucleosomes, fuzzy nucleosomes, and linker regions and discussed nucleosome sliding in terms of bending energy. We proposed that the stability of a nucleosome is positively correlated with the strength of the bending anisotropy of DNA segment, and both accessibility and directionality of nucleosome sliding is likely to be modulated by diverse patterns of DNA bending energy profile.
Collapse
|
91
|
Lim LWK, Chung HH, Chong YL, Lee NK. A survey of recently emerged genome-wide computational enhancer predictor tools. Comput Biol Chem 2018; 74:132-141. [DOI: 10.1016/j.compbiolchem.2018.03.019] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 03/13/2018] [Accepted: 03/13/2018] [Indexed: 12/19/2022]
|
92
|
Zhang M, Xu Y, Li L, Liu Z, Yang X, Yu DJ. Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble. Anal Biochem 2018; 550:41-48. [DOI: 10.1016/j.ab.2018.03.027] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2017] [Revised: 03/27/2018] [Accepted: 03/28/2018] [Indexed: 11/25/2022]
|
93
|
Meher PK, Sahu TK, Mohanty J, Gahoi S, Purru S, Grover M, Rao AR. nifPred: Proteome-Wide Identification and Categorization of Nitrogen-Fixation Proteins of Diaztrophs Based on Composition-Transition-Distribution Features Using Support Vector Machine. Front Microbiol 2018; 9:1100. [PMID: 29896173 PMCID: PMC5986947 DOI: 10.3389/fmicb.2018.01100] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 05/08/2018] [Indexed: 11/13/2022] Open
Abstract
As inorganic nitrogen compounds are essential for basic building blocks of life (e.g., nucleotides and amino acids), the role of biological nitrogen-fixation (BNF) is indispensible. All nitrogen fixing microbes rely on the same nitrogenase enzyme for nitrogen reduction, which is in fact an enzyme complex consists of as many as 20 genes. However, the occurrence of six genes viz., nifB, nifD, nifE, nifH, nifK, and nifN has been proposed to be essential for a functional nitrogenase enzyme. Therefore, identification of these genes is important to understand the mechanism of BNF as well as to explore the possibilities for improving BNF from agricultural sustainability point of view. Further, though the computational tools are available for the annotation and phylogenetic analysis of nifH gene sequences alone, to the best of our knowledge no tool is available for the computational prediction of the above mentioned six categories of nitrogen-fixation (nif) genes or proteins. Thus, we proposed an approach, which is first of its kind for the computational identification of nif proteins encoded by the six categories of nif genes. Sequence-derived features were employed to map the input sequences into vectors of numeric observations that were subsequently fed to the support vector machine as input. Two types of classifier were constructed: (i) a binary classifier for classification of nif and non-nitrogen-fixation (non-nif) proteins, and (ii) a multi-class classifier for classification of six categories of nif proteins. Higher accuracies were observed for the combination of composition-transition-distribution (CTD) feature set and radial kernel, as compared to the other feature-kernel combinations. The overall accuracies were observed >90% in both binary and multi-class classifications. The developed approach further achieved >92% accuracy, while evaluated with blind (independent) test datasets. The developed approach also produced higher accuracy in identifying nif proteins, while evaluated using proteome-wide datasets of several species. Furthermore, we established a prediction server nifPred (http://webapp.cabgrid.res.in/nifPred) to assist the scientific community for proteome-wide identification of six categories of nif proteins. Besides, the source code of nifPred is also available at https://github.com/PrabinaMeher/nifPred. The developed web server is expected to supplement the transcriptional profiling and comparative genomics studies for the identification and functional annotation of genes related to BNF.
Collapse
Affiliation(s)
- Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Tanmaya K Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Jyotilipsa Mohanty
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.,Department of Bioinformatics, Orissa University of Agriculture and Technology, Bhubaneswar, India
| | - Shachi Gahoi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Supriya Purru
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Monendra Grover
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Atmakuri R Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| |
Collapse
|
94
|
Yang H, Qiu WR, Liu G, Guo FB, Chen W, Chou KC, Lin H. iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. Int J Biol Sci 2018; 14:883-891. [PMID: 29989083 PMCID: PMC6036749 DOI: 10.7150/ijbs.24616] [Citation(s) in RCA: 135] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Accepted: 02/04/2018] [Indexed: 02/06/2023] Open
Abstract
Meiotic recombination caused by meiotic double-strand DNA breaks. In some regions the frequency of DNA recombination is relatively higher, while in other regions the frequency is lower: the former is usually called "recombination hotspot", while the latter the "recombination coldspot". Information of the hot and cold spots may provide important clues for understanding the mechanism of genome revolution. Therefore, it is important to accurately predict these spots. In this study, we rebuilt the benchmark dataset by unifying its samples with a same length (131 bp). Based on such a foundation and using SVM (Support Vector Machine) classifier, a new predictor called "iRSpot-Pse6NC" was developed by incorporating the key hexamer features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. It has been observed via rigorous cross-validations that the proposed predictor is superior to its counterparts in overall accuracy, stability, sensitivity and specificity. For the convenience of most experimental scientists, the web-server for iRSpot-Pse6NC has been established at http://lin-group.cn/server/iRSpot-Pse6NC, by which users can easily obtain their desired result without the need to go through the detailed mathematical equations involved.
Collapse
Affiliation(s)
- Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wang-Ren Qiu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, 333403, China
| | - Guoqing Liu
- School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Feng-Biao Guo
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| |
Collapse
|
95
|
Tang H, Zhao YW, Zou P, Zhang CM, Chen R, Huang P, Lin H. HBPred: a tool to identify growth hormone-binding proteins. Int J Biol Sci 2018; 14:957-964. [PMID: 29989085 PMCID: PMC6036759 DOI: 10.7150/ijbs.24174] [Citation(s) in RCA: 136] [Impact Index Per Article: 22.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 01/15/2018] [Indexed: 12/19/2022] Open
Abstract
Hormone-binding protein (HBP) is a kind of soluble carrier protein and can selectively and non-covalently interact with hormone. HBP plays an important role in life growth, but its function is still unclear. Correct recognition of HBPs is the first step to further study their function and understand their biological process. However, it is difficult to correctly recognize HBPs from more and more proteins through traditional biochemical experiments because of high experimental cost and long experimental period. To overcome these disadvantages, we designed a computational method for identifying HBPs accurately in the study. At first, we collected HBP data from UniProt to establish a high-quality benchmark dataset. Based on the dataset, the dipeptide composition was extracted from HBP residue sequences. In order to find out the optimal features to provide key clues for HBP identification, the analysis of various (ANOVA) was performed for feature ranking. The optimal features were selected through the incremental feature selection strategy. Subsequently, the features were inputted into support vector machine (SVM) for prediction model construction. Jackknife cross-validation results showed that 88.6% HBPs and 81.3% non-HBPs were correctly recognized, suggesting that our proposed model was powerful. This study provides a new strategy to identify HBPs. Moreover, based on the proposed model, we established a webserver called HBPred, which could be freely accessed at http://lin-group.cn/server/HBPred.
Collapse
Affiliation(s)
- Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Ya-Wei Zhao
- Key Laboratory for NeuroInformation of Ministry of Education, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Ping Zou
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Chun-Mei Zhang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Rong Chen
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Po Huang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
96
|
Zhang J, Feng P, Lin H, Chen W. Identifying RNA N 6-Methyladenosine Sites in Escherichia coli Genome. Front Microbiol 2018; 9:955. [PMID: 29867860 PMCID: PMC5960707 DOI: 10.3389/fmicb.2018.00955] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 04/24/2018] [Indexed: 12/20/2022] Open
Abstract
N6-methyladenosine (m6A) plays important roles in a branch of biological and physiological processes. Accurate identification of m6A sites is especially helpful for understanding their biological functions. Since the wet-lab techniques are still expensive and time-consuming, it's urgent to develop computational methods to identify m6A sites from primary RNA sequences. Although there are some computational methods for identifying m6A sites, no methods whatsoever are available for detecting m6A sites in microbial genomes. In this study, we developed a computational method for identifying m6A sites in Escherichia coli genome. The accuracies obtained by the proposed method are >90% in both 10-fold cross-validation test and independent dataset test, indicating that the proposed method holds the high potential to become a useful tool for the identification of m6A sites in microbial genomes.
Collapse
Affiliation(s)
- Jidong Zhang
- Department of Immunology, Zunyi Medical College, Zunyi, China
| | - Pengmian Feng
- Hebei Province Key Laboratory of Occupational Health and Safety for Coal Industry, School of Public Health, North China University of Science and Technology, Tangshan, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Department of Physics, Center for Genomics and Computational Biology, School of Sciences, North China University of Science and Technology, Tangshan, China
| |
Collapse
|
97
|
Zhang S, Zhuang W, Xu Z. Prediction of DNase I hypersensitive sites in plant genome using multiple modes of pseudo components. Anal Biochem 2018; 549:149-156. [DOI: 10.1016/j.ab.2018.03.025] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2018] [Revised: 03/23/2018] [Accepted: 03/27/2018] [Indexed: 12/25/2022]
|
98
|
Sabooh MF, Iqbal N, Khan M, Khan M, Maqbool HF. Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's PseKNC. J Theor Biol 2018; 452:1-9. [PMID: 29727634 DOI: 10.1016/j.jtbi.2018.04.037] [Citation(s) in RCA: 74] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2018] [Revised: 04/24/2018] [Accepted: 04/27/2018] [Indexed: 02/02/2023]
Abstract
This study examines accurate and efficient computational method for identification of 5-methylcytosine sites in RNA modification. The occurrence of 5-methylcytosine (m5C) plays a vital role in a number of biological processes. For better comprehension of the biological functions and mechanism it is necessary to recognize m5C sites in RNA precisely. The laboratory techniques and procedures are available to identify m5C sites in RNA, but these procedures require a lot of time and resources. This study develops a new computational method for extracting the features of RNA sequence. In this method, first the RNA sequence is encoded via composite feature vector, then, for the selection of discriminate features, the minimum-redundancy-maximum-relevance algorithm was used. Secondly, the classification method used has been based on a support vector machine by using jackknife cross validation test. The suggested method efficiently identifies m5C sites from non- m5C sites and the outcome of the suggested algorithm is 93.33% with sensitivity of 90.0 and specificity of 96.66 on bench mark datasets. The result exhibits that proposed algorithm shown significant identification performance compared to the existing computational techniques. This study extends the knowledge about the occurrence sites of RNA modification which paves the way for better comprehension of the biological uses and mechanism.
Collapse
Affiliation(s)
- M Fazli Sabooh
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Nadeem Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| | - Mukhtaj Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Muslim Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - H F Maqbool
- University of Engineering & Technology Lahore, Pakistan
| |
Collapse
|
99
|
Muthu Krishnan S. Using Chou's general PseAAC to analyze the evolutionary relationship of receptor associated proteins (RAP) with various folding patterns of protein domains. J Theor Biol 2018; 445:62-74. [DOI: 10.1016/j.jtbi.2018.02.008] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Revised: 01/24/2018] [Accepted: 02/12/2018] [Indexed: 01/31/2023]
|
100
|
Prediction of the aquatic toxicity of aromatic compounds to tetrahymena pyriformis through support vector regression. Oncotarget 2018; 8:49359-49369. [PMID: 28467816 PMCID: PMC5564774 DOI: 10.18632/oncotarget.17210] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2017] [Accepted: 03/30/2017] [Indexed: 01/24/2023] Open
Abstract
Toxicity evaluation is an extremely important process during drug development. It is usually initiated by experiments on animals, which is time-consuming and costly. To speed up such a process, a quantitative structure-activity relationship (QSAR) study was performed to develop a computational model for correlating the structures of 581 aromatic compounds with their aquatic toxicity to tetrahymena pyriformis. A set of 68 molecular descriptors derived solely from the structures of the aromatic compounds were calculated based on Gaussian 03, HyperChem 7.5, and TSAR V3.3. A comprehensive feature selection method, minimum Redundancy Maximum Relevance (mRMR)-genetic algorithm (GA)-support vector regression (SVR) method, was applied to select the best descriptor subset in QSAR analysis. The SVR method was employed to model the toxicity potency from a training set of 500 compounds. Five-fold cross-validation method was used to optimize the parameters of SVR model. The new SVR model was tested on an independent dataset of 81 compounds. Both high internal consistent and external predictive rates were obtained, indicating the SVR model is very promising to become an effective tool for fast detecting the toxicity.
Collapse
|