151
|
Olyaee MH, Yaghoubi A, Yaghoobi M. Predicting protein structural classes based on complex networks and recurrence analysis. J Theor Biol 2016; 404:375-382. [DOI: 10.1016/j.jtbi.2016.06.018] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Revised: 05/25/2016] [Accepted: 06/15/2016] [Indexed: 11/24/2022]
|
152
|
Lu Y, Gan Y, Guan J, Zhou S. An integrative analysis of nucleosome occupancy and positioning using diverse sequence dependent properties. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.11.107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
153
|
Muthu Krishnan S. Classify vertebrate hemoglobin proteins by incorporating the evolutionary information into the general PseAAC with the hybrid approach. J Theor Biol 2016; 409:27-37. [PMID: 27575465 DOI: 10.1016/j.jtbi.2016.08.027] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Revised: 08/11/2016] [Accepted: 08/16/2016] [Indexed: 01/26/2023]
Abstract
Hemoglobin is an oxygen-binding protein widely present in all kingdoms of life from prokaryotic to eukaryotic, but well established in the vertebrate system. An attempt was made to determine the Vertebrate hemoglobin (VerHb) protein on their animal classifications, based on general pseudo amino acid composition (PseAAC)'s evolutionary profiles and hybrid approach. The support vector machine (SVM) has been applied to develop all models, the prediction results further compared according to their animal classification. The performance of the approaches estimated using five-fold cross-validation techniques. The prediction performance was further investigated by receiver operating characteristic (ROC) and prediction score graphs. The prediction accuracy (ACC), sensitivity (SN) and specificity (SP) were examined to find the accurate predictions on the threshold level. Based on the approach, a web-tool has been developed for identifying the VerHb proteins.
Collapse
Affiliation(s)
- S Muthu Krishnan
- CSIR - Institute of Microbial Technology (IMTECH), Sector-39A, Chandigarh, India.
| |
Collapse
|
154
|
Awazu A. Prediction of nucleosome positioning by the incorporation of frequencies and distributions of three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition. Bioinformatics 2016; 33:42-48. [PMID: 27563027 PMCID: PMC5860184 DOI: 10.1093/bioinformatics/btw562] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2016] [Revised: 08/02/2016] [Accepted: 08/19/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Nucleosome positioning plays important roles in many eukaryotic intranuclear processes, such as transcriptional regulation and chromatin structure formation. The investigations of nucleosome positioning rules provide a deeper understanding of these intracellular processes. Results Nucleosome positioning prediction was performed using a model consisting of three types of variables characterizing a DNA sequence—the number of five-nucleotide sequences, the number of three-nucleotide combinations in one period of a helix, and mono- and di-nucleotide distributions in DNA fragments. Using recently proposed stringent benchmark datasets with low biases for Saccharomyces cerevisiae, Homo sapiens, Caenorhabditis elegans and Drosophila melanogaster, the present model was shown to have a better prediction performance than the recently proposed predictors. This model was able to display the common and organism-dependent factors that affect nucleosome forming and inhibiting sequences as well. Therefore, the predictors developed here can accurately predict nucleosome positioning and help determine the key factors influencing this process. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Akinori Awazu
- Department of Mathematical and Life Sciences.,Research Center for Mathematics on Chromatin Live Dynamics, Hiroshima University, Kagami-yama 1-3-1, Higashi-Hiroshima, 739-8526, Japan
| |
Collapse
|
155
|
He B, Kang J, Ru B, Ding H, Zhou P, Huang J. SABinder: A Web Service for Predicting Streptavidin-Binding Peptides. BIOMED RESEARCH INTERNATIONAL 2016; 2016:9175143. [PMID: 27610387 PMCID: PMC5005764 DOI: 10.1155/2016/9175143] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Accepted: 07/27/2016] [Indexed: 11/17/2022]
Abstract
Streptavidin is sometimes used as the intended target to screen phage-displayed combinatorial peptide libraries for streptavidin-binding peptides (SBPs). More often in the biopanning system, however, streptavidin is just a commonly used anchoring molecule that can efficiently capture the biotinylated target. In this case, SBPs creeping into the biopanning results are not desired binders but target-unrelated peptides (TUP). Taking them as intended binders may mislead subsequent studies. Therefore, it is important to find if a peptide is likely to be an SBP when streptavidin is either the intended target or just the anchoring molecule. In this paper, we describe an SVM-based ensemble predictor called SABinder. It is the first predictor for SBP. The model was built with the feature of optimized dipeptide composition. It was observed that 89.20% (MCC = 0.78; AUC = 0.93; permutation test, p < 0.001) of peptides were correctly classified. As a web server, SABinder is freely accessible. The tool provides a highly efficient way to exclude potential SBP when they are TUP or to facilitate identification of possibly new SBP when they are the desired binders. In either case, it will be helpful and can benefit related scientific community.
Collapse
Affiliation(s)
- Bifang He
- Key Laboratory for Neuroinformation of Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Juanjuan Kang
- Key Laboratory for Neuroinformation of Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Beibei Ru
- Key Laboratory for Neuroinformation of Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for Neuroinformation of Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Peng Zhou
- Key Laboratory for Neuroinformation of Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Jian Huang
- Key Laboratory for Neuroinformation of Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
156
|
Liu B, Wang S, Long R, Chou KC. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 2016; 33:35-41. [DOI: 10.1093/bioinformatics/btw539] [Citation(s) in RCA: 268] [Impact Index Per Article: 33.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Revised: 08/01/2016] [Accepted: 08/11/2016] [Indexed: 11/13/2022] Open
|
157
|
Ali F, Hayat M. Machine learning approaches for discrimination of Extracellular Matrix proteins using hybrid feature space. J Theor Biol 2016; 403:30-37. [DOI: 10.1016/j.jtbi.2016.05.011] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2015] [Revised: 05/02/2016] [Accepted: 05/03/2016] [Indexed: 01/12/2023]
|
158
|
Tang H, Su ZD, Wei HH, Chen W, Lin H. Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun 2016; 477:150-154. [DOI: 10.1016/j.bbrc.2016.06.035] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2016] [Accepted: 06/08/2016] [Indexed: 01/04/2023]
|
159
|
Dong J, Yao ZJ, Wen M, Zhu MF, Wang NN, Miao HY, Lu AP, Zeng WB, Cao DS. BioTriangle: a web-accessible platform for generating various molecular representations for chemicals, proteins, DNAs/RNAs and their interactions. J Cheminform 2016; 8:34. [PMID: 27330567 PMCID: PMC4915156 DOI: 10.1186/s13321-016-0146-2] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 06/14/2016] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND More and more evidences from network biology indicate that most cellular components exert their functions through interactions with other cellular components, such as proteins, DNAs, RNAs and small molecules. The rapidly increasing amount of publicly available data in biology and chemistry enables researchers to revisit interaction problems by systematic integration and analysis of heterogeneous data. Currently, some tools have been developed to represent these components. However, they have some limitations and only focus on the analysis of either small molecules or proteins or DNAs/RNAs. To the best of our knowledge, there is still a lack of freely-available, easy-to-use and integrated platforms for generating molecular descriptors of DNAs/RNAs, proteins, small molecules and their interactions. RESULTS Herein, we developed a comprehensive molecular representation platform, called BioTriangle, to emphasize the integration of cheminformatics and bioinformatics into a molecular informatics platform for computational biology study. It contains a feature-rich toolkit used for the characterization of various biological molecules and complex interaction samples including chemicals, proteins, DNAs/RNAs and even their interactions. By using BioTriangle, users are able to start a full pipelining from getting molecular data, molecular representation to constructing machine learning models conveniently. CONCLUSION BioTriangle provides a user-friendly interface to calculate various features of biological molecules and complex interaction samples conveniently. The computing tasks can be submitted and performed simply in a browser without any sophisticated installation and configuration process. BioTriangle is freely available at http://biotriangle.scbdd.com.Graphical abstractAn overview of BioTriangle. A platform for generating various molecular representations for chemicals, proteins, DNAs/RNAs and their interactions.
Collapse
Affiliation(s)
- Jie Dong
- School of Pharmaceutical Sciences, Central South University, Changsha, People's Republic of China
| | - Zhi-Jiang Yao
- College of Chemistry and Chemical Engineering, Central South University, Changsha, People's Republic of China
| | - Ming Wen
- College of Chemistry and Chemical Engineering, Central South University, Changsha, People's Republic of China
| | - Min-Feng Zhu
- School of Mathematics and Statistics, Central South University, Changsha, People's Republic of China
| | - Ning-Ning Wang
- School of Pharmaceutical Sciences, Central South University, Changsha, People's Republic of China
| | - Hong-Yu Miao
- School of Public Health, University of Texas Health Science Center, Houston, TX USA
| | - Ai-Ping Lu
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, SAR People's Republic of China
| | - Wen-Bin Zeng
- School of Pharmaceutical Sciences, Central South University, Changsha, People's Republic of China
| | - Dong-Sheng Cao
- School of Pharmaceutical Sciences, Central South University, Changsha, People's Republic of China ; Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, SAR People's Republic of China
| |
Collapse
|
160
|
Improving N(6)-methyladenosine site prediction with heuristic selection of nucleotide physical-chemical properties. Anal Biochem 2016; 508:104-13. [PMID: 27293216 DOI: 10.1016/j.ab.2016.06.001] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2016] [Revised: 05/31/2016] [Accepted: 06/01/2016] [Indexed: 12/28/2022]
Abstract
N(6)-methyladenosine (m(6)A) is one of the most common and abundant post-transcriptional RNA modifications found in viruses and most eukaryotes. m(6)A plays an essential role in many vital biological processes to regulate gene expression. Because of its widespread distribution across the genomes, the identification of m(6)A sites from RNA sequences is of significant importance for better understanding the regulatory mechanism of m(6)A. Although progress has been achieved in m(6)A site prediction, challenges remain. This article aims to further improve the performance of m(6)A site prediction by introducing a new heuristic nucleotide physical-chemical property selection (HPCS) algorithm. The proposed HPCS algorithm can effectively extract an optimized subset of nucleotide physical-chemical properties under the prescribed feature representation for encoding an RNA sequence into a feature vector. We demonstrate the efficacy of the proposed HPCS algorithm under different feature representations, including pseudo dinucleotide composition (PseDNC), auto-covariance (AC), and cross-covariance (CC). Based on the proposed HPCS algorithm, we implemented an m(6)A site predictor, called M6A-HPCS, which is freely available at http://csbio.njust.edu.cn/bioinf/M6A-HPCS. Experimental results over rigorous jackknife tests on benchmark datasets demonstrated that the proposed M6A-HPCS achieves higher success rates and outperforms existing state-of-the-art sequence-based m(6)A site predictors.
Collapse
|
161
|
Huang HH. An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses. J Theor Biol 2016; 398:136-44. [DOI: 10.1016/j.jtbi.2016.03.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Revised: 02/25/2016] [Accepted: 03/02/2016] [Indexed: 11/29/2022]
|
162
|
Prediction of aptamer-protein interacting pairs using an ensemble classifier in combination with various protein sequence attributes. BMC Bioinformatics 2016; 17:225. [PMID: 27245069 PMCID: PMC4888498 DOI: 10.1186/s12859-016-1087-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2016] [Accepted: 05/17/2016] [Indexed: 02/05/2023] Open
Abstract
Background Aptamer-protein interacting pairs play a variety of physiological functions and therapeutic potentials in organisms. Rapidly and effectively predicting aptamer-protein interacting pairs is significant to design aptamers binding to certain interested proteins, which will give insight into understanding mechanisms of aptamer-protein interacting pairs and developing aptamer-based therapies. Results In this study, an ensemble method is presented to predict aptamer-protein interacting pairs with hybrid features. The features for aptamers are extracted from Pseudo K-tuple Nucleotide Composition (PseKNC) while the features for proteins incorporate Discrete Cosine Transformation (DCT), disorder information, and bi-gram Position Specific Scoring Matrix (PSSM). We investigate predictive capabilities of various feature spaces. The proposed ensemble method obtains the best performance with Youden’s Index of 0.380, using the hybrid feature space of PseKNC, DCT, bi-gram PSSM, and disorder information by 10-fold cross validation. The Relief-Incremental Feature Selection (IFS) method is adopted to obtain the optimal feature set. Based on the optimal feature set, the proposed method achieves a balanced performance with a sensitivity of 0.753 and a specificity of 0.725 on the training dataset, which indicates that this method can solve the imbalanced data problem effectively. To evaluate the prediction performance objectively, an independent testing dataset is used to evaluate the proposed method. Encouragingly, our proposed method performs better than previous study with a sensitivity of 0.738 and a Youden’s Index of 0.451. Conclusions These results suggest that the proposed method can be a potential candidate for aptamer-protein interacting pair prediction, which may contribute to finding novel aptamer-protein interacting pairs and understanding the relationship between aptamers and proteins. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1087-5) contains supplementary material, which is available to authorized users.
Collapse
|
163
|
Wuyun Q, Zheng W, Zhang Y, Ruan J, Hu G. Improved Species-Specific Lysine Acetylation Site Prediction Based on a Large Variety of Features Set. PLoS One 2016; 11:e0155370. [PMID: 27183223 PMCID: PMC4868276 DOI: 10.1371/journal.pone.0155370] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Accepted: 04/27/2016] [Indexed: 12/21/2022] Open
Abstract
Lysine acetylation is a major post-translational modification. It plays a vital role in numerous essential biological processes, such as gene expression and metabolism, and is related to some human diseases. To fully understand the regulatory mechanism of acetylation, identification of acetylation sites is first and most important. However, experimental identification of protein acetylation sites is often time consuming and expensive. Therefore, the alternative computational methods are necessary. Here, we developed a novel tool, KA-predictor, to predict species-specific lysine acetylation sites based on support vector machine (SVM) classifier. We incorporated different types of features and employed an efficient feature selection on each type to form the final optimal feature set for model learning. And our predictor was highly competitive for the majority of species when compared with other methods. Feature contribution analysis indicated that HSE features, which were firstly introduced for lysine acetylation prediction, significantly improved the predictive performance. Particularly, we constructed a high-accurate structure dataset of H.sapiens from PDB to analyze the structural properties around lysine acetylation sites. Our datasets and a user-friendly local tool of KA-predictor can be freely available at http://sourceforge.net/p/ka-predictor.
Collapse
Affiliation(s)
- Qiqige Wuyun
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China, 300071
| | - Wei Zheng
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China, 300071
| | - Yanping Zhang
- Department of Mathematics, School of Science, Hebei University of Engineering, Handan, China, 056038
| | - Jishou Ruan
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China, 300071
- State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, China, 300071
| | - Gang Hu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China, 300071
- * E-mail:
| |
Collapse
|
164
|
Chen J, Liu B, Huang D. Protein Remote Homology Detection Based on an Ensemble Learning Approach. BIOMED RESEARCH INTERNATIONAL 2016; 2016:5813645. [PMID: 27294123 PMCID: PMC4875977 DOI: 10.1155/2016/5813645] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Accepted: 02/21/2016] [Indexed: 12/15/2022]
Abstract
Protein remote homology detection is one of the central problems in bioinformatics. Although some computational methods have been proposed, the problem is still far from being solved. In this paper, an ensemble classifier for protein remote homology detection, called SVM-Ensemble, was proposed with a weighted voting strategy. SVM-Ensemble combined three basic classifiers based on different feature spaces, including Kmer, ACC, and SC-PseAAC. These features consider the characteristics of proteins from various perspectives, incorporating both the sequence composition and the sequence-order information along the protein sequences. Experimental results on a widely used benchmark dataset showed that the proposed SVM-Ensemble can obviously improve the predictive performance for the protein remote homology detection. Moreover, it achieved the best performance and outperformed other state-of-the-art methods.
Collapse
Affiliation(s)
- Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Bingquan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Dong Huang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| |
Collapse
|
165
|
Chen W, Tang H, Lin H. MethyRNA: a web server for identification of N6-methyladenosine sites. J Biomol Struct Dyn 2016; 35:683-687. [DOI: 10.1080/07391102.2016.1157761] [Citation(s) in RCA: 74] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063009, China
| | - Hua Tang
- Department of Pathophysiology, Sichuan Medical University, Luzhou 646000, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
166
|
Prediction of Golgi-resident protein types using general form of Chou's pseudo-amino acid compositions: Approaches with minimal redundancy maximal relevance feature selection. J Theor Biol 2016; 402:38-44. [PMID: 27155042 DOI: 10.1016/j.jtbi.2016.04.032] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Revised: 04/19/2016] [Accepted: 04/26/2016] [Indexed: 11/20/2022]
Abstract
Recently, several efforts have been made in predicting Golgi-resident proteins. However, it is still a challenging task to identify the type of a Golgi-resident protein. Precise prediction of the type of a Golgi-resident protein plays a key role in understanding its molecular functions in various biological processes. In this paper, we proposed to use a mutual information based feature selection scheme with the general form Chou's pseudo-amino acid compositions to predict the Golgi-resident protein types. The positional specific physicochemical properties were applied in the Chou's pseudo-amino acid compositions. We achieved 91.24% prediction accuracy in a jackknife test with 49 selected features. It has the best performance among all the present predictors. This result indicates that our computational model can be useful in identifying Golgi-resident protein types.
Collapse
|
167
|
Iqbal M, Hayat M. "iSS-Hyb-mRMR": Identification of splicing sites using hybrid space of pseudo trinucleotide and pseudo tetranucleotide composition. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016; 128:1-11. [PMID: 27040827 DOI: 10.1016/j.cmpb.2016.02.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2015] [Accepted: 02/16/2016] [Indexed: 06/05/2023]
Abstract
BACKGROUND AND OBJECTIVES Gene splicing is a vital source of protein diversity. Perfectly eradication of introns and joining exons is the prominent task in eukaryotic gene expression, as exons are usually interrupted by introns. Identification of splicing sites through experimental techniques is complicated and time-consuming task. With the avalanche of genome sequences generated in the post genomic age, it remains a complicated and challenging task to develop an automatic, robust and reliable computational method for fast and effective identification of splicing sites. METHODS In this study, a hybrid model "iSS-Hyb-mRMR" is proposed for quickly and accurately identification of splicing sites. Two sample representation methods namely; pseudo trinucleotide composition (PseTNC) and pseudo tetranucleotide composition (PseTetraNC) were used to extract numerical descriptors from DNA sequences. Hybrid model was developed by concatenating PseTNC and PseTetraNC. In order to select high discriminative features, minimum redundancy maximum relevance algorithm was applied on the hybrid feature space. The performance of these feature representation methods was tested using various classification algorithms including K-nearest neighbor, probabilistic neural network, general regression neural network, and fitting network. Jackknife test was used for evaluation of its performance on two benchmark datasets S1 and S2, respectively. RESULTS The predictor, proposed in the current study achieved an accuracy of 93.26%, sensitivity of 88.77%, and specificity of 97.78% for S1, and the accuracy of 94.12%, sensitivity of 87.14%, and specificity of 98.64% for S2, respectively. CONCLUSION It is observed, that the performance of proposed model is higher than the existing methods in the literature so for; and will be fruitful in the mechanism of RNA splicing, and other research academia.
Collapse
Affiliation(s)
- Muhammad Iqbal
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan.
| |
Collapse
|
168
|
Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Trans Nanobioscience 2016; 15:328-334. [PMID: 28113908 DOI: 10.1109/tnb.2016.2555951] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .
Collapse
|
169
|
Che Y, Ju Y, Xuan P, Long R, Xing F. Identification of Multi-Functional Enzyme with Multi-Label Classifier. PLoS One 2016; 11:e0153503. [PMID: 27078147 PMCID: PMC4831692 DOI: 10.1371/journal.pone.0153503] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2016] [Accepted: 03/30/2016] [Indexed: 11/23/2022] Open
Abstract
Enzymes are important and effective biological catalyst proteins participating in almost all active cell processes. Identification of multi-functional enzymes is essential in understanding the function of enzymes. Machine learning methods perform better in protein structure and function prediction than traditional biological wet experiments. Thus, in this study, we explore an efficient and effective machine learning method to categorize enzymes according to their function. Multi-functional enzymes are predicted with a special machine learning strategy, namely, multi-label classifier. Sequence features are extracted from a position-specific scoring matrix with autocross-covariance transformation. Experiment results show that the proposed method obtains an accuracy rate of 94.1% in classifying six main functional classes through five cross-validation tests and outperforms state-of-the-art methods. In addition, 91.25% accuracy is achieved in multi-functional enzyme prediction, which is often ignored in other enzyme function prediction studies. The online prediction server and datasets can be accessed from the link http://server.malab.cn/MEC/.
Collapse
Affiliation(s)
- Yuxin Che
- School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
| | - Ying Ju
- School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
| | - Ping Xuan
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China
| | - Ren Long
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Fei Xing
- School of Aerospace Engineering, Xiamen University, Xiamen, Fujian 361005, China
| |
Collapse
|
170
|
Liu G, Xing Y, Zhao H, Wang J, Shang Y, Cai L. A deformation energy-based model for predicting nucleosome dyads and occupancy. Sci Rep 2016; 6:24133. [PMID: 27053067 PMCID: PMC4823781 DOI: 10.1038/srep24133] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Accepted: 03/21/2016] [Indexed: 12/14/2022] Open
Abstract
Nucleosome plays an essential role in various cellular processes, such as DNA replication, recombination, and transcription. Hence, it is important to decode the mechanism of nucleosome positioning and identify nucleosome positions in the genome. In this paper, we present a model for predicting nucleosome positioning based on DNA deformation, in which both bending and shearing of the nucleosomal DNA are considered. The model successfully predicted the dyad positions of nucleosomes assembled in vitro and the in vitro map of nucleosomes in Saccharomyces cerevisiae. Applying the model to Caenorhabditis elegans and Drosophila melanogaster, we achieved satisfactory results. Our data also show that shearing energy of nucleosomal DNA outperforms bending energy in nucleosome occupancy prediction and the ability to predict nucleosome dyad positions is attributed to bending energy that is associated with rotational positioning of nucleosomes.
Collapse
Affiliation(s)
- Guoqing Liu
- The Institute of Bioengineering and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China.,Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| | - Yongqiang Xing
- The Institute of Bioengineering and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Hongyu Zhao
- The Institute of Bioengineering and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Jianying Wang
- The Institute of Bioengineering and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China.,State Key Laboratory for Utilization of Bayan Obo Multi-Metallic Resources, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Yu Shang
- Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA.,College of Computer Science and Technology, Jilin University, Changchun, Jilin 130021, China
| | - Lu Cai
- The Institute of Bioengineering and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| |
Collapse
|
171
|
Vázquez-Prieto S, Paniagua E, Ubeira FM, González-Díaz H. QSPR-Perturbation Models for the Prediction of B-Epitopes from Immune Epitope Database: A Potentially Valuable Route for Predicting “In Silico” New Optimal Peptide Sequences and/or Boundary Conditions for Vaccine Development. Int J Pept Res Ther 2016. [DOI: 10.1007/s10989-016-9524-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
172
|
Liu Z, Xiao X, Yu DJ, Jia J, Qiu WR, Chou KC. pRNAm-PC: Predicting N6-methyladenosine sites in RNA sequences via physical–chemical properties. Anal Biochem 2016; 497:60-7. [DOI: 10.1016/j.ab.2015.12.017] [Citation(s) in RCA: 225] [Impact Index Per Article: 28.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Revised: 12/02/2015] [Accepted: 12/23/2015] [Indexed: 11/28/2022]
|
173
|
Lyons J, Paliwal KK, Dehzangi A, Heffernan R, Tsunoda T, Sharma A. Protein fold recognition using HMM–HMM alignment and dynamic programming. J Theor Biol 2016; 393:67-74. [DOI: 10.1016/j.jtbi.2015.12.018] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Revised: 12/17/2015] [Accepted: 12/18/2015] [Indexed: 10/22/2022]
|
174
|
Chen W, Feng P, Ding H, Lin H, Chou KC. Using deformation energy to analyze nucleosome positioning in genomes. Genomics 2016; 107:69-75. [DOI: 10.1016/j.ygeno.2015.12.005] [Citation(s) in RCA: 87] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Revised: 12/06/2015] [Accepted: 12/22/2015] [Indexed: 12/28/2022]
|
175
|
Ding H, Liang ZY, Guo FB, Huang J, Chen W, Lin H. Predicting bacteriophage proteins located in host cell with feature selection technique. Comput Biol Med 2016; 71:156-61. [PMID: 26945463 DOI: 10.1016/j.compbiomed.2016.02.012] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Revised: 02/18/2016] [Accepted: 02/18/2016] [Indexed: 10/22/2022]
Abstract
A bacteriophage is a virus that can infect a bacterium. The fate of an infected bacterium is determined by the bacteriophage proteins located in the host cell. Thus, reliably identifying bacteriophage proteins located in the host cell is extremely important to understand their functions and discover potential anti-bacterial drugs. Thus, in this paper, a computational method was developed to recognize bacteriophage proteins located in host cells based only on their amino acid sequences. The analysis of variance (ANOVA) combined with incremental feature selection (IFS) was proposed to optimize the feature set. Using a jackknife cross-validation, our method can discriminate between bacteriophage proteins located in a host cell and the bacteriophage proteins not located in a host cell with a maximum overall accuracy of 84.2%, and can further classify bacteriophage proteins located in host cell cytoplasm and in host cell membranes with a maximum overall accuracy of 92.4%. To enhance the value of the practical applications of the method, we built a web server called PHPred (〈http://lin.uestc.edu.cn/server/PHPred〉). We believe that the PHPred will become a powerful tool to study bacteriophage proteins located in host cells and to guide related drug discovery.
Collapse
Affiliation(s)
- Hui Ding
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology and Center for Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Zhi-Yong Liang
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology and Center for Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Feng-Biao Guo
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology and Center for Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Jian Huang
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology and Center for Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology and Center for Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu 610054, China; Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology and Center for Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
176
|
Yang L, Wang S, Zhou M, Chen X, Zuo Y, Sun D, Lv Y. Comparative analysis of housekeeping and tissue-selective genes in human based on network topologies and biological properties. Mol Genet Genomics 2016; 291:1227-41. [PMID: 26897376 DOI: 10.1007/s00438-016-1178-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 01/26/2016] [Indexed: 01/14/2023]
Abstract
Housekeeping genes are genes that are turned on most of the time in almost every tissue to maintain cellular functions. Tissue-selective genes are predominantly expressed in one or a few biologically relevant tissue types. Benefitting from the massive gene expression microarray data obtained over the past decades, the properties of housekeeping and tissue-selective genes can now be investigated on a large-scale manner. In this study, we analyzed the topological properties of housekeeping and tissue-selective genes in the protein-protein interaction (PPI) network. Furthermore, we compared the biological properties and amino acid usage between these two gene groups. The results indicated that there were significant differences in topological properties between housekeeping and tissue-selective genes in the PPI network, and housekeeping genes had higher centrality properties and may play important roles in the complex biological network environment. We also found that there were significant differences in multiple biological properties and many amino acid compositions. The functional genes enrichment and subcellular localizations analysis was also performed to investigate the characterization of housekeeping and tissue-selective genes. The results indicated that the two gene groups showed significant different enrichment in drug targets, disease genes and toxin targets, and located in different subcellular localizations. At last, the discriminations between the properties of two gene groups were measured by the F-score, and expression stage had the most discriminative index in all properties. These findings may elucidate the biological mechanisms for understanding housekeeping and tissue-selective genes and may contribute to better annotate housekeeping and tissue-selective genes in other organisms.
Collapse
Affiliation(s)
- Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Shiyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Meng Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Xiaowen Chen
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Yongchun Zuo
- The National Research Center for Animal Transgenic Biotechnology, Inner Mongolia University, Hohhot, 010021, China
| | - Dianjun Sun
- Center for Endemic Disease Control, Chinese Center for Disease Control and Prevention, Harbin Medical University, Harbin, 150081, China.
| | - Yingli Lv
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China.
| |
Collapse
|
177
|
Ramirez-Malule H, Restrepo A, Cardona W, Junne S, Neubauer P, Rios-Estepa R. Inversion of the stereochemical configuration (3S, 5S)-clavaminic acid into (3R, 5R)-clavulanic acid: A computationally-assisted approach based on experimental evidence. J Theor Biol 2016; 395:40-50. [PMID: 26835563 DOI: 10.1016/j.jtbi.2016.01.028] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Revised: 01/19/2016] [Accepted: 01/21/2016] [Indexed: 11/17/2022]
Abstract
Clavulanic acid (CA), a potent inhibitor of β-lactamase enzymes, is produced by Streptomyces clavuligerus (Sc) cultivation processes, for which low yields are commonly obtained. Improved knowledge of the clavam biosynthetic pathway, especially the steps involved in the inversion of 3S-5S into 3R-5R stereochemical configuration, would help to eventually identify bottlenecks in the pathway. In this work, we studied the role of acetate in CA biosynthesis by a combined continuous culture and computational simulation approach. From this we derived a new model for the synthesis of N-acetyl-glycyl-clavaminic acid (NAG-clavam) by Sc. Acetylated compounds, such as NAG-clavam and N-acetyl-clavaminic acid, have been reported in the clavam pathway. Although the acetyl group is present in the β-lactam intermediate NAG-clavam, it is unknown how this group is incorporated. Hence, under the consideration of the experimentally proven accumulation of acetate during CA biosynthesis, and the fact that an acetyl group is present in the NAG-clavam structure, a computational evaluation of the tentative formation of NAG-clavam was performed for the purpose of providing further understanding. The proposed reaction mechanism consists of two steps: first, acetate reacts with ATP to produce a reactive acylphosphate intermediate; second, a direct nucleophilic attack of the terminal amino group of N-glycyl-clavaminic on the carbonyl carbon of the acylphosphate intermediate leads to a tetrahydral intermediate, which collapses and produces ADP and N-acetyl-glycyl-clavaminic acid. The calculations suggest that for the proposed reaction mechanism, the reaction proceeds until completion of the first step, without the direct action of an enzyme, where acetate and ATP are involved. For this step, the computed activation energy was ≅2.82kcal/mol while the reaction energy was ≅2.38kcal/mol. As this is an endothermic chemical process with a relatively small activation energy, the reaction rate should be considerably high. The calculations offered in this work should not be considered as a definite characterization of the potential energy surface for the reaction between acetate and ATP, but rather as a first approximation that provides valuable insight about the reaction mechanism. Finally, a complete route for the inversion of the stereochemical configuration from (3S, 5S)-clavaminic acid into (3R, 5R)-clavulanic acid is proposed, including a novel alternative for the double epimerization using proline racemase and NAG-clavam formation.
Collapse
Affiliation(s)
- Howard Ramirez-Malule
- Grupo de Bioprocesos, Departamento de Ingeniería Química, Universidad de Antioquia UdeA, Calle 70 No. 52-21, Medellín, Colombia
| | - Albeiro Restrepo
- Grupo de Química Física Teórica, Instituto de Química, Universidad de Antioquia UdeA, Calle 70 No. 52-21, Medellín, Colombia
| | - Wilson Cardona
- Grupo de Química de Plantas Colombianas, Instituto de Química, Universidad de Antioquia UdeA, Calle 70 No. 52-21, Medellín, Colombia
| | - Stefan Junne
- Chair of Bioprocess Engineering, Department of Biotechnology, Technische Universität Berlin, Ackerstr. 76, ACK 24, 13355 Berlin, Germany
| | - Peter Neubauer
- Chair of Bioprocess Engineering, Department of Biotechnology, Technische Universität Berlin, Ackerstr. 76, ACK 24, 13355 Berlin, Germany
| | - Rigoberto Rios-Estepa
- Grupo de Bioprocesos, Departamento de Ingeniería Química, Universidad de Antioquia UdeA, Calle 70 No. 52-21, Medellín, Colombia.
| |
Collapse
|
178
|
In-silico analysis of gymnemagenin from Gymnema sylvestre (Retz.) R.Br. with targets related to diabetes. J Theor Biol 2016; 391:95-101. [PMID: 26711684 DOI: 10.1016/j.jtbi.2015.12.004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Revised: 11/12/2015] [Accepted: 12/10/2015] [Indexed: 11/23/2022]
|
179
|
An estimator for local analysis of genome based on the minimal absent word. J Theor Biol 2016; 395:23-30. [PMID: 26829314 DOI: 10.1016/j.jtbi.2016.01.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Revised: 01/17/2016] [Accepted: 01/19/2016] [Indexed: 11/22/2022]
Abstract
This study presents an alternative alignment-free relative feature analysis method based on the minimal absent word, which has potential advantages over the local alignment method in local analysis. Smooth-local-analysis-curve and similarity-distribution are constructed for a fast, efficient, and visual comparison. Moreover, when the multi-sequence-comparison is needed, the local-analysis-curves can illustrate some interesting zones.
Collapse
|
180
|
Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks. Sci Rep 2016; 6:19598. [PMID: 26797014 PMCID: PMC4726425 DOI: 10.1038/srep19598] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2015] [Accepted: 12/14/2015] [Indexed: 11/09/2022] Open
Abstract
The hypo- or hyper-methylation of the human genome is one of the epigenetic features of leukemia. However, experimental approaches have only determined the methylation state of a small portion of the human genome. We developed deep learning based (stacked denoising autoencoders, or SdAs) software named “DeepMethyl” to predict the methylation state of DNA CpG dinucleotides using features inferred from three-dimensional genome topology (based on Hi-C) and DNA sequence patterns. We used the experimental data from immortalised myelogenous leukemia (K562) and healthy lymphoblastoid (GM12878) cell lines to train the learning models and assess prediction performance. We have tested various SdA architectures with different configurations of hidden layer(s) and amount of pre-training data and compared the performance of deep networks relative to support vector machines (SVMs). Using the methylation states of sequentially neighboring regions as one of the learning features, an SdA achieved a blind test accuracy of 89.7% for GM12878 and 88.6% for K562. When the methylation states of sequentially neighboring regions are unknown, the accuracies are 84.82% for GM12878 and 72.01% for K562. We also analyzed the contribution of genome topological features inferred from Hi-C. DeepMethyl can be accessed at http://dna.cs.usm.edu/deepmethyl/.
Collapse
|
181
|
iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets. Molecules 2016; 21:E95. [PMID: 26797600 PMCID: PMC6274413 DOI: 10.3390/molecules21010095] [Citation(s) in RCA: 136] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Revised: 12/18/2015] [Accepted: 01/07/2016] [Indexed: 12/25/2022] Open
Abstract
Knowledge of protein-protein interactions and their binding sites is indispensable for in-depth understanding of the networks in living cells. With the avalanche of protein sequences generated in the postgenomic age, it is critical to develop computational methods for identifying in a timely fashion the protein-protein binding sites (PPBSs) based on the sequence information alone because the information obtained by this way can be used for both biomedical research and drug development. To address such a challenge, we have proposed a new predictor, called iPPBS-Opt, in which we have used: (1) the K-Nearest Neighbors Cleaning (KNNC) and Inserting Hypothetical Training Samples (IHTS) treatments to optimize the training dataset; (2) the ensemble voting approach to select the most relevant features; and (3) the stationary wavelet transform to formulate the statistical samples. Cross-validation tests by targeting the experiment-confirmed results have demonstrated that the new predictor is very promising, implying that the aforementioned practices are indeed very effective. Particularly, the approach of using the wavelets to express protein/peptide sequences might be the key in grasping the problem's essence, fully consistent with the findings that many important biological functions of proteins can be elucidated with their low-frequency internal motions. To maximize the convenience of most experimental scientists, we have provided a step-by-step guide on how to use the predictor's web server (http://www.jci-bioinfo.cn/iPPBS-Opt) to get the desired results without the need to go through the complicated mathematical equations involved.
Collapse
|
182
|
Ranganarayanan P, Thanigesan N, Ananth V, Jayaraman VK, Ramakrishnan V. Identification of Glucose-Binding Pockets in Human Serum Albumin Using Support Vector Machine and Molecular Dynamics Simulations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:148-157. [PMID: 26886739 DOI: 10.1109/tcbb.2015.2415806] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Human Serum Albumin (HSA) has been suggested to be an alternate biomarker to the existing Hemoglobin-A1c (HbA1c) marker for glycemic monitoring. Development and usage of HSA as an alternate biomarker requires the identification of glycation sites, or equivalently, glucose-binding pockets. In this work, we combine molecular dynamics simulations of HSA and the state-of-art machine learning method Support Vector Machine (SVM) to predict glucose-binding pockets in HSA. SVM uses the three dimensional arrangement of atoms and their chemical properties to predict glucose-binding ability of a pocket. Feature selection reveals that the arrangement of atoms and their chemical properties within the first 4Å from the centroid of the pocket play an important role in the binding of glucose. With a 10-fold cross validation accuracy of 84 percent, our SVM model reveals seven new potential glucose-binding sites in HSA of which two are exposed only during the dynamics of HSA. The predictions are further corroborated using docking studies. These findings can complement studies directed towards the development of HSA as an alternate biomarker for glycemic monitoring.
Collapse
|
183
|
Feng P, Ding H, Chen W, Lin H. Identifying RNA 5-methylcytosine sites via pseudo nucleotide compositions. MOLECULAR BIOSYSTEMS 2016; 12:3307-3311. [DOI: 10.1039/c6mb00471g] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
RNA 5-methylcytosine (m5C) has been discovered from archaea to eukaryotes, which is catalyzed by RNA methyltransferase.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Public Health
- North China University of Science and Technology
- Tangshan
- China
| | - Hui Ding
- Key Laboratory for NeuroInformation of Ministry of Education
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
- China
| | - Wei Chen
- Department of Physics
- School of Sciences
- Center for Genomics and Computational Biology
- North China University of Science and Technology
- Tangshan 063009
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
- China
| |
Collapse
|
184
|
Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. MOLECULAR BIOSYSTEMS 2016; 12:1269-75. [DOI: 10.1039/c5mb00883b] [Citation(s) in RCA: 147] [Impact Index Per Article: 18.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Immunoglobulins, also called antibodies, are a group of cell surface proteins which are produced by the immune system in response to the presence of a foreign substance (called antigen).
Collapse
Affiliation(s)
- Hua Tang
- Department of Pathophysiology
- Sichuan Medical University
- Luzhou 646000
- China
| | - Wei Chen
- Department of Physics
- School of Sciences
- Center for Genomics and Computational Biology
- North China University of Science and Technology
- Tangshan 063009
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
- China
| |
Collapse
|
185
|
Jiao YS, Du PF. Predicting Golgi-resident protein types using pseudo amino acid compositions: Approaches with positional specific physicochemical properties. J Theor Biol 2015; 391:35-42. [PMID: 26702543 DOI: 10.1016/j.jtbi.2015.11.009] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Revised: 11/17/2015] [Accepted: 11/19/2015] [Indexed: 11/24/2022]
Abstract
Knowing the type of a Golgi-resident protein is an important step in understanding its molecular functions as well as its role in biological processes. In this paper, we developed a novel computational method to predict Golgi-resident protein types using positional specific physicochemical properties and analysis of variance based feature selection methods. Our method achieved 86.9% prediction accuracy in leave-one-out cross-validations with only 59 features. Our method has the potential to be applied in predicting a wide range of protein attributes.
Collapse
Affiliation(s)
- Ya-Sen Jiao
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
| | - Pu-Feng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| |
Collapse
|
186
|
Predicting cancerlectins by the optimal g-gap dipeptides. Sci Rep 2015; 5:16964. [PMID: 26648527 PMCID: PMC4673586 DOI: 10.1038/srep16964] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2015] [Accepted: 10/22/2015] [Indexed: 12/14/2022] Open
Abstract
The cancerlectin plays a key role in the process of tumor cell differentiation. Thus, to fully understand the function of cancerlectin is significant because it sheds light on the future direction for the cancer therapy. However, the traditional wet-experimental methods were money- and time-consuming. It is highly desirable to develop an effective and efficient computational tool to identify cancerlectins. In this study, we developed a sequence-based method to discriminate between cancerlectins and non-cancerlectins. The analysis of variance (ANOVA) was used to choose the optimal feature set derived from the g-gap dipeptide composition. The jackknife cross-validated results showed that the proposed method achieved the accuracy of 75.19%, which is superior to other published methods. For the convenience of other researchers, an online web-server CaLecPred was established and can be freely accessed from the website http://lin.uestc.edu.cn/server/CalecPred. We believe that the CaLecPred is a powerful tool to study cancerlectins and to guide the related experimental validations.
Collapse
|
187
|
Heras J, Domínguez C, Mata E, Pascual V. Surveying and benchmarking techniques to analyse DNA gel fingerprint images. Brief Bioinform 2015; 17:912-925. [PMID: 26634918 DOI: 10.1093/bib/bbv102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Revised: 10/20/2015] [Indexed: 11/13/2022] Open
Abstract
DNA fingerprinting is a genetic typing technique that allows the analysis of the genomic relatedness between samples, and the comparison of DNA patterns. The analysis of DNA gel fingerprint images usually consists of five consecutive steps: image pre-processing, lane segmentation, band detection, normalization and fingerprint comparison. In this article, we firstly survey the main methods that have been applied in the literature in each of these stages. Secondly, we focus on lane-segmentation and band-detection algorithms-as they are the steps that usually require user-intervention-and detect the seven core algorithms used for both tasks. Subsequently, we present a benchmark that includes a data set of images, the gold standards associated with those images and the tools to measure the performance of lane-segmentation and band-detection algorithms. Finally, we implement the core algorithms used both for lane segmentation and band detection, and evaluate their performance using our benchmark. As a conclusion of that study, we obtain that the average profile algorithm is the best starting point for lane segmentation and band detection.
Collapse
|
188
|
Sharma R, Dehzangi A, Lyons J, Paliwal K, Tsunoda T, Sharma A. Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC. IEEE Trans Nanobioscience 2015; 14:915-26. [DOI: 10.1109/tnb.2015.2500186] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
189
|
Chen W, Feng P, Ding H, Lin H, Chou KC. iRNA-Methyl: Identifying N6-methyladenosine sites using pseudo nucleotide composition. Anal Biochem 2015; 490:26-33. [DOI: 10.1016/j.ab.2015.08.021] [Citation(s) in RCA: 254] [Impact Index Per Article: 28.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Revised: 08/13/2015] [Accepted: 08/16/2015] [Indexed: 10/23/2022]
|
190
|
Chen FS, Jiang ZR. Prediction of drug’s Anatomical Therapeutic Chemical (ATC) code by integrating drug–domain network. J Biomed Inform 2015; 58:80-88. [DOI: 10.1016/j.jbi.2015.09.016] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Revised: 09/14/2015] [Accepted: 09/22/2015] [Indexed: 10/22/2022]
|
191
|
Isami S, Sakamoto N, Nishimori H, Awazu A. Simple Elastic Network Models for Exhaustive Analysis of Long Double-Stranded DNA Dynamics with Sequence Geometry Dependence. PLoS One 2015; 10:e0143760. [PMID: 26624614 PMCID: PMC4666469 DOI: 10.1371/journal.pone.0143760] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2015] [Accepted: 11/09/2015] [Indexed: 11/19/2022] Open
Abstract
Simple elastic network models of DNA were developed to reveal the structure-dynamics relationships for several nucleotide sequences. First, we propose a simple all-atom elastic network model of DNA that can explain the profiles of temperature factors for several crystal structures of DNA. Second, we propose a coarse-grained elastic network model of DNA, where each nucleotide is described only by one node. This model could effectively reproduce the detailed dynamics obtained with the all-atom elastic network model according to the sequence-dependent geometry. Through normal-mode analysis for the coarse-grained elastic network model, we exhaustively analyzed the dynamic features of a large number of long DNA sequences, approximately ∼150 bp in length. These analyses revealed positive correlations between the nucleosome-forming abilities and the inter-strand fluctuation strength of double-stranded DNA for several DNA sequences.
Collapse
Affiliation(s)
- Shuhei Isami
- Department of Mathematical and Life Sciences, Hiroshima University, Kagami-yama 1-3-1, Higashi-Hiroshima 739-8526, Japan
| | - Naoaki Sakamoto
- Department of Mathematical and Life Sciences, Hiroshima University, Kagami-yama 1-3-1, Higashi-Hiroshima 739-8526, Japan
- Research Center for Mathematics on Chromatin Live Dynamics, Hiroshima University, Kagami-yama 1-3-1, Higashi-Hiroshima 739-8526, Japan
| | - Hiraku Nishimori
- Department of Mathematical and Life Sciences, Hiroshima University, Kagami-yama 1-3-1, Higashi-Hiroshima 739-8526, Japan
- Research Center for Mathematics on Chromatin Live Dynamics, Hiroshima University, Kagami-yama 1-3-1, Higashi-Hiroshima 739-8526, Japan
| | - Akinori Awazu
- Department of Mathematical and Life Sciences, Hiroshima University, Kagami-yama 1-3-1, Higashi-Hiroshima 739-8526, Japan
- Research Center for Mathematics on Chromatin Live Dynamics, Hiroshima University, Kagami-yama 1-3-1, Higashi-Hiroshima 739-8526, Japan
- * E-mail:
| |
Collapse
|
192
|
Prediction of Protein–Protein Interaction Sites with Machine-Learning-Based Data-Cleaning and Post-Filtering Procedures. J Membr Biol 2015; 249:141-53. [DOI: 10.1007/s00232-015-9856-z] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Accepted: 11/03/2015] [Indexed: 12/12/2022]
|
193
|
Ju Z, Cao JZ, Gu H. iLM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into Chou׳s general PseAAC. J Theor Biol 2015; 385:50-7. [DOI: 10.1016/j.jtbi.2015.07.030] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2015] [Revised: 07/06/2015] [Accepted: 07/23/2015] [Indexed: 10/23/2022]
|
194
|
Liu B, Fang L, Wang S, Wang X, Li H, Chou KC. Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. J Theor Biol 2015; 385:153-9. [DOI: 10.1016/j.jtbi.2015.08.025] [Citation(s) in RCA: 131] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2015] [Revised: 08/21/2015] [Accepted: 08/24/2015] [Indexed: 10/23/2022]
|
195
|
Jia J, Liu Z, Xiao X, Liu B, Chou KC. Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition. J Biomol Struct Dyn 2015; 34:1946-61. [PMID: 26375780 DOI: 10.1080/07391102.2015.1095116] [Citation(s) in RCA: 88] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
With the explosive growth of protein sequences entering into protein data banks in the post-genomic era, it is highly demanded to develop automated methods for rapidly and effectively identifying the protein-protein binding sites (PPBSs) based on the sequence information alone. To address this problem, we proposed a predictor called iPPBS-PseAAC, in which each amino acid residue site of the proteins concerned was treated as a 15-tuple peptide segment generated by sliding a window along the protein chains with its center aligned with the target residue. The working peptide segment is further formulated by a general form of pseudo amino acid composition via the following procedures: (1) it is converted into a numerical series via the physicochemical properties of amino acids; (2) the numerical series is subsequently converted into a 20-D feature vector by means of the stationary wavelet transform technique. Formed by many individual "Random Forest" classifiers, the operation engine to run prediction is a two-layer ensemble classifier, with the 1st-layer voting out the best training data-set from many bootstrap systems and the 2nd-layer voting out the most relevant one from seven physicochemical properties. Cross-validation tests indicate that the new predictor is very promising, meaning that many important key features, which are deeply hidden in complicated protein sequences, can be extracted via the wavelets transform approach, quite consistent with the facts that many important biological functions of proteins can be elucidated with their low-frequency internal motions. The web server of iPPBS-PseAAC is accessible at http://www.jci-bioinfo.cn/iPPBS-PseAAC , by which users can easily acquire their desired results without the need to follow the complicated mathematical equations involved.
Collapse
Affiliation(s)
- Jianhua Jia
- a Computer Department , Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China
| | - Zi Liu
- a Computer Department , Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China
| | - Xuan Xiao
- a Computer Department , Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China.,c Gordon Life Science Institute , Boston , MA 02478 , USA
| | - Bingxiang Liu
- a Computer Department , Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China
| | - Kuo-Chen Chou
- b Center of Excellence in Genomic Medicine Research (CEGMR) , King Abdulaziz University , Jeddah 21589 , Saudi Arabia.,c Gordon Life Science Institute , Boston , MA 02478 , USA
| |
Collapse
|
196
|
Liu B, Fang L, Long R, Lan X, Chou KC. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 2015; 32:362-9. [PMID: 26476782 DOI: 10.1093/bioinformatics/btv604] [Citation(s) in RCA: 266] [Impact Index Per Article: 29.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2015] [Accepted: 10/12/2015] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Enhancers are of short regulatory DNA elements. They can be bound with proteins (activators) to activate transcription of a gene, and hence play a critical role in promoting gene transcription in eukaryotes. With the avalanche of DNA sequences generated in the post-genomic age, it is a challenging task to develop computational methods for timely identifying enhancers from extremely complicated DNA sequences. Although some efforts have been made in this regard, they were limited at only identifying whether a query DNA element being of an enhancer or not. According to the distinct levels of biological activities and regulatory effects on target genes, however, enhancers should be further classified into strong and weak ones in strength. RESULTS In view of this, a two-layer predictor called ' IENHANCER-2L: ' was proposed by formulating DNA elements with the 'pseudo k-tuple nucleotide composition', into which the six DNA local parameters were incorporated. To the best of our knowledge, it is the first computational predictor ever established for identifying not only enhancers, but also their strength. Rigorous cross-validation tests have indicated that IENHANCER-2L: holds very high potential to become a useful tool for genome analysis. AVAILABILITY AND IMPLEMENTATION For the convenience of most experimental scientists, a web server for the two-layer predictor was established at http://bioinformatics.hitsz.edu.cn/iEnhancer-2L/, by which users can easily get their desired results without the need to go through the mathematical details. CONTACT bliu@gordonlifescience.org, bliu@insun.hit.edu.cn, xlan@stanford.edu, kcchou@gordonlifescience.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Computational Biology, Gordon Life Science Institute, Belmont, MA 02478, USA
| | | | - Ren Long
- School of Computer Science and Technology
| | - Xun Lan
- Department of Genetics, Stanford University, Stanford, CA 94305, USA and
| | - Kuo-Chen Chou
- Computational Biology, Gordon Life Science Institute, Belmont, MA 02478, USA, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
197
|
Zhang J, Wang G, Feng J, Zhang L, Li J. Identifying ion channel genes related to cardiomyopathy using a novel decision forest strategy. MOLECULAR BIOSYSTEMS 2015; 10:2407-14. [PMID: 24977958 DOI: 10.1039/c4mb00193a] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Ion channels play many crucial functions in life. Their dysfunction may lead to a number of diseases, such as arrhythmia and beta cell dysfunction. In this study, we firstly selected the ion channel gene expression profiles using a dimensionality reduction method. After that, we applied a novel decision forest strategy to mine cardiomyopathy related ion channel genes. The novel proposed Zi integrated the information of the decision trees' height and the frequency at which a gene was located in the tree. It achieved a much higher ability of feature selection. In the result, 26 cardiomyopathy related ion channel genes were identified. Their Zi were higher than the threshold Z*. Furthermore, most of these genes had been reported to have relationships with cardiomyopathies. In conclusion, our proposed decision forest strategy had a better classification performance. Our result can provide a theoretical basis for cardiovascular researchers.
Collapse
Affiliation(s)
- Jie Zhang
- Department of Prevention, Tongji University School of Medicine, Shanghai, China.
| | | | | | | | | |
Collapse
|
198
|
Protein cold adaptation: Role of physico-chemical parameters in adaptation of proteins to low temperatures. J Theor Biol 2015; 383:130-7. [DOI: 10.1016/j.jtbi.2015.07.013] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Revised: 06/21/2015] [Accepted: 07/16/2015] [Indexed: 11/21/2022]
|
199
|
YongE F, GaoShan K. Identify Beta-Hairpin Motifs with Quadratic Discriminant Algorithm Based on the Chemical Shifts. PLoS One 2015; 10:e0139280. [PMID: 26422468 PMCID: PMC4589334 DOI: 10.1371/journal.pone.0139280] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 09/09/2015] [Indexed: 01/13/2023] Open
Abstract
Successful prediction of the beta-hairpin motif will be helpful for understanding the of the fold recognition. Some algorithms have been proposed for the prediction of beta-hairpin motifs. However, the parameters used by these methods were primarily based on the amino acid sequences. Here, we proposed a novel model for predicting beta-hairpin structure based on the chemical shift. Firstly, we analyzed the statistical distribution of chemical shifts of six nuclei in not beta-hairpin and beta-hairpin motifs. Secondly, we used these chemical shifts as features combined with three algorithms to predict beta-hairpin structure. Finally, we achieved the best prediction, namely sensitivity of 92%, the specificity of 94% with 0.85 of Mathew’s correlation coefficient using quadratic discriminant analysis algorithm, which is clearly superior to the same method for the prediction of beta-hairpin structure from 20 amino acid compositions in the three-fold cross-validation. Our finding showed that the chemical shift is an effective parameter for beta-hairpin prediction, suggesting the quadratic discriminant analysis is a powerful algorithm for the prediction of beta-hairpin.
Collapse
Affiliation(s)
- Feng YongE
- College of Science, Inner Mongolia Agriculture University, Hohhot, PR China
- * E-mail:
| | - Kou GaoShan
- College of Science, Inner Mongolia Agriculture University, Hohhot, PR China
| |
Collapse
|
200
|
Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter. J Theor Biol 2015; 387:88-100. [PMID: 26427337 DOI: 10.1016/j.jtbi.2015.09.014] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2015] [Revised: 09/10/2015] [Accepted: 09/15/2015] [Indexed: 12/20/2022]
Abstract
Empirical analysis on k-mer DNA has been proven as an effective tool in finding unique patterns in DNA sequences which can lead to the discovery of potential sequence motifs. In an extensive study of empirical k-mer DNA on hundreds of organisms, the researchers found unique multi-modal k-mer spectra occur in the genomes of organisms from the tetrapod clade only which includes all mammals. The multi-modality is caused by the formation of the two lowest modes where k-mers under them are referred as the rare k-mers. The suppression of the two lowest modes (or the rare k-mers) can be attributed to the CG dinucleotide inclusions in them. Apart from that, the rare k-mers are selectively distributed in certain genomic features of CpG Island (CGI), promoter, 5' UTR, and exon. We correlated the rare k-mers with hundreds of annotated features using several bioinformatic tools, performed further intrinsic rare k-mer analyses within the correlated features, and modeled the elucidated rare k-mer clustering feature into a classifier to predict the correlated CGI and promoter features. Our correlation results show that rare k-mers are highly associated with several annotated features of CGI, promoter, 5' UTR, and open chromatin regions. Our intrinsic results show that rare k-mers have several unique topological, compositional, and clustering properties in CGI and promoter features. Finally, the performances of our RWC (rare-word clustering) method in predicting the CGI and promoter features are ranked among the top three, in eight of the CGI and promoter evaluations, among eight of the benchmarked datasets.
Collapse
|