1
|
Mu Z, Yu T, Liu X, Zheng H, Wei L, Liu J. FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinformatics 2021; 22:297. [PMID: 34078264 PMCID: PMC8172329 DOI: 10.1186/s12859-021-04223-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Accepted: 05/28/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. RESULTS In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. CONCLUSION The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.
Collapse
Affiliation(s)
- Zengchao Mu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Xiaoping Liu
- Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Beijing, China
| | - Hongyu Zheng
- Department of Radiation Oncology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China.
| | - Juntao Liu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| |
Collapse
|
2
|
Qi Z, Wen X. Novel Protein Sequence Comparison Method Based on Transition Probability Graph and Information Entropy. Comb Chem High Throughput Screen 2020; 25:392-400. [PMID: 32875978 DOI: 10.2174/1386207323666200901103001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 07/17/2020] [Accepted: 07/17/2020] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE Sequence analysis is one of the foundations in bioinformatics. It is widely used to find out the feature metric hidden in the sequence. Otherwise, the graphical representation of biologic sequence is an important tool for sequencing analysis. This study is undertaken to find out a new graphical representation of biosequences. MATERIALS AND METHODS The transition probability is used to describe amino acid combinations of protein sequences. The combinations are composed of amino acids directly adjacent to each other or separated by multiple amino acids. The transition probability graph is built up by the transition probabilities of amino acid combinations. Next, a map is defined as a representation from transition probability graph to transition probability vector by k-order transition probability graph. Transition entropy vectors are developed by the transition probability vector and information entropy. Finally, the proposed method is applied to two separate applications, 499 HA genes of H1N1, and 95 coronaviruses. RESULTS By constructing a phylogenetic tree, we find that the results of each application are consistent with other studies. CONCLUSION the graphical representation proposed in this article is a practical and correct method.
Collapse
Affiliation(s)
- Zhaohui Qi
- College of Information Science and Engineering Hunan Normal University, Changsha 410081. China
| | - Xinlong Wen
- College of Information Science and Engineering Hunan Normal University, Changsha 410081. China
| |
Collapse
|
3
|
A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector. BIOMED RESEARCH INTERNATIONAL 2019; 2019:8702968. [PMID: 31205946 PMCID: PMC6530227 DOI: 10.1155/2019/8702968] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/06/2019] [Revised: 03/13/2019] [Accepted: 04/07/2019] [Indexed: 11/30/2022]
Abstract
Similarity/dissimilarity analysis is a key way of understanding the biology of an organism by knowing the origin of the new genes/sequences. Sequence data are grouped in terms of biological relationships. The number of sequences related to any group is susceptible to be increased every day. All the present alignment-free methods approve the utility of their approaches by producing a similarity/dissimilarity matrix. Although this matrix is clear, it measures the degree of similarity among sequences individually. In our work, a representative of each of three groups of protein sequences is introduced. A similarity/dissimilarity vector is evaluated instead of the ordinary similarity/dissimilarity matrix based on the group representative. The approach is applied on three selected groups of protein sequences: beta globin, NADH dehydrogenase subunit 5 (ND5), and spike protein sequences. A cross-grouping comparison is produced to ensure the singularity of each group. A qualitative comparison between our approach, previous articles, and the phylogenetic tree of these protein sequences proved the utility of our approach.
Collapse
|
4
|
Abo-Elkhier MM, Abd Elwahaab MA, Abo El Maaty MI. Measuring Similarity among Protein Sequences Using a New Descriptor. BIOMED RESEARCH INTERNATIONAL 2019; 2019:2796971. [PMID: 31886192 PMCID: PMC6893242 DOI: 10.1155/2019/2796971] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/06/2019] [Revised: 09/03/2019] [Accepted: 10/28/2019] [Indexed: 12/01/2022]
Abstract
The comparison of protein sequences according to similarity is a fundamental aspect of today's biomedical research. With the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. Famous sequences' comparison methods are alignment based. They generally give excellent results when the sequences under study are closely related and they are time consuming. Herein, a new alignment-free method is introduced. Our technique depends on a new graphical representation and descriptor. The graphical representation of protein sequence is a simple way to visualize protein sequences. The descriptor compresses the primary sequence into a single vector composed of only two values. Our approach gives good results with both short and long sequences within a little computation time. It is applied on nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 spike protein sequences. Correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others' approaches, results, and sequence homology.
Collapse
Affiliation(s)
- Mervat M. Abo-Elkhier
- Department of Engineering Mathematics and Physics, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt
| | - Marwa A. Abd Elwahaab
- Department of Engineering Mathematics and Physics, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt
| | - Moheb I. Abo El Maaty
- Department of Engineering Mathematics and Physics, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt
| |
Collapse
|
5
|
Mu Z, Yu T, Qi E, Liu J, Li G. DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information. BMC Bioinformatics 2019; 20:351. [PMID: 31221087 PMCID: PMC6587251 DOI: 10.1186/s12859-019-2943-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Accepted: 06/10/2019] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Protein feature extraction plays an important role in the areas of similarity analysis of protein sequences and prediction of protein structures, functions and interactions. The feature extraction based on graphical representation is one of the most effective and efficient ways. However, most existing methods suffer limitations from their method design. RESULTS We introduce DCGR, a novel method for extracting features from protein sequences based on the chaos game representation, which is developed by constructing CGR curves of protein sequences according to physicochemical properties of amino acids, followed by converting the CGR curves into multi-dimensional feature vectors by using the distributions of points in CGR images. Tested on five data sets, DCGR was significantly superior to the state-of-the-art feature extraction methods. CONCLUSION The DCGR is practically powerful for extracting effective features from protein sequences, and therefore important in similarity analysis of protein sequences, study of protein-protein interactions and prediction of protein functions. It is freely available at https://sourceforge.net/projects/transcriptomeassembly/files/Feature%20Extraction .
Collapse
Affiliation(s)
- Zengchao Mu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Ting Yu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Enfeng Qi
- College of Mathematics and Statistics, Guangxi Normal University, Guilin, 541001 China
| | - Juntao Liu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| |
Collapse
|
6
|
Identifying anticancer peptides by using a generalized chaos game representation. J Math Biol 2018; 78:441-463. [PMID: 30291366 DOI: 10.1007/s00285-018-1279-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Revised: 08/01/2018] [Indexed: 10/28/2022]
Abstract
We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.
Collapse
|
7
|
Qi ZH, Li KC, Ma JL, Yao YH, Liu LY. Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application. Evol Bioinform Online 2018; 14:1176934318777755. [PMID: 29977111 PMCID: PMC6024350 DOI: 10.1177/1176934318777755] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2017] [Accepted: 04/09/2018] [Indexed: 11/16/2022] Open
Abstract
In this article, we propose a 3-dimensional graphical representation of protein sequences based on 10 physicochemical properties of 20 amino acids and the BLOSUM62 matrix. It contains evolutionary information and provides intuitive visualization. To further analyze the similarity of proteins, we extract a specific vector from the graphical representation curve. The vector is used to calculate the similarity distance between 2 protein sequences. To prove the effectiveness of our approach, we apply it to 3 real data sets. The results are consistent with the known evolution fact and show that our method is effective in phylogenetic analysis.
Collapse
Affiliation(s)
- Zhao-Hui Qi
- School of Information Science and
Technology, Shijiazhuang Tiedao University, Shijiazhuang, Republic of China
| | - Ke-Cheng Li
- School of Information Science and
Technology, Shijiazhuang Tiedao University, Shijiazhuang, Republic of China
| | - Jin-Long Ma
- School of Information Science and
Technology, Shijiazhuang Tiedao University, Shijiazhuang, Republic of China
| | - Yu-Hua Yao
- School of Mathematics and Statistics,
Hainan Normal University, Haikou, Republic of China
| | - Ling-Yun Liu
- School of Information Science and
Technology, Shijiazhuang Tiedao University, Shijiazhuang, Republic of China
| |
Collapse
|
8
|
gDNA-Prot: Predict DNA-binding proteins by employing support vector machine and a novel numerical characterization of protein sequence. J Theor Biol 2016; 406:8-16. [PMID: 27378005 DOI: 10.1016/j.jtbi.2016.06.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Revised: 05/19/2016] [Accepted: 06/01/2016] [Indexed: 11/24/2022]
Abstract
DNA-binding proteins are the functional proteins in cells, which play an important role in various essential biological activities. An effective and fast computational method gDNA-Prot is proposed to predict DNA-binding proteins in this paper, which is a DNA-binding predictor that combines the support vector machine classifier and a novel kind of feature called graphical representation. The DNA-binding protein sequence information was described with the 20 probabilities of amino acids and the 23 new numerical graphical representation features of a protein sequence, based on 23 physicochemical properties of 20 amino acids. The Principal Components Analysis (PCA) was employed as feature selection method for removing the irrelevant features and reducing redundant features. The Sigmod function and Min-max normalization methods for PCA were applied to accelerate the training speed and obtain higher accuracy. Experiments demonstrated that the Principal Components Analysis with Sigmod function generated the best performance. The gDNA-Prot method was also compared with the DNAbinder, iDNA-Prot and DNA-Prot. The results suggested that gDNA-Prot outperformed the DNAbinder and iDNA-Prot. Although the DNA-Prot outperformed gDNA-Prot, gDNA-Prot was faster and convenient to predict the DNA-binding proteins. Additionally, the proposed gNDA-Prot method is available at http://sourceforge.net/projects/gdnaprot.
Collapse
|
9
|
El-Lakkani A, Lashin M. An efficient method for measuring the similarity of protein sequences. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2016; 27:363-370. [PMID: 27103219 DOI: 10.1080/1062936x.2016.1174735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2016] [Accepted: 04/01/2016] [Indexed: 06/05/2023]
Abstract
An accurate numerical descriptor for protein sequence is introduced. It is basically a set of each three successive amino acids in the sequence (triplet), starting from left to right, in addition to the distances between each two successive amino acids in the triplet such that the summation of these distances does not exceed 8. This numerical descriptor combines two features the amino acid composition and the position of each amino acid relative to the other nearby amino acids. This numerical descriptor is used to measure the similarity between protein sequences in three sets: NADH dehydrogenase subunit 5 (ND5) proteins of different species, 24 transferrin proteins from vertebrates and 12 proteins of baculoviruses. High correlation coefficient values between our results and the results of ClustalW program are obtained. These values are higher than the values obtained in many other related works.
Collapse
Affiliation(s)
- A El-Lakkani
- a Faculty of Science, Department of Biophysics , Cairo University , Giza , Egypt
| | - M Lashin
- a Faculty of Science, Department of Biophysics , Cairo University , Giza , Egypt
| |
Collapse
|
10
|
20D-dynamic representation of protein sequences. Genomics 2016; 107:16-23. [DOI: 10.1016/j.ygeno.2015.12.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Revised: 12/10/2015] [Accepted: 12/14/2015] [Indexed: 11/23/2022]
|
11
|
Novel numerical characterization of protein sequences based on individual amino acid and its application. BIOMED RESEARCH INTERNATIONAL 2015; 2015:909567. [PMID: 25705698 PMCID: PMC4332462 DOI: 10.1155/2015/909567] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2014] [Revised: 12/18/2014] [Accepted: 01/12/2015] [Indexed: 11/22/2022]
Abstract
The hydrophobicity and hydrophilicity of amino acids play a very important role in protein folding and its interaction with the environment and other molecules, as well as its catalytic mechanism. Based on the two physicochemical indexes, a 2D graphical representation of protein sequences is introduced; meanwhile, a new numerical characteristic has been proposed to compute the distance of different sequences for analysis of sequence similarity/dissimilarity on the basis of this graphical representation. Furthermore, we apply the new distance in the similarities/dissimilarities of ND5 proteins of nine species and predict the four major classes based on the dataset containing 639 domains. The results show that the method is simple and effective.
Collapse
|
12
|
El-Lakkani A, Mahran H. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2015; 26:125-137. [PMID: 25650529 DOI: 10.1080/1062936x.2014.995700] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
A new two-dimensional graphical representation of protein sequences is introduced. Twenty concentric evenly spaced circles divided by n radial lines into equal divisions are selected to represent any protein sequence of length n. Each circle represents one of the different 20 amino acids, and each radial line represents a single amino acid of the protein sequence. An efficient numerical method based on the graph is proposed to measure the similarity between two protein sequences. To prove the accuracy of our approach, the method is applied to NADH dehydrogenase subunit 5 (ND5) proteins of nine different species and 24 transferrin sequences from vertebrates. High values of correlation coefficient between our results and the results of ClustalW are obtained (approximately perfect correlations). These values are higher than the values obtained in many other related works.
Collapse
Affiliation(s)
- A El-Lakkani
- a Department of Biophysics, Faculty of Science , Cairo University , Giza , Egypt
| | | |
Collapse
|
13
|
Xu SC, Li Z, Zhang SP, Hu JL. Primary structure similarity analysis of proteins sequences by a new graphical representation. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2014; 25:791-803. [PMID: 25242152 DOI: 10.1080/1062936x.2014.955055] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
A new graphical description of the primary structure of protein sequences is introduced. First, a three-dimensional space discrete point set of a protein sequence is created based on the three main physicochemical properties of the amino acids. Secondly, a continuous cubic B-spline curve interpolating the amino acid points is constructed to represent the shape of the protein sequence. Then the geometric properties (curvature and torsion) of the continuous curve are extracted for the purpose of analyzing the similarity between protein sequences. Finally, an improved Canberra distance comparison is introduced for the similarity analysis of protein sequences with different lengths. Experimental results show that our method is effective for the similarity comparison of protein sequences.
Collapse
Affiliation(s)
- S C Xu
- a College of Science , Zhejiang Sci-Tech University , Hangzhou , China
| | | | | | | |
Collapse
|
14
|
Li J, Koehl P. 3D representations of amino acids-applications to protein sequence comparison and classification. Comput Struct Biotechnol J 2014; 11:47-58. [PMID: 25379143 PMCID: PMC4212284 DOI: 10.1016/j.csbj.2014.09.001] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such a protein sequence facilitates the decoding of its information content. We show that a feature-based representation in a three-dimensional (3D) space derived from amino acid substitution matrices provides an adequate representation that can be used for direct comparison of protein sequences based on geometry. We measure the performance of such a representation in the context of the protein structural fold prediction problem. We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier. We show in contrast that the use of the three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements.
Collapse
Affiliation(s)
- Jie Li
- Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, United States
| | - Patrice Koehl
- Department of Computer Science and Genome Center, University of California, Davis, One Shields Ave, Davis, CA 95616, United States
| |
Collapse
|
15
|
Yao Y, Yan S, Han J, Dai Q, He PA. A novel descriptor of protein sequences and its application. J Theor Biol 2014; 347:109-17. [PMID: 24412564 DOI: 10.1016/j.jtbi.2014.01.001] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2013] [Revised: 11/10/2013] [Accepted: 01/01/2014] [Indexed: 02/05/2023]
Abstract
In this paper, a dynamic 3-D graphical representation of protein sequences is introduced based on three physical-chemical properties of amino acids. The coordinates of the graph have direct biological significance, which could reflect the innate structure of the proteins. The information of principal moments of inertia and range of axis coordinate are extracted as a novel mixed descriptor and proposed for the comparison of protein primary sequences. Meanwhile, the Euclidean distance of the normalized descriptor vectors which avoid the influence of the difference in length of protein sequences under consideration is employed as a quantitative measurement of the similarity of proteins. Finally, we take the nine ND5 (NADH dehydrogenase subunit 5) proteins for example and illustrate the effectiveness of our approach.
Collapse
Affiliation(s)
- Yuhua Yao
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| | - Shoujiang Yan
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Jianning Han
- College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Ping-an He
- College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| |
Collapse
|
16
|
El-Lakkani A, El-Sherif S. Similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices. Chem Phys Lett 2013. [DOI: 10.1016/j.cplett.2013.10.032] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
17
|
Zhang YP, Ruan JS, He PA. Analyzes of the similarities of protein sequences based on the pseudo amino acid composition. Chem Phys Lett 2013. [DOI: 10.1016/j.cplett.2013.10.076] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
18
|
Alignment free comparison: k word voting model and its applications. J Theor Biol 2013; 335:276-82. [DOI: 10.1016/j.jtbi.2013.06.037] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2012] [Revised: 04/25/2013] [Accepted: 06/26/2013] [Indexed: 02/06/2023]
|
19
|
Huang DS, Yu HJ. Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:457-467. [PMID: 23929869 DOI: 10.1109/tcbb.2013.10] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Based on all kinds of adjacent amino acids (AAA), we map each protein primary sequence into a 400 by ((L-1)) matrix (M). In addition, we further derive a normalized 400-tuple mathematical descriptors (D), which is extracted from the primary protein sequences via singular values decomposition (SVD) of the matrix. The obtained 400-D normalized feature vectors (NFVs) further facilitate our quantitative analysis of protein sequences. Using the normalized representation of the primary protein sequences, we analyze the similarity for different sequences upon two data sets: 1) ND5 sequences from nine species and 2) transferrin sequences of 24 vertebrates. We also compared the results in this study with those from other related works. These two experiments illustrate that our proposed NFV-AAA approach does perform well in the field of similarity analysis of sequence.
Collapse
Affiliation(s)
- De-Shuang Huang
- School of Electronics and Information Engineering, Tongji University, 4800 Caoan Road, Shanghai 201804, China
| | | |
Collapse
|
20
|
He PA, Li D, Zhang Y, Wang X, Yao Y. A 3D graphical representation of protein sequences based on the Gray code. J Theor Biol 2012; 304:81-7. [PMID: 22554947 DOI: 10.1016/j.jtbi.2012.03.023] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2011] [Revised: 02/17/2012] [Accepted: 03/17/2012] [Indexed: 11/18/2022]
Abstract
Based on the order of 6-bit binary Gray code, a cyclic order of 20 amino acids is introduced. A novel 3D graphical representation of protein sequences is proposed according to the CGR of DNA sequences. Furthermore, the mathematical descriptor is suggested to characterize the graphical representation curve. The efficiency of our approach can be illustrated by performing the comparison of similarities/dissimilarities among sequences of the ND5 proteins of nine different species. With the correlation and significance analysis, the comparisons of both our results and results of other graphical representation with the ClustalW's results can show the utility of our approach.
Collapse
Affiliation(s)
- Ping-an He
- College of Science, Zhejiang Sci-Tech University, Hangzhou 310018, PR China.
| | | | | | | | | |
Collapse
|
21
|
Yu HJ, Huang DS. Novel 20-D descriptors of protein sequences and it’s applications in similarity analysis. Chem Phys Lett 2012. [DOI: 10.1016/j.cplett.2012.02.030] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
22
|
Yang L, Zhang X, Zhu H. Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word. J Theor Biol 2011; 295:125-31. [PMID: 22138094 DOI: 10.1016/j.jtbi.2011.11.021] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2011] [Revised: 11/18/2011] [Accepted: 11/19/2011] [Indexed: 11/15/2022]
Abstract
This work proposes an alignment free comparison model for the DNA primary sequences. In this paper, we treat the double strands of the DNA rather than single strand. We define the shortest absent word of the double strands between the DNA sequences and some properties are studied to speed up the algorithm for searching the shortest absent word. We present a novel model for comparison, in which the similarity distribution is introduced to describe the similarity between the sequences. A distance measure is deduced based on the Shannon entropy meanwhile is used in phylogenetic analysis. Some experiments show that our model performs well in the field of sequence analysis.
Collapse
Affiliation(s)
- Lianping Yang
- College of Sciences, Northeastern University, Shenyang, China
| | | | | |
Collapse
|
23
|
Wang C, Xi L, Li S, Liu H, Yao X. A sequence-based computational model for the prediction of the solvent accessible surface area for α-helix and β-barrel transmembrane residues. J Comput Chem 2011; 33:11-7. [DOI: 10.1002/jcc.21936] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2011] [Revised: 07/21/2011] [Accepted: 08/09/2011] [Indexed: 11/10/2022]
|
24
|
Bielińska-Wąż D. Graphical and numerical representations of DNA sequences: statistical aspects of similarity. JOURNAL OF MATHEMATICAL CHEMISTRY 2011; 49:2345. [PMID: 32214591 PMCID: PMC7087963 DOI: 10.1007/s10910-011-9890-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/18/2011] [Accepted: 07/22/2011] [Indexed: 05/10/2023]
Abstract
New approaches aiming at a detailed similarity/dissimilarity analysis of DNA sequences are formulated. Several corrections that enrich the information which may be derived from the alignment methods are proposed. The corrections take into account the distributions along the sequences of the aligned bases (neglected in the standard alignment methods). As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. The studies are supplemented by detailed similarity studies for histones H1 and H4 coding sequences. The data are described according to the latest version of the EMBL database. The work is supplemented by a concise review of the state-of-art graphical representations of DNA sequences.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Instytut Fizyki, Uniwersytet Mikołaja Kopernika, Grudziądzka 5, 87-100 Toruń, Poland
| |
Collapse
|
25
|
Liao B, Liao B, Lu X, Cao Z. A novel graphical representation of protein sequences and its application. J Comput Chem 2011; 32:2539-44. [PMID: 21638292 DOI: 10.1002/jcc.21833] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2011] [Revised: 03/22/2011] [Accepted: 04/13/2011] [Indexed: 11/08/2022]
Abstract
On the basis of information on the evolution of the 20 amino acids and their physiochemical characteristics, we propose a new two-dimensional (2D) graphical representation of protein sequences in this article. By this representation method, we use 2D data to represent three-dimensional information constructed by the amino acids' evolution index, the class information of amino acid based on physiochemical characteristics, and the order of the amino acids appearing in the protein sequences. Then, using discrete Fourier transform, the sequence signals with different lengths can be transformed to the frequency domain, in which the sequences are with the same length. A new method is used to analyze the protein sequence similarity and to predict the protein structural class. The experiments indicate that our method is effective and useful.
Collapse
Affiliation(s)
- Bo Liao
- College of Information Science and Technology, Hunan University, Changsha, Hunan, China.
| | | | | | | |
Collapse
|
26
|
Abstract
Up to now, various approaches for phylogenetic analysis have been developed. Almost all of them put stress on analyzing nucleic acid sequences or protein primary sequences. In this paper, we propose a new sequence distance for efficient reconstruction of phylogenetic trees based on the distribution of length about common sub-sequences between two sequences. We describe some applications of this method, which not only show the validity of the method, but also suggest a number of novel phylogenetic insights.
Collapse
Affiliation(s)
- Guisong Chang
- School of Mathematical Sciences, Dalian University of Technology, 116024 Dalian, People's Republic of China.
| | | |
Collapse
|
27
|
Bielińska-Wa¸ż D, Subramaniam S. Classification studies based on a spectral representation of DNA. J Theor Biol 2010; 266:667-74. [DOI: 10.1016/j.jtbi.2010.07.038] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Revised: 06/29/2010] [Accepted: 07/28/2010] [Indexed: 12/01/2022]
|
28
|
Liao B, Liao B, Sun X, Zeng Q. A novel method for similarity analysis and protein sub-cellular localization prediction. Bioinformatics 2010; 26:2678-83. [PMID: 20826879 DOI: 10.1093/bioinformatics/btq521] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Biological sequence was regarded as an important study by many biologists, because the sequence contains a large number of biological information, what is helpful for scientists' studies on biological cells, DNA and proteins. Currently, many researchers used the method based on protein sequences in function classification, sub-cellular location, structure and functional site prediction, including some machine-learning methods. The purpose of this article, is to find a new way of sequence analysis, but more simple and effective. RESULTS According to the nature of 64 genetic codes, we propose a simple and intuitive 2D graphical expression of protein sequences. And based on this expression we give a new Euclidean-distance method to compute the distance of different sequences for the analysis of sequence similarity. This approach contains more sequence information. A typical phylogenetic tree constructed based on this method proved the effectiveness of our approach. Finally, we use this sequence-similarity-analysis method to predict protein sub-cellular localization, in the two datasets commonly used. The results show that the method is reasonable.
Collapse
Affiliation(s)
- Bo Liao
- School of computer and communication, Hunan University, Changsha, Hunan, China.
| | | | | | | |
Collapse
|