1
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. Use of 2D FFT and DTW in Protein Sequence Comparison. Protein J 2024; 43:1-11. [PMID: 37848727 DOI: 10.1007/s10930-023-10160-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/20/2023] [Indexed: 10/19/2023]
Abstract
Protein sequence comparison remains a challenging work for the researchers owing to the computational complexity due to the presence of 20 amino acids compared with only four nucleotides in Genome sequences. Further, protein sequences of different species are of different lengths; it throws additional changes to the researchers to develop methods, specially alignment-free methods, to compare protein sequences. In this work, an efficient technique to compare protein sequences is developed by a graphical representation. First, the classified grouping of 20 amino acids with a cardinality of 4 based on polar class is considered to narrow down the representational range from 20 to 4. Then a unit vector technique based on a two-quadrant Cartesian system is proposed to provide a new two-dimensional graphical representation of the protein sequence. Now, two approaches are proposed to cope with the varying lengths of protein sequences from various species: one uses Dynamic Time Warping (DTW), while the other one uses a two-dimensional Fast Fourier Transform (2D FFT). Next, the effectiveness of these two techniques is analyzed using two evaluation criteria-quantitative measures based on symmetric distance (SD) and computational speed. An analysis is performed on five data sets of 9 ND4, 9 ND5, 9 ND6, 12 Baculovirus, and 24 TF proteins under the two methods. It is found that the FFT-based method produces the same results as DTW but in less computational time. It is found that the result of the proposed method agrees with the known biological reference. Further, the present method produces better clustering than the existing ones.
Collapse
Affiliation(s)
- Jayanta Pal
- Department of ECE, National Institute of Technology, Durgapur, India.
- Department of CSE, Narula Institute of Technology, Kolkata, India.
| | - Soumen Ghosh
- Department of ECE, National Institute of Technology, Durgapur, India
| | - Bansibadan Maji
- Department of ECE, National Institute of Technology, Durgapur, India
| | | |
Collapse
|
2
|
Li W, Yang L, Qiu Y, Yuan Y, Li X, Meng Z. FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis. BMC Bioinformatics 2022; 23:347. [PMID: 35986255 PMCID: PMC9392226 DOI: 10.1186/s12859-022-04889-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 08/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Amino acid property-aware phylogenetic analysis (APPA) refers to the phylogenetic analysis method based on amino acid property encoding, which is used for understanding and inferring evolutionary relationships between species from the molecular perspective. Fast Fourier transform (FFT) and Higuchi’s fractal dimension (HFD) have excellent performance in describing sequences’ structural and complexity information for APPA. However, with the exponential growth of protein sequence data, it is very important to develop a reliable APPA method for protein sequence analysis.
Results
Consequently, we propose a new method named FFP, it joints FFT and HFD. Firstly, FFP is used to encode protein sequences on the basis of the important physicochemical properties of amino acids, the dissociation constant, which determines acidity and basicity of protein molecules. Secondly, FFT and HFD are used to generate the feature vectors of encoded sequences, whereafter, the distance matrix is calculated from the cosine function, which describes the degree of similarity between species. The smaller the distance between them, the more similar they are. Finally, the phylogenetic tree is constructed. When FFP is tested for phylogenetic analysis on four groups of protein sequences, the results are obviously better than other comparisons, with the highest accuracy up to more than 97%.
Conclusion
FFP has higher accuracy in APPA and multi-sequence alignment. It also can measure the protein sequence similarity effectively. And it is hoped to play a role in APPA’s related research.
Collapse
|
3
|
Li W, Yang L, Meng Z, Qiu Y, Wang PSP, Li X. Phylogenetic Analysis: A Novel Method of Protein Sequence Similarity Analysis. INT J PATTERN RECOGN 2022. [DOI: 10.1142/s0218001422580071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein sequence similarity analysis (PSSA) is a significant task in bioinformatics, which can obtain information about unknown sequences such as protein structures and homology relationships. Protein sequence refers to the series of amino acids with rich physical and chemical properties, namely the basic structure of proteins. However, sequence similarity analysis and phylogenetic analysis between different species which have complex amino acid sequences is a challenging problem. In this paper, nine properties of amino acids were considered and the sequence was converted into numerical values by principal component analysis (PCA); with Haar Wavelet Transform, and Higuchi fractal dimension (HFD), a new feature vector is constructed to represent the sequence; Spearman distance was selected to calculate the distance matrix and the phylogenetic tree was constructed. In this paper, two representative protein sequences (9 ND5 (NADH dehydrogenase 5) and 8 ND6 (NADH dehydrogenase 6)) were selected for similarity analysis and phylogenetic analysis, and compared with MEGA software and other existing methods. The extensive results show that our method is outperforming and results consistent with the known facts.
Collapse
Affiliation(s)
- Wei Li
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Lina Yang
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Zuqiang Meng
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Yu Qiu
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | | | - Xichun Li
- Guangxi Normal University for Nationalities, Chongzuo 532200, China
| |
Collapse
|
4
|
Li C, Dai Q, He PA. A time series representation of protein sequences for similarity comparison. J Theor Biol 2022; 538:111039. [DOI: 10.1016/j.jtbi.2022.111039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Revised: 01/18/2022] [Accepted: 01/20/2022] [Indexed: 10/19/2022]
|
5
|
Mu Z, Yu T, Liu X, Zheng H, Wei L, Liu J. FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinformatics 2021; 22:297. [PMID: 34078264 PMCID: PMC8172329 DOI: 10.1186/s12859-021-04223-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Accepted: 05/28/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. RESULTS In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. CONCLUSION The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.
Collapse
Affiliation(s)
- Zengchao Mu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Xiaoping Liu
- Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Beijing, China
| | - Hongyu Zheng
- Department of Radiation Oncology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China.
| | - Juntao Liu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| |
Collapse
|
6
|
Mu Z, Yu T, Qi E, Liu J, Li G. DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information. BMC Bioinformatics 2019; 20:351. [PMID: 31221087 PMCID: PMC6587251 DOI: 10.1186/s12859-019-2943-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Accepted: 06/10/2019] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Protein feature extraction plays an important role in the areas of similarity analysis of protein sequences and prediction of protein structures, functions and interactions. The feature extraction based on graphical representation is one of the most effective and efficient ways. However, most existing methods suffer limitations from their method design. RESULTS We introduce DCGR, a novel method for extracting features from protein sequences based on the chaos game representation, which is developed by constructing CGR curves of protein sequences according to physicochemical properties of amino acids, followed by converting the CGR curves into multi-dimensional feature vectors by using the distributions of points in CGR images. Tested on five data sets, DCGR was significantly superior to the state-of-the-art feature extraction methods. CONCLUSION The DCGR is practically powerful for extracting effective features from protein sequences, and therefore important in similarity analysis of protein sequences, study of protein-protein interactions and prediction of protein functions. It is freely available at https://sourceforge.net/projects/transcriptomeassembly/files/Feature%20Extraction .
Collapse
Affiliation(s)
- Zengchao Mu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Ting Yu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Enfeng Qi
- College of Mathematics and Statistics, Guangxi Normal University, Guilin, 541001 China
| | - Juntao Liu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| |
Collapse
|
7
|
Li C, Zhao J, Wang C, Yao Y. Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation. Comb Chem High Throughput Screen 2019; 21:100-110. [PMID: 29380690 PMCID: PMC5930480 DOI: 10.2174/1386207321666180130100838] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Revised: 01/24/2018] [Accepted: 01/26/2018] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. METHODS Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. RESULTS By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. CONCLUSION These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.
Collapse
Affiliation(s)
- Chun Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China.,Department of Mathematics, Bohai University, Jinzhou 121013, China.,Research Institute of Food Science, Bohai University, Jinzhou 121013, China
| | - Jialing Zhao
- Department of Mathematics, Bohai University, Jinzhou 121013, China
| | - Changzhong Wang
- Department of Mathematics, Bohai University, Jinzhou 121013, China
| | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| |
Collapse
|
8
|
Mo Z, Zhu W, Sun Y, Xiang Q, Zheng M, Chen M, Li Z. One novel representation of DNA sequence based on the global and local position information. Sci Rep 2018; 8:7592. [PMID: 29765099 PMCID: PMC5953932 DOI: 10.1038/s41598-018-26005-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Accepted: 04/27/2018] [Indexed: 11/28/2022] Open
Abstract
One novel representation of DNA sequence combining the global and local position information of the original sequence has been proposed to distinguish the different species. First, for the sufficient exploitation of global information, one graphical representation of DNA sequence has been formulated according to the curve of Fermat spiral. Then, for the consideration of local characteristics of DNA sequence, attaching each point in the curve of Fermat spiral with the related mass has been applied based on the relationships of neighboring four nucleotides. In this paper, the normalized moments of inertia of the curve of Fermat spiral which composed by the points with mass has been calculated as the numerical description of the corresponding DNA sequence on the first exons of beta-global genes. Choosing the Euclidean distance as the measurement of the numerical descriptions, the similarity between species has shown the performance of proposed method.
Collapse
Affiliation(s)
- Zhiyi Mo
- School of Information and Electronic Engineering, Wuzhou University, Wuzhu, China
| | - Wen Zhu
- College of Computer Science and Electronic Engineering, Hunan University, Hunan, China.
| | - Yi Sun
- College of Computer Science and Electronic Engineering, Hunan University, Hunan, China
| | - Qilin Xiang
- College of Computer Science and Electronic Engineering, Hunan University, Hunan, China
| | - Ming Zheng
- School of Information and Electronic Engineering, Wuzhou University, Wuzhu, China
| | - Min Chen
- College of Computer and Information Science, Hunan Institute of Technology, Hengyang, China
| | - Zejun Li
- College of Computer and Information Science, Hunan Institute of Technology, Hengyang, China
| |
Collapse
|
9
|
Abstract
Advances in sequencing technologies led to rapid increase in the number and diversity of biological sequences, which facilitated development in the sequence research. In this paper, we present a new method for analyzing protein sequence similarity. We calculated the spectral radii of 20 amino acids (AAs) and put forward a novel 2-D graphical representation of protein sequences. To characterize protein sequences numerically, three groups of features were extracted and related to statistical, dynamics measurements and fluctuation complexity of the sequences. With the obtained feature vector, two models utilizing Gaussian Kernel similarity and Cosine similarity were built to measure the similarity between sequences. We applied our method to analyze the similarities/dissimilarities of four data sets. Both proposed models received consistent results with improvements when compared to that obtained by the ClustalW analysis. The novel approach we present in this study may therefore benefit protein research in medical and scientific fields.
Collapse
|
10
|
Jalilvand A, Akbari B, Zare Mirakabad F. S-FLN: A sequence-based hierarchical approach for functional linkage network construction. J Theor Biol 2018; 437:149-162. [PMID: 29080781 DOI: 10.1016/j.jtbi.2017.10.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2016] [Revised: 07/27/2017] [Accepted: 10/18/2017] [Indexed: 11/24/2022]
Abstract
The functional linkage network (FLN) construction is a primary and important step in drug discovery and disease gene prioritization methods. In order to construct FLN, several methods have been introduced based on integration of various biological data. Although, there are impressive ideas behind these methods, they suffer from low quality of the biological data. In this paper, a hierarchical sequence-based approach is proposed to construct FLN. The proposed approach, denoted as S-FLN (Sequence-based Functional Linkage Network), uses the sequence of proteins as the primary data in three main steps. Firstly, the physicochemical properties of amino-acids are employed to describe the functionality of proteins. As the sequence of proteins is a more comprehensive and accurate primary data, more reliable relations are achieved. Secondly, seven different descriptor methods are used to extract feature vectors from the proteins sequences. Advantage of different descriptor methods lead to obtain diverse ensemble learners in the next step. Finally, a two-layer ensemble learning structure is proposed to calculated the score of protein pairs. The proposed approach has been evaluated using two biological datasets, S.Cerevisiae and H.Pylori, and resulted in 93.9% and 91.15% precision rates, respectively. The results of various experiments indicate the efficiency and validity of the proposed approach.
Collapse
Affiliation(s)
- A Jalilvand
- Department of Electronic and computer engineering,Tarbiat Modares University, Tehran, Iran
| | - B Akbari
- Department of Electronic and computer engineering,Tarbiat Modares University, Tehran, Iran.
| | - F Zare Mirakabad
- Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran
| |
Collapse
|
11
|
Computational Techniques for a Comprehensive Understanding of Different Genotype-Phenotype Factors in Biological Systems and Their Applications. Synth Biol (Oxf) 2018. [DOI: 10.1007/978-981-10-8693-9_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
12
|
Bielińska-Wąż D, Wąż P. Spectral-dynamic representation of DNA sequences. J Biomed Inform 2017; 72:1-7. [PMID: 28587890 DOI: 10.1016/j.jbi.2017.06.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Revised: 05/03/2017] [Accepted: 06/01/2017] [Indexed: 11/25/2022]
Abstract
A graphical representation of DNA sequences in which the distribution of a particular base B=A,C,G,T is represented by a set of discrete lines has been formulated. The methodology of this approach has been borrowed from two areas of physics: spectroscopy and dynamics. Consequently, the set of discrete lines is referred to as the B-spectrum. Next, the B-spectrum is transformed to a rigid body composed of material points. In this way a dynamic representation of the DNA sequence has been obtained. The centers of mass of these rigid bodies, divided by their moments of inertia, have been taken as the descriptors of the spectra and, thus, of the DNA sequences. The performance of this method on a standard set of data commonly applied by authors introducing new approaches to bioinformatics (the first exons of β-globin genes of different species) proved to be very good.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Department of Radiological Informatics and Statistics, Medical University of Gdańsk, Tuwima 15, 80-210 Gdańsk, Poland.
| | - Piotr Wąż
- Department of Nuclear Medicine, Medical University of Gdańsk, Tuwima 15, 80-210 Gdańsk, Poland.
| |
Collapse
|
13
|
Yang L, Zhang W. A Multiresolution Graphical Representation for Similarity Relationship and Multiresolution Clustering for Biological Sequences. J Comput Biol 2017; 24:299-310. [DOI: 10.1089/cmb.2016.0030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Lianping Yang
- College of Sciences, Northeastern University, Shenyang, China
| | - Weilin Zhang
- Department of Mathematics, New York University Shanghai, Shanghai, China
| |
Collapse
|
14
|
PING PENGYAO, ZHU XIANYOU, WANG LEI. SIMILARITIES/DISSIMILARITIES ANALYSIS OF PROTEIN SEQUENCES BASED ON PCA-FFT. J BIOL SYST 2017. [DOI: 10.1142/s0218339017500024] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, a novel method to analyze the similarity/dissimilarity of protein sequences based on Principal Component Analysis-Fast Fourier Transformation (PCA-FFT) is proposed, in which, the PCA is utilized to transform protein sequences into time series and the FFT is utilized to analyze the time series while considering them as signals. To test the effectiveness of our newly proposed method, it is applied to analyze the similarity/dissimilarity of 16 different ND5 protein sequences and 29 different spike protein sequences, respectively. Furthermore, the correlation analysis is presented for comparing with others methods, and the simulation results show that it has better performances in the aspects of computation complexity and recognition degree than some existing methods.
Collapse
Affiliation(s)
- PENGYAO PING
- College of Information Engineering, Xiangtan University, Yuhu District, Xiangtan, Hunan 411105, P. R. China
- Key Laboratory of Intelligent Computing and Information Processing, P. R. China
| | - XIANYOU ZHU
- Department of Computer Science, Hengyang Normal University, 421008, P. R. China
| | - LEI WANG
- College of Information Engineering, Xiangtan University, Yuhu District, Xiangtan, Hunan 411105, P. R. China
- Key Laboratory of Intelligent Computing and Information Processing, P. R. China
| |
Collapse
|
15
|
Li Y, Song T, Yang J, Zhang Y, Yang J. An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids. PLoS One 2016; 11:e0167430. [PMID: 27918587 PMCID: PMC5137889 DOI: 10.1371/journal.pone.0167430] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 11/14/2016] [Indexed: 11/30/2022] Open
Abstract
In this paper, we have proposed a novel alignment-free method for comparing the similarity of protein sequences. We first encode a protein sequence into a 440 dimensional feature vector consisting of a 400 dimensional Pseudo-Markov transition probability vector among the 20 amino acids, a 20 dimensional content ratio vector, and a 20 dimensional position ratio vector of the amino acids in the sequence. By evaluating the Euclidean distances among the representing vectors, we compare the similarity of protein sequences. We then apply this method into the ND5 dataset consisting of the ND5 protein sequences of 9 species, and the F10 and G11 datasets representing two of the xylanases containing glycoside hydrolase families, i.e., families 10 and 11. As a result, our method achieves a correlation coefficient of 0.962 with the canonical protein sequence aligner ClustalW in the ND5 dataset, much higher than those of other 5 popular alignment-free methods. In addition, we successfully separate the xylanases sequences in the F10 family and the G11 family and illustrate that the F10 family is more heat stable than the G11 family, consistent with a few previous studies. Moreover, we prove mathematically an identity equation involving the Pseudo-Markov transition probability vector and the amino acids content ratio vector.
Collapse
Affiliation(s)
- Yushuang Li
- School of Science, Yanshan University, Qinhuangdao, China
| | - Tian Song
- School of Science, Yanshan University, Qinhuangdao, China
| | - Jiasheng Yang
- Department of Civil and Environmental Engineering, National Universality of Singapore, Singapore
| | - Yi Zhang
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei, China
| | - Jialiang Yang
- School of Mathematics and Information Science, Henan Polytechnic University, Henan, China
| |
Collapse
|
16
|
An estimator for local analysis of genome based on the minimal absent word. J Theor Biol 2016; 395:23-30. [PMID: 26829314 DOI: 10.1016/j.jtbi.2016.01.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Revised: 01/17/2016] [Accepted: 01/19/2016] [Indexed: 11/22/2022]
Abstract
This study presents an alternative alignment-free relative feature analysis method based on the minimal absent word, which has potential advantages over the local alignment method in local analysis. Smooth-local-analysis-curve and similarity-distribution are constructed for a fast, efficient, and visual comparison. Moreover, when the multi-sequence-comparison is needed, the local-analysis-curves can illustrate some interesting zones.
Collapse
|
17
|
20D-dynamic representation of protein sequences. Genomics 2016; 107:16-23. [DOI: 10.1016/j.ygeno.2015.12.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Revised: 12/10/2015] [Accepted: 12/14/2015] [Indexed: 11/23/2022]
|
18
|
Sedlar K, Skutkova H, Vitek M, Provaznik I. Set of rules for genomic signal downsampling. Comput Biol Med 2015; 69:308-14. [PMID: 26078051 DOI: 10.1016/j.compbiomed.2015.05.022] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Revised: 05/25/2015] [Accepted: 05/26/2015] [Indexed: 12/14/2022]
Abstract
Comparison and classification of organisms based on molecular data is an important task of computational biology, since at least parts of DNA sequences for many organisms are available. Unfortunately, methods for comparison are computationally very demanding, suitable only for short sequences. In this paper, we focus on the redundancy of genetic information stored in DNA sequences. We proposed rules for downsampling of DNA signals of cumulated phase. According to the length of an original sequence, we are able to significantly reduce the amount of data with only slight loss of original information. Dyadic wavelet transform was chosen for fast downsampling with minimum influence on signal shape carrying the biological information. We proved the usability of such new short signals by measuring percentage deviation of pairs of original and downsampled signals while maintaining spectral power of signals. Minimal loss of biological information was proved by measuring the Robinson-Foulds distance between pairs of phylogenetic trees reconstructed from the original and downsampled signals. The preservation of inter-species and intra-species information makes these signals suitable for fast sequence identification as well as for more detailed phylogeny reconstruction.
Collapse
Affiliation(s)
- Karel Sedlar
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic.
| | - Helena Skutkova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic.
| | - Martin Vitek
- International Clinical Research Center - Center of Biomedical Engineering, St. Anne׳s University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic.
| | - Ivo Provaznik
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic; International Clinical Research Center - Center of Biomedical Engineering, St. Anne׳s University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic.
| |
Collapse
|
19
|
Yu JF, Dou XH, Wang HB, Sun X, Zhao HY, Wang JH. A Novel Cylindrical Representation for Characterizing Intrinsic Properties of Protein Sequences. J Chem Inf Model 2015; 55:1261-70. [PMID: 25945398 DOI: 10.1021/ci500577m] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The composition and sequence order of amino acid residues are the two most important characteristics to describe a protein sequence. Graphical representations facilitate visualization of biological sequences and produce biologically useful numerical descriptors. In this paper, we propose a novel cylindrical representation by placing the 20 amino acid residue types in a circle and sequence positions along the z axis. This representation allows visualization of the composition and sequence order of amino acids at the same time. Ten numerical descriptors and one weighted numerical descriptor have been developed to quantitatively describe intrinsic properties of protein sequences on the basis of the cylindrical model. Their applications to similarity/dissimilarity analysis of nine ND5 proteins indicated that these numerical descriptors are more effective than several classical numerical matrices. Thus, the cylindrical representation obtained here provides a new useful tool for visualizing and charactering protein sequences. An online server is available at http://biophy.dzu.edu.cn:8080/CNumD/input.jsp .
Collapse
Affiliation(s)
- Jia-Feng Yu
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China.,‡State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, China
| | - Xiang-Hua Dou
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Hong-Bo Wang
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Xiao Sun
- ‡State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, China
| | - Hui-Ying Zhao
- §Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, Queensland 4000, Australia
| | - Ji-Hua Wang
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China.,∥College of Physics and Electronic Information, Dezhou University, Dezhou 253023, China
| |
Collapse
|
20
|
Qi ZH, Jin MZ, Li SL, Feng J. A protein mapping method based on physicochemical properties and dimension reduction. Comput Biol Med 2014; 57:1-7. [PMID: 25486446 DOI: 10.1016/j.compbiomed.2014.11.012] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Revised: 11/15/2014] [Accepted: 11/19/2014] [Indexed: 01/11/2023]
Abstract
BACKGROUND The graphical mapping of a protein sequence is more difficult than the graphical mapping of a DNA sequence because of the twenty amino acids and their complicated physicochemical properties. However, the graphical mapping for protein sequences attracts many researchers to develop different mapping methods. Currently, researchers have proposed their mapping methods based on several physicochemical properties. In this article, a new mapping method for protein sequences is developed by considering additional physicochemical properties, which is a simple and effective approach. METHODS Based on the 12 major physicochemical properties of amino acids and the PCA method, we propose a simple and intuitive 2D graphical mapping method for protein sequences. Next, we extract a 20D vector from the graphical mapping which is used to characterize a protein sequence. RESULTS The proposed graphical mapping consists of three important properties, one-to-one, no circuit, and good visualization. This mapping contains more physicochemical information. Next, this proposed method is applied to two separate applications. The results illustrate the utility of the proposed method. DISCUSSION To validate the proposed method, we first give a comparison of protein sequences, which consists of nine ND6 proteins. The similarity/dissimilarity matrix for the ssnine ND6 proteins correctly reveals their evolutionary relationship. Next, we give another application for the cluster analysis of HA genes of influenza A (H1N1) isolates. The results are consistent with the known evolution fact of the H1N1 virus. The separate applications further illustrate the utility of the proposed method.
Collapse
Affiliation(s)
- Zhao-Hui Qi
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei, 050043, People's Republic of China.
| | - Meng-Zhe Jin
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei, 050043, People's Republic of China
| | - Su-Li Li
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei, 050043, People's Republic of China
| | - Jun Feng
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei, 050043, People's Republic of China
| |
Collapse
|