1
|
Li W, Yang L, Qiu Y, Yuan Y, Li X, Meng Z. FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis. BMC Bioinformatics 2022; 23:347. [PMID: 35986255 PMCID: PMC9392226 DOI: 10.1186/s12859-022-04889-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 08/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Amino acid property-aware phylogenetic analysis (APPA) refers to the phylogenetic analysis method based on amino acid property encoding, which is used for understanding and inferring evolutionary relationships between species from the molecular perspective. Fast Fourier transform (FFT) and Higuchi’s fractal dimension (HFD) have excellent performance in describing sequences’ structural and complexity information for APPA. However, with the exponential growth of protein sequence data, it is very important to develop a reliable APPA method for protein sequence analysis.
Results
Consequently, we propose a new method named FFP, it joints FFT and HFD. Firstly, FFP is used to encode protein sequences on the basis of the important physicochemical properties of amino acids, the dissociation constant, which determines acidity and basicity of protein molecules. Secondly, FFT and HFD are used to generate the feature vectors of encoded sequences, whereafter, the distance matrix is calculated from the cosine function, which describes the degree of similarity between species. The smaller the distance between them, the more similar they are. Finally, the phylogenetic tree is constructed. When FFP is tested for phylogenetic analysis on four groups of protein sequences, the results are obviously better than other comparisons, with the highest accuracy up to more than 97%.
Conclusion
FFP has higher accuracy in APPA and multi-sequence alignment. It also can measure the protein sequence similarity effectively. And it is hoped to play a role in APPA’s related research.
Collapse
|
2
|
Rout RK, Umer S, Sheikh S, Sindhwani S, Pati S. EightyDVec: a method for protein sequence similarity analysis using physicochemical properties of amino acids. COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING: IMAGING & VISUALIZATION 2022. [DOI: 10.1080/21681163.2021.1956369] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Ranjeet Kumar Rout
- Computer Science & Engineering, National Institute of Technology Srinagar, Hazratbal, India
| | - Saiyed Umer
- Computer Science & Engineering, Aliah University, West Bengal, India
| | - Sabha Sheikh
- Computer Science & Engineering, National Institute of Technology Srinagar, Hazratbal, India
| | - Sanchit Sindhwani
- , DR. B. R. Ambedkar National Institute of Technology, Jalandhar, Punjab, India
| | - Smitarani Pati
- , DR. B. R. Ambedkar National Institute of Technology, Jalandhar, Punjab, India
| |
Collapse
|
3
|
Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis. Gene 2020; 766:145096. [PMID: 32919006 DOI: 10.1016/j.gene.2020.145096] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 08/16/2020] [Accepted: 08/24/2020] [Indexed: 12/17/2022]
Abstract
The phylogenetic analysis based on sequence similarity targeted to real biological taxa is one of the major challenging tasks. In this paper, we propose a novel alignment-free method, CoFASA (Codon Feature based Amino acid Sequence Analyser), for similarity analysis of nucleotide sequences. At first, we assign numerical weights to the four nucleotides. We then calculate a score of each codon based on the numerical value of the constituent nucleotides, termed as degree of codons. Accordingly, we obtain the degree of each amino acid based on the degree of codons targeted towards a specific amino acid. Utilizing the degree of twenty amino acids and their relative abundance within a given sequence, we generate 20-dimensional features for every coding DNA sequence or protein sequence. We use the features for performing phylogenetic analysis of the set of candidate sequences. We use multiple protein sequences derived from Beta-globin (BG), NADH dehydrogenase subunit 5 (ND5), Transferrins (TFs), Xylanases, low identity (<40%) and high identity (⩾40%) protein sequences (encompassing 533 and 1064 protein families) for experimental assessments. We compare our results with sixteen (16) well-known methods, including both alignment-based and alignment-free methods. Various assessment indices are used, such as the Pearson correlation coefficient, RF (Robinson-Foulds) distance and ROC score for performance analysis. While comparing the performance of CoFASA with alignment-based methods (ClustalW, ClustalΩ, MAFFT, and MUSCLE), it shows very similar results. Further, CoFASA shows better performance in comparison to well-known alignment-free methods, including LZW-Kernal, jD2Stat, FFP, spaced, and AFKS-D2s in predicting taxonomic relationship among candidate taxa. Overall, we observe that the features derived by CoFASA are very much useful in isolating the sequences according to their taxonomic labels. While our method is cost-effective, at the same time, produces consistent and satisfactory outcomes.
Collapse
|
4
|
A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector. BIOMED RESEARCH INTERNATIONAL 2019; 2019:8702968. [PMID: 31205946 PMCID: PMC6530227 DOI: 10.1155/2019/8702968] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/06/2019] [Revised: 03/13/2019] [Accepted: 04/07/2019] [Indexed: 11/30/2022]
Abstract
Similarity/dissimilarity analysis is a key way of understanding the biology of an organism by knowing the origin of the new genes/sequences. Sequence data are grouped in terms of biological relationships. The number of sequences related to any group is susceptible to be increased every day. All the present alignment-free methods approve the utility of their approaches by producing a similarity/dissimilarity matrix. Although this matrix is clear, it measures the degree of similarity among sequences individually. In our work, a representative of each of three groups of protein sequences is introduced. A similarity/dissimilarity vector is evaluated instead of the ordinary similarity/dissimilarity matrix based on the group representative. The approach is applied on three selected groups of protein sequences: beta globin, NADH dehydrogenase subunit 5 (ND5), and spike protein sequences. A cross-grouping comparison is produced to ensure the singularity of each group. A qualitative comparison between our approach, previous articles, and the phylogenetic tree of these protein sequences proved the utility of our approach.
Collapse
|
5
|
Abo-Elkhier MM, Abd Elwahaab MA, Abo El Maaty MI. Measuring Similarity among Protein Sequences Using a New Descriptor. BIOMED RESEARCH INTERNATIONAL 2019; 2019:2796971. [PMID: 31886192 PMCID: PMC6893242 DOI: 10.1155/2019/2796971] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/06/2019] [Revised: 09/03/2019] [Accepted: 10/28/2019] [Indexed: 12/01/2022]
Abstract
The comparison of protein sequences according to similarity is a fundamental aspect of today's biomedical research. With the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. Famous sequences' comparison methods are alignment based. They generally give excellent results when the sequences under study are closely related and they are time consuming. Herein, a new alignment-free method is introduced. Our technique depends on a new graphical representation and descriptor. The graphical representation of protein sequence is a simple way to visualize protein sequences. The descriptor compresses the primary sequence into a single vector composed of only two values. Our approach gives good results with both short and long sequences within a little computation time. It is applied on nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 spike protein sequences. Correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others' approaches, results, and sequence homology.
Collapse
Affiliation(s)
- Mervat M. Abo-Elkhier
- Department of Engineering Mathematics and Physics, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt
| | - Marwa A. Abd Elwahaab
- Department of Engineering Mathematics and Physics, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt
| | - Moheb I. Abo El Maaty
- Department of Engineering Mathematics and Physics, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt
| |
Collapse
|
6
|
Saw AK, Tripathy BC, Nandi S. Alignment-free similarity analysis for protein sequences based on fuzzy integral. Sci Rep 2019; 9:2775. [PMID: 30808983 PMCID: PMC6391537 DOI: 10.1038/s41598-019-39477-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2018] [Accepted: 01/15/2019] [Indexed: 12/12/2022] Open
Abstract
Sequence comparison is an essential part of modern molecular biology research. In this study, we estimated the parameters of Markov chain by considering the frequencies of occurrence of the all possible amino acid pairs from each alignment-free protein sequence. These estimated Markov chain parameters were used to calculate similarity between two protein sequences based on a fuzzy integral algorithm. For validation, our result was compared with both alignment-based (ClustalW) and alignment-free methods on six benchmark datasets. The results indicate that our developed algorithm has a better clustering performance for protein sequence comparison.
Collapse
Affiliation(s)
- Ajay Kumar Saw
- Institute of Advanced Study in Science and Technology, Mathematical Sciences Division, Guwahati, 781035, India
| | | | - Soumyadeep Nandi
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India.
| |
Collapse
|
7
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. WITHDRAWN: A Novel Way of Comparing Protein Sequences Represented Under Physio-Chemical Properties of their Amino Acids. Comput Biol Chem 2017. [DOI: 10.1016/j.compbiolchem.2017.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
8
|
PING PENGYAO, ZHU XIANYOU, WANG LEI. SIMILARITIES/DISSIMILARITIES ANALYSIS OF PROTEIN SEQUENCES BASED ON PCA-FFT. J BIOL SYST 2017. [DOI: 10.1142/s0218339017500024] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, a novel method to analyze the similarity/dissimilarity of protein sequences based on Principal Component Analysis-Fast Fourier Transformation (PCA-FFT) is proposed, in which, the PCA is utilized to transform protein sequences into time series and the FFT is utilized to analyze the time series while considering them as signals. To test the effectiveness of our newly proposed method, it is applied to analyze the similarity/dissimilarity of 16 different ND5 protein sequences and 29 different spike protein sequences, respectively. Furthermore, the correlation analysis is presented for comparing with others methods, and the simulation results show that it has better performances in the aspects of computation complexity and recognition degree than some existing methods.
Collapse
Affiliation(s)
- PENGYAO PING
- College of Information Engineering, Xiangtan University, Yuhu District, Xiangtan, Hunan 411105, P. R. China
- Key Laboratory of Intelligent Computing and Information Processing, P. R. China
| | - XIANYOU ZHU
- Department of Computer Science, Hengyang Normal University, 421008, P. R. China
| | - LEI WANG
- College of Information Engineering, Xiangtan University, Yuhu District, Xiangtan, Hunan 411105, P. R. China
- Key Laboratory of Intelligent Computing and Information Processing, P. R. China
| |
Collapse
|
9
|
Hou W, Pan Q, Peng Q, He M. A new method to analyze protein sequence similarity using Dynamic Time Warping. Genomics 2016; 109:123-130. [PMID: 27974244 PMCID: PMC7125777 DOI: 10.1016/j.ygeno.2016.12.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Revised: 12/06/2016] [Accepted: 12/10/2016] [Indexed: 12/05/2022]
Abstract
Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective. We propose a novel method to extract the features of the sequences based on physicochemical property of proteins. We apply the Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to analyze the similarity of proteins. Different datasets are used to prove our model's effectiveness.
Collapse
Affiliation(s)
- Wenbing Hou
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qiuhui Pan
- School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116024, PR China; School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qianying Peng
- Department of Academics, Dalian Naval Academy, Dalian 116001, PR China
| | - Mingfeng He
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| |
Collapse
|
10
|
Li Y, Song T, Yang J, Zhang Y, Yang J. An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids. PLoS One 2016; 11:e0167430. [PMID: 27918587 PMCID: PMC5137889 DOI: 10.1371/journal.pone.0167430] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 11/14/2016] [Indexed: 11/30/2022] Open
Abstract
In this paper, we have proposed a novel alignment-free method for comparing the similarity of protein sequences. We first encode a protein sequence into a 440 dimensional feature vector consisting of a 400 dimensional Pseudo-Markov transition probability vector among the 20 amino acids, a 20 dimensional content ratio vector, and a 20 dimensional position ratio vector of the amino acids in the sequence. By evaluating the Euclidean distances among the representing vectors, we compare the similarity of protein sequences. We then apply this method into the ND5 dataset consisting of the ND5 protein sequences of 9 species, and the F10 and G11 datasets representing two of the xylanases containing glycoside hydrolase families, i.e., families 10 and 11. As a result, our method achieves a correlation coefficient of 0.962 with the canonical protein sequence aligner ClustalW in the ND5 dataset, much higher than those of other 5 popular alignment-free methods. In addition, we successfully separate the xylanases sequences in the F10 family and the G11 family and illustrate that the F10 family is more heat stable than the G11 family, consistent with a few previous studies. Moreover, we prove mathematically an identity equation involving the Pseudo-Markov transition probability vector and the amino acids content ratio vector.
Collapse
Affiliation(s)
- Yushuang Li
- School of Science, Yanshan University, Qinhuangdao, China
| | - Tian Song
- School of Science, Yanshan University, Qinhuangdao, China
| | - Jiasheng Yang
- Department of Civil and Environmental Engineering, National Universality of Singapore, Singapore
| | - Yi Zhang
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei, China
| | - Jialiang Yang
- School of Mathematics and Information Science, Henan Polytechnic University, Henan, China
| |
Collapse
|
11
|
El-Lakkani A, Lashin M. An efficient method for measuring the similarity of protein sequences. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2016; 27:363-370. [PMID: 27103219 DOI: 10.1080/1062936x.2016.1174735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2016] [Accepted: 04/01/2016] [Indexed: 06/05/2023]
Abstract
An accurate numerical descriptor for protein sequence is introduced. It is basically a set of each three successive amino acids in the sequence (triplet), starting from left to right, in addition to the distances between each two successive amino acids in the triplet such that the summation of these distances does not exceed 8. This numerical descriptor combines two features the amino acid composition and the position of each amino acid relative to the other nearby amino acids. This numerical descriptor is used to measure the similarity between protein sequences in three sets: NADH dehydrogenase subunit 5 (ND5) proteins of different species, 24 transferrin proteins from vertebrates and 12 proteins of baculoviruses. High correlation coefficient values between our results and the results of ClustalW program are obtained. These values are higher than the values obtained in many other related works.
Collapse
Affiliation(s)
- A El-Lakkani
- a Faculty of Science, Department of Biophysics , Cairo University , Giza , Egypt
| | - M Lashin
- a Faculty of Science, Department of Biophysics , Cairo University , Giza , Egypt
| |
Collapse
|
12
|
Randić M, Pisanski T. Protein alignment: Exact versus approximate. An illustration. J Comput Chem 2015; 36:1069-74. [DOI: 10.1002/jcc.23892] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2015] [Revised: 02/10/2015] [Accepted: 02/16/2015] [Indexed: 11/11/2022]
Affiliation(s)
- Milan Randić
- National Institute of Chemistry; Laboratory of Chemometrics; Ljubljana Slovenia
| | - Tomaž Pisanski
- Insitute of Mathematics, Physics, and Mechanics, and Mathematics Dept.; University of Ljubljana; SI-1000 Ljubljana Slovenia
- Department of Mathematics; Primorska University; 6000 Koper Slovenia
| |
Collapse
|
13
|
ADLD: a novel graphical representation of protein sequences and its application. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2014; 2014:959753. [PMID: 25530796 PMCID: PMC4230005 DOI: 10.1155/2014/959753] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2014] [Accepted: 09/25/2014] [Indexed: 11/30/2022]
Abstract
To facilitate the intuitional analysis of protein sequences, a novel graphical representation of protein sequences called ADLD (Alignment Diagonal Line Diagram) is introduced in this paper first, and then a new ADLD based method is proposed and utilized to analyze the similarity/dissimilarity of protein sequences. Comparing with existing methods, our ADLD based method is proved to be effective in the similarity/dissimilarity analysis of protein sequences and have the merits of good intuition, visuality, and simplicity. The examinations of the similarities/dissimilarities for both the 16 different ND5 proteins and the 29 different spike proteins illustrate the utility of our ADLD based approach.
Collapse
|