1
|
Mallikarjuna T, Thummadi NB, Vindal V, Manimaran P. Prioritizing cervical cancer candidate genes using chaos game and fractal-based time series approach. Theory Biosci 2024; 143:183-193. [PMID: 38807013 DOI: 10.1007/s12064-024-00418-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 05/14/2024] [Indexed: 05/30/2024]
Abstract
Cervical cancer is one of the most severe threats to women worldwide and holds fourth rank in lethality. It is estimated that 604, 127 cervical cancer cases have been reported in 2020 globally. With advancements in high throughput technologies and bioinformatics, several cervical candidate genes have been proposed for better therapeutic strategies. In this paper, we intend to prioritize the candidate genes that are involved in cervical cancer progression through a fractal time series-based cross-correlations approach. we apply the chaos game representation theory combining a two-dimensional multifractal detrended cross-correlations approach among the known and candidate genes involved in cervical cancer progression to prioritize the candidate genes. We obtained 16 candidate genes that showed cross-correlation with known cancer genes. Functional enrichment analysis of the candidate genes shows that they involve GO terms: biological processes, cell-cell junction assembly, cell-cell junction organization, regulation of cell shape, cortical actin cytoskeleton organization, and actomyosin structure organization. KEGG pathway analysis revealed genes' role in Rap1 signaling pathway, ErbB signaling pathway, MAPK signaling pathway, PI3K-Akt signaling pathway, mTOR signaling pathway, Acute myeloid leukemia, chronic myeloid leukemia, Breast cancer, Thyroid cancer, Bladder cancer, and Gastric cancer. Further, we performed survival analysis and prioritized six genes CDH2, PAIP1, BRAF, EPB41L3, OSMR, and RUNX1 as potential candidate genes for cervical cancer that has a crucial role in tumor progression. We found that our study through this integrative approach an efficient tool and paved a new way to prioritize the candidate genes and these genes could be evaluated experimentally for potential validation. We suggest this may be useful in analyzing the nucleotide sequences and protein sequences for clustering, classification, class affiliation, etc.
Collapse
Affiliation(s)
- T Mallikarjuna
- Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Gachibowli, Hyderabad, 500046, India
| | - N B Thummadi
- Department of Animal Biology, School of Life Sciences, University of Hyderabad, Gachibowli, Hyderabad, 500046, India
| | - Vaibhav Vindal
- Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Gachibowli, Hyderabad, 500046, India
| | - P Manimaran
- School of Physics, University of Hyderabad, Gachibowli, Hyderabad, Telangana, 500046, India.
| |
Collapse
|
2
|
Ghosh S, Pal J, Cattani C, Maji B, Bhattacharya DK. Protein sequence comparison based on representation on a finite dimensional unit hypercube. J Biomol Struct Dyn 2024; 42:6425-6439. [PMID: 37837426 DOI: 10.1080/07391102.2023.2268719] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/01/2023] [Indexed: 10/16/2023]
Abstract
Numerous techniques are used to compare protein sequences based on the values of the physiochemical properties of amino acids. In this work, a single physical/chemical property value based non-binary representation of protein sequences is obtained on a 20 × 20-dimensional unit hypercube. The represented vector expressed in the matrix form is taken as the descriptor. The generalized NTV metric, which is an extension of the NTV metric used for polynucleotide space is taken as a distance measure. Based on this distance measure, a distance matrix is obtained for protein sequence comparison. Using this distance matrix, phylogenetic trees are drawn by using Molecular Evolutionary Genetics Analysis 11 (MEGA11) software applying the neighbor-joining method. Data sets used in this current work are 9-ND4, 9-ND5, 9-ND6, 24 TF-LF proteins, 27 different viruses and 127 proteins from the protein kinase C (PKC) family. Two sets of phylogenetic trees are obtained - one based on property value of polarity and the other based on property value of molecular weight. They are found to be exactly the same. Similar results also hold for other single property value based representation. The present trees are individually tested for efficiency based on the criterion of rationalized perception and computational time. The results of the present method are compared with those obtained earlier by other methods on the same protein sequences using assessment criteria of Symmetric distance (SD), Correlation coefficient, and Rationalized perception. In all the cases, the present results are found to be better than the results of other methods under comparison.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Soumen Ghosh
- Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
- Information Technology, Narula Institute of Technology, Kolkata, West Bengal, India
| | - Jayanta Pal
- Computer Science & Engineering, Narula Institute of Technology, Kolkata, West Bengal, India
| | - Carlo Cattani
- DEIM, University of Tuscia, Largo dell'Universita, Viterbo, Italy
| | - Bansibadan Maji
- Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
| | | |
Collapse
|
3
|
Fahmy AM, Hammad MS, Mabrouk MS, Al-Atabany WI. On leveraging self-supervised learning for accurate HCV genotyping. Sci Rep 2024; 14:15463. [PMID: 38965254 PMCID: PMC11224313 DOI: 10.1038/s41598-024-64209-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 06/06/2024] [Indexed: 07/06/2024] Open
Abstract
Hepatitis C virus (HCV) is a major global health concern, affecting millions of individuals worldwide. While existing literature predominantly focuses on disease classification using clinical data, there exists a critical research gap concerning HCV genotyping based on genomic sequences. Accurate HCV genotyping is essential for patient management and treatment decisions. While the neural models excel at capturing complex patterns, they still face challenges, such as data scarcity, that exist a lot in computational genomics. To overcome this challenges, this paper introduces an advanced deep learning approach for HCV genotyping based on the graphical representation of nucleotide sequences that outperforms classical approaches. Notably, it is effective for both partial and complete HCV genomes and addresses challenges associated with imbalanced datasets. In this work, ten HCV genotypes: 1a, 1b, 2a, 2b, 2c, 3a, 3b, 4, 5, and 6 were used in the analysis. This study utilizes Chaos Game Representation for 2D mapping of genomic sequences, employing self-supervised learning using convolutional autoencoder for deep feature extraction, resulting in an outstanding performance for HCV genotyping compared to various machine learning and deep learning models. This baseline provides a benchmark against which the performance of the proposed approach and other models can be evaluated. The experimental results showcase a remarkable classification accuracy of over 99%, outperforming traditional deep learning models. This performance demonstrates the capability of the proposed model to accurately identify HCV genotypes in both partial and complete sequences and in dealing with data scarcity for certain genotypes. The results of the proposed model are compared to NCBI genotyping tool.
Collapse
Affiliation(s)
- Ahmed M Fahmy
- Computer Science program, School of Information Technology and Computer Science (ITCS), Nile University, Sheikh Zayed City, Egypt.
| | - Muhammed S Hammad
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Cairo, Egypt
| | - Mai S Mabrouk
- Biomedical informatics program, School of Information Technology and Computer Science (ITCS), Nile University, Sheikh Zayed City, Egypt
| | - Walid I Al-Atabany
- Biomedical informatics program, School of Information Technology and Computer Science (ITCS), Nile University, Sheikh Zayed City, Egypt
| |
Collapse
|
4
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
5
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. Use of 2D FFT and DTW in Protein Sequence Comparison. Protein J 2024; 43:1-11. [PMID: 37848727 DOI: 10.1007/s10930-023-10160-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/20/2023] [Indexed: 10/19/2023]
Abstract
Protein sequence comparison remains a challenging work for the researchers owing to the computational complexity due to the presence of 20 amino acids compared with only four nucleotides in Genome sequences. Further, protein sequences of different species are of different lengths; it throws additional changes to the researchers to develop methods, specially alignment-free methods, to compare protein sequences. In this work, an efficient technique to compare protein sequences is developed by a graphical representation. First, the classified grouping of 20 amino acids with a cardinality of 4 based on polar class is considered to narrow down the representational range from 20 to 4. Then a unit vector technique based on a two-quadrant Cartesian system is proposed to provide a new two-dimensional graphical representation of the protein sequence. Now, two approaches are proposed to cope with the varying lengths of protein sequences from various species: one uses Dynamic Time Warping (DTW), while the other one uses a two-dimensional Fast Fourier Transform (2D FFT). Next, the effectiveness of these two techniques is analyzed using two evaluation criteria-quantitative measures based on symmetric distance (SD) and computational speed. An analysis is performed on five data sets of 9 ND4, 9 ND5, 9 ND6, 12 Baculovirus, and 24 TF proteins under the two methods. It is found that the FFT-based method produces the same results as DTW but in less computational time. It is found that the result of the proposed method agrees with the known biological reference. Further, the present method produces better clustering than the existing ones.
Collapse
Affiliation(s)
- Jayanta Pal
- Department of ECE, National Institute of Technology, Durgapur, India.
- Department of CSE, Narula Institute of Technology, Kolkata, India.
| | - Soumen Ghosh
- Department of ECE, National Institute of Technology, Durgapur, India
| | - Bansibadan Maji
- Department of ECE, National Institute of Technology, Durgapur, India
| | | |
Collapse
|
6
|
Navish AA, Uthayakumar R. A comparative study on structural proteins of viruses that belong to the identical family. THE EUROPEAN PHYSICAL JOURNAL. SPECIAL TOPICS 2023; 232:1-10. [PMID: 36846473 PMCID: PMC9936937 DOI: 10.1140/epjs/s11734-023-00791-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 01/25/2023] [Indexed: 06/18/2023]
Abstract
Recent studies have focused on the similarity between SARS Cov-2 and various viruses from the Coronaviridae family (such as MERS Cov, SARS Cov and Bat Cov RaTG13) to uncover the mystery of SARS Cov-2. Specifically, some studies identified that the SARS Cov-2 is closely related to Bat Cov RaTG13 (a SARS-related coronavirus found in bats) rather than the other viruses in that family. These studies are mainly focusing on the biological techniques to show the similarity between the SARS Cov-2 and other viruses. Examining proteins is not easy for common researchers unless for biologists. To rectify this flaw, we have to convert the protein to one of the known formats, which are easy to understand. Consequently, this study uses viral structural proteins to analyse the relationship between SARS Cov-2 and the rest of the coronavirus with the help of mathematical and statistical parameters and explores the various graph representations of MERS Cov, SARS Cov, Bat Cov RaTG13 and SARS Cov-2 structural proteins, such as zig-zag curve, Protein Contact Map ( PCM ) and Chaos Game Representation ( CGR ). Though these graph interpretations are visually similar, a slight variation between the graphs reflects their structural and functional differences. Thus, we use an elegant parameter known as the fractal dimension to observe their minor changes. According to the nature of the graph, we employ different types of fractal dimensions, namely mass dimension and box dimension. Furthermore, we perform the similarity tests with normalized cross-correlation and cosine similarity to assess the comparability of the PCM and CGR graphs. The acquired C C n values are near the sequence identity between SARS Cov-2 and MERS Cov, SARS Cov, Bat Cov RaTG13.
Collapse
Affiliation(s)
- A. A. Navish
- Department of Mathematics, The Gandhigram Rural Institute-Deemed to be University, Gandhigram, Dindigul, 624 302 Tamil Nadu India
| | - R. Uthayakumar
- Department of Mathematics, The Gandhigram Rural Institute-Deemed to be University, Gandhigram, Dindigul, 624 302 Tamil Nadu India
| |
Collapse
|
7
|
Li W, Yang L, Qiu Y, Yuan Y, Li X, Meng Z. FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis. BMC Bioinformatics 2022; 23:347. [PMID: 35986255 PMCID: PMC9392226 DOI: 10.1186/s12859-022-04889-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 08/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Amino acid property-aware phylogenetic analysis (APPA) refers to the phylogenetic analysis method based on amino acid property encoding, which is used for understanding and inferring evolutionary relationships between species from the molecular perspective. Fast Fourier transform (FFT) and Higuchi’s fractal dimension (HFD) have excellent performance in describing sequences’ structural and complexity information for APPA. However, with the exponential growth of protein sequence data, it is very important to develop a reliable APPA method for protein sequence analysis.
Results
Consequently, we propose a new method named FFP, it joints FFT and HFD. Firstly, FFP is used to encode protein sequences on the basis of the important physicochemical properties of amino acids, the dissociation constant, which determines acidity and basicity of protein molecules. Secondly, FFT and HFD are used to generate the feature vectors of encoded sequences, whereafter, the distance matrix is calculated from the cosine function, which describes the degree of similarity between species. The smaller the distance between them, the more similar they are. Finally, the phylogenetic tree is constructed. When FFP is tested for phylogenetic analysis on four groups of protein sequences, the results are obviously better than other comparisons, with the highest accuracy up to more than 97%.
Conclusion
FFP has higher accuracy in APPA and multi-sequence alignment. It also can measure the protein sequence similarity effectively. And it is hoped to play a role in APPA’s related research.
Collapse
|
8
|
Liang Y, Yang S, Zheng L, Wang H, Zhou J, Huang S, Yang L, Zuo Y. Research progress of reduced amino acid alphabets in protein analysis and prediction. Comput Struct Biotechnol J 2022; 20:3503-3510. [PMID: 35860409 PMCID: PMC9284397 DOI: 10.1016/j.csbj.2022.07.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 06/30/2022] [Accepted: 07/01/2022] [Indexed: 11/29/2022] Open
Abstract
A comprehensive summary of the literature on the reduced amino acid alphabets. A systematic review of the development history of reduced amino acid alphabets. Rich application cases of amino acid reduction alphabets are described in the article. A detailed analysis of the properties and uses of the reduced amino acid alphabets.
Proteins are the executors of cellular physiological activities, and accurate structural and function elucidation are crucial for the refined mapping of proteins. As a feature engineering method, the reduction of amino acid composition is not only an important method for protein structure and function analysis, but also opens a broad horizon for the complex field of machine learning. Representing sequences with fewer amino acid types greatly reduces the complexity and noise of traditional feature engineering in dimension, and provides more interpretable predictive models for machine learning to capture key features. In this paper, we systematically reviewed the strategy and method studies of the reduced amino acid (RAA) alphabets, and summarized its main research in protein sequence alignment, functional classification, and prediction of structural properties, respectively. In the end, we gave a comprehensive analysis of 672 RAA alphabets from 74 reduction methods.
Collapse
Affiliation(s)
- Yuchao Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Siqi Yang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Hao Wang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Jian Zhou
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
- Corresponding authors.
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
- Corresponding authors.
| |
Collapse
|
9
|
Ren Y, Chakraborty T, Doijad S, Falgenhauer L, Falgenhauer J, Goesmann A, Hauschild AC, Schwengers O, Heider D. Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. Bioinformatics 2021; 38:325-334. [PMID: 34613360 PMCID: PMC8722762 DOI: 10.1093/bioinformatics/btab681] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 08/27/2021] [Accepted: 09/24/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Antimicrobial resistance (AMR) is one of the biggest global problems threatening human and animal health. Rapid and accurate AMR diagnostic methods are thus very urgently needed. However, traditional antimicrobial susceptibility testing (AST) is time-consuming, low throughput and viable only for cultivable bacteria. Machine learning methods may pave the way for automated AMR prediction based on genomic data of the bacteria. However, comparing different machine learning methods for the prediction of AMR based on different encodings and whole-genome sequencing data without previously known knowledge remains to be done. RESULTS In this study, we evaluated logistic regression (LR), support vector machine (SVM), random forest (RF) and convolutional neural network (CNN) for the prediction of AMR for the antibiotics ciprofloxacin, cefotaxime, ceftazidime and gentamicin. We could demonstrate that these models can effectively predict AMR with label encoding, one-hot encoding and frequency matrix chaos game representation (FCGR encoding) on whole-genome sequencing data. We trained these models on a large AMR dataset and evaluated them on an independent public dataset. Generally, RFs and CNNs perform better than LR and SVM with AUCs up to 0.96. Furthermore, we were able to identify mutations that are associated with AMR for each antibiotic. AVAILABILITY AND IMPLEMENTATION Source code in data preparation and model training are provided at GitHub website (https://github.com/YunxiaoRen/ML-iAMR). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yunxiao Ren
- Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg 35032, Germany
| | - Trinad Chakraborty
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen 35392, Germany,German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany
| | - Swapnil Doijad
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen 35392, Germany,German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany
| | - Linda Falgenhauer
- German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany,Institute of Hygiene and Environmental Medicine, Justus Liebig University Giessen, Giessen 35392, Germany,Hessisches universitäres Kompetenzzentrum Krankenhaushygiene, Giessen 35392, Germany
| | - Jane Falgenhauer
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen 35392, Germany,German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany
| | - Alexander Goesmann
- German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany,Department of Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen 35392, Germany
| | - Anne-Christin Hauschild
- Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg 35032, Germany
| | - Oliver Schwengers
- German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany,Department of Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen 35392, Germany
| | | |
Collapse
|
10
|
Mu Z, Yu T, Liu X, Zheng H, Wei L, Liu J. FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinformatics 2021; 22:297. [PMID: 34078264 PMCID: PMC8172329 DOI: 10.1186/s12859-021-04223-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Accepted: 05/28/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. RESULTS In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. CONCLUSION The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.
Collapse
Affiliation(s)
- Zengchao Mu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Xiaoping Liu
- Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Beijing, China
| | - Hongyu Zheng
- Department of Radiation Oncology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China.
| | - Juntao Liu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| |
Collapse
|
11
|
Sun Z, Huang S, Zheng L, Liang P, Yang W, Zuo Y. ICTC-RAAC: An improved web predictor for identifying the types of ion channel-targeted conotoxins by using reduced amino acid cluster descriptors. Comput Biol Chem 2020; 89:107371. [PMID: 32950852 DOI: 10.1016/j.compbiolchem.2020.107371] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Revised: 09/01/2020] [Accepted: 09/02/2020] [Indexed: 12/27/2022]
Abstract
Conotoxins are small peptide toxins which are rich in disulfide and have the unique diversity of sequences. It is significant to correctly identify the types of ion channel-targeted conotoxins because that they are considered as the optimal pharmacological candidate medicine in drug design owing to their ability specifically binding to ion channels and interfering with neural transmission. Comparing with other feature extracting methods, the reduced amino acid cluster (RAAC) better resolved in simplifying protein complexity and identifying functional conserved regions. Thus, in our study, 673 RAACs generated from 74 types of reduced amino acid alphabet were comprehensively assessed to establish a state-of-the-art predictor for predicting ion channel-targeted conotoxins. The results showed Type 20, Cluster 9 (T = 20, C = 9) in the tripeptide composition (N = 3) achieved the best accuracy, 89.3%, which was based on the algorithm of amino acids reduction of variance maximization. Further, the ANOVA with incremental feature selection (IFS) was used for feature selection to improve prediction performance. Finally, the cross-validation results showed that the best overall accuracy we calculated was 96.4% and 1.8% higher than the best accuracy of previous studies. Based on the predictor we proposed, a user-friendly webserver was established and can be friendly accessed at http://bioinfor.imu.edu.cn/ictcraac.
Collapse
Affiliation(s)
- Zijie Sun
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China; School of Mathematical Sciences, Inner Mongolia University, Hohhot, 010021, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Pengfei Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Wuritu Yang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China.
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China.
| |
Collapse
|
12
|
Sun Z, Pei S, He RL, Yau SST. A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector. Comput Struct Biotechnol J 2020; 18:1904-1913. [PMID: 32774785 PMCID: PMC7390779 DOI: 10.1016/j.csbj.2020.07.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 07/04/2020] [Accepted: 07/05/2020] [Indexed: 12/16/2022] Open
Abstract
Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.
Collapse
Affiliation(s)
- Zeju Sun
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| |
Collapse
|
13
|
Anitas EM. Small-Angle Scattering and Multifractal Analysis of DNA Sequences. Int J Mol Sci 2020; 21:ijms21134651. [PMID: 32629908 PMCID: PMC7369734 DOI: 10.3390/ijms21134651] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2020] [Revised: 06/28/2020] [Accepted: 06/28/2020] [Indexed: 12/26/2022] Open
Abstract
The arrangement of A, C, G and T nucleotides in large DNA sequences of many prokaryotic and eukaryotic cells exhibit long-range correlations with fractal properties. Chaos game representation (CGR) of such DNA sequences, followed by a multifractal analysis, is a useful way to analyze the corresponding scaling properties. This approach provides a powerful visualization method to characterize their spatial inhomogeneity, and allows discrimination between mono- and multifractal distributions. However, in some cases, two different arbitrary point distributions, may generate indistinguishable multifractal spectra. By using a new model based on multiplicative deterministic cascades, here it is shown that small-angle scattering (SAS) formalism can be used to address such issue, and to extract additional structural information. It is shown that the box-counting dimension given by multifractal spectra can be recovered from the scattering exponent of SAS intensity in the fractal region. This approach is illustrated for point distributions of CGR data corresponding to Escherichia coli, Phospholamban and Mouse mitochondrial DNA, and it is shown that for the latter two cases, SAS allows extraction of the fractal iteration number and the scaling factor corresponding to "ACGT" square, or to recover the number of bases. The results are compared with a model based on multiplicative deterministic cascades, and respectively with one which takes into account the existence of forbidden sequences in DNA. This allows a classification of the DNA sequences in terms of random and deterministic fractals structures emerging in CGR.
Collapse
Affiliation(s)
- Eugen Mircea Anitas
- Joint Institute for Nuclear Research, Dubna 141980, Russia;
- Horia Hulubei, National Institute of Physics and Nuclear Engineering, 077125 Bucharest-Magurele, Romania
| |
Collapse
|
14
|
Jiang S, Li BG, Yu ZG, Wang F, Anh V, Zhou Y. Multifractal temporally weighted detrended cross-correlation analysis of multivariate time series. CHAOS (WOODBURY, N.Y.) 2020; 30:023134. [PMID: 32113234 DOI: 10.1063/1.5129574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 02/04/2020] [Indexed: 06/10/2023]
Abstract
Fractal and multifractal properties of various systems have been studied extensively. In this paper, first, the multivariate multifractal detrend cross-correlation analysis (MMXDFA) is proposed to investigate the multifractal features in multivariate time series. MMXDFA may produce oscillations in the fluctuation function and spurious cross correlations. In order to overcome these problems, we then propose the multivariate multifractal temporally weighted detrended cross-correlation analysis (MMTWXDFA). In relation to the multivariate detrended cross-correlation analysis and multifractal temporally weighted detrended cross-correlation analysis, an innovation of MMTWXDFA is the application of the signed Manhattan distance to calculate the local detrended covariance function. To evaluate the performance of the MMXDFA and MMTWXDFA methods, we apply them on some artificially generated multivariate series. Several numerical tests demonstrate that both methods can identify their fractality, but MMTWXDFA can detect long-range cross correlations and simultaneously quantify the levels of cross correlation between two multivariate series more accurately.
Collapse
Affiliation(s)
- Shan Jiang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Bao-Gen Li
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Fang Wang
- College of Information and Science Technology, Hunan Agricultural University, Changsha, Hunan 410128, China
| | - Vo Anh
- Faculty of Science, Engineering and Technology, Swinburne University of Technology, PO Box 218, Hawthorn, Victoria 3122, Australia
| | - Yu Zhou
- Institute of Future Cities and Department of Geography and Resource Management, The Chinese University of Hong Kong, Shatin, Hong Kong, China
| |
Collapse
|
15
|
Zuo Y, Chang Y, Huang S, Zheng L, Yang L, Cao G. iDEF-PseRAAC: Identifying the Defensin Peptide by Using Reduced Amino Acid Composition Descriptor. Evol Bioinform Online 2019; 15:1176934319867088. [PMID: 31391777 PMCID: PMC6669840 DOI: 10.1177/1176934319867088] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 07/08/2019] [Indexed: 11/18/2022] Open
Abstract
Defensins as 1 of major classes of host defense peptides play a significant role in the innate immunity, which are extremely evolved in almost all living organisms. Developing high-throughput computational methods can accurately help in designing drugs or medical means to defense against pathogens. To take up such a challenge, an up-to-date server based on rigorous benchmark dataset, referred to as iDEF-PseRAAC, was designed for predicting the defensin family in this study. By extracting primary sequence compositions based on different types of reduced amino acid alphabet, it was calculated that the best overall accuracy of the selected feature subset was achieved to 92.38%. Therefore, we can conclude that the information provided by abundant types of amino acid reduction will provide efficient and rational methodology for defensin identification. And, a free online server is freely available for academic users at http://bioinfor.imu.edu.cn/idpf. We hold expectations that iDEF-PseRAAC may be a promising weapon for the function annotation about the defensins protein.
Collapse
Affiliation(s)
- Yongchun Zuo
- College of Veterinary Medicine, Inner Mongolia Agricultural University, Hohhot, China.,State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Yu Chang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Guifang Cao
- College of Veterinary Medicine, Inner Mongolia Agricultural University, Hohhot, China
| |
Collapse
|
16
|
Deep learning on chaos game representation for proteins. Bioinformatics 2019; 36:272-279. [DOI: 10.1093/bioinformatics/btz493] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2019] [Revised: 05/29/2019] [Accepted: 06/14/2019] [Indexed: 11/14/2022] Open
Abstract
AbstractMotivationClassification of protein sequences is one big task in bioinformatics and has many applications. Different machine learning methods exist and are applied on these problems, such as support vector machines (SVM), random forests (RF) and neural networks (NN). All of these methods have in common that protein sequences have to be made machine-readable and comparable in the first step, for which different encodings exist. These encodings are typically based on physical or chemical properties of the sequence. However, due to the outstanding performance of deep neural networks (DNN) on image recognition, we used frequency matrix chaos game representation (FCGR) for encoding of protein sequences into images. In this study, we compare the performance of SVMs, RFs and DNNs, trained on FCGR encoded protein sequences. While the original chaos game representation (CGR) has been used mainly for genome sequence encoding and classification, we modified it to work also for protein sequences, resulting in n-flakes representation, an image with several icosagons.ResultsWe could show that all applied machine learning techniques (RF, SVM and DNN) show promising results compared to the state-of-the-art methods on our benchmark datasets, with DNNs outperforming the other methods and that FCGR is a promising new encoding method for protein sequences.Availability and implementationhttps://cran.r-project.org/.Supplementary informationSupplementary data are available at Bioinformatics online.
Collapse
|
17
|
Mu Z, Yu T, Qi E, Liu J, Li G. DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information. BMC Bioinformatics 2019; 20:351. [PMID: 31221087 PMCID: PMC6587251 DOI: 10.1186/s12859-019-2943-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Accepted: 06/10/2019] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Protein feature extraction plays an important role in the areas of similarity analysis of protein sequences and prediction of protein structures, functions and interactions. The feature extraction based on graphical representation is one of the most effective and efficient ways. However, most existing methods suffer limitations from their method design. RESULTS We introduce DCGR, a novel method for extracting features from protein sequences based on the chaos game representation, which is developed by constructing CGR curves of protein sequences according to physicochemical properties of amino acids, followed by converting the CGR curves into multi-dimensional feature vectors by using the distributions of points in CGR images. Tested on five data sets, DCGR was significantly superior to the state-of-the-art feature extraction methods. CONCLUSION The DCGR is practically powerful for extracting effective features from protein sequences, and therefore important in similarity analysis of protein sequences, study of protein-protein interactions and prediction of protein functions. It is freely available at https://sourceforge.net/projects/transcriptomeassembly/files/Feature%20Extraction .
Collapse
Affiliation(s)
- Zengchao Mu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Ting Yu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Enfeng Qi
- College of Mathematics and Statistics, Guangxi Normal University, Guilin, 541001 China
| | - Juntao Liu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| |
Collapse
|
18
|
Li C, Zhao J, Wang C, Yao Y. Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation. Comb Chem High Throughput Screen 2019; 21:100-110. [PMID: 29380690 PMCID: PMC5930480 DOI: 10.2174/1386207321666180130100838] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Revised: 01/24/2018] [Accepted: 01/26/2018] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. METHODS Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. RESULTS By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. CONCLUSION These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.
Collapse
Affiliation(s)
- Chun Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China.,Department of Mathematics, Bohai University, Jinzhou 121013, China.,Research Institute of Food Science, Bohai University, Jinzhou 121013, China
| | - Jialing Zhao
- Department of Mathematics, Bohai University, Jinzhou 121013, China
| | - Changzhong Wang
- Department of Mathematics, Bohai University, Jinzhou 121013, China
| | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| |
Collapse
|
19
|
Wang H, Wu P. Prediction of RNA-protein interactions using conjoint triad feature and chaos game representation. Bioengineered 2019; 9:242-251. [PMID: 30117758 PMCID: PMC6984769 DOI: 10.1080/21655979.2018.1470721] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
RNA-protein interactions (RPIs) play a very important role in a wide range of post-transcriptional regulations, and identifying whether a given RNA-protein pair can form interactions or not is a vital prerequisite for dissecting the regulatory mechanisms of functional RNAs. Currently, expensive and time-consuming biological assays can only determine a very small portion of all RPIs, which calls for computational approaches to help biologists efficiently and correctly find candidate RPIs. Here, we integrated a successful computing algorithm, conjoint triad feature (CTF), and another method, chaos game representation (CGR), for representing RNA-protein pairs and by doing so developed a prediction model based on these representations and random forest (RF) classifiers. When testing two benchmark datasets, RPI369 and RPI2241, the combined method (CTF+CGR) showed some superiority compared with four existing tools. Especially on RPI2241, the CTF+CGR method improved prediction accuracy (ACC) from 0.91 (the best record of all published works) to 0.95. When independently testing a newly constructed dataset, RPI1449, which only contained experimentally validated RPIs released between 2014 and 2016, our method still showed some generalization capability with an ACC of 0.75. Accordingly, we believe that our hybrid CTF+CGR method will be an important tool for predicting RPIs in the future.
Collapse
Affiliation(s)
- Hongchu Wang
- a Department of Mathematics , South China Normal University , Guangzhou P.R. of China
| | - Pengfei Wu
- b College of Informatics , Huazhong Agricultural University , Wuhan P.R. of China
| |
Collapse
|
20
|
Identifying anticancer peptides by using a generalized chaos game representation. J Math Biol 2018; 78:441-463. [PMID: 30291366 DOI: 10.1007/s00285-018-1279-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Revised: 08/01/2018] [Indexed: 10/28/2022]
Abstract
We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.
Collapse
|
21
|
Duarte-Sanchez JE, Velasco-Medina J, Moreno PA. Hardware Accelerator for the Multifractal Analysis of DNA Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1611-1624. [PMID: 28749355 DOI: 10.1109/tcbb.2017.2731339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The multifractal analysis has allowed to quantify the genetic variability and non-linear stability along the human genome sequence. It has some implications in explaining several genetic diseases given by some chromosome abnormalities, among other genetic particularities. The multifractal analysis of a genome is carried out by dividing the complete DNA sequence in smaller fragments and calculating the generalized dimension spectrum of each fragment using the chaos game representation and the box-counting method. This is a time consuming process because it involves the processing of large data sets using floating-point representation. In order to reduce the computation time, we designed an application-specific processor, here called multifractal processor, which is based on our proposed hardware-oriented algorithm for calculating efficiently the generalized dimension spectrum of DNA sequences. The multifractal processor was implemented on a low-cost SoC-FPGA and was verified by processing a complete human genome. The execution time and numeric results of the Multifractal processor were compared with the results obtained from the software implementation executed in a 20-core workstation, achieving a speed up of 2.6x and an average error of 0.0003 percent.
Collapse
|
22
|
Wei YL, Yu ZG, Zou HL, Anh V. Multifractal temporally weighted detrended cross-correlation analysis to quantify power-law cross-correlation and its application to stock markets. CHAOS (WOODBURY, N.Y.) 2017; 27:063111. [PMID: 28679233 DOI: 10.1063/1.4985637] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
A new method-multifractal temporally weighted detrended cross-correlation analysis (MF-TWXDFA)-is proposed to investigate multifractal cross-correlations in this paper. This new method is based on multifractal temporally weighted detrended fluctuation analysis and multifractal cross-correlation analysis (MFCCA). An innovation of the method is applying geographically weighted regression to estimate local trends in the nonstationary time series. We also take into consideration the sign of the fluctuations in computing the corresponding detrended cross-covariance function. To test the performance of the MF-TWXDFA algorithm, we apply it and the MFCCA method on simulated and actual series. Numerical tests on artificially simulated series demonstrate that our method can accurately detect long-range cross-correlations for two simultaneously recorded series. To further show the utility of MF-TWXDFA, we apply it on time series from stock markets and find that power-law cross-correlation between stock returns is significantly multifractal. A new coefficient, MF-TWXDFA cross-correlation coefficient, is also defined to quantify the levels of cross-correlation between two time series.
Collapse
Affiliation(s)
- Yun-Lan Wei
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Zu-Guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Hai-Long Zou
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane Q4001, Australia
| |
Collapse
|
23
|
Fractal and multifractal analyses of bipartite networks. Sci Rep 2017; 7:45588. [PMID: 28361962 PMCID: PMC5374526 DOI: 10.1038/srep45588] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2016] [Accepted: 02/27/2017] [Indexed: 11/29/2022] Open
Abstract
Bipartite networks have attracted considerable interest in various fields. Fractality and multifractality of unipartite (classical) networks have been studied in recent years, but there is no work to study these properties of bipartite networks. In this paper, we try to unfold the self-similarity structure of bipartite networks by performing the fractal and multifractal analyses for a variety of real-world bipartite network data sets and models. First, we find the fractality in some bipartite networks, including the CiteULike, Netflix, MovieLens (ml-20m), Delicious data sets and (u, v)-flower model. Meanwhile, we observe the shifted power-law or exponential behavior in other several networks. We then focus on the multifractal properties of bipartite networks. Our results indicate that the multifractality exists in those bipartite networks possessing fractality. To capture the inherent attribute of bipartite network with two types different nodes, we give the different weights for the nodes of different classes, and show the existence of multifractality in these node-weighted bipartite networks. In addition, for the data sets with ratings, we modify the two existing algorithms for fractal and multifractal analyses of edge-weighted unipartite networks to study the self-similarity of the corresponding edge-weighted bipartite networks. The results show that our modified algorithms are feasible and can effectively uncover the self-similarity structure of these edge-weighted bipartite networks and their corresponding node-weighted versions.
Collapse
|
24
|
Yu Y, Yang L, Liu Z, Zhu C. Gene essentiality prediction based on fractal features and machine learning. MOLECULAR BIOSYSTEMS 2017; 13:577-584. [PMID: 28145541 DOI: 10.1039/c6mb00806b] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Predicting bacterial essential genes using only fractal features.
Collapse
Affiliation(s)
- Yongming Yu
- Department of Biomedical Engineering
- Shandong University
- Jinan
- China
| | - Licai Yang
- Department of Biomedical Engineering
- Shandong University
- Jinan
- China
| | - Zhiping Liu
- Department of Biomedical Engineering
- Shandong University
- Jinan
- China
| | - Chuansheng Zhu
- Department of Hematology
- Shandong University Affiliated Qianfoshan Hospital
- Jinan
- China
| |
Collapse
|
25
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S, Solis-Reyes S. Additive methods for genomic signatures. BMC Bioinformatics 2016; 17:313. [PMID: 27549194 PMCID: PMC4994249 DOI: 10.1186/s12859-016-1157-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 07/19/2016] [Indexed: 01/09/2023] Open
Abstract
Background Studies exploring the potential of Chaos Game Representations (CGR) of genomic sequences to act as “genomic signatures” (to be species- and genome-specific) showed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be very different. While the hypothesis that CGRs of mitochondrial DNA sequences can act as genomic signatures was validated for a snapshot of all sequenced mitochondrial genomes available in the NCBI GenBank sequence database, to our knowledge no such extensive analysis of CGRs of nuclear DNA sequences exists to date. Results We analyzed an extensive dataset, totalling 1.45 gigabase pairs, of nuclear/nucleoid genomic sequences (nDNA) from 42 different organisms, spanning all major kingdoms of life. Our computational experiments indicate that CGR signatures of nDNA of two different origins cannot always be differentiated, especially if they originate from closely-related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii. To address this issue, we propose the general concept of additive DNA signature of a set (collection) of DNA sequences. One particular instance, the composite DNA signature, combines information from nDNA fragments and organellar (mitochondrial, chloroplast, or plasmid) genomes. We demonstrate that, in this dataset, composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of CGR signatures of nDNA failed or was inconclusive. Another instance, the assembled DNA signature, combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment, to produce its signature. We show that an assembled DNA signature has the same distinguishing power as a conventionally computed CGR signature, while using shorter contiguous sequences and potentially less sequence information. Conclusions Our results suggest that, while CGR signatures of nDNA cannot always play the role of genomic signatures, composite and assembled DNA signatures (separately or in combination) could potentially be used instead. Such additive signatures could be used, e.g., with raw unassembled next-generation sequencing (NGS) read data, when high-quality sequencing data is not available, or to complement information obtained by other methods of species identification or classification. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1157-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| | - Lila Kari
- School of Computing Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada. .,Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.,Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Stephen Solis-Reyes
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| |
Collapse
|
26
|
Zhang L, Kong L, Han X, Lv J. Structural class prediction of protein using novel feature extraction method from chaos game representation of predicted secondary structure. J Theor Biol 2016; 400:1-10. [DOI: 10.1016/j.jtbi.2016.04.011] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2016] [Revised: 03/18/2016] [Accepted: 04/08/2016] [Indexed: 11/30/2022]
|
27
|
Song YQ, Liu JL, Yu ZG, Li BG. Multifractal analysis of weighted networks by a modified sandbox algorithm. Sci Rep 2015; 5:17628. [PMID: 26634304 PMCID: PMC4669438 DOI: 10.1038/srep17628] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Accepted: 11/03/2015] [Indexed: 01/13/2023] Open
Abstract
Complex networks have attracted growing attention in many fields. As a generalization of fractal analysis, multifractal analysis (MFA) is a useful way to systematically describe the spatial heterogeneity of both theoretical and experimental fractal patterns. Some algorithms for MFA of unweighted complex networks have been proposed in the past a few years, including the sandbox (SB) algorithm recently employed by our group. In this paper, a modified SB algorithm (we call it SBw algorithm) is proposed for MFA of weighted networks. First, we use the SBw algorithm to study the multifractal property of two families of weighted fractal networks (WFNs): "Sierpinski" WFNs and "Cantor dust" WFNs. We also discuss how the fractal dimension and generalized fractal dimensions change with the edge-weights of the WFN. From the comparison between the theoretical and numerical fractal dimensions of these networks, we can find that the proposed SBw algorithm is efficient and feasible for MFA of weighted networks. Then, we apply the SBw algorithm to study multifractal properties of some real weighted networks - collaboration networks. It is found that the multifractality exists in these weighted networks, and is affected by their edge-weights.
Collapse
Affiliation(s)
- Yu-Qin Song
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
- College of Science, Hunan University of technology, Zhuzhou, Hunan 412007, China
| | - Jin-Long Liu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Q4001, Australia
| | - Bao-Gen Li
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| |
Collapse
|
28
|
Wang H, Hu X. Accurate prediction of nuclear receptors with conjoint triad feature. BMC Bioinformatics 2015; 16:402. [PMID: 26630876 PMCID: PMC4668603 DOI: 10.1186/s12859-015-0828-1] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Accepted: 11/17/2015] [Indexed: 11/26/2022] Open
Abstract
Background Nuclear receptors (NRs) form a large family of ligand-inducible transcription factors that regulate gene expressions involved in numerous physiological phenomena, such as embryogenesis, homeostasis, cell growth and death. These nuclear receptors-related pathways are important targets of marketed drugs. Therefore, the design of a reliable computational model for predicting NRs from amino acid sequence has now been a significant biomedical problem. Results Conjoint triad feature (CTF) mainly considers neighbor relationships in protein sequences by encoding each protein sequence using the triad (continuous three amino acids) frequency distribution extracted from a 7-letter reduced alphabet. In addition, chaos game representation (CGR) can investigate the patterns hidden in protein sequences and visually reveal previously unknown structure. In this paper, three methods, CTF, CGR, amino acid composition (AAC), are applied to formulate the protein samples. By considering different combinations of three methods, we study seven groups of features, and each group is evaluated by the 10-fold cross-validation test. Meanwhile, a new non-redundant dataset containing 474 NR sequences and 500 non-NR sequences is built based on the latest NucleaRDB database. Comparing the results of numerical experiments, the group of combined features with CTF and AAC gets the best result with the accuracy of 96.30 % for identifying NRs from non-NRs. Moreover, if it is classified as a NR, it will be further put into the second level, which will classify a NR into one of the eight main subfamilies. At the second level, the group of combined features with CTF and AAC also gets the best accuracy of 94.73 %. Subsequently, the proposed predictor is compared with two existing methods, and the comparisons show that the accuracies of two levels significantly increase to 98.79 % (NR-2L: 92.56 %; iNR-PhysChem: 98.18 %; the first level) and 93.71 % (NR-2L: 88.68 %; iNR-PhysChem: 92.45 %; the second level) with the introduction of our CTF-based method. Finally, each component of CTF features is analyzed via the statistical significant test, and a simplified model only with the resulting top-50 significant features achieves accuracy of 95.28 %. Conclusions The experimental results demonstrate that our CTF-based method is an effective way for predicting nuclear receptor proteins. Furthermore, the top-50 significant features obtained from the statistical significant test are considered as the “intrinsic features” in predicting NRs based on the analysis of relative importance. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0828-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hongchu Wang
- Department of Mathemaitcs, South China Normal University, Guangzhou, 510631, P.R. of China
| | - Xuehai Hu
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, P.R. of China.
| |
Collapse
|
29
|
Yu JF, Dou XH, Wang HB, Sun X, Zhao HY, Wang JH. A Novel Cylindrical Representation for Characterizing Intrinsic Properties of Protein Sequences. J Chem Inf Model 2015; 55:1261-70. [PMID: 25945398 DOI: 10.1021/ci500577m] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The composition and sequence order of amino acid residues are the two most important characteristics to describe a protein sequence. Graphical representations facilitate visualization of biological sequences and produce biologically useful numerical descriptors. In this paper, we propose a novel cylindrical representation by placing the 20 amino acid residue types in a circle and sequence positions along the z axis. This representation allows visualization of the composition and sequence order of amino acids at the same time. Ten numerical descriptors and one weighted numerical descriptor have been developed to quantitatively describe intrinsic properties of protein sequences on the basis of the cylindrical model. Their applications to similarity/dissimilarity analysis of nine ND5 proteins indicated that these numerical descriptors are more effective than several classical numerical matrices. Thus, the cylindrical representation obtained here provides a new useful tool for visualizing and charactering protein sequences. An online server is available at http://biophy.dzu.edu.cn:8080/CNumD/input.jsp .
Collapse
Affiliation(s)
- Jia-Feng Yu
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China.,‡State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, China
| | - Xiang-Hua Dou
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Hong-Bo Wang
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Xiao Sun
- ‡State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, China
| | - Hui-Ying Zhao
- §Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, Queensland 4000, Australia
| | - Ji-Hua Wang
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China.,∥College of Physics and Electronic Information, Dezhou University, Dezhou 253023, China
| |
Collapse
|
30
|
Xie XH, Yu ZG, Han GS, Yang WF, Anh V. Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles. Mol Phylogenet Evol 2015; 89:37-45. [PMID: 25882834 DOI: 10.1016/j.ympev.2015.04.008] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 03/29/2015] [Accepted: 04/06/2015] [Indexed: 11/18/2022]
Abstract
There has been a growing interest in alignment-free methods for whole genome comparison and phylogenomic studies. In this study, we propose an alignment-free method for phylogenetic tree construction using whole-proteome sequences. Based on the inter-amino-acid distances, we first convert the whole-proteome sequences into inter-amino-acid distance vectors, which are called observed inter-amino-acid distance profiles. Then, we propose to use conditional geometric distribution profiles (the distributions of sequences where the amino acids are placed randomly and independently) as the reference distribution profiles. Last the relative deviation between the observed and reference distribution profiles is used to define a simple metric that reflects the phylogenetic relationships between whole-proteome sequences of different organisms. We name our method inter-amino-acid distances and conditional geometric distribution profiles (IAGDP). We evaluate our method on two data sets: the benchmark dataset including 29 genomes used in previous published papers, and another one including 67 mammal genomes. Our results demonstrate that the new method is useful and efficient.
Collapse
Affiliation(s)
- Xian-Hua Xie
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematics and Computer Science, Gannan Normal University, Jiangxi 341000, PR China.
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| | - Guo-Sheng Han
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China.
| | - Wei-Feng Yang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China.
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| |
Collapse
|
31
|
Tanchotsrinon W, Lursinsap C, Poovorawan Y. A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition. BMC Bioinformatics 2015; 16:71. [PMID: 25880169 PMCID: PMC4375884 DOI: 10.1186/s12859-015-0493-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 02/06/2015] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Human Papillomavirus (HPV) genotyping is an important approach to fight cervical cancer due to the relevant information regarding risk stratification for diagnosis and the better understanding of the relationship of HPV with carcinogenesis. This paper proposed two new feature extraction techniques, i.e. ChaosCentroid and ChaosFrequency, for predicting HPV genotypes associated with the cancer. The additional diversified 12 HPV genotypes, i.e. types 6, 11, 16, 18, 31, 33, 35, 45, 52, 53, 58, and 66, were studied in this paper. In our proposed techniques, a partitioned Chaos Game Representation (CGR) is deployed to represent HPV genomes. ChaosCentroid captures the structure of sequences in terms of centroid of each sub-region with Euclidean distances among the centroids and the center of CGR as the relations of all sub-regions. ChaosFrequency extracts the statistical distribution of mono-, di-, or higher order nucleotides along HPV genomes and forms a matrix of frequency of dots in each sub-region. For performance evaluation, four different types of classifiers, i.e. Multi-layer Perceptron, Radial Basis Function, K-Nearest Neighbor, and Fuzzy K-Nearest Neighbor Techniques were deployed, and our best results from each classifier were compared with the NCBI genotyping tool. RESULTS The experimental results obtained by four different classifiers are in the same trend. ChaosCentroid gave considerably higher performance than ChaosFrequency when the input length is one but it was moderately lower than ChaosFrequency when the input length is two. Both proposed techniques yielded almost or exactly the best performance when the input length is more than three. But there is no significance between our proposed techniques and the comparative alignment method. CONCLUSIONS Our proposed alignment-free and scale-independent method can successfully transform HPV genomes with 7,000 - 10,000 base pairs into features of 1 - 11 dimensions. This signifies that our ChaosCentroid and ChaosFrequency can be served as the effective feature extraction techniques for predicting the HPV genotypes.
Collapse
Affiliation(s)
- Watcharaporn Tanchotsrinon
- Advanced Virtual and Intelligent Computing Research Center (AVIC), Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Phayathai Road, Bangkok, Thailand.
| | - Chidchanok Lursinsap
- Advanced Virtual and Intelligent Computing Research Center (AVIC), Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Phayathai Road, Bangkok, Thailand.
| | - Yong Poovorawan
- Center of Excellence in Clinical Virology, Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Phayathai Road, Bangkok, Thailand.
| |
Collapse
|
32
|
Liu JL, Yu ZG, Anh V. Determination of multifractal dimensions of complex networks by means of the sandbox algorithm. CHAOS (WOODBURY, N.Y.) 2015; 25:023103. [PMID: 25725639 DOI: 10.1063/1.4907557] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Complex networks have attracted much attention in diverse areas of science and technology. Multifractal analysis (MFA) is a useful way to systematically describe the spatial heterogeneity of both theoretical and experimental fractal patterns. In this paper, we employ the sandbox (SB) algorithm proposed by Tél et al. (Physica A 159, 155-166 (1989)), for MFA of complex networks. First, we compare the SB algorithm with two existing algorithms of MFA for complex networks: the compact-box-burning algorithm proposed by Furuya and Yakubo (Phys. Rev. E 84, 036118 (2011)), and the improved box-counting algorithm proposed by Li et al. (J. Stat. Mech.: Theor. Exp. 2014, P02020 (2014)) by calculating the mass exponents τ(q) of some deterministic model networks. We make a detailed comparison between the numerical and theoretical results of these model networks. The comparison results show that the SB algorithm is the most effective and feasible algorithm to calculate the mass exponents τ(q) and to explore the multifractal behavior of complex networks. Then, we apply the SB algorithm to study the multifractal property of some classic model networks, such as scale-free networks, small-world networks, and random networks. Our results show that multifractality exists in scale-free networks, that of small-world networks is not obvious, and it almost does not exist in random networks.
Collapse
Affiliation(s)
- Jin-Long Liu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane Q4001, Australia
| |
Collapse
|
33
|
Liu JL, Yu ZG, Anh V. Topological properties and fractal analysis of a recurrence network constructed from fractional Brownian motions. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2014; 89:032814. [PMID: 24730906 DOI: 10.1103/physreve.89.032814] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2013] [Indexed: 06/03/2023]
Abstract
Many studies have shown that we can gain additional information on time series by investigating their accompanying complex networks. In this work, we investigate the fundamental topological and fractal properties of recurrence networks constructed from fractional Brownian motions (FBMs). First, our results indicate that the constructed recurrence networks have exponential degree distributions; the average degree exponent 〈λ〉 increases first and then decreases with the increase of Hurst index H of the associated FBMs; the relationship between H and 〈λ〉 can be represented by a cubic polynomial function. We next focus on the motif rank distribution of recurrence networks, so that we can better understand networks at the local structure level. We find the interesting superfamily phenomenon, i.e., the recurrence networks with the same motif rank pattern being grouped into two superfamilies. Last, we numerically analyze the fractal and multifractal properties of recurrence networks. We find that the average fractal dimension 〈dB〉 of recurrence networks decreases with the Hurst index H of the associated FBMs, and their dependence approximately satisfies the linear formula 〈dB〉≈2-H, which means that the fractal dimension of the associated recurrence network is close to that of the graph of the FBM. Moreover, our numerical results of multifractal analysis show that the multifractality exists in these recurrence networks, and the multifractality of these networks becomes stronger at first and then weaker when the Hurst index of the associated time series becomes larger from 0.4 to 0.95. In particular, the recurrence network with the Hurst index H=0.5 possesses the strongest multifractality. In addition, the dependence relationships of the average information dimension 〈D(1)〉 and the average correlation dimension 〈D(2)〉 on the Hurst index H can also be fitted well with linear functions. Our results strongly suggest that the recurrence network inherits the basic characteristic and the fractal nature of the associated FBM series.
Collapse
Affiliation(s)
- Jin-Long Liu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China and School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, Q4001, Australia
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, Q4001, Australia
| |
Collapse
|
34
|
Han GS, Yu ZG, Anh V. A two-stage SVM method to predict membrane protein types by incorporating amino acid classifications and physicochemical properties into a general form of Chou's PseAAC. J Theor Biol 2013; 344:31-9. [PMID: 24316387 DOI: 10.1016/j.jtbi.2013.11.017] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2013] [Revised: 10/16/2013] [Accepted: 11/24/2013] [Indexed: 01/12/2023]
Abstract
Membrane proteins play important roles in many biochemical processes and are also attractive targets of drug discovery for various diseases. The elucidation of membrane protein types provides clues for understanding the structure and function of proteins. Recently we developed a novel system for predicting protein subnuclear localizations. In this paper, we propose a simplified version of our system for predicting membrane protein types directly from primary protein structures, which incorporates amino acid classifications and physicochemical properties into a general form of pseudo-amino acid composition. In this simplified system, we will design a two-stage multi-class support vector machine combined with a two-step optimal feature selection process, which proves very effective in our experiments. The performance of the present method is evaluated on two benchmark datasets consisting of five types of membrane proteins. The overall accuracies of prediction for five types are 93.25% and 96.61% via the jackknife test and independent dataset test, respectively. These results indicate that our method is effective and valuable for predicting membrane protein types. A web server for the proposed method is available at http://www.juemengt.com/jcc/memty_page.php.
Collapse
Affiliation(s)
- Guo-Sheng Han
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China
| | - Zu-Guo Yu
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China; School of Mathematical Science, Queensland University of Technology, GPO Box 2434, Brisbane Q 4001, Australia.
| | - Vo Anh
- School of Mathematical Science, Queensland University of Technology, GPO Box 2434, Brisbane Q 4001, Australia
| |
Collapse
|
35
|
Predicting DNA binding proteins using support vector machine with hybrid fractal features. J Theor Biol 2013; 343:186-92. [PMID: 24189096 DOI: 10.1016/j.jtbi.2013.10.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2013] [Revised: 08/12/2013] [Accepted: 10/17/2013] [Indexed: 11/20/2022]
Abstract
DNA-binding proteins play a vitally important role in many biological processes. Prediction of DNA-binding proteins from amino acid sequence is a significant but not fairly resolved scientific problem. Chaos game representation (CGR) investigates the patterns hidden in protein sequences, and visually reveals previously unknown structure. Fractal dimensions (FD) are good tools to measure sizes of complex, highly irregular geometric objects. In order to extract the intrinsic correlation with DNA-binding property from protein sequences, CGR algorithm, fractal dimension and amino acid composition are applied to formulate the numerical features of protein samples in this paper. Seven groups of features are extracted, which can be computed directly from the primary sequence, and each group is evaluated by the 10-fold cross-validation test and Jackknife test. Comparing the results of numerical experiments, the group of amino acid composition and fractal dimension (21-dimension vector) gets the best result, the average accuracy is 81.82% and average Matthew's correlation coefficient (MCC) is 0.6017. This resulting predictor is also compared with existing method DNA-Prot and shows better performances.
Collapse
|
36
|
Schwende I, Pham TD. Pattern recognition and probabilistic measures in alignment-free sequence analysis. Brief Bioinform 2013; 15:354-68. [PMID: 24096012 DOI: 10.1093/bib/bbt070] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
With the massive production of genomic and proteomic data, the number of available biological sequences in databases has reached a level that is not feasible anymore for exact alignments even when just a fraction of all sequences is used. To overcome this inevitable time complexity, ultrafast alignment-free methods are studied. Within the past two decades, a broad variety of nonalignment methods have been proposed including dissimilarity measures on classical representations of sequences like k-words or Markov models. Furthermore, articles were published that describe distance measures on alternative representations such as compression complexity, spectral time series or chaos game representation. However, alignments are still the standard method for real world applications in biological sequence analysis, and the time efficient alignment-free approaches are usually applied in cases when the accustomed algorithms turn out to fail or be too inconvenient.
Collapse
Affiliation(s)
- Isabel Schwende
- PhD, Aizu Research Cluster for Medical Informatics and Engineering (ARC-Medical), Research Center for Advanced Information Science and Technology (CAIST), The University of Aizu, Aizuwakamatsu, Fukushima 965-8580, Japan.
| | | |
Collapse
|
37
|
M. Abo-Elkhier M. Similarity/dissimilarity analysis of protein sequences using the spatial median as a descriptor. ACTA ACUST UNITED AC 2012. [DOI: 10.4236/jbpc.2012.32016] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
38
|
Moreno PA, Vélez PE, Martínez E, Garreta LE, Díaz N, Amador S, Tischer I, Gutiérrez JM, Naik AK, Tobar F, García F. The human genome: a multifractal analysis. BMC Genomics 2011; 12:506. [PMID: 21999602 PMCID: PMC3277318 DOI: 10.1186/1471-2164-12-506] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2011] [Accepted: 10/14/2011] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Several studies have shown that genomes can be studied via a multifractal formalism. Recently, we used a multifractal approach to study the genetic information content of the Caenorhabditis elegans genome. Here we investigate the possibility that the human genome shows a similar behavior to that observed in the nematode. RESULTS We report here multifractality in the human genome sequence. This behavior correlates strongly on the presence of Alu elements and to a lesser extent on CpG islands and (G+C) content. In contrast, no or low relationship was found for LINE, MIR, MER, LTRs elements and DNA regions poor in genetic information. Gene function, cluster of orthologous genes, metabolic pathways, and exons tended to increase their frequencies with ranges of multifractality and large gene families were located in genomic regions with varied multifractality. Additionally, a multifractal map and classification for human chromosomes are proposed. CONCLUSIONS Based on these findings, we propose a descriptive non-linear model for the structure of the human genome, with some biological implications. This model reveals 1) a multifractal regionalization where many regions coexist that are far from equilibrium and 2) this non-linear organization has significant molecular and medical genetic implications for understanding the role of Alu elements in genome stability and structure of the human genome. Given the role of Alu sequences in gene regulation, genetic diseases, human genetic diversity, adaptation and phylogenetic analyses, these quantifications are especially useful.
Collapse
Affiliation(s)
- Pedro A Moreno
- Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Santiago de Cali, Colombia
| | - Patricia E Vélez
- Profesora del Departamento de Biología, FACNED, Universidad del Cauca, Popayán, Colombia
- Escuela de Ciencias Básicas. Facultad de Salud, Universidad del Valle, Santiago de Cali, Colombia
| | - Ember Martínez
- Departamento de Sistemas, Universidad del Cauca, Popayán, Colombia
| | - Luis E Garreta
- Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Santiago de Cali, Colombia
| | - Néstor Díaz
- Departamento de Sistemas, Universidad del Cauca, Popayán, Colombia
| | - Siler Amador
- Departamento de Sistemas, Universidad del Cauca, Popayán, Colombia
| | - Irene Tischer
- Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Santiago de Cali, Colombia
| | - José M Gutiérrez
- Instituto de Física de Cantabria, Universidad de Cantabria-CSIC, Santander, España
| | | | - Fabián Tobar
- Escuela de Ciencias Básicas. Facultad de Salud, Universidad del Valle, Santiago de Cali, Colombia
| | - Felipe García
- Escuela de Ciencias Básicas. Facultad de Salud, Universidad del Valle, Santiago de Cali, Colombia
| |
Collapse
|
39
|
Lu JL, Hu XH, Hu DG. A new hybrid fractal algorithm for predicting thermophilic nucleotide sequences. J Theor Biol 2011; 293:74-81. [PMID: 22001320 DOI: 10.1016/j.jtbi.2011.09.028] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2011] [Revised: 09/23/2011] [Accepted: 09/26/2011] [Indexed: 01/20/2023]
Abstract
Knowledge of thermophilic mechanisms about some organisms whose optimum growth temperature (OGT) ranges from 50 to 80 degree plays a major role in helping design stable proteins. How to predict a DNA sequence to be thermophilic is a long but not fairly resolved problem. Chaos game representation (CGR) can investigate the patterns hiding in DNA sequences, and can visually reveal previously unknown structure. Fractal dimensions are good tools to measure sizes of complex, highly irregular geometric objects. In this paper, we convert every DNA sequence into a high dimensional vector by CGR algorithm and fractal dimension, and then predict the DNA sequence thermostability by these fractal features and support vector machine (SVM). We have conducted experiments on three groups: 17-dimensional vector, 65-dimensional vector, and 257-dimensional vector. Each group is evaluated by the 10-fold cross-validation test. For the results, the group of 257-dimensional vector gets the best results: the average accuracy is 0.9456 and average MCC is 0.8878. The results are also compared with the previous work with single CGR features. The comparison shows the high effectiveness of the new hybrid fractal algorithm.
Collapse
Affiliation(s)
- Jin-Long Lu
- College of Science, Huazhong Agricultural University, Wuhan, PR China
| | | | | |
Collapse
|
40
|
Randić M, Zupan J, Balaban AT, Vikić-Topić D, Plavšić D. Graphical Representation of Proteins. Chem Rev 2010; 111:790-862. [PMID: 20939561 DOI: 10.1021/cr800198j] [Citation(s) in RCA: 93] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Milan Randić
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Jure Zupan
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Alexandru T. Balaban
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Dražen Vikić-Topić
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| | - Dejan Plavšić
- National Institute of Chemistry, P.O. Box 3430, 1001 Ljubljana, Slovenia; NMR Center, Ruđer Bošković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia; and Texas A&M University at Galveston, Galveston, Texas 77553
| |
Collapse
|
41
|
Stan C, Cristescu CP, Scarlat EI. Similarity analysis for DNA sequences based on chaos game representation. Case study: the albumin. J Theor Biol 2010; 267:513-8. [PMID: 20869369 DOI: 10.1016/j.jtbi.2010.09.027] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2010] [Revised: 09/16/2010] [Accepted: 09/17/2010] [Indexed: 11/16/2022]
Abstract
Using chaos game representation we introduce a novel and straightforward method for identifying similarities/dissimilarities between DNA sequences of the same type, from different organisms. A matrix is associated to each CGR pattern and the similarities result from the comparison between the matrices of the sequences of interest. Three different methods of analysis of the resulting difference matrix are considered: a 3-dimensional representation giving both local and global information, a numerical characterization by defining an n-letter word similarity measure and a statistical evaluation. The method is illustrated by implementation to the study of albumin nucleotides sequences from eight mammal species taking as reference the human albumin.
Collapse
Affiliation(s)
- Cristina Stan
- Department of Physics I, Faculty of Applied Sciences, Politehnica University of Bucharest, 313 Splaiul Independentei, RO-060042, Bucharest, Romania.
| | | | | |
Collapse
|
42
|
He P. A new graphical representation of similarity/dissimilarity studies of protein sequences. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2010; 21:571-580. [PMID: 20818589 DOI: 10.1080/1062936x.2010.510481] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Based on chaos game representation, a two-dimensional-graphical representation of protein sequences is described in which 20 amino acids are rearranged in a cyclic order using a PAM250 substitution matrix. A numerical characterisation has been developed as a descriptor to compare protein sequences. Finally, an example is given in which the dehydrogenase subunit 5 (ND5) protein sequences of nine species are compared.
Collapse
Affiliation(s)
- P He
- College of Science, Zhejiang Sci-Tech University, Hangzhou, China.
| |
Collapse
|
43
|
Maaty MIAE, Abo-Elkhier MM, Elwahaab MAA. Representation of protein sequences on latitude-like circles and longitude-like semi-circles. Chem Phys Lett 2010. [DOI: 10.1016/j.cplett.2010.05.039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
44
|
Cork DJ, Lembark S, Tovanabutra S, Robb ML, Kim JH. W-curve alignments for HIV-1 genomic comparisons. PLoS One 2010; 5:e10829. [PMID: 20532248 PMCID: PMC2879897 DOI: 10.1371/journal.pone.0010829] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2010] [Accepted: 04/16/2010] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND The W-curve was originally developed as a graphical visualization technique for viewing DNA and RNA sequences. Its ability to render features of DNA also makes it suitable for computational studies. Its main advantage in this area is utilizing a single-pass algorithm for comparing the sequences. Avoiding recursion during sequence alignments offers advantages for speed and in-process resources. The graphical technique also allows for multiple models of comparison to be used depending on the nucleotide patterns embedded in similar whole genomic sequences. The W-curve approach allows us to compare large numbers of samples quickly. METHOD We are currently tuning the algorithm to accommodate quirks specific to HIV-1 genomic sequences so that it can be used to aid in diagnostic and vaccine efforts. Tracking the molecular evolution of the virus has been greatly hampered by gap associated problems predominantly embedded within the envelope gene of the virus. Gaps and hypermutation of the virus slow conventional string based alignments of the whole genome. This paper describes the W-curve algorithm itself, and how we have adapted it for comparison of similar HIV-1 genomes. A treebuilding method is developed with the W-curve that utilizes a novel Cylindrical Coordinate distance method and gap analysis method. HIV-1 C2-V5 env sequence regions from a Mother/Infant cohort study are used in the comparison. FINDINGS The output distance matrix and neighbor results produced by the W-curve are functionally equivalent to those from Clustal for C2-V5 sequences in the mother/infant pairs infected with CRF01_AE. CONCLUSIONS Significant potential exists for utilizing this method in place of conventional string based alignment of HIV-1 genomes, such as Clustal X. With W-curve heuristic alignment, it may be possible to obtain clinically useful results in a short time-short enough to affect clinical choices for acute treatment. A description of the W-curve generation process, including a comparison technique of aligning extremes of the curves to effectively phase-shift them past the HIV-1 gap problem, is presented. Besides yielding similar neighbor-joining phenogram topologies, most Mother and Infant C2-V5 sequences in the cohort pairs geometrically map closest to each other, indicating that W-curve heuristics overcame any gap problem.
Collapse
Affiliation(s)
- Douglas J Cork
- United States Military HIV Research Program, Rockville, Maryland, USA.
| | | | | | | | | |
Collapse
|
45
|
Vélez PE, Garreta LE, Martínez E, Díaz N, Amador S, Tischer I, Gutiérrez JM, Moreno PA. The Caenorhabditis elegans genome: a multifractal analysis. GENETICS AND MOLECULAR RESEARCH 2010; 9:949-65. [PMID: 20506082 DOI: 10.4238/vol9-2gmr756] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
The Caenorhabditis elegans genome has several regular and irregular characteristics in its nucleotide composition; these are observed within and between chromosomes. To study these particularities, we carried out a multifractal analysis, which requires a large number of exponents to characterize scaling properties. We looked for a relationship between the genetic information content of the chromosomes and multifractal parameters and found less multifractality compared to the human genome. Differences in multifractality among chromosomes and in regions of chromosomes, and two group averages of chromosome regions were observed. All these differences were mainly dependent on differences in the contents of repetitive DNA. Based on these properties, we propose a nonlinear model for the structure of the C. elegans genome, with some biological implications. These results suggest that examining differences in multifractality is a viable approach for measuring local variations of genomic information contents along chromosomes. This approach could be extended to other genomes in order to characterize structural and functional regions of chromosomes.
Collapse
Affiliation(s)
- P E Vélez
- Departamento de Biología, Universidad del Cauca, Popayán, Colombia
| | | | | | | | | | | | | | | |
Collapse
|
46
|
Proper distance metrics for phylogenetic analysis using complete genomes without sequence alignment. Int J Mol Sci 2010; 11:1141-54. [PMID: 20480005 PMCID: PMC2869232 DOI: 10.3390/ijms11031141] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2010] [Accepted: 03/03/2010] [Indexed: 11/29/2022] Open
Abstract
A shortcoming of most correlation distance methods based on the composition vectors without alignment developed for phylogenetic analysis using complete genomes is that the “distances” are not proper distance metrics in the strict mathematical sense. In this paper we propose two new correlation-related distance metrics to replace the old one in our dynamical language approach. Four genome datasets are employed to evaluate the effects of this replacement from a biological point of view. We find that the two proper distance metrics yield trees with the same or similar topologies as/to those using the old “distance” and agree with the tree of life based on 16S rRNA in a majority of the basic branches. Hence the two proper correlation-related distance metrics proposed here improve our dynamical language approach for phylogenetic analysis.
Collapse
|
47
|
Yang JY, Peng ZL, Chen X. Prediction of protein structural classes for low-homology sequences based on predicted secondary structure. BMC Bioinformatics 2010; 11 Suppl 1:S9. [PMID: 20122246 PMCID: PMC3009544 DOI: 10.1186/1471-2105-11-s1-s9] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Background Prediction of protein structural classes (α, β, α + β and α/β) from amino acid sequences is of great importance, as it is beneficial to study protein function, regulation and interactions. Many methods have been developed for high-homology protein sequences, and the prediction accuracies can achieve up to 90%. However, for low-homology sequences whose average pairwise sequence identity lies between 20% and 40%, they perform relatively poorly, yielding the prediction accuracy often below 60%. Results We propose a new method to predict protein structural classes on the basis of features extracted from the predicted secondary structures of proteins rather than directly from their amino acid sequences. It first uses PSIPRED to predict the secondary structure for each protein sequence. Then, the chaos game representation is employed to represent the predicted secondary structure as two time series, from which we generate a comprehensive set of 24 features using recurrence quantification analysis, K-string based information entropy and segment-based analysis. The resulting feature vectors are finally fed into a simple yet powerful Fisher's discriminant algorithm for the prediction of protein structural classes. We tested the proposed method on three benchmark datasets in low homology and achieved the overall prediction accuracies of 82.9%, 83.1% and 81.3%, respectively. Comparisons with ten existing methods showed that our method consistently performs better for all the tested datasets and the overall accuracy improvements range from 2.3% to 27.5%. A web server that implements the proposed method is freely available at http://www1.spms.ntu.edu.sg/~chenxin/RKS_PPSC/. Conclusion The high prediction accuracy achieved by our proposed method is attributed to the design of a comprehensive feature set on the predicted secondary structure sequences, which is capable of characterizing the sequence order information, local interactions of the secondary structural elements, and spacial arrangements of α helices and β strands. Thus, it is a valuable method to predict protein structural classes particularly for low-homology amino acid sequences.
Collapse
Affiliation(s)
- Jian-Yi Yang
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore.
| | | | | |
Collapse
|
48
|
Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D. Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol 2009; 257:618-26. [DOI: 10.1016/j.jtbi.2008.12.027] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2008] [Revised: 11/07/2008] [Accepted: 12/19/2008] [Indexed: 11/17/2022]
|
49
|
Almeida JS, Vinga S. Biological sequences as pictures: a generic two dimensional solution for iterated maps. BMC Bioinformatics 2009; 10:100. [PMID: 19335894 PMCID: PMC2678093 DOI: 10.1186/1471-2105-10-100] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2008] [Accepted: 03/31/2009] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Representing symbolic sequences graphically using iterated maps has enjoyed an enduring popularity since it was first proposed in Jeffrey 1990 as chaos game representation (CGR). The usefulness of this representation goes beyond the convenience of a scale independent representation. It provides a variable memory length representation of transition. This includes the representation of succession with non-integer order, which comes with the promise of generalizing Markovian formalisms. The original proposal targeted genomic sequences only but since then several generalizations have been proposed, many specifically designed to handle protein data. RESULTS The challenge of a general solution is that of deriving a bijective transformation of symbolic sequences into bi-dimensional planes. More specifically, it requires the regular fractal nesting of polygons. A first attempt at a general solution was proposed by Fiser 1994 by using non-overlapping circles that contain the polygons. This was used as a starting point to identify a more efficient solution where the encapsulating circles can overlap without the same happening for the sequence maps which are circumscribed to fractal polygon domains. CONCLUSION We identified the optimal inscribed packing solution for iterated maps of any Biological sequence, indeed of any symbolic sequence. The new solution maintains the prized bijective mapping property and includes the Sierpinski triangle and the CGR square as particular solutions of the more encompassing formulation.
Collapse
Affiliation(s)
- Jonas S Almeida
- Dept Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Susana Vinga
- Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R. Alves Redol 9, 1000-029 Lisboa, Portugal
- Dept Bioestatística e Informática, Faculdade de Ciências Médicas – Universidade Nova de Lisboa (FCM/UNL), Campo Mártires da Pátria 130, 1169-056 Lisboa, Portugal
| |
Collapse
|
50
|
Dea-Ayuela MA, Pérez-Castillo Y, Meneses-Marcel A, Ubeira FM, Bolas-Fernández F, Chou KC, González-Díaz H. HP-Lattice QSAR for dynein proteins: experimental proteomics (2D-electrophoresis, mass spectrometry) and theoretic study of a Leishmania infantum sequence. Bioorg Med Chem 2008; 16:7770-6. [PMID: 18662882 DOI: 10.1016/j.bmc.2008.07.023] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2008] [Revised: 06/23/2008] [Accepted: 07/02/2008] [Indexed: 10/21/2022]
Abstract
The toxicity and inefficacy of actual organic drugs against Leishmaniosis justify research projects to find new molecular targets in Leishmania species including Leishmania infantum (L. infantum) and Leishmaniamajor (L. major), both important pathogens. In this sense, quantitative structure-activity relationship (QSAR) methods, which are very useful in Bioorganic and Medicinal Chemistry to discover small-sized drugs, may help to identify not only new drugs but also new drug targets, if we apply them to proteins. Dyneins are important proteins of these parasites governing fundamental processes such as cilia and flagella motion, nuclear migration, organization of the mitotic splinde, and chromosome separation during mitosis. However, despite the interest for them as potential drug targets, so far there has been no report whatsoever on dyneins with QSAR techniques. To the best of our knowledge, we report here the first QSAR for dynein proteins. We used as input the Spectral Moments of a Markov matrix associated to the HP-Lattice Network of the protein sequence. The data contain 411 protein sequences of different species selected by ClustalX to develop a QSAR that correctly discriminates on average between 92.75% and 92.51% of dyneins and other proteins in four different train and cross-validation datasets. We also report a combined experimental and theoretic study of a new dynein sequence in order to illustrate the utility of the model to search for potential drug targets with a practical example. First, we carried out a 2D-electrophoresis analysis of L. infantum biological samples. Next, we excised from 2D-E gels one spot of interest belonging to an unknown protein or protein fragment in the region M<20,200 and pI<4. We used MASCOT search engine to find proteins in the L. major data base with the highest similarity score to the MS of the protein isolated from L. infantum. We used the QSAR model to predict the new sequence as dynein with probability of 99.99% without relying upon alignment. In order to confirm the previous function annotation we predicted the sequences as dynein with BLAST and the omniBLAST tools (96% alignment similarity to dyneins of other species). Using this combined strategy, we have successfully identified L. infantum protein containing dynein heavy chain, and illustrated the potential use of the QSAR model as a complement to alignment tools.
Collapse
|