1
|
Ghosh S, Pal J, Cattani C, Maji B, Bhattacharya DK. Protein sequence comparison based on representation on a finite dimensional unit hypercube. J Biomol Struct Dyn 2024; 42:6425-6439. [PMID: 37837426 DOI: 10.1080/07391102.2023.2268719] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/01/2023] [Indexed: 10/16/2023]
Abstract
Numerous techniques are used to compare protein sequences based on the values of the physiochemical properties of amino acids. In this work, a single physical/chemical property value based non-binary representation of protein sequences is obtained on a 20 × 20-dimensional unit hypercube. The represented vector expressed in the matrix form is taken as the descriptor. The generalized NTV metric, which is an extension of the NTV metric used for polynucleotide space is taken as a distance measure. Based on this distance measure, a distance matrix is obtained for protein sequence comparison. Using this distance matrix, phylogenetic trees are drawn by using Molecular Evolutionary Genetics Analysis 11 (MEGA11) software applying the neighbor-joining method. Data sets used in this current work are 9-ND4, 9-ND5, 9-ND6, 24 TF-LF proteins, 27 different viruses and 127 proteins from the protein kinase C (PKC) family. Two sets of phylogenetic trees are obtained - one based on property value of polarity and the other based on property value of molecular weight. They are found to be exactly the same. Similar results also hold for other single property value based representation. The present trees are individually tested for efficiency based on the criterion of rationalized perception and computational time. The results of the present method are compared with those obtained earlier by other methods on the same protein sequences using assessment criteria of Symmetric distance (SD), Correlation coefficient, and Rationalized perception. In all the cases, the present results are found to be better than the results of other methods under comparison.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Soumen Ghosh
- Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
- Information Technology, Narula Institute of Technology, Kolkata, West Bengal, India
| | - Jayanta Pal
- Computer Science & Engineering, Narula Institute of Technology, Kolkata, West Bengal, India
| | - Carlo Cattani
- DEIM, University of Tuscia, Largo dell'Universita, Viterbo, Italy
| | - Bansibadan Maji
- Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
| | | |
Collapse
|
2
|
Guan M, Sun N, Yau SST. Geometric analysis of SARS-CoV-2 variants. Gene 2024; 909:148291. [PMID: 38417688 DOI: 10.1016/j.gene.2024.148291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/23/2024] [Accepted: 02/14/2024] [Indexed: 03/01/2024]
Abstract
SARS-CoV-2 as a severe respiratory disease has been prevalent around the world since its first discovery in 2019.As a single-stranded RNA virus, its high mutation rate makes its variants manifold and enables some of them to have high pathogenicity, such as Omicron variant, the most prevalent virus now. Research on the relationship of these SARS-CoV-2 variants, especially exploring their difference is a hot issue. In this study, we constructed a geometric space to represent all SARS-CoV-2 sequences of different variants. An alignment-free method: natural vector method was utilized to establish genome space. The genome space of SARS-CoV-2 was constructed based on the 24-dimensional natural vector and the appropriate metric was determined through performing phylogenetic analysises. Phylogenetic trees of different lineages constructed under the selected natural vector and metric coincided with the lineage naming standards, which means lineages with same alphabetical prefix cluster in phylogenetic trees. Furthermore, the relationships between the various GISAID clades as depicted by the natural graph primarily matched the description provided in the GISAID clade naming.The validity of our geometric space was demonstrated by these phylogenetic analysis results. So in this research, we constructed a geometry space for the genomes of the novel coronavirus SARS-CoV-2, which allows us to compare the different variants. Our geometric space is valuable for resolving the issues insides the virus.
Collapse
Affiliation(s)
- Mengcen Guan
- Department of Mathematical Sciences, Tsinghua University, Beijing, China.
| | - Nan Sun
- Department of Mathematical Sciences, Tsinghua University, Beijing, China.
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China; Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, China.
| |
Collapse
|
3
|
Cahuantzi R, Lythgoe KA, Hall I, Pellis L, House T. Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning methods. Proc Natl Acad Sci U S A 2024; 121:e2317284121. [PMID: 38478692 PMCID: PMC10962941 DOI: 10.1073/pnas.2317284121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 02/05/2024] [Indexed: 03/21/2024] Open
Abstract
Since its emergence in late 2019, SARS-CoV-2 has diversified into a large number of lineages and caused multiple waves of infection globally. Novel lineages have the potential to spread rapidly and internationally if they have higher intrinsic transmissibility and/or can evade host immune responses, as has been seen with the Alpha, Delta, and Omicron variants of concern. They can also cause increased mortality and morbidity if they have increased virulence, as was seen for Alpha and Delta. Phylogenetic methods provide the "gold standard" for representing the global diversity of SARS-CoV-2 and to identify newly emerging lineages. However, these methods are computationally expensive, struggle when datasets get too large, and require manual curation to designate new lineages. These challenges provide a motivation to develop complementary methods that can incorporate all of the genetic data available without down-sampling to extract meaningful information rapidly and with minimal curation. In this paper, we demonstrate the utility of using algorithmic approaches based on word-statistics to represent whole sequences, bringing speed, scalability, and interpretability to the construction of genetic topologies. While not serving as a substitute for current phylogenetic analyses, the proposed methods can be used as a complementary, and fully automatable, approach to identify and confirm new emerging variants.
Collapse
Affiliation(s)
- Roberto Cahuantzi
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
- United Kingdom Health Security Agency, University of Oxford, OxfordOX3 7LF, United Kingdom
| | - Katrina A. Lythgoe
- Department of Biology, University of Oxford, OxfordOX1 3SZ, United Kingdom
- Big Data Institute, University of Oxford, OxfordOX3 7LF, United Kingdom
- Pandemic Sciences Institute, University of Oxford, OxfordOX3 7LF, United Kingdom
| | - Ian Hall
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| | - Lorenzo Pellis
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| | - Thomas House
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| |
Collapse
|
4
|
Sykes J, Holland BR, Charleston MA. A review of visualisations of protein fold networks and their relationship with sequence and function. Biol Rev Camb Philos Soc 2023; 98:243-262. [PMID: 36210328 PMCID: PMC10092621 DOI: 10.1111/brv.12905] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 09/08/2022] [Accepted: 09/09/2022] [Indexed: 01/12/2023]
Abstract
Proteins form arguably the most significant link between genotype and phenotype. Understanding the relationship between protein sequence and structure, and applying this knowledge to predict function, is difficult. One way to investigate these relationships is by considering the space of protein folds and how one might move from fold to fold through similarity, or potential evolutionary relationships. The many individual characterisations of fold space presented in the literature can tell us a lot about how well the current Protein Data Bank represents protein fold space, how convergence and divergence may affect protein evolution, how proteins affect the whole of which they are part, and how proteins themselves function. A synthesis of these different approaches and viewpoints seems the most likely way to further our knowledge of protein structure evolution and thus, facilitate improved protein structure design and prediction.
Collapse
Affiliation(s)
- Janan Sykes
- School of Natural Sciences, University of Tasmania, Private Bag 37, Hobart, Tasmania, 7001, Australia
| | - Barbara R Holland
- School of Natural Sciences, University of Tasmania, Private Bag 37, Hobart, Tasmania, 7001, Australia
| | - Michael A Charleston
- School of Natural Sciences, University of Tasmania, Private Bag 37, Hobart, Tasmania, 7001, Australia
| |
Collapse
|
5
|
Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2023. [DOI: 10.1007/s41060-022-00381-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
6
|
Kumar N, Acharya V. Machine intelligence-driven framework for optimized hit selection in virtual screening. J Cheminform 2022; 14:48. [PMID: 35869511 PMCID: PMC9306080 DOI: 10.1186/s13321-022-00630-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 07/05/2022] [Indexed: 11/10/2022] Open
Abstract
AbstractVirtual screening (VS) aids in prioritizing unknown bio-interactions between compounds and protein targets for empirical drug discovery. In standard VS exercise, roughly 10% of top-ranked molecules exhibit activity when examined in biochemical assays, which accounts for many false positive hits, making it an arduous task. Attempts for conquering false-hit rates were developed through either ligand-based or structure-based VS separately; however, nonetheless performed remarkably well. Here, we present an advanced VS framework—automated hit identification and optimization tool (A-HIOT)—comprises chemical space-driven stacked ensemble for identification and protein space-driven deep learning architectures for optimization of an array of specific hits for fixed protein receptors. A-HIOT implements numerous open-source algorithms intending to integrate chemical and protein space leading to a high-quality prediction. The optimized hits are the selective molecules which we retrieve after extreme refinement implying chemical space and protein space modules of A-HIOT. Using CXC chemokine receptor 4, we demonstrated the superior performance of A-HIOT for hit molecule identification and optimization with tenfold cross-validation accuracies of 94.8% and 81.9%, respectively. In comparison with other machine learning algorithms, A-HIOT achieved higher accuracies of 96.2% for hit identification and 89.9% for hit optimization on independent benchmark datasets for CXCR4 and 86.8% for hit identification and 90.2% for hit optimization on independent test dataset for androgen receptor (AR), thus, shows its generalizability and robustness. In conclusion, advantageous features impeded in A-HIOT is making a reliable approach for bridging the long-standing gap between ligand-based and structure-based VS in finding the optimized hits for the desired receptor. The complete resource (framework) code is available at https://gitlab.com/neeraj-24/A-HIOT.
Graphical Abstract
Collapse
|
7
|
Sun N, Pei S, He L, Yin C, He RL, Yau SST. Geometric construction of viral genome space and its applications. Comput Struct Biotechnol J 2021; 19:4226-4234. [PMID: 34429843 PMCID: PMC8353408 DOI: 10.1016/j.csbj.2021.07.028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 07/24/2021] [Accepted: 07/24/2021] [Indexed: 11/25/2022] Open
Abstract
The first construction of viral genome space. The first demonstration of the convex hull principle of genomes. The first definition of a natural metric to describe the geometry of genome space.
Understanding the relationships between genomic sequences is essential to the classification and characterization of living beings. The classes and characteristics of an organism can be identified in the corresponding genome space. In the genome space, the natural metric is important to describe the distribution of genomes. Therefore, the similarity of two biological sequences can be measured. Here, we report that all of the viral genomes are in 32-dimensional Euclidean space, in which the natural metric is the weighted summation of Euclidean distance of k-mer natural vectors. The classification of viral genomes in the constructed genome space further proves the convex hull principle of taxonomy, which states that convex hulls of different families are mutually disjoint. This study provides a novel geometric perspective to describe the genome sequences.
Collapse
Affiliation(s)
- Nan Sun
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Lily He
- Department of Mathematics, School of Science, Beijing University of Civil Engineering and Architecture, Beijing, PR China
| | - Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60628, USA
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.,Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| |
Collapse
|
8
|
Mu Z, Yu T, Liu X, Zheng H, Wei L, Liu J. FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinformatics 2021; 22:297. [PMID: 34078264 PMCID: PMC8172329 DOI: 10.1186/s12859-021-04223-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Accepted: 05/28/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. RESULTS In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. CONCLUSION The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.
Collapse
Affiliation(s)
- Zengchao Mu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Xiaoping Liu
- Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Beijing, China
| | - Hongyu Zheng
- Department of Radiation Oncology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China.
| | - Juntao Liu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| |
Collapse
|
9
|
Wan X, Tan X. A protein structural study based on the centrality analysis of protein sequence feature networks. PLoS One 2021; 16:e0248861. [PMID: 33780482 PMCID: PMC8006989 DOI: 10.1371/journal.pone.0248861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Accepted: 03/05/2021] [Indexed: 11/19/2022] Open
Abstract
In this paper, we use network approaches to analyze the relations between protein sequence features for the top hierarchical classes of CATH and SCOP. We use fundamental connectivity measures such as correlation (CR), normalized mutual information rate (nMIR), and transfer entropy (TE) to analyze the pairwise-relationships between the protein sequence features, and use centrality measures to analyze weighted networks constructed from the relationship matrices. In the centrality analysis, we find both commonalities and differences between the different protein 3D structural classes. Results show that all top hierarchical classes of CATH and SCOP present strong non-deterministic interactions for the composition and arrangement features of Cystine (C), Methionine (M), Tryptophan (W), and also for the arrangement features of Histidine (H). The different protein 3D structural classes present different preferences in terms of their centrality distributions and significant features.
Collapse
Affiliation(s)
- Xiaogeng Wan
- College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, China
- * E-mail:
| | - Xinying Tan
- The Fourth Center of PLA General Hospital, Beijing, China
| |
Collapse
|
10
|
Wan X, Tan X. A Simple Protein Evolutionary Classification Method Based on the Mutual Relations Between Protein Sequences. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200305090055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Protein is a kind of important organics in life. It is varied with its
sequences, structures and functions. Protein evolutionary classification is one of the popular
research topics in computational bioinformatics. Many studies have used protein sequence
information to classify the evolutionary relationships of proteins. As the amount of protein
sequence data increases, efficient computational tools are needed to make efficient protein
evolutionary classifications with high accuracies in the big data paradigm.
Methods:
In this study, we propose a new simple and efficient computational approach based on
the normalized mutual information rates to compute the relationship between protein sequences,
we then use the “distances” defined on the relationships to perform the evolutionary classifications
of proteins. The new method is computational efficient, model-free and unsupervised, which does
not require training data when performing classifications.
Result:
Simulation studies on various examples demonstrate the efficiency of the new method.
We use precision-recall curves to compare the efficiency of our new method with traditional
methods, results show that the new method outperforms the traditional methods in most of the
cases when performing evolutionary classifications.
Conclusion:
The new method is simple and proved to be efficient in protein evolutionary
classifications, which is useful in future evolutionary analysis particularly in the big data paradigm.
Collapse
Affiliation(s)
- Xiaogeng Wan
- Department of Mathematics, College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, 100029, China
| | - Xinying Tan
- The Fourth Center of PLA General Hospital, Beijing, 100037, China
| |
Collapse
|
11
|
Multivariate Chemometrics as a Strategy to Predict the Allergenic Nature of Food Proteins. Symmetry (Basel) 2020. [DOI: 10.3390/sym12101616] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
The purpose of the present study is to develop a simple method for the classification of food proteins with respect to their allerginicity. The methods applied to solve the problem are well-known multivariate statistical approaches (hierarchical and non-hierarchical cluster analysis, two-way clustering, principal components and factor analysis) being a substantial part of modern exploratory data analysis (chemometrics). The methods were applied to a data set consisting of 18 food proteins (allergenic and non-allergenic). The results obtained convincingly showed that a successful separation of the two types of food proteins could be easily achieved with the selection of simple and accessible physicochemical and structural descriptors. The results from the present study could be of significant importance for distinguishing allergenic from non-allergenic food proteins without engaging complicated software methods and resources. The present study corresponds entirely to the concept of the journal and of the Special issue for searching of advanced chemometric strategies in solving structural problems of biomolecules.
Collapse
|
12
|
Ensemble Learning Prediction of Drug-Target Interactions Using GIST Descriptor Extracted from PSSM-Based Evolutionary Information. BIOMED RESEARCH INTERNATIONAL 2020; 2020:4516250. [PMID: 32908888 PMCID: PMC7463380 DOI: 10.1155/2020/4516250] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Revised: 08/02/2020] [Accepted: 08/10/2020] [Indexed: 12/02/2022]
Abstract
Identifying the drug-target interactions (DTIs) plays an essential role in new drug development. However, there still has the limited knowledge of DTIs and a significant number of unknown DTI pairs. Moreover, the traditional experimental methods have inevitable disadvantages such as high cost and time-consuming. Therefore, developing computational methods for predicting DTIs is attracting more and more attention. In this study, we report a novel computational approach for predicting DTI using GIST feature, position-specific scoring matrix (PSSM), and rotation forest (RF). Specifically, each target protein is first converted into a PSSM for retaining evolutionary information. Then, the GIST feature is extracted from PSSM and substructure fingerprint information is adopted to extract the feature of the drug. Finally, combining each protein and drug features to form a new drug-target pair, which is employed as input feature for RF classifier. In the experiment, the proposed method achieves high average accuracies of 89.25%, 85.93%, 82.36%, and 73.89% on enzyme, ion channel, G protein-coupled receptors (GPCRs), and nuclear receptor, respectively. For further evaluating the prediction performance of the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the same golden standard dataset. These promising results illustrate that the proposed method is more effective and stable than other methods. We expect the proposed method to be a useful tool for predicting large-scale DTIs.
Collapse
|
13
|
Sun Z, Pei S, He RL, Yau SST. A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector. Comput Struct Biotechnol J 2020; 18:1904-1913. [PMID: 32774785 PMCID: PMC7390779 DOI: 10.1016/j.csbj.2020.07.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 07/04/2020] [Accepted: 07/05/2020] [Indexed: 12/16/2022] Open
Abstract
Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.
Collapse
Affiliation(s)
- Zeju Sun
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| |
Collapse
|
14
|
Wan X, Tan X. A study on separation of the protein structural types in amino acid sequence feature spaces. PLoS One 2019; 14:e0226768. [PMID: 31869390 PMCID: PMC6927603 DOI: 10.1371/journal.pone.0226768] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Accepted: 12/03/2019] [Indexed: 11/23/2022] Open
Abstract
Proteins are diverse with their sequences, structures and functions, it is important to study the relations between the sequences, structures and functions. In this paper, we conduct a study that surveying the relations between the protein sequences and their structures. In this study, we use the natural vector (NV) and the averaged property factor (APF) features to represent protein sequences into feature vectors, and use the multi-class MSE and the convex hull methods to separate proteins of different structural classes into different regions. We found that proteins from different structural classes are separable by hyper-planes and convex hulls in the natural vector feature space, where the feature vectors of different structural classes are separated into disjoint regions or convex hulls in the high dimensional feature spaces. The natural vector outperforms the averaged property factor method in identifying the structures, and the convex hull method outperforms the multi-class MSE in separating the feature points. These outcomes convince the strong connections between the protein sequences and their structures, and may imply that the amino acids composition and their sequence arrangements represented by the natural vectors have greater influences to the structures than the averaged physical property factors of the amino acids.
Collapse
Affiliation(s)
- Xiaogeng Wan
- College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, China
- * E-mail:
| | - Xinying Tan
- The Fourth Center of PLA General Hospital, Beijing, China
| |
Collapse
|
15
|
Zhao X, Tian K, He RL, Yau SST. Convex hull principle for classification and phylogeny of eukaryotic proteins. Genomics 2019; 111:1777-1784. [DOI: 10.1016/j.ygeno.2018.11.033] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2018] [Revised: 11/25/2018] [Accepted: 11/30/2018] [Indexed: 12/11/2022]
|
16
|
Li Y, Huang YA, You ZH, Li LP, Wang Z. Drug-Target Interaction Prediction Based on Drug Fingerprint Information and Protein Sequence. Molecules 2019; 24:molecules24162999. [PMID: 31430892 PMCID: PMC6719962 DOI: 10.3390/molecules24162999] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2019] [Revised: 08/13/2019] [Accepted: 08/14/2019] [Indexed: 01/09/2023] Open
Abstract
The identification of drug-target interactions (DTIs) is a critical step in drug development. Experimental methods that are based on clinical trials to discover DTIs are time-consuming, expensive, and challenging. Therefore, as complementary to it, developing new computational methods for predicting novel DTI is of great significance with regards to saving cost and shortening the development period. In this paper, we present a novel computational model for predicting DTIs, which uses the sequence information of proteins and a rotation forest classifier. Specifically, all of the target protein sequences are first converted to a position-specific scoring matrix (PSSM) to retain evolutionary information. We then use local phase quantization (LPQ) descriptors to extract evolutionary information in the PSSM. On the other hand, substructure fingerprint information is utilized to extract the features of the drug. We finally combine the features of drugs and protein together to represent features of each drug-target pair and use a rotation forest classifier to calculate the scores of interaction possibility, for a global DTI prediction. The experimental results indicate that the proposed model is effective, achieving average accuracies of 89.15%, 86.01%, 82.20%, and 71.67% on four datasets (i.e., enzyme, ion channel, G protein-coupled receptors (GPCR), and nuclear receptor), respectively. In addition, we compared the prediction performance of the rotation forest classifier with another popular classifier, support vector machine, on the same dataset. Several types of methods previously proposed are also implemented on the same datasets for performance comparison. The comparison results demonstrate the superiority of the proposed method to the others. We anticipate that the proposed method can be used as an effective tool for predicting drug-target interactions on a large scale, given the information of protein sequences and drug fingerprints.
Collapse
Affiliation(s)
- Yang Li
- School of Information Engineering, Xijing University, Xi'an 710123, China
| | - Yu-An Huang
- School of Information Engineering, Xijing University, Xi'an 710123, China.
| | - Zhu-Hong You
- School of Information Engineering, Xijing University, Xi'an 710123, China.
| | - Li-Ping Li
- School of Information Engineering, Xijing University, Xi'an 710123, China
| | - Zheng Wang
- School of Information Engineering, Xijing University, Xi'an 710123, China
| |
Collapse
|
17
|
Tian K, Zhao X, Zhang Y, Yau S. Comparing protein structures and inferring functions with a novel three-dimensional Yau-Hausdorff method. J Biomol Struct Dyn 2018; 37:4151-4160. [PMID: 30518311 DOI: 10.1080/07391102.2018.1540359] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Structures and functions of proteins play various essential roles in biological processes. The functions of newly discovered proteins can be predicted by comparing their structures with that of known-functional proteins. Many approaches have been proposed for measuring the protein structure similarity, such as the template-modeling (TM)-score method, GRaphlet (GR)-Align method as well as the commonly used root-mean-square deviation (RMSD) measures. However, the alignment comparisons between the similarity of protein structure cost much time on large dataset, and the accuracy still have room to improve. In this study, we introduce a new three-dimensional (3D) Yau-Hausdorff distance between any two 3D objects. The (3D) Yau-Hausdorff distance can be used in particular to measure the similarity/dissimilarity of two proteins of any size and does not need aligning and superimposing two structures. We apply structural similarity to study function similarity and perform phylogenetic analysis on several datasets. The results show that (3D) Yau-Hausdorff distance could serve as a more precise and effective method to discover biological relationships between proteins than other methods on structure comparison. Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Kun Tian
- Department of Mathematical Sciences, Tsinghua University , Beijing , P.R. China
| | - Xin Zhao
- Department of Mathematical Sciences, Tsinghua University , Beijing , P.R. China
| | - Yuning Zhang
- School of Life Sciences, Tsinghua University , Beijing , P.R. China
| | - Stephen Yau
- Department of Mathematical Sciences, Tsinghua University , Beijing , P.R. China
| |
Collapse
|
18
|
Tian K, Zhao X, Yau SST. Convex hull analysis of evolutionary and phylogenetic relationships between biological groups. J Theor Biol 2018; 456:34-40. [DOI: 10.1016/j.jtbi.2018.07.035] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Revised: 07/23/2018] [Accepted: 07/25/2018] [Indexed: 11/28/2022]
|
19
|
Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics 2018; 111:1298-1305. [PMID: 30195069 DOI: 10.1016/j.ygeno.2018.08.010] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 08/19/2018] [Accepted: 08/27/2018] [Indexed: 11/22/2022]
Abstract
Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.
Collapse
|
20
|
Dong R, Zhu Z, Yin C, He RL, Yau SST. A new method to cluster genomes based on cumulative Fourier power spectrum. Gene 2018; 673:239-250. [PMID: 29935353 DOI: 10.1016/j.gene.2018.06.042] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2018] [Revised: 06/12/2018] [Accepted: 06/14/2018] [Indexed: 11/27/2022]
Abstract
Analyzing phylogenetic relationships using mathematical methods has always been of importance in bioinformatics. Quantitative research may interpret the raw biological data in a precise way. Multiple Sequence Alignment (MSA) is used frequently to analyze biological evolutions, but is very time-consuming. When the scale of data is large, alignment methods cannot finish calculation in reasonable time. Therefore, we present a new method using moments of cumulative Fourier power spectrum in clustering the DNA sequences. Each sequence is translated into a vector in Euclidean space. Distances between the vectors can reflect the relationships between sequences. The mapping between the spectra and moment vector is one-to-one, which means that no information is lost in the power spectra during the calculation. We cluster and classify several datasets including Influenza A, primates, and human rhinovirus (HRV) datasets to build up the phylogenetic trees. Results show that the new proposed cumulative Fourier power spectrum is much faster and more accurately than MSA and another alignment-free method known as k-mer. The research provides us new insights in the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes. The computer programs of the cumulative Fourier power spectrum are available at GitHub (https://github.com/YaulabTsinghua/cumulative-Fourier-power-spectrum).
Collapse
Affiliation(s)
- Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Ziyue Zhu
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, IL 60607, USA
| | - Rong L He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
21
|
Huang G, Li J, Zhao C. Computational Prediction and Analysis of Associations between Small Molecules and Binding-Associated S-Nitrosylation Sites. Molecules 2018; 23:molecules23040954. [PMID: 29671802 PMCID: PMC6017196 DOI: 10.3390/molecules23040954] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Revised: 03/30/2018] [Accepted: 04/09/2018] [Indexed: 01/12/2023] Open
Abstract
Interactions between drugs and proteins occupy a central position during the process of drug discovery and development. Numerous methods have recently been developed for identifying drug–target interactions, but few have been devoted to finding interactions between post-translationally modified proteins and drugs. We presented a machine learning-based method for identifying associations between small molecules and binding-associated S-nitrosylated (SNO-) proteins. Namely, small molecules were encoded by molecular fingerprint, SNO-proteins were encoded by the information entropy-based method, and the random forest was used to train a classifier. Ten-fold and leave-one-out cross validations achieved, respectively, 0.7235 and 0.7490 of the area under a receiver operating characteristic curve. Computational analysis of similarity suggested that SNO-proteins associated with the same drug shared statistically significant similarity, and vice versa. This method and finding are useful to identify drug–SNO associations and further facilitate the discovery and development of SNO-associated drugs.
Collapse
Affiliation(s)
- Guohua Huang
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China.
- College of Information Engineering, Shaoyang University, Shaoyang 422000, China.
| | - Jincheng Li
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China.
- College of Information Engineering, Shaoyang University, Shaoyang 422000, China.
| | - Chenglin Zhao
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China.
- College of Information Engineering, Shaoyang University, Shaoyang 422000, China.
| |
Collapse
|
22
|
Dong R, Zheng H, Tian K, Yau SC, Mao W, Yu W, Yin C, Yu C, He RL, Yang J, Yau SS. Virus Database and Online Inquiry System Based on Natural Vectors. Evol Bioinform Online 2017; 13:1176934317746667. [PMID: 29308007 PMCID: PMC5751915 DOI: 10.1177/1176934317746667] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2017] [Accepted: 10/05/2017] [Indexed: 01/09/2023] Open
Abstract
We construct a virus database called VirusDB (http://yaulab.math.tsinghua.edu.cn/VirusDB/) and an online inquiry system to serve people who are interested in viral classification and prediction. The database stores all viral genomes, their corresponding natural vectors, and the classification information of the single/multiple-segmented viral reference sequences downloaded from National Center for Biotechnology Information. The online inquiry system serves the purpose of computing natural vectors and their distances based on submitted genomes, providing an online interface for accessing and using the database for viral classification and prediction, and back-end processes for automatic and manual updating of database content to synchronize with GenBank. Submitted genomes data in FASTA format will be carried out and the prediction results with 5 closest neighbors and their classifications will be returned by email. Considering the one-to-one correspondence between sequence and natural vector, time efficiency, and high accuracy, natural vector is a significant advance compared with alignment methods, which makes VirusDB a useful database in further research.
Collapse
Affiliation(s)
- Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Hui Zheng
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, IL, USA
| | - Kun Tian
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Shek-Chung Yau
- Information Technology Services Center, The Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Weiguang Mao
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Wenping Yu
- College of Computer and Control Engineering, Nankai University, Tianjin, China
| | - Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, IL, USA
| | - Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA, Australia.,School of Medicine, Flinders University, Adelaide, SA, Australia
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL, USA
| | - Jie Yang
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, IL, USA
| | - Stephen St Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| |
Collapse
|
23
|
Zhao X, Tian K, He RL, Yau SST. Establishing the phylogeny of Prochlorococcus with a new alignment-free method. Ecol Evol 2017; 7:11057-11065. [PMID: 29299281 PMCID: PMC5743538 DOI: 10.1002/ece3.3535] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 09/04/2017] [Accepted: 09/14/2017] [Indexed: 11/11/2022] Open
Abstract
Prochlorococcus marinus, one of the most abundant marine cyanobacteria in the global ocean, is classified into low-light (LL) and high-light (HL) adapted ecotypes. These two adapted ecotypes differ in their ecophysiological characteristics, especially whether adapted for growth at high-light or low-light intensities. However, some evolutionary relationships of Prochlorococcus phylogeny remain to be resolved, such as whether the strains SS120 and MIT9211 form a monophyletic group. We use the Natural Vector (NV) method to represent the sequence in order to identify the phylogeny of the Prochlorococcus. The natural vector method is alignment free without any model assumptions. This study added the covariances of amino acids in protein sequence to the natural vector method. Based on these new natural vectors, we can compute the Hausdorff distance between the two clades which represents the dissimilarity. This method enables us to systematically analyze both the dataset of ribosomal proteomes and the dataset of 16s-23s rRNA sequences in order to reconstruct the phylogeny of Prochlorococcus. Furthermore, we apply classification to inspect the relationship of SS120 and MIT9211. From the reconstructed phylogenetic trees and classification results, we may conclude that the SS120 does not cluster with MIT9211. This study demonstrates a new method for performing phylogenetic analysis. The results confirm that these two strains do not form a monophyletic clade in the phylogeny of Prochlorococcus.
Collapse
Affiliation(s)
- Xin Zhao
- Department of Mathematical Sciences Tsinghua University Beijing China
| | - Kun Tian
- Department of Mathematical Sciences Tsinghua University Beijing China
| | - Rong L He
- Department of Biological Sciences Chicago State University Chicago IL USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences Tsinghua University Beijing China
| |
Collapse
|
24
|
Yu C, Arcos-Burgos M, Licinio J, Wong ML. A latent genetic subtype of major depression identified by whole-exome genotyping data in a Mexican-American cohort. Transl Psychiatry 2017; 7:e1134. [PMID: 28509902 PMCID: PMC5534938 DOI: 10.1038/tp.2017.102] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Revised: 04/04/2017] [Accepted: 04/10/2017] [Indexed: 02/07/2023] Open
Abstract
Identifying data-driven subtypes of major depressive disorder (MDD) is an important topic of psychiatric research. Currently, MDD subtypes are based on clinically defined depression symptom patterns. Although a few data-driven attempts have been made to identify more homogenous subgroups within MDD, other studies have not focused on using human genetic data for MDD subtyping. Here we used a computational strategy to identify MDD subtypes based on single-nucleotide polymorphism genotyping data from MDD cases and controls using Hamming distance and cluster analysis. We examined a cohort of Mexican-American participants from Los Angeles, including MDD patients (n=203) and healthy controls (n=196). The results in cluster trees indicate that a significant latent subtype exists in the Mexican-American MDD group. The individuals in this hidden subtype have increased common genetic substrates related to major depression and they also have more anxiety and less middle insomnia, depersonalization and derealisation, and paranoid symptoms. Advances in this line of research to validate this strategy in other patient groups of different ethnicities will have the potential to eventually be translated to clinical practice, with the tantalising possibility that in the future it may be possible to refine MDD diagnosis based on genetic data.
Collapse
Affiliation(s)
- C Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, Adelaide, SA, Australia
- School of Medicine, Flinders University, Bedford Park, Adelaide, SA, Australia
| | - M Arcos-Burgos
- Department of Genome Sciences, John Curtin School of Medical Research, Australian National University, Canberra, ACT, Australia
- University of Rosario International Institute of Translational Medicine, Bogota, Colombia
| | - J Licinio
- Mind and Brain Theme, South Australian Health and Medical Research Institute, Adelaide, SA, Australia
- School of Medicine, Flinders University, Bedford Park, Adelaide, SA, Australia
- South Ural State University Biomedical School, Chelyabinsk, Russia
| | - M-L Wong
- Mind and Brain Theme, South Australian Health and Medical Research Institute, Adelaide, SA, Australia
- School of Medicine, Flinders University, Bedford Park, Adelaide, SA, Australia
| |
Collapse
|
25
|
Xiong D, Zeng J, Gong H. A deep learning framework for improving long-range residue–residue contact prediction using a hierarchical strategy. Bioinformatics 2017; 33:2675-2683. [DOI: 10.1093/bioinformatics/btx296] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2016] [Accepted: 05/02/2017] [Indexed: 12/31/2022] Open
Affiliation(s)
- Dapeng Xiong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing, China
| | - Jianyang Zeng
- Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing, China
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing, China
| |
Collapse
|
26
|
Wan X, Zhao X, Yau SST. An information-based network approach for protein classification. PLoS One 2017; 12:e0174386. [PMID: 28350835 PMCID: PMC5370107 DOI: 10.1371/journal.pone.0174386] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2016] [Accepted: 03/08/2017] [Indexed: 11/25/2022] Open
Abstract
Protein classification is one of the critical problems in bioinformatics. Early studies used geometric distances and polygenetic-tree to classify proteins. These methods use binary trees to present protein classification. In this paper, we propose a new protein classification method, whereby theories of information and networks are used to classify the multivariate relationships of proteins. In this study, protein universe is modeled as an undirected network, where proteins are classified according to their connections. Our method is unsupervised, multivariate, and alignment-free. It can be applied to the classification of both protein sequences and structures. Nine examples are used to demonstrate the efficiency of our new method.
Collapse
Affiliation(s)
- Xiaogeng Wan
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
- * E-mail: (XW); (XZ); (SSTY)
| | - Xin Zhao
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
- * E-mail: (XW); (XZ); (SSTY)
| | - Stephen S. T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
- * E-mail: (XW); (XZ); (SSTY)
| |
Collapse
|
27
|
Yu C, Baune BT, Licinio J, Wong ML. A novel strategy for clustering major depression individuals using whole-genome sequencing variant data. Sci Rep 2017; 7:44389. [PMID: 28287625 PMCID: PMC5347377 DOI: 10.1038/srep44389] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2016] [Accepted: 02/07/2017] [Indexed: 12/01/2022] Open
Abstract
Major depressive disorder (MDD) is highly prevalent, resulting in an exceedingly high disease burden. The identification of generic risk factors could lead to advance prevention and therapeutics. Current approaches examine genotyping data to identify specific variations between cases and controls. Compared to genotyping, whole-genome sequencing (WGS) allows for the detection of private mutations. In this proof-of-concept study, we establish a conceptually novel computational approach that clusters subjects based on the entirety of their WGS. Those clusters predicted MDD diagnosis. This strategy yielded encouraging results, showing that depressed Mexican-American participants were grouped closer; in contrast ethnically-matched controls grouped away from MDD patients. This implies that within the same ancestry, the WGS data of an individual can be used to check whether this individual is within or closer to MDD subjects or to controls. We propose a novel strategy to apply WGS data to clinical medicine by facilitating diagnosis through genetic clustering. Further studies utilising our method should examine larger WGS datasets on other ethnical groups.
Collapse
Affiliation(s)
- Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia
- School of Medicine, Flinders University, Bedford Park, SA 5042, Australia
| | - Bernhard T. Baune
- Discipline of Psychiatry, School of Medicine, University of Adelaide, Adelaide, SA 5005, Australia
| | - Julio Licinio
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia
- School of Medicine, Flinders University, Bedford Park, SA 5042, Australia
| | - Ma-Li Wong
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia
- School of Medicine, Flinders University, Bedford Park, SA 5042, Australia
| |
Collapse
|
28
|
Hou W, Pan Q, Peng Q, He M. A new method to analyze protein sequence similarity using Dynamic Time Warping. Genomics 2016; 109:123-130. [PMID: 27974244 PMCID: PMC7125777 DOI: 10.1016/j.ygeno.2016.12.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Revised: 12/06/2016] [Accepted: 12/10/2016] [Indexed: 12/05/2022]
Abstract
Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective. We propose a novel method to extract the features of the sequences based on physicochemical property of proteins. We apply the Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to analyze the similarity of proteins. Different datasets are used to prove our model's effectiveness.
Collapse
Affiliation(s)
- Wenbing Hou
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qiuhui Pan
- School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116024, PR China; School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qianying Peng
- Department of Academics, Dalian Naval Academy, Dalian 116001, PR China
| | - Mingfeng He
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| |
Collapse
|
29
|
Zhao X, Wan X, He RL, Yau SST. A new method for studying the evolutionary origin of the SAR11 clade marine bacteria. Mol Phylogenet Evol 2016; 98:271-9. [PMID: 26926946 DOI: 10.1016/j.ympev.2016.02.015] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2015] [Revised: 02/18/2016] [Accepted: 02/18/2016] [Indexed: 12/14/2022]
Abstract
The free-living SAR11 clade is a globally abundant group of oceanic Alphaproteobacteria, with small genome sizes and rich genomic A+T content. However, the taxonomy of SAR11 has become controversial recently. Some researchers argue that the position of SAR11 is a sister group to Rickettsiales. Other researchers advocate that SAR11 is located within free-living lineages of Alphaproteobacteria. Here, we use the natural vector representation method to identify the evolutionary origin of the SAR11 clade. This alignment-free method does not depend on any model assumptions. With this approach, the correspondence between proteome sequences and their natural vectors is one-to-one. After fixing a set of proteins, each bacterium is represented by a set of vectors. The Hausdorff distance is then used to compute the dissimilarity distance between two bacteria. The phylogenetic tree can be reconstructed based on these distances. Using our method, we systematically analyze four data sets of alphaproteobacterial proteomes in order to reconstruct the phylogeny of Alphaproteobacteria. From this we can see that the phylogenetic position of the SAR11 group is within a group of other free-living lineages of Alphaproteobacteria.
Collapse
Affiliation(s)
- Xin Zhao
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Xiaogeng Wan
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Rong L He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.
| |
Collapse
|
30
|
Tian K, Yang X, Kong Q, Yin C, He RL, Yau SST. Two Dimensional Yau-Hausdorff Distance with Applications on Comparison of DNA and Protein Sequences. PLoS One 2015; 10:e0136577. [PMID: 26384293 PMCID: PMC4575136 DOI: 10.1371/journal.pone.0136577] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Accepted: 08/05/2015] [Indexed: 11/20/2022] Open
Abstract
Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching.
Collapse
Affiliation(s)
- Kun Tian
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Xiaoqian Yang
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Qin Kong
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, United States of America
| | - Rong L He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, United States of America
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| |
Collapse
|
31
|
Yau SST, Mao WG, Benson M, He RL. Distinguishing proteins from arbitrary amino acid sequences. Sci Rep 2015; 5:7972. [PMID: 25609314 PMCID: PMC4302309 DOI: 10.1038/srep07972] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2014] [Accepted: 12/09/2014] [Indexed: 01/26/2023] Open
Abstract
What kinds of amino acid sequences could possibly be protein sequences? From all existing databases that we can find, known proteins are only a small fraction of all possible combinations of amino acids. Beginning with Sanger's first detailed determination of a protein sequence in 1952, previous studies have focused on describing the structure of existing protein sequences in order to construct the protein universe. No one, however, has developed a criteria for determining whether an arbitrary amino acid sequence can be a protein. Here we show that when the collection of arbitrary amino acid sequences is viewed in an appropriate geometric context, the protein sequences cluster together. This leads to a new computational test, described here, that has proved to be remarkably accurate at determining whether an arbitrary amino acid sequence can be a protein. Even more, if the results of this test indicate that the sequence can be a protein, and it is indeed a protein sequence, then its identity as a protein sequence is uniquely defined. We anticipate our computational test will be useful for those who are attempting to complete the job of discovering all proteins, or constructing the protein universe.
Collapse
Affiliation(s)
- Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084, China
| | - Wei-Guang Mao
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084, China
| | - Max Benson
- Department of Computer Science, Seattle Pacific University, Seattle, WA 98119, USA
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| |
Collapse
|
32
|
Li J, Koehl P. 3D representations of amino acids-applications to protein sequence comparison and classification. Comput Struct Biotechnol J 2014; 11:47-58. [PMID: 25379143 PMCID: PMC4212284 DOI: 10.1016/j.csbj.2014.09.001] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such a protein sequence facilitates the decoding of its information content. We show that a feature-based representation in a three-dimensional (3D) space derived from amino acid substitution matrices provides an adequate representation that can be used for direct comparison of protein sequences based on geometry. We measure the performance of such a representation in the context of the protein structural fold prediction problem. We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier. We show in contrast that the use of the three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements.
Collapse
Affiliation(s)
- Jie Li
- Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, United States
| | - Patrice Koehl
- Department of Computer Science and Genome Center, University of California, Davis, One Shields Ave, Davis, CA 95616, United States
| |
Collapse
|
33
|
DFA7, a new method to distinguish between intron-containing and intronless genes. PLoS One 2014; 9:e101363. [PMID: 25036549 PMCID: PMC4103774 DOI: 10.1371/journal.pone.0101363] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Accepted: 06/05/2014] [Indexed: 11/23/2022] Open
Abstract
Intron-containing and intronless genes have different biological properties and statistical characteristics. Here we propose a new computational method to distinguish between intron-containing and intronless gene sequences. Seven feature parameters , , , , , , and based on detrended fluctuation analysis (DFA) are fully used, and thus we can compute a 7-dimensional feature vector for any given gene sequence to be discriminated. Furthermore, support vector machine (SVM) classifier with Gaussian radial basis kernel function is performed on this feature space to classify the genes into intron-containing and intronless. We investigate the performance of the proposed method in comparison with other state-of-the-art algorithms on biological datasets. The experimental results show that our new method significantly improves the accuracy over those existing techniques.
Collapse
|
34
|
K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 2014; 546:25-34. [PMID: 24858075 DOI: 10.1016/j.gene.2014.05.043] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2014] [Revised: 05/04/2014] [Accepted: 05/20/2014] [Indexed: 11/21/2022]
Abstract
Based on the well-known k-mer model, we propose a k-mer natural vector model for representing a genetic sequence based on the numbers and distributions of k-mers in the sequence. We show that there exists a one-to-one correspondence between a genetic sequence and its associated k-mer natural vector. The k-mer natural vector method can be easily and quickly used to perform phylogenetic analysis of genetic sequences without requiring evolutionary models or human intervention. Whole or partial genomes can be handled more effective with our proposed method. It is applied to the phylogenetic analysis of genetic sequences, and the obtaining results fully demonstrate that the k-mer natural vector method is a very powerful tool for analysing and annotating genetic sequences and determining evolutionary relationships both in terms of accuracy and efficiency.
Collapse
|
35
|
APSLAP: an adaptive boosting technique for predicting subcellular localization of apoptosis protein. Acta Biotheor 2013; 61:481-97. [PMID: 23982307 DOI: 10.1007/s10441-013-9197-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2013] [Accepted: 08/16/2013] [Indexed: 01/09/2023]
Abstract
Apoptotic proteins play key roles in understanding the mechanism of programmed cell death. Knowledge about the subcellular localization of apoptotic protein is constructive in understanding the mechanism of programmed cell death, determining the functional characterization of the protein, screening candidates in drug design, and selecting protein for relevant studies. It is also proclaimed that the information required for determining the subcellular localization of protein resides in their corresponding amino acid sequence. In this work, a new biological feature, class pattern frequency of physiochemical descriptor, was effectively used in accordance with the amino acid composition, protein similarity measure, CTD (composition, translation, and distribution) of physiochemical descriptors, and sequence similarity to predict the subcellular localization of apoptosis protein. AdaBoost with the weak learner as Random-Forest was designed for the five modules and prediction is made based on the weighted voting system. Bench mark dataset of 317 apoptosis proteins were subjected to prediction by our system and the accuracy was found to be 100.0 and 92.4 %, and 90.1 % for self-consistency test, jack-knife test, and tenfold cross validation test respectively, which is 0.9 % higher than that of other existing methods. Beside this, the independent data (N151 and ZW98) set prediction resulted in the accuracy of 90.7 and 87.7 %, respectively. These results show that the protein feature represented by a combined feature vector along with AdaBoost algorithm holds well in effective prediction of subcellular localization of apoptosis proteins. The user friendly web interface "APSLAP" has been constructed, which is freely available at http://apslap.bicpu.edu.in and it is anticipated that this tool will play a significant role in determining the specific role of apoptosis proteins with reliability.
Collapse
|
36
|
Yu C, He RL, Yau SST. Protein sequence comparison based on K-string dictionary. Gene 2013; 529:250-6. [DOI: 10.1016/j.gene.2013.07.092] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2012] [Revised: 06/14/2013] [Accepted: 07/25/2013] [Indexed: 11/30/2022]
|