1
|
Mahmud S, Guo Z, Quadir F, Liu J, Cheng J. Multi-head attention-based U-Nets for predicting protein domain boundaries using 1D sequence features and 2D distance maps. BMC Bioinformatics 2022; 23:283. [PMID: 35854211 PMCID: PMC9295499 DOI: 10.1186/s12859-022-04829-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Accepted: 07/08/2022] [Indexed: 01/25/2023] Open
Abstract
The information about the domain architecture of proteins is useful for studying protein structure and function. However, accurate prediction of protein domain boundaries (i.e., sequence regions separating two domains) from sequence remains a significant challenge. In this work, we develop a deep learning method based on multi-head U-Nets (called DistDom) to predict protein domain boundaries utilizing 1D sequence features and predicted 2D inter-residue distance map as input. The 1D features contain the evolutionary and physicochemical information of protein sequences, whereas the 2D distance map includes the structural information of proteins that was rarely used in domain boundary prediction before. The 1D and 2D features are processed by the 1D and 2D U-Nets respectively to generate hidden features. The hidden features are then used by the multi-head attention to predict the probability of each residue of a protein being in a domain boundary, leveraging both local and global information in the features. The residue-level domain boundary predictions can be used to classify proteins as single-domain or multi-domain proteins. It classifies the CASP14 single-domain and multi-domain targets at the accuracy of 75.9%, 13.28% more accurate than the state-of-the-art method. Tested on the CASP14 multi-domain protein targets with expert annotated domain boundaries, the average per-target F1 measure score of the domain boundary prediction by DistDom is 0.263, 29.56% higher than the state-of-the-art method.
Collapse
Affiliation(s)
- Sajid Mahmud
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Zhiye Guo
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Farhan Quadir
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Jian Liu
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Jianlin Cheng
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| |
Collapse
|
2
|
Mulnaes D, Golchin P, Koenig F, Gohlke H. TopDomain: Exhaustive Protein Domain Boundary Metaprediction Combining Multisource Information and Deep Learning. J Chem Theory Comput 2021; 17:4599-4613. [PMID: 34161735 DOI: 10.1021/acs.jctc.1c00129] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Protein domains are independent, functional, and stable structural units of proteins. Accurate protein domain boundary prediction plays an important role in understanding protein structure and evolution, as well as for protein structure prediction. Current domain boundary prediction methods differ in terms of boundary definition, methodology, and training databases resulting in disparate performance for different proteins. We developed TopDomain, an exhaustive metapredictor, that uses deep neural networks to combine multisource information from sequence- and homology-based features of over 50 primary predictors. For this purpose, we developed a new domain boundary data set termed the TopDomain data set, in which the true annotations are informed by SCOPe annotations, structural domain parsers, human inspection, and deep learning. We benchmark TopDomain against 2484 targets with 3354 boundaries from the TopDomain test set and achieve F1 scores of 78.4% and 73.8% for multidomain boundary prediction within ±20 residues and ±10 residues of the true boundary, respectively. When examined on targets from CASP11-13 competitions, TopDomain achieves F1 scores of 47.5% and 42.8% for multidomain proteins. TopDomain significantly outperforms 15 widely used, state-of-the-art ab initio and homology-based domain boundary predictors. Finally, we implemented TopDomainTMC, which accurately predicts whether domain parsing is necessary for the target protein.
Collapse
Affiliation(s)
- Daniel Mulnaes
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Pegah Golchin
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Filip Koenig
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Holger Gohlke
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany.,John von Neumann Institute for Computing (NIC), Jülich Supercomputing Centre (JSC), Institute of Biological Information Processing (IBI-7: Structural Biochemistry) & Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, 52425 Jülich, Germany
| |
Collapse
|
3
|
Wang Y, Zhang H, Zhong H, Xue Z. Protein domain identification methods and online resources. Comput Struct Biotechnol J 2021; 19:1145-1153. [PMID: 33680357 PMCID: PMC7895673 DOI: 10.1016/j.csbj.2021.01.041] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Revised: 01/25/2021] [Accepted: 01/26/2021] [Indexed: 01/03/2023] Open
Abstract
Protein domains are the basic units of proteins that can fold, function, and evolve independently. Knowledge of protein domains is critical for protein classification, understanding their biological functions, annotating their evolutionary mechanisms and protein design. Thus, over the past two decades, a number of protein domain identification approaches have been developed, and a variety of protein domain databases have also been constructed. This review divides protein domain prediction methods into two categories, namely sequence-based and structure-based. These methods are introduced in detail, and their advantages and limitations are compared. Furthermore, this review also provides a comprehensive overview of popular online protein domain sequence and structure databases. Finally, we discuss potential improvements of these prediction methods.
Collapse
Affiliation(s)
- Yan Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical College, Yantai, Shandong 264003, China
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Hang Zhang
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Haolin Zhong
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| |
Collapse
|
4
|
Shi Q, Chen W, Huang S, Jin F, Dong Y, Wang Y, Xue Z. DNN-Dom: predicting protein domain boundary from sequence alone by deep neural network. Bioinformatics 2020; 35:5128-5136. [PMID: 31197306 DOI: 10.1093/bioinformatics/btz464] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 05/07/2019] [Accepted: 06/05/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Accurate delineation of protein domain boundary plays an important role for protein engineering and structure prediction. Although machine-learning methods are widely used to predict domain boundary, these approaches often ignore long-range interactions among residues, which have been proven to improve the prediction performance. However, how to simultaneously model the local and global interactions to further improve domain boundary prediction is still a challenging problem. RESULTS This article employs a hybrid deep learning method that combines convolutional neural network and gate recurrent units' models for domain boundary prediction. It not only captures the local and non-local interactions, but also fuses these features for prediction. Additionally, we adopt balanced Random Forest for classification to deal with high imbalance of samples and high dimensions of deep features. Experimental results show that our proposed approach (DNN-Dom) outperforms existing machine-learning-based methods for boundary prediction. We expect that DNN-Dom can be useful for assisting protein structure and function prediction. AVAILABILITY AND IMPLEMENTATION The method is available as DNN-Dom Server at http://isyslab.info/DNN-Dom/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qiang Shi
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Weiya Chen
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Siqi Huang
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Fanglin Jin
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Yinghao Dong
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Yan Wang
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Zhidong Xue
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| |
Collapse
|
5
|
Hong SH, Joo K, Lee J. ConDo: protein domain boundary prediction using coevolutionary information. Bioinformatics 2020; 35:2411-2417. [PMID: 30500873 DOI: 10.1093/bioinformatics/bty973] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Revised: 11/15/2018] [Accepted: 11/29/2018] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Domain boundary prediction is one of the most important problems in the study of protein structure and function. Many sequence-based domain boundary prediction methods are either template-based or machine learning (ML) based. ML-based methods often perform poorly due to their use of only local (i.e. short-range) features. These conventional features such as sequence profiles, secondary structures and solvent accessibilities are typically restricted to be within 20 residues of the domain boundary candidate. RESULTS To address the performance of ML-based methods, we developed a new protein domain boundary prediction method (ConDo) that utilizes novel long-range features such as coevolutionary information in addition to the aforementioned local window features as inputs for ML. Toward this purpose, two types of coevolutionary information were extracted from multiple sequence alignment using direct coupling analysis: (i) partially aligned sequences, and (ii) correlated mutation information. Both the partially aligned sequence information and the modularity of residue-residue couplings possess long-range correlation information. AVAILABILITY AND IMPLEMENTATION https://github.com/gicsaw/ConDo.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Keehyoung Joo
- Center for Advanced Computation, Korea Institute for Advanced Study, Korea
| | - Jooyoung Lee
- School of Computational Sciences.,Center for Advanced Computation, Korea Institute for Advanced Study, Korea
| |
Collapse
|
6
|
Chen G, Chen J, Liu H, Chen S, Zhang Y, Li P, Thierry-Mieg D, Thierry-Mieg J, Mattes W, Ning B, Shi T. Comprehensive Identification and Characterization of Human Secretome Based on Integrative Proteomic and Transcriptomic Data. Front Cell Dev Biol 2019; 7:299. [PMID: 31824949 PMCID: PMC6881247 DOI: 10.3389/fcell.2019.00299] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Accepted: 11/07/2019] [Indexed: 12/25/2022] Open
Abstract
Secreted proteins (SPs) play important roles in diverse important biological processes; however, a comprehensive and high-quality list of human SPs is still lacking. Here we identified 6,943 high-confidence human SPs (3,522 of them are novel) based on 330,427 human proteins derived from databases of UniProt, Ensembl, AceView, and RefSeq. Notably, 6,267 of 6,943 (90.3%) SPs have the supporting evidences from a large amount of mass spectrometry (MS) and RNA-seq data. We found that the SPs were broadly expressed in diverse tissues as well as human body fluid, and a significant portion of them exhibited tissue-specific expression. Moreover, 14 cancer-specific SPs that their expression levels were significantly associated with the patients’ survival of eight different tumors were identified, which could be potential prognostic biomarkers. Strikingly, 89.21% of 6,943 SPs (2,927 novel SPs) contain known protein domains. Those novel SPs we mainly enriched with the known domains regarding immunity, such as Immunoglobulin V-set and C1-set domain. Specifically, we constructed a user-friendly and freely accessible database, SPRomeDB (www.unimd.org/SPRomeDB), to catalog those SPs. Our comprehensive SP identification and characterization gain insights into human secretome and provide valuable resource for future researches.
Collapse
Affiliation(s)
- Geng Chen
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Jiwei Chen
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Huanlong Liu
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Shuangguan Chen
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Yang Zhang
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Peng Li
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Danielle Thierry-Mieg
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Jean Thierry-Mieg
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - William Mattes
- National Center for Toxicological Research, Food and Drug Administration, Jefferson City, AR, United States
| | - Baitang Ning
- National Center for Toxicological Research, Food and Drug Administration, Jefferson City, AR, United States
| | - Tieliu Shi
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| |
Collapse
|
7
|
Wang Y, Wang J, Li R, Shi Q, Xue Z, Zhang Y. ThreaDomEx: a unified platform for predicting continuous and discontinuous protein domains by multiple-threading and segment assembly. Nucleic Acids Res 2019; 45:W400-W407. [PMID: 28498994 PMCID: PMC5793814 DOI: 10.1093/nar/gkx410] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Accepted: 04/28/2017] [Indexed: 12/21/2022] Open
Abstract
We develop a hierarchical pipeline, ThreaDomEx, for both continuous domain (CD) and discontinuous domain (DCD) structure predictions. Starting from a query sequence, ThreaDomEx first threads it through the PDB to identify multiple structure templates, where a profile of domain conservation score (DC-score) is derived for domain-segment assignment. To further detect DCDs that consist of separated segments along the sequence, a boundary-clustering algorithm is used to refine the DCD-linker locations. In case that the templates do not contain DCDs, a domain-segment assembly process, guided by symmetry comparison, is applied for further DCD detections. ThreaDomEx was tested a set of 1111 proteins and achieved a normalized domain overlap score of 89.3% compared to experimental data, which is significantly higher than other state-of-the-art methods. It also recalls 26.7% of DCDs with 72.7% precision on the proteins for which threading failed to detect any DCDs. The server provides facilities for users to interactively refine the domain models by adjusting DC-score threshold, deleting and adding domain linkers, and assembling domain segments, which are particularly helpful for the hard targets for which current methods have a low accuracy while human-expert knowledge and experimental insights can be used for refining models. ThreaDomEX server is available at http://zhanglab.ccmb.med.umich.edu/ThreaDomEx.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jian Wang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Ruiming Li
- School of Software, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Qiang Shi
- School of Software, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Zhidong Xue
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,School of Software, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
8
|
Abstract
Codon usage depends on mutation bias, tRNA-mediated selection, and the need for high efficiency and accuracy in translation. One codon in a synonymous codon family is often strongly over-used, especially in highly expressed genes, which often leads to a high dN/dS ratio because dS is very small. Many different codon usage indices have been proposed to measure codon usage and codon adaptation. Sense codon could be misread by release factors and stop codons misread by tRNAs, which also contribute to codon usage in rare cases. This chapter outlines the conceptual framework on codon evolution, illustrates codon-specific and gene-specific codon usage indices, and presents their applications. A new index for codon adaptation that accounts for background mutation bias (Index of Translation Elongation) is presented and contrasted with codon adaptation index (CAI) which does not consider background mutation bias. They are used to re-analyze data from a recent paper claiming that translation elongation efficiency matters little in protein production. The reanalysis disproves the claim.
Collapse
|
9
|
Richa T, Ide S, Suzuki R, Ebina T, Kuroda Y. Fast H-DROP: A thirty times accelerated version of H-DROP for interactive SVM-based prediction of helical domain linkers. J Comput Aided Mol Des 2016; 31:237-244. [PMID: 28028736 DOI: 10.1007/s10822-016-9999-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2016] [Accepted: 12/10/2016] [Indexed: 10/20/2022]
Abstract
Efficient and rapid prediction of domain regions from amino acid sequence information alone is often required for swift structural and functional characterization of large multi-domain proteins. Here we introduce Fast H-DROP, a thirty times accelerated version of our previously reported H-DROP (Helical Domain linker pRediction using OPtimal features), which is unique in specifically predicting helical domain linkers (boundaries). Fast H-DROP, analogously to H-DROP, uses optimum features selected from a set of 3000 ones by combining a random forest and a stepwise feature selection protocol. We reduced the computational time from 8.5 min per sequence in H-DROP to 14 s per sequence in Fast H-DROP on an 8 Xeon processor Linux server by using SWISS-PROT instead of Genbank non-redundant (nr) database for generating the PSSMs. The sensitivity and precision of Fast H-DROP assessed by cross-validation were 33.7 and 36.2%, which were merely ~2% lower than that of H-DROP. The reduced computational time of Fast H-DROP, without affecting prediction performances, makes it more interactive and user-friendly. Fast H-DROP and H-DROP are freely available from http://domserv.lab.tuat.ac.jp/ .
Collapse
Affiliation(s)
- Tambi Richa
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 12-24-16 Nakamachi, Koganei-shi, Tokyo, 184-8588, Japan
| | - Soichiro Ide
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 12-24-16 Nakamachi, Koganei-shi, Tokyo, 184-8588, Japan
| | - Ryosuke Suzuki
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 12-24-16 Nakamachi, Koganei-shi, Tokyo, 184-8588, Japan
| | - Teppei Ebina
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 12-24-16 Nakamachi, Koganei-shi, Tokyo, 184-8588, Japan.,Department of Physiology, Graduate school of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
| | - Yutaka Kuroda
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 12-24-16 Nakamachi, Koganei-shi, Tokyo, 184-8588, Japan.
| |
Collapse
|
10
|
Chatterjee P, Basu S, Zubek J, Kundu M, Nasipuri M, Plewczynski D. PDP-CON: prediction of domain/linker residues in protein sequences using a consensus approach. J Mol Model 2016; 22:72. [PMID: 26969678 PMCID: PMC4788683 DOI: 10.1007/s00894-016-2933-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Accepted: 02/17/2016] [Indexed: 01/04/2023]
Abstract
The prediction of domain/linker residues in protein sequences is a crucial task in the functional classification of proteins, homology-based protein structure prediction, and high-throughput structural genomics. In this work, a novel consensus-based machine-learning technique was applied for residue-level prediction of the domain/linker annotations in protein sequences using ordered/disordered regions along protein chains and a set of physicochemical properties. Six different classifiers-decision tree, Gaussian naïve Bayes, linear discriminant analysis, support vector machine, random forest, and multilayer perceptron-were exhaustively explored for the residue-level prediction of domain/linker regions. The protein sequences from the curated CATH database were used for training and cross-validation experiments. Test results obtained by applying the developed PDP-CON tool to the mutually exclusive, independent proteins of the CASP-8, CASP-9, and CASP-10 databases are reported. An n-star quality consensus approach was used to combine the results yielded by different classifiers. The average PDP-CON accuracy and F-measure values for the CASP targets were found to be 0.86 and 0.91, respectively. The dataset, source code, and all supplementary materials for this work are available at https://cmaterju.org/cmaterbioinfo/ for noncommercial use.
Collapse
Affiliation(s)
- Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Garia, Kolkata, 700152, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India.
| | - Julian Zubek
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland.,Center of New Technologies, University of Warsaw, Banacha 2c, 02-097, Warsaw, Poland
| | - Mahantapas Kundu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
| | - Dariusz Plewczynski
- Center of New Technologies, University of Warsaw, Banacha 2c, 02-097, Warsaw, Poland. .,Faculty of Pharmacy, Medical University of Warsaw, Warsaw, Poland.
| |
Collapse
|
11
|
Mabrouk M, Werner T, Schneider M, Putz I, Brock O. Analysis of free modeling predictions by RBO aleph in CASP11. Proteins 2015; 84 Suppl 1:87-104. [PMID: 26492194 DOI: 10.1002/prot.24950] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 09/28/2015] [Accepted: 10/19/2015] [Indexed: 12/15/2022]
Abstract
The CASP experiment is a biannual benchmark for assessing protein structure prediction methods. In CASP11, RBO Aleph ranked as one of the top-performing automated servers in the free modeling category. This category consists of targets for which structural templates are not easily retrievable. We analyze the performance of RBO Aleph and show that its success in CASP was a result of its ab initio structure prediction protocol. A detailed analysis of this protocol demonstrates that two components unique to our method greatly contributed to prediction quality: residue-residue contact prediction by EPC-map and contact-guided conformational space search by model-based search (MBS). Interestingly, our analysis also points to a possible fundamental problem in evaluating the performance of protein structure prediction methods: Improvements in components of the method do not necessarily lead to improvements of the entire method. This points to the fact that these components interact in ways that are poorly understood. This problem, if indeed true, represents a significant obstacle to community-wide progress. Proteins 2016; 84(Suppl 1):87-104. © 2015 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Mahmoud Mabrouk
- Department of Electrical Engineering and Computer Science, Robotics and Biology Laboratory, Technische Universität Berlin, Berlin, 10587, Germany
| | - Tim Werner
- Department of Electrical Engineering and Computer Science, Robotics and Biology Laboratory, Technische Universität Berlin, Berlin, 10587, Germany
| | - Michael Schneider
- Department of Electrical Engineering and Computer Science, Robotics and Biology Laboratory, Technische Universität Berlin, Berlin, 10587, Germany
| | - Ines Putz
- Department of Electrical Engineering and Computer Science, Robotics and Biology Laboratory, Technische Universität Berlin, Berlin, 10587, Germany
| | - Oliver Brock
- Department of Electrical Engineering and Computer Science, Robotics and Biology Laboratory, Technische Universität Berlin, Berlin, 10587, Germany.
| |
Collapse
|
12
|
Xue Z, Jang R, Govindarajoo B, Huang Y, Wang Y. Extending Protein Domain Boundary Predictors to Detect Discontinuous Domains. PLoS One 2015; 10:e0141541. [PMID: 26502173 PMCID: PMC4621036 DOI: 10.1371/journal.pone.0141541] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2015] [Accepted: 10/10/2015] [Indexed: 11/18/2022] Open
Abstract
A variety of protein domain predictors were developed to predict protein domain boundaries in recent years, but most of them cannot predict discontinuous domains. Considering nearly 40% of multidomain proteins contain one or more discontinuous domains, we have developed DomEx to enable domain boundary predictors to detect discontinuous domains by assembling the continuous domain segments. Discontinuous domains are predicted by matching the sequence profile of concatenated continuous domain segments with the profiles from a single-domain library derived from SCOP and CATH, and Pfam. Then the matches are filtered by similarity to library templates, a symmetric index score and a profile-profile alignment score. DomEx recalled 32.3% discontinuous domains with 86.5% precision when tested on 97 non-homologous protein chains containing 58 continuous and 99 discontinuous domains, in which the predicted domain segments are within ±20 residues of the boundary definitions in CATH 3.5. Compared with our recently developed predictor, ThreaDom, which is the state-of-the-art tool to detect discontinuous-domains, DomEx recalled 26.7% discontinuous domains with 72.7% precision in a benchmark with 29 discontinuous-domain chains, where ThreaDom failed to predict any discontinuous domains. Furthermore, combined with ThreaDom, the method ranked number one among 10 predictors. The source code and datasets are available at https://github.com/xuezhidong/DomEx.
Collapse
Affiliation(s)
- Zhidong Xue
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
- * E-mail: (ZX); (YW)
| | - Richard Jang
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, United States of America
| | - Brandon Govindarajoo
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, United States of America
| | - Yichu Huang
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yan Wang
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
- * E-mail: (ZX); (YW)
| |
Collapse
|
13
|
Mabrouk M, Putz I, Werner T, Schneider M, Neeb M, Bartels P, Brock O. RBO Aleph: leveraging novel information sources for protein structure prediction. Nucleic Acids Res 2015; 43:W343-8. [PMID: 25897112 PMCID: PMC4489312 DOI: 10.1093/nar/gkv357] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2015] [Accepted: 04/03/2015] [Indexed: 02/02/2023] Open
Abstract
RBO Aleph is a novel protein structure prediction web server for template-based modeling, protein contact prediction and ab initio structure prediction. The server has a strong emphasis on modeling difficult protein targets for which templates cannot be detected. RBO Aleph's unique features are (i) the use of combined evolutionary and physicochemical information to perform residue–residue contact prediction and (ii) leveraging this contact information effectively in conformational space search. RBO Aleph emerged as one of the leading approaches to ab initio protein structure prediction and contact prediction during the most recent Critical Assessment of Protein Structure Prediction experiment (CASP11, 2014). In addition to RBO Aleph's main focus on ab initio modeling, the server also provides state-of-the-art template-based modeling services. Based on template availability, RBO Aleph switches automatically between template-based modeling and ab initio prediction based on the target protein sequence, facilitating use especially for non-expert users. The RBO Aleph web server offers a range of tools for visualization and data analysis, such as the visualization of predicted models, predicted contacts and the estimated prediction error along the model's backbone. The server is accessible at http://compbio.robotics.tu-berlin.de/rbo_aleph/.
Collapse
Affiliation(s)
- Mahmoud Mabrouk
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, 10587 Berlin, Germany
| | - Ines Putz
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, 10587 Berlin, Germany
| | - Tim Werner
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, 10587 Berlin, Germany
| | - Michael Schneider
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, 10587 Berlin, Germany
| | - Moritz Neeb
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, 10587 Berlin, Germany
| | - Philipp Bartels
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, 10587 Berlin, Germany
| | - Oliver Brock
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, 10587 Berlin, Germany
| |
Collapse
|
14
|
Shatnawi M, Zaki N. Inter-domain linker prediction using amino acid compositional index. Comput Biol Chem 2015; 55:23-30. [DOI: 10.1016/j.compbiolchem.2015.01.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2014] [Revised: 01/22/2015] [Accepted: 01/22/2015] [Indexed: 10/24/2022]
|
15
|
PDP-RF: Protein Domain Boundary Prediction Using Random Forest Classifier. LECTURE NOTES IN COMPUTER SCIENCE 2015. [DOI: 10.1007/978-3-319-19941-2_42] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
16
|
Shatnawi M, Zaki N, Yoo PD. Protein inter-domain linker prediction using Random Forest and amino acid physiochemical properties. BMC Bioinformatics 2014; 15 Suppl 16:S8. [PMID: 25521329 PMCID: PMC4290662 DOI: 10.1186/1471-2105-15-s16-s8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate prediction of protein domain linkers and boundaries is often regarded as the initial step of protein tertiary structure and function predictions. Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units. In this study, we propose a novel and accurate domain-linker prediction approach based on protein primary structure information only. We utilize a nature-inspired machine-learning model called Random Forest along with a novel domain-linker profile that contains physiochemical and domain-linker information of amino acid sequences. RESULTS The proposed approach was tested on two well-known benchmark protein datasets and achieved 68% sensitivity and 99% precision, which is better than any existing protein domain-linker predictor. Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets. CONCLUSION Our experimental results prove that the proposed approach is useful for domain-linker identification in highly imbalanced single- and multi-domain proteins.
Collapse
|
17
|
H-DROP: an SVM based helical domain linker predictor trained with features optimized by combining random forest and stepwise selection. J Comput Aided Mol Des 2014; 28:831-9. [PMID: 24965847 DOI: 10.1007/s10822-014-9763-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2014] [Accepted: 06/09/2014] [Indexed: 10/25/2022]
Abstract
Domain linker prediction is attracting much interest as it can help identifying novel domains suitable for high throughput proteomics analysis. Here, we report H-DROP, an SVM-based Helical Domain linker pRediction using OPtimal features. H-DROP is, to the best of our knowledge, the first predictor for specifically and effectively identifying helical linkers. This was made possible first because a large training dataset became available from IS-Dom, and second because we selected a small number of optimal features from a huge number of potential ones. The training helical linker dataset, which included 261 helical linkers, was constructed by detecting helical residues at the boundary regions of two independent structural domains listed in our previously reported IS-Dom dataset. 45 optimal feature candidates were selected from 3,000 features by random forest, which were further reduced to 26 optimal features by stepwise selection. The prediction sensitivity and precision of H-DROP were 35.2 and 38.8%, respectively. These values were over 10.7% higher than those of control methods including our previously developed DROP, which is a coil linker predictor, and PPRODO, which is trained with un-differentiated domain boundary sequences. Overall, these results indicated that helical linkers can be predicted from sequence information alone by using a strictly curated training data set for helical linkers and carefully selected set of optimal features. H-DROP is available at http://domserv.lab.tuat.ac.jp.
Collapse
|
18
|
Xue Z, Xu D, Wang Y, Zhang Y. ThreaDom: extracting protein domain boundary information from multiple threading alignments. Bioinformatics 2013; 29:i247-56. [PMID: 23812990 PMCID: PMC3694664 DOI: 10.1093/bioinformatics/btt209] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Motivation: Protein domains are subunits that can fold and evolve independently. Identification of domain boundary locations is often the first step in protein folding and function annotations. Most of the current methods deduce domain boundaries by sequence-based analysis, which has low accuracy. There is no efficient method for predicting discontinuous domains that consist of segments from separated sequence regions. As template-based methods are most efficient for protein 3D structure modeling, combining multiple threading alignment information should increase the accuracy and reliability of computational domain predictions. Result: We developed a new protein domain predictor, ThreaDom, which deduces domain boundary locations based on multiple threading alignments. The core of the method development is the derivation of a domain conservation score that combines information from template domain structures and terminal and internal alignment gaps. Tested on 630 non-redundant sequences, without using homologous templates, ThreaDom generates correct single- and multi-domain classifications in 81% of cases, where 78% have the domain linker assigned within ±20 residues. In a second test on 486 proteins with discontinuous domains, ThreaDom achieves an average precision 84% and recall 65% in domain boundary prediction. Finally, ThreaDom was examined on 56 targets from CASP8 and had a domain overlap rate 73, 87 and 85% with the target for Free Modeling, Hard multiple-domain and discontinuous domain proteins, respectively, which are significantly higher than most domain predictors in the CASP8. Similar results were achieved on the targets from the most recently CASP9 and CASP10 experiments. Availability:http://zhanglab.ccmb.med.umich.edu/ThreaDom/. Contact:zhng@umich.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhidong Xue
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | | | | | | |
Collapse
|
19
|
PPM-Dom: A novel method for domain position prediction. Comput Biol Chem 2013; 47:8-15. [DOI: 10.1016/j.compbiolchem.2013.06.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2013] [Revised: 06/05/2013] [Accepted: 06/05/2013] [Indexed: 02/05/2023]
|
20
|
Zhang XY, Lu LJ, Song Q, Yang QQ, Li DP, Sun JM, Li TH, Cong PS. DomHR: accurately identifying domain boundaries in proteins using a hinge region strategy. PLoS One 2013; 8:e60559. [PMID: 23593247 PMCID: PMC3623903 DOI: 10.1371/journal.pone.0060559] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2012] [Accepted: 02/27/2013] [Indexed: 11/18/2022] Open
Abstract
Motivation The precise prediction of protein domains, which are the structural, functional and evolutionary units of proteins, has been a research focus in recent years. Although many methods have been presented for predicting protein domains and boundaries, the accuracy of predictions could be improved. Results In this study we present a novel approach, DomHR, which is an accurate predictor of protein domain boundaries based on a creative hinge region strategy. A hinge region was defined as a segment of amino acids that covers part of a domain region and a boundary region. We developed a strategy to construct profiles of domain-hinge-boundary (DHB) features generated by sequence-domain/hinge/boundary alignment against a database of known domain structures. The DHB features had three elements: normalized domain, hinge, and boundary probabilities. The DHB features were used as input to identify domain boundaries in a sequence. DomHR used a nonredundant dataset as the training set, the DHB and predicted shape string as features, and a conditional random field as the classification algorithm. In predicted hinge regions, a residue was determined to be a domain or a boundary according to a decision threshold. After decision thresholds were optimized, DomHR was evaluated by cross-validation, large-scale prediction, independent test and CASP (Critical Assessment of Techniques for Protein Structure Prediction) tests. All results confirmed that DomHR outperformed other well-established, publicly available domain boundary predictors for prediction accuracy. Availability The DomHR is available at http://cal.tongji.edu.cn/domain/.
Collapse
Affiliation(s)
- Xiao-yan Zhang
- Department of Chemistry, Tongji University, Shanghai, China
| | - Long-jian Lu
- Department of Chemistry, Tongji University, Shanghai, China
| | - Qi Song
- Department of Chemistry, Tongji University, Shanghai, China
| | - Qian-qian Yang
- Department of Chemistry, Tongji University, Shanghai, China
| | - Da-peng Li
- Department of Chemistry, Tongji University, Shanghai, China
| | - Jiang-ming Sun
- Department of Chemistry, Tongji University, Shanghai, China
| | - Tong-hua Li
- Department of Chemistry, Tongji University, Shanghai, China
- * E-mail: (T-HL); (P-SC) (PC)
| | - Pei-sheng Cong
- Department of Chemistry, Tongji University, Shanghai, China
- * E-mail: (T-HL); (P-SC) (PC)
| |
Collapse
|
21
|
Xia X. Position weight matrix, gibbs sampler, and the associated significance tests in motif characterization and prediction. SCIENTIFICA 2012; 2012:917540. [PMID: 24278755 PMCID: PMC3820676 DOI: 10.6064/2012/917540] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2012] [Accepted: 10/11/2012] [Indexed: 05/31/2023]
Abstract
Position weight matrix (PWM) is not only one of the most widely used bioinformatic methods, but also a key component in more advanced computational algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide or amino acid sequences. However, few generally applicable statistical tests are available for evaluating the significance of site patterns, PWM, and PWM scores (PWMS) of putative motifs. Statistical significance tests of the PWM output, that is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and have never been collected in a single paper, with the consequence that many implementations of PWM do not include any significance test. Here I review PWM-based methods used in motif characterization and prediction (including a detailed illustration of the Gibbs sampler for de novo motif discovery), present statistical and probabilistic rationales behind statistical significance tests relevant to PWM, and illustrate their application with real data. The multiple comparison problem associated with the test of site-specific frequencies is best handled by false discovery rate methods. The test of PWM, due to the use of pseudocounts, is best done by resampling methods. The test of individual PWMS for each sequence segment should be based on the extreme value distribution.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, University of Ottawa, 30 Marie Curie, Ottawa, ON, Canada K1N 6N5
| |
Collapse
|
22
|
Cheng J, Li J, Wang Z, Eickholt J, Deng X. The MULTICOM toolbox for protein structure prediction. BMC Bioinformatics 2012; 13:65. [PMID: 22545707 PMCID: PMC3495398 DOI: 10.1186/1471-2105-13-65] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2012] [Accepted: 04/30/2012] [Indexed: 12/31/2022] Open
Abstract
Background As genome sequencing is becoming routine in biomedical research, the total number of protein sequences is increasing exponentially, recently reaching over 108 million. However, only a tiny portion of these proteins (i.e. ~75,000 or < 0.07%) have solved tertiary structures determined by experimental techniques. The gap between protein sequence and structure continues to enlarge rapidly as the throughput of genome sequencing techniques is much higher than that of protein structure determination techniques. Computational software tools for predicting protein structure and structural features from protein sequences are crucial to make use of this vast repository of protein resources. Results To meet the need, we have developed a comprehensive MULTICOM toolbox consisting of a set of protein structure and structural feature prediction tools. These tools include secondary structure prediction, solvent accessibility prediction, disorder region prediction, domain boundary prediction, contact map prediction, disulfide bond prediction, beta-sheet topology prediction, fold recognition, multiple template combination and alignment, template-based tertiary structure modeling, protein model quality assessment, and mutation stability prediction. Conclusions These tools have been rigorously tested by many users in the last several years and/or during the last three rounds of the Critical Assessment of Techniques for Protein Structure Prediction (CASP7-9) from 2006 to 2010, achieving state-of-the-art or near performance. In order to facilitate bioinformatics research and technological development in the field, we have made the MULTICOM toolbox freely available as web services and/or software packages for academic use and scientific research. It is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/.
Collapse
Affiliation(s)
- Jianlin Cheng
- Department of Computer Science, University of Missouri-Columbia, Columbia, MO 65211, USA.
| | | | | | | | | |
Collapse
|
23
|
Eickholt J, Deng X, Cheng J. DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning. BMC Bioinformatics 2011; 12:43. [PMID: 21284866 PMCID: PMC3036623 DOI: 10.1186/1471-2105-12-43] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2010] [Accepted: 02/01/2011] [Indexed: 11/17/2022] Open
Abstract
Background Accurate identification of protein domain boundaries is useful for protein structure determination and prediction. However, predicting protein domain boundaries from a sequence is still very challenging and largely unsolved. Results We developed a new method to integrate the classification power of machine learning with evolutionary signals embedded in protein families in order to improve protein domain boundary prediction. The method first extracts putative domain boundary signals from a multiple sequence alignment between a query sequence and its homologs. The putative sites are then classified and scored by support vector machines in conjunction with input features such as sequence profiles, secondary structures, solvent accessibilities around the sites and their positions. The method was evaluated on a domain benchmark by 10-fold cross-validation and 60% of true domain boundaries can be recalled at a precision of 60%. The trade-off between the precision and recall can be adjusted according to specific needs by using different decision thresholds on the domain boundary scores assigned by the support vector machines. Conclusions The good prediction accuracy and the flexibility of selecting domain boundary sites at different precision and recall values make our method a useful tool for protein structure determination and modelling. The method is available at http://sysbio.rnet.missouri.edu/dobo/.
Collapse
Affiliation(s)
- Jesse Eickholt
- Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
| | | | | |
Collapse
|
24
|
Ebina T, Toh H, Kuroda Y. DROP: an SVM domain linker predictor trained with optimal features selected by random forest. ACTA ACUST UNITED AC 2010; 27:487-94. [PMID: 21169376 DOI: 10.1093/bioinformatics/btq700] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Biologically important proteins are often large, multidomain proteins, which are difficult to characterize by high-throughput experimental methods. Efficient domain/boundary predictions are thus increasingly required in diverse area of proteomics research for computationally dissecting proteins into readily analyzable domains. RESULTS We constructed a support vector machine (SVM)-based domain linker predictor, DROP (Domain linker pRediction using OPtimal features), which was trained with 25 optimal features. The optimal combination of features was identified from a set of 3000 features using a random forest algorithm complemented with a stepwise feature selection. DROP demonstrated a prediction sensitivity and precision of 41.3 and 49.4%, respectively. These values were over 19.9% higher than those of control SVM predictors trained with non-optimized features, strongly suggesting the efficiency of our feature selection method. In addition, the mean NDO-Score of DROP for predicting novel domains in seven CASP8 FM multidomain proteins was 0.760, which was higher than any of the 12 published CASP8 DP servers. Overall, these results indicate that the SVM prediction of domain linkers can be improved by identifying optimal features that best distinguish linker from non-linker regions.
Collapse
Affiliation(s)
- Teppei Ebina
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, Koganei-shi, Tokyo 184-8588, Japan
| | | | | |
Collapse
|
25
|
Yoo PD, Shwen Ho Y, Ng J, Charleston M, Saksena NK, Yang P, Zomaya AY. Hierarchical kernel mixture models for the prediction of AIDS disease progression using HIV structural gp120 profiles. BMC Genomics 2010; 11 Suppl 4:S22. [PMID: 21143806 PMCID: PMC3005921 DOI: 10.1186/1471-2164-11-s4-s22] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Changes to the glycosylation profile on HIV gp120 can influence viral pathogenesis and alter AIDS disease progression. The characterization of glycosylation differences at the sequence level is inadequate as the placement of carbohydrates is structurally complex. However, no structural framework is available to date for the study of HIV disease progression. In this study, we propose a novel machine-learning based framework for the prediction of AIDS disease progression in three stages (RP, SP, and LTNP) using the HIV structural gp120 profile. This new intelligent framework proves to be accurate and provides an important benchmark for predicting AIDS disease progression computationally. The model is trained using a novel HIV gp120 glycosylation structural profile to detect possible stages of AIDS disease progression for the target sequences of HIV+ individuals. The performance of the proposed model was compared to seven existing different machine-learning models on newly proposed gp120-Benchmark_1 dataset in terms of error-rate (MSE), accuracy (CCI), stability (STD), and complexity (TBM). The novel framework showed better predictive performance with 67.82% CCI, 30.21 MSE, 0.8 STD, and 2.62 TBM on the three stages of AIDS disease progression of 50 HIV+ individuals. This framework is an invaluable bioinformatics tool that will be useful to the clinical assessment of viral pathogenesis.
Collapse
Affiliation(s)
- Paul D Yoo
- Centre for Distributed and High Performance Computing, University of Sydney, NSW, Australia
| | | | | | | | | | | | | |
Collapse
|
26
|
Gorbalenya AE, Lieutaud P, Harris MR, Coutard B, Canard B, Kleywegt GJ, Kravchenko AA, Samborskiy DV, Sidorov IA, Leontovich AM, Jones TA. Practical application of bioinformatics by the multidisciplinary VIZIER consortium. Antiviral Res 2010; 87:95-110. [PMID: 20153379 PMCID: PMC7172516 DOI: 10.1016/j.antiviral.2010.02.005] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2009] [Revised: 02/03/2010] [Accepted: 02/04/2010] [Indexed: 01/03/2023]
Abstract
This review focuses on bioinformatics technologies employed by the EU-sponsored multidisciplinary VIZIER consortium (Comparative Structural Genomics of Viral Enzymes Involved in Replication, FP6 PROJECT: 2004-511960, active from 1 November 2004 to 30 April 2009), to achieve its goals. From the management of the information flow of the project, to bioinformatics-mediated selection of RNA viruses and prediction of protein targets, to the analysis of 3D protein structures and antiviral compounds, these technologies provided a communication framework and integrated solutions for steady and timely advancement of the project. RNA viruses form a large class of major pathogens that affect humans and domestic animals. Such RNA viruses as HIV, Influenza virus and Hepatitis C virus are of prime medical concern today, but the identities of viruses that will threaten human population tomorrow are far from certain. To contain outbreaks of common or newly emerging infections, prototype drugs against viruses representing the Virus Universe must be developed. This concept was championed by the VIZIER project which brought together experts in diverse fields to produce a concerted and sustained effort for identifying and validating targets for antivirus therapy in dozens of RNA virus lineages.
Collapse
Affiliation(s)
- Alexander E. Gorbalenya
- Molecular Virology Laboratory, Department of Medical Microbiology, Center for Infectious Diseases, Leiden University Medical Center, P.O. Box 9600, E4-P, 2300 RC Leiden, The Netherlands
- A.N. Belozersky Institute of Physico-Chemical Biology, Moscow State University, Moscow 119899, Russia
| | - Philippe Lieutaud
- Laboratoire Architecture et Fonction des Macromolécules Biologiques, UMR 6098, AFMB-CNRS-ESIL, Case 925, 163 Avenue de Luminy, 13288 Marseille, France
| | - Mark R. Harris
- Department of Cell and Molecular Biology, Uppsala University, Biomedical Center, Box 596, SE-751 24 Uppsala, Sweden
| | - Bruno Coutard
- Laboratoire Architecture et Fonction des Macromolécules Biologiques, UMR 6098, AFMB-CNRS-ESIL, Case 925, 163 Avenue de Luminy, 13288 Marseille, France
| | - Bruno Canard
- Laboratoire Architecture et Fonction des Macromolécules Biologiques, UMR 6098, AFMB-CNRS-ESIL, Case 925, 163 Avenue de Luminy, 13288 Marseille, France
| | - Gerard J. Kleywegt
- Department of Cell and Molecular Biology, Uppsala University, Biomedical Center, Box 596, SE-751 24 Uppsala, Sweden
| | - Alexander A. Kravchenko
- A.N. Belozersky Institute of Physico-Chemical Biology, Moscow State University, Moscow 119899, Russia
| | - Dmitry V. Samborskiy
- Molecular Virology Laboratory, Department of Medical Microbiology, Center for Infectious Diseases, Leiden University Medical Center, P.O. Box 9600, E4-P, 2300 RC Leiden, The Netherlands
| | - Igor A. Sidorov
- Molecular Virology Laboratory, Department of Medical Microbiology, Center for Infectious Diseases, Leiden University Medical Center, P.O. Box 9600, E4-P, 2300 RC Leiden, The Netherlands
| | - Andrey M. Leontovich
- A.N. Belozersky Institute of Physico-Chemical Biology, Moscow State University, Moscow 119899, Russia
| | - T. Alwyn Jones
- Department of Cell and Molecular Biology, Uppsala University, Biomedical Center, Box 596, SE-751 24 Uppsala, Sweden
| |
Collapse
|
27
|
Chen P, Liu C, Burge L, Li J, Mohammad M, Southerland W, Gloster C, Wang B. DomSVR: domain boundary prediction with support vector regression from sequence information alone. Amino Acids 2010; 39:713-26. [PMID: 20165918 DOI: 10.1007/s00726-010-0506-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2009] [Accepted: 01/25/2010] [Indexed: 11/24/2022]
Abstract
Protein domains are structural and fundamental functional units of proteins. The information of protein domain boundaries is helpful in understanding the evolution, structures and functions of proteins, and also plays an important role in protein classification. In this paper, we propose a support vector regression-based method to address the problem of protein domain boundary identification based on novel input profiles extracted from AAindex database. As a result, our method achieves an average sensitivity of approximately 36.5% and an average specificity of approximately 81% for multi-domain protein chains, which is overall better than the performance of published approaches to identify domain boundary. As our method used sequence information alone, our method is simpler and faster.
Collapse
Affiliation(s)
- Peng Chen
- Department of Systems and Computer Science, Howard University, 2400 Sixth Street, NW, Washington, DC 20059, USA.
| | | | | | | | | | | | | | | |
Collapse
|
28
|
Yoo PD, Zhou BB, Zomaya AY. A modular kernel approach for integrative analysis of protein domain boundaries. BMC Genomics 2009; 10 Suppl 3:S21. [PMID: 19958485 PMCID: PMC2788374 DOI: 10.1186/1471-2164-10-s3-s21] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND In this paper, we introduce a novel inter-range interaction integrated approach for protein domain boundary prediction. It involves (1) the design of modular kernel algorithm, which is able to effectively exploit the information of non-local interactions in amino acids, and (2) the development of a novel profile that can provide suitable information to the algorithm. One of the key features of this profiling technique is the use of multiple structural alignments of remote homologues to create an extended sequence profile and combines the structural information with suitable chemical information that plays an important role in protein stability. This profile can capture the sequence characteristics of an entire structural superfamily and extend a range of profiles generated from sequence similarity alone. RESULTS Our novel profile that combines homology information with hydrophobicity from SARAH1 scale was successful in providing more structural and chemical information. In addition, the modular approach adopted in our algorithm proved to be effective in capturing information from non-local interactions. Our approach achieved 82.1%, 50.9% and 31.5% accuracies for one-domain, two-domain, and three- and more domain proteins respectively. CONCLUSION The experimental results in this study are encouraging, however, more work is need to extend it to a broader range of applications. We are currently developing a novel interactive (human in the loop) profiling that can provide information from more distantly related homology. This approach will further enhance the current study.
Collapse
Affiliation(s)
- Paul D Yoo
- Advanced Networks Research Group, School of Information Technologies (J12), the University of Sydney, NSW 2006, Australia.
| | | | | |
Collapse
|
29
|
Xue B, Faraggi E, Zhou Y. Predicting residue-residue contact maps by a two-layer, integrated neural-network method. Proteins 2009; 76:176-83. [PMID: 19137600 DOI: 10.1002/prot.22329] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A neural network method (SPINE-2D) is introduced to provide a sequence-based prediction of residue-residue contact maps. This method is built on the success of SPINE in predicting secondary structure, residue solvent accessibility, and backbone torsion angles via large-scale training with overfit protection and a two-layer neural network. SPINE-2D achieved a 10-fold cross-validated accuracy of 47% (+/-2%) for top L/5 predicted contacts between two residues with sequence separation of six or more and an accuracy of 24 +/- 1% for nonlocal contacts with sequence separation of 24 residues or more. The accuracies of 23% and 26% for nonlocal contact predictions are achieved for two independent datasets of 500 proteins and 82 CASP 7 targets, respectively. A comparison with other methods indicates that SPINE-2D is among the most accurate methods for contact-map prediction. SPINE-2D is available as a webserver at http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Bin Xue
- Indiana University School of Informatics, Indiana University-Purdue University, Indianapolis, Indiana 46202, USA
| | | | | |
Collapse
|
30
|
Walsh I, Martin AJM, Mooney C, Rubagotti E, Vullo A, Pollastri G. Ab initio and homology based prediction of protein domains by recursive neural networks. BMC Bioinformatics 2009; 10:195. [PMID: 19558651 PMCID: PMC2711945 DOI: 10.1186/1471-2105-10-195] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2008] [Accepted: 06/26/2009] [Indexed: 11/10/2022] Open
Abstract
Background Proteins, especially larger ones, are often composed of individual evolutionary units, domains, which have their own function and structural fold. Predicting domains is an important intermediate step in protein analyses, including the prediction of protein structures. Results We describe novel systems for the prediction of protein domain boundaries powered by Recursive Neural Networks. The systems rely on a combination of primary sequence and evolutionary information, predictions of structural features such as secondary structure, solvent accessibility and residue contact maps, and structural templates, both annotated for domains (from the SCOP dataset) and unannotated (from the PDB). We gauge the contribution of contact maps, and PDB and SCOP templates independently and for different ranges of template quality. We find that accurately predicted contact maps are informative for the prediction of domain boundaries, while the same is not true for contact maps predicted ab initio. We also find that gap information from PDB templates is informative, but, not surprisingly, less than SCOP annotations. We test both systems trained on templates of all qualities, and systems trained only on templates of marginal similarity to the query (less than 25% sequence identity). While the first batch of systems produces near perfect predictions in the presence of fair to good templates, the second batch outperforms or match ab initio predictors down to essentially any level of template quality. We test all systems in 5-fold cross-validation on a large non-redundant set of multi-domain and single domain proteins. The final predictors are state-of-the-art, with a template-less prediction boundary recall of 50.8% (precision 38.7%) within ± 20 residues and a single domain recall of 80.3% (precision 78.1%). The SCOP-based predictors achieve a boundary recall of 74% (precision 77.1%) again within ± 20 residues, and classify single domain proteins as such in over 85% of cases, when we allow a mix of bad and good quality templates. If we only allow marginal templates (max 25% sequence identity to the query) the scores remain high, with boundary recall and precision of 59% and 66.3%, and 80% of all single domain proteins predicted correctly. Conclusion The systems presented here may prove useful in large-scale annotation of protein domains in proteins of unknown structure. The methods are available as public web servers at the address: and we plan on running them on a multi-genomic scale and make the results public in the near future.
Collapse
Affiliation(s)
- Ian Walsh
- School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland.
| | | | | | | | | | | |
Collapse
|
31
|
Ebina T, Toh H, Kuroda Y. Loop-length-dependent SVM prediction of domain linkers for high-throughput structural proteomics. Biopolymers 2009; 92:1-8. [PMID: 18844295 DOI: 10.1002/bip.21105] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The prediction of structural domains in novel protein sequences is becoming of practical importance. One important area of application is the development of computer-aided techniques for identifying, at a low cost, novel protein domain targets for large-scale functional and structural proteomics. Here, we report a loop-length-dependent support vector machine (SVM) prediction of domain linkers, which are loops separating two structural domains. (DLP-SVM is freely available at: http://www.tuat.ac.jp/ approximately domserv/cgi-bin/DLP-SVM.cgi.) We constructed three loop-length-dependent SVM predictors of domain linkers (SVM-All, SVM-Long and SVM-Short), and also built SVM-Joint, which combines the results of SVM-Short and SVM-Long into a single consolidated prediction. The performances of SVM-Joint were, in most aspects, the highest, with a sensitivity of 59.7% and a specificity of 43.6%, which indicated that the specificity and the sensitivity were improved by over 2 and 3% respectively, when loop-length-dependent characteristics were taken into account. Furthermore, the sensitivity and specificity of SVM-Joint were, respectively, 37.6 and 17.4% higher than those of a random guess, and also superior to those of previously reported domain linker predictors. These results indicate that SVMs can be used to predict domain linkers, and that loop-length-dependent characteristics are useful for improving SVM prediction performances.
Collapse
Affiliation(s)
- Teppei Ebina
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 12-24-16 Naka-machi, Koganei-shi, Tokyo 184-8588, Japan
| | | | | |
Collapse
|
32
|
Bondugula R, Lee MS, Wallqvist A. FIEFDom: a transparent domain boundary recognition system using a fuzzy mean operator. Nucleic Acids Res 2008; 37:452-62. [PMID: 19056827 PMCID: PMC2632928 DOI: 10.1093/nar/gkn944] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Protein domain prediction is often the preliminary step in both experimental and computational protein research. Here we present a new method to predict the domain boundaries of a multidomain protein from its amino acid sequence using a fuzzy mean operator. Using the nr-sequence database together with a reference protein set (RPS) containing known domain boundaries, the operator is used to assign a likelihood value for each residue of the query sequence as belonging to a domain boundary. This procedure robustly identifies contiguous boundary regions. For a dataset with a maximum sequence identity of 30%, the average domain prediction accuracy of our method is 97% for one domain proteins and 58% for multidomain proteins. The presented model is capable of using new sequence/structure information without re-parameterization after each RPS update. When tested on a current database using a four year old RPS and on a database that contains different domain definitions than those used to train the models, our method consistently yielded the same accuracy while two other published methods did not. A comparison with other domain prediction methods used in the CASP7 competition indicates that our method performs better than existing sequence-based methods.
Collapse
Affiliation(s)
- Rajkumar Bondugula
- Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, MD 21702, USA.
| | | | | |
Collapse
|
33
|
Wu Y, Dousis AD, Chen M, Li J, Ma J. OPUS-Dom: applying the folding-based method VECFOLD to determine protein domain boundaries. J Mol Biol 2008; 385:1314-29. [PMID: 19026662 DOI: 10.1016/j.jmb.2008.10.093] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2008] [Revised: 10/29/2008] [Accepted: 10/31/2008] [Indexed: 10/21/2022]
Abstract
In this article, we present a de novo method for predicting protein domain boundaries, called OPUS-Dom. The core of the method is a novel coarse-grained folding method, VECFOLD, which constructs low-resolution structural models from a target sequence by folding a chain of vectors representing the predicted secondary-structure elements. OPUS-Dom generates a large ensemble of folded structure decoys by VECFOLD and labels the domain boundaries of each decoy by a domain parsing algorithm. Consensus domain boundaries are then derived from the statistical distribution of the putative boundaries and three empirical sequence-based domain profiles. OPUS-Dom generally outperformed several state-of-the-art domain prediction algorithms over various benchmark protein sets. Even though each VECFOLD-generated structure contains large errors, collectively these structures provide a more robust delineation of domain boundaries. The success of OPUS-Dom suggests that the arrangement of protein domains is more a consequence of limited coordination patterns per domain arising from tertiary packing of secondary-structure segments, rather than sequence-specific constraints.
Collapse
Affiliation(s)
- Yinghao Wu
- Department of Bioengineering, Rice University, Houston, TX 77005, USA
| | | | | | | | | |
Collapse
|
34
|
Yoo PD, Sikder AR, Taheri J, Zhou BB, Zomaya AY. DomNet: Protein Domain Boundary Prediction Using Enhanced General Regression Network and New Profiles. IEEE Trans Nanobioscience 2008; 7:172-81. [DOI: 10.1109/tnb.2008.2000747] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
35
|
Yoo PD, Sikder AR, Zhou BB, Zomaya AY. Improved general regression network for protein domain boundary prediction. BMC Bioinformatics 2008; 9 Suppl 1:S12. [PMID: 18315843 PMCID: PMC2259413 DOI: 10.1186/1471-2105-9-s1-s12] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Background Protein domains present some of the most useful information that can be used to understand protein structure and functions. Recent research on protein domain boundary prediction has been mainly based on widely known machine learning techniques, such as Artificial Neural Networks and Support Vector Machines. In this study, we propose a new machine learning model (IGRN) that can achieve accurate and reliable classification, with significantly reduced computations. The IGRN was trained using a PSSM (Position Specific Scoring Matrix), secondary structure, solvent accessibility information and inter-domain linker index to detect possible domain boundaries for a target sequence. Results The proposed model achieved average prediction accuracy of 67% on the Benchmark_2 dataset for domain boundary identification in multi-domains proteins and showed superior predictive performance and generalisation ability among the most widely used neural network models. With the CASP7 benchmark dataset, it also demonstrated comparable performance to existing domain boundary predictors such as DOMpro, DomPred, DomSSEA, DomCut and DomainDiscovery with 70.10% prediction accuracy. Conclusion The performance of proposed model has been compared favourably to the performance of other existing machine learning based methods as well as widely known domain boundary predictors on two benchmark datasets and excels in the identification of domain boundaries in terms of model bias, generalisation and computational requirements.
Collapse
Affiliation(s)
- Paul D Yoo
- Advanced Networks Research Group, School of Information Technologies (J12), The University of Sydney, NSW 2006, Australia.
| | | | | | | |
Collapse
|
36
|
Manjasetty BA, Turnbull AP, Panjikar S, Büssow K, Chance MR. Automated technologies and novel techniques to accelerate protein crystallography for structural genomics. Proteomics 2008; 8:612-25. [PMID: 18210369 DOI: 10.1002/pmic.200700687] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The sequence infrastructure that has arisen through large-scale genomic projects dedicated to protein analysis, has provided a wealth of information and brought together scientists and institutions from all over the world. As a consequence, the development of novel technologies and methodologies in proteomics research is helping to unravel the biochemical and physiological mechanisms of complex multivariate diseases at both a functional and molecular level. In the late sixties, when X-ray crystallography had just been established, the idea of determining protein structure on an almost universal basis was akin to an impossible dream or a miracle. Yet only forty years after, automated protein structure determination platforms have been established. The widespread use of robotics in protein crystallography has had a huge impact at every stage of the pipeline from protein cloning, over-expression, purification, crystallization, data collection, structure solution, refinement, validation and data management- all of which have become more or less automated with minimal human intervention necessary. Here, recent advances in protein crystal structure analysis in the context of structural genomics will be discussed. In addition, this review aims to give an overview of recent developments in high throughput instrumentation, and technologies and strategies to accelerate protein structure/function analysis.
Collapse
Affiliation(s)
- Babu A Manjasetty
- Case Center for Synchrotron Biosciences, National Synchrotron Light Source, Brookhaven National Laboratory, Upton, NY11973, USA.
| | | | | | | | | |
Collapse
|
37
|
Ye L, Liu T, Wu Z, Zhou R. Sequence-based protein domain boundary prediction using BP neural network with various property profiles. Proteins 2008; 71:300-7. [PMID: 17932915 DOI: 10.1002/prot.21745] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Given the rapid growth in the number of sequences without known structures, it is becoming increasingly important to not only accurately define protein structural domains but also predict domain boundaries from the amino-acid sequence alone. In this article, we present a Back-Propagation (BP) neural network method using 9 different sequence profiles, based on chemical, physical, and statistical properties, to predict the domain boundary of two-domain proteins from one dimensional sequences. We have achieved an accuracy of 69% with a 10-fold cross validation on a 238 nonredundant two-domain protein dataset that we built based on a common set from both SCOP and CATH classifications. The method has also been applied to a larger third-party dataset with 522 proteins; and an accuracy of 62% has been achieved. Our prediction results on both datasets are found to be significantly better than those from some other methods, such as DomCut and DGS on the same datasets, and also comparable to that from the PPRODO method, upon which the larger dataset was based. Our cross validation results are also noticeably better than previous ones from other BP neural network methods, probably because we have used more property descriptors with significantly more training nodes in our neural network. The integration with PPRODO method also indicates that the information obtained from our current approach is complementary to that available through multiple sequence alignments. Moreover, the relative importance of each property profile has been analyzed in detail.
Collapse
Affiliation(s)
- Lei Ye
- Department of Computer Science, Zhejiang University, Hangzhou, China
| | | | | | | |
Collapse
|
38
|
Tress M, Cheng J, Baldi P, Joo K, Lee J, Seo JH, Lee J, Baker D, Chivian D, Kim D, Ezkurdia I. Assessment of predictions submitted for the CASP7 domain prediction category. Proteins 2008; 69 Suppl 8:137-51. [PMID: 17680686 DOI: 10.1002/prot.21675] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
This paper details the assessment process and evaluation results for the Critical Assessment of Protein Structure Prediction (CASP7) domain prediction category. Domain predictions were assessed using the Normalized Domain Overlap score introduced in CASP6 and the accuracy of prediction of domain break points. The results of the analysis clearly demonstrate that the best methods are able to make consistently reliable predictions when the target has a structural template, although they are less good when the domain break occurs in a region not covered by a template. The conditions of the experiment meant that it was impossible to draw any conclusions about domain prediction for free modeling targets and it was also difficult to draw many distinctions between the best groups. Two thirds of the targets submitted were single domains and hence regarded as easy to predict. Even those targets defined as having multiple domains always had at least one domain with a similar template structure.
Collapse
Affiliation(s)
- Michael Tress
- Structural and Biological Computation Programme, Spanish National Cancer Research Centre, Madrid, Spain.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Gimona M. Protein Linguistics and the Modular Code of the Cytoskeleton. BIOSEMIOTICS 2008:189-206. [DOI: 10.1007/978-1-4020-6340-4_8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
40
|
Zhou H, Xue B, Zhou Y. DDOMAIN: Dividing structures into domains using a normalized domain-domain interaction profile. Protein Sci 2007; 16:947-55. [PMID: 17456745 PMCID: PMC2206635 DOI: 10.1110/ps.062597307] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Dividing protein structures into domains is proven useful for more accurate structural and functional characterization of proteins. Here, we develop a method, called DDOMAIN, that divides structure into DOMAINs using a normalized contact-based domain-domain interaction profile. Results of DDOMAIN are compared to AUTHORS annotations (domain definitions are given by the authors who solved protein structures), as well as to popular SCOP and CATH annotations by human experts and automatic programs. DDOMAIN's automatic annotations are most consistent with the AUTHORS annotations (90% agreement in number of domains and 88% agreement in both number of domains and at least 85% overlap in domain assignment of residues) if its three adjustable parameters are trained by the AUTHORS annotations. By comparison, the agreement is 83% (81% with at least 85% overlap criterion) between SCOP-trained DDOMAIN and SCOP annotations and 77% (73%) between CATH-trained DDOMAIN and CATH annotations. The agreement between DDOMAIN and AUTHORS annotations goes beyond single-domain proteins (97%, 82%, and 56% for single-, two-, and three-domain proteins, respectively). For an "easy" data set of proteins whose CATH and SCOP annotations agree with each other in number of domains, the agreement is 90% (89%) between "easy-set"-trained DDOMAIN and CATH/SCOP annotations. The consistency between SCOP-trained DDOMAIN and SCOP annotations is superior to two other recently developed, SCOP-trained, automatic methods PDP (protein domain parser), and DomainParser 2. We also tested a simple consensus method made of PDP, DomainParser 2, and DDOMAIN and a different version of DDOMAIN based on a more sophisticated statistical energy function. The DDOMAIN server and its executable are available in the services section on http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Hongyi Zhou
- Howard Hughes Medical Institute Center for Single Molecule Biophysics, Department of Physiology and Biophysics, State University of New York at Buffalo, Buffalo, New York 14214, USA
| | | | | |
Collapse
|
41
|
Abstract
Protein domain prediction is important for protein structure prediction, structure determination, function annotation, mutagenesis analysis and protein engineering. Here we describe an accurate protein domain prediction server (DOMAC) combining both template-based and ab initio methods. The preliminary version of the server was ranked among the top domain prediction servers in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7), 2006. DOMAC server and datasets are available at: http://www.bioinfotool.org/domac.html
Collapse
Affiliation(s)
- Jianlin Cheng
- School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA.
| |
Collapse
|
42
|
Sikder AR, Zomaya AY. Improving the performance of DomainDiscovery of protein domain boundary assignment using inter-domain linker index. BMC Bioinformatics 2006; 7 Suppl 5:S6. [PMID: 17254311 PMCID: PMC1764483 DOI: 10.1186/1471-2105-7-s5-s6] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background Knowledge of protein domain boundaries is critical for the characterisation and understanding of protein function. The ability to identify domains without the knowledge of the structure – by using sequence information only – is an essential step in many types of protein analyses. In this present study, we demonstrate that the performance of DomainDiscovery is improved significantly by including the inter-domain linker index value for domain identification from sequence-based information. Improved DomainDiscovery uses a Support Vector Machine (SVM) approach and a unique training dataset built on the principle of consensus among experts in defining domains in protein structure. The SVM was trained using a PSSM (Position Specific Scoring Matrix), secondary structure, solvent accessibility information and inter-domain linker index to detect possible domain boundaries for a target sequence. Results Improved DomainDiscovery is compared with other methods by benchmarking against a structurally non-redundant dataset and also CASP5 targets. Improved DomainDiscovery achieves 70% accuracy for domain boundary identification in multi-domains proteins. Conclusion Improved DomainDiscovery compares favourably to the performance of other methods and excels in the identification of domain boundaries for multi-domain proteins as a result of introducing support vector machine with benchmark_2 dataset.
Collapse
Affiliation(s)
- Abdur R Sikder
- Advanced Networks Research Group, School of Information Technologies, J12, University of Sydney, NSW 2006, Australia
| | - Albert Y Zomaya
- Advanced Networks Research Group, School of Information Technologies, J12, University of Sydney, NSW 2006, Australia
| |
Collapse
|
43
|
Joshi RR, Samant VV. Bayesian data mining of protein domains gives an efficient predictive algorithm and new insight. J Mol Model 2006; 13:275-82. [PMID: 17028865 DOI: 10.1007/s00894-006-0141-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2006] [Accepted: 07/28/2006] [Indexed: 11/29/2022]
Abstract
Identification of structural domains in uncharacterized protein sequences is important in the prediction of protein tertiary folds and functional sites, and hence in designing biologically active molecules. We present a new predictive computational method of classifying a protein into single, two continuous or two discontinuous domains using Bayesian Data Mining. The algorithm requires only the primary sequence and computer-predicted secondary structure. It incorporates correlation patterns between certain 3-dimensional motifs and some local helical folds found conserved in the vicinity of protein domains with high statistical confidence. The prediction of domain-class by this computationally simple and fast method shows good accuracy of prediction-average accuracies 83.3% for single domain, 60% for two continuous and 65.7% for two discontinuous domain proteins. Experiments on the large validation sample show its performance to be significantly better than that of DGS and DomSSEA. Computations of Bayesian probabilities show important features in terms of correlation of certain conserved patterns of secondary folds and tertiary motifs and give new insight. Applications for improved accuracy of predicting domain boundary points relevant to protein structural and functional modeling are also highlighted.
Collapse
Affiliation(s)
- Rajani R Joshi
- Department of Mathematics, Indian Institute of Technology Bombay, Powai, Mumbai, 400 076, India.
| | | |
Collapse
|
44
|
Dong Q, Wang X, Lin L. Novel knowledge-based mean force potential at the profile level. BMC Bioinformatics 2006; 7:324. [PMID: 16803615 PMCID: PMC1534065 DOI: 10.1186/1471-2105-7-324] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2006] [Accepted: 06/27/2006] [Indexed: 11/10/2022] Open
Abstract
Background The development and testing of functions for the modeling of protein energetics is an important part of current research aimed at understanding protein structure and function. Knowledge-based mean force potentials are derived from statistical analyses of interacting groups in experimentally determined protein structures. Current knowledge-based mean force potentials are developed at the atom or amino acid level. The evolutionary information contained in the profiles is not investigated. Based on these observations, a class of novel knowledge-based mean force potentials at the profile level has been presented, which uses the evolutionary information of profiles for developing more powerful statistical potentials. Results The frequency profiles are directly calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into binary profiles with a probability threshold. As a result, the protein sequences are represented as sequences of binary profiles rather than sequences of amino acids. Similar to the knowledge-based potentials at the residue level, a class of novel potentials at the profile level is introduced. We develop four types of profile-level statistical potentials including distance-dependent, contact, Φ/Ψ dihedral angle and accessible surface statistical potentials. These potentials are first evaluated by the fold assessment between the correct and incorrect models generated by comparative modeling from our own and other groups. They are then used to recognize the native structures from well-constructed decoy sets. Experimental results show that all the knowledge-base mean force potentials at the profile level outperform those at the residue level. Significant improvements are obtained for the distance-dependent and accessible surface potentials (5–6%). The contact and Φ/Ψ dihedral angle potential only get a slight improvement (1–2%). Decoy set evaluation results show that the distance-dependent profile-level potentials even outperform other atom-level potentials. We also demonstrate that profile-level statistical potentials can improve the performance of threading. Conclusion The knowledge-base mean force potentials at the profile level can provide better discriminatory ability than those at the residue level, so they will be useful for protein structure prediction and model refinement.
Collapse
Affiliation(s)
- Qiwen Dong
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China
| | - Lei Lin
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China
| |
Collapse
|
45
|
Miyazaki S, Kuroda Y, Yokoyama S. Identification of putative domain linkers by a neural network - application to a large sequence database. BMC Bioinformatics 2006; 7:323. [PMID: 16800897 PMCID: PMC1538634 DOI: 10.1186/1471-2105-7-323] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2006] [Accepted: 06/27/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The reliable dissection of large proteins into structural domains represents an important issue for structural genomics/proteomics projects. To provide a practical approach to this issue, we tested the ability of neural network to identify domain linkers from the SWISSPROT database (101602 sequences). RESULTS Our search detected 3009 putative domain linkers adjacent to or overlapping with domains, as defined by sequence similarity to either Protein Data Bank (PDB) or Conserved Domain Database (CDD) sequences. Among these putative linkers, 75% were "correctly" located within 20 residues of a domain terminus, and the remaining 25% were found in the middle of a domain, and probably represented failed predictions. Moreover, our neural network predicted 5124 putative domain linkers in structurally un-annotated regions without sequence similarity to PDB or CDD sequences, which suggest to the possible existence of novel structural domains. As a comparison, we performed the same analysis by identifying low-complexity regions (LCR), which are known to encode unstructured polypeptide segments, and observed that the fraction of LCRs that correlate with domain termini is similar to that of domain linkers. However, domain linkers and LCRs appeared to identify different types of domain boundary regions, as only 32% of the putative domain linkers overlapped with LCRs. CONCLUSION Overall, our study indicates that the two methods detect independent and complementary regions, and that the combination of these methods can substantially improve the sensitivity of the domain boundary prediction. This finding should enable the identification of novel structural domains, yielding new targets for large scale protein analyses.
Collapse
Affiliation(s)
- Satoshi Miyazaki
- Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
- RIKEN Genomic Sciences Center, 1-7-22, Suehiro-cho, Tsurumi, Yokohama 230-0045, Japan
| | - Yutaka Kuroda
- Department of Biotechnology and Life Science, Graduate School of Technology, Tokyo University of Agriculture and Technology, 2-24-16, Nakamachi, Koganei, 184-8588, Tokyo, Japan
| | - Shigeyuki Yokoyama
- Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
- RIKEN Genomic Sciences Center, 1-7-22, Suehiro-cho, Tsurumi, Yokohama 230-0045, Japan
| |
Collapse
|
46
|
Abstract
We present an analysis of the domain boundary prediction, a new category, in the sixth community-wide experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP6). There were 1011 predictions submitted for 63 targets. Each prediction was compared to the set of domains defined manually by visual inspection of the experimental structure. The comparison was scored using a new domain prediction scoring scheme. As the definition of a domain is subjective, many targets were assigned alternate definitions. For such targets, each prediction was compared with all different definitions and the best score was chosen. The predictors found it difficult to accurately predict domain boundaries when the target protein contained many domains or domains made of multiple sequence segments. The CBRC-DR (P0536) and Sternberg (P0237) groups were the most successful among human experts, while Baker-Rossettadom (P0353) and Baker-Robetta-Ginzu (P0421) did well among servers.
Collapse
Affiliation(s)
- Chin-Hsien Tai
- Laboratory of Molecular Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, USA
| | | | | | | |
Collapse
|
47
|
Joshi RR, Samant VV. Fast prediction of protein domain boundaries using conserved local patterns. J Mol Model 2006; 12:943-52. [PMID: 16649034 DOI: 10.1007/s00894-006-0116-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2005] [Accepted: 03/09/2006] [Indexed: 10/24/2022]
Abstract
We have found certain conserved motifs and secondary structural patterns present in the vicinity of interior domain boundary points (dbps) by a data-driven approach without any a priori constraint on the type and number of such features, and without any requirement of sequence homology. We have used these motifs and patterns to rerank the solutions obtained by the well-known domain guess by size (DGS) algorithm. We predict, overall, five solutions. The average accuracy of overall (i.e., top five) predictions by our method [domain boundary prediction using conserved patterns (DPCP)] has improved the average accuracy of the top five solutions of DGS from 71.74 to 82.88 %, in the case of two-continuous-domain proteins, and from 21.38 to 80.56 %, for two-discontinuous-domain proteins. Considering only the top solution, the gains in accuracy are from 0 to 72.74 % for two-continuous-domain proteins with chain lengths up to 300 residues, and from 0 to 62.85 % for those with up to 400 residues. In the case of discontinuous domains, top_min solutions (the minimum number of solutions required for predicting all dbps of a protein) of DPCP improve the average accuracy of DGS prediction from 12.5 to 76.3 % in proteins with chain lengths up to 300 residues, and from 13.33 to 70.84 % for proteins with up to 400 residues. In our validation experiments, the performance of DPCP was also found to be superior to that of domain identification from secondary structure element alignment (DomSSEA), the best method reported so far for efficient prediction of domain boundaries using predicted secondary structure. The average accuracies of the topmost solution of DomSSEA are 61 and 52 % for proteins with up to 300 residues and 400, respectively, in the case of continuous domains; the corresponding accuracies for the discontinuous case are 28 and 21 %.
Collapse
Affiliation(s)
- Rajani R Joshi
- Department of Mathematics, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India.
| | | |
Collapse
|
48
|
Dong Q, Wang X, Lin L, Xu Z. Domain boundary prediction based on profile domain linker propensity index. Comput Biol Chem 2006; 30:127-33. [PMID: 16531120 DOI: 10.1016/j.compbiolchem.2006.01.001] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2005] [Revised: 12/29/2005] [Accepted: 01/08/2006] [Indexed: 11/19/2022]
Abstract
Successful prediction of protein domain boundaries provides valuable information not only for the computational structure prediction of multi-domain proteins but also for the experimental structure determination. In this work, a novel index at the profile level is presented, namely, the profile domain linker propensity index (PDLI), which uses the evolutionary information of profiles for domain linker prediction. The frequency profiles are directly calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into binary profiles with a probability threshold. PDLI is then obtained by the frequencies of binary profiles in domain linkers as compared to those in domains. A smooth and normalized numeric profile is generated for any amino acid sequences from which the domain linkers can be predicted. Testing on the Structural Classification of Proteins (SCOP) database and CASP6 targets shows that PDLI outperforms other indexes at the amino acid level.
Collapse
Affiliation(s)
- Qiwen Dong
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China.
| | | | | | | |
Collapse
|
49
|
Hondoh T, Kato A, Yokoyama S, Kuroda Y. Computer-aided NMR assay for detecting natively folded structural domains. Protein Sci 2006; 15:871-83. [PMID: 16522794 PMCID: PMC2242495 DOI: 10.1110/ps.051880406] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Structural genomics projects require strategies for rapidly recognizing protein sequences appropriate for routine structure determination. For large proteins, this strategy includes the dissection of proteins into structural domains that form stable native structures. However, protein dissection essentially remains an empirical and often a tedious process. Here, we describe a simple strategy for rapidly identifying structural domains and assessing their structures. This approach combines the computational prediction of sequence regions corresponding to putative domains with an experimental assessment of their structures and stabilities by NMR and biochemical methods. We tested this approach with nine putative domains predicted from a set of 108 Thermus thermophilus HB8 sequences using PASS, a domain prediction program we previously reported. To facilitate the experimental assessment of the domain structures, we developed a generic 6-hour His-tag-based purification protocol, which enables the sample quality evaluation of a putative structural domain in a single day. As a result, we observed that half of the predicted structural domains were indeed natively folded, as judged by their HSQC spectra. Furthermore, two of the natively folded domains were novel, without related sequences classified in the Pfam and SMART databases, which is a significant result with regard to the ability of structural genomics projects to uniformly cover the protein fold space.
Collapse
Affiliation(s)
- Takayuki Hondoh
- Protein Research Group, RIKEN Genomic Sciences Center, Tsurumi, Yokohama 230-0045, Japan
| | | | | | | |
Collapse
|
50
|
Abstract
The correspondence between biology and linguistics at the level of sequence and lexical inventories, and of structure and syntax, has fuelled attempts to describe genome structure by the rules of formal linguistics. But how can we define protein linguistic rules? And how could compositional semantics improve our understanding of protein organization and functional plasticity?
Collapse
Affiliation(s)
- Mario Gimona
- Consorzio Mario Negri Sud, Marie Curie Unit of Actin Cytoskeleton Regulation, Department of Cell Biology and Oncology, Via Nazionale 8A, 66030 Santa Maria Imbaro, Italy.
| |
Collapse
|