1
|
Lee T, Lee S, Kang M, Kim S. Deep hierarchical embedding for simultaneous modeling of GPCR proteins in a unified metric space. Sci Rep 2021; 11:9543. [PMID: 33953216 PMCID: PMC8100104 DOI: 10.1038/s41598-021-88623-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Accepted: 04/13/2021] [Indexed: 11/23/2022] Open
Abstract
GPCR proteins belong to diverse families of proteins that are defined at multiple hierarchical levels. Inspecting relationships between GPCR proteins on the hierarchical structure is important, since characteristics of the protein can be inferred from proteins in similar hierarchical information. However, modeling of GPCR families has been performed separately for each of the family, subfamily, and sub-subfamily level. Relationships between GPCR proteins are ignored in these approaches as they process the information in the proteins with several disconnected models. In this study, we propose DeepHier, a deep learning model to simultaneously learn representations of GPCR family hierarchy from the protein sequences with a unified single model. Novel loss term based on metric learning is introduced to incorporate hierarchical relations between proteins. We tested our approach using a public GPCR sequence dataset. Metric distances in the deep feature space corresponded to the hierarchical family relation between GPCR proteins. Furthermore, we demonstrated that further downstream tasks, like phylogenetic reconstruction and motif discovery, are feasible in the constructed embedding space. These results show that hierarchical relations between sequences were successfully captured in both of technical and biological aspects.
Collapse
Affiliation(s)
- Taeheon Lee
- Looxid Labs, Seoul, 06628, Republic of Korea
| | - Sangseon Lee
- BK21 FOUR Intelligence Computing, Seoul National University, Seoul, 08826, Republic of Korea
| | - Minji Kang
- Department of Computer Science, Stanford University, Stanford, CA, 94305, USA
| | - Sun Kim
- Bioinformatics Institute, Seoul National University, Seoul, 08826, Republic of Korea. .,Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Republic of Korea. .,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, Republic of Korea. .,Institute of Engineering Research, Seoul National University, Seoul, 08826, Republic of Korea.
| |
Collapse
|
2
|
Shkurin A, Vellido A. Using random forests for assistance in the curation of G-protein coupled receptor databases. Biomed Eng Online 2017; 16:75. [PMID: 28830426 PMCID: PMC5568607 DOI: 10.1186/s12938-017-0357-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. METHODS We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. RESULTS Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. CONCLUSION The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.
Collapse
Affiliation(s)
- Aleksei Shkurin
- Department of Computer Science, Universitat Politècnica de Catalunya, C. Jordi Girona, 1-3, 08034, Barcelona, Spain.,Technology, Communication and Transport Department, Mikkeli University of Applied Sciences, Patteristonkatu 3, 50100, Mikkeli, Finland
| | - Alfredo Vellido
- Department of Computer Science, Universitat Politècnica de Catalunya, C. Jordi Girona, 1-3, 08034, Barcelona, Spain.
| |
Collapse
|
3
|
Maiti A, Ghorai S, Mukherjee A. A multi-fold string kernel for sequence classification. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2015; 2015:6469-6472. [PMID: 26737774 DOI: 10.1109/embc.2015.7319874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A novel framework is proposed to classify biological sequences using a kernel. It considers the topological information along with the primary structural information. The widely used string kernel for sequence classification does not take into account the structural information which might be available for biological sequences. The proposed kernels incorporate the additional structural information and thus make the features more informative.
Collapse
|
4
|
Sahin ME, Can T, Son CD. GPCRsort-responding to the next generation sequencing data challenge: prediction of G protein-coupled receptor classes using only structural region lengths. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2014; 18:636-44. [PMID: 25133496 DOI: 10.1089/omi.2014.0073] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Next generation sequencing (NGS) and the attendant data deluge are increasingly impacting molecular life sciences research. Chief among the challenges and opportunities is to enhance our ability to classify molecular target data into meaningful and cohesive systematic nomenclature. In this vein, the G protein-coupled receptors (GPCRs) are the largest and most divergent receptor family that plays a crucial role in a host of pathophysiological pathways. For the pharmaceutical industry, GPCRs are a major drug target and it is estimated that 60%-70% of all medicines in development today target GPCRs. Hence, they require an efficient and rapid classification to group the members according to their functions. In addition to NGS and the Big Data challenge we currently face, an emerging number of orphan GPCRs further demand for novel, rapid, and accurate classification of the receptors since the current classification tools are inadequate and slow. This study presents the development of a new classification tool for GPCRs using the structural features derived from their primary sequences: GPCRsort. Comparison experiments with the current known GPCR classification techniques showed that GPCRsort is able to rapidly (in the order of minutes) classify uncharacterized GPCRs with 97.3% accuracy, whereas the best available technique's accuracy is 90.7%. GPCRsort is available in the public domain for postgenomics life scientists engaged in GPCR research with NGS: http://bioserver.ceng.metu.edu.tr/GPCRSort .
Collapse
Affiliation(s)
- Mehmet Emre Sahin
- 1 Department of Computer Engineering, Middle East Technical University , Ankara, Turkey
| | | | | |
Collapse
|
5
|
Bioinformatics tools for predicting GPCR gene functions. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2014; 796:205-24. [PMID: 24158807 DOI: 10.1007/978-94-007-7423-0_10] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
Abstract
The automatic classification of GPCRs by bioinformatics methodology can provide functional information for new GPCRs in the whole 'GPCR proteome' and this information is important for the development of novel drugs. Since GPCR proteome is classified hierarchically, general ways for GPCR function prediction are based on hierarchical classification. Various computational tools have been developed to predict GPCR functions; those tools use not simple sequence searches but more powerful methods, such as alignment-free methods, statistical model methods, and machine learning methods used in protein sequence analysis, based on learning datasets. The first stage of hierarchical function prediction involves the discrimination of GPCRs from non-GPCRs and the second stage involves the classification of the predicted GPCR candidates into family, subfamily, and sub-subfamily levels. Then, further classification is performed according to their protein-protein interaction type: binding G-protein type, oligomerized partner type, etc. Those methods have achieved predictive accuracies of around 90 %. Finally, I described the future subject of research of the bioinformatics technique about functional prediction of GPCR.
Collapse
|
6
|
Heifetz A, Barker O, Verquin G, Wimmer N, Meutermans W, Pal S, Law RJ, Whittaker M. Fighting obesity with a sugar-based library: discovery of novel MCH-1R antagonists by a new computational-VAST approach for exploration of GPCR binding sites. J Chem Inf Model 2013; 53:1084-99. [PMID: 23590178 DOI: 10.1021/ci4000882] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Obesity is an increasingly common disease. While antagonism of the melanin-concentrating hormone-1 receptor (MCH-1R) has been widely reported as a promising therapeutic avenue for obesity treatment, no MCH-1R antagonists have reached the market. Discovery and optimization of new chemical matter targeting MCH-1R is hindered by reduced HTS success rates and a lack of structural information about the MCH-1R binding site. X-ray crystallography and NMR, the major experimental sources of structural information, are very slow processes for membrane proteins and are not currently feasible for every GPCR or GPCR-ligand complex. This situation significantly limits the ability of these methods to impact the drug discovery process for GPCR targets in "real-time", and hence, there is an urgent need for other practical and cost-efficient alternatives. We present here a conceptually pioneering approach that integrates GPCR modeling with design, synthesis, and screening of a diverse library of sugar-based compounds from the VAST technology (versatile assembly on stable templates) to provide structural insights on the MCH-1R binding site. This approach creates a cost-efficient new avenue for structure-based drug discovery (SBDD) against GPCR targets. In our work, a primary VAST hit was used to construct a high-quality MCH-1R model. Following model validation, a structure-based virtual screen yielded a 14% hit rate and 10 novel chemotypes of potent MCH-1R antagonists, including EOAI3367472 (IC50 = 131 nM) and EOAI3367474 (IC50 = 213 nM).
Collapse
Affiliation(s)
- Alexander Heifetz
- Evotec (UK), Ltd., Milton Park, Abingdon, Oxfordshire, United Kingdom.
| | | | | | | | | | | | | | | |
Collapse
|
7
|
Cárdenas MI, Vellido A, Olier I, Rovira X, Giraldo J. Kernel Generative Topographic Mapping of Protein Sequences. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
The world of pharmacology is becoming increasingly dependent on the advances in the fields of genomics and proteomics. The –omics sciences bring about the challenge of how to deal with the large amounts of complex data they generate from an intelligent data analysis perspective. In this chapter, the authors focus on the analysis of a specific type of proteins, the G protein-coupled receptors, which are the target for over 15% of current drugs. They describe a kernel method of the manifold learning family for the analysis of protein amino acid symbolic sequences. This method sheds light on the structure of protein subfamilies, while providing an intuitive visualization of such structure.
Collapse
|
8
|
Heifetz A, Morris GB, Biggin PC, Barker O, Fryatt T, Bentley J, Hallett D, Manikowski D, Pal S, Reifegerste R, Slack M, Law R. Study of Human Orexin-1 and -2 G-Protein-Coupled Receptors with Novel and Published Antagonists by Modeling, Molecular Dynamics Simulations, and Site-Directed Mutagenesis. Biochemistry 2012; 51:3178-97. [DOI: 10.1021/bi300136h] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Affiliation(s)
- Alexander Heifetz
- Evotec (U.K.) Ltd., 114 Milton Park, Abingdon, Oxfordshire OX14 4SA, United Kingdom
| | - G. Benjamin Morris
- Department of Biochemistry, University of Oxford, South Parks Road, Oxford OX1 3QU, United Kingdom
| | - Philip C. Biggin
- Department of Biochemistry, University of Oxford, South Parks Road, Oxford OX1 3QU, United Kingdom
| | - Oliver Barker
- Evotec (U.K.) Ltd., 114 Milton Park, Abingdon, Oxfordshire OX14 4SA, United Kingdom
| | - Tara Fryatt
- Evotec (U.K.) Ltd., 114 Milton Park, Abingdon, Oxfordshire OX14 4SA, United Kingdom
| | - Jonathan Bentley
- Evotec (U.K.) Ltd., 114 Milton Park, Abingdon, Oxfordshire OX14 4SA, United Kingdom
| | - David Hallett
- Evotec (U.K.) Ltd., 114 Milton Park, Abingdon, Oxfordshire OX14 4SA, United Kingdom
| | | | - Sandeep Pal
- Evotec (U.K.) Ltd., 114 Milton Park, Abingdon, Oxfordshire OX14 4SA, United Kingdom
| | - Rita Reifegerste
- Evotec AG, Manfred Eigen Campus, Essener Bogen 7, 22419 Hamburg, Germany
| | - Mark Slack
- Evotec AG, Manfred Eigen Campus, Essener Bogen 7, 22419 Hamburg, Germany
| | - Richard Law
- Evotec (U.K.) Ltd., 114 Milton Park, Abingdon, Oxfordshire OX14 4SA, United Kingdom
| |
Collapse
|