1
|
Oliveira LS, Reyes A, Dutilh BE, Gruber A. Rational Design of Profile HMMs for Sensitive and Specific Sequence Detection with Case Studies Applied to Viruses, Bacteriophages, and Casposons. Viruses 2023; 15:519. [PMID: 36851733 PMCID: PMC9966878 DOI: 10.3390/v15020519] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 02/01/2023] [Accepted: 02/09/2023] [Indexed: 02/15/2023] Open
Abstract
Profile hidden Markov models (HMMs) are a powerful way of modeling biological sequence diversity and constitute a very sensitive approach to detecting divergent sequences. Here, we report the development of protocols for the rational design of profile HMMs. These methods were implemented on TABAJARA, a program that can be used to either detect all biological sequences of a group or discriminate specific groups of sequences. By calculating position-specific information scores along a multiple sequence alignment, TABAJARA automatically identifies the most informative sequence motifs and uses them to construct profile HMMs. As a proof-of-principle, we applied TABAJARA to generate profile HMMs for the detection and classification of two viral groups presenting different evolutionary rates: bacteriophages of the Microviridae family and viruses of the Flavivirus genus. We obtained conserved models for the generic detection of any Microviridae or Flavivirus sequence, and profile HMMs that can specifically discriminate Microviridae subfamilies or Flavivirus species. In another application, we constructed Cas1 endonuclease-derived profile HMMs that can discriminate CRISPRs and casposons, two evolutionarily related transposable elements. We believe that the protocols described here, and implemented on TABAJARA, constitute a generic toolbox for generating profile HMMs for the highly sensitive and specific detection of sequence classes.
Collapse
Affiliation(s)
- Liliane S. Oliveira
- Department of Parasitology, Instituto de Ciências Biomédicas, Universidade de São Paulo, São Paulo 05508-000, SP, Brazil
| | - Alejandro Reyes
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá 111711, Colombia
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, Saint Louis, MO 63108, USA
| | - Bas E. Dutilh
- Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich-Schiller-University Jena, 07743 Jena, Germany
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
| | - Arthur Gruber
- Department of Parasitology, Instituto de Ciências Biomédicas, Universidade de São Paulo, São Paulo 05508-000, SP, Brazil
- European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
| |
Collapse
|
2
|
Doğan T, Akhan Güzelcan E, Baumann M, Koyas A, Atas H, Baxendale IR, Martin M, Cetin-Atalay R. Protein domain-based prediction of drug/compound-target interactions and experimental validation on LIM kinases. PLoS Comput Biol 2021; 17:e1009171. [PMID: 34843456 PMCID: PMC8659301 DOI: 10.1371/journal.pcbi.1009171] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2021] [Revised: 12/09/2021] [Accepted: 11/09/2021] [Indexed: 12/23/2022] Open
Abstract
Predictive approaches such as virtual screening have been used in drug discovery with the objective of reducing developmental time and costs. Current machine learning and network-based approaches have issues related to generalization, usability, or model interpretability, especially due to the complexity of target proteins' structure/function, and bias in system training datasets. Here, we propose a new method "DRUIDom" (DRUg Interacting Domain prediction) to identify bio-interactions between drug candidate compounds and targets by utilizing the domain modularity of proteins, to overcome problems associated with current approaches. DRUIDom is composed of two methodological steps. First, ligands/compounds are statistically mapped to structural domains of their target proteins, with the aim of identifying their interactions. As such, other proteins containing the same mapped domain or domain pair become new candidate targets for the corresponding compounds. Next, a million-scale dataset of small molecule compounds, including those mapped to domains in the previous step, are clustered based on their molecular similarities, and their domain associations are propagated to other compounds within the same clusters. Experimentally verified bioactivity data points, obtained from public databases, are meticulously filtered to construct datasets of active/interacting and inactive/non-interacting drug/compound-target pairs (~2.9M data points), and used as training data for calculating parameters of compound-domain mappings, which led to 27,032 high-confidence associations between 250 domains and 8,165 compounds, and a finalized output of ~5 million new compound-protein interactions. DRUIDom is experimentally validated by syntheses and bioactivity analyses of compounds predicted to target LIM-kinase proteins, which play critical roles in the regulation of cell motility, cell cycle progression, and differentiation through actin filament dynamics. We showed that LIMK-inhibitor-2 and its derivatives significantly block the cancer cell migration through inhibition of LIMK phosphorylation and the downstream protein cofilin. One of the derivative compounds (LIMKi-2d) was identified as a promising candidate due to its action on resistant Mahlavu liver cancer cells. The results demonstrated that DRUIDom can be exploited to identify drug candidate compounds for intended targets and to predict new target proteins based on the defined compound-domain relationships. Datasets, results, and the source code of DRUIDom are fully-available at: https://github.com/cansyl/DRUIDom.
Collapse
Affiliation(s)
- Tunca Doğan
- Department of Computer Engineering, Hacettepe University, Ankara, Turkey
- Institute of Informatics, Hacettepe University, Ankara, Turkey
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Ece Akhan Güzelcan
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
- Center for Genomics and Rare Diseases & Biobank for Rare Diseases, Hacettepe University, Ankara, Turkey
| | - Marcus Baumann
- School of Chemistry, University College Dublin, Dublin, Ireland
| | - Altay Koyas
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Heval Atas
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Ian R. Baxendale
- Department of Chemistry, University of Durham, Durham, United Kingdom
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Rengul Cetin-Atalay
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
- Section of Pulmonary and Critical Care Medicine, University of Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
3
|
Lima I, Cino EA. Sequence similarity in 3D for comparison of protein families. J Mol Graph Model 2021; 106:107906. [PMID: 33848948 DOI: 10.1016/j.jmgm.2021.107906] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 03/18/2021] [Accepted: 03/18/2021] [Indexed: 11/26/2022]
Abstract
Homologous proteins are often compared by pairwise sequence alignment, and structure superposition if the atomic coordinates are available. Unification of sequence and structure data is an important task in structural biology. Here, we present the Sequence Similarity 3D (SS3D) method of integrating sequence and structure information. SS3D is a distance and substitution matrix-based method for straightforward visualization of regions of similarity and difference between homologous proteins. This work details the SS3D approach, and demonstrates its utility through case studies comparing members of several protein families. The examples show that SS3D can effectively highlight biologically important regions of similarity and dissimilarity. We anticipate that the method will be useful for numerous structural biology applications, including, but not limited to, studies of binding specificity, structure-function relationships, and evolutionary pathways. SS3D is available with a manual and tutorial at https://github.com/0x462e41/SS3D/.
Collapse
Affiliation(s)
- Igor Lima
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, 31270-901, Brazil
| | - Elio A Cino
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, 31270-901, Brazil.
| |
Collapse
|
4
|
Bojar D, Powers RK, Camacho DM, Collins JJ. Deep-Learning Resources for Studying Glycan-Mediated Host-Microbe Interactions. Cell Host Microbe 2021; 29:132-144.e3. [DOI: 10.1016/j.chom.2020.10.004] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 09/09/2020] [Accepted: 10/08/2020] [Indexed: 02/07/2023]
|
5
|
Pan L, Guo Q, Chai S, Cheng Y, Ruan M, Ye Q, Wang R, Yao Z, Zhou G, Li Z, Deng M, Jin F, Liu L, Wan H. Evolutionary Conservation and Expression Patterns of Neutral/Alkaline Invertases in Solanum. Biomolecules 2019; 9:biom9120763. [PMID: 31766568 PMCID: PMC6995568 DOI: 10.3390/biom9120763] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2019] [Revised: 11/15/2019] [Accepted: 11/20/2019] [Indexed: 01/22/2023] Open
Abstract
The invertase gene family in plants is composed of two subfamilies of enzymes, namely, acid- and neutral/alkaline invertases (cytosolic invertase, CIN). Both can irreversibly cleave sucrose into fructose and glucose, which are thought to play key roles in carbon metabolism and plant growth. CINs are widely found in plants, but little is reported about this family. In this paper, a comparative genomic approach was used to analyze the CIN gene family in Solanum, including Solanum tuberosum, Solanum lycopersicum, Solanum pennellii, Solanum pimpinellifolium, and Solanum melongena. A total of 40 CINs were identified in five Solanum plants, and sequence features, phylogenetic relationships, motif compositions, gene structure, collinear relationship, and expression profile were further analyzed. Sequence analysis revealed a remarkable conservation of CINs in sequence length, gene number, and molecular weight. The previously verified four amino acid residues (D188, E414, Arg430, and Ser547) were also observed in 39 out of 40 CINs in our study, showing to be deeply conserved. The CIN gene family could be distinguished into groups α and β, and α is further subdivided into subgroups α1 and α2 in our phylogenetic tree. More remarkably, each species has an average of four CINs in the α and β groups. Marked interspecies conservation and collinearity of CINs were also further revealed by chromosome mapping. Exon-intron configuration and conserved motifs were consistent in each of these α and β groups on the basis of in silico analysis. Expression analysis indicated that CINs were constitutively expressed and share similar expression profiles in all tested samples from S. tuberosum and S. lycopersicum. In addition, in CIN genes of the tomato and potato in response to abiotic and biotic stresses, phytohormones also performed. Overall, CINs in Solanum were encoded by a small and highly conserved gene family, possibly reflecting structural and functional conservation in Solanum. These results lay the foundation for further expounding the functional characterization of CIN genes and are also significant for understanding the evolutionary profiling of the CIN gene family in Solanum.
Collapse
Affiliation(s)
- Luzhao Pan
- College of Horticulture and Gardening, Yangtze University, Jingzhou 434025, China; (L.P.); (S.C.); (L.L.)
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
| | - Qinwei Guo
- Quzhou Academy of Agricultural Sciences, Quzhou 324000, Zhejiang, China;
| | - Songlin Chai
- College of Horticulture and Gardening, Yangtze University, Jingzhou 434025, China; (L.P.); (S.C.); (L.L.)
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
| | - Yuan Cheng
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
| | - Meiying Ruan
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
| | - Qingjing Ye
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
| | - Rongqing Wang
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
| | - Zhuping Yao
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
| | - Guozhi Zhou
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
| | - Zhimiao Li
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
| | - Minghua Deng
- College of Horticulture and landscape, Yunnan Agricultural University, Kunming 650201, China;
| | - Fengmei Jin
- Tianjin Research Center of Agricultural Biotechnology, Tianjin 300192, China;
| | - Lecheng Liu
- College of Horticulture and Gardening, Yangtze University, Jingzhou 434025, China; (L.P.); (S.C.); (L.L.)
| | - Hongjian Wan
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-Products, Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China; (Y.C.); (M.R.); (Q.Y.); (R.W.); (Z.Y.); (G.Z.); (Z.L.)
- China-Australia Research Centre for Crop Improvement, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China
- Correspondence: ; Tel.: +86-571-86407677; Fax: +86-571-86400997
| |
Collapse
|
6
|
Doğan T, MacDougall A, Saidi R, Poggioli D, Bateman A, O'Donovan C, Martin MJ. UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB. Bioinformatics 2016; 32:2264-71. [PMID: 27153729 PMCID: PMC4965628 DOI: 10.1093/bioinformatics/btw114] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2015] [Revised: 01/22/2016] [Accepted: 02/25/2016] [Indexed: 11/17/2022] Open
Abstract
MOTIVATION Similarity-based methods have been widely used in order to infer the properties of genes and gene products containing little or no experimental annotation. New approaches that overcome the limitations of methods that rely solely upon sequence similarity are attracting increased attention. One of these novel approaches is to use the organization of the structural domains in proteins. RESULTS We propose a method for the automatic annotation of protein sequences in the UniProt Knowledgebase (UniProtKB) by comparing their domain architectures, classifying proteins based on the similarities and propagating functional annotation. The performance of this method was measured through a cross-validation analysis using the Gene Ontology (GO) annotation of a sub-set of UniProtKB/Swiss-Prot. The results demonstrate the effectiveness of this approach in detecting functional similarity with an average F-score: 0.85. We applied the method on nearly 55.3 million uncharacterized proteins in UniProtKB/TrEMBL resulted in 44 818 178 GO term predictions for 12 172 114 proteins. 22% of these predictions were for 2 812 016 previously non-annotated protein entries indicating the significance of the value added by this approach. AVAILABILITY AND IMPLEMENTATION The results of the method are available at: ftp://ftp.ebi.ac.uk/pub/contrib/martin/DAAC/ CONTACT: tdogan@ebi.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tunca Doğan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Alistair MacDougall
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Rabie Saidi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Diego Poggioli
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| |
Collapse
|