1
|
Dhondge H, Chauvot de Beauchêne I, Devignes MD. CroMaSt: a workflow for assessing protein domain classification by cross-mapping of structural instances between domain databases and structural alignment. BIOINFORMATICS ADVANCES 2023; 3:vbad081. [PMID: 37431435 PMCID: PMC10329740 DOI: 10.1093/bioadv/vbad081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 06/16/2023] [Accepted: 06/26/2023] [Indexed: 07/12/2023]
Abstract
Motivation Protein domains can be viewed as building blocks, essential for understanding structure-function relationships in proteins. However, each domain database classifies protein domains using its own methodology. Thus, in many cases, domain models and boundaries differ from one domain database to the other, raising the question of domain definition and enumeration of true domain instances. Results We propose an automated iterative workflow to assess protein domain classification by cross-mapping domain structural instances between domain databases and by evaluating structural alignments. CroMaSt (for Cross-Mapper of domain Structural instances) will classify all experimental structural instances of a given domain type into four different categories ('Core', 'True', 'Domain-like' and 'Failed'). CroMast is developed in Common Workflow Language and takes advantage of two well-known domain databases with wide coverage: Pfam and CATH. It uses the Kpax structural alignment tool with expert-adjusted parameters. CroMaSt was tested with the RNA Recognition Motif domain type and identifies 962 'True' and 541 'Domain-like' structural instances for this domain type. This method solves a crucial issue in domain-centric research and can generate essential information that could be used for synthetic biology and machine-learning approaches of protein domain engineering. Availability and implementation The workflow and the Results archive for the CroMaSt runs presented in this article are available from WorkflowHub (doi: 10.48546/workflowhub.workflow.390.2). Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
|
2
|
Kolodny R, Nepomnyachiy S, Tawfik DS, Ben-Tal N. Bridging Themes: Short Protein Segments Found in Different Architectures. Mol Biol Evol 2021; 38:2191-2208. [PMID: 33502503 PMCID: PMC8136508 DOI: 10.1093/molbev/msab017] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The vast majority of theoretically possible polypeptide chains do not fold, let alone confer function. Hence, protein evolution from preexisting building blocks has clear potential advantages over ab initio emergence from random sequences. In support of this view, sequence similarities between different proteins is generally indicative of common ancestry, and we collectively refer to such homologous sequences as "themes." At the domain level, sequence homology is routinely detected. However, short themes which are segments, or fragments of intact domains, are particularly interesting because they may provide hints about the emergence of domains, as opposed to divergence of preexisting domains, or their mixing-and-matching to form multi-domain proteins. Here we identified 525 representative short themes, comprising 20-80 residues that are unexpectedly shared between domains considered to have emerged independently. Among these "bridging themes" are ones shared between the most ancient domains, for example, Rossmann, P-loop NTPase, TIM-barrel, flavodoxin, and ferredoxin-like. We elaborate on several particularly interesting cases, where the bridging themes mediate ligand binding. Ligand binding may have contributed to the stability and the plasticity of these building blocks, and to their ability to invade preexisting domains or serve as starting points for completely new domains.
Collapse
Affiliation(s)
- Rachel Kolodny
- Department of Computer Science, University of Haifa, Haifa, Israel
| | | | - Dan S Tawfik
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Nir Ben-Tal
- George S. Wise Faculty of Life Sciences, Department of Biochemistry and Molecular Biology, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
3
|
Searching protein space for ancient sub-domain segments. Curr Opin Struct Biol 2021; 68:105-112. [PMID: 33476896 DOI: 10.1016/j.sbi.2020.11.006] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2020] [Accepted: 11/29/2020] [Indexed: 01/08/2023]
Abstract
Evolutionary processes that formed the current protein universe left their traces, among them homologous segments that recur, or are 'reused,' in multiple proteins. These reused segments, called 'themes,' can be found at various scales, the best known of which is the domain. Yet, recent studies have begun to focus on the evolutionary insights that can be derived from sub-domain-scale themes, which are candidates for traces of more ancient events. Characterizing these may provide clues to the emergence of domains. Particularly interesting are themes that are reused across dissimilar contexts, that is, where the rest of the protein domain differs. We survey computational studies identifying reused themes within different contexts at the sub-domain level.
Collapse
|
4
|
Karimi S, Ahmadi M, Goudarzi F, Ferdousi R. A computational model for GPCR-ligand interaction prediction. J Integr Bioinform 2020; 18:155-165. [PMID: 34171942 PMCID: PMC7790179 DOI: 10.1515/jib-2019-0084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Accepted: 11/25/2020] [Indexed: 11/25/2022] Open
Abstract
G protein-coupled receptors (GPCRs) play an essential role in critical human activities, and they are considered targets for a wide range of drugs. Accordingly, based on these crucial roles, GPCRs are mainly considered and focused on pharmaceutical research. Hence, there are a lot of investigations on GPCRs. Experimental laboratory research is very costly in terms of time and expenses, and accordingly, there is a marked tendency to use computational methods as an alternative method. In this study, a prediction model based on machine learning (ML) approaches was developed to predict GPCRs and ligand interactions. Decision tree (DT), random forest (RF), multilayer perceptron (MLP), support vector machine (SVM), and Naive Bayes (NB) were the algorithms that were investigated in this study. After several optimization steps, receiver operating characteristic (ROC) for DT, RF, MLP, SVM, and NB algorithm were 95.2, 98.1, 96.3, 95.5, and 97.3, respectively. Accordingly final model was made base on the RF algorithm. The current computational study compared with others focused on specific and important types of proteins (GPCR) interaction and employed/examined different types of sequence-based features to obtain more accurate results. Drug science researchers could widely use the developed prediction model in this study. The developed predictor was applied over 16,132 GPCR-ligand pairs and about 6778 potential interactions predicted.
Collapse
Affiliation(s)
- Shiva Karimi
- Health Information Management Department, Paramedical School, Kermanshah University of Medical Sciences, Kermanshah, Iran
| | - Maryam Ahmadi
- Department of Health Information Management, School of Management and Medical Information Sciences, Iran University of Medical Sciences, Tehran, Iran
| | - Farjam Goudarzi
- Regenerative Medicine Research Center, Kermanshah University of Medical Sciences, Kermanshah, Iran
| | - Reza Ferdousi
- Department of Health Information Technology, School of Management and Medical Informatics, Tabriz University of Medical Sciences, Tabriz, Iran
| |
Collapse
|
5
|
Navigating Among Known Structures in Protein Space. Methods Mol Biol 2018. [PMID: 30298400 DOI: 10.1007/978-1-4939-8736-8_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Present-day protein space is the result of 3.7 billion years of evolution, constrained by the underlying physicochemical qualities of the proteins. It is difficult to differentiate between evolutionary traces and effects of physicochemical constraints. Nonetheless, as a rule of thumb, instances of structural reuse, or focusing on structural similarity, are likely attributable to physicochemical constraints, whereas sequence reuse, or focusing on sequence similarity, may be more indicative of evolutionary relationships. Both types of relationships have been studied and can provide meaningful insights to protein biophysics and evolution, which in turn can lead to better algorithms for protein search, annotation, and maybe even design.In broad strokes, studies of protein space vary in the entities they represent, the similarity measure comparing these entities, and the representation used. The entities can be, for example, protein chains, domains, supra-domains, or smaller protein sub-parts denoted themes. The measures of similarity between the entities can be based on sequence, structure, function, or any combination of these. The representation can be global, encompassing the whole space, or local, focusing on a particular region surrounding protein(s) of interest. Global representations include lists of grouped proteins, protein networks, and maps. Networks are the abstraction that is derived most directly from the similarity data: each node is the protein entity (e.g., a domain), and edges connect similar domains. Selecting the entities, the similarity measure, and the abstraction are three intertwined decisions: the similarity measures allow us to identify the entities, and the selection of entities influences what is a meaningful similarity measure. Similarly, we seek entities that are related to each other in a way, for which a simple representation describes their relationships succinctly and accurately. This chapter will cover studies that rely on different entities, similarity measures, and a range of representations to better understand protein structure space. Scholars may use publicly available navigators offering a global representation, and in particular the hierarchical classifications SCOP, CATH, and ECOD, or a local representation, which encompass structural alignment algorithms. Alternatively, scholars can configure their own navigator using existing tools. To demonstrate this DIY (do it yourself) approach for navigating in protein space, we investigate substrate-binding proteins. By presenting sequence similarities among this large and diverse protein family as a network, we can infer that one member (pdb ID 4ntl; of yet unknown function) may bind methionine and suggest a putative binding mechanism.
Collapse
|
6
|
Evans SM, Adcox HE, VieBrock L, Green RS, Luce-Fedrow A, Chattopadhyay S, Jiang J, Marconi RT, Paris D, Richards AL, Carlyon JA. Outer Membrane Protein A Conservation among Orientia tsutsugamushi Isolates Suggests Its Potential as a Protective Antigen and Diagnostic Target. Trop Med Infect Dis 2018; 3:E63. [PMID: 30274459 PMCID: PMC6073748 DOI: 10.3390/tropicalmed3020063] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Revised: 05/31/2018] [Accepted: 06/04/2018] [Indexed: 01/28/2023] Open
Abstract
Scrub typhus threatens one billion people in the Asia-Pacific area and cases have emerged outside this region. It is caused by infection with any of the multitude of strains of the bacterium Orientia tsutsugamushi. A vaccine that affords heterologous protection and a commercially-available molecular diagnostic assay are lacking. Herein, we determined that the nucleotide and translated amino acid sequences of outer membrane protein A (OmpA) are highly conserved among 51 O. tsutsugamushi isolates. Molecular modeling revealed the predicted tertiary structure of O. tsutsugamushi OmpA to be very similar to that of the phylogenetically-related pathogen, Anaplasma phagocytophilum, including the location of a helix that contains residues functionally essential for A. phagocytophilum infection. PCR primers were developed that amplified ompA DNA from all O. tsutsugamushi strains, but not from negative control bacteria. Using these primers in quantitative PCR enabled sensitive detection and quantitation of O. tsutsugamushi ompA DNA from organs and blood of mice that had been experimentally infected with the Karp or Gilliam strains. The high degree of OmpA conservation among O. tsutsugamushi strains evidences its potential to serve as a molecular diagnostic target and justifies its consideration as a candidate for developing a broadly-protective scrub typhus vaccine.
Collapse
Affiliation(s)
- Sean M Evans
- Department of Microbiology and Immunology, Virginia Commonwealth University Medical Center, School of Medicine, Richmond, VA 23298, USA.
| | - Haley E Adcox
- Department of Microbiology and Immunology, Virginia Commonwealth University Medical Center, School of Medicine, Richmond, VA 23298, USA.
| | - Lauren VieBrock
- Department of Microbiology and Immunology, Virginia Commonwealth University Medical Center, School of Medicine, Richmond, VA 23298, USA.
| | - Ryan S Green
- Department of Microbiology and Immunology, Virginia Commonwealth University Medical Center, School of Medicine, Richmond, VA 23298, USA.
| | - Alison Luce-Fedrow
- Viral and Rickettsial Diseases Department, Naval Medical Research Center, Silver Spring, MD 20910, USA.
- Department of Biology, Shippensburg University, Shippensburg, PA 17257, USA.
| | - Suschsmita Chattopadhyay
- Viral and Rickettsial Diseases Department, Naval Medical Research Center, Silver Spring, MD 20910, USA.
| | - Ju Jiang
- Viral and Rickettsial Diseases Department, Naval Medical Research Center, Silver Spring, MD 20910, USA.
| | - Richard T Marconi
- Department of Microbiology and Immunology, Virginia Commonwealth University Medical Center, School of Medicine, Richmond, VA 23298, USA.
| | - Daniel Paris
- Department of Medicine, Swiss Tropical and Public Health Institute, 4051 Basel, Switzerland.
| | - Allen L Richards
- Viral and Rickettsial Diseases Department, Naval Medical Research Center, Silver Spring, MD 20910, USA.
- Department of Preventive Medicine and Biostatistics, Uniformed Services University of the Health Sciences, Bethesda, MD 20814, USA.
| | - Jason A Carlyon
- Department of Microbiology and Immunology, Virginia Commonwealth University Medical Center, School of Medicine, Richmond, VA 23298, USA.
| |
Collapse
|
7
|
Complex evolutionary footprints revealed in an analysis of reused protein segments of diverse lengths. Proc Natl Acad Sci U S A 2017; 114:11703-11708. [PMID: 29078314 PMCID: PMC5676897 DOI: 10.1073/pnas.1707642114] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
We question a central paradigm: namely, that the protein domain is the “atomic unit” of evolution. In conflict with the current textbook view, our results unequivocally show that duplication of protein segments happens both above and below the domain level among amino acid segments of diverse lengths. Indeed, we show that significant evolutionary information is lost when the protein is approached as a string of domains. Our finer-grained approach reveals a far more complicated picture, where reused segments often intertwine and overlap with each other. Our results are consistent with a recursive model of evolution, in which segments of various lengths, typically smaller than domains, “hop” between environments. The fit segments remain, leaving traces that can still be detected. Proteins share similar segments with one another. Such “reused parts”—which have been successfully incorporated into other proteins—are likely to offer an evolutionary advantage over de novo evolved segments, as most of the latter will not even have the capacity to fold. To systematically explore the evolutionary traces of segment “reuse” across proteins, we developed an automated methodology that identifies reused segments from protein alignments. We search for “themes”—segments of at least 35 residues of similar sequence and structure—reused within representative sets of 15,016 domains [Evolutionary Classification of Protein Domains (ECOD) database] or 20,398 chains [Protein Data Bank (PDB)]. We observe that theme reuse is highly prevalent and that reuse is more extensive when the length threshold for identifying a theme is lower. Structural domains, the best characterized form of reuse in proteins, are just one of many complex and intertwined evolutionary traces. Others include long themes shared among a few proteins, which encompass and overlap with shorter themes that recur in numerous proteins. The observed complexity is consistent with evolution by duplication and divergence, and some of the themes might include descendants of ancestral segments. The observed recursive footprints, where the same amino acid can simultaneously participate in several intertwined themes, could be a useful concept for protein design. Data are available at http://trachel-srv.cs.haifa.ac.il/rachel/ppi/themes/.
Collapse
|
8
|
|
9
|
Postic G, Ghouzam Y, Chebrek R, Gelly JC. An ambiguity principle for assigning protein structural domains. SCIENCE ADVANCES 2017; 3:e1600552. [PMID: 28097215 PMCID: PMC5235333 DOI: 10.1126/sciadv.1600552] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Accepted: 11/28/2016] [Indexed: 05/20/2023]
Abstract
Ambiguity is the quality of being open to several interpretations. For an image, it arises when the contained elements can be delimited in two or more distinct ways, which may cause confusion. We postulate that it also applies to the analysis of protein three-dimensional structure, which consists in dividing the molecule into subunits called domains. Because different definitions of what constitutes a domain can be used to partition a given structure, the same protein may have different but equally valid domain annotations. However, knowledge and experience generally displace our ability to accept more than one way to decompose the structure of an object-in this case, a protein. This human bias in structure analysis is particularly harmful because it leads to ignoring potential avenues of research. We present an automated method capable of producing multiple alternative decompositions of protein structure (web server and source code available at www.dsimb.inserm.fr/sword/). Our innovative algorithm assigns structural domains through the hierarchical merging of protein units, which are evolutionarily preserved substructures that describe protein architecture at an intermediate level, between domain and secondary structure. To validate the use of these protein units for decomposing protein structures into domains, we set up an extensive benchmark made of expert annotations of structural domains and including state-of-the-art domain parsing algorithms. The relevance of our "multipartitioning" approach is shown through numerous examples of applications covering protein function, evolution, folding, and structure prediction. Finally, we introduce a measure for the structural ambiguity of protein molecules.
Collapse
Affiliation(s)
- Guillaume Postic
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Corresponding author. (G.P.); (J.-C.G.)
| | - Yassine Ghouzam
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
| | - Romain Chebrek
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
| | - Jean-Christophe Gelly
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Corresponding author. (G.P.); (J.-C.G.)
| |
Collapse
|
10
|
Berezovsky IN, Guarnera E, Zheng Z. Basic units of protein structure, folding, and function. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2016; 128:85-99. [PMID: 27697476 DOI: 10.1016/j.pbiomolbio.2016.09.009] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Revised: 09/05/2016] [Accepted: 09/26/2016] [Indexed: 10/20/2022]
Abstract
Study of the hierarchy of domain structure with alternative sets of domains and analysis of discontinuous domains, consisting of remote segments of the polypeptide chain, raised a question about the minimal structural unit of the protein domain. The hypothesis on the decisive role of the polypeptide backbone in determining the elementary units of globular proteins have led to the discovery of closed loops. It is reviewed here how closed loops form the loop-n-lock structure of proteins, providing the foundation for stability and designability of protein folds/domain and underlying their co-translational folding. Simplified protein sequences are considered here with the aim to explore the basic principles that presumably dominated the folding and stability of proteins in the early stages of structural evolution. Elementary functional loops (EFLs), closed loops with one or few catalytic residues, are, in turn, units of the protein function. They are apparent descendants of the prebiotic ring-like peptides, which gave rise to the first functional folds/domains being fused in the beginning of the evolution of protein structure. It is also shown how evolutionary relations between protein functional superfamilies and folds delineated with the help of EFLs can contribute to establishing the rules for design of desired enzymatic functions. Generalized descriptors of the elementary functions are proposed to be used as basic units in the future computational design.
Collapse
Affiliation(s)
- Igor N Berezovsky
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore; Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, 117579, Singapore.
| | - Enrico Guarnera
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
| | - Zejun Zheng
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
| |
Collapse
|