1
|
Mier P, Andrade-Navarro MA. The nucleotide landscape of polyXY regions. Comput Struct Biotechnol J 2023; 21:5408-5412. [PMID: 38022702 PMCID: PMC10652141 DOI: 10.1016/j.csbj.2023.10.054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 10/30/2023] [Accepted: 10/30/2023] [Indexed: 12/01/2023] Open
Abstract
PolyXY regions are compositionally biased regions composed of two different amino acids. They are classified according to the arrangement of the two amino acid types 'X' and 'Y' into direpeats (composed of alternating amino acids, e.g. 'XYXYXY'), joined (composed of two consecutive stretches of each amino acid, e.g. 'XXXYYY') and shuffled (other arrangements, e.g., 'XYXXYY'). They have been characterized at the amino acid level in all domains of life, and are described as often found within intrinsically disordered regions. Since DNA replication slippage has been proposed as a driver of repeat variation, and given that some polyXY have a repetitive nature, we hypothesized that characterizing the nucleotide coding of various types of polyXY could give hints about their origin and evolution. To test this, we obtained all polyXY regions in the human transcriptome, categorized them, and studied their coding nucleotide sequences. We observed that polyXY exacerbates the codon biases, and that the similarity between the X and Y codons is higher than in the background proteome. Our results support a general mechanism of emergence and evolution of polyXY from single-codon polyX. PolyXY are revealed as hotspots for replication slippage, particularly those composed of repeats: joined and direpeat polyXY. Inter-conversion to shuffled polyXY disrupts nucleotide repeats and restricts further evolution by replication slippage, a mechanism that we previously observed in polyX. Our results shed light on polyXY composition and should simplify the determination of their functions.
Collapse
Affiliation(s)
- Pablo Mier
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Miguel A. Andrade-Navarro
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| |
Collapse
|
2
|
Orlov YL, Orlova NG. Bioinformatics tools for the sequence complexity estimates. Biophys Rev 2023; 15:1367-1378. [PMID: 37974990 PMCID: PMC10643780 DOI: 10.1007/s12551-023-01140-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 09/01/2023] [Indexed: 11/19/2023] Open
Abstract
We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.
Collapse
Affiliation(s)
- Yuriy L. Orlov
- The Digital Health Institute, I.M. Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, 119991 Russia
- Institute of Cytology and Genetics SB RAS, 630090 Novosibirsk, Russia
- Agrarian and Technological Institute, Peoples’ Friendship University of Russia, 117198 Moscow, Russia
| | - Nina G. Orlova
- Department of Mathematics, Financial University under the Government of the Russian Federation, Moscow, 125167 Russia
| |
Collapse
|
3
|
Mier P, Andrade-Navarro MA. Evolutionary Study of Protein Short Tandem Repeats in Protein Families. Biomolecules 2023; 13:1116. [PMID: 37509152 PMCID: PMC10377733 DOI: 10.3390/biom13071116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 07/06/2023] [Accepted: 07/12/2023] [Indexed: 07/30/2023] Open
Abstract
Tandem repeats in proteins are patterns of residues repeated directly adjacent to each other. The evolution of these repeats can be assessed by using groups of homologous sequences, which can help pointing to events of unit duplication or deletion. High pressure in a protein family for variation of a given type of repeat might point to their function. Here, we propose the analysis of protein families to calculate protein short tandem repeats (pSTRs) in each protein sequence and assess their variability within the family in terms of number of units. To facilitate this analysis, we developed the pSTR tool, a method to analyze the evolution of protein short tandem repeats in a given protein family by pairwise comparisons between evolutionarily related protein sequences. We evaluated pSTR unit number variation in protein families of 12 complete metazoan proteomes. We hypothesize that families with more dynamic ensembles of repeats could reflect particular roles of these repeats in processes that require more adaptability.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, 55128 Mainz, Germany
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, 55128 Mainz, Germany
| |
Collapse
|
4
|
Cermakova K, Hodges HC. Interaction modules that impart specificity to disordered protein. Trends Biochem Sci 2023; 48:477-490. [PMID: 36754681 PMCID: PMC10106370 DOI: 10.1016/j.tibs.2023.01.004] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 01/09/2023] [Accepted: 01/12/2023] [Indexed: 02/09/2023]
Abstract
Intrinsically disordered regions (IDRs) are especially enriched among proteins that regulate chromatin and transcription. As a result, mechanisms that influence specificity of IDR-driven interactions have emerged as exciting unresolved issues for understanding gene regulation. We review the molecular elements frequently found within IDRs that confer regulatory specificity. In particular, we summarize the differing roles of disordered low-complexity regions (LCRs) and short linear motifs (SLiMs) towards selective nuclear regulation. Examination of IDR-driven interactions highlights SLiMs as organizers of selectivity, with widespread roles in gene regulation and integration of cellular signals. Analysis of recurrent interactions between SLiMs and folded domains suggests diverse avenues for SLiMs to influence phase-separated condensates and highlights opportunities to manipulate these interactions for control of biological activity.
Collapse
Affiliation(s)
- Katerina Cermakova
- Department of Molecular and Cellular Biology, Center for Precision Environmental Health, Baylor College of Medicine, Houston, TX, USA
| | - H Courtney Hodges
- Department of Molecular and Cellular Biology, Center for Precision Environmental Health, Baylor College of Medicine, Houston, TX, USA; Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, USA; Department of Bioengineering, Rice University, Houston, TX, USA; Center for Cancer Epigenetics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
| |
Collapse
|
5
|
Erdozain S, Barrionuevo E, Ripoll L, Mier P, Andrade-Navarro MA. Protein repeats evolve and emerge in giant viruses. J Struct Biol 2023; 215:107962. [PMID: 37031868 DOI: 10.1016/j.jsb.2023.107962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 03/21/2023] [Accepted: 04/04/2023] [Indexed: 04/11/2023]
Abstract
Nucleocytoplasmatic large DNA viruses (NCLDVs or giant viruses) stand out because of their relatively large genomes encoding hundreds of proteins. These species give us an unprecedented opportunity to study the emergence and evolution of repeats in protein sequences. On the one hand, as viruses, these species have a restricted set of functions, which can help us better define the functional landscape of repeats. On the other hand, given the particular use of the genetic machinery of the host, it is worth asking whether this allows the variations of genetic material that lead to repeats in non-viral species. To support research in the characterization of repeat protein evolution and function, we present here an analysis focused on the repeat proteins of giant viruses, namely tandem repeats (TRs), short repeats (SRs), and homorepeats (polyX). Proteins with large and short repeats are not very frequent in non-eukaryotic organisms because of the difficulties that their folding may entail; however, their presence in giant viruses remarks their advantage for performance in the protein environment of the eukaryotic host. The heterogeneous content of these TRs, SRs and polyX in some viruses hints at diverse needs. Comparisons to homologs suggest that the mechanisms that generate these repeats are extensively used by some of these viruses, but also their capacity to adopt genes with repeats. Giant viruses could be very good models for the study of the emergence and evolution of protein repeats.
Collapse
Affiliation(s)
- Sofía Erdozain
- Instituto de Biotecnología y Biología Molecular, Departamento de Ciencias Biológicas, Facultad de Ciencias Exactas, Universidad Nacional de La Plata, Argentina
| | - Emilia Barrionuevo
- Laboratory of Bioactive Research and Development, Faculty of Exact Sciences, National University of La Plata, Argentina
| | - Lucas Ripoll
- Laboratory of Genetic Engineering, Cell, and Molecular Biology, National University of Quilmes, Argentina
| | - Pablo Mier
- Faculty of Biology, Johannes Gutenberg University of Mainz, 55128 Mainz, Germany
| | | |
Collapse
|
6
|
Shukla S, Lazarchuk P, Pavlova MN, Sidorova JM. Genome-wide survey of D/E repeats in human proteins uncovers their instability and aids in identifying their role in the chromatin regulator ATAD2. iScience 2022; 25:105464. [PMCID: PMC9672403 DOI: 10.1016/j.isci.2022.105464] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 08/03/2022] [Accepted: 10/26/2022] [Indexed: 11/15/2022] Open
Abstract
D/E repeats are stretches of aspartic and/or glutamic acid residues found in over 150 human proteins. We examined genomic stability of D/E repeats and functional characteristics of D/E repeat-containing proteins vis-à-vis the proteins with poly-Q or poly-A repeats, which are known to undergo pathologic expansions. Mining of tumor sequencing data revealed that D/E repeat-coding regions are similar to those coding poly-Qs and poly-As in increased incidence of trinucleotide insertions/deletions but differ in types and incidence of substitutions. D/E repeat-containing proteins preferentially function in chromatin metabolism and are the more likely to be nuclear and interact with core histones, the longer their repeats are. One of the longest D/E repeats of unknown function is in ATAD2, a bromodomain family ATPase frequently overexpressed in tumors. We demonstrate that D/E repeat deletion in ATAD2 suppresses its binding to nascent and mature chromatin and to the constitutive pericentromeric heterochromatin, where ATAD2 represses satellite transcription. Many human proteins contain runs of aspartic/glutamic acid residues (D/E repeats) D/E repeats show increased incidence of in-frame insertions/deletions in tumors Nuclear and histone-interacting proteins often have long D/E repeats D/E repeat of the oncogene ATAD2 controls its binding to pericentric chromatin
Collapse
Affiliation(s)
- Shalabh Shukla
- Department of Laboratory Medicine and Pathology, University of Washington, 1959 NE Pacific St., Box 357705, Seattle, WA 98195, USA
| | - Pavlo Lazarchuk
- Department of Laboratory Medicine and Pathology, University of Washington, 1959 NE Pacific St., Box 357705, Seattle, WA 98195, USA
| | - Maria N. Pavlova
- Department of Laboratory Medicine and Pathology, University of Washington, 1959 NE Pacific St., Box 357705, Seattle, WA 98195, USA
| | - Julia M. Sidorova
- Department of Laboratory Medicine and Pathology, University of Washington, 1959 NE Pacific St., Box 357705, Seattle, WA 98195, USA
- Corresponding author
| |
Collapse
|
7
|
Kastano K, Mier P, Dosztányi Z, Promponas VJ, Andrade-Navarro MA. Functional Tuning of Intrinsically Disordered Regions in Human Proteins by Composition Bias. Biomolecules 2022; 12:biom12101486. [PMID: 36291695 PMCID: PMC9599065 DOI: 10.3390/biom12101486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 09/30/2022] [Accepted: 10/11/2022] [Indexed: 11/16/2022] Open
Abstract
Intrinsically disordered regions (IDRs) in protein sequences are flexible, have low structural constraints and as a result have faster rates of evolution. This lack of evolutionary conservation greatly limits the use of sequence homology for the classification and functional assessment of IDRs, as opposed to globular domains. The study of IDRs requires other properties for their classification and functional prediction. While composition bias is not a necessary property of IDRs, compositionally biased regions (CBRs) have been noted as frequent part of IDRs. We hypothesized that to characterize IDRs, it could be helpful to study their overlap with particular types of CBRs. Here, we evaluate this overlap in the human proteome. A total of 2/3 of residues in IDRs overlap CBRs. Considering CBRs enriched in one type of amino acid, we can distinguish CBRs that tend to be fully included within long IDRs (R, H, N, D, P, G), from those that partially overlap shorter IDRs (S, E, K, T), and others that tend to overlap IDR terminals (Q, A). CBRs overlap more often IDRs in nuclear proteins and in proteins involved in liquid-liquid phase separation (LLPS). Study of protein interaction networks reveals the enrichment of CBRs in IDRs by tandem repetition of short linear motifs (rich in S or P), and the existence of E-rich polar regions that could support specific protein interactions with non-specific interactions. Our results open ways to pin down the function of IDRs from their partial compositional biases.
Collapse
Affiliation(s)
- Kristina Kastano
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University, Biozentrum I, Hans-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Pablo Mier
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University, Biozentrum I, Hans-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Zsuzsanna Dosztányi
- Department of Biochemistry, ELTE Eötvös Loránd University, Pázmány Péter stny 1/c, H-1117 Budapest, Hungary
| | - Vasilis J. Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, 1678 Nicosia, Cyprus
| | - Miguel A. Andrade-Navarro
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University, Biozentrum I, Hans-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
- Correspondence:
| |
Collapse
|
8
|
Basu S, Bahadur RP. Conservation and coevolution determine evolvability of different classes of disordered residues in human intrinsically disordered proteins. Proteins 2021; 90:632-644. [PMID: 34626492 DOI: 10.1002/prot.26261] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Revised: 10/07/2021] [Accepted: 10/07/2021] [Indexed: 12/19/2022]
Abstract
Structure, function, and evolution are interdependent properties of proteins. Diversity of protein functions arising from structural variations is a potential driving force behind protein evolvability. Intrinsically disordered proteins or regions (IDPs or IDRs) lack well-defined structure under normal physiological conditions, yet, they are highly functional. Increased occurrence of IDPs in eukaryotes compared to prokaryotes indicates strong correlation of protein evolution and disorderedness. IDPs generally have higher evolution rate compared to globular proteins. Structural pliability allows IDPs to accommodate multiple mutations without affecting their functional potential. Nevertheless, how evolutionary signals vary between different classes of disordered residues (DRs) in IDPs is poorly understood. This study addresses variation of evolutionary behavior in terms of residue conservation and intra-protein coevolution among structural and functional classes of DRs in IDPs. Analyses are performed on 579 human IDPs, which are classified based on length of IDRs, interacting partners and functional classes. We find short IDRs are less conserved than long IDRs or full IDPs. Functional classes which require flexibility and specificity to perform their activity comparatively evolve slower than others. Disorder promoting amino acids evolve faster than order promoting amino acids. Pro, Gly, Ile, and Phe have unique coevolving nature which further emphasizes on their roles in IDPs. This study sheds light on evolutionary footprints in different classes of DRs from human IDPs and enhances our understanding of the structural and functional potential of IDPs.
Collapse
Affiliation(s)
- Sushmita Basu
- Computational Structural Biology Lab, Department of Biotechnology, Indian Institute of Technology Kharagpur, Kharagpur, India
| | - Ranjit Prasad Bahadur
- Computational Structural Biology Lab, Department of Biotechnology, Indian Institute of Technology Kharagpur, Kharagpur, India
| |
Collapse
|
9
|
Rudenko V, Korotkov E. Search for Highly Divergent Tandem Repeats in Amino Acid Sequences. Int J Mol Sci 2021; 22:ijms22137096. [PMID: 34281150 PMCID: PMC8269118 DOI: 10.3390/ijms22137096] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 06/25/2021] [Accepted: 06/28/2021] [Indexed: 11/29/2022] Open
Abstract
We report a Method to Search for Highly Divergent Tandem Repeats (MSHDTR) in protein sequences which considers pairwise correlations between adjacent residues. MSHDTR was compared with some previously developed methods for searching for tandem repeats (TRs) in amino acid sequences, such as T-REKS and XSTREAM, which focus on the identification of TRs with significant sequence similarity, whereas MSHDTR detects repeats that significantly diverged during evolution, accumulating deletions, insertions, and substitutions. The application of MSHDTR to a search of the Swiss-Prot databank revealed over 15 thousand TR-containing amino acid sequences that were difficult to find using the other methods. Among the detected TRs, the most representative were those with consensus lengths of two and seven residues; these TRs were subjected to cluster analysis and the classes of patterns were identified. All TRs detected in this study have been combined into a databank accessible over the WWW.
Collapse
Affiliation(s)
- Valentina Rudenko
- Center of Bioengineering Research Center of Biotechnology RAS, 119071 Moscow, Russia;
- Correspondence: ; Tel.: +7-926-7248271
| | - Eugene Korotkov
- Center of Bioengineering Research Center of Biotechnology RAS, 119071 Moscow, Russia;
- Moscow Engineering Physics Institute, National Research Nuclear University MEPhI, 115409 Moscow, Russia
| |
Collapse
|
10
|
Mier P, Andrade-Navarro MA. Assessing the low complexity of protein sequences via the low complexity triangle. PLoS One 2020; 15:e0239154. [PMID: 33378336 PMCID: PMC7773278 DOI: 10.1371/journal.pone.0239154] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Accepted: 08/31/2020] [Indexed: 11/24/2022] Open
Abstract
Background Proteins with low complexity regions (LCRs) have atypical sequence and structural features. Their amino acid composition varies from the expected, determined proteome-wise, and they do not follow the rules of structural folding that prevail in globular regions. One way to characterize these regions is by assessing the repeatability of a sequence, that is, calculating the local propensity of a region to be part of a repeat. Results We combine two local measures of low complexity, repeatability (using the RES algorithm) and fraction of the most frequent amino acid, to evaluate different proteomes, datasets of protein regions with specific features, and individual cases of proteins with extreme compositions. We apply a representation called ‘low complexity triangle’ as a proof-of-concept to represent the low complexity measured values. Results show that proteomes have distinct signatures in the low complexity triangle, and that these signatures are associated to complexity features of the sequences. We developed a web tool called LCT (http://cbdm-01.zdv.uni-mainz.de/~munoz/lct/) to allow users to calculate the low complexity triangle of a given protein or region of interest. Conclusions The low complexity triangle proves to be a suitable procedure to represent the general low complexity of a sequence or protein dataset. Homorepeats, direpeats, compositionally biased regions and globular regions occupy characteristic positions in the triangle. The described pipeline can be used to characterize LCRs and may help in quantifying the content of degenerated tandem repeats in proteins and proteomes.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Mainz, Germany
- * E-mail:
| | - Miguel A. Andrade-Navarro
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
11
|
Evolutionary Study of Disorder in Protein Sequences. Biomolecules 2020; 10:biom10101413. [PMID: 33036302 PMCID: PMC7650552 DOI: 10.3390/biom10101413] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 09/29/2020] [Accepted: 10/03/2020] [Indexed: 12/14/2022] Open
Abstract
Intrinsically disordered proteins (IDPs) contain regions lacking intrinsic globular structure (intrinsically disordered regions, IDRs). IDPs are present across the tree of life, with great variability of IDR type and frequency even between closely related taxa. To investigate the function of IDRs, we evaluated and compared the distribution of disorder content in 10,695 reference proteomes, confirming its high variability and finding certain correlation along the Euteleostomi (bony vertebrates) lineage to number of cell types. We used the comparison of orthologs to study the function of disorder related to increase in cell types, observing that multiple interacting subunits of protein complexes might gain IDRs in evolution, thus stressing the function of IDRs in modulating protein-protein interactions, particularly in the cell nucleus. Interestingly, the conservation of local compositional biases of IDPs follows residue-type specific patterns, with E- and K-rich regions being evolutionarily stable and Q- and A-rich regions being more dynamic. We provide a framework for targeted evolutionary studies of the emergence of IDRs. We believe that, given the large variability of IDR distributions in different species, studies using this evolutionary perspective are required.
Collapse
|
12
|
Lobanov MY, Likhachev IV, Galzitskaya OV. Disordered Residues and Patterns in the Protein Data Bank. Molecules 2020; 25:molecules25071522. [PMID: 32230759 PMCID: PMC7180803 DOI: 10.3390/molecules25071522] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 03/24/2020] [Accepted: 03/25/2020] [Indexed: 01/05/2023] Open
Abstract
We created a new library of disordered patterns and disordered residues in the Protein Data Bank (PDB). To obtain such datasets, we clustered the PDB and obtained the groups of chains with different identities and marked disordered residues. We elaborated a new procedure for finding disordered patterns and created a new version of the library. This library includes three sets of patterns: unique patterns, patterns consisting of two kinds of amino acids, and homo-repeats. Using this database, the user can: (1) find homologues in the entire Protein Data Bank; (2) perform a statistical analysis of disordered residues in protein structures; (3) search for disordered patterns and homo-repeats; (4) search for disordered regions in different chains of the same protein; (5) download clusters of protein chains with different identity from our database and library of disordered patterns; and (6) observe 3D structure interactively using MView. A new library of disordered patterns will help improve the accuracy of predictions for residues that will be structured or unstructured in a given region.
Collapse
Affiliation(s)
- Mikhail Yu. Lobanov
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, 142290 Moscow, Russia; (M.Y.L.); (I.V.L.)
| | - Ilya V. Likhachev
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, 142290 Moscow, Russia; (M.Y.L.); (I.V.L.)
- Institute of Mathematical Problems of Biology, Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, Vitkevicha str.1, Pushchino, 142290 Moscow, Russia
| | - Oxana V. Galzitskaya
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, 142290 Moscow, Russia; (M.Y.L.); (I.V.L.)
- Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, Pushchino, 142290 Moscow, Russia
- Correspondence: ; Tel.: +7-903-675-0156
| |
Collapse
|