1
|
Mier P, Andrade-Navarro MA. The nucleotide landscape of polyXY regions. Comput Struct Biotechnol J 2023; 21:5408-5412. [PMID: 38022702 PMCID: PMC10652141 DOI: 10.1016/j.csbj.2023.10.054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 10/30/2023] [Accepted: 10/30/2023] [Indexed: 12/01/2023] Open
Abstract
PolyXY regions are compositionally biased regions composed of two different amino acids. They are classified according to the arrangement of the two amino acid types 'X' and 'Y' into direpeats (composed of alternating amino acids, e.g. 'XYXYXY'), joined (composed of two consecutive stretches of each amino acid, e.g. 'XXXYYY') and shuffled (other arrangements, e.g., 'XYXXYY'). They have been characterized at the amino acid level in all domains of life, and are described as often found within intrinsically disordered regions. Since DNA replication slippage has been proposed as a driver of repeat variation, and given that some polyXY have a repetitive nature, we hypothesized that characterizing the nucleotide coding of various types of polyXY could give hints about their origin and evolution. To test this, we obtained all polyXY regions in the human transcriptome, categorized them, and studied their coding nucleotide sequences. We observed that polyXY exacerbates the codon biases, and that the similarity between the X and Y codons is higher than in the background proteome. Our results support a general mechanism of emergence and evolution of polyXY from single-codon polyX. PolyXY are revealed as hotspots for replication slippage, particularly those composed of repeats: joined and direpeat polyXY. Inter-conversion to shuffled polyXY disrupts nucleotide repeats and restricts further evolution by replication slippage, a mechanism that we previously observed in polyX. Our results shed light on polyXY composition and should simplify the determination of their functions.
Collapse
Affiliation(s)
- Pablo Mier
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Miguel A. Andrade-Navarro
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| |
Collapse
|
2
|
Abstract
We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.
Collapse
Affiliation(s)
- Yuriy L. Orlov
- The Digital Health Institute, I.M. Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, 119991 Russia
- Institute of Cytology and Genetics SB RAS, 630090 Novosibirsk, Russia
- Agrarian and Technological Institute, Peoples’ Friendship University of Russia, 117198 Moscow, Russia
| | - Nina G. Orlova
- Department of Mathematics, Financial University under the Government of the Russian Federation, Moscow, 125167 Russia
| |
Collapse
|
3
|
Cappannini A, Forcelloni S, Giansanti A. Evolutionary pressures and codon bias in low complexity regions of plasmodia. Genetica 2021; 149:217-237. [PMID: 34254217 DOI: 10.1007/s10709-021-00126-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2020] [Accepted: 06/30/2021] [Indexed: 11/25/2022]
Abstract
The biological meaning of low complexity regions in the proteins of Plasmodium species is a topic of discussion in evolutionary biology. There is a debate between selectionists and neutralists, who either attribute or do not attribute an effect of low-complexity regions on the fitness of these parasites, respectively. In this work, we comparatively study 22 Plasmodium species to understand whether their low complexity regions undergo a neutral or, rather, a selective and species-dependent evolution. The focus is on the connection between the codon repertoire of the genetic coding sequences and the occurrence of low complexity regions in the corresponding proteins. The first part of the work concerns the correlation between the length of plasmodial proteins and their propensity at embedding low complexity regions. Relative synonymous codon usage, entropy, and other indicators reveal that the incidence of low complexity regions and their codon bias is species-specific and subject to selective evolutionary pressure. We also observed that protein length, a relaxed selective pressure, and a broad repertoire of codons in proteins, are strongly correlated with the occurrence of low complexity regions. Overall, it seems plausible that the codon bias of low-complexity regions contributes to functional innovation and codon bias enhancement of proteins on which Plasmodium species rest as successful evolutionary parasites.
Collapse
Affiliation(s)
- Andrea Cappannini
- Department of Physics, Sapienza, University of Rome, P.le A. Moro 5, 00185, Roma, Italy.
| | - Sergio Forcelloni
- Max Planck Institute of Biochemistry, 82152, Martinsried, Germany.,Department of Chemistry, Technical University of Munich, 85748, Garching, Germany
| | - Andrea Giansanti
- Department of Physics, Sapienza, University of Rome, P.le A. Moro 5, 00185, Roma, Italy.,Istituto Nazionale di Fisica Nucleare, INFN, Roma1 section. 00185, Roma, Italy
| |
Collapse
|
4
|
Kamel M, Mier P, Tari A, Andrade-Navarro MA. Repeatability in protein sequences. J Struct Biol 2019; 208:86-91. [PMID: 31408700 DOI: 10.1016/j.jsb.2019.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Revised: 08/06/2019] [Accepted: 08/08/2019] [Indexed: 02/07/2023]
Abstract
Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary structure elements do not apply when the distribution of amino acids becomes biased. While there is a tendency towards structural disorder in LCRs, various examples, and particularly homorepeats of single amino acids, suggest that very short repeats could adopt structures very difficult to predict. These structures are possibly variable and dependant on the context of intra- or inter-molecular interactions. In general, short repeats in LCRs can induce structure. This could explain the observation that very short (non-perfect) repeats are widespread and many define regions with a function in protein interactions. For these reasons, we have developed an algorithm to quickly analyze local repeatability along protein sequences, that is, how close a protein fragment is from a perfect repeat. Using this algorithm we identified that the proteins of the yeast Saccharomyces cerevisiae are depleted in short repeats (approximate or not) of odd-length, while the human proteins are not, that the fish Danio rerio has many proteins with repeats of length two and that the plant Arabidopsis thaliana has an unusually large amount of repeats of length seven. Our method (REpeatability Scanner, RES, accessible at http://cbdm-01.zdv.uni-mainz.de/~munoz/res/) allows to find regions with approximate short repeats in protein sequences, and helps to characterize the variable use of LCRs and compositional bias in different organisms.
Collapse
|
5
|
Kebede AM, Tadesse FG, Feleke AD, Golassa L, Gadisa E. Effect of low complexity regions within the PvMSP3α block II on the tertiary structure of the protein and implications to immune escape mechanisms. BMC Struct Biol 2019; 19:6. [PMID: 30917807 DOI: 10.1186/s12900-019-0104-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Accepted: 03/07/2019] [Indexed: 01/24/2023]
Abstract
Background Plasmodium vivax merozoite surface protein 3α (PvMSP3α) is a promising vaccine candidate which has shown strong association with immunogenicity and protectiveness. Its use is however complicated by evolutionary plasticity features which enhance immune evasion. Low complexity regions (LCRs) provide plasticity in surface proteins of Plasmodium species, but its implication in vaccine design remain unexplored. Here population genetic, comparative phylogenetic and structural biology analysis was performed on the gene encoding PvMSP3α. Results Three LCRs were found in PvMSP3α block II. Both the predicted tertiary structure of the protein and the phylogenetic trees based on this region were influenced by the presence of the LCRs. The LCRs were mainly B cell epitopes within or adjacent. In addition a repeat motif mimicking one of the B cell epitopes was found within the PvMSP3a block II low complexity region. This particular B cell epitope also featured rampant alanine substitutions which might impair antibody binding. Conclusion The findings indicate that PvMSP3α block II possesses LCRs which might confer a strong phenotypic plasticity. The phenomenon of phenotypic plasticity and implication of LCRs in malaria immunology in general and vaccine candidate genes in particular merits further exploration. Electronic supplementary material The online version of this article (10.1186/s12900-019-0104-0) contains supplementary material, which is available to authorized users.
Collapse
|
6
|
Kumari B, Kumar R, Chauhan V, Kumar M. Comparative functional analysis of proteins containing low-complexity predicted amyloid regions. PeerJ 2018; 6:e5823. [PMID: 30397544 PMCID: PMC6214233 DOI: 10.7717/peerj.5823] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2018] [Accepted: 09/25/2018] [Indexed: 11/20/2022] Open
Abstract
Background In both prokaryotic and eukaryotic proteins, repeated occurrence of a single or a group of few amino acids are found. These regions are termed as low complexity regions (LCRs). It has been observed that amino acid bias in LCR is directly linked to their uncontrolled expansion and amyloid formation. But a comparative analysis of the behavior of LCR based on their constituent amino acids and their association with amyloidogenic propensity is not available. Methods Firstly we grouped all LCRs on the basis of their composition: homo-polymers, positively charged amino acids, negatively charged amino acids, polar amino acids and hydrophobic amino acids. We analyzed the compositional pattern of LCRs in each group and their propensity to form amyloids. The functional characteristics of proteins containing different groups of LCRs were explored using DAVID. In addition, we also analyzed the classes, pathways and functions of human proteins that form amyloids in LCRs. Results Among homopolymeric LCRs, the most common was Gln repeats. LCRs composed of repeats of Met and aromatic amino acids were amongst the least occurring. The results revealed that LCRs composed of negatively charged and polar amino acids were more common in comparison to LCRs formed by positively charged and hydrophobic amino acids. We also noted that generally proteins with LCRs were involved in transcription but those with Gly repeats were associated to translational activities. Our analysis suggests that proteins in which LCR is composed of hydrophobic residues are more prone toward amyloid formation. We also found that the human proteins with amyloid forming LCRs were generally involved in binding and catalytic activity. Discussion The presented analysis summarizes the most common and least occurring LCRs in proteins. Our results show that though repeats of Gln are the most abundant but Asn repeats make longest stretch of low complexity. The results showed that potential of LCRs to form amyloids varies with their amino acid composition.
Collapse
Affiliation(s)
- Bandana Kumari
- Department of Biophysics, University of Delhi South Campus, New Delhi, India
| | - Ravindra Kumar
- Department of Biophysics, University of Delhi South Campus, New Delhi, India
| | - Vipin Chauhan
- Department of Genetics, University of Delhi South Campus, New Delhi, India.,Current affiliation: Centre for Neuroscience, Indian Institute of Science, Bangalore, India
| | - Manish Kumar
- Department of Biophysics, University of Delhi South Campus, New Delhi, India
| |
Collapse
|
7
|
María Velasco A, Becerra A, Hernández-Morales R, Delaye L, Jiménez-Corona ME, Ponce-de-Leon S, Lazcano A. Low complexity regions (LCRs) contribute to the hypervariability of the HIV-1 gp120 protein. J Theor Biol 2013; 338:80-6. [PMID: 24021867 DOI: 10.1016/j.jtbi.2013.08.039] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2013] [Revised: 08/01/2013] [Accepted: 08/31/2013] [Indexed: 01/27/2023]
Abstract
Low complexity regions (LCRs) are sequences of nucleic acids or proteins defined by a compositional bias. Their occurrence has been confirmed in sequences of the three cellular lineages (Bacteria, Archaea and Eucarya), and has also been reported in viral genomes. We present here the results of a detailed computer analysis of the LCRs present in the HIV-1 glycoprotein 120 (gp120) encoded by the viral gene env. The analysis was performed using a sample of 3637 Env polyprotein sequences derived from 4117 completely sequenced and translated HIV-1 genomes available in public databases as of December 2012. We have identified 1229 LCRs located in four different regions of the gp120 protein that correspond to four of the five regions that have been identified as hypervariable (V1, V2, V4 and V5). The remaining 29 LCRs are found in the signal peptide and in the conserved regions C2, C3, C4 and C5. No LCR has been identified in the hypervariable region V3. The LCRs detected in the V1, V2, V4, and V5 hypervariable regions exhibit a high Asn content in their amino acid composition, which very likely correspond to glycosylation sites, which may contribute to the retroviral ability to avoid the immune system. In sharp contrast with what is observed in gp120 proteins lacking LCRs, the glycosylation sites present in LCRs tend to be clustered towards the center of the region forming well-defined islands. The results presented here suggest that LCRs represent a hitherto undescribed source of genomic variability in lentivirus, and that these repeats may represent an important source of antigenic variation in HIV-1 populations. The results reported here may exemplify the evolutionary processes that may have increased the size of primitive cellular RNA genomes and the role of LCRs as a source of raw material during the processes of evolutionary acquisition of new functions.
Collapse
Affiliation(s)
- Ana María Velasco
- Facultad de Ciencias, UNAM, Ciudad Universitaria, Apdo. Postal 70-407, México D. F. 04510, Mexico; Laboratorios de Biológicos y Reactivos de México, Amores 1240, Colonia Del Valle, México D. F. 03100, Mexico
| | | | | | | | | | | | | |
Collapse
|