1
|
Mizuno Y, Nakasone W, Nakamura M, Otaki JM. In Silico and In Vitro Evaluation of the Molecular Mimicry of the SARS-CoV-2 Spike Protein by Common Short Constituent Sequences (cSCSs) in the Human Proteome: Toward Safer Epitope Design for Vaccine Development. Vaccines (Basel) 2024; 12:539. [PMID: 38793790 PMCID: PMC11125730 DOI: 10.3390/vaccines12050539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 05/12/2024] [Accepted: 05/12/2024] [Indexed: 05/26/2024] Open
Abstract
Spike protein sequences in SARS-CoV-2 have been employed for vaccine epitopes, but many short constituent sequences (SCSs) in the spike protein are present in the human proteome, suggesting that some anti-spike antibodies induced by infection or vaccination may be autoantibodies against human proteins. To evaluate this possibility of "molecular mimicry" in silico and in vitro, we exhaustively identified common SCSs (cSCSs) found both in spike and human proteins bioinformatically. The commonality of SCSs between the two systems seemed to be coincidental, and only some cSCSs were likely to be relevant to potential self-epitopes based on three-dimensional information. Among three antibodies raised against cSCS-containing spike peptides, only the antibody against EPLDVL showed high affinity for the spike protein and reacted with an EPLDVL-containing peptide from the human unc-80 homolog protein. Western blot analysis revealed that this antibody also reacted with several human proteins expressed mainly in the small intestine, ovary, and stomach. Taken together, these results showed that most cSCSs are likely incapable of inducing autoantibodies but that at least EPLDVL functions as a self-epitope, suggesting a serious possibility of infection-induced or vaccine-induced autoantibodies in humans. High-risk cSCSs, including EPLDVL, should be excluded from vaccine epitopes to prevent potential autoimmune disorders.
Collapse
Affiliation(s)
- Yuya Mizuno
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, Faculty of Science, University of the Ryukyus, Senbaru, Nishihara 903-0213, Okinawa, Japan
| | - Wataru Nakasone
- Computer Science and Intelligent Systems Unit, Department of Engineering, Faculty of Engineering, University of the Ryukyus, Senbaru, Nishihara 903-0213, Okinawa, Japan
| | - Morikazu Nakamura
- Computer Science and Intelligent Systems Unit, Department of Engineering, Faculty of Engineering, University of the Ryukyus, Senbaru, Nishihara 903-0213, Okinawa, Japan
| | - Joji M. Otaki
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, Faculty of Science, University of the Ryukyus, Senbaru, Nishihara 903-0213, Okinawa, Japan
| |
Collapse
|
2
|
Algorithmically-guided discovery of viral epitopes via linguistic parsing: Problem formulation and solving by soft computing. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
3
|
Jiang L, Wang D, Xu D. A Pretrained ELECTRA Model for Kinase-Specific Phosphorylation Site Prediction. Methods Mol Biol 2022; 2499:105-124. [PMID: 35696076 DOI: 10.1007/978-1-0716-2317-6_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Phosphorylation plays a vital role in signal transduction and cell cycle. Identifying and understanding phosphorylation through machine-learning methods has a long history. However, existing methods only learn representations of a protein sequence segment from a labeled dataset itself, which could result in biased or incomplete features, especially for kinase-specific phosphorylation site prediction in which training data are typically sparse. To learn a comprehensive contextual representation of a protein sequence segment for kinase-specific phosphorylation site prediction, we pretrained our model from over 24 million unlabeled sequence fragments using ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately). The pretrained model was applied to kinase-specific site prediction of kinases CDK, PKA, CK2, MAPK, and PKC. The pretrained ELECTRA model achieves 9.02% improvement over BERT and 11.10% improvement over MusiteDeep in the area under the precision-recall curve on the benchmark data.
Collapse
Affiliation(s)
- Lei Jiang
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Duolin Wang
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA.
| |
Collapse
|
4
|
Caetano-Anollés G. The Compressed Vocabulary of Microbial Life. Front Microbiol 2021; 12:655990. [PMID: 34305827 PMCID: PMC8292947 DOI: 10.3389/fmicb.2021.655990] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/27/2021] [Indexed: 12/22/2022] Open
Abstract
Communication is an undisputed central activity of life that requires an evolving molecular language. It conveys meaning through messages and vocabularies. Here, I explore the existence of a growing vocabulary in the molecules and molecular functions of the microbial world. There are clear correspondences between the lexicon, syntax, semantics, and pragmatics of language organization and the module, structure, function, and fitness paradigms of molecular biology. These correspondences are constrained by universal laws and engineering principles. Macromolecular structure, for example, follows quantitative linguistic patterns arising from statistical laws that are likely universal, including the Zipf's law, a special case of the scale-free distribution, the Heaps' law describing sublinear growth typical of economies of scales, and the Menzerath-Altmann's law, which imposes size-dependent patterns of decreasing returns. Trade-off solutions between principles of economy, flexibility, and robustness define a "triangle of persistence" describing the impact of the environment on a biological system. The pragmatic landscape of the triangle interfaces with the syntax and semantics of molecular languages, which together with comparative and evolutionary genomic data can explain global patterns of diversification of cellular life. The vocabularies of proteins (proteomes) and functions (functionomes) revealed a significant universal lexical core supporting a universal common ancestor, an ancestral evolutionary link between Bacteria and Eukarya, and distinct reductive evolutionary strategies of language compression in Archaea and Bacteria. A "causal" word cloud strategy inspired by the dependency grammar paradigm used in catenae unfolded the evolution of lexical units associated with Gene Ontology terms at different levels of ontological abstraction. While Archaea holds the smallest, oldest, and most homogeneous vocabulary of all superkingdoms, Bacteria heterogeneously apportions a more complex vocabulary, and Eukarya pushes functional innovation through mechanisms of flexibility and robustness.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, and C. R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL, United States
| |
Collapse
|
5
|
Endo S, Motomura K, Tsuhako M, Kakazu Y, Nakamura M, M. Otaki J. Search for Human-Specific Proteins Based on Availability Scores of Short Constituent Sequences: Identification of a WRWSH Protein in Human Testis. Comput Biol Chem 2020. [DOI: 10.5772/intechopen.89653] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Little is known about protein sequences unique in humans. Here, we performed alignment-free sequence comparisons based on the availability (frequency bias) of short constituent amino acid (aa) sequences (SCSs) in proteins to search for human-specific proteins. Focusing on 5-aa SCSs (pentats), exhaustive comparisons of availability scores among the human proteome and other nine mammalian proteomes in the nonredundant (nr) database identified a candidate protein containing WRWSH, here called FAM75, as human-specific. Examination of various human genome sequences revealed that FAM75 had genomic DNA sequences for either WRWSH or WRWSR due to a single nucleotide polymorphism (SNP). FAM75 and its related protein FAM205A were found to be produced through alternative splicing. The FAM75 transcript was found only in humans, but the FAM205A transcript was also present in other mammals. In humans, both FAM75 and FAM205A were expressed specifically in testis at the mRNA level, and they were immunohistochemically located in cells in seminiferous ducts and in acrosomes in spermatids at the protein level, suggesting their possible function in sperm development and fertilization. This study highlights a practical application of SCS-based methods for protein searches and suggests possible contributions of SNP variants and alternative splicing of FAM75 to human evolution.
Collapse
|
6
|
Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 2020; 25:689-705. [DOI: 10.1016/j.drudis.2020.01.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 12/20/2019] [Accepted: 01/28/2020] [Indexed: 01/06/2023]
|
7
|
Cardoso MH, Orozco RQ, Rezende SB, Rodrigues G, Oshiro KGN, Cândido ES, Franco OL. Computer-Aided Design of Antimicrobial Peptides: Are We Generating Effective Drug Candidates? Front Microbiol 2020; 10:3097. [PMID: 32038544 PMCID: PMC6987251 DOI: 10.3389/fmicb.2019.03097] [Citation(s) in RCA: 108] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2019] [Accepted: 12/20/2019] [Indexed: 11/16/2022] Open
Abstract
Antimicrobial peptides (AMPs), especially antibacterial peptides, have been widely investigated as potential alternatives to antibiotic-based therapies. Indeed, naturally occurring and synthetic AMPs have shown promising results against a series of clinically relevant bacteria. Even so, this class of antimicrobials has continuously failed clinical trials at some point, highlighting the importance of AMP optimization. In this context, the computer-aided design of AMPs has put together crucial information on chemical parameters and bioactivities in AMP sequences, thus providing modes of prediction to evaluate the antibacterial potential of a candidate sequence before synthesis. Quantitative structure-activity relationship (QSAR) computational models, for instance, have greatly contributed to AMP sequence optimization aimed at improved biological activities. In addition to machine-learning methods, the de novo design, linguistic model, pattern insertion methods, and genetic algorithms, have shown the potential to boost the automated design of AMPs. However, how successful have these approaches been in generating effective antibacterial drug candidates? Bearing this in mind, this review will focus on the main computational strategies that have generated AMPs with promising activities against pathogenic bacteria, as well as anti-infective potential in different animal models, including sepsis and cutaneous infections. Moreover, we will point out recent studies on the computer-aided design of antibiofilm peptides. As expected from automated design strategies, diverse candidate sequences with different structural arrangements have been generated and deposited in databases. We will, therefore, also discuss the structural diversity that has been engendered.
Collapse
Affiliation(s)
- Marlon H Cardoso
- S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil
| | - Raquel Q Orozco
- S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Instituto de Ciências Biológicas, Departamento de Biologia, Programa de Pós-Graduação em Ciências Biológicas (Imunologia/Genética e Biotecnologia), Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil
| | - Samilla B Rezende
- S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil
| | - Gisele Rodrigues
- Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil
| | - Karen G N Oshiro
- S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Programa de Pós-Graduação em Patologia Molecular, Faculdade de Medicina, Universidade de Brasília, Brasília, Brazil
| | - Elizabete S Cândido
- S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil
| | - Octávio L Franco
- S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil.,Instituto de Ciências Biológicas, Departamento de Biologia, Programa de Pós-Graduação em Ciências Biológicas (Imunologia/Genética e Biotecnologia), Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil.,Programa de Pós-Graduação em Patologia Molecular, Faculdade de Medicina, Universidade de Brasília, Brasília, Brazil
| |
Collapse
|
8
|
Lee M, Kang YS, Seok J. The estimation of probability distribution for factor variables with many categorical values. PLoS One 2018; 13:e0202547. [PMID: 30142178 PMCID: PMC6108477 DOI: 10.1371/journal.pone.0202547] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Accepted: 08/06/2018] [Indexed: 11/18/2022] Open
Abstract
With recent developments of data technology in biomedicine, factor data such as diagnosis codes and genomic features, which can have tens to hundreds of discrete and unorderable categorical values, have emerged. While considered as a fundamental problem in statistical analyses, the estimation of probability distribution for such factor variables has not studied much because the previous studies have mainly focused on continuous variables and discrete factor variables with a few categories such as sex and race. In this work, we propose a nonparametric Bayesian procedure to estimate the probability distribution of factors with many categories. The proposed method was demonstrated through simulation studies under various conditions and showed significant improvements on the estimation errors from the previous conventional methods. In addition, the method was applied to the analysis of diagnosis data of intensive care unit patients, and generated interesting medical hypotheses. The overall results indicate that the proposed method will be useful in the analysis of biomedical factor data.
Collapse
Affiliation(s)
- Minhyeok Lee
- School of Electrical Engineering, Korea University, Seongbuk-gu, Seoul, South Korea
| | - Yeong Seon Kang
- Department of Business Administration, University of Seoul, Dongdaemun-gu, Seoul, South Korea
| | - Junhee Seok
- School of Electrical Engineering, Korea University, Seongbuk-gu, Seoul, South Korea
- * E-mail:
| |
Collapse
|
9
|
Konopka BM, Marciniak M, Dyrka W. Quantiprot - a Python package for quantitative analysis of protein sequences. BMC Bioinformatics 2017; 18:339. [PMID: 28716000 PMCID: PMC5512976 DOI: 10.1186/s12859-017-1751-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Accepted: 07/05/2017] [Indexed: 11/17/2022] Open
Abstract
Background The field of protein sequence analysis is dominated by tools rooted in substitution matrices and alignments. A complementary approach is provided by methods of quantitative characterization. A major advantage of the approach is that quantitative properties defines a multidimensional solution space, where sequences can be related to each other and differences can be meaningfully interpreted. Results Quantiprot is a software package in Python, which provides a simple and consistent interface to multiple methods for quantitative characterization of protein sequences. The package can be used to calculate dozens of characteristics directly from sequences or using physico-chemical properties of amino acids. Besides basic measures, Quantiprot performs quantitative analysis of recurrence and determinism in the sequence, calculates distribution of n-grams and computes the Zipf’s law coefficient. Conclusions We propose three main fields of application of the Quantiprot package. First, quantitative characteristics can be used in alignment-free similarity searches, and in clustering of large and/or divergent sequence sets. Second, a feature space defined by quantitative properties can be used in comparative studies of protein families and organisms. Third, the feature space can be used for evaluating generative models, where large number of sequences generated by the model can be compared to actually observed sequences.
Collapse
Affiliation(s)
- Bogumił M Konopka
- Katedra InŻynierii Biomedycznej, Wydział Podstawowych Problemów Techniki, Politechnika Wrocławska, WybrzeŻe Wyspiańskiego 27, Wroclaw, 50-370, Poland
| | - Marta Marciniak
- Katedra InŻynierii Biomedycznej, Wydział Podstawowych Problemów Techniki, Politechnika Wrocławska, WybrzeŻe Wyspiańskiego 27, Wroclaw, 50-370, Poland
| | - Witold Dyrka
- Katedra InŻynierii Biomedycznej, Wydział Podstawowych Problemów Techniki, Politechnika Wrocławska, WybrzeŻe Wyspiańskiego 27, Wroclaw, 50-370, Poland.
| |
Collapse
|
10
|
Abstract
Self/non-self-discrimination by vertebrate immune systems is based on the recognition of the presence of peptides in proteins of a parasite that are not contained in the proteins of a host. Therefore, a reduction of the number of 'words' in its own peptide vocabulary could be an efficient evolutionary strategy of parasites for escaping recognition. Here, we compared peptide vocabularies of 30 endoparasitic and 17 free-living unicellular organisms and also eight multicellular parasitic and 16 multicellular free-living organisms. We found that both unicellular and multicellular parasites used a significantly lower number of different pentapeptides than free-living controls. Impoverished pentapeptide vocabularies in parasites were observed across all five clades that contain both the parasitic and free-living species. The effect of parasitism on a number of peptides used in an organism's proteins is larger than effects of all other studied factors, including the size of a proteome, the number of encoded proteins, etc. This decrease of pentapeptide diversity was partly compensated for by an increased number of hexapeptides. Our results support the hypothesis of parasitism-associated reduction of peptide vocabulary and suggest that T-cell receptors mostly recognize the five amino acids-long part of peptides that are presented in the groove of major histocompatibility complex molecules.
Collapse
|
11
|
Asgari E, Mofrad MRK. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 2015; 10:e0141287. [PMID: 26555596 PMCID: PMC4640716 DOI: 10.1371/journal.pone.0141287] [Citation(s) in RCA: 349] [Impact Index Per Article: 38.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Accepted: 10/05/2015] [Indexed: 12/22/2022] Open
Abstract
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.
Collapse
Affiliation(s)
- Ehsaneddin Asgari
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America
| | - Mohammad R. K. Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America
- Physical Biosciences Division, Lawrence Berkeley National Lab, Berkeley, California 94720, United States of America
| |
Collapse
|
12
|
Hatton L, Warr G. Protein structure and evolution: are they constrained globally by a principle derived from information theory? PLoS One 2015; 10:e0125663. [PMID: 25970335 PMCID: PMC4429977 DOI: 10.1371/journal.pone.0125663] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2014] [Accepted: 02/19/2015] [Indexed: 01/01/2023] Open
Abstract
That the physicochemical properties of amino acids constrain the structure, function and evolution of proteins is not in doubt. However, principles derived from information theory may also set bounds on the structure (and thus also the evolution) of proteins. Here we analyze the global properties of the full set of proteins in release 13-11 of the SwissProt database, showing by experimental test of predictions from information theory that their collective structure exhibits properties that are consistent with their being guided by a conservation principle. This principle (Conservation of Information) defines the global properties of systems composed of discrete components each of which is in turn assembled from discrete smaller pieces. In the system of proteins, each protein is a component, and each protein is assembled from amino acids. Central to this principle is the inter-relationship of the unique amino acid count and total length of a protein and its implications for both average protein length and occurrence of proteins with specific unique amino acid counts. The unique amino acid count is simply the number of distinct amino acids (including those that are post-translationally modified) that occur in a protein, and is independent of the number of times that the particular amino acid occurs in the sequence. Conservation of Information does not operate at the local level (it is independent of the physicochemical properties of the amino acids) where the influences of natural selection are manifest in the variety of protein structure and function that is well understood. Rather, this analysis implies that Conservation of Information would define the global bounds within which the whole system of proteins is constrained; thus it appears to be acting to constrain evolution at a level different from natural selection, a conclusion that appears counter-intuitive but is supported by the studies described herein.
Collapse
Affiliation(s)
- Leslie Hatton
- Faculty of Science, Engineering and Computing, Kingston University, London, UK
- * E-mail:
| | - Gregory Warr
- Medical University of South Carolina, Charleston, South Carolina, USA
| |
Collapse
|
13
|
Motomura K, Nakamura M, Otaki JM. A frequency-based linguistic approach to protein decoding and design: Simple concepts, diverse applications, and the SCS Package. Comput Struct Biotechnol J 2013; 5:e201302010. [PMID: 24688703 PMCID: PMC3962227 DOI: 10.5936/csbj.201302010] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Revised: 02/07/2013] [Accepted: 02/08/2013] [Indexed: 11/23/2022] Open
Abstract
Protein structure and function information is coded in amino acid sequences. However, the relationship between primary sequences and three-dimensional structures and functions remains enigmatic. Our approach to this fundamental biochemistry problem is based on the frequencies of short constituent sequences (SCSs) or words. A protein amino acid sequence is considered analogous to an English sentence, where SCSs are equivalent to words. Availability scores, which are defined as real SCS frequencies in the non-redundant amino acid database relative to their probabilistically expected frequencies, demonstrate the biological usage bias of SCSs. As a result, this frequency-based linguistic approach is expected to have diverse applications, such as secondary structure specifications by structure-specific SCSs and immunological adjuvants with rare or non-existent SCSs. Linguistic similarities (e.g., wide ranges of scale-free distributions) and dissimilarities (e.g., behaviors of low-rank samples) between proteins and the natural English language have been revealed in the rank-frequency relationships of SCSs or words. We have developed a web server, the SCS Package, which contains five applications for analyzing protein sequences based on the linguistic concept. These tools have the potential to assist researchers in deciphering structurally and functionally important protein sites, species-specific sequences, and functional relationships between SCSs. The SCS Package also provides researchers with a tool to construct amino acid sequences de novo based on the idiomatic usage of SCSs.
Collapse
Affiliation(s)
- Kenta Motomura
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Senbaru, Nishihara, Okinawa 903-0213, Japan ; Department of Information Science, University of the Ryukyus, Senbaru, Nishihara, Okinawa 903-0213, Japan
| | - Morikazu Nakamura
- Department of Information Science, University of the Ryukyus, Senbaru, Nishihara, Okinawa 903-0213, Japan
| | - Joji M Otaki
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Senbaru, Nishihara, Okinawa 903-0213, Japan
| |
Collapse
|