1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Mizuno Y, Nakasone W, Nakamura M, Otaki JM. In Silico and In Vitro Evaluation of the Molecular Mimicry of the SARS-CoV-2 Spike Protein by Common Short Constituent Sequences (cSCSs) in the Human Proteome: Toward Safer Epitope Design for Vaccine Development. Vaccines (Basel) 2024; 12:539. [PMID: 38793790 PMCID: PMC11125730 DOI: 10.3390/vaccines12050539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 05/12/2024] [Accepted: 05/12/2024] [Indexed: 05/26/2024] Open
Abstract
Spike protein sequences in SARS-CoV-2 have been employed for vaccine epitopes, but many short constituent sequences (SCSs) in the spike protein are present in the human proteome, suggesting that some anti-spike antibodies induced by infection or vaccination may be autoantibodies against human proteins. To evaluate this possibility of "molecular mimicry" in silico and in vitro, we exhaustively identified common SCSs (cSCSs) found both in spike and human proteins bioinformatically. The commonality of SCSs between the two systems seemed to be coincidental, and only some cSCSs were likely to be relevant to potential self-epitopes based on three-dimensional information. Among three antibodies raised against cSCS-containing spike peptides, only the antibody against EPLDVL showed high affinity for the spike protein and reacted with an EPLDVL-containing peptide from the human unc-80 homolog protein. Western blot analysis revealed that this antibody also reacted with several human proteins expressed mainly in the small intestine, ovary, and stomach. Taken together, these results showed that most cSCSs are likely incapable of inducing autoantibodies but that at least EPLDVL functions as a self-epitope, suggesting a serious possibility of infection-induced or vaccine-induced autoantibodies in humans. High-risk cSCSs, including EPLDVL, should be excluded from vaccine epitopes to prevent potential autoimmune disorders.
Collapse
Affiliation(s)
- Yuya Mizuno
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, Faculty of Science, University of the Ryukyus, Senbaru, Nishihara 903-0213, Okinawa, Japan
| | - Wataru Nakasone
- Computer Science and Intelligent Systems Unit, Department of Engineering, Faculty of Engineering, University of the Ryukyus, Senbaru, Nishihara 903-0213, Okinawa, Japan
| | - Morikazu Nakamura
- Computer Science and Intelligent Systems Unit, Department of Engineering, Faculty of Engineering, University of the Ryukyus, Senbaru, Nishihara 903-0213, Okinawa, Japan
| | - Joji M. Otaki
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, Faculty of Science, University of the Ryukyus, Senbaru, Nishihara 903-0213, Okinawa, Japan
| |
Collapse
|
3
|
Endo S, Motomura K, Tsuhako M, Kakazu Y, Nakamura M, M. Otaki J. Search for Human-Specific Proteins Based on Availability Scores of Short Constituent Sequences: Identification of a WRWSH Protein in Human Testis. Comput Biol Chem 2020. [DOI: 10.5772/intechopen.89653] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Little is known about protein sequences unique in humans. Here, we performed alignment-free sequence comparisons based on the availability (frequency bias) of short constituent amino acid (aa) sequences (SCSs) in proteins to search for human-specific proteins. Focusing on 5-aa SCSs (pentats), exhaustive comparisons of availability scores among the human proteome and other nine mammalian proteomes in the nonredundant (nr) database identified a candidate protein containing WRWSH, here called FAM75, as human-specific. Examination of various human genome sequences revealed that FAM75 had genomic DNA sequences for either WRWSH or WRWSR due to a single nucleotide polymorphism (SNP). FAM75 and its related protein FAM205A were found to be produced through alternative splicing. The FAM75 transcript was found only in humans, but the FAM205A transcript was also present in other mammals. In humans, both FAM75 and FAM205A were expressed specifically in testis at the mRNA level, and they were immunohistochemically located in cells in seminiferous ducts and in acrosomes in spermatids at the protein level, suggesting their possible function in sperm development and fertilization. This study highlights a practical application of SCS-based methods for protein searches and suggests possible contributions of SNP variants and alternative splicing of FAM75 to human evolution.
Collapse
|
4
|
Enhancing the Immune Response of a Nicotine Vaccine with Synthetic Small "Non-Natural" Peptides. Molecules 2020; 25:molecules25061290. [PMID: 32178357 PMCID: PMC7143940 DOI: 10.3390/molecules25061290] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 03/05/2020] [Accepted: 03/07/2020] [Indexed: 11/16/2022] Open
Abstract
The addictive nature of nicotine is likely the most significant reason for the continued prevalence of tobacco smoking despite the widespread reports of its negative health effects. Nicotine vaccines are an alternative to the currently available smoking cessation treatments, which have limited efficacy. However, the nicotine hapten is non-immunogenic, and successful vaccine formulations to treat nicotine addiction require both effective adjuvants and delivery systems. The immunomodulatory properties of short, non-natural peptide sequences not found in human systems and their ability to improve vaccine efficacy continue to be reported. The aim of this study was to determine if small “non-natural peptides,” as part of a conjugate nicotine vaccine, could improve immune responses. Four peptides were synthesized via solid phase methodology, purified, and characterized. Ex vivo plasma stability studies using RP-HPLC confirmed that the peptides were not subject to proteolytic degradation. The peptides were formulated into conjugate nicotine vaccine candidates along with a bacterial derived adjuvant vaccine delivery system and chitosan as a stabilizing compound. Formulations were tested in vitro in a dendritic cell line to determine the combination that would elicit the greatest 1L-1β response using ELISAs. Three of the peptides were able to enhance the cytokine response above that induced by the adjuvant delivery system alone. In vivo vaccination studies in BALB/c mice demonstrated that the best immune response, as measured by nicotine-specific antibody levels, was elicited from the conjugate vaccine structure, which included the peptide, as well as the other components. Isotype analyses highlighted that the peptide was able to shift immune response toward being more humorally dominant. Overall, the results have implications for the use of non-natural peptides as adjuvants not only for the development of a nicotine vaccine but also for use with other addictive substances and conventional vaccination targets as well.
Collapse
|
5
|
Global pentapeptide statistics are far away from expected distributions. Sci Rep 2018; 8:15178. [PMID: 30310110 PMCID: PMC6181984 DOI: 10.1038/s41598-018-33433-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2017] [Accepted: 09/26/2018] [Indexed: 11/08/2022] Open
Abstract
The relationships between polypeptide composition, sequence, structure and function have been puzzling biologists ever since first protein sequences were determined. Here, we study the statistics of occurrence of all possible pentapeptide sequences in known proteins. To compensate for the non-uniform distribution of individual amino acid residues in protein sequences, we investigate separately all possible permutations of every given amino acid composition. For the majority of permutation groups we find that pentapeptide occurrences deviate strongly from the expected binomial distributions, and that the observed distributions are also characterized by high numbers of outlier sequences. An analysis of identified outliers shows they often contain known motifs and rare amino acids, suggesting that they represent important functional elements. We further compare the pentapeptide composition of regions known to correspond to protein domains with that of non-domain regions. We find that a substantial number of pentapeptides is clearly strongly favored in protein domains. Finally, we show that over-represented pentapeptides are significantly related to known functional motifs and to predicted ancient structural peptides.
Collapse
|
6
|
Neighbor effect and local conformation in protein structures. Amino Acids 2017; 49:1641-1646. [DOI: 10.1007/s00726-017-2463-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Accepted: 07/06/2017] [Indexed: 11/26/2022]
|
7
|
Amino acid sequence repertoire of the bacterial proteome and the occurrence of untranslatable sequences. Proc Natl Acad Sci U S A 2016; 113:7166-70. [PMID: 27307442 DOI: 10.1073/pnas.1606518113] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Bioinformatic analysis of Escherichia coli proteomes revealed that all possible amino acid triplet sequences occur at their expected frequencies, with four exceptions. Two of the four underrepresented sequences (URSs) were shown to interfere with translation in vivo and in vitro. Enlarging the URS by a single amino acid resulted in increased translational inhibition. Single-molecule methods revealed stalling of translation at the entrance of the peptide exit tunnel of the ribosome, adjacent to ribosomal nucleotides A2062 and U2585. Interaction with these same ribosomal residues is involved in regulation of translation by longer, naturally occurring protein sequences. The E. coli exit tunnel has evidently evolved to minimize interaction with the exit tunnel and maximize the sequence diversity of the proteome, although allowing some interactions for regulatory purposes. Bioinformatic analysis of the human proteome revealed no underrepresented triplet sequences, possibly reflecting an absence of regulation by interaction with the exit tunnel.
Collapse
|
8
|
Motomura K, Nakamura M, Otaki JM. A frequency-based linguistic approach to protein decoding and design: Simple concepts, diverse applications, and the SCS Package. Comput Struct Biotechnol J 2013; 5:e201302010. [PMID: 24688703 PMCID: PMC3962227 DOI: 10.5936/csbj.201302010] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Revised: 02/07/2013] [Accepted: 02/08/2013] [Indexed: 11/23/2022] Open
Abstract
Protein structure and function information is coded in amino acid sequences. However, the relationship between primary sequences and three-dimensional structures and functions remains enigmatic. Our approach to this fundamental biochemistry problem is based on the frequencies of short constituent sequences (SCSs) or words. A protein amino acid sequence is considered analogous to an English sentence, where SCSs are equivalent to words. Availability scores, which are defined as real SCS frequencies in the non-redundant amino acid database relative to their probabilistically expected frequencies, demonstrate the biological usage bias of SCSs. As a result, this frequency-based linguistic approach is expected to have diverse applications, such as secondary structure specifications by structure-specific SCSs and immunological adjuvants with rare or non-existent SCSs. Linguistic similarities (e.g., wide ranges of scale-free distributions) and dissimilarities (e.g., behaviors of low-rank samples) between proteins and the natural English language have been revealed in the rank-frequency relationships of SCSs or words. We have developed a web server, the SCS Package, which contains five applications for analyzing protein sequences based on the linguistic concept. These tools have the potential to assist researchers in deciphering structurally and functionally important protein sites, species-specific sequences, and functional relationships between SCSs. The SCS Package also provides researchers with a tool to construct amino acid sequences de novo based on the idiomatic usage of SCSs.
Collapse
Affiliation(s)
- Kenta Motomura
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Senbaru, Nishihara, Okinawa 903-0213, Japan ; Department of Information Science, University of the Ryukyus, Senbaru, Nishihara, Okinawa 903-0213, Japan
| | - Morikazu Nakamura
- Department of Information Science, University of the Ryukyus, Senbaru, Nishihara, Okinawa 903-0213, Japan
| | - Joji M Otaki
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Senbaru, Nishihara, Okinawa 903-0213, Japan
| |
Collapse
|
9
|
Motomura K, Fujita T, Tsutsumi M, Kikuzato S, Nakamura M, Otaki JM. Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach. PLoS One 2012; 7:e50039. [PMID: 23185527 PMCID: PMC3503725 DOI: 10.1371/journal.pone.0050039] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2012] [Accepted: 10/15/2012] [Indexed: 11/19/2022] Open
Abstract
The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or "words". We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and "compressed" English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., "key words") and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.
Collapse
Affiliation(s)
- Kenta Motomura
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
- Department of Information Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Tomohiro Fujita
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Motosuke Tsutsumi
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Satsuki Kikuzato
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Morikazu Nakamura
- Department of Information Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Joji M. Otaki
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| |
Collapse
|
10
|
Pentamers not found in the universal proteome can enhance antigen specific immune responses and adjuvant vaccines. PLoS One 2012; 7:e43802. [PMID: 22937099 PMCID: PMC3427150 DOI: 10.1371/journal.pone.0043802] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2012] [Accepted: 07/26/2012] [Indexed: 12/22/2022] Open
Abstract
Certain short peptides do not occur in humans and are rare or non-existent in the universal proteome. Antigens that contain rare amino acid sequences are in general highly immunogenic and may activate different arms of the immune system. We first generated a list of rare, semi-common, and common 5-mer peptides using bioinformatics tools to analyze the UniProtKB database. Experimental observations indicated that rare and semi-common 5-mers generated stronger cellular responses in comparison with common-occurring sequences. We hypothesized that the biological process responsible for this enhanced immunogenicity could be used to positively modulate immune responses with potential application for vaccine development. Initially, twelve rare 5-mers, 9-mers, and 13-mers were incorporated in frame at the end of an H5N1 hemagglutinin (HA) antigen and expressed from a DNA vaccine. The presence of some 5-mer peptides induced improved immune responses. Adding one 5-mer peptide exogenously also offered improved clinical outcome and/or survival against a lethal H5N1 or H1N1 influenza virus challenge in BALB/c mice and ferrets, respectively. Interestingly, enhanced anti-HBsAg antibody production by up to 25-fold in combination with a commercial Hepatitis B vaccine (Engerix-B, GSK) was also observed in BALB/c mice. Mechanistically, NK cell activation and dependency was observed with enhancing peptides ex vivo and in NK-depleted mice. Overall, the data suggest that rare or non-existent oligopeptides can be developed as immunomodulators and supports the further evaluation of some 5-mer peptides as potential vaccine adjuvants.
Collapse
|
11
|
Tsutsumi M, Otaki JM. Parallel and antiparallel β-strands differ in amino acid composition and availability of short constituent sequences. J Chem Inf Model 2011; 51:1457-64. [PMID: 21520893 DOI: 10.1021/ci200027d] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
One of the important secondary structures in proteins is the β-strand. However, due to its complexity, it is less characterized than helical structures. Using the 1641 representative three-dimensional protein structure data from the Protein Data Bank, we characterized β-strand structures based on strand length and amino acid composition, focusing on differences between parallel and antiparallel β-strands. Antiparallel strands were more frequent and slightly longer than parallel strands. Overall, the majority of β-sheets were antiparallel sheets; however, mixed sheets were reasonably abundant, and parallel sheets were relatively rare. Notably, the nonpolar, aliphatic hydrocarbon amino acids, valine, isoleucine, and leucine were observed at a high frequency in both strands but were more abundant in parallel than in antiparallel strands. The relative amino acid occurrence in β-sheets, especially in parallel strands, was highly correlated with amino acid hydrophobicity. This correlation was not observed in α-helices and 3(10)-helices. In addition, we examined the frequency of 400 amino acid doublets and 8000 amino acid triplets in β-strands based on availability, a measurement of the relative counts of the doublets and triplets. We identified some triplets that were specifically found in either parallel or antiparallel strands. We further identified "zero-count triplets" which did not occur in either parallel or antiparallel strands, despite the fact that they were probabilistically supposed to occur several times. Taken together, the present study revealed essential features of β-strand structures and the differences between parallel and antiparallel β-strands, which can potentially be applied to the secondary structure prediction and the functional design of protein sequences in the future.
Collapse
Affiliation(s)
- Motosuke Tsutsumi
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | | |
Collapse
|
12
|
Ung P, Winkler DA. Tripeptide Motifs in Biology: Targets for Peptidomimetic Design. J Med Chem 2011; 54:1111-25. [DOI: 10.1021/jm1012984] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Phuc Ung
- CSIRO Materials Science and Engineering, Bag 10, Clayton South MDC 3169, Australia
- Monash Institute of Pharmaceutical Science, Parkville 3152, Australia
| | - David A. Winkler
- CSIRO Materials Science and Engineering, Bag 10, Clayton South MDC 3169, Australia
- Monash Institute of Pharmaceutical Science, Parkville 3152, Australia
| |
Collapse
|
13
|
Otaki JM, Tsutsumi M, Gotoh T, Yamamoto H. Secondary structure characterization based on amino acid composition and availability in proteins. J Chem Inf Model 2010; 50:690-700. [PMID: 20210310 DOI: 10.1021/ci900452z] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The importance of thorough analyses of the secondary structures in proteins as basic structural units cannot be overemphasized. Although recent computational methods have achieved reasonably high accuracy for predicting secondary structures from amino acid sequences, a simple and fundamental empirical approach to characterize the amino acid composition of secondary structures was performed mainly in 1970s, with a small number of analyzed structures. To extend this classical approach using a large number of analyzed structures, here we characterized the amino acid sequences of secondary structures (12 154 alpha-helix units, 4592 3(10)-helix units, 16 787 beta-strand units, and 30 811 "other" units), using the representative three-dimensional protein structure records (1641 protein chains) from the Protein Data Bank. We first examined the length and the amino acid compositions of secondary structures, including rank order differences and assignment relationships among amino acids. These compositional results were largely, but not entirely, consistent with the previous studies. In addition, we examined the frequency of 400 amino acid doublets and 8000 triplets in secondary structures based on their relative counts, termed the availability. We identified not only some triplets that were specific to a certain secondary structure but also so-called zero-count triplets, which did not occur in a given secondary structure at all, even though they were probabilistically predicted to occur several times. Taken together, the present study revealed essential features of secondary structures and suggests potential applications in the secondary structure prediction and the functional design of protein sequences.
Collapse
Affiliation(s)
- Joji M Otaki
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology, and Marine Science, University of the Ryukyus, Nishihara, Okinawa 903-0213, Japan.
| | | | | | | |
Collapse
|