Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Daeyaert F, Moereels H, Lewi PJ. Classification and identification of proteins by means of common and specific amino acid n-tuples in unaligned sequences. Comput Methods Programs Biomed 1998;56:221-233. [PMID: 9725648 DOI: 10.1016/s0169-2607(98)00031-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]

For:	Daeyaert F, Moereels H, Lewi PJ. Classification and identification of proteins by means of common and specific amino acid n-tuples in unaligned sequences. Comput Methods Programs Biomed 1998;56:221-233. [PMID: 9725648 DOI: 10.1016/s0169-2607(98)00031-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]

Number

Cited by Other Article(s)

Mizuno Y, Nakasone W, Nakamura M, Otaki JM. In Silico and In Vitro Evaluation of the Molecular Mimicry of the SARS-CoV-2 Spike Protein by Common Short Constituent Sequences (cSCSs) in the Human Proteome: Toward Safer Epitope Design for Vaccine Development. Vaccines (Basel) 2024;12:539. [PMID: 38793790 PMCID: PMC11125730 DOI: 10.3390/vaccines12050539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 05/12/2024] [Accepted: 05/12/2024] [Indexed: 05/26/2024] Open

Endo S, Motomura K, Tsuhako M, Kakazu Y, Nakamura M, M. Otaki J. Search for Human-Specific Proteins Based on Availability Scores of Short Constituent Sequences: Identification of a WRWSH Protein in Human Testis. Comput Biol Chem 2020. [DOI: 10.5772/intechopen.89653] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Osmanbeyoglu HU, Ganapathiraju MK. N-gram analysis of 970 microbial organisms reveals presence of biological language models. BMC Bioinformatics 2011;12:12. [PMID: 21219653 PMCID: PMC3027111 DOI: 10.1186/1471-2105-12-12] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2010] [Accepted: 01/10/2011] [Indexed: 11/29/2022] Open

Abstract

Background

It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree.

Results

We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli.

Conclusion

Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.

Collapse

Pavlović-Lazetić GM, Mitić NS, Beljanski MV. n-Gram characterization of genomic islands in bacterial genomes. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2009;93:241-56. [PMID: 19101056 PMCID: PMC7185697 DOI: 10.1016/j.cmpb.2008.10.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2008] [Revised: 09/10/2008] [Accepted: 10/21/2008] [Indexed: 05/27/2023]

Mitić NS, Pavlović-Lažetić GM, Beljanski MV. Could n-gram analysis contribute to genomic island determination? J Biomed Inform 2008;41:936-43. [DOI: 10.1016/j.jbi.2008.03.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2007] [Revised: 03/13/2008] [Accepted: 03/13/2008] [Indexed: 11/28/2022]

Otaki JM, Gotoh T, Yamamoto H. Potential implications of availability of short amino acid sequences in proteins: an old and new approach to protein decoding and design. BIOTECHNOLOGY ANNUAL REVIEW 2008;14:109-41. [PMID: 18606361 DOI: 10.1016/s1387-2656(08)00004-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Abstract

Three-dimensional structure of a protein molecule is primarily determined by its amino acid sequence, and thus the elucidation of general rules embedded in amino acid sequences is of great importance in protein science and engineering. To extract valuable information from sequences, we propose an analytical method in which a protein sequence is considered to be constructed by serial superimpositions of short amino acid sequences of n amino acid sets, especially triplets (3-aa sets). Using the comprehensive nonredundant protein database, we first examined "availability" of all possible combinatorial sets of 8,000 triplet species. Availability score was mathematically defined as an indicator for the relative "preference" or "avoidance" for a given short constituent sequence to be used in protein chain. Availability scores of real proteins were clearly biased against those of randomly generated proteins. We found many triplet species that occurred in the database more than expected or less than expected. Such bias was extended to longer sets, and we found that some species of pentats (5-aa sets) that occurred reasonably frequently in the randomly generated protein population did not occur at all in any real proteins known today. Availability score was dependent on species, potentially serving as a phylogenetic indicator. Furthermore, we suggest possibilities of various biotechnological applications of characteristic short sequences such as human-specific and pathogen-specific short sequences obtained from availability analysis. Availability score was also dependent on secondary structures, potentially serving as a structural indicator. Availability analysis on triplets may be combined with a comprehensive data collection on the varphi and psi peptide-bond angles of the amino acid at the center of each triplet, i.e., a collection of Ramachandran plots for each triplet. These triplet characters, together with other physicochemical data, will provide us with basic information between protein sequence and structure, by which structure prediction and engineering may be greatly facilitated. Availability analysis may also be useful in identifying word processing units in amino acid sequences based on an analogy to natural languages. Together with other approaches, availability analysis will elucidate general rules hidden in the primary sequences and eventually contributes to rebuilding the paradigm of protein science.

Collapse

Radomski JP, Slonimski PP. Primary sequences of proteins from complete genomes display a singular periodicity: Alignment-free N-gram analysis. C R Biol 2007;330:33-48. [PMID: 17241946 DOI: 10.1016/j.crvi.2006.11.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2006] [Accepted: 11/07/2006] [Indexed: 11/25/2022]

Qian B, Soyer OS, Neubig RR, Goldstein RA. Depicting a protein's two faces: GPCR classification by phylogenetic tree-based HMMs. FEBS Lett 2003;554:95-9. [PMID: 14596921 DOI: 10.1016/s0014-5793(03)01112-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]