Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Motomura K, Fujita T, Tsutsumi M, Kikuzato S, Nakamura M, Otaki JM. Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach. PLoS One 2012;7:e50039. [PMID: 23185527 PMCID: PMC3503725 DOI: 10.1371/journal.pone.0050039] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2012] [Accepted: 10/15/2012] [Indexed: 11/19/2022] Open

For:	Motomura K, Fujita T, Tsutsumi M, Kikuzato S, Nakamura M, Otaki JM. Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach. PLoS One 2012;7:e50039. [PMID: 23185527 PMCID: PMC3503725 DOI: 10.1371/journal.pone.0050039] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2012] [Accepted: 10/15/2012] [Indexed: 11/19/2022] Open

Number

Cited by Other Article(s)

Mizuno Y, Nakasone W, Nakamura M, Otaki JM. In Silico and In Vitro Evaluation of the Molecular Mimicry of the SARS-CoV-2 Spike Protein by Common Short Constituent Sequences (cSCSs) in the Human Proteome: Toward Safer Epitope Design for Vaccine Development. Vaccines (Basel) 2024;12:539. [PMID: 38793790 PMCID: PMC11125730 DOI: 10.3390/vaccines12050539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 05/12/2024] [Accepted: 05/12/2024] [Indexed: 05/26/2024] Open

Algorithmically-guided discovery of viral epitopes via linguistic parsing: Problem formulation and solving by soft computing. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Jiang L, Wang D, Xu D. A Pretrained ELECTRA Model for Kinase-Specific Phosphorylation Site Prediction. Methods Mol Biol 2022;2499:105-124. [PMID: 35696076 DOI: 10.1007/978-1-0716-2317-6_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]

Caetano-Anollés G. The Compressed Vocabulary of Microbial Life. Front Microbiol 2021;12:655990. [PMID: 34305827 PMCID: PMC8292947 DOI: 10.3389/fmicb.2021.655990] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/27/2021] [Indexed: 12/22/2022] Open

Abstract

Communication is an undisputed central activity of life that requires an evolving molecular language. It conveys meaning through messages and vocabularies. Here, I explore the existence of a growing vocabulary in the molecules and molecular functions of the microbial world. There are clear correspondences between the lexicon, syntax, semantics, and pragmatics of language organization and the module, structure, function, and fitness paradigms of molecular biology. These correspondences are constrained by universal laws and engineering principles. Macromolecular structure, for example, follows quantitative linguistic patterns arising from statistical laws that are likely universal, including the Zipf's law, a special case of the scale-free distribution, the Heaps' law describing sublinear growth typical of economies of scales, and the Menzerath-Altmann's law, which imposes size-dependent patterns of decreasing returns. Trade-off solutions between principles of economy, flexibility, and robustness define a "triangle of persistence" describing the impact of the environment on a biological system. The pragmatic landscape of the triangle interfaces with the syntax and semantics of molecular languages, which together with comparative and evolutionary genomic data can explain global patterns of diversification of cellular life. The vocabularies of proteins (proteomes) and functions (functionomes) revealed a significant universal lexical core supporting a universal common ancestor, an ancestral evolutionary link between Bacteria and Eukarya, and distinct reductive evolutionary strategies of language compression in Archaea and Bacteria. A "causal" word cloud strategy inspired by the dependency grammar paradigm used in catenae unfolded the evolution of lexical units associated with Gene Ontology terms at different levels of ontological abstraction. While Archaea holds the smallest, oldest, and most homogeneous vocabulary of all superkingdoms, Bacteria heterogeneously apportions a more complex vocabulary, and Eukarya pushes functional innovation through mechanisms of flexibility and robustness.

Collapse

Endo S, Motomura K, Tsuhako M, Kakazu Y, Nakamura M, M. Otaki J. Search for Human-Specific Proteins Based on Availability Scores of Short Constituent Sequences: Identification of a WRWSH Protein in Human Testis. Comput Biol Chem 2020. [DOI: 10.5772/intechopen.89653] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 2020;25:689-705. [DOI: 10.1016/j.drudis.2020.01.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 12/20/2019] [Accepted: 01/28/2020] [Indexed: 01/06/2023]

Cardoso MH, Orozco RQ, Rezende SB, Rodrigues G, Oshiro KGN, Cândido ES, Franco OL. Computer-Aided Design of Antimicrobial Peptides: Are We Generating Effective Drug Candidates? Front Microbiol 2020;10:3097. [PMID: 32038544 PMCID: PMC6987251 DOI: 10.3389/fmicb.2019.03097] [Citation(s) in RCA: 108] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2019] [Accepted: 12/20/2019] [Indexed: 11/16/2022] Open

Affiliation(s)

Marlon H Cardoso S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil
Raquel Q Orozco S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Instituto de Ciências Biológicas, Departamento de Biologia, Programa de Pós-Graduação em Ciências Biológicas (Imunologia/Genética e Biotecnologia), Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil
Samilla B Rezende S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil
Gisele Rodrigues Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil
Karen G N Oshiro S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Programa de Pós-Graduação em Patologia Molecular, Faculdade de Medicina, Universidade de Brasília, Brasília, Brazil
Elizabete S Cândido S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil
Octávio L Franco S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil.,Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil.,Instituto de Ciências Biológicas, Departamento de Biologia, Programa de Pós-Graduação em Ciências Biológicas (Imunologia/Genética e Biotecnologia), Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil.,Programa de Pós-Graduação em Patologia Molecular, Faculdade de Medicina, Universidade de Brasília, Brasília, Brazil

Collapse

Lee M, Kang YS, Seok J. The estimation of probability distribution for factor variables with many categorical values. PLoS One 2018;13:e0202547. [PMID: 30142178 PMCID: PMC6108477 DOI: 10.1371/journal.pone.0202547] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Accepted: 08/06/2018] [Indexed: 11/18/2022] Open

Konopka BM, Marciniak M, Dyrka W. Quantiprot - a Python package for quantitative analysis of protein sequences. BMC Bioinformatics 2017;18:339. [PMID: 28716000 PMCID: PMC5512976 DOI: 10.1186/s12859-017-1751-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Accepted: 07/05/2017] [Indexed: 11/17/2022] Open

Parasitism as the main factor shaping peptide vocabularies in current organisms. Parasitology 2017;144:975-983. [PMID: 28241893 DOI: 10.1017/s0031182017000191] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Asgari E, Mofrad MRK. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 2015;10:e0141287. [PMID: 26555596 PMCID: PMC4640716 DOI: 10.1371/journal.pone.0141287] [Citation(s) in RCA: 349] [Impact Index Per Article: 38.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Accepted: 10/05/2015] [Indexed: 12/22/2022] Open

Abstract

We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.

Collapse

Hatton L, Warr G. Protein structure and evolution: are they constrained globally by a principle derived from information theory? PLoS One 2015;10:e0125663. [PMID: 25970335 PMCID: PMC4429977 DOI: 10.1371/journal.pone.0125663] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2014] [Accepted: 02/19/2015] [Indexed: 01/01/2023] Open

Abstract

That the physicochemical properties of amino acids constrain the structure, function and evolution of proteins is not in doubt. However, principles derived from information theory may also set bounds on the structure (and thus also the evolution) of proteins. Here we analyze the global properties of the full set of proteins in release 13-11 of the SwissProt database, showing by experimental test of predictions from information theory that their collective structure exhibits properties that are consistent with their being guided by a conservation principle. This principle (Conservation of Information) defines the global properties of systems composed of discrete components each of which is in turn assembled from discrete smaller pieces. In the system of proteins, each protein is a component, and each protein is assembled from amino acids. Central to this principle is the inter-relationship of the unique amino acid count and total length of a protein and its implications for both average protein length and occurrence of proteins with specific unique amino acid counts. The unique amino acid count is simply the number of distinct amino acids (including those that are post-translationally modified) that occur in a protein, and is independent of the number of times that the particular amino acid occurs in the sequence. Conservation of Information does not operate at the local level (it is independent of the physicochemical properties of the amino acids) where the influences of natural selection are manifest in the variety of protein structure and function that is well understood. Rather, this analysis implies that Conservation of Information would define the global bounds within which the whole system of proteins is constrained; thus it appears to be acting to constrain evolution at a level different from natural selection, a conclusion that appears counter-intuitive but is supported by the studies described herein.

Collapse

Motomura K, Nakamura M, Otaki JM. A frequency-based linguistic approach to protein decoding and design: Simple concepts, diverse applications, and the SCS Package. Comput Struct Biotechnol J 2013;5:e201302010. [PMID: 24688703 PMCID: PMC3962227 DOI: 10.5936/csbj.201302010] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Revised: 02/07/2013] [Accepted: 02/08/2013] [Indexed: 11/23/2022] Open