Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: List JM, Pathmanathan JS, Lopez P, Bapteste E. Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biol Direct 2016;11:39. [PMID: 27544206 PMCID: PMC4992195 DOI: 10.1186/s13062-016-0145-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2016] [Accepted: 08/06/2016] [Indexed: 11/13/2022] Open

For:	List JM, Pathmanathan JS, Lopez P, Bapteste E. Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biol Direct 2016;11:39. [PMID: 27544206 PMCID: PMC4992195 DOI: 10.1186/s13062-016-0145-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2016] [Accepted: 08/06/2016] [Indexed: 11/13/2022] Open

Number

Cited by Other Article(s)

List JM. Open Problems in Computational Historical Linguistics. OPEN RESEARCH EUROPE 2024;3:201. [PMID: 38357681 PMCID: PMC10864822 DOI: 10.12688/openreseurope.16804.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 05/30/2024] [Indexed: 02/16/2024]

Dotan E, Jaschek G, Pupko T, Belinkov Y. Effect of tokenization on transformers for biological sequences. Bioinformatics 2024;40:btae196. [PMID: 38608190 PMCID: PMC11055402 DOI: 10.1093/bioinformatics/btae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/20/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open

Abstract

MOTIVATION

Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families.

RESULTS

We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data.

AVAILABILITY AND IMPLEMENTATION

Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.

Collapse

Yang S, Sun X, Jin L, Zhang M. Inferring language dispersal patterns with velocity field estimation. Nat Commun 2024;15:190. [PMID: 38167834 PMCID: PMC10761963 DOI: 10.1038/s41467-023-44430-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 12/11/2023] [Indexed: 01/05/2024] Open

Ladoukakis ED, Michelioudakis D, Anagnostopoulou E. Toward an evolutionary framework for language variation and change. Bioessays 2022;44:e2100216. [PMID: 34985776 DOI: 10.1002/bies.202100216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 12/13/2021] [Accepted: 12/15/2021] [Indexed: 11/05/2022]

Evans CL, Greenhill SJ, Watts J, List JM, Botero CA, Gray RD, Kirby KR. The uses and abuses of tree thinking in cultural evolution. Philos Trans R Soc Lond B Biol Sci 2021;376:20200056. [PMID: 33993767 PMCID: PMC8126464 DOI: 10.1098/rstb.2020.0056] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/11/2021] [Indexed: 11/13/2022] Open

Browne C. AI for Ancient Games: Report on the Digital Ludeme Project. KUNSTLICHE INTELLIGENZ 2020;34:89-93. [PMID: 32382215 PMCID: PMC7194251 DOI: 10.1007/s13218-019-00600-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Accepted: 06/22/2019] [Indexed: 11/24/2022]

Rzymski C, Tresoldi T, Greenhill SJ, Wu MS, Schweikhard NE, Koptjevskaja-Tamm M, Gast V, Bodt TA, Hantgan A, Kaiping GA, Chang S, Lai Y, Morozova N, Arjava H, Hübler N, Koile E, Pepper S, Proos M, Van Epps B, Blanco I, Hundt C, Monakhov S, Pianykh K, Ramesh S, Gray RD, Forkel R, List JM. The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies. Sci Data 2020;7:13. [PMID: 31932593 PMCID: PMC6957499 DOI: 10.1038/s41597-019-0341-x] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Accepted: 11/29/2019] [Indexed: 11/09/2022] Open

Affiliation(s)

Christoph Rzymski Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany.
Tiago Tresoldi Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany.
Simon J Greenhill Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany.,ARC Centre of Excellence for the Dynamics of Language, Australian National University, Canberra, Australia
Mei-Shin Wu Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
Nathanael E Schweikhard Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
Maria Koptjevskaja-Tamm Stockholm University, Stockholm, Sweden
Volker Gast Friedrich Schiller University, Jena, Germany
Timotheus A Bodt SOAS, London, UK
Abbie Hantgan CNRS LLACAN, Paris, France
Gereon A Kaiping University of Leiden, Leiden, Netherlands
Sophie Chang Independent English-Chinese Translator and linguistic researcher, Taipei, Taiwan
Yunfan Lai Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
Natalia Morozova Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
Heini Arjava University of Helsinki, Helsinki, Finland
Nataliia Hübler Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
Ezequiel Koile Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
Steve Pepper University of Oslo, Oslo, Norway
Mariann Proos University of Tartu, Tartu, Estonia
Briana Van Epps Lund University, Lund, Sweden
Ingrid Blanco Friedrich Schiller University, Jena, Germany
Carolin Hundt Friedrich Schiller University, Jena, Germany
Sergei Monakhov Friedrich Schiller University, Jena, Germany
Kristina Pianykh Friedrich Schiller University, Jena, Germany
Sallona Ramesh Friedrich Schiller University, Jena, Germany
Russell D Gray Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
Robert Forkel Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
Johann-Mattis List Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany.

Collapse

Grammar of protein domain architectures. Proc Natl Acad Sci U S A 2019;116:3636-3645. [PMID: 30733291 PMCID: PMC6397568 DOI: 10.1073/pnas.1814684116] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open

Abstract

Genomes appear similar to natural language texts, and protein domains can be treated as analogs of words. To investigate the linguistic properties of genomes further, we calculated the complexity of the “protein languages” in all major branches of life and identified a nearly universal value of information gain associated with the transition from a random domain arrangement to the current protein domain architecture. An exploration of the evolutionary relationship of the protein languages identified the domain combinations that discriminate between the major branches of cellular life. We conclude that there exists a “quasi-universal grammar” of protein domains and that the nearly constant information gain we identified corresponds to the minimal complexity required to maintain a functional cell.

From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinations is viable in evolution. We employ a popular linguistic technique, n-gram analysis, to probe the “proteome grammar”—that is, the rules of association of domains that generate various domain architectures of proteins. Comparison of the complexity measures of “protein languages” in major branches of life shows that the relative entropy difference (information gain) between the observed domain architectures and random domain combinations is highly conserved in evolution and is close to being a universal constant, at ∼1.2 bits. Substantial deviations from this constant are observed in only two major groups of organisms: a subset of Archaea that appears to be cells simplified to the limit, and animals that display extreme complexity. We also identify the n-grams that represent signatures of the major branches of cellular life. The results of this analysis bolster the analogy between genomes and natural language and show that a “quasi-universal grammar” underlies the evolution of domain architectures in all divisions of cellular life. The nearly universal value of information gain by the domain architectures could reflect the minimum complexity of signal processing that is required to maintain a functioning cell.

Collapse

List JM, Greenhill SJ, Gray RD. The Potential of Automatic Word Comparison for Historical Linguistics. PLoS One 2017;12:e0170046. [PMID: 28129337 PMCID: PMC5271327 DOI: 10.1371/journal.pone.0170046] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2016] [Accepted: 12/28/2016] [Indexed: 11/19/2022] Open

List JM. Cultural Phylogenetics: Concepts and Applications in Archaeology. — Edited by Larissa Mendoza Straffon. Syst Biol 2016. [DOI: 10.1093/sysbio/syw085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open