1
|
List JM. Open Problems in Computational Historical Linguistics. OPEN RESEARCH EUROPE 2024; 3:201. [PMID: 38357681 PMCID: PMC10864822 DOI: 10.12688/openreseurope.16804.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 05/30/2024] [Indexed: 02/16/2024]
Abstract
Problems constitute the starting point of all scientific research. The essay reflects on the different kinds of problems that scientists address in their research and discusses a list of 10 problems for the field of computational historical linguistics, that was proposed throughout 2019 in a series of blog posts (see http://phylonetworks.blogspot.com/). In contrast to problems identified in different contexts, these problems were considered to be solvable, but no solution could be proposed back then. By discussing the problems in the light of developments that have been made in the field during the past five years, a modified list is proposed that takes new insights into account but also finds that the majority of the problems has not yet been solved.
Collapse
Affiliation(s)
- Johann-Mattis List
- Chair of Multilingual Computational Linguistics, University of Passau, Passau, Bavaria, 94032, Germany
- Department of Linguistic and Cultural Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, 04103, Germany
| |
Collapse
|
2
|
Dotan E, Jaschek G, Pupko T, Belinkov Y. Effect of tokenization on transformers for biological sequences. Bioinformatics 2024; 40:btae196. [PMID: 38608190 PMCID: PMC11055402 DOI: 10.1093/bioinformatics/btae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/20/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open
Abstract
MOTIVATION Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.
Collapse
Affiliation(s)
- Edo Dotan
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Gal Jaschek
- Department of Genetics, Yale University School of Medicine, New Haven, CT 06510, United States
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Yonatan Belinkov
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
| |
Collapse
|
3
|
Yang S, Sun X, Jin L, Zhang M. Inferring language dispersal patterns with velocity field estimation. Nat Commun 2024; 15:190. [PMID: 38167834 PMCID: PMC10761963 DOI: 10.1038/s41467-023-44430-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 12/11/2023] [Indexed: 01/05/2024] Open
Abstract
Reconstructing the spatial evolution of languages can deepen our understanding of the demic diffusion and cultural spread. However, the phylogeographic approach that is frequently used to infer language dispersal patterns has limitations, primarily because the phylogenetic tree cannot fully explain the language evolution induced by the horizontal contact among languages, such as borrowing and areal diffusion. Here, we introduce the language velocity field estimation, which does not rely on the phylogenetic tree, to infer language dispersal trajectories and centre. Its effectiveness and robustness are verified through both simulated and empirical validations. Using language velocity field estimation, we infer the dispersal patterns of four agricultural language families and groups, encompassing approximately 700 language samples. Our results show that the dispersal trajectories of these languages are primarily compatible with population movement routes inferred from ancient DNA and archaeological materials, and their dispersal centres are geographically proximate to ancient homelands of agricultural or Neolithic cultures. Our findings highlight that the agricultural languages dispersed alongside the demic diffusions and cultural spreads during the past 10,000 years. We expect that language velocity field estimation could aid the spatial analysis of language evolution and further branch out into the studies of demographic and cultural dynamics.
Collapse
Affiliation(s)
- Sizhe Yang
- State Key Laboratory of Genetic Engineering, Center for Evolutionary Biology, and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, 200438, China
| | - Xiaoru Sun
- Human Phenome Institute, Fudan University, Shanghai, 200438, China
- Ministry of Education Key Laboratory of Contemporary Anthropology, Department of Anthropology and Human Genetics, School of Life Sciences, Fudan University, Shanghai, 200438, China
| | - Li Jin
- State Key Laboratory of Genetic Engineering, Center for Evolutionary Biology, and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, 200438, China.
- Human Phenome Institute, Fudan University, Shanghai, 200438, China.
| | - Menghan Zhang
- Institute of Modern Languages and Linguistics, Fudan University, Shanghai, 200433, China.
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, 200433, China.
| |
Collapse
|
4
|
Ladoukakis ED, Michelioudakis D, Anagnostopoulou E. Toward an evolutionary framework for language variation and change. Bioessays 2022; 44:e2100216. [PMID: 34985776 DOI: 10.1002/bies.202100216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 12/13/2021] [Accepted: 12/15/2021] [Indexed: 11/05/2022]
Abstract
In this paper, we identify the parallels and the differences between language and life as evolvable systems in pursuit of a framework that will investigate language change from the perspective of a general theory of evolution. Despite the consensus that languages change similarly to species, as reflected in the construction of language trees, the field has mainly applied biological techniques to specific problems of historical linguistics and has not systematically engaged in disentangling the basic concepts (population, reproductive unit, inheritance, etc.) and the core processes underlying evolutionary theory, namely mutation, selection, drift, and migration, as applied to language. We develop such a proposal. Treating language as an evolvable system places previous studies in a novel perspective, as it offers an elegant unifying framework that can accommodate current knowledge, utilize the rich theoretical framework of evolutionary biology, and synthesize many independent strands of inquiry, initiating a whole new research program.
Collapse
Affiliation(s)
| | - Dimitris Michelioudakis
- Department of Linguistics, School of Philology, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Elena Anagnostopoulou
- Department of Philology, Division of Linguistics, University of Crete, Rethymnon, Greece
| |
Collapse
|
5
|
Evans CL, Greenhill SJ, Watts J, List JM, Botero CA, Gray RD, Kirby KR. The uses and abuses of tree thinking in cultural evolution. Philos Trans R Soc Lond B Biol Sci 2021; 376:20200056. [PMID: 33993767 PMCID: PMC8126464 DOI: 10.1098/rstb.2020.0056] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/11/2021] [Indexed: 11/13/2022] Open
Abstract
Modern phylogenetic methods are increasingly being used to address questions about macro-level patterns in cultural evolution. These methods can illuminate the unobservable histories of cultural traits and identify the evolutionary drivers of trait change over time, but their application is not without pitfalls. Here, we outline the current scope of research in cultural tree thinking, highlighting a toolkit of best practices to navigate and avoid the pitfalls and 'abuses' associated with their application. We emphasize two principles that support the appropriate application of phylogenetic methodologies in cross-cultural research: researchers should (1) draw on multiple lines of evidence when deciding if and which types of phylogenetic methods and models are suitable for their cross-cultural data, and (2) carefully consider how different cultural traits might have different evolutionary histories across space and time. When used appropriately phylogenetic methods can provide powerful insights into the processes of evolutionary change that have shaped the broad patterns of human history. This article is part of the theme issue 'Foundations of cultural evolution'.
Collapse
Affiliation(s)
- Cara L. Evans
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena 07745, Germany
| | - Simon J. Greenhill
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena 07745, Germany
- ARC Centre of Excellence for the Dynamics of Language, ANU College of Asia and the Pacific, Australian National University, Canberra 2700, Australia
| | - Joseph Watts
- Religion Programme, University of Otago, Dunedin 9016, New Zealand
- Centre for Research on Evolution, Belief and Behaviour, University of Otago, Dunedin 9016, New Zealand
| | - Johann-Mattis List
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena 07745, Germany
| | - Carlos A. Botero
- Department of Biology, Washington University in St Louis, St Louis, MO 63130, USA
| | - Russell D. Gray
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena 07745, Germany
- School of Psychology, University of Auckland, Auckland 1010, New Zealand
| | - Kathryn R. Kirby
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena 07745, Germany
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, ON, Canada M5S 3B2
| |
Collapse
|
6
|
Browne C. AI for Ancient Games: Report on the Digital Ludeme Project. KUNSTLICHE INTELLIGENZ 2020; 34:89-93. [PMID: 32382215 PMCID: PMC7194251 DOI: 10.1007/s13218-019-00600-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Accepted: 06/22/2019] [Indexed: 11/24/2022]
Abstract
This report summarises the Digital Ludeme Project, a recently launched 5-year research project being conducted at Maastricht University. This computational study of the world’s traditional strategy games seeks to improve our understanding of early games, their development, and their role in the spread of related mathematical ideas throughout recorded human history.
Collapse
Affiliation(s)
- Cameron Browne
- Department of Data Science and Knowledge Engineering (DKE), Maastricht University, Bouillonstraat 8-10, 6211 LH Maastricht, The Netherlands
| |
Collapse
|
7
|
Rzymski C, Tresoldi T, Greenhill SJ, Wu MS, Schweikhard NE, Koptjevskaja-Tamm M, Gast V, Bodt TA, Hantgan A, Kaiping GA, Chang S, Lai Y, Morozova N, Arjava H, Hübler N, Koile E, Pepper S, Proos M, Van Epps B, Blanco I, Hundt C, Monakhov S, Pianykh K, Ramesh S, Gray RD, Forkel R, List JM. The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies. Sci Data 2020; 7:13. [PMID: 31932593 PMCID: PMC6957499 DOI: 10.1038/s41597-019-0341-x] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Accepted: 11/29/2019] [Indexed: 11/09/2022] Open
Abstract
Advances in computer-assisted linguistic research have been greatly influential in reshaping linguistic research. With the increasing availability of interconnected datasets created and curated by researchers, more and more interwoven questions can now be investigated. Such advances, however, are bringing high requirements in terms of rigorousness for preparing and curating datasets. Here we present CLICS, a Database of Cross-Linguistic Colexifications (CLICS). CLICS tackles interconnected interdisciplinary research questions about the colexification of words across semantic categories in the world's languages, and show-cases best practices for preparing data for cross-linguistic research. This is done by addressing shortcomings of an earlier version of the database, CLICS2, and by supplying an updated version with CLICS3, which massively increases the size and scope of the project. We provide tools and guidelines for this purpose and discuss insights resulting from organizing student tasks for database updates.
Collapse
Affiliation(s)
- Christoph Rzymski
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany.
| | - Tiago Tresoldi
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany.
| | - Simon J Greenhill
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany.,ARC Centre of Excellence for the Dynamics of Language, Australian National University, Canberra, Australia
| | - Mei-Shin Wu
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
| | - Nathanael E Schweikhard
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
| | | | - Volker Gast
- Friedrich Schiller University, Jena, Germany
| | | | | | | | - Sophie Chang
- Independent English-Chinese Translator and linguistic researcher, Taipei, Taiwan
| | - Yunfan Lai
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
| | - Natalia Morozova
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
| | | | - Nataliia Hübler
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
| | - Ezequiel Koile
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
| | | | | | | | | | | | | | | | | | - Russell D Gray
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
| | - Robert Forkel
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany
| | - Johann-Mattis List
- Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany.
| |
Collapse
|
8
|
Abstract
Genomes appear similar to natural language texts, and protein domains can be treated as analogs of words. To investigate the linguistic properties of genomes further, we calculated the complexity of the “protein languages” in all major branches of life and identified a nearly universal value of information gain associated with the transition from a random domain arrangement to the current protein domain architecture. An exploration of the evolutionary relationship of the protein languages identified the domain combinations that discriminate between the major branches of cellular life. We conclude that there exists a “quasi-universal grammar” of protein domains and that the nearly constant information gain we identified corresponds to the minimal complexity required to maintain a functional cell. From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinations is viable in evolution. We employ a popular linguistic technique, n-gram analysis, to probe the “proteome grammar”—that is, the rules of association of domains that generate various domain architectures of proteins. Comparison of the complexity measures of “protein languages” in major branches of life shows that the relative entropy difference (information gain) between the observed domain architectures and random domain combinations is highly conserved in evolution and is close to being a universal constant, at ∼1.2 bits. Substantial deviations from this constant are observed in only two major groups of organisms: a subset of Archaea that appears to be cells simplified to the limit, and animals that display extreme complexity. We also identify the n-grams that represent signatures of the major branches of cellular life. The results of this analysis bolster the analogy between genomes and natural language and show that a “quasi-universal grammar” underlies the evolution of domain architectures in all divisions of cellular life. The nearly universal value of information gain by the domain architectures could reflect the minimum complexity of signal processing that is required to maintain a functioning cell.
Collapse
|
9
|
List JM, Greenhill SJ, Gray RD. The Potential of Automatic Word Comparison for Historical Linguistics. PLoS One 2017; 12:e0170046. [PMID: 28129337 PMCID: PMC5271327 DOI: 10.1371/journal.pone.0170046] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2016] [Accepted: 12/28/2016] [Indexed: 11/19/2022] Open
Abstract
The amount of data from languages spoken all over the world is rapidly increasing. Traditional manual methods in historical linguistics need to face the challenges brought by this influx of data. Automatic approaches to word comparison could provide invaluable help to pre-analyze data which can be later enhanced by experts. In this way, computational approaches can take care of the repetitive and schematic tasks leaving experts to concentrate on answering interesting questions. Here we test the potential of automatic methods to detect etymologically related words (cognates) in cross-linguistic data. Using a newly compiled database of expert cognate judgments across five different language families, we compare how well different automatic approaches distinguish related from unrelated words. Our results show that automatic methods can identify cognates with a very high degree of accuracy, reaching 89% for the best-performing method Infomap. We identify the specific strengths and weaknesses of these different methods and point to major challenges for future approaches. Current automatic approaches for cognate detection-although not perfect-could become an important component of future research in historical linguistics.
Collapse
Affiliation(s)
- Johann-Mattis List
- Centre des Recherches Linguistiques sur l’Asie Orientale, École des Hautes Études en Sciences Sociales, 2 Rue de Lille, 75007 Paris, France
| | - Simon J. Greenhill
- Department for Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Kahlaische Straße 10, 07743, Jena, Germany
- ARC Centre of Excellence for the Dynamics of Language, Australian National University, Canberra, 2600, Australia
| | - Russell D. Gray
- Department for Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Kahlaische Straße 10, 07743, Jena, Germany
| |
Collapse
|
10
|
List JM. Cultural Phylogenetics: Concepts and Applications in Archaeology. — Edited by Larissa Mendoza Straffon. Syst Biol 2016. [DOI: 10.1093/sysbio/syw085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|