1
|
Hendrix P, Sun CC, Brighton H, Bender A. On the Connection Between Language Change and Language Processing. Cogn Sci 2023; 47:e13384. [PMID: 38071744 DOI: 10.1111/cogs.13384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 10/22/2023] [Accepted: 11/06/2023] [Indexed: 12/18/2023]
Abstract
Previous studies provided evidence for a connection between language processing and language change. We add to these studies with an exploration of the influence of lexical-distributional properties of words in orthographic space, semantic space, and the mapping between orthographic and semantic space on the probability of lexical extinction. Through a binomial linear regression analysis, we investigated the probability of lexical extinction by the first decade of the twenty-first century (2000s) for words that existed in the first decade of the nineteenth-century (1800s) in eight data sets for five languages: English, French, German, Italian, and Spanish. The binomial linear regression analysis revealed that words that are more similar in form to other words are less likely to disappear from a language. By contrast, words that are more similar in meaning to other words are more likely to become extinct. In addition, a more consistent mapping between form and meaning protects a word from lexical extinction. A nonlinear time-to-event analysis furthermore revealed that the position of a word in orthographic and semantic space continues to influence the probability of it disappearing from a language for at least 200 years. Effects of the lexical-distributional properties of words under investigation here have been reported in the language processing literature as well. The results reported here, therefore, fit well with a usage-based approach to language change, which holds that language change is at least to some extent connected to cognitive mechanisms in the human brain.
Collapse
Affiliation(s)
- Peter Hendrix
- Department of Cognitive Science and Artificial Intelligence, Tilburg University
| | - Ching Chu Sun
- Department of General Linguistics, Tübingen University
| | - Henry Brighton
- Department of Cognitive Science and Artificial Intelligence, Tilburg University
| | - Andreas Bender
- Department of Statistics, Ludwig-Maximillians-University Munich
| |
Collapse
|
2
|
Albuquerque UP, Cantalice AS, Oliveira ES, de Moura JMB, dos Santos RKS, da Silva RH, Brito-Júnior VM, Ferreira-Júnior WS. Exploring Large Digital Bodies for the Study of Human Behavior. EVOLUTIONARY PSYCHOLOGICAL SCIENCE 2023; 9:1-10. [PMID: 37362224 PMCID: PMC10203656 DOI: 10.1007/s40806-023-00363-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 04/03/2023] [Accepted: 04/04/2023] [Indexed: 06/28/2023]
Abstract
Internet access has become a fundamental component of contemporary society, with major impacts in many areas that offer opportunities for new research insights. The search and deposition of information in digital media form large sets of data known as digital corpora, which can be used to generate structured data, representing repositories of knowledge and evidence of human culture. This information offers opportunities for scientific investigations that contribute to the understanding of human behavior on a large scale, reaching human populations/individuals that would normally be difficult to access. These tools can help access social and cultural varieties worldwide. In this article, we briefly review the potential of these corpora in the study of human behavior. Therefore, we propose Culturomics of Human Behavior as an approach to understand, explain, and predict human behavior using digital corpora.
Collapse
Affiliation(s)
- Ulysses Paulino Albuquerque
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Anibal Silva Cantalice
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Edwine Soares Oliveira
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Joelson Moreno Brito de Moura
- Instituto de Estudos do Xingu (IEX), Av. Norte Sul, Universidade Federal do Sul E Sudeste do Pará, Loteamento Cidade Nova, Lote N. 1, Qd 15, Setor 15, São Félix Do Xingu, Brazil
| | - Rayane Karoline Silva dos Santos
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Risoneide Henriques da Silva
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Valdir Moura Brito-Júnior
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Washington Soares Ferreira-Júnior
- Laboratório de Investigações Bioculturais no Semiárido, Universidade de Pernambuco, Campus Petrolina, BR203, Km 2, S/N, 56328-903 Petrolina, Pernambuco, Brazil
| |
Collapse
|
3
|
Staples TL. Expansion and evolution of the R programming language. ROYAL SOCIETY OPEN SCIENCE 2023; 10:221550. [PMID: 37063989 PMCID: PMC10090872 DOI: 10.1098/rsos.221550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 03/23/2023] [Indexed: 06/19/2023]
Abstract
Languages change over time, driven by creation of new words and cultural pressure to optimize communication. Programming languages resemble written language but communicate primarily with computer hardware rather than a human audience. I tested whether there were detectable changes over time in use of R, a mature, open-source programming language used for scientific computing. Across 393 142 GitHub repositories published between 2014 and 2021, I extracted 143 409 288 R functions, programming 'verbs', pairing linguistic and ecological analyses to detect change to diversity and composition of functions used over time. I found the number of R functions in use increased and underwent substantial change, driven primarily by the popularity of the 'tidyverse' collection of community-written extensions. I provide evidence that users can change the nature of programming languages, with patterns that match known processes from natural languages and genetic evolution. In R, there appear to be selective pressures for increased analytic complexity and R functions in decline that are not yet extinct (extinction debts). R's evolution towards the tidyverse may also represent the start of a division into two distinct dialects, which may impact the readability and continuity of analytic and scientific inquiries codified in R, as well as the language's future.
Collapse
Affiliation(s)
- Timothy L. Staples
- School of Biological Sciences, The University of Queensland, Building 60, St Lucia, Queensland 4072, Australia
| |
Collapse
|
4
|
Meylan SC, Griffiths TL. The Challenges of Large-Scale, Web-Based Language Datasets: Word Length and Predictability Revisited. Cogn Sci 2021; 45:e12983. [PMID: 34170030 DOI: 10.1111/cogs.12983] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Revised: 03/16/2021] [Accepted: 04/07/2021] [Indexed: 11/28/2022]
Abstract
Language research has come to rely heavily on large-scale, web-based datasets. These datasets can present significant methodological challenges, requiring researchers to make a number of decisions about how they are collected, represented, and analyzed. These decisions often concern long-standing challenges in corpus-based language research, including determining what counts as a word, deciding which words should be analyzed, and matching sets of words across languages. We illustrate these challenges by revisiting "Word lengths are optimized for efficient communication" (Piantadosi, Tily, & Gibson, 2011), which found that word lengths in 11 languages are more strongly correlated with their average predictability (or average information content) than their frequency. Using what we argue to be best practices for large-scale corpus analyses, we find significantly attenuated support for this result and demonstrate that a stronger relationship obtains between word frequency and length for a majority of the languages in the sample. We consider the implications of the results for language research more broadly and provide several recommendations to researchers regarding best practices.
Collapse
Affiliation(s)
- Stephan C Meylan
- Department of Brain and Cognitive Science, Massachusetts Institute of Technology.,Department of Psychology and Neuroscience, Duke University
| | | |
Collapse
|
5
|
Degaetano-Ortlieb S, Säily T, Bizzoni Y. Registerial Adaptation vs. Innovation Across Situational Contexts: 18th Century Women in Transition. Front Artif Intell 2021; 4:609970. [PMID: 34151252 PMCID: PMC8208492 DOI: 10.3389/frai.2021.609970] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Accepted: 04/19/2021] [Indexed: 11/13/2022] Open
Abstract
Endeavors to computationally model language variation and change are ever increasing. While analyses of recent diachronic trends are frequently conducted, long-term trends accounting for sociolinguistic variation are less well-studied. Our work sheds light on the temporal dynamics of language use of British 18th century women as a group in transition across two situational contexts. Our findings reveal that in formal contexts women adapt to register conventions, while in informal contexts they act as innovators of change in language use influencing others. While adopted from other disciplines, our methods inform (historical) sociolinguistic work in novel ways. These methods include diachronic periodization by Kullback-Leibler divergence to determine periods of change and relevant features of variation, and event cascades as influencer models.
Collapse
Affiliation(s)
| | - Tanja Säily
- Department of Languages, Faculty of Arts, University of Helsinki, Helsinki, Finland
| | - Yuri Bizzoni
- Department of Language Science and Technology, Saarland University, Saarbrücken, Germany
| |
Collapse
|
6
|
Bizzoni Y, Degaetano-Ortlieb S, Fankhauser P, Teich E. Linguistic Variation and Change in 250 Years of English Scientific Writing: A Data-Driven Approach. Front Artif Intell 2020; 3:73. [PMID: 33733190 PMCID: PMC7861277 DOI: 10.3389/frai.2020.00073] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 08/07/2020] [Indexed: 11/13/2022] Open
Abstract
We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665. Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of "scientific language" and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register). Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.
Collapse
Affiliation(s)
- Yuri Bizzoni
- Language Science and Technology, Saarland University, Saarbrücken, Germany
| | | | - Peter Fankhauser
- Digital Linguistics, Institut für Deutsche Sprache, Mannheim, Germany
| | - Elke Teich
- Language Science and Technology, Saarland University, Saarbrücken, Germany
| |
Collapse
|
7
|
A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics. ENTROPY 2020; 22:e22010126. [PMID: 33285901 PMCID: PMC7516435 DOI: 10.3390/e22010126] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 01/15/2020] [Accepted: 01/16/2020] [Indexed: 11/16/2022]
Abstract
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
Collapse
|
8
|
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size. ENTROPY 2019; 21:e21050464. [PMID: 33267178 PMCID: PMC7514953 DOI: 10.3390/e21050464] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 04/24/2019] [Accepted: 04/30/2019] [Indexed: 12/03/2022]
Abstract
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.
Collapse
|
9
|
The Entropy of Words—Learnability and Expressivity across More than 1000 Languages. ENTROPY 2017. [DOI: 10.3390/e19060275] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
10
|
Chen H, Liang J, Liu H. How Does Word Length Evolve in Written Chinese? PLoS One 2015; 10:e0138567. [PMID: 26384237 PMCID: PMC4575206 DOI: 10.1371/journal.pone.0138567] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2015] [Accepted: 08/31/2015] [Indexed: 11/19/2022] Open
Abstract
We demonstrate a substantial evidence that the word length can be an essential lexical structural feature for word evolution in written Chinese. The data used in this study are diachronic Chinese short narrative texts with a time span of over 2000-years. We show that the increase of word length is an essential regularity in word evolution. On the one hand, word frequency is found to depend on word length, and their relation is in line with the Power law function y = ax-b. On the other hand, our deeper analyses show that the increase of word length results in the simplification in characters for balance in written Chinese. Moreover, the correspondence between written and spoken Chinese is discussed. We conclude that the disyllabic trend may account for the increase of word length, and its impacts can be explained in "the principle of least effort".
Collapse
Affiliation(s)
- Heng Chen
- Center for the Study of Language and Cognition, Zhejiang University, Hangzhou, CN-310028, China
| | - Junying Liang
- Department of Linguistics, Zhejiang University, Hangzhou, CN-310058, China
| | - Haitao Liu
- Department of Linguistics, Zhejiang University, Hangzhou, CN-310058, China
- Ningbo Institute of Technology, Zhejiang University, Ningbo, CN-315100, China
- * E-mail:
| |
Collapse
|
11
|
Cocho G, Flores J, Gershenson C, Pineda C, Sánchez S. Rank diversity of languages: generic behavior in computational linguistics. PLoS One 2015; 10:e0121898. [PMID: 25849150 PMCID: PMC4388647 DOI: 10.1371/journal.pone.0121898] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2014] [Accepted: 02/05/2015] [Indexed: 11/19/2022] Open
Abstract
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: "heads" consist of words which almost do not change their rank in time, "bodies" are words of general use, while "tails" are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.
Collapse
Affiliation(s)
- Germinal Cocho
- Instituto de Física, Universidad Nacional Autónoma de México, Mexico City, Mexico
- Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Jorge Flores
- Instituto de Física, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Carlos Gershenson
- Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Mexico City, Mexico
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Mexico City, Mexico
- * E-mail:
| | - Carlos Pineda
- Instituto de Física, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Sergio Sánchez
- Facultad de Ciencias, Universidad Nacional Autónoma de México, Mexico City, Mexico
| |
Collapse
|