1
|
Hendrix P, Sun CC, Brighton H, Bender A. On the Connection Between Language Change and Language Processing. Cogn Sci 2023; 47:e13384. [PMID: 38071744 DOI: 10.1111/cogs.13384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 10/22/2023] [Accepted: 11/06/2023] [Indexed: 12/18/2023]
Abstract
Previous studies provided evidence for a connection between language processing and language change. We add to these studies with an exploration of the influence of lexical-distributional properties of words in orthographic space, semantic space, and the mapping between orthographic and semantic space on the probability of lexical extinction. Through a binomial linear regression analysis, we investigated the probability of lexical extinction by the first decade of the twenty-first century (2000s) for words that existed in the first decade of the nineteenth-century (1800s) in eight data sets for five languages: English, French, German, Italian, and Spanish. The binomial linear regression analysis revealed that words that are more similar in form to other words are less likely to disappear from a language. By contrast, words that are more similar in meaning to other words are more likely to become extinct. In addition, a more consistent mapping between form and meaning protects a word from lexical extinction. A nonlinear time-to-event analysis furthermore revealed that the position of a word in orthographic and semantic space continues to influence the probability of it disappearing from a language for at least 200 years. Effects of the lexical-distributional properties of words under investigation here have been reported in the language processing literature as well. The results reported here, therefore, fit well with a usage-based approach to language change, which holds that language change is at least to some extent connected to cognitive mechanisms in the human brain.
Collapse
Affiliation(s)
- Peter Hendrix
- Department of Cognitive Science and Artificial Intelligence, Tilburg University
| | - Ching Chu Sun
- Department of General Linguistics, Tübingen University
| | - Henry Brighton
- Department of Cognitive Science and Artificial Intelligence, Tilburg University
| | - Andreas Bender
- Department of Statistics, Ludwig-Maximillians-University Munich
| |
Collapse
|
2
|
Albuquerque UP, Cantalice AS, Oliveira ES, de Moura JMB, dos Santos RKS, da Silva RH, Brito-Júnior VM, Ferreira-Júnior WS. Exploring Large Digital Bodies for the Study of Human Behavior. EVOLUTIONARY PSYCHOLOGICAL SCIENCE 2023; 9:1-10. [PMID: 37362224 PMCID: PMC10203656 DOI: 10.1007/s40806-023-00363-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 04/03/2023] [Accepted: 04/04/2023] [Indexed: 06/28/2023]
Abstract
Internet access has become a fundamental component of contemporary society, with major impacts in many areas that offer opportunities for new research insights. The search and deposition of information in digital media form large sets of data known as digital corpora, which can be used to generate structured data, representing repositories of knowledge and evidence of human culture. This information offers opportunities for scientific investigations that contribute to the understanding of human behavior on a large scale, reaching human populations/individuals that would normally be difficult to access. These tools can help access social and cultural varieties worldwide. In this article, we briefly review the potential of these corpora in the study of human behavior. Therefore, we propose Culturomics of Human Behavior as an approach to understand, explain, and predict human behavior using digital corpora.
Collapse
Affiliation(s)
- Ulysses Paulino Albuquerque
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Anibal Silva Cantalice
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Edwine Soares Oliveira
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Joelson Moreno Brito de Moura
- Instituto de Estudos do Xingu (IEX), Av. Norte Sul, Universidade Federal do Sul E Sudeste do Pará, Loteamento Cidade Nova, Lote N. 1, Qd 15, Setor 15, São Félix Do Xingu, Brazil
| | - Rayane Karoline Silva dos Santos
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Risoneide Henriques da Silva
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Valdir Moura Brito-Júnior
- Laboratório de Ecologia e Evolução de Sistemas Socioecológicos (LEA), Departamento de Botânica, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, Cidade Universitária, 123550670-901 Recife, Pernambuco, Brazil
| | - Washington Soares Ferreira-Júnior
- Laboratório de Investigações Bioculturais no Semiárido, Universidade de Pernambuco, Campus Petrolina, BR203, Km 2, S/N, 56328-903 Petrolina, Pernambuco, Brazil
| |
Collapse
|
3
|
Staples TL. Expansion and evolution of the R programming language. ROYAL SOCIETY OPEN SCIENCE 2023; 10:221550. [PMID: 37063989 PMCID: PMC10090872 DOI: 10.1098/rsos.221550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 03/23/2023] [Indexed: 06/19/2023]
Abstract
Languages change over time, driven by creation of new words and cultural pressure to optimize communication. Programming languages resemble written language but communicate primarily with computer hardware rather than a human audience. I tested whether there were detectable changes over time in use of R, a mature, open-source programming language used for scientific computing. Across 393 142 GitHub repositories published between 2014 and 2021, I extracted 143 409 288 R functions, programming 'verbs', pairing linguistic and ecological analyses to detect change to diversity and composition of functions used over time. I found the number of R functions in use increased and underwent substantial change, driven primarily by the popularity of the 'tidyverse' collection of community-written extensions. I provide evidence that users can change the nature of programming languages, with patterns that match known processes from natural languages and genetic evolution. In R, there appear to be selective pressures for increased analytic complexity and R functions in decline that are not yet extinct (extinction debts). R's evolution towards the tidyverse may also represent the start of a division into two distinct dialects, which may impact the readability and continuity of analytic and scientific inquiries codified in R, as well as the language's future.
Collapse
Affiliation(s)
- Timothy L. Staples
- School of Biological Sciences, The University of Queensland, Building 60, St Lucia, Queensland 4072, Australia
| |
Collapse
|
4
|
Workman TE, Goulet JL, Brandt CA, Lindemann L, Skanderson M, Warren AR, Eleazer JR, Kronk C, Gordon KS, Pratt-Chapman M, Zeng-Treitler Q. Temporal and Geographic Patterns of Documentation of Sexual Orientation and Gender Identity Keywords in Clinical Notes. Med Care 2023; 61:130-136. [PMID: 36511399 PMCID: PMC9931630 DOI: 10.1097/mlr.0000000000001803] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
OBJECTIVE Disclosure of sexual orientation and gender identity correlates with better outcomes, yet data may not be available in structured fields in electronic health record data. To gain greater insight into the care of sexual and gender-diverse patients in the Veterans Health Administration (VHA), we examined the documentation patterns of sexual orientation and gender identity through extraction and analyses of data contained in unstructured electronic health record clinical notes. METHODS Salient terms were identified through authoritative vocabularies, the research team's expertise, and frequencies, and the use of consistency in VHA clinical notes. Term frequencies were extracted from VHA clinical notes recorded from 2000 to 2018. Temporal analyses assessed usage changes in normalized frequencies as compared with nonclinical use, relative growth rates, and geographic variations. RESULTS Over time most terms increased in use, similar to Google ngram data, especially after the repeal of the "Don't Ask Don't Tell" military policy in 2010. For most terms, the usage adoption consistency also increased by the study's end. Aggregated use of all terms increased throughout the United States. CONCLUSION Term usage trends may provide a view of evolving care in a temporal continuum of changing policy. These findings may be useful for policies and interventions geared toward sexual and gender-diverse individuals. Despite the lack of structured data, the documentation of sexual orientation and gender identity terms is increasing in clinical notes.
Collapse
Affiliation(s)
- Terri Elizabeth Workman
- Biomedical Informatics Center, The George Washington University, Washington, DC
- Washington DC VA Medical Center, Washington, DC
| | - Joseph L. Goulet
- Department of Emergency Medicine, Yale School of Medicine, New Haven, CT
- VA Connecticut Healthcare System, West Haven, CT
| | - Cynthia A. Brandt
- Department of Emergency Medicine, Yale School of Medicine, New Haven, CT
- VA Connecticut Healthcare System, West Haven, CT
| | - Luke Lindemann
- VA Connecticut Healthcare System, West Haven, CT
- Department of Psychology, Yale University, New Haven, CT
| | | | | | - Jacob R. Eleazer
- VA Connecticut Healthcare System PRIME Center, West Haven, CT
- Department of Psychiatry, Yale School of Medicine, New Haven, CT
| | | | - Kirsha S. Gordon
- VA Connecticut Healthcare System, West Haven, CT
- Yale School of Medicine, New Haven, CT
| | | | - Qing Zeng-Treitler
- Biomedical Informatics Center, The George Washington University, Washington, DC
- Washington DC VA Medical Center, Washington, DC
| |
Collapse
|
5
|
Decolonizing the Ourang-Outang. INT J PRIMATOL 2022. [DOI: 10.1007/s10764-022-00345-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
6
|
Holdaway C, Piantadosi ST. Stochastic Time-Series Analyses Highlight the Day-To-Day Dynamics of Lexical Frequencies. Cogn Sci 2022; 46:e13215. [PMID: 36515373 DOI: 10.1111/cogs.13215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 08/25/2022] [Accepted: 10/09/2022] [Indexed: 12/15/2022]
Abstract
Standard models in quantitative linguistics assume that word usage follows a fixed frequency distribution, often Zipf's law or a close relative. This view, however, does not capture the near daily variations in topics of conversation, nor the short-term dynamics of language change. In order to understand the dynamics of human language use, we present a corpus of daily word frequency variation scraped from online news sources every 20 min for more than 2 years. We construct a simple time-varying model with a latent state, which is observed via word frequency counts. We use Bayesian techniques to infer the parameters of this model for 20,000 words, allowing us to convert complex word-frequency trajectories into low-dimensional parameters in word usage. By analyzing the inferred parameters of this model, we quantify the relative mobility and drift of words on a day-to-day basis, while accounting for sampling error. We quantify this variation and show evidence against "rich-get-richer" models of word use, which have been previously hypothesized to explain statistical patterns in language.
Collapse
|
7
|
Abstract
Increasing evidence demonstrates that in many places language coexistence has become ubiquitous and essential for supporting language and cultural diversity and associated with its financial and economic benefits. The competitive evolution among multiple languages determines the evolution outcome, either coexistence, or decline, or extinction. Here, we extend the Abrams-Strogatz model of language competition to multiple languages and then validate it by analyzing the behavioral transitions of language usage over the recent several decades in Singapore and Hong Kong. In each case, we estimate from data the model parameters that measure each language utility for its speakers and the strength of two biases, the majority preference for their language, and the minority aversion to it. The values of these two biases decide which language is the fastest growing in the competition and what would be the stable state of the system. We also study the system convergence time to stable states and discover the existence of tipping points with multiple attractors. Moreover, the critical slowdown of convergence to the stable fractions of language users appears near and peaks at the tipping points, signaling when the system approaches them. Our analysis furthers our understanding of evolution of various languages and the role of tipping points in behavioral transitions. These insights may help to protect languages from extinction and retain the language and cultural diversity.
Collapse
|
8
|
Watanabe H. Empirical observations of ultraslow diffusion driven by the fractional dynamics in languages. Phys Rev E 2018; 98:012308. [PMID: 30110851 DOI: 10.1103/physreve.98.012308] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Indexed: 06/08/2023]
Abstract
Ultraslow diffusion (i.e., logarithmic diffusion) has been extensively studied theoretically but has hardly been observed empirically. In this paper, first, we find the ultraslow-like diffusion of the time series of word counts of already popular words by analyzing three different nationwide language databases: (i) newspaper articles (Japanese), (ii) blog articles (Japanese), and (iii) page views of Wikipedia (English, French, Chinese, and Japanese). Second, we use theoretical analysis to show that this diffusion is basically explained by the random walk model with the power-law forgetting with the exponent β≈0.5, which is related to the fractional Langevin equation. The exponent β characterizes the speed of forgetting and β≈0.5 corresponds to (i) the border (or thresholds) between the stationary and the nonstationary and (ii) the right-in-the-middle dynamics between the IID noise for β=1 and the normal random walk for β=0. Third, the generative model of the time series of word counts of already popular words, which is a kind of Poisson process with the Poisson parameter sampled by the above-mentioned random walk model, can almost reproduce not only the empirical mean-squared displacement but also the power spectrum density and the probability density function.
Collapse
Affiliation(s)
- Hayafumi Watanabe
- Risk Analysis Research Center, The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan; Joint Support-Center for Data Science Research, The Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan; and Hottolink, Inc., 6 Yonbancho Chiyoda-ku, Tokyo 102-0081, Japan
| |
Collapse
|
9
|
Menezes T, Roth C. Natural Scales in Geographical Patterns. Sci Rep 2017; 7:45823. [PMID: 28374825 PMCID: PMC5379183 DOI: 10.1038/srep45823] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Accepted: 03/06/2017] [Indexed: 12/01/2022] Open
Abstract
Human mobility is known to be distributed across several orders of magnitude of physical distances, which makes it generally difficult to endogenously find or define typical and meaningful scales. Relevant analyses, from movements to geographical partitions, seem to be relative to some ad-hoc scale, or no scale at all. Relying on geotagged data collected from photo-sharing social media, we apply community detection to movement networks constrained by increasing percentiles of the distance distribution. Using a simple parameter-free discontinuity detection algorithm, we discover clear phase transitions in the community partition space. The detection of these phases constitutes the first objective method of characterising endogenous, natural scales of human movement. Our study covers nine regions, ranging from cities to countries of various sizes and a transnational area. For all regions, the number of natural scales is remarkably low (2 or 3). Further, our results hint at scale-related behaviours rather than scale-related users. The partitions of the natural scales allow us to draw discrete multi-scale geographical boundaries, potentially capable of providing key insights in fields such as epidemiology or cultural contagion where the introduction of spatial boundaries is pivotal.
Collapse
Affiliation(s)
- Telmo Menezes
- Centre Marc Bloch Berlin e.V., Friedrichstr. 191, 10117 Berlin, Germany
| | - Camille Roth
- Centre Marc Bloch Berlin e.V., Friedrichstr. 191, 10117 Berlin, Germany
- Sciences Po, médialab, 84 rue de Grenelle, 75007 Paris, France
- Centre National de la Recherche Scientifique, France
| |
Collapse
|
10
|
Model of the Dynamic Construction Process of Texts and Scaling Laws of Words Organization in Language Systems. PLoS One 2016; 11:e0168971. [PMID: 28006026 PMCID: PMC5179102 DOI: 10.1371/journal.pone.0168971] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Accepted: 12/11/2016] [Indexed: 11/19/2022] Open
Abstract
Scaling laws characterize diverse complex systems in a broad range of fields, including physics, biology, finance, and social science. The human language is another example of a complex system of words organization. Studies on written texts have shown that scaling laws characterize the occurrence frequency of words, words rank, and the growth of distinct words with increasing text length. However, these studies have mainly concentrated on the western linguistic systems, and the laws that govern the lexical organization, structure and dynamics of the Chinese language remain not well understood. Here we study a database of Chinese and English language books. We report that three distinct scaling laws characterize words organization in the Chinese language. We find that these scaling laws have different exponents and crossover behaviors compared to English texts, indicating different words organization and dynamics of words in the process of text growth. We propose a stochastic feedback model of words organization and text growth, which successfully accounts for the empirically observed scaling laws with their corresponding scaling exponents and characteristic crossover regimes. Further, by varying key model parameters, we reproduce differences in the organization and scaling laws of words between the Chinese and English language. We also identify functional relationships between model parameters and the empirically observed scaling exponents, thus providing new insights into the words organization and growth dynamics in the Chinese and English language.
Collapse
|
11
|
Tanaka-Ishii K, Bunde A. Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words. PLoS One 2016; 11:e0164658. [PMID: 27893737 PMCID: PMC5125566 DOI: 10.1371/journal.pone.0164658] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 09/28/2016] [Indexed: 11/25/2022] Open
Abstract
A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf’s law). Here we address the complementary question, if also the rhythm of the text, characterized by the arrangement of the rare words in the text, can be quantified mathematically in a similar basic way. To this end, we consider representative classic single-authored texts from England/Ireland, France, Germany, China, and Japan. In each text, we classify each word by its rank. We focus on the rare words with ranks above some threshold Q and study the lengths of the (return) intervals between them. We find that for all texts considered, the probability SQ(r) that the length of an interval exceeds r, follows a perfect Weibull-function, SQ(r) = exp(−b(β)rβ), with β around 0.7. The return intervals themselves are arranged in a long-range correlated self-similar fashion, where the autocorrelation function CQ(s) of the intervals follows a power law, CQ(s) ∼ s−γ, with an exponent γ between 0.14 and 0.48. We show that these features lead to a pronounced clustering of the rare words in the text.
Collapse
Affiliation(s)
- Kumiko Tanaka-Ishii
- The University of Tokyo, Research Center for Advanced Science and Technology, Tokyo, 153-8904, Japan
| | - Armin Bunde
- Universität Giessen,Institut für Theoretische Physik,Giessen,35392, Germany
- * E-mail:
| |
Collapse
|
12
|
A triple helix model of medical innovation: Supply, demand, and technological capabilities in terms of Medical Subject Headings. RESEARCH POLICY 2016. [DOI: 10.1016/j.respol.2015.12.004] [Citation(s) in RCA: 52] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
13
|
Letchford A, Preis T, Moat HS. Quantifying the Search Behaviour of Different Demographics Using Google Correlate. PLoS One 2016; 11:e0149025. [PMID: 26910464 PMCID: PMC4766235 DOI: 10.1371/journal.pone.0149025] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2015] [Accepted: 01/26/2016] [Indexed: 11/18/2022] Open
Abstract
Vast records of our everyday interests and concerns are being generated by our frequent interactions with the Internet. Here, we investigate how the searches of Google users vary across U.S. states with different birth rates and infant mortality rates. We find that users in states with higher birth rates search for more information about pregnancy, while those in states with lower birth rates search for more information about cats. Similarly, we find that users in states with higher infant mortality rates search for more information about credit, loans and diseases. Our results provide evidence that Internet search data could offer new insight into the concerns of different demographics.
Collapse
Affiliation(s)
- Adrian Letchford
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, CV4 7AL, Coventry, United Kingdom
- * E-mail:
| | - Tobias Preis
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, CV4 7AL, Coventry, United Kingdom
| | - Helen Susannah Moat
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, CV4 7AL, Coventry, United Kingdom
| |
Collapse
|
14
|
|
15
|
Culturomics as a data playground for tests of selection: Mathematical approaches to detecting selection in word use. J Theor Biol 2016; 405:140-9. [PMID: 26802483 DOI: 10.1016/j.jtbi.2015.12.012] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2015] [Revised: 12/01/2015] [Accepted: 12/28/2015] [Indexed: 11/23/2022]
Abstract
In biological evolution traits may rise and fall in frequency due to genetic drift, where variant frequencies change by chance, or by selection where advantageous variants will rise in frequency. The neutral model of evolution, first developed by Kimura in the 1960s, has become the standard against which selection is detected. While the balance between these two important forces - drift and selection - has been well established in biology there are other domains where the contribution of these processes is still coming together. Although the idea of natural selection has been applied to the cultural domain since the time of Darwin, it has proven more challenging to positively identify cultural traits under selection both because of a lack of established tests for selection and a lack of large cultural data sets. However, in recent years with the accumulation of large cultural data sets many cultural features from pre-history pottery to modern baby names have been shown to evolve according to the neutral theory. But there is accumulating empirical evidence from cultural processes suggesting that the neutral theory alone cannot account for all features of the data. As such, there has been a renewed interest in determining whether there is selection amidst drift. Here we analyze a subset English word frequencies, and determine whether frequency change reveals processes of selection. Inspired by the Moran and Wright-Fisher models in population genetics, we developed a neutral model of word frequency variation to assess when linguistic data appears to depart from neutral evolution. As such, our model represents a possible "test for selection" in the linguistic domain. We explore how the distribution of word use has changed for sets of words in English for more than 100 years (1901-2008) as expressed in vocabulary usage in published books, made available by Google Ngram. When comparing empirical word frequency changes to our neutral model we find pervasive and systematic departures from neutrality.
Collapse
|
16
|
Chatterjee A, Ghosh A, Chakrabarti BK. Universality of Citation Distributions for Academic Institutions and Journals. PLoS One 2016; 11:e0146762. [PMID: 26751563 PMCID: PMC4709109 DOI: 10.1371/journal.pone.0146762] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2014] [Accepted: 12/22/2015] [Indexed: 11/19/2022] Open
Abstract
Citations measure the importance of a publication, and may serve as a proxy for its popularity and quality of its contents. Here we study the distributions of citations to publications from individual academic institutions for a single year. The average number of citations have large variations between different institutions across the world, but the probability distributions of citations for individual institutions can be rescaled to a common form by scaling the citations by the average number of citations for that institution. We find this feature seems to be universal for a broad selection of institutions irrespective of the average number of citations per article. A similar analysis for citations to publications in a particular journal in a single year reveals similar results. We find high absolute inequality for both these sets, Gini coefficients being around 0.66 and 0.58 for institutions and journals respectively. We also find that the top 25% of the articles hold about 75% of the total citations for institutions and the top 29% of the articles hold about 71% of the total citations for journals.
Collapse
Affiliation(s)
- Arnab Chatterjee
- Condensed Matter Physics Division, Saha Institute of Nuclear Physics, 1/AF Bidhannagar, Kolkata 700064, India
- * E-mail:
| | - Asim Ghosh
- Condensed Matter Physics Division, Saha Institute of Nuclear Physics, 1/AF Bidhannagar, Kolkata 700064, India
- Department of Computer Science, Aalto University School of Science, P.O. Box 15400, FI-00076 AALTO, Finland
| | - Bikas K. Chakrabarti
- Condensed Matter Physics Division, Saha Institute of Nuclear Physics, 1/AF Bidhannagar, Kolkata 700064, India
- Economic Research Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata 700108, India
| |
Collapse
|
17
|
Zambrano E, Hernando A, Fernández Bariviera A, Hernando R, Plastino A. Thermodynamics of firms' growth. J R Soc Interface 2015; 12:20150789. [PMID: 26510828 PMCID: PMC4685849 DOI: 10.1098/rsif.2015.0789] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2015] [Accepted: 10/05/2015] [Indexed: 11/12/2022] Open
Abstract
The distribution of firms' growth and firms' sizes is a topic under intense scrutiny. In this paper, we show that a thermodynamic model based on the maximum entropy principle, with dynamical prior information, can be constructed that adequately describes the dynamics and distribution of firms' growth. Our theoretical framework is tested against a comprehensive database of Spanish firms, which covers, to a very large extent, Spain's economic activity, with a total of 1,155,142 firms evolving along a full decade. We show that the empirical exponent of Pareto's law, a rule often observed in the rank distribution of large-size firms, is explained by the capacity of economic system for creating/destroying firms, and that can be used to measure the health of a capitalist-based economy. Indeed, our model predicts that when the exponent is larger than 1, creation of firms is favoured; when it is smaller than 1, destruction of firms is favoured instead; and when it equals 1 (matching Zipf's law), the system is in a full macroeconomic equilibrium, entailing 'free' creation and/or destruction of firms. For medium and smaller firm sizes, the dynamical regime changes, the whole distribution can no longer be fitted to a single simple analytical form and numerical prediction is required. Our model constitutes the basis for a full predictive framework regarding the economic evolution of an ensemble of firms. Such a structure can be potentially used to develop simulations and test hypothetical scenarios, such as economic crisis or the response to specific policy measures.
Collapse
Affiliation(s)
- Eduardo Zambrano
- Social Thermodynamics Applied Research (SThAR), EPFL Innovation Park, Bâtiment C, 1015 Lausanne, Switzerland
| | - Alberto Hernando
- Social Thermodynamics Applied Research (SThAR), EPFL Innovation Park, Bâtiment C, 1015 Lausanne, Switzerland
| | | | - Ricardo Hernando
- Social Thermodynamics Applied Research (SThAR), EPFL Innovation Park, Bâtiment C, 1015 Lausanne, Switzerland
| | - Angelo Plastino
- National University of La Plata, Physics Institute (IFLP-CCT-CONICET) C.C.737, 1900 La Plata, Argentina
| |
Collapse
|
18
|
Bochkarev V, Solovyev V, Wichmann S. Universals versus historical contingencies in lexical evolution. J R Soc Interface 2015; 11:20140841. [PMID: 25274040 DOI: 10.1098/rsif.2014.0841] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The frequency with which we use different words changes all the time, and every so often, a new lexical item is invented or another one ceases to be used. Beyond a small sample of lexical items whose properties are well studied, little is known about the dynamics of lexical evolution. How do the lexical inventories of languages, viewed as entire systems, evolve? Is the rate of evolution of the lexicon contingent upon historical factors or is it driven by regularities, perhaps to do with universals of cognition and social interaction? We address these questions using the Google Books N-Gram Corpus as a source of data and relative entropy as a measure of changes in the frequency distributions of words. It turns out that there are both universals and historical contingencies at work. Across several languages, we observe similar rates of change, but only at timescales of at least around five decades. At shorter timescales, the rate of change is highly variable and differs between languages. Major societal transformations as well as catastrophic events such as wars lead to increased change in frequency distributions, whereas stability in society has a dampening effect on lexical evolution.
Collapse
Affiliation(s)
- V Bochkarev
- Kazan Federal University, Kremlevskaya Street 18, 420000 Kazan, Russia
| | - V Solovyev
- Kazan Federal University, Kremlevskaya Street 18, 420000 Kazan, Russia
| | - S Wichmann
- Kazan Federal University, Kremlevskaya Street 18, 420000 Kazan, Russia Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103 Leipzig, Germany
| |
Collapse
|
19
|
Pechenick EA, Danforth CM, Dodds PS. Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PLoS One 2015; 10:e0137041. [PMID: 26445406 PMCID: PMC4596490 DOI: 10.1371/journal.pone.0137041] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2015] [Accepted: 07/02/2015] [Indexed: 12/03/2022] Open
Abstract
It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800–2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.
Collapse
Affiliation(s)
- Eitan Adam Pechenick
- Department of Mathematics and Statistics, University of Vermont, Burlington, Vermont, United States of America
- Center for Complex Systems, University of Vermont, Burlington, Vermont, United States of America
- Computational Story Lab, University of Vermont, Burlington, Vermont, United States of America
- Vermont Advanced Computing Core, University of Vermont, Burlington, Vermont, United States of America
- * E-mail: (EAP); (PSD)
| | - Christopher M. Danforth
- Department of Mathematics and Statistics, University of Vermont, Burlington, Vermont, United States of America
- Center for Complex Systems, University of Vermont, Burlington, Vermont, United States of America
- Computational Story Lab, University of Vermont, Burlington, Vermont, United States of America
- Vermont Advanced Computing Core, University of Vermont, Burlington, Vermont, United States of America
| | - Peter Sheridan Dodds
- Department of Mathematics and Statistics, University of Vermont, Burlington, Vermont, United States of America
- Center for Complex Systems, University of Vermont, Burlington, Vermont, United States of America
- Computational Story Lab, University of Vermont, Burlington, Vermont, United States of America
- Vermont Advanced Computing Core, University of Vermont, Burlington, Vermont, United States of America
- * E-mail: (EAP); (PSD)
| |
Collapse
|
20
|
Chen H, Liang J, Liu H. How Does Word Length Evolve in Written Chinese? PLoS One 2015; 10:e0138567. [PMID: 26384237 PMCID: PMC4575206 DOI: 10.1371/journal.pone.0138567] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2015] [Accepted: 08/31/2015] [Indexed: 11/19/2022] Open
Abstract
We demonstrate a substantial evidence that the word length can be an essential lexical structural feature for word evolution in written Chinese. The data used in this study are diachronic Chinese short narrative texts with a time span of over 2000-years. We show that the increase of word length is an essential regularity in word evolution. On the one hand, word frequency is found to depend on word length, and their relation is in line with the Power law function y = ax-b. On the other hand, our deeper analyses show that the increase of word length results in the simplification in characters for balance in written Chinese. Moreover, the correspondence between written and spoken Chinese is discussed. We conclude that the disyllabic trend may account for the increase of word length, and its impacts can be explained in "the principle of least effort".
Collapse
Affiliation(s)
- Heng Chen
- Center for the Study of Language and Cognition, Zhejiang University, Hangzhou, CN-310028, China
| | - Junying Liang
- Department of Linguistics, Zhejiang University, Hangzhou, CN-310058, China
| | - Haitao Liu
- Department of Linguistics, Zhejiang University, Hangzhou, CN-310058, China
- Ningbo Institute of Technology, Zhejiang University, Ningbo, CN-315100, China
- * E-mail:
| |
Collapse
|
21
|
Letchford A, Moat HS, Preis T. The advantage of short paper titles. ROYAL SOCIETY OPEN SCIENCE 2015; 2:150266. [PMID: 26361556 PMCID: PMC4555861 DOI: 10.1098/rsos.150266] [Citation(s) in RCA: 64] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 07/27/2015] [Indexed: 05/31/2023]
Abstract
Vast numbers of scientific articles are published each year, some of which attract considerable attention, and some of which go almost unnoticed. Here, we investigate whether any of this variance can be explained by a simple metric of one aspect of the paper's presentation: the length of its title. Our analysis provides evidence that journals which publish papers with shorter titles receive more citations per paper. These results are consistent with the intriguing hypothesis that papers with shorter titles may be easier to understand, and hence attract more citations.
Collapse
|
22
|
Hernández DG, Zanette DH, Samengo I. Information-theoretical analysis of the statistical dependencies among three variables: Applications to written language. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2015; 92:022813. [PMID: 26382460 DOI: 10.1103/physreve.92.022813] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/30/2015] [Indexed: 06/05/2023]
Abstract
We develop the information-theoretical concepts required to study the statistical dependencies among three variables. Some of such dependencies are pure triple interactions, in the sense that they cannot be explained in terms of a combination of pairwise correlations. We derive bounds for triple dependencies, and characterize the shape of the joint probability distribution of three binary variables with high triple interaction. The analysis also allows us to quantify the amount of redundancy in the mutual information between pairs of variables, and to assess whether the information between two variables is or is not mediated by a third variable. These concepts are applied to the analysis of written texts. We find that the probability that a given word is found in a particular location within the text is not only modulated by the presence or absence of other nearby words, but also, on the presence or absence of nearby pairs of words. We identify the words enclosing the key semantic concepts of the text, the triplets of words with high pairwise and triple interactions, and the words that mediate the pairwise interactions between other words.
Collapse
Affiliation(s)
- Damián G Hernández
- Centro Atómico Bariloche and Instituto Balseiro, (8400) San Carlos de Bariloche, Argentina
| | - Damián H Zanette
- Centro Atómico Bariloche and Instituto Balseiro, (8400) San Carlos de Bariloche, Argentina
| | - Inés Samengo
- Centro Atómico Bariloche and Instituto Balseiro, (8400) San Carlos de Bariloche, Argentina
| |
Collapse
|
23
|
Zipf's word frequency law in natural language: a critical review and future directions. Psychon Bull Rev 2015; 21:1112-30. [PMID: 24664880 DOI: 10.3758/s13423-014-0585-6] [Citation(s) in RCA: 150] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The frequency distribution of words has been a key object of study in statistical linguistics for the past 70 years. This distribution approximately follows a simple mathematical form known as Zipf's law. This article first shows that human language has a highly complex, reliable structure in the frequency distribution over and above this classic law, although prior data visualization methods have obscured this fact. A number of empirical phenomena related to word frequencies are then reviewed. These facts are chosen to be informative about the mechanisms giving rise to Zipf's law and are then used to evaluate many of the theoretical explanations of Zipf's law in language. No prior account straightforwardly explains all the basic facts or is supported with independent evaluation of its underlying assumptions. To make progress at understanding why language obeys Zipf's law, studies must seek evidence beyond the law itself, testing assumptions and evaluating novel predictions with new, independent data.
Collapse
|
24
|
Abstract
Taylor's law (TL) states that the variance V of a nonnegative random variable is a power function of its mean M; i.e., V = aM(b). TL has been verified extensively in ecology, where it applies to population abundance, physics, and other natural sciences. Its ubiquitous empirical verification suggests a context-independent mechanism. Sample exponents b measured empirically via the scaling of sample mean and variance typically cluster around the value b = 2. Some theoretical models of population growth, however, predict a broad range of values for the population exponent b pertaining to the mean and variance of population density, depending on details of the growth process. Is the widely reported sample exponent b ≃ 2 the result of ecological processes or could it be a statistical artifact? Here, we apply large deviations theory and finite-sample arguments to show exactly that in a broad class of growth models the sample exponent is b ≃ 2 regardless of the underlying population exponent. We derive a generalized TL in terms of sample and population exponents b(jk) for the scaling of the kth vs. the jth cumulants. The sample exponent b(jk) depends predictably on the number of samples and for finite samples we obtain b(jk) ≃ k = j asymptotically in time, a prediction that we verify in two empirical examples. Thus, the sample exponent b ≃ 2 may indeed be a statistical artifact and not dependent on population dynamics under conditions that we specify exactly. Given the broad class of models investigated, our results apply to many fields where TL is used although inadequately understood.
Collapse
|
25
|
Botta F, Moat HS, Preis T. Quantifying crowd size with mobile phone and Twitter data. ROYAL SOCIETY OPEN SCIENCE 2015. [PMID: 26064667 DOI: 10.5061/dryad.1rk60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Being able to infer the number of people in a specific area is of extreme importance for the avoidance of crowd disasters and to facilitate emergency evacuations. Here, using a football stadium and an airport as case studies, we present evidence of a strong relationship between the number of people in restricted areas and activity recorded by mobile phone providers and the online service Twitter. Our findings suggest that data generated through our interactions with mobile phone networks and the Internet may allow us to gain valuable measurements of the current state of society.
Collapse
Affiliation(s)
- Federico Botta
- Centre for Complexity Science , Warwick Business School, University of Warwick , Coventry CV4 7AL, UK ; Data Science Lab, Behavioural Science , Warwick Business School, University of Warwick , Coventry CV4 7AL, UK
| | - Helen Susannah Moat
- Data Science Lab, Behavioural Science , Warwick Business School, University of Warwick , Coventry CV4 7AL, UK
| | - Tobias Preis
- Data Science Lab, Behavioural Science , Warwick Business School, University of Warwick , Coventry CV4 7AL, UK
| |
Collapse
|
26
|
Botta F, Moat HS, Preis T. Quantifying crowd size with mobile phone and Twitter data. ROYAL SOCIETY OPEN SCIENCE 2015; 2:150162. [PMID: 26064667 PMCID: PMC4453255 DOI: 10.1098/rsos.150162] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2015] [Accepted: 05/01/2015] [Indexed: 06/01/2023]
Abstract
Being able to infer the number of people in a specific area is of extreme importance for the avoidance of crowd disasters and to facilitate emergency evacuations. Here, using a football stadium and an airport as case studies, we present evidence of a strong relationship between the number of people in restricted areas and activity recorded by mobile phone providers and the online service Twitter. Our findings suggest that data generated through our interactions with mobile phone networks and the Internet may allow us to gain valuable measurements of the current state of society.
Collapse
Affiliation(s)
- Federico Botta
- Centre for Complexity Science, Warwick Business School, University of Warwick, Coventry CV4 7AL, UK
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, Coventry CV4 7AL, UK
| | - Helen Susannah Moat
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, Coventry CV4 7AL, UK
| | - Tobias Preis
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, Coventry CV4 7AL, UK
| |
Collapse
|
27
|
Cocho G, Flores J, Gershenson C, Pineda C, Sánchez S. Rank diversity of languages: generic behavior in computational linguistics. PLoS One 2015; 10:e0121898. [PMID: 25849150 PMCID: PMC4388647 DOI: 10.1371/journal.pone.0121898] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2014] [Accepted: 02/05/2015] [Indexed: 11/19/2022] Open
Abstract
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: "heads" consist of words which almost do not change their rank in time, "bodies" are words of general use, while "tails" are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.
Collapse
Affiliation(s)
- Germinal Cocho
- Instituto de Física, Universidad Nacional Autónoma de México, Mexico City, Mexico
- Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Jorge Flores
- Instituto de Física, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Carlos Gershenson
- Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Mexico City, Mexico
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Mexico City, Mexico
- * E-mail:
| | - Carlos Pineda
- Instituto de Física, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Sergio Sánchez
- Facultad de Ciencias, Universidad Nacional Autónoma de México, Mexico City, Mexico
| |
Collapse
|
28
|
Fushing H, Chen C, Hsieh YC, Farrell P. Lewis Carroll's Doublets net of English words: network heterogeneity in a complex system. PLoS One 2014; 9:e114177. [PMID: 25517974 PMCID: PMC4269387 DOI: 10.1371/journal.pone.0114177] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2014] [Accepted: 10/30/2014] [Indexed: 11/19/2022] Open
Abstract
Lewis Carroll's English word game Doublets is represented as a system of networks with each node being an English word and each connectivity edge confirming that its two ending words are equal in letter length, but different by exactly one letter. We show that this system, which we call the Doublets net, constitutes a complex body of linguistic knowledge concerning English word structure that has computable multiscale features. Distributed morphological, phonological and orthographic constraints and the language's local redundancy are seen at the node level. Phonological communities are seen at the network level. And a balancing act between the language's global efficiency and redundancy is seen at the system level. We develop a new measure of intrinsic node-to-node distance and a computational algorithm, called community geometry, which reveal the implicit multiscale structure within binary networks. Because the Doublets net is a modular complex cognitive system, the community geometry and computable multi-scale structural information may provide a foundation for understanding computational learning in many systems whose network structure has yet to be fully analyzed.
Collapse
Affiliation(s)
- Hsieh Fushing
- Department of Statistics, University of California Davis, Davis, California, United States of America
| | - Chen Chen
- Department of Statistics, University of California Davis, Davis, California, United States of America
| | | | - Patrick Farrell
- Department of Linguistics, University of California Davis, Davis, California, United States of America
- * E-mail:
| |
Collapse
|
29
|
Internal and external dynamics in language: evidence from verb regularity in a historical corpus of English. PLoS One 2014; 9:e102882. [PMID: 25084006 PMCID: PMC4118841 DOI: 10.1371/journal.pone.0102882] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2014] [Accepted: 06/24/2014] [Indexed: 11/19/2022] Open
Abstract
Human languages are rule governed, but almost invariably these rules have exceptions in the form of irregularities. Since rules in language are efficient and productive, the persistence of irregularity is an anomaly. How does irregularity linger in the face of internal (endogenous) and external (exogenous) pressures to conform to a rule? Here we address this problem by taking a detailed look at simple past tense verbs in the Corpus of Historical American English. The data show that the language is open, with many new verbs entering. At the same time, existing verbs might tend to regularize or irregularize as a consequence of internal dynamics, but overall, the amount of irregularity sustained by the language stays roughly constant over time. Despite continuous vocabulary growth, and presumably, an attendant increase in expressive power, there is no corresponding growth in irregularity. We analyze the set of irregulars, showing they may adhere to a set of minority rules, allowing for increased stability of irregularity over time. These findings contribute to the debate on how language systems become rule governed, and how and why they sustain exceptions to rules, providing insight into the interplay between the emergence and maintenance of rules and exceptions in language.
Collapse
|
30
|
Antonio Cordón-García J, Linder D, Gómez-Díaz R, Alonso-Arévalo J. E-Book publishing in Spain. ELECTRONIC LIBRARY 2014. [DOI: 10.1108/el-12-2012-0155] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Purpose
– The aims of the present paper is electronic publishing has transformed the business model of publishing houses in Spain in such a way that two models currently coexist. The specificities of each of these models were studied and the consequences of each model for the future of electronic publishing in Spain were analysed.
Design/methodology/approach
– The first stage of this study consisted in locating studies that would allow the authors to obtain useful indicators and statistic data regarding publication in Spain. The second stage of this study consisted of extracting from the sources cited above all data relevant to the study. To wit, these were the number of electronic books published, the major publishing houses offering electronic publications, the major platforms currently selling electronic books, presently available electronic reading devices, the rates of reading on all devices, reading rates itemized by age and educational background and general tendencies in digital publishing and e-reading.
Findings
– There are traditional publishers of mostly paper-based volumes, whose business models are based on having large catalogues of titles and large print-runs, though print-runs are increasingly smaller and bookseller returns increasingly larger. Intermediary agents operating under this model, for instance booksellers, are subject to ever-greater economic pressures, especially in the current crisis
Originality/value
– In the study that follows, the authors attempt to analyse the characteristics behind these changes and learn to what extent these changes will affect the future models of publication and reading in Spain.
Collapse
|
31
|
Zamparo M, Baldovin F, Caraglio M, Stella AL. Scaling symmetry, renormalization, and time series modeling: the case of financial assets dynamics. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2013; 88:062808. [PMID: 24483512 DOI: 10.1103/physreve.88.062808] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Indexed: 06/03/2023]
Abstract
We present and discuss a stochastic model of financial assets dynamics based on the idea of an inverse renormalization group strategy. With this strategy we construct the multivariate distributions of elementary returns based on the scaling with time of the probability density of their aggregates. In its simplest version the model is the product of an endogenous autoregressive component and a random rescaling factor designed to embody also exogenous influences. Mathematical properties like increments' stationarity and ergodicity can be proven. Thanks to the relatively low number of parameters, model calibration can be conveniently based on a method of moments, as exemplified in the case of historical data of the S&P500 index. The calibrated model accounts very well for many stylized facts, like volatility clustering, power-law decay of the volatility autocorrelation function, and multiscaling with time of the aggregated return distribution. In agreement with empirical evidence in finance, the dynamics is not invariant under time reversal, and, with suitable generalizations, skewness of the return distribution and leverage effects can be included. The analytical tractability of the model opens interesting perspectives for applications, for instance, in terms of obtaining closed formulas for derivative pricing. Further important features are the possibility of making contact, in certain limits, with autoregressive models widely used in finance and the possibility of partially resolving the long- and short-memory components of the volatility, with consistent results when applied to historical series.
Collapse
Affiliation(s)
| | - Fulvio Baldovin
- Dipartimento di Fisica e Astronomia, Sezione INFN, Università di Padova, Via Marzolo 8, I-35131 Padova, Italy
| | - Michele Caraglio
- Dipartimento di Fisica e Astronomia, Sezione INFN, Università di Padova, Via Marzolo 8, I-35131 Padova, Italy
| | - Attilio L Stella
- Dipartimento di Fisica e Astronomia, Sezione INFN, Università di Padova, Via Marzolo 8, I-35131 Padova, Italy
| |
Collapse
|
32
|
Quantitative analysis of the evolution of novelty in cinema through crowdsourced keywords. Sci Rep 2013; 3:2758. [PMID: 24067890 PMCID: PMC3783974 DOI: 10.1038/srep02758] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2013] [Accepted: 08/28/2013] [Indexed: 11/23/2022] Open
Abstract
The generation of novelty is central to any creative endeavor. Novelty generation and the relationship between novelty and individual hedonic value have long been subjects of study in social psychology. However, few studies have utilized large-scale datasets to quantitatively investigate these issues. Here we consider the domain of American cinema and explore these questions using a database of films spanning a 70 year period. We use crowdsourced keywords from the Internet Movie Database as a window into the contents of films, and prescribe novelty scores for each film based on occurrence probabilities of individual keywords and keyword-pairs. These scores provide revealing insights into the dynamics of novelty in cinema. We investigate how novelty influences the revenue generated by a film, and find a relationship that resembles the Wundt-Berlyne curve. We also study the statistics of keyword occurrence and the aggregate distribution of keywords over a 100 year period.
Collapse
|
33
|
Efficient learning strategy of Chinese characters based on network approach. PLoS One 2013; 8:e69745. [PMID: 23990887 PMCID: PMC3749196 DOI: 10.1371/journal.pone.0069745] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2013] [Accepted: 06/12/2013] [Indexed: 11/19/2022] Open
Abstract
We develop an efficient learning strategy of Chinese characters based on the network of the hierarchical structural relations between Chinese characters. A more efficient strategy is that of learning the same number of useful Chinese characters in less effort or time. We construct a node-weighted network of Chinese characters, where character usage frequencies are used as node weights. Using this hierarchical node-weighted network, we propose a new learning method, the distributed node weight (DNW) strategy, which is based on a new measure of nodes' importance that considers both the weight of the nodes and its location in the network hierarchical structure. Chinese character learning strategies, particularly their learning order, are analyzed as dynamical processes over the network. We compare the efficiency of three theoretical learning methods and two commonly used methods from mainstream Chinese textbooks, one for Chinese elementary school students and the other for students learning Chinese as a second language. We find that the DNW method significantly outperforms the others, implying that the efficiency of current learning methods of major textbooks can be greatly improved.
Collapse
|
34
|
Alves LGA, Ribeiro HV, Lenzi EK, Mendes RS. Distance to the scaling law: a useful approach for unveiling relationships between crime and urban metrics. PLoS One 2013; 8:e69580. [PMID: 23940525 PMCID: PMC3734155 DOI: 10.1371/journal.pone.0069580] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2013] [Accepted: 06/13/2013] [Indexed: 11/24/2022] Open
Abstract
We report on a quantitative analysis of relationships between the number of homicides, population size and ten other urban metrics. By using data from Brazilian cities, we show that well-defined average scaling laws with the population size emerge when investigating the relations between population and number of homicides as well as population and urban metrics. We also show that the fluctuations around the scaling laws are log-normally distributed, which enabled us to model these scaling laws by a stochastic-like equation driven by a multiplicative and log-normally distributed noise. Because of the scaling laws, we argue that it is better to employ logarithms in order to describe the number of homicides in function of the urban metrics via regression analysis. In addition to the regression analysis, we propose an approach to correlate crime and urban metrics via the evaluation of the distance between the actual value of the number of homicides (as well as the value of the urban metrics) and the value that is expected by the scaling law with the population size. This approach has proved to be robust and useful for unveiling relationships/behaviors that were not properly carried out by the regression analysis, such as [Formula: see text] the non-explanatory potential of the elderly population when the number of homicides is much above or much below the scaling law, [Formula: see text] the fact that unemployment has explanatory potential only when the number of homicides is considerably larger than the expected by the power law, and [Formula: see text] a gender difference in number of homicides, where cities with female population below the scaling law are characterized by a number of homicides above the power law.
Collapse
Affiliation(s)
- Luiz G. A. Alves
- Departamento de Física and National Institute of Science and Technology for Complex Systems, Universidade Estadual de Maringá, Maringá, Brazil
| | - Haroldo V. Ribeiro
- Departamento de Física and National Institute of Science and Technology for Complex Systems, Universidade Estadual de Maringá, Maringá, Brazil
| | - Ervin K. Lenzi
- Departamento de Física and National Institute of Science and Technology for Complex Systems, Universidade Estadual de Maringá, Maringá, Brazil
| | - Renio S. Mendes
- Departamento de Física and National Institute of Science and Technology for Complex Systems, Universidade Estadual de Maringá, Maringá, Brazil
| |
Collapse
|
35
|
Probing the statistical properties of unknown texts: application to the Voynich Manuscript. PLoS One 2013; 8:e67310. [PMID: 23844002 PMCID: PMC3699599 DOI: 10.1371/journal.pone.0067310] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2013] [Accepted: 05/17/2013] [Indexed: 11/19/2022] Open
Abstract
While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.
Collapse
|
36
|
Reisenauer R, Smith K, Blythe RA. Stochastic dynamics of lexicon learning in an uncertain and nonuniform world. PHYSICAL REVIEW LETTERS 2013; 110:258701. [PMID: 23829764 DOI: 10.1103/physrevlett.110.258701] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/22/2013] [Indexed: 06/02/2023]
Abstract
We study the time taken by a language learner to correctly identify the meaning of all words in a lexicon under conditions where many plausible meanings can be inferred whenever a word is uttered. We show that the most basic form of cross-situational learning--whereby information from multiple episodes is combined to eliminate incorrect meanings--can perform badly when words are learned independently and meanings are drawn from a nonuniform distribution. If learners further assume that no two words share a common meaning, we find a phase transition between a maximally efficient learning regime, where the learning time is reduced to the shortest it can possibly be, and a partially efficient regime where incorrect candidate meanings for words persist at late times. We obtain exact results for the word-learning process through an equivalence to a statistical mechanical problem of enumerating loops in the space of word-meaning mappings.
Collapse
Affiliation(s)
- Rainer Reisenauer
- Physik-Department, Technische Universität München, James-Franck-Strasse 1, 85748 Garching, Germany
| | | | | |
Collapse
|
37
|
A statistical physics view of pitch fluctuations in the classical music from Bach to Chopin: evidence for scaling. PLoS One 2013; 8:e58710. [PMID: 23544047 PMCID: PMC3609771 DOI: 10.1371/journal.pone.0058710] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2012] [Accepted: 02/08/2013] [Indexed: 11/20/2022] Open
Abstract
Because classical music has greatly affected our life and culture in its long history, it has attracted extensive attention from researchers to understand laws behind it. Based on statistical physics, here we use a different method to investigate classical music, namely, by analyzing cumulative distribution functions (CDFs) and autocorrelation functions of pitch fluctuations in compositions. We analyze 1,876 compositions of five representative classical music composers across 164 years from Bach, to Mozart, to Beethoven, to Mendelsohn, and to Chopin. We report that the biggest pitch fluctuations of a composer gradually increase as time evolves from Bach time to Mendelsohn/Chopin time. In particular, for the compositions of a composer, the positive and negative tails of a CDF of pitch fluctuations are distributed not only in power laws (with the scale-free property), but also in symmetry (namely, the probability of a treble following a bass and that of a bass following a treble are basically the same for each composer). The power-law exponent decreases as time elapses. Further, we also calculate the autocorrelation function of the pitch fluctuation. The autocorrelation function shows a power-law distribution for each composer. Especially, the power-law exponents vary with the composers, indicating their different levels of long-range correlation of notes. This work not only suggests a way to understand and develop music from a viewpoint of statistical physics, but also enriches the realm of traditional statistical physics by analyzing music.
Collapse
|
38
|
Deviation of Zipf's and Heaps' Laws in human languages with limited dictionary sizes. Sci Rep 2013; 3:1082. [PMID: 23378896 PMCID: PMC3558701 DOI: 10.1038/srep01082] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2012] [Accepted: 12/21/2012] [Indexed: 11/16/2022] Open
Abstract
Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf's exponent diverges. Indeed, the character frequency decays exponentially in the Zipf's plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. A theoretical model for writing process is proposed, which embodies the rich-get-richer mechanism and the effects of limited dictionary size. Experiments, simulations and analytical solutions agree well with each other. This work refines the understanding about Zipf's and Heaps' laws in human language systems.
Collapse
|
39
|
Petersen AM, Tenenbaum JN, Havlin S, Stanley HE, Perc M. Languages cool as they expand: allometric scaling and the decreasing need for new words. Sci Rep 2012; 2:943. [PMID: 23230508 PMCID: PMC3517984 DOI: 10.1038/srep00943] [Citation(s) in RCA: 142] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2012] [Accepted: 10/24/2012] [Indexed: 11/23/2022] Open
Abstract
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This "cooling pattern" forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.
Collapse
Affiliation(s)
- Alexander M. Petersen
- Laboratory for the Analysis of Complex Economic Systems, IMT Lucca Institute for Advanced Studies, Lucca 55100, Italy
| | - Joel N. Tenenbaum
- Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
- Operations and Technology Management, School of Management, Boston University, Boston, Massachusetts 02215, USA
| | - Shlomo Havlin
- Minerva Center and Department of Physics, Bar-Ilan University, Ramat-Gan 52900, Israel
| | - H. Eugene Stanley
- Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | - Matjaž Perc
- Department of Physics, Faculty of Natural Sciences and Mathematics, University of Maribor, Koroška cesta 160, SI-2000 Maribor, Slovenia
| |
Collapse
|
40
|
Perc M. Evolution of the most common English words and phrases over the centuries. J R Soc Interface 2012; 9:3323-8. [PMID: 22832364 PMCID: PMC3481586 DOI: 10.1098/rsif.2012.0491] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2012] [Accepted: 07/02/2012] [Indexed: 12/04/2022] Open
Abstract
By determining the most common English words and phrases since the beginning of the sixteenth century, we obtain a unique large-scale view of the evolution of written text. We find that the most common words and phrases in any given year had a much shorter popularity lifespan in the sixteenth century than they had in the twentieth century. By measuring how their usage propagated across the years, we show that for the past two centuries, the process has been governed by linear preferential attachment. Along with the steady growth of the English lexicon, this provides an empirical explanation for the ubiquity of Zipf's law in language statistics and confirms that writing, although undoubtedly an expression of art and skill, is not immune to the same influences of self-organization that are known to regulate processes as diverse as the making of new friends and World Wide Web growth.
Collapse
Affiliation(s)
- Matjaz Perc
- Faculty of Natural Sciences and Mathematics, University of Maribor, Koroška cesta 160, 2000 Maribor, Slovenia.
| |
Collapse
|
41
|
Bentley RA, Garnett P, O'Brien MJ, Brock WA. Word diffusion and climate science. PLoS One 2012; 7:e47966. [PMID: 23144839 PMCID: PMC3492395 DOI: 10.1371/journal.pone.0047966] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2012] [Accepted: 09/24/2012] [Indexed: 11/18/2022] Open
Abstract
As public and political debates often demonstrate, a substantial disjoint can exist between the findings of science and the impact it has on the public. Using climate-change science as a case example, we reconsider the role of scientists in the information-dissemination process, our hypothesis being that important keywords used in climate science follow “boom and bust” fashion cycles in public usage. Representing this public usage through extraordinary new data on word frequencies in books published up to the year 2008, we show that a classic two-parameter social-diffusion model closely fits the comings and goings of many keywords over generational or longer time scales. We suggest that the fashions of word usage contributes an empirical, possibly regular, correlate to the impact of climate science on society.
Collapse
Affiliation(s)
- R Alexander Bentley
- Department of Archaeology and Anthropology, University of Bristol, Bristol, United Kingdom.
| | | | | | | |
Collapse
|
42
|
Phillis CC, O’Regan SM, Green SJ, Bruce JE, Anderson SC, Linton JN, Favaro B. Multiple pathways to conservation success. Conserv Lett 2012. [DOI: 10.1111/j.1755-263x.2012.00294.x] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|