1
|
Li W, Almirantis Y, Provata A. Range-limited Heaps' law for functional DNA words in the human genome. J Theor Biol 2024; 592:111878. [PMID: 38901778 DOI: 10.1016/j.jtbi.2024.111878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/31/2024] [Accepted: 06/10/2024] [Indexed: 06/22/2024]
Abstract
Heaps' or Herdan-Heaps' law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps' law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.
Collapse
Affiliation(s)
- Wentian Li
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA(1); The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Yannis Almirantis
- Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| | - Astero Provata
- Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| |
Collapse
|
2
|
Holdaway C, Piantadosi ST. Stochastic Time-Series Analyses Highlight the Day-To-Day Dynamics of Lexical Frequencies. Cogn Sci 2022; 46:e13215. [PMID: 36515373 DOI: 10.1111/cogs.13215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 08/25/2022] [Accepted: 10/09/2022] [Indexed: 12/15/2022]
Abstract
Standard models in quantitative linguistics assume that word usage follows a fixed frequency distribution, often Zipf's law or a close relative. This view, however, does not capture the near daily variations in topics of conversation, nor the short-term dynamics of language change. In order to understand the dynamics of human language use, we present a corpus of daily word frequency variation scraped from online news sources every 20 min for more than 2 years. We construct a simple time-varying model with a latent state, which is observed via word frequency counts. We use Bayesian techniques to infer the parameters of this model for 20,000 words, allowing us to convert complex word-frequency trajectories into low-dimensional parameters in word usage. By analyzing the inferred parameters of this model, we quantify the relative mobility and drift of words on a day-to-day basis, while accounting for sampling error. We quantify this variation and show evidence against "rich-get-richer" models of word use, which have been previously hypothesized to explain statistical patterns in language.
Collapse
|
3
|
Aletti G, Crimaldi I. Twitter as an innovation process with damping effect. Sci Rep 2021; 11:21243. [PMID: 34711859 PMCID: PMC8553952 DOI: 10.1038/s41598-021-00378-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 10/11/2021] [Indexed: 11/23/2022] Open
Abstract
In the existing literature about innovation processes, the proposed models often satisfy the Heaps' law, regarding the rate at which novelties appear, and the Zipf's law, that states a power law behavior for the frequency distribution of the elements. However, there are empirical cases far from showing a pure power law behavior and such a deviation is mostly present for elements with high frequencies. We explain this phenomenon by means of a suitable "damping" effect in the probability of a repetition of an old element. We introduce an extremely general model, whose key element is the update function, that can be suitably chosen in order to reproduce the behaviour exhibited by the empirical data. In particular, we explicit the update function for some Twitter data sets and show great performances with respect to Heaps' law and, above all, with respect to the fitting of the frequency-rank plots for low and high frequencies. Moreover, we also give other examples of update functions, that are able to reproduce the behaviors empirically observed in other contexts.
Collapse
Affiliation(s)
- Giacomo Aletti
- ADAMSS Center, Università degli Studi di Milano, Milan, Italy.
| | | |
Collapse
|
4
|
Mazzarisi O, de Azevedo-Lopes A, Arenzon JJ, Corberi F. Maximal Diversity and Zipf's Law. PHYSICAL REVIEW LETTERS 2021; 127:128301. [PMID: 34597111 DOI: 10.1103/physrevlett.127.128301] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 06/18/2021] [Accepted: 08/12/2021] [Indexed: 06/13/2023]
Abstract
Zipf's law describes the empirical size distribution of the components of many systems in natural and social sciences and humanities. We show, by solving a statistical model, that Zipf's law co-occurs with the maximization of the diversity of the component sizes. The law ruling the increase of such diversity with the total dimension of the system is derived and its relation with Heaps's law is discussed. As an example, we show that our analytical results compare very well with linguistics and population datasets.
Collapse
Affiliation(s)
- Onofrio Mazzarisi
- Max Planck Institute for Mathematics in the Sciences, Inselstrae 22, 04103 Leipzig, Germany
- Dipartimento di Fisica "E. R. Caianiello," Università di Salerno, via Giovanni Paolo II 132, 84084 Fisciano (SA), Italy
| | - Amanda de Azevedo-Lopes
- Instituto de Física, Universidade Federal do Rio Grande do Sul, CP 15051, 91501-970 Porto Alegre RS, Brazil
| | - Jeferson J Arenzon
- Instituto de Física, Universidade Federal do Rio Grande do Sul, CP 15051, 91501-970 Porto Alegre RS, Brazil
- Instituto Nacional de Ciência e Tecnologia-Sistemas Complexos, 22290-180 Rio de Janeiro RJ, Brazil
| | - Federico Corberi
- Dipartimento di Fisica "E. R. Caianiello," Università di Salerno, via Giovanni Paolo II 132, 84084 Fisciano (SA), Italy
- INFN, Gruppo Collegato di Salerno, and CNISM, Unità di Salerno, Università di Salerno, via Giovanni Paolo II 132, 84084 Fisciano (SA), Italy
| |
Collapse
|
5
|
Caetano-Anollés G. The Compressed Vocabulary of Microbial Life. Front Microbiol 2021; 12:655990. [PMID: 34305827 PMCID: PMC8292947 DOI: 10.3389/fmicb.2021.655990] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/27/2021] [Indexed: 12/22/2022] Open
Abstract
Communication is an undisputed central activity of life that requires an evolving molecular language. It conveys meaning through messages and vocabularies. Here, I explore the existence of a growing vocabulary in the molecules and molecular functions of the microbial world. There are clear correspondences between the lexicon, syntax, semantics, and pragmatics of language organization and the module, structure, function, and fitness paradigms of molecular biology. These correspondences are constrained by universal laws and engineering principles. Macromolecular structure, for example, follows quantitative linguistic patterns arising from statistical laws that are likely universal, including the Zipf's law, a special case of the scale-free distribution, the Heaps' law describing sublinear growth typical of economies of scales, and the Menzerath-Altmann's law, which imposes size-dependent patterns of decreasing returns. Trade-off solutions between principles of economy, flexibility, and robustness define a "triangle of persistence" describing the impact of the environment on a biological system. The pragmatic landscape of the triangle interfaces with the syntax and semantics of molecular languages, which together with comparative and evolutionary genomic data can explain global patterns of diversification of cellular life. The vocabularies of proteins (proteomes) and functions (functionomes) revealed a significant universal lexical core supporting a universal common ancestor, an ancestral evolutionary link between Bacteria and Eukarya, and distinct reductive evolutionary strategies of language compression in Archaea and Bacteria. A "causal" word cloud strategy inspired by the dependency grammar paradigm used in catenae unfolded the evolution of lexical units associated with Gene Ontology terms at different levels of ontological abstraction. While Archaea holds the smallest, oldest, and most homogeneous vocabulary of all superkingdoms, Bacteria heterogeneously apportions a more complex vocabulary, and Eukarya pushes functional innovation through mechanisms of flexibility and robustness.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, and C. R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL, United States
| |
Collapse
|
6
|
Chacoma A, Zanette DH. Heaps' Law and Heaps functions in tagged texts: evidences of their linguistic relevance. ROYAL SOCIETY OPEN SCIENCE 2020; 7:200008. [PMID: 32269820 PMCID: PMC7137977 DOI: 10.1098/rsos.200008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Accepted: 02/21/2020] [Indexed: 06/11/2023]
Abstract
We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.
Collapse
Affiliation(s)
- A. Chacoma
- Instituto de Física Enrique Gaviola, Consejo Nacional de Investigaciones Científicas y Técnicas and Universidad Nacional de Córdoba, Ciudad Universitaria, 5000 Córdoba, Pcia. de Córdoba, Argentina
| | - D. H. Zanette
- Centro Atómico Bariloche and Instituto Balseiro, Comisión Nacional de Energía Atómica and Universidad Nacional de Cuyo, Consejo Nacional de Investigaciones Científicas y Técnicas, Av. Bustillo 9500, 8400 San Carlos de Bariloche, Pcia. de Río Negro, Argentina
| |
Collapse
|
7
|
Iacopini I, Milojević S, Latora V. Network Dynamics of Innovation Processes. PHYSICAL REVIEW LETTERS 2018; 120:048301. [PMID: 29437427 DOI: 10.1103/physrevlett.120.048301] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Revised: 12/04/2017] [Indexed: 06/08/2023]
Abstract
We introduce a model for the emergence of innovations, in which cognitive processes are described as random walks on the network of links among ideas or concepts, and an innovation corresponds to the first visit of a node. The transition matrix of the random walk depends on the network weights, while in turn the weight of an edge is reinforced by the passage of a walker. The presence of the network naturally accounts for the mechanism of the "adjacent possible," and the model reproduces both the rate at which novelties emerge and the correlations among them observed empirically. We show this by using synthetic networks and by studying real data sets on the growth of knowledge in different scientific disciplines. Edge-reinforced random walks on complex topologies offer a new modeling framework for the dynamics of correlated novelties and are another example of coevolution of processes and networks.
Collapse
Affiliation(s)
- Iacopo Iacopini
- School of Mathematical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
- The Alan Turing Institute, The British Library, London NW1 2DB, United Kingdom
| | - Staša Milojević
- Center for Complex Networks and Systems Research, School of Informatics and Computing, Indiana University, Bloomington, Indiana 47408, USA
| | - Vito Latora
- School of Mathematical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
- Dipartimento di Fisica ed Astronomia, Università di Catania and INFN, I-95123 Catania, Italy
| |
Collapse
|
8
|
Nasir A, Kim KM, Caetano-Anollés G. Phylogenetic Tracings of Proteome Size Support the Gradual Accretion of Protein Structural Domains and the Early Origin of Viruses from Primordial Cells. Front Microbiol 2017; 8:1178. [PMID: 28690608 PMCID: PMC5481351 DOI: 10.3389/fmicb.2017.01178] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Accepted: 06/09/2017] [Indexed: 01/05/2023] Open
Abstract
Untangling the origin and evolution of viruses remains a challenging proposition. We recently studied the global distribution of protein domain structures in thousands of completely sequenced viral and cellular proteomes with comparative genomics, phylogenomics, and multidimensional scaling methods. A tree of life describing the evolution of proteomes revealed viruses emerging from the base of the tree as a fourth supergroup of life. A tree of domains indicated an early origin of modern viral lineages from ancient cells that co-existed with the cellular ancestors. However, it was recently argued that the rooting of our trees and the basal placement of viruses was artifactually induced by small genome (proteome) size. Here we show that these claims arise from misunderstanding and misinterpretations of cladistic methodology. Trees are reconstructed unrooted, and thus, their topologies cannot be distorted a posteriori by the rooting methodology. Tracing proteome size in trees and multidimensional views of evolutionary relationships as well as tests of leaf stability and exclusion/inclusion of taxa demonstrated that the smallest proteomes were neither attracted toward the root nor caused any topological distortions of the trees. Simulations confirmed that taxa clustering patterns were independent of proteome size and were determined by the presence of known evolutionary relatives in data matrices, highlighting the need for broader taxon sampling in phylogeny reconstruction. Instead, phylogenetic tracings of proteome size revealed a slowdown in innovation of the structural domain vocabulary and four regimes of allometric scaling that reflected a Heaps law. These regimes explained increasing economies of scale in the evolutionary growth and accretion of kernel proteome repertoires of viruses and cellular organisms that resemble growth of human languages with limited vocabulary sizes. Results reconcile dynamic and static views of domain frequency distributions that are consistent with the axiom of spatiotemporal continuity that is tenet of evolutionary thinking.
Collapse
Affiliation(s)
- Arshan Nasir
- Department of Biosciences, COMSATS Institute of Information TechnologyIslamabad, Pakistan
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-ChampaignUrbana, IL, United States
| | - Kyung Mo Kim
- Division of Polar Life Sciences, Korea Polar Research InstituteIncheon, South Korea
| | - Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-ChampaignUrbana, IL, United States
| |
Collapse
|
9
|
Torre IG, Luque B, Lacasa L, Luque J, Hernández-Fernández A. Emergence of linguistic laws in human voice. Sci Rep 2017; 7:43862. [PMID: 28272418 PMCID: PMC5341060 DOI: 10.1038/srep43862] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Accepted: 01/31/2017] [Indexed: 11/08/2022] Open
Abstract
Linguistic laws constitute one of the quantitative cornerstones of modern cognitive sciences and have been routinely investigated in written corpora, or in the equivalent transcription of oral corpora. This means that inferences of statistical patterns of language in acoustics are biased by the arbitrary, language-dependent segmentation of the signal, and virtually precludes the possibility of making comparative studies between human voice and other animal communication systems. Here we bridge this gap by proposing a method that allows to measure such patterns in acoustic signals of arbitrary origin, without needs to have access to the language corpus underneath. The method has been applied to sixteen different human languages, recovering successfully some well-known laws of human communication at timescales even below the phoneme and finding yet another link between complexity and criticality in a biological system. These methods further pave the way for new comparative studies in animal communication or the analysis of signals of unknown code.
Collapse
Affiliation(s)
- Iván González Torre
- Department of Applied Mathematics and Statistics, EIAE, Technical University of Madrid, Plaza Cardenal Cisneros, 28040, Madrid, Spain
| | - Bartolo Luque
- Department of Applied Mathematics and Statistics, EIAE, Technical University of Madrid, Plaza Cardenal Cisneros, 28040, Madrid, Spain
- School of Mathematical Sciences, Queen Mary University of London, Mile End Road, E14NS, London, UK
| | - Lucas Lacasa
- School of Mathematical Sciences, Queen Mary University of London, Mile End Road, E14NS, London, UK
| | - Jordi Luque
- Telefonica Research, Edificio Telefonica-Diagonal 00, Barcelona, Spain
| | - Antoni Hernández-Fernández
- Complexity and Quantitative Linguistics Lab, Laboratory for Relational Algorithmics, Complexity and Learning (LARCA), Institut de Ciències de l’Educació, Universitat Politècnica de Catalunya, Barcelona, Spain
| |
Collapse
|
10
|
Model of the Dynamic Construction Process of Texts and Scaling Laws of Words Organization in Language Systems. PLoS One 2016; 11:e0168971. [PMID: 28006026 PMCID: PMC5179102 DOI: 10.1371/journal.pone.0168971] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Accepted: 12/11/2016] [Indexed: 11/19/2022] Open
Abstract
Scaling laws characterize diverse complex systems in a broad range of fields, including physics, biology, finance, and social science. The human language is another example of a complex system of words organization. Studies on written texts have shown that scaling laws characterize the occurrence frequency of words, words rank, and the growth of distinct words with increasing text length. However, these studies have mainly concentrated on the western linguistic systems, and the laws that govern the lexical organization, structure and dynamics of the Chinese language remain not well understood. Here we study a database of Chinese and English language books. We report that three distinct scaling laws characterize words organization in the Chinese language. We find that these scaling laws have different exponents and crossover behaviors compared to English texts, indicating different words organization and dynamics of words in the process of text growth. We propose a stochastic feedback model of words organization and text growth, which successfully accounts for the empirically observed scaling laws with their corresponding scaling exponents and characteristic crossover regimes. Further, by varying key model parameters, we reproduce differences in the organization and scaling laws of words between the Chinese and English language. We also identify functional relationships between model parameters and the empirically observed scaling exponents, thus providing new insights into the words organization and growth dynamics in the Chinese and English language.
Collapse
|
11
|
Carpena P, Bernaola-Galván PA, Carretero-Campos C, Coronado AV. Probability distribution of intersymbol distances in random symbolic sequences: Applications to improving detection of keywords in texts and of amino acid clustering in proteins. Phys Rev E 2016; 94:052302. [PMID: 27967154 DOI: 10.1103/physreve.94.052302] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Indexed: 11/07/2022]
Abstract
Symbolic sequences have been extensively investigated in the past few years within the framework of statistical physics. Paradigmatic examples of such sequences are written texts, and deoxyribonucleic acid (DNA) and protein sequences. In these examples, the spatial distribution of a given symbol (a word, a DNA motif, an amino acid) is a key property usually related to the symbol importance in the sequence: The more uneven and far from random the symbol distribution, the higher the relevance of the symbol to the sequence. Thus, many techniques of analysis measure in some way the deviation of the symbol spatial distribution with respect to the random expectation. The problem is then to know the spatial distribution corresponding to randomness, which is typically considered to be either the geometric or the exponential distribution. However, these distributions are only valid for very large symbolic sequences and for many occurrences of the analyzed symbol. Here, we obtain analytically the exact, randomly expected spatial distribution valid for any sequence length and any symbol frequency, and we study its main properties. The knowledge of the distribution allows us to define a measure able to properly quantify the deviation from randomness of the symbol distribution, especially for short sequences and low symbol frequency. We apply the measure to the problem of keyword detection in written texts and to study amino acid clustering in protein sequences. In texts, we show how the results improve with respect to previous methods when short texts are analyzed. In proteins, which are typically short, we show how the measure quantifies unambiguously the amino acid clustering and characterize its spatial distribution.
Collapse
Affiliation(s)
- Pedro Carpena
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | - Pedro A Bernaola-Galván
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | - Concepción Carretero-Campos
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | - Ana V Coronado
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| |
Collapse
|
12
|
Fraaije JGEM, van Male J, Becherer P, Serral Gracià R. Coarse-Grained Models for Automated Fragmentation and Parametrization of Molecular Databases. J Chem Inf Model 2016; 56:2361-2377. [DOI: 10.1021/acs.jcim.6b00003] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Johannes G. E. M. Fraaije
- Leiden
Institute of Chemistry, Leiden University, Einsteinweg 55, 2300 RA Leiden, The Netherlands
- Culgi BV, Galileiweg 8, 2333 BD Leiden, The Netherlands
| | - Jan van Male
- Culgi BV, Galileiweg 8, 2333 BD Leiden, The Netherlands
| | - Paul Becherer
- Culgi BV, Galileiweg 8, 2333 BD Leiden, The Netherlands
| | | |
Collapse
|
13
|
Abstract
Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.
Collapse
Affiliation(s)
- Diego Raphael Amancio
- Institute of Mathematical and Computer Sciences, University of São Paulo, São Carlos, São Paulo, Brazil
- * E-mail:
| |
Collapse
|
14
|
Yan X, Minnhagen P. Maximum entropy, word-frequency, Chinese characters, and multiple meanings. PLoS One 2015; 10:e0125592. [PMID: 25955175 PMCID: PMC4425542 DOI: 10.1371/journal.pone.0125592] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2014] [Accepted: 03/16/2015] [Indexed: 11/23/2022] Open
Abstract
The word-frequency distribution of a text written by an author is well accounted for by a maximum entropy distribution, the RGF (random group formation)-prediction. The RGF-distribution is completely determined by the a priori values of the total number of words in the text (M), the number of distinct words (N) and the number of repetitions of the most common word (kmax). It is here shown that this maximum entropy prediction also describes a text written in Chinese characters. In particular it is shown that although the same Chinese text written in words and Chinese characters have quite differently shaped distributions, they are nevertheless both well predicted by their respective three a priori characteristic values. It is pointed out that this is analogous to the change in the shape of the distribution when translating a given text to another language. Another consequence of the RGF-prediction is that taking a part of a long text will change the input parameters (M, N, kmax) and consequently also the shape of the frequency distribution. This is explicitly confirmed for texts written in Chinese characters. Since the RGF-prediction has no system-specific information beyond the three a priori values (M, N, kmax), any specific language characteristic has to be sought in systematic deviations from the RGF-prediction and the measured frequencies. One such systematic deviation is identified and, through a statistical information theoretical argument and an extended RGF-model, it is proposed that this deviation is caused by multiple meanings of Chinese characters. The effect is stronger for Chinese characters than for Chinese words. The relation between Zipf’s law, the Simon-model for texts and the present results are discussed.
Collapse
Affiliation(s)
- Xiaoyong Yan
- Systems Science Institute, Beijing Jiaotong University, Beijing 100044, China
- Big Data Research Center, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Petter Minnhagen
- IceLab, Department of Physics, Umeå University, 901 87 Umeå, Sweden
- * E-mail:
| |
Collapse
|
15
|
Wang L, Li X. Spatial epidemiology of networked metapopulation: an overview. CHINESE SCIENCE BULLETIN-CHINESE 2014; 59:3511-3522. [PMID: 32214746 PMCID: PMC7088704 DOI: 10.1007/s11434-014-0499-8] [Citation(s) in RCA: 87] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2013] [Accepted: 03/21/2014] [Indexed: 12/05/2022]
Abstract
An emerging disease is one infectious epidemic caused by a newly transmissible pathogen, which has either appeared for the first time or already existed in human populations, having the capacity to increase rapidly in incidence as well as geographic range. Adapting to human immune system, emerging diseases may trigger large-scale pandemic spreading, such as the transnational spreading of SARS, the global outbreak of A(H1N1), and the recent potential invasion of avian influenza A(H7N9). To study the dynamics mediating the transmission of emerging diseases, spatial epidemiology of networked metapopulation provides a valuable modeling framework, which takes spatially distributed factors into consideration. This review elaborates the latest progresses on the spatial metapopulation dynamics, discusses empirical and theoretical findings that verify the validity of networked metapopulations, and the sketches application in evaluating the effectiveness of disease intervention strategies as well.
Collapse
Affiliation(s)
- Lin Wang
- 1Adaptive Networks and Control Laboratory, Department of Electronic Engineering, Fudan University, Shanghai, 200433 China
- 2Centre for Chaos and Complex Networks, Department of Electronic Engineering, City University of Hong Kong, Hong Kong SAR, China
- 3Present Address: School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Xiang Li
- 1Adaptive Networks and Control Laboratory, Department of Electronic Engineering, Fudan University, Shanghai, 200433 China
| |
Collapse
|
16
|
Diversity of individual mobility patterns and emergence of aggregated scaling laws. Sci Rep 2014; 3:2678. [PMID: 24045416 PMCID: PMC3776193 DOI: 10.1038/srep02678] [Citation(s) in RCA: 108] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2013] [Accepted: 08/28/2013] [Indexed: 12/04/2022] Open
Abstract
Uncovering human mobility patterns is of fundamental importance to the understanding of epidemic spreading, urban transportation and other socioeconomic dynamics embodying spatiality and human travel. According to the direct travel diaries of volunteers, we show the absence of scaling properties in the displacement distribution at the individual level,while the aggregated displacement distribution follows a power law with an exponential cutoff. Given the constraint on total travelling cost, this aggregated scaling law can be analytically predicted by the mixture nature of human travel under the principle of maximum entropy. A direct corollary of such theory is that the displacement distribution of a single mode of transportation should follow an exponential law, which also gets supportive evidences in known data. We thus conclude that the travelling cost shapes the displacement distribution at the aggregated level.
Collapse
|
17
|
Huang J. A common construction pattern of English words and Chinese characters. PLoS One 2013; 8:e74515. [PMID: 24023946 PMCID: PMC3759465 DOI: 10.1371/journal.pone.0074515] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2013] [Accepted: 08/05/2013] [Indexed: 12/04/2022] Open
Abstract
Rankings are ubiquitous around the world. Here I investigate spatial ranking patterns of English Words and Chinese Characters, and reveal a common construction pattern related to phase separation. In detail, I analyze a list of different words in the English language, and find that the frequency of the number of letters per word linearly or nonlinearly decays over its rank in the frequency table. I interpret the linearly decaying area as a linear phase that covers 96.4% words, which is in sharp contrast to a nonlinear phase (representing the nonlinearly decaying area) that covers the remaining 3.6% words. Amazingly, the phase separation phenomenon with the same two percentages of 96.4% and 3.6% holds also for the relation between strokes and characters in the Chinese language although English and Chinese are two distinctly different language systems. The common construction pattern originates from the log-normal distributions of frequencies of words or characters, which can be understood by the joint effect of both the Weber-Fechner law in psychophysics and the principle of maximum entropy in information theory.
Collapse
Affiliation(s)
- Jiping Huang
- Department of Physics and State Key Laboratory of Surface Physics, Fudan University, Shanghai, China
- * E-mail:
| |
Collapse
|
18
|
Efficient learning strategy of Chinese characters based on network approach. PLoS One 2013; 8:e69745. [PMID: 23990887 PMCID: PMC3749196 DOI: 10.1371/journal.pone.0069745] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2013] [Accepted: 06/12/2013] [Indexed: 11/19/2022] Open
Abstract
We develop an efficient learning strategy of Chinese characters based on the network of the hierarchical structural relations between Chinese characters. A more efficient strategy is that of learning the same number of useful Chinese characters in less effort or time. We construct a node-weighted network of Chinese characters, where character usage frequencies are used as node weights. Using this hierarchical node-weighted network, we propose a new learning method, the distributed node weight (DNW) strategy, which is based on a new measure of nodes' importance that considers both the weight of the nodes and its location in the network hierarchical structure. Chinese character learning strategies, particularly their learning order, are analyzed as dynamical processes over the network. We compare the efficiency of three theoretical learning methods and two commonly used methods from mainstream Chinese textbooks, one for Chinese elementary school students and the other for students learning Chinese as a second language. We find that the DNW method significantly outperforms the others, implying that the efficiency of current learning methods of major textbooks can be greatly improved.
Collapse
|
19
|
A bio-inspired methodology of identifying influential nodes in complex networks. PLoS One 2013; 8:e66732. [PMID: 23799129 PMCID: PMC3682958 DOI: 10.1371/journal.pone.0066732] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2013] [Accepted: 05/10/2013] [Indexed: 11/19/2022] Open
Abstract
How to identify influential nodes is a key issue in complex networks. The degree centrality is simple, but is incapable to reflect the global characteristics of networks. Betweenness centrality and closeness centrality do not consider the location of nodes in the networks, and semi-local centrality, leaderRank and pageRank approaches can be only applied in unweighted networks. In this paper, a bio-inspired centrality measure model is proposed, which combines the Physarum centrality with the K-shell index obtained by K-shell decomposition analysis, to identify influential nodes in weighted networks. Then, we use the Susceptible-Infected (SI) model to evaluate the performance. Examples and applications are given to demonstrate the adaptivity and efficiency of the proposed method. In addition, the results are compared with existing methods.
Collapse
|