1
|
Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling. JOURNAL OF DATA AND INFORMATION SCIENCE 2021. [DOI: 10.2478/jdis-2021-0013] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Abstract
Purpose
Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.
Design/methodology/approach
We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.
Findings
Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement.
Research limitations
We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases.
Practical implications
We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.
Originality/value
By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.
Collapse
|
2
|
Moghaddasi H, Rezaei S, Darooneh AH, Heshmati E, Khalifeh K. A comparative analysis of dipeptides distribution in eukaryotes and prokaryotes by statistical mechanics. PHYSICA A: STATISTICAL MECHANICS AND ITS APPLICATIONS 2020; 555:124567. [DOI: 10.1016/j.physa.2020.124567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
3
|
|
4
|
Malik Z, Hashmi K, Najmi E, Rezgui A. Wisdom extraction in knowledge-based information systems. JOURNAL OF KNOWLEDGE MANAGEMENT 2019. [DOI: 10.1108/jkm-05-2018-0288] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThis paper aims to provide a number of distinct approaches towards this goal, i.e. to translate the information contained in the repositories into knowledge. For centuries, humans have gathered and generated data to study the different phenomena around them. Consequently, there are a variety of information repositories available in many different fields of study. However, the ability to access, integrate and properly interpret the relevant data sets in these repositories has mainly been limited by their ever expanding volumes. The goal of translating the available data to knowledge, eventually leading to wisdom, requires an understanding of the relations, ordering and associations among the data sets.Design/methodology/approachWhile the existing information repositories are rich in content, there are no easy means of understanding the relevance or influence of the different facts contained therein. Therefore, the interest of the general populace in terms of prioritizing some data items (or facts) over others is usually lost. In this paper, the goal is to provide approaches for transforming the available facts in the information repositories to wisdom. The authors target the lack of order in the facts presented in the repositories to create a hierarchical distribution based on the common understanding, expectations, opinions and judgments of the different users.FindingsThe authors present multiple approaches to extract and order the facts related to each concept, using both automatic and semi-automatic methods. The experiments show that the results of these approaches are similar and very close to the instinctive ordering of facts by users.Originality/valueThe authors believe that the work presented in this paper, with some additions, can be a feasible step to convert the available knowledge to wisdom and a step towards the future of online information systems.
Collapse
|
5
|
Najafi E, H. Darooneh A. Long range dependence in texts: A method for quantifying coherence of text. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2017.06.032] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
6
|
Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels. Sci Rep 2017; 7:41543. [PMID: 28128320 PMCID: PMC5269680 DOI: 10.1038/srep41543] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 12/22/2016] [Indexed: 01/03/2023] Open
Abstract
Functional DNA sub-sequences and genome elements are spatially clustered through the genome just as keywords in literary texts. Therefore, some of the methods for ranking words in texts can also be used to compare different DNA sub-sequences. In analogy with the literary texts, here we claim that the distribution of distances between the successive sub-sequences (words) is q-exponential which is the distribution function in non-extensive statistical mechanics. Thus the q-parameter can be used as a measure of words clustering levels. Here, we analyzed the distribution of distances between consecutive occurrences of 16 possible dinucleotides in human chromosomes to obtain their corresponding q-parameters. We found that CG as a biologically important two-letter word concerning its methylation, has the highest clustering level. This finding shows the predicting ability of the method in biology. We also proposed that chromosome 18 with the largest value of q-parameter for promoters of genes is more sensitive to dietary and lifestyle. We extended our study to compare the genome of some selected organisms and concluded that the clustering level of CGs increases in higher evolutionary organisms compared to lower ones.
Collapse
|
7
|
Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines. Scientometrics 2016. [DOI: 10.1007/s11192-016-2209-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
8
|
Model of the Dynamic Construction Process of Texts and Scaling Laws of Words Organization in Language Systems. PLoS One 2016; 11:e0168971. [PMID: 28006026 PMCID: PMC5179102 DOI: 10.1371/journal.pone.0168971] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Accepted: 12/11/2016] [Indexed: 11/19/2022] Open
Abstract
Scaling laws characterize diverse complex systems in a broad range of fields, including physics, biology, finance, and social science. The human language is another example of a complex system of words organization. Studies on written texts have shown that scaling laws characterize the occurrence frequency of words, words rank, and the growth of distinct words with increasing text length. However, these studies have mainly concentrated on the western linguistic systems, and the laws that govern the lexical organization, structure and dynamics of the Chinese language remain not well understood. Here we study a database of Chinese and English language books. We report that three distinct scaling laws characterize words organization in the Chinese language. We find that these scaling laws have different exponents and crossover behaviors compared to English texts, indicating different words organization and dynamics of words in the process of text growth. We propose a stochastic feedback model of words organization and text growth, which successfully accounts for the empirically observed scaling laws with their corresponding scaling exponents and characteristic crossover regimes. Further, by varying key model parameters, we reproduce differences in the organization and scaling laws of words between the Chinese and English language. We also identify functional relationships between model parameters and the empirically observed scaling exponents, thus providing new insights into the words organization and growth dynamics in the Chinese and English language.
Collapse
|
9
|
Carpena P, Bernaola-Galván PA, Carretero-Campos C, Coronado AV. Probability distribution of intersymbol distances in random symbolic sequences: Applications to improving detection of keywords in texts and of amino acid clustering in proteins. Phys Rev E 2016; 94:052302. [PMID: 27967154 DOI: 10.1103/physreve.94.052302] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Indexed: 11/07/2022]
Abstract
Symbolic sequences have been extensively investigated in the past few years within the framework of statistical physics. Paradigmatic examples of such sequences are written texts, and deoxyribonucleic acid (DNA) and protein sequences. In these examples, the spatial distribution of a given symbol (a word, a DNA motif, an amino acid) is a key property usually related to the symbol importance in the sequence: The more uneven and far from random the symbol distribution, the higher the relevance of the symbol to the sequence. Thus, many techniques of analysis measure in some way the deviation of the symbol spatial distribution with respect to the random expectation. The problem is then to know the spatial distribution corresponding to randomness, which is typically considered to be either the geometric or the exponential distribution. However, these distributions are only valid for very large symbolic sequences and for many occurrences of the analyzed symbol. Here, we obtain analytically the exact, randomly expected spatial distribution valid for any sequence length and any symbol frequency, and we study its main properties. The knowledge of the distribution allows us to define a measure able to properly quantify the deviation from randomness of the symbol distribution, especially for short sequences and low symbol frequency. We apply the measure to the problem of keyword detection in written texts and to study amino acid clustering in protein sequences. In texts, we show how the results improve with respect to previous methods when short texts are analyzed. In proteins, which are typically short, we show how the measure quantifies unambiguously the amino acid clustering and characterize its spatial distribution.
Collapse
Affiliation(s)
- Pedro Carpena
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | - Pedro A Bernaola-Galván
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | - Concepción Carretero-Campos
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | - Ana V Coronado
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| |
Collapse
|
10
|
Li Z, Cao H, Cui Y, Zhang Y. Extracting DNA words based on the sequence features: non-uniform distribution and integrity. Theor Biol Med Model 2016; 13:2. [PMID: 26811154 PMCID: PMC4727310 DOI: 10.1186/s12976-016-0028-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2015] [Accepted: 01/14/2016] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the "words" based only on the DNA sequences. METHODS We considered that non-uniform distribution and integrity were two important features of a word, based on which we developed an ab initio algorithm to extract "DNA words" that have potential functional meaning. A Kolmogorov-Smirnov test was used for consistency test of uniform distribution of DNA sequences, and the integrity was judged by the sequence and position alignment. Two random base sequences were adopted as negative control, and an English book was used as positive control to verify our algorithm. We applied our algorithm to the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli to show the utility of the methods. RESULTS The results provide strong evidences that the algorithm is a promising tool for ab initio building a DNA dictionary. CONCLUSIONS Our method provides a fast way for large scale screening of important DNA elements and offers potential insights into the understanding of a genome.
Collapse
Affiliation(s)
- Zhi Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, 030001, China.
| | - Hongyan Cao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, 030001, China.
| | - Yuehua Cui
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, 030001, China.
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, 48824, USA.
| | - Yanbo Zhang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, 030001, China.
| |
Collapse
|
11
|
Abstract
Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.
Collapse
Affiliation(s)
- Diego Raphael Amancio
- Institute of Mathematical and Computer Sciences, University of São Paulo, São Carlos, São Paulo, Brazil
- * E-mail:
| |
Collapse
|
12
|
Najafi E, Darooneh AH. The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction. PLoS One 2015; 10:e0130617. [PMID: 26091207 PMCID: PMC4474631 DOI: 10.1371/journal.pone.0130617] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2014] [Accepted: 05/21/2015] [Indexed: 11/18/2022] Open
Abstract
A text can be considered as a one dimensional array of words. The locations of each word type in this array form a fractal pattern with certain fractal dimension. We observe that important words responsible for conveying the meaning of a text have dimensions considerably different from one, while the fractal dimensions of unimportant words are close to one. We introduce an index quantifying the importance of the words in a given text using their fractal dimensions and then ranking them according to their importance. This index measures the difference between the fractal pattern of a word in the original text relative to a shuffled version. Because the shuffled text is meaningless (i.e., words have no importance), the difference between the original and shuffled text can be used to ascertain degree of fractality. The degree of fractality may be used for automatic keyword detection. Words with the degree of fractality higher than a threshold value are assumed to be the retrieved keywords of the text. We measure the efficiency of our method for keywords extraction, making a comparison between our proposed method and two other well-known methods of automatic keyword extraction.
Collapse
Affiliation(s)
- Elham Najafi
- Department of Physics, University of Zanjan, Zanjan, Iran
- * E-mail:
| | | |
Collapse
|
13
|
Bao J, Yuan R, Bao Z. An improved alignment-free model for DNA sequence similarity metric. BMC Bioinformatics 2014; 15:321. [PMID: 25261973 PMCID: PMC4261891 DOI: 10.1186/1471-2105-15-321] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2013] [Accepted: 09/23/2014] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words' probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality. RESULTS This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings. CONCLUSIONS The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization.
Collapse
Affiliation(s)
- Junpeng Bao
- Department of Computer Science and Technology Xi’an Jiaotong University, West Xianning Road, 710049 Xi’an, P.R. China
| | - Ruiyu Yuan
- Department of Computer Science and Technology Xi’an Jiaotong University, West Xianning Road, 710049 Xi’an, P.R. China
| | - Zhe Bao
- Department of Computer Science and Technology Xi’an Jiaotong University, West Xianning Road, 710049 Xi’an, P.R. China
| |
Collapse
|
14
|
Dios F, Barturen G, Lebrón R, Rueda A, Hackenberg M, Oliver JL. DNA clustering and genome complexity. Comput Biol Chem 2014; 53 Pt A:71-8. [PMID: 25182383 DOI: 10.1016/j.compbiolchem.2014.08.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 01/08/2023]
Abstract
Early global measures of genome complexity (power spectra, the analysis of fluctuations in DNA walks or compositional segmentation) uncovered a high degree of complexity in eukaryotic genome sequences. The main evolutionary mechanisms leading to increases in genome complexity (i.e. gene duplication and transposon proliferation) can all potentially produce increases in DNA clustering. To quantify such clustering and provide a genome-wide description of the formed clusters, we developed GenomeCluster, an algorithm able to detect clusters of whatever genome element identified by chromosome coordinates. We obtained a detailed description of clusters for ten categories of human genome elements, including functional (genes, exons, introns), regulatory (CpG islands, TFBSs, enhancers), variant (SNPs) and repeat (Alus, LINE1) elements, as well as DNase hypersensitivity sites. For each category, we located their clusters in the human genome, then quantifying cluster length and composition, and estimated the clustering level as the proportion of clustered genome elements. In average, we found a 27% of elements in clusters, although a considerable variation occurs among different categories. Genes form the lowest number of clusters, but these are the longest ones, both in bp and the average number of components, while the shortest clusters are formed by SNPs. Functional and regulatory elements (genes, CpG islands, TFBSs, enhancers) show the highest clustering level, as compared to DNase sites, repeats (Alus, LINE1) or SNPs. Many of the genome elements we analyzed are known to be composed of clusters of low-level entities. In addition, we found here that the clusters generated by GenomeCluster can be in turn clustered into high-level super-clusters. The observation of 'clusters-within-clusters' parallels the 'domains within domains' phenomenon previously detected through global statistical methods in eukaryotic sequences, and reveals a complex human genome landscape dominated by hierarchical clustering.
Collapse
Affiliation(s)
- Francisco Dios
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain
| | - Guillermo Barturen
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain
| | - Ricardo Lebrón
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain
| | - Antonio Rueda
- Plataforma Andaluza de Genómica y Bioinformática (GBPA), Edificio INSUR, Calle Albert Einstein, 41092 Sevilla, Spain
| | - Michael Hackenberg
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain
| | - José L Oliver
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain.
| |
Collapse
|
15
|
Allahverdyan AE, Deng W, Wang QA. Explaining Zipf's law via a mental lexicon. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2013; 88:062804. [PMID: 24483508 DOI: 10.1103/physreve.88.062804] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Revised: 05/22/2013] [Indexed: 06/03/2023]
Abstract
Zipf's law is the major regularity of statistical linguistics that has served as a prototype for rank-frequency relations and scaling laws in natural sciences. Here we show that Zipf's law-together with its applicability for a single text and its generalizations to high and low frequencies including hapax legomena-can be derived from assuming that the words are drawn into the text with random probabilities. Their a priori density relates, via the Bayesian statistics, to the mental lexicon of the author who produced the text.
Collapse
Affiliation(s)
- Armen E Allahverdyan
- Laboratoire de Physique Statistique et Systèmes Complexes, ISMANS, 44 ave. Bartholdi, 72000 Le Mans, France and Yerevan Physics Institute, Alikhanian Brothers Street 2, Yerevan 375036, Armenia
| | - Weibing Deng
- Laboratoire de Physique Statistique et Systèmes Complexes, ISMANS, 44 ave. Bartholdi, 72000 Le Mans, France and IMMM, UMR CNRS 6283, Université du Maine, 72085 Le Mans, France and Complexity Science Center and Institute of Particle Physics, Hua-Zhong Normal University, Wuhan 430079, China
| | - Q A Wang
- Laboratoire de Physique Statistique et Systèmes Complexes, ISMANS, 44 ave. Bartholdi, 72000 Le Mans, France and IMMM, UMR CNRS 6283, Université du Maine, 72085 Le Mans, France
| |
Collapse
|
16
|
Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis. PLoS One 2013; 8:e66344. [PMID: 23805215 PMCID: PMC3689824 DOI: 10.1371/journal.pone.0066344] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2013] [Accepted: 05/08/2013] [Indexed: 11/23/2022] Open
Abstract
The Voynich manuscript has remained so far as a mystery for linguists and cryptologists. While the text written on medieval parchment -using an unknown script system- shows basic statistical patterns that bear resemblance to those from real languages, there are features that suggested to some researches that the manuscript was a forgery intended as a hoax. Here we analyse the long-range structure of the manuscript using methods from information theory. We show that the Voynich manuscript presents a complex organization in the distribution of words that is compatible with those found in real language sequences. We are also able to extract some of the most significant semantic word-networks in the text. These results together with some previously known statistical features of the Voynich manuscript, give support to the presence of a genuine message inside the book.
Collapse
|
17
|
Bernaola-Galván P, Oliver J, Hackenberg M, Coronado A, Ivanov P, Carpena P. Segmentation of time series with long-range fractal correlations. THE EUROPEAN PHYSICAL JOURNAL. B 2012; 85:211. [PMID: 23645997 PMCID: PMC3643524 DOI: 10.1140/epjb/e2012-20969-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Segmentation is a standard method of data analysis to identify change-points dividing a nonstationary time series into homogeneous segments. However, for long-range fractal correlated series, most of the segmentation techniques detect spurious change-points which are simply due to the heterogeneities induced by the correlations and not to real nonstationarities. To avoid this oversegmentation, we present a segmentation algorithm which takes as a reference for homogeneity, instead of a random i.i.d. series, a correlated series modeled by a fractional noise with the same degree of correlations as the series to be segmented. We apply our algorithm to artificial series with long-range correlations and show that it systematically detects only the change-points produced by real nonstationarities and not those created by the correlations of the signal. Further, we apply the method to the sequence of the long arm of human chromosome 21, which is known to have long-range fractal correlations. We obtain only three segments that clearly correspond to the three regions of different G + C composition revealed by means of a multi-scale wavelet plot. Similar results have been obtained when segmenting all human chromosome sequences, showing the existence of previously unknown huge compositional superstructures in the human genome.
Collapse
Affiliation(s)
| | - J.L. Oliver
- Dpto. de Genética, Inst. de Biotecnología, Universidad de Granada, 18071 Granada, Spain
| | - M. Hackenberg
- Dpto. de Genética, Inst. de Biotecnología, Universidad de Granada, 18071 Granada, Spain
| | - A.V. Coronado
- Dpto. de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| | - P.Ch. Ivanov
- Harvard Medical School, Division of Sleep Medicine, Brigham & Women’s Hospital, 02115 Boston, MA, USA
- Department of Physics and Center for Polymer Studies, Boston University, 2215 Boston, MA, USA
- Institute of Solid State Physics, Bulgarian Academy of Sciences, 1784 Sofia, Bulgaria
| | - P. Carpena
- Dpto. de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
18
|
Clustering of DNA words and biological function: A proof of principle. J Theor Biol 2012; 297:127-36. [DOI: 10.1016/j.jtbi.2011.12.024] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Revised: 12/20/2011] [Accepted: 12/21/2011] [Indexed: 02/08/2023]
|
19
|
Fridman M, Pugatch R, Nixon M, Friesem AA, Davidson N. Measuring maximal eigenvalue distribution of Wishart random matrices with coupled lasers. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012; 85:020101. [PMID: 22463135 DOI: 10.1103/physreve.85.020101] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/16/2011] [Indexed: 05/31/2023]
Abstract
We determined the probability distribution of the combined output power from 25 coupled fiber lasers and show that it agrees well with the Tracy-Widom and Majumdar-Vergassola distributions of the largest eigenvalue of Wishart random matrices with no fitting parameters. This was achieved with 500,000 measurements of the combined output power from the fiber lasers, that continuously changes with variations of the fiber lasers lengths. We show experimentally that for small deviations of the combined output power over its mean value the Tracy-Widom distribution is correct, while for large deviations the Majumdar-Vergassola distribution is correct.
Collapse
Affiliation(s)
- Moti Fridman
- Weizmann Institute of Science, Department of Physics of Complex Systems, Rehovot 76100, Israel
| | | | | | | | | |
Collapse
|
20
|
Frahm KM, Shepelyansky DL. Poincaré recurrences of DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012; 85:016214. [PMID: 22400650 DOI: 10.1103/physreve.85.016214] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2011] [Revised: 12/01/2011] [Indexed: 05/31/2023]
Abstract
We analyze the statistical properties of Poincaré recurrences of Homo sapiens, mammalian, and other DNA sequences taken from the Ensembl Genome data base with up to 15 billion base pairs. We show that the probability of Poincaré recurrences decays in an algebraic way with the Poincaré exponent β≈4 even if the oscillatory dependence is well pronounced. The correlations between recurrences decay with an exponent ν≈0.6 that leads to an anomalous superdiffusive walk. However, for Homo sapiens sequences, with the largest available statistics, the diffusion coefficient converges to a finite value on distances larger than one million base pairs. We argue that the approach based on Poncaré recurrences determines new proximity features between different species and sheds a new light on their evolution history.
Collapse
Affiliation(s)
- K M Frahm
- Laboratoire de Physique Théorique du CNRS, IRSAMC, Université de Toulouse, UPS, F-31062 Toulouse, France
| | | |
Collapse
|
21
|
Mehri A, Darooneh AH. Keyword extraction by nonextensivity measure. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 83:056106. [PMID: 21728604 DOI: 10.1103/physreve.83.056106] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2010] [Indexed: 05/31/2023]
Abstract
The presence of a long-range correlation in the spatial distribution of a relevant word type, in spite of random occurrences of an irrelevant word type, is an important feature of human-written texts. We classify the correlation between the occurrences of words by nonextensive statistical mechanics for the word-ranking process. In particular, we look at the nonextensivity parameter as an alternative metric to measure the spatial correlation in the text, from which the words may be ranked in terms of this measure. Finally, we compare different methods for keyword extraction.
Collapse
Affiliation(s)
- Ali Mehri
- Department of Physics, Zanjan University, Zanjan, Iran.
| | | |
Collapse
|
22
|
Hackenberg M, Carpena P, Bernaola-Galván P, Barturen G, Alganza ÁM, Oliver JL. WordCluster: detecting clusters of DNA words and genomic elements. Algorithms Mol Biol 2011; 6:2. [PMID: 21261981 PMCID: PMC3037320 DOI: 10.1186/1748-7188-6-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2010] [Accepted: 01/24/2011] [Indexed: 01/26/2023] Open
Abstract
Background Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds. Results We introduce here an algorithm to detect clusters of DNA words (k-mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome. Conclusions WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes.
Collapse
|
23
|
Lü L, Zhang ZK, Zhou T. Zipf's law leads to Heaps' law: analyzing their relation in finite-size systems. PLoS One 2010; 5:e14139. [PMID: 21152034 PMCID: PMC2996287 DOI: 10.1371/journal.pone.0014139] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2010] [Accepted: 10/20/2010] [Indexed: 11/18/2022] Open
Abstract
Background Zipf's law and Heaps' law are observed in disparate complex systems. Of particular interests, these two laws often appear together. Many theoretical models and analyses are performed to understand their co-occurrence in real systems, but it still lacks a clear picture about their relation. Methodology/Principal Findings We show that the Heaps' law can be considered as a derivative phenomenon if the system obeys the Zipf's law. Furthermore, we refine the known approximate solution of the Heaps' exponent provided the Zipf's exponent. We show that the approximate solution is indeed an asymptotic solution for infinite systems, while in the finite-size system the Heaps' exponent is sensitive to the system size. Extensive empirical analysis on tens of disparate systems demonstrates that our refined results can better capture the relation between the Zipf's and Heaps' exponents. Conclusions/Significance The present analysis provides a clear picture about the relation between the Zipf's law and Heaps' law without the help of any specific stochastic model, namely the Heaps' law is indeed a derivative phenomenon from the Zipf's law. The presented numerical method gives considerably better estimation of the Heaps' exponent given the Zipf's exponent and the system size. Our analysis provides some insights and implications of real complex systems. For example, one can naturally obtained a better explanation of the accelerated growth of scale-free networks.
Collapse
Affiliation(s)
- Linyuan Lü
- Web Sciences Center, University of Electronic Science and Technology of China, Chengdu, People's Republic of China
- Department of Physics, University of Fribourg, Fribourg, Switzerland
| | - Zi-Ke Zhang
- Department of Physics, University of Fribourg, Fribourg, Switzerland
| | - Tao Zhou
- Web Sciences Center, University of Electronic Science and Technology of China, Chengdu, People's Republic of China
- Department of Physics, University of Fribourg, Fribourg, Switzerland
- Department of Modern Physics, University of Science and Technology of China, Hefei, People's Republic of China
- * E-mail:
| |
Collapse
|
24
|
|
25
|
Abstract
Structural 3D motifs in RNA play an important role in the RNA stability and function. Previous studies have focused on the characterization and discovery of 3D motifs in RNA secondary and tertiary structures. However, statistical analyses of the distribution of 3D motifs along the RNA appear to be lacking. Herein, we present a novel strategy for evaluating the distribution of 3D motifs along the RNA chain and those motifs whose distributions are significantly non-random are identified. By applying it to the X-ray structure of the large ribosomal subunit from Haloarcula marismortui, helical motifs were found to cluster together along the chain and in the 3D structure, whereas the known tetraloops tend to be sequentially and spatially dispersed. That the distribution of key structural motifs such as tetraloops differ significantly from a random one suggests that our method could also be used to detect novel 3D motifs of any size in sufficiently long/large RNA structures. The motif distribution type can help in the prediction and design of 3D structures of large RNA molecules.
Collapse
Affiliation(s)
- Karen Sargsyan
- Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan
| | | |
Collapse
|