Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Carpena P, Bernaola-Galván P, Hackenberg M, Coronado AV, Oliver JL. Level statistics of words: finding keywords in literary texts and symbolic sequences. Phys Rev E Stat Nonlin Soft Matter Phys 2009;79:035102. [PMID: 19392005 DOI: 10.1103/physreve.79.035102] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2008] [Indexed: 05/27/2023]

For:	Carpena P, Bernaola-Galván P, Hackenberg M, Coronado AV, Oliver JL. Level statistics of words: finding keywords in literary texts and symbolic sequences. Phys Rev E Stat Nonlin Soft Matter Phys 2009;79:035102. [PMID: 19392005 DOI: 10.1103/physreve.79.035102] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2008] [Indexed: 05/27/2023]

Number

Cited by Other Article(s)

Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling. JOURNAL OF DATA AND INFORMATION SCIENCE 2021. [DOI: 10.2478/jdis-2021-0013] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open

Abstract Abstract Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction. Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent. Collapse

Moghaddasi H, Rezaei S, Darooneh AH, Heshmati E, Khalifeh K. A comparative analysis of dipeptides distribution in eukaryotes and prokaryotes by statistical mechanics. PHYSICA A: STATISTICAL MECHANICS AND ITS APPLICATIONS 2020;555:124567. [DOI: 10.1016/j.physa.2020.124567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]

sCAKE: Semantic Connectivity Aware Keyword Extraction. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2018.10.034] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]

Malik Z, Hashmi K, Najmi E, Rezgui A. Wisdom extraction in knowledge-based information systems. JOURNAL OF KNOWLEDGE MANAGEMENT 2019. [DOI: 10.1108/jkm-05-2018-0288] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

Abstract PurposeThis paper aims to provide a number of distinct approaches towards this goal, i.e. to translate the information contained in the repositories into knowledge. For centuries, humans have gathered and generated data to study the different phenomena around them. Consequently, there are a variety of information repositories available in many different fields of study. However, the ability to access, integrate and properly interpret the relevant data sets in these repositories has mainly been limited by their ever expanding volumes. The goal of translating the available data to knowledge, eventually leading to wisdom, requires an understanding of the relations, ordering and associations among the data sets.Design/methodology/approachWhile the existing information repositories are rich in content, there are no easy means of understanding the relevance or influence of the different facts contained therein. Therefore, the interest of the general populace in terms of prioritizing some data items (or facts) over others is usually lost. In this paper, the goal is to provide approaches for transforming the available facts in the information repositories to wisdom. The authors target the lack of order in the facts presented in the repositories to create a hierarchical distribution based on the common understanding, expectations, opinions and judgments of the different users.FindingsThe authors present multiple approaches to extract and order the facts related to each concept, using both automatic and semi-automatic methods. The experiments show that the results of these approaches are similar and very close to the instinctive ordering of facts by users.Originality/valueThe authors believe that the work presented in this paper, with some additions, can be a feasible step to convert the available knowledge to wisdom and a step towards the future of online information systems. Collapse

Najafi E, H. Darooneh A. Long range dependence in texts: A method for quantifying coherence of text. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2017.06.032] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]

Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels. Sci Rep 2017;7:41543. [PMID: 28128320 PMCID: PMC5269680 DOI: 10.1038/srep41543] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 12/22/2016] [Indexed: 01/03/2023] Open

Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines. Scientometrics 2016. [DOI: 10.1007/s11192-016-2209-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]

Model of the Dynamic Construction Process of Texts and Scaling Laws of Words Organization in Language Systems. PLoS One 2016;11:e0168971. [PMID: 28006026 PMCID: PMC5179102 DOI: 10.1371/journal.pone.0168971] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Accepted: 12/11/2016] [Indexed: 11/19/2022] Open

Carpena P, Bernaola-Galván PA, Carretero-Campos C, Coronado AV. Probability distribution of intersymbol distances in random symbolic sequences: Applications to improving detection of keywords in texts and of amino acid clustering in proteins. Phys Rev E 2016;94:052302. [PMID: 27967154 DOI: 10.1103/physreve.94.052302] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Indexed: 11/07/2022]

Li Z, Cao H, Cui Y, Zhang Y. Extracting DNA words based on the sequence features: non-uniform distribution and integrity. Theor Biol Med Model 2016;13:2. [PMID: 26811154 PMCID: PMC4727310 DOI: 10.1186/s12976-016-0028-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2015] [Accepted: 01/14/2016] [Indexed: 12/02/2022] Open

Amancio DR. A Complex Network Approach to Stylometry. PLoS One 2015;10:e0136076. [PMID: 26313921 PMCID: PMC4552030 DOI: 10.1371/journal.pone.0136076] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 07/22/2015] [Indexed: 11/18/2022] Open

Najafi E, Darooneh AH. The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction. PLoS One 2015;10:e0130617. [PMID: 26091207 PMCID: PMC4474631 DOI: 10.1371/journal.pone.0130617] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2014] [Accepted: 05/21/2015] [Indexed: 11/18/2022] Open

Bao J, Yuan R, Bao Z. An improved alignment-free model for DNA sequence similarity metric. BMC Bioinformatics 2014;15:321. [PMID: 25261973 PMCID: PMC4261891 DOI: 10.1186/1471-2105-15-321] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2013] [Accepted: 09/23/2014] [Indexed: 11/23/2022] Open

Abstract

BACKGROUND

DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words' probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality.

RESULTS

This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings.

CONCLUSIONS

The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization.

Collapse

Dios F, Barturen G, Lebrón R, Rueda A, Hackenberg M, Oliver JL. DNA clustering and genome complexity. Comput Biol Chem 2014;53 Pt A:71-8. [PMID: 25182383 DOI: 10.1016/j.compbiolchem.2014.08.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 01/08/2023]

Abstract

Early global measures of genome complexity (power spectra, the analysis of fluctuations in DNA walks or compositional segmentation) uncovered a high degree of complexity in eukaryotic genome sequences. The main evolutionary mechanisms leading to increases in genome complexity (i.e. gene duplication and transposon proliferation) can all potentially produce increases in DNA clustering. To quantify such clustering and provide a genome-wide description of the formed clusters, we developed GenomeCluster, an algorithm able to detect clusters of whatever genome element identified by chromosome coordinates. We obtained a detailed description of clusters for ten categories of human genome elements, including functional (genes, exons, introns), regulatory (CpG islands, TFBSs, enhancers), variant (SNPs) and repeat (Alus, LINE1) elements, as well as DNase hypersensitivity sites. For each category, we located their clusters in the human genome, then quantifying cluster length and composition, and estimated the clustering level as the proportion of clustered genome elements. In average, we found a 27% of elements in clusters, although a considerable variation occurs among different categories. Genes form the lowest number of clusters, but these are the longest ones, both in bp and the average number of components, while the shortest clusters are formed by SNPs. Functional and regulatory elements (genes, CpG islands, TFBSs, enhancers) show the highest clustering level, as compared to DNase sites, repeats (Alus, LINE1) or SNPs. Many of the genome elements we analyzed are known to be composed of clusters of low-level entities. In addition, we found here that the clusters generated by GenomeCluster can be in turn clustered into high-level super-clusters. The observation of 'clusters-within-clusters' parallels the 'domains within domains' phenomenon previously detected through global statistical methods in eukaryotic sequences, and reveals a complex human genome landscape dominated by hierarchical clustering.

Collapse

Allahverdyan AE, Deng W, Wang QA. Explaining Zipf's law via a mental lexicon. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2013;88:062804. [PMID: 24483508 DOI: 10.1103/physreve.88.062804] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Revised: 05/22/2013] [Indexed: 06/03/2023]

Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis. PLoS One 2013;8:e66344. [PMID: 23805215 PMCID: PMC3689824 DOI: 10.1371/journal.pone.0066344] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2013] [Accepted: 05/08/2013] [Indexed: 11/23/2022] Open

Bernaola-Galván P, Oliver J, Hackenberg M, Coronado A, Ivanov P, Carpena P. Segmentation of time series with long-range fractal correlations. THE EUROPEAN PHYSICAL JOURNAL. B 2012;85:211. [PMID: 23645997 PMCID: PMC3643524 DOI: 10.1140/epjb/e2012-20969-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]

Clustering of DNA words and biological function: A proof of principle. J Theor Biol 2012;297:127-36. [DOI: 10.1016/j.jtbi.2011.12.024] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Revised: 12/20/2011] [Accepted: 12/21/2011] [Indexed: 02/08/2023]

Fridman M, Pugatch R, Nixon M, Friesem AA, Davidson N. Measuring maximal eigenvalue distribution of Wishart random matrices with coupled lasers. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012;85:020101. [PMID: 22463135 DOI: 10.1103/physreve.85.020101] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/16/2011] [Indexed: 05/31/2023]

Frahm KM, Shepelyansky DL. Poincaré recurrences of DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012;85:016214. [PMID: 22400650 DOI: 10.1103/physreve.85.016214] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2011] [Revised: 12/01/2011] [Indexed: 05/31/2023]

Mehri A, Darooneh AH. Keyword extraction by nonextensivity measure. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011;83:056106. [PMID: 21728604 DOI: 10.1103/physreve.83.056106] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2010] [Indexed: 05/31/2023]

Hackenberg M, Carpena P, Bernaola-Galván P, Barturen G, Alganza ÁM, Oliver JL. WordCluster: detecting clusters of DNA words and genomic elements. Algorithms Mol Biol 2011;6:2. [PMID: 21261981 PMCID: PMC3037320 DOI: 10.1186/1748-7188-6-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2010] [Accepted: 01/24/2011] [Indexed: 01/26/2023] Open

Lü L, Zhang ZK, Zhou T. Zipf's law leads to Heaps' law: analyzing their relation in finite-size systems. PLoS One 2010;5:e14139. [PMID: 21152034 PMCID: PMC2996287 DOI: 10.1371/journal.pone.0014139] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2010] [Accepted: 10/20/2010] [Indexed: 11/18/2022] Open

Fitting Ranked Linguistic Data with Two-Parameter Functions. ENTROPY 2010. [DOI: 10.3390/e12071743] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Sargsyan K, Lim C. Arrangement of 3D structural motifs in ribosomal RNA. Nucleic Acids Res 2010;38:3512-22. [PMID: 20159997 PMCID: PMC2887949 DOI: 10.1093/nar/gkq074] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open