1
|
Taher Harikandeh SR, Aliakbary S, Taheri S. An embedding approach for analyzing the evolution of research topics with a case study on computer science subdomains. Scientometrics 2023; 128:1567-1582. [PMID: 36743778 PMCID: PMC9886542 DOI: 10.1007/s11192-023-04642-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2022] [Accepted: 01/19/2023] [Indexed: 02/02/2023]
Abstract
The study of topic evolution aims to analyze the behavior of different research fields by utilizing various features such as the relationships between articles. In recent years, many published papers consider more than one field of study which has led to a significant increase in the number of inter-field and interdisciplinary articles. Therefore, we can analyze the similarity/dissimilarity and convergence/divergence of research fields based on topic analysis of the published papers. Our research intends to create a methodology for studying the evolution of the research fields. In this paper, we propose an embedding approach for modeling each research topics as a multidimensional vector. Using this model, we measure the topic's distances over the years and investigate how topics evolve over time. The proposed similarity metric showed many advantages over other alternatives (such as Jaccard similarity) and it resulted in better stability and accuracy. As a case study, we applied the proposed method to subsets of computer science for experimental purposes, and the results were quite comprehensible and coherent.
Collapse
Affiliation(s)
- Seyyed Reza Taher Harikandeh
- grid.412502.00000 0001 0686 4748Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
| | - Sadegh Aliakbary
- grid.412502.00000 0001 0686 4748Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
| | - Soroush Taheri
- grid.412502.00000 0001 0686 4748Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
| |
Collapse
|
2
|
Krenn M, Zeilinger A. Predicting research trends with semantic and neural networks with an application in quantum physics. Proc Natl Acad Sci U S A 2020; 117:1910-1916. [PMID: 31937664 PMCID: PMC6994972 DOI: 10.1073/pnas.1914370116] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The vast and growing number of publications in all disciplines of science cannot be comprehended by a single human researcher. As a consequence, researchers have to specialize in narrow subdisciplines, which makes it challenging to uncover scientific connections beyond the own field of research. Thus, access to structured knowledge from a large corpus of publications could help push the frontiers of science. Here, we demonstrate a method to build a semantic network from published scientific literature, which we call SemNet We use SemNet to predict future trends in research and to inspire personalized and surprising seeds of ideas in science. We apply it in the discipline of quantum physics, which has seen an unprecedented growth of activity in recent years. In SemNet, scientific knowledge is represented as an evolving network using the content of 750,000 scientific papers published since 1919. The nodes of the network correspond to physical concepts, and links between two nodes are drawn when two concepts are concurrently studied in research articles. We identify influential and prize-winning research topics from the past inside SemNet, thus confirming that it stores useful semantic knowledge. We train a neural network using states of SemNet of the past to predict future developments in quantum physics and confirm high-quality predictions using historic data. Using network theoretical tools, we can suggest personalized, out-of-the-box ideas by identifying pairs of concepts, which have unique and extremal semantic network properties. Finally, we consider possible future developments and implications of our findings.
Collapse
Affiliation(s)
- Mario Krenn
- Faculty of Physics, Vienna Center for Quantum Science & Technology, University of Vienna, 1090 Vienna, Austria;
- Institute for Quantum Optics and Quantum Information, Austrian Academy of Sciences, 1090 Vienna, Austria
- Department of Chemistry, University of Toronto, Toronto, ON M5S 3H6, Canada
- Department of Computer Science, University of Toronto, Toronto, ON M5T 3A1, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON M5G 1M1, Canada
| | - Anton Zeilinger
- Faculty of Physics, Vienna Center for Quantum Science & Technology, University of Vienna, 1090 Vienna, Austria;
- Institute for Quantum Optics and Quantum Information, Austrian Academy of Sciences, 1090 Vienna, Austria
| |
Collapse
|
3
|
Tarasova OA, Biziukova NY, Filimonov DA, Poroikov VV, Nicklaus MC. Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications. J Chem Inf Model 2019; 59:3635-3644. [PMID: 31453694 DOI: 10.1021/acs.jcim.9b00164] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure-activity relationship to experimental testing of lead compounds and their validation in clinics. Currently, a large amount of such data is available from databases, scientific publications, and patents. Biological data are characterized by incompleteness, uncertainty, and low reproducibility. Despite the existence of free and commercially available databases of biological activities of compounds, they usually lack unambiguous information about peculiarities of biological assays. On the other hand, scientific papers are the primary source of new data disclosed to the scientific community for the first time. In this study, we have developed and validated a data-mining approach for extraction of text fragments containing description of bioassays. We have used this approach to evaluate compounds and their biological activity reported in scientific publications. We have found that categorization of papers into relevant and irrelevant may be performed based on the machine-learning analysis of the abstracts. Text fragments extracted from the full texts of publications allow their further partitioning into several classes according to the peculiarities of bioassays. We demonstrate the applicability of our approach to the comparison of the endpoint values of biological activity and cytotoxicity of reference compounds.
Collapse
Affiliation(s)
- Olga A Tarasova
- Department of Bioinformatics , Institute of Biomedical Chemistry , 10 Building 8, Pogodinskaya Street , Moscow 119121 , Russia
| | - Nadezhda Yu Biziukova
- Department of Bioinformatics , Institute of Biomedical Chemistry , 10 Building 8, Pogodinskaya Street , Moscow 119121 , Russia
| | - Dmitry A Filimonov
- Department of Bioinformatics , Institute of Biomedical Chemistry , 10 Building 8, Pogodinskaya Street , Moscow 119121 , Russia
| | - Vladimir V Poroikov
- Department of Bioinformatics , Institute of Biomedical Chemistry , 10 Building 8, Pogodinskaya Street , Moscow 119121 , Russia
| | - Marc C Nicklaus
- Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research , National Cancer Institute , Frederick , Maryland 21702 , United States
| |
Collapse
|
4
|
Sybrandt J, Shtutman M, Safro I. MOLIERE: Automatic Biomedical Hypothesis Generation System. KDD : PROCEEDINGS. INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING 2017; 2017:1633-1642. [PMID: 29430330 PMCID: PMC5804740 DOI: 10.1145/3097983.3098057] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.
Collapse
Affiliation(s)
| | - Michael Shtutman
- University of South Carolina, Drug Discovery and Biomedical Sciences, Columbia SC, USA
| | - Ilya Safro
- Clemson University, School of Computing, Clemson SC, USA
| |
Collapse
|
5
|
Abbe A, Grouin C, Zweigenbaum P, Falissard B. Text mining applications in psychiatry: a systematic literature review. Int J Methods Psychiatr Res 2016; 25:86-100. [PMID: 26184780 PMCID: PMC6877250 DOI: 10.1002/mpr.1481] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/05/2014] [Revised: 01/21/2015] [Accepted: 04/09/2015] [Indexed: 11/08/2022] Open
Abstract
The expansion of biomedical literature is creating the need for efficient tools to keep pace with increasing volumes of information. Text mining (TM) approaches are becoming essential to facilitate the automated extraction of useful biomedical information from unstructured text. We reviewed the applications of TM in psychiatry, and explored its advantages and limitations. A systematic review of the literature was carried out using the CINAHL, Medline, EMBASE, PsycINFO and Cochrane databases. In this review, 1103 papers were screened, and 38 were included as applications of TM in psychiatric research. Using TM and content analysis, we identified four major areas of application: (1) Psychopathology (i.e. observational studies focusing on mental illnesses) (2) the Patient perspective (i.e. patients' thoughts and opinions), (3) Medical records (i.e. safety issues, quality of care and description of treatments), and (4) Medical literature (i.e. identification of new scientific information in the literature). The information sources were qualitative studies, Internet postings, medical records and biomedical literature. Our work demonstrates that TM can contribute to complex research tasks in psychiatry. We discuss the benefits, limits, and further applications of this tool in the future. Copyright © 2015 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Adeline Abbe
- Inserm, U669, Paris, France.,University Paris-Sud and University Paris Descartes, UMR-S0669, Paris, France
| | | | | | - Bruno Falissard
- Inserm, U669, Paris, France.,University Paris-Sud and University Paris Descartes, UMR-S0669, Paris, France
| |
Collapse
|
6
|
Grady CR, Knepper MA, Burg MB, Ferraris JD. Database of osmoregulated proteins in mammalian cells. Physiol Rep 2014; 2:e12180. [PMID: 25355853 PMCID: PMC4254105 DOI: 10.14814/phy2.12180] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2014] [Revised: 09/15/2014] [Accepted: 09/29/2014] [Indexed: 11/24/2022] Open
Abstract
Biological information, even in highly specialized fields, is increasing at a volume that no single investigator can assimilate. The existence of this vast knowledge base creates the need for specialized computer databases to store and selectively sort the information. We have developed a manually curated database of the effects of hypertonicity on target proteins. Effects include changes in mRNA abundance and protein abundance, activity, phosphorylation state, binding, and cellular compartment. The biological information used in this database was derived from three research approaches: transcriptomic, proteomic, and reductionist (hypothesis-driven). The data are presented in the form of grammatical triplets consisting of subject, verb phrase, and object. The purpose of this format is to allow the data to be read from left to right as an English sentence. It is readable either by humans or by computers using natural language processing algorithms. An example of a data entry reads "Hypertonicity increases activity of ABL1 in HEK293." This database was created to provide access to a wealth of information on the effects of hypertonicity in a format that can be selectively sorted.
Collapse
Affiliation(s)
- Cameron R. Grady
- Systems Biology Center, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Mark A. Knepper
- Systems Biology Center, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Maurice B. Burg
- Systems Biology Center, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Joan D. Ferraris
- Systems Biology Center, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
7
|
Sanghi A, Zaringhalam M, Corcoran CC, Saeed F, Hoffert JD, Sandoval P, Pisitkun T, Knepper MA. A knowledge base of vasopressin actions in the kidney. Am J Physiol Renal Physiol 2014; 307:F747-55. [PMID: 25056354 DOI: 10.1152/ajprenal.00012.2014] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Biological information is growing at a rapid pace, making it difficult for individual investigators to be familiar with all information that is relevant to their own research. Computers are beginning to be used to extract and curate biological information; however, the complexity of human language used in research papers continues to be a critical barrier to full automation of knowledge extraction. Here, we report a manually curated knowledge base of vasopressin actions in renal epithelial cells that is designed to be readable either by humans or by computer programs using natural language processing algorithms. The knowledge base consists of three related databases accessible at https://helixweb.nih.gov/ESBL/TinyUrls/Vaso_portal.html. One of the component databases reports vasopressin actions on individual proteins expressed in renal epithelia, including effects on phosphorylation, protein abundances, protein translocation from one subcellular compartment to another, protein-protein binding interactions, etc. The second database reports vasopressin actions on physiological measures in renal epithelia, and the third reports specific mRNA species whose abundances change in response to vasopressin. We illustrate the application of the knowledge base by using it to generate a protein kinase network that connects vasopressin binding in collecting duct cells to physiological effects to regulate the water channel protein aquaporin-2.
Collapse
Affiliation(s)
- Akshay Sanghi
- Epithelial Systems Biology Laboratory, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland
| | - Matthew Zaringhalam
- Epithelial Systems Biology Laboratory, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland
| | - Callan C Corcoran
- Epithelial Systems Biology Laboratory, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland
| | - Fahad Saeed
- Departments of Electrical and Computer Engineering and Computer Science, Western Michigan University, Kalamazoo, Michigan
| | - Jason D Hoffert
- Epithelial Systems Biology Laboratory, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland
| | - Pablo Sandoval
- Epithelial Systems Biology Laboratory, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland
| | - Trairak Pisitkun
- Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand; and
| | - Mark A Knepper
- Epithelial Systems Biology Laboratory, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland;
| |
Collapse
|
8
|
McDermott JE, Wang J, Mitchell H, Webb-Robertson BJ, Hafen R, Ramey J, Rodland KD. Challenges in Biomarker Discovery: Combining Expert Insights with Statistical Analysis of Complex Omics Data. ACTA ACUST UNITED AC 2012; 7:37-51. [PMID: 23335946 DOI: 10.1517/17530059.2012.718329] [Citation(s) in RCA: 121] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
INTRODUCTION: The advent of high throughput technologies capable of comprehensive analysis of genes, transcripts, proteins and other significant biological molecules has provided an unprecedented opportunity for the identification of molecular markers of disease processes. However, it has simultaneously complicated the problem of extracting meaningful molecular signatures of biological processes from these complex datasets. The process of biomarker discovery and characterization provides opportunities for more sophisticated approaches to integrating purely statistical and expert knowledge-based approaches. AREAS COVERED: In this review we will present examples of current practices for biomarker discovery from complex omic datasets and the challenges that have been encountered in deriving valid and useful signatures of disease. We will then present a high-level review of data-driven (statistical) and knowledge-based methods applied to biomarker discovery, highlighting some current efforts to combine the two distinct approaches. EXPERT OPINION: Effective, reproducible and objective tools for combining data-driven and knowledge-based approaches to identify predictive signatures of disease are key to future success in the biomarker field. We will describe our recommendations for possible approaches to this problem including metrics for the evaluation of biomarkers.
Collapse
|
9
|
Bellazzi R, Diomidous M, Sarkar IN, Takabayashi K, Ziegler A, McCray AT. Data analysis and data mining: current issues in biomedical informatics. Methods Inf Med 2012; 50:536-44. [PMID: 22146916 DOI: 10.3414/me11-06-0002] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
BACKGROUND Medicine and biomedical sciences have become data-intensive fields, which, at the same time, enable the application of data-driven approaches and require sophisticated data analysis and data mining methods. Biomedical informatics provides a proper interdisciplinary context to integrate data and knowledge when processing available information, with the aim of giving effective decision-making support in clinics and translational research. OBJECTIVES To reflect on different perspectives related to the role of data analysis and data mining in biomedical informatics. METHODS On the occasion of the 50th year of Methods of Information in Medicine a symposium was organized, which reflected on opportunities, challenges and priorities of organizing, representing and analysing data, information and knowledge in biomedicine and health care. The contributions of experts with a variety of backgrounds in the area of biomedical data analysis have been collected as one outcome of this symposium, in order to provide a broad, though coherent, overview of some of the most interesting aspects of the field. RESULTS The paper presents sections on data accumulation and data-driven approaches in medical informatics, data and knowledge integration, statistical issues for the evaluation of data mining models, translational bioinformatics and bioinformatics aspects of genetic epidemiology. CONCLUSIONS Biomedical informatics represents a natural framework to properly and effectively apply data analysis and data mining methods in a decision-making context. In the future, it will be necessary to preserve the inclusive nature of the field and to foster an increasing sharing of data and methods between researchers.
Collapse
Affiliation(s)
- R Bellazzi
- University of Pavia, Dipartimento di Informatica e Sistemistica, Via Ferrata 1, 27100 Pavia (PV), Italy.
| | | | | | | | | | | |
Collapse
|
10
|
Hallinan J. Data mining for microbiologists. J Microbiol Methods 2012. [DOI: 10.1016/b978-0-08-099387-4.00002-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]
|