1
|
Koptelov M, Holveck M, Cremilleux B, Reynaud J, Roche M, Teisseire M. A manually annotated corpus in French for the study of urbanization and the natural risk prevention. Sci Data 2023; 10:818. [PMID: 37993460 PMCID: PMC10665325 DOI: 10.1038/s41597-023-02705-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 10/31/2023] [Indexed: 11/24/2023] Open
Abstract
Land artificialization is a serious problem of civilization. Urban planning and natural risk management are aimed to improve it. In France, these practices operate the Local Land Plans (PLU - Plan Local d'Urbanisme) and the Natural risk prevention plans (PPRn - Plan de Prévention des Risques naturels) containing land use rules. To facilitate automatic extraction of the rules, we manually annotated a number of those documents concerning Montpellier, a rapidly evolving agglomeration exposed to natural risks. We defined a format for labeled examples in which each entry includes title and subtitle. In addition, we proposed a hierarchical representation of class labels to generalize the use of our corpus. Our corpus, consisting of 1934 textual segments, each of which labeled by one of the 4 classes (Verifiable, Non-verifiable, Informative and Not pertinent) is the first corpus in the French language in the fields of urban planning and natural risk management. Along with presenting the corpus, we tested a state-of-the-art approach for text classification to demonstrate its usability for automatic rule extraction.
Collapse
Affiliation(s)
- Maksim Koptelov
- UNICAEN, ENSICAEN, CNRS - UMR GREYC, 14000, Caen, France.
- INRAE, F-34398, Montpellier, France.
- UMR TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, 34090, France.
| | | | | | | | - Mathieu Roche
- UMR TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, 34090, France.
- French Agricultural Research for Development (CIRAD), Montpellier, France.
| | - Maguelonne Teisseire
- INRAE, F-34398, Montpellier, France
- UMR TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, 34090, France
| |
Collapse
|
2
|
Zhu M, Cole JM. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. J Chem Inf Model 2022; 62:1633-1643. [PMID: 35349259 PMCID: PMC9049592 DOI: 10.1021/acs.jcim.1c01198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
![]()
The
layout of portable document format (PDF) files is constant
to any screen, and the metadata therein are latent, compared to mark-up
languages such as HTML and XML. No semantic tags are usually provided,
and a PDF file is not designed to be edited or its data interpreted
by software. However, data held in PDF files need to be extracted
in order to comply with open-source data requirements that are now
government-regulated. In the chemical domain, related chemical and
property data also need to be found, and their correlations need to
be exploited to enable data science in areas such as data-driven materials
discovery. Such relationships may be realized using text-mining software
such as the “chemistry-aware” natural-language-processing
tool, ChemDataExtractor; however, this tool has limited data-extraction
capabilities from PDF files. This study presents the PDFDataExtractor
tool, which can act as a plug-in to ChemDataExtractor. It outperforms
other PDF-extraction tools for the chemical literature by coupling
its functionalities to the chemical-named entity-recognition capabilities
of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor
are much improved. The system features a template-based architecture.
This enables semantic information to be extracted from the PDF files
of scientific articles in order to reconstruct the logical structure
of articles. While other existing PDF-extracting tools focus on quantity
mining, this template-based system is more focused on quality mining
on different layouts. PDFDataExtractor outputs information in JSON
and plain text, including the metadata of a PDF file, such as paper
title, authors, affiliation, email, abstract, keywords, journal, year,
document object identifier (DOI), reference, and issue number. With
a self-created evaluation article set, PDFDataExtractor achieved promising
precision for all key assessed metadata areas of the document text.
Collapse
Affiliation(s)
- Miao Zhu
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K
| | - Jacqueline M Cole
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.,ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.,Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K
| |
Collapse
|
3
|
Automatic classification of literature in systematic reviews on food safety using machine learning. Curr Res Food Sci 2022; 5:84-95. [PMID: 35024621 PMCID: PMC8728304 DOI: 10.1016/j.crfs.2021.12.010] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 12/10/2021] [Accepted: 12/19/2021] [Indexed: 12/02/2022] Open
Abstract
Systematic reviews are used to collect relevant literature to answer a research question in a way that is clear, thorough, unbiased and reproducible. They are implemented as a standard method in the domain of food safety to obtain a literature overview on the state-of-the-art research related to food safety topics of interest. A disadvantage to systematic reviews, however, is that this process is time-consuming and requires expert domain knowledge. The work reported here aims to reduce the time needed by an expert to screen all possible relevant articles by applying machine learning techniques to classify the articles automatically as either relevant or not relevant. Eight different machine learning algorithms and ensembles of all combinations of these algorithms were tested on two different systematic reviews on food safety (i.e. chemical hazards in cereals and leafy greens). The results showed that the best performance was obtained by an ensemble of naive Bayes and a support vector machine, resulting in an average decrease of 32.8% in the amount of articles the expert has to read and an average decrease in irrelevant articles of 57.8% while keeping 95% of the relevant articles. It was concluded that automatic classification of the literature in a systematic literature review can support experts in their task and save valuable time without compromising the quality of the review.
Collapse
|
4
|
Rajkumar S, Muthukumar S, Aparna S. S., Gladston A. An Improved Text Extraction Approach With Auto Encoder for Creating Your Own Audiobook. INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH 2022. [DOI: 10.4018/ijirr.289570] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
As we all know, listening makes learning easier and interesting than reading. An audiobook is a software that converts text to speech. Though this sounds good, the audiobooks available in the market are not free and feasible for everyone. Added to this, we find that these audiobooks are only meant for fictional stories, novels or comics. A comprehensive review of the available literature shows that very little intensive work was done for image to speech conversion. In this paper, we employ various strategies for the entire process. As an initial step, deep learning techniques are constructed to denoise the images that are fed to the system. This is followed by text extraction with the help of OCR engines. Additional improvements are made to improve the quality of text extraction and post processing spell check mechanism are incorporated for this purpose. Our result analysis demonstrates that with denoising and spell checking, our model has achieved an accuracy of 98.11% when compared to 84.02% without any denoising or spell check mechanism.
Collapse
|
5
|
Roldán-Molina GR, Ruano-Ordás D, Basto-Fernandes V, Méndez JR. An ontology knowledge inspection methodology for quality assessment and continuous improvement. DATA KNOWL ENG 2021. [DOI: 10.1016/j.datak.2021.101889] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
6
|
Kononova O, He T, Huo H, Trewartha A, Olivetti EA, Ceder G. Opportunities and challenges of text mining in aterials research. iScience 2021; 24:102155. [PMID: 33665573 PMCID: PMC7905448 DOI: 10.1016/j.isci.2021.102155] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Research publications are the major repository of scientific knowledge. However, their unstructured and highly heterogenous format creates a significant obstacle to large-scale analysis of the information contained within. Recent progress in natural language processing (NLP) has provided a variety of tools for high-quality information extraction from unstructured text. These tools are primarily trained on non-technical text and struggle to produce accurate results when applied to scientific text, involving specific technical terminology. During the last years, significant efforts in information retrieval have been made for biomedical and biochemical publications. For materials science, text mining (TM) methodology is still at the dawn of its development. In this review, we survey the recent progress in creating and applying TM and NLP approaches to materials science field. This review is directed at the broad class of researchers aiming to learn the fundamentals of TM as applied to the materials science publications.
Collapse
Affiliation(s)
- Olga Kononova
- Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Tanjin He
- Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Haoyan Huo
- Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Amalie Trewartha
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Elsa A. Olivetti
- Department of Materials Science & Engineering, MIT, Cambridge, MA 02139, USA
| | - Gerbrand Ceder
- Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
7
|
Alpizar-Chacon I, Sosnovsky S. Knowledge models from PDF textbooks. NEW REV HYPERMEDIA M 2021. [DOI: 10.1080/13614568.2021.1889692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
8
|
Arnaboldi V, Raciti D, Van Auken K, Chan JN, Müller HM, Sternberg PW. Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase. Database (Oxford) 2020; 2020:baaa006. [PMID: 32185395 PMCID: PMC7078066 DOI: 10.1093/database/baaa006] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 01/08/2020] [Accepted: 01/14/2020] [Indexed: 01/17/2023]
Abstract
Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.
Collapse
Affiliation(s)
- Valerio Arnaboldi
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Daniela Raciti
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Juancarlos N Chan
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Hans-Michael Müller
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Paul W Sternberg
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| |
Collapse
|
9
|
Abstract
PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.
Collapse
Affiliation(s)
- Helen V Cook
- School of Clinical Medicine, University of Cambridge, Cambridge, UK.,Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
10
|
Jiang X, Ringwald M, Blake JA, Arighi C, Zhang G, Shatkay H. An effective biomedical document classification scheme in support of biocuration: addressing class imbalance. Database (Oxford) 2019; 2019:baz045. [PMID: 31032839 PMCID: PMC6482935 DOI: 10.1093/database/baz045] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 02/26/2019] [Accepted: 03/18/2019] [Indexed: 01/01/2023]
Abstract
Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory's Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.
Collapse
Affiliation(s)
- Xiangying Jiang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | | | - Judith A Blake
- The Jackson Laboratory, 600 Main St., Bar Harbor, ME, USA
| | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
- Center of Bioinformatics and Computational Biology, Delaware Biotechnology Institute, Newark, DE, USA
| | - Gongbo Zhang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Hagit Shatkay
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
- Center of Bioinformatics and Computational Biology, Delaware Biotechnology Institute, Newark, DE, USA
| |
Collapse
|
11
|
Westergaard D, Stærfeldt HH, Tønsberg C, Jensen LJ, Brunak S. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 2018; 14:e1005962. [PMID: 29447159 PMCID: PMC5831415 DOI: 10.1371/journal.pcbi.1005962] [Citation(s) in RCA: 87] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 02/28/2018] [Accepted: 01/05/2018] [Indexed: 12/21/2022] Open
Abstract
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
Collapse
Affiliation(s)
- David Westergaard
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Hans-Henrik Stærfeldt
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
| | - Christian Tønsberg
- Office for Innovation and Sector Services, Technical Information Center of Denmark, Technical University of Denmark, Lyngby, Denmark
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
- * E-mail: (LJJ); (SB)
| | - Søren Brunak
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
- * E-mail: (LJJ); (SB)
| |
Collapse
|
12
|
Salloum SA, Al-Emran M, Monem AA, Shaalan K. Using Text Mining Techniques for Extracting Information from Research Articles. INTELLIGENT NATURAL LANGUAGE PROCESSING: TRENDS AND APPLICATIONS 2018:373-397. [DOI: 10.1007/978-3-319-67056-0_18] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
13
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
14
|
Bimba AT, Idris N, Al-Hunaiyyan A, Mahmud RB, Abdelaziz A, Khan S, Chang V. Towards knowledge modeling and manipulation technologies: A survey. INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT 2016. [DOI: 10.1016/j.ijinfomgt.2016.05.022] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
15
|
Ahmad R, Afzal MT, Qadir MA. Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats. SEMANTIC WEB CHALLENGES 2016. [DOI: 10.1007/978-3-319-46565-4_23] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
16
|
Ahmed Z, Dandekar T. MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format. F1000Res 2015; 4:1453. [PMID: 29721305 PMCID: PMC5897790 DOI: 10.12688/f1000research.7329.3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/26/2018] [Indexed: 01/12/2023] Open
Abstract
Published scientific literature contains millions of figures, including information about the results obtained from different scientific experiments e.g. PCR-ELISA data, microarray analysis, gel electrophoresis, mass spectrometry data, DNA/RNA sequencing, diagnostic imaging (CT/MRI and ultrasound scans), and medicinal imaging like electroencephalography (EEG), magnetoencephalography (MEG), echocardiography (ECG), positron-emission tomography (PET) images. The importance of biomedical figures has been widely recognized in scientific and medicine communities, as they play a vital role in providing major original data, experimental and computational results in concise form. One major challenge for implementing a system for scientific literature analysis is extracting and analyzing text and figures from published PDF files by physical and logical document analysis. Here we present a product line architecture based bioinformatics tool ‘Mining Scientific Literature (MSL)’, which supports the extraction of text and images by interpreting all kinds of published PDF files using advanced data mining and image processing techniques. It provides modules for the marginalization of extracted text based on different coordinates and keywords, visualization of extracted figures and extraction of embedded text from all kinds of biological and biomedical figures using applied Optimal Character Recognition (OCR). Moreover, for further analysis and usage, it generates the system’s output in different formats including text, PDF, XML and images files. Hence, MSL is an easy to install and use analysis tool to interpret published scientific literature in PDF format.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Genetics and Genome Sciences, School of Medicine, University of Connecticut Health Center, Farmington, CT, 06032, USA.,Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT, 06032, USA
| | - Thomas Dandekar
- Department of Bioinformatics, Biocenter, University of Wuerzburg, Wuerzburg, 97074, Germany
| |
Collapse
|
17
|
Ronzano F, Saggion H. Dr. Inventor Framework: Extracting Structured Information from Scientific Publications. DISCOVERY SCIENCE 2015. [DOI: 10.1007/978-3-319-24282-8_18] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
18
|
Klampfl S, Granitzer M, Jack K, Kern R. Unsupervised document structure analysis of digital scientific articles. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES 2014. [DOI: 10.1007/s00799-014-0115-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
19
|
Tripathy SJ, Savitskaya J, Burton SD, Urban NN, Gerkin RC. NeuroElectro: a window to the world's neuron electrophysiology data. Front Neuroinform 2014; 8:40. [PMID: 24808858 PMCID: PMC4010726 DOI: 10.3389/fninf.2014.00040] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 03/27/2014] [Indexed: 11/25/2022] Open
Abstract
The behavior of neural circuits is determined largely by the electrophysiological properties of the neurons they contain. Understanding the relationships of these properties requires the ability to first identify and catalog each property. However, information about such properties is largely locked away in decades of closed-access journal articles with heterogeneous conventions for reporting results, making it difficult to utilize the underlying data. We solve this problem through the NeuroElectro project: a Python library, RESTful API, and web application (at http://neuroelectro.org) for the extraction, visualization, and summarization of published data on neurons' electrophysiological properties. Information is organized both by neuron type (using neuron definitions provided by NeuroLex) and by electrophysiological property (using a newly developed ontology). We describe the techniques and challenges associated with the automated extraction of tabular electrophysiological data and methodological metadata from journal articles. We further discuss strategies for how to best combine, normalize and organize data across these heterogeneous sources. NeuroElectro is a valuable resource for experimental physiologists attempting to supplement their own data, for computational modelers looking to constrain their model parameters, and for theoreticians searching for undiscovered relationships among neurons and their properties.
Collapse
Affiliation(s)
- Shreejoy J. Tripathy
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
- Center for the Neural Basis of Cognition, Carnegie Mellon UniversityPittsburgh, PA, USA
| | - Judith Savitskaya
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
| | - Shawn D. Burton
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
- Center for the Neural Basis of Cognition, Carnegie Mellon UniversityPittsburgh, PA, USA
| | - Nathaniel N. Urban
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
- Center for the Neural Basis of Cognition, Carnegie Mellon UniversityPittsburgh, PA, USA
| | - Richard C. Gerkin
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
- Center for the Neural Basis of Cognition, Carnegie Mellon UniversityPittsburgh, PA, USA
| |
Collapse
|
20
|
Dunn AG, Coiera E, Mandl KD. Is Biblioleaks inevitable? J Med Internet Res 2014; 16:e112. [PMID: 24755534 PMCID: PMC4019771 DOI: 10.2196/jmir.3331] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2014] [Revised: 04/03/2014] [Accepted: 04/13/2014] [Indexed: 11/13/2022] Open
Abstract
In 2014, the vast majority of published biomedical research is still hidden behind paywalls rather than open access. For more than a decade, similar restrictions over other digitally available content have engendered illegal activity. Music file sharing became rampant in the late 1990s as communities formed around new ways to share. The frequency and scale of cyber-attacks against commercial and government interests has increased dramatically. Massive troves of classified government documents have become public through the actions of a few. Yet we have not seen significant growth in the illegal sharing of peer-reviewed academic articles. Should we truly expect that biomedical publishing is somehow at less risk than other content-generating industries? What of the larger threat—a “Biblioleaks” event—a database breach and public leak of the substantial archives of biomedical literature? As the expectation that all research should be available to everyone becomes the norm for a younger generation of researchers and the broader community, the motivations for such a leak are likely to grow. We explore the feasibility and consequences of a Biblioleaks event for researchers, journals, publishers, and the broader communities of doctors and the patients they serve.
Collapse
Affiliation(s)
- Adam G Dunn
- Centre for Health Informatics, Australian Institute of Health Innovation, The University of New South Wales, Sydney, Australia.
| | | | | |
Collapse
|
21
|
Klampfl S, Kern R. An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles. RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES 2013. [DOI: 10.1007/978-3-642-40501-3_15] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
22
|
Abstract
The wealth and diversity of neuroscience research are inherent characteristics of the discipline that can give rise to some complications. As the field continues to expand, we generate a great deal of data about all aspects, and from multiple perspectives, of the brain, its chemistry, biology, and how these affect behavior. The vast majority of research scientists cannot afford to spend their time combing the literature to find every article related to their research, nor do they wish to spend time adjusting their neuroanatomical vocabulary to communicate with other subdomains in the neurosciences. As such, there has been a recent increase in the amount of informatics research devoted to developing digital resources for neuroscience research. Neuroinformatics is concerned with the development of computational tools to further our understanding of the brain and to make sense of the vast amount of information that neuroscientists generate (French & Pavlidis, 2007). Many of these tools are related to the use of textual data. Here, we review some of the recent developments for better using the vast amount of textual information generated in neuroscience research and publication and suggest several use cases that will demonstrate how bench neuroscientists can take advantage of the resources that are available.
Collapse
Affiliation(s)
- Kyle H Ambert
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA.
| | | |
Collapse
|