Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Ramakrishnan C, Patnia A, Hovy E, Burns GA. Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 2012;7:7. [PMID: 22640904 DOI: 10.1186/1751-0473-7-7] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2012] [Accepted: 05/28/2012] [Indexed: 11/17/2022]

For:	Ramakrishnan C, Patnia A, Hovy E, Burns GA. Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 2012;7:7. [PMID: 22640904 DOI: 10.1186/1751-0473-7-7] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2012] [Accepted: 05/28/2012] [Indexed: 11/17/2022]

Number

Cited by Other Article(s)

Koptelov M, Holveck M, Cremilleux B, Reynaud J, Roche M, Teisseire M. A manually annotated corpus in French for the study of urbanization and the natural risk prevention. Sci Data 2023;10:818. [PMID: 37993460 PMCID: PMC10665325 DOI: 10.1038/s41597-023-02705-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 10/31/2023] [Indexed: 11/24/2023] Open

Zhu M, Cole JM. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. J Chem Inf Model 2022;62:1633-1643. [PMID: 35349259 PMCID: PMC9049592 DOI: 10.1021/acs.jcim.1c01198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Abstract

The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.

Collapse

Automatic classification of literature in systematic reviews on food safety using machine learning. Curr Res Food Sci 2022;5:84-95. [PMID: 35024621 PMCID: PMC8728304 DOI: 10.1016/j.crfs.2021.12.010] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 12/10/2021] [Accepted: 12/19/2021] [Indexed: 12/02/2022] Open

Rajkumar S, Muthukumar S, Aparna S. S., Gladston A. An Improved Text Extraction Approach With Auto Encoder for Creating Your Own Audiobook. INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH 2022. [DOI: 10.4018/ijirr.289570] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Roldán-Molina GR, Ruano-Ordás D, Basto-Fernandes V, Méndez JR. An ontology knowledge inspection methodology for quality assessment and continuous improvement. DATA KNOWL ENG 2021. [DOI: 10.1016/j.datak.2021.101889] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]

Kononova O, He T, Huo H, Trewartha A, Olivetti EA, Ceder G. Opportunities and challenges of text mining in aterials research. iScience 2021;24:102155. [PMID: 33665573 PMCID: PMC7905448 DOI: 10.1016/j.isci.2021.102155] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open

Alpizar-Chacon I, Sosnovsky S. Knowledge models from PDF textbooks. NEW REV HYPERMEDIA M 2021. [DOI: 10.1080/13614568.2021.1889692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]

Arnaboldi V, Raciti D, Van Auken K, Chan JN, Müller HM, Sternberg PW. Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase. Database (Oxford) 2020;2020:baaa006. [PMID: 32185395 PMCID: PMC7078066 DOI: 10.1093/database/baaa006] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 01/08/2020] [Accepted: 01/14/2020] [Indexed: 01/17/2023]

Abstract

Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.

Collapse

Cook HV, Jensen LJ. A Guide to Dictionary-Based Text Mining. Methods Mol Biol 2019;1939:73-89. [PMID: 30848457 DOI: 10.1007/978-1-4939-9089-4_5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Jiang X, Ringwald M, Blake JA, Arighi C, Zhang G, Shatkay H. An effective biomedical document classification scheme in support of biocuration: addressing class imbalance. Database (Oxford) 2019;2019:baz045. [PMID: 31032839 PMCID: PMC6482935 DOI: 10.1093/database/baz045] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 02/26/2019] [Accepted: 03/18/2019] [Indexed: 01/01/2023]

Westergaard D, Stærfeldt HH, Tønsberg C, Jensen LJ, Brunak S. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 2018;14:e1005962. [PMID: 29447159 PMCID: PMC5831415 DOI: 10.1371/journal.pcbi.1005962] [Citation(s) in RCA: 87] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 02/28/2018] [Accepted: 01/05/2018] [Indexed: 12/21/2022] Open

Salloum SA, Al-Emran M, Monem AA, Shaalan K. Using Text Mining Techniques for Extracting Information from Research Articles. INTELLIGENT NATURAL LANGUAGE PROCESSING: TRENDS AND APPLICATIONS 2018:373-397. [DOI: 10.1007/978-3-319-67056-0_18] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]

Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017;117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]

Bimba AT, Idris N, Al-Hunaiyyan A, Mahmud RB, Abdelaziz A, Khan S, Chang V. Towards knowledge modeling and manipulation technologies: A survey. INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT 2016. [DOI: 10.1016/j.ijinfomgt.2016.05.022] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Ahmad R, Afzal MT, Qadir MA. Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats. SEMANTIC WEB CHALLENGES 2016. [DOI: 10.1007/978-3-319-46565-4_23] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]

Ahmed Z, Dandekar T. MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format. F1000Res 2015;4:1453. [PMID: 29721305 PMCID: PMC5897790 DOI: 10.12688/f1000research.7329.3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/26/2018] [Indexed: 01/12/2023] Open

Ronzano F, Saggion H. Dr. Inventor Framework: Extracting Structured Information from Scientific Publications. DISCOVERY SCIENCE 2015. [DOI: 10.1007/978-3-319-24282-8_18] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Klampfl S, Granitzer M, Jack K, Kern R. Unsupervised document structure analysis of digital scientific articles. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES 2014. [DOI: 10.1007/s00799-014-0115-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]

Tripathy SJ, Savitskaya J, Burton SD, Urban NN, Gerkin RC. NeuroElectro: a window to the world's neuron electrophysiology data. Front Neuroinform 2014;8:40. [PMID: 24808858 PMCID: PMC4010726 DOI: 10.3389/fninf.2014.00040] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 03/27/2014] [Indexed: 11/25/2022] Open

Dunn AG, Coiera E, Mandl KD. Is Biblioleaks inevitable? J Med Internet Res 2014;16:e112. [PMID: 24755534 PMCID: PMC4019771 DOI: 10.2196/jmir.3331] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2014] [Revised: 04/03/2014] [Accepted: 04/13/2014] [Indexed: 11/13/2022] Open

Klampfl S, Kern R. An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles. RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES 2013. [DOI: 10.1007/978-3-642-40501-3_15] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]

Ambert KH, Cohen AM. Text-mining and neuroscience. INTERNATIONAL REVIEW OF NEUROBIOLOGY 2012. [PMID: 23195123 DOI: 10.1016/b978-0-12-388408-4.00006-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]