Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Poulter GL, Rubin DL, Altman RB, Seoighe C. MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics 2008;9:108. [PMID: 18284683 PMCID: PMC2263023 DOI: 10.1186/1471-2105-9-108] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2007] [Accepted: 02/19/2008] [Indexed: 11/24/2022] Open

For:	Poulter GL, Rubin DL, Altman RB, Seoighe C. MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics 2008;9:108. [PMID: 18284683 PMCID: PMC2263023 DOI: 10.1186/1471-2105-9-108] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2007] [Accepted: 02/19/2008] [Indexed: 11/24/2022] Open

Number

Cited by Other Article(s)

Zhang L, Lu W, Chen H, Huang Y, Cheng Q. A comparative evaluation of biomedical similar article recommendation. J Biomed Inform 2022;131:104106. [PMID: 35661818 DOI: 10.1016/j.jbi.2022.104106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 05/27/2022] [Accepted: 05/28/2022] [Indexed: 11/28/2022]

Abstract

BACKGROUND

Biomedical sciences, with their focus on human health and disease, have attracted unprecedented attention in the 21st century. The proliferation of biomedical sciences has also led to a large number of scientific articles being produced, which makes it difficult for biomedical researchers to find relevant articles and hinders the dissemination of valuable discoveries. To bridge this gap, the research community has initiated the article recommendation task, with the aim of recommending articles to biomedical researchers automatically based on their research interests. Over the past two decades, many recommendation methods have been developed. However, an algorithm-level comparison and rigorous evaluation of the most important methods on a shared dataset is still lacking.

METHOD

In this study, we first investigate 15 methods for automated article recommendation in the biomedical domain. We then conduct an empirical evaluation of the 15 methods, including six term-based methods, two word embedding methods, three sentence embedding methods, two document embedding methods, and two BERT-based methods. These methods are evaluated in two scenarios: article-oriented recommenders and user-oriented recommenders, with two publicly available datasets: TREC 2005 Genomics and RELISH, respectively.

RESULTS

Our experimental results show that the text representation models BERT and BioSenVec outperform many existing recommendation methods (e.g., BM25, PMRA, XPRC) and web-based recommendation systems (e.g., MScanner, MedlineRanker, BioReader) on both datasets regarding most of the evaluation metrics, and fine-tuning can improve the performance of the BERT-based methods.

CONCLUSIONS

Our comparison study is useful for researchers and practitioners in selecting the best modeling strategies for building article recommendation systems in the biomedical domain. The code and datasets are publicly available.

Collapse

Simon C, Davidsen K, Hansen C, Seymour E, Barnkob MB, Olsen LR. BioReader: a text mining tool for performing classification of biomedical literature. BMC Bioinformatics 2019;19:57. [PMID: 30717659 PMCID: PMC7394276 DOI: 10.1186/s12859-019-2607-x] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2018] [Accepted: 01/04/2019] [Indexed: 02/01/2023] Open

Abstract

Background

Scientific data and research results are being published at an unprecedented rate. Many database curators and researchers utilize data and information from the primary literature to populate databases, form hypotheses, or as the basis for analyses or validation of results. These efforts largely rely on manual literature surveys for collection of these data, and while querying the vast amounts of literature using keywords is enabled by repositories such as PubMed, filtering relevant articles from such query results can be a non-trivial and highly time consuming task.

Results

We here present a tool that enables users to perform classification of scientific literature by text mining-based classification of article abstracts. BioReader (Biomedical Research Article Distiller) is trained by uploading article corpora for two training categories - e.g. one positive and one negative for content of interest - as well as one corpus of abstracts to be classified and/or a search string to query PubMed for articles. The corpora are submitted as lists of PubMed IDs and the abstracts are automatically downloaded from PubMed, preprocessed, and the unclassified corpus is classified using the best performing classification algorithm out of ten implemented algorithms.

Conclusion

BioReader supports data and information collection by implementing text mining-based classification of primary biomedical literature in a web interface, thus enabling curators and researchers to take advantage of the vast amounts of data and information in the published literature. BioReader outperforms existing tools with similar functionalities and expands the features used for mining literature in database curation efforts. The tool is freely available as a web service at http://www.cbs.dtu.dk/services/BioReader

Electronic supplementary material

The online version of this article (10.1186/s12859-019-2607-x) contains supplementary material, which is available to authorized users.

Collapse

Brown P, Zhou Y. Large expert-curated database for benchmarking document similarity detection in biomedical literature search. Database (Oxford) 2019;2019:baz085. [PMID: 33326193 PMCID: PMC7291946 DOI: 10.1093/database/baz085] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2019] [Revised: 05/15/2019] [Accepted: 05/31/2019] [Indexed: 02/07/2023]

Abstract

Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.

Collapse

Brown P, Zhou Y. Large expert-curated database for benchmarking document similarity detection in biomedical literature search. Database (Oxford) 2019. [PMID: 33326193 DOI: 10.1093/database/baz085.] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Abstract

Collapse

Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017;117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]

Ahmed Z, Zeeshan S, Dandekar T. Mining biomedical images towards valuable information retrieval in biomedical and life sciences. Database (Oxford) 2016;2016:baw118. [PMID: 27538578 PMCID: PMC4990152 DOI: 10.1093/database/baw118] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2015] [Revised: 06/07/2016] [Accepted: 07/19/2016] [Indexed: 12/22/2022]

Wei W, Marmor R, Singh S, Wang S, Demner-Fushman D, Kuo TT, Hsu CN, Ohno-Machado L. Finding Related Publications: Extending the Set of Terms Used to Assess Article Similarity. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2016;2016:225-34. [PMID: 27570676 PMCID: PMC5001748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Ji Y, Ying H, Tran J, Dews P, Massanari RM. Integrating unified medical language system and association mining techniques into relevance feedback for biomedical literature search. BMC Bioinformatics 2016;17 Suppl 9:264. [PMID: 27453982 PMCID: PMC4959361 DOI: 10.1186/s12859-016-1129-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Ahmed Z, Dandekar T. MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format. F1000Res 2015;4:1453. [PMID: 29721305 PMCID: PMC5897790 DOI: 10.12688/f1000research.7329.3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/26/2018] [Indexed: 01/12/2023] Open

French L, Liu P, Marais O, Koreman T, Tseng L, Lai A, Pavlidis P. Text mining for neuroanatomy using WhiteText with an updated corpus and a new web application. Front Neuroinform 2015;9:13. [PMID: 26052282 PMCID: PMC4439553 DOI: 10.3389/fninf.2015.00013] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2014] [Accepted: 05/07/2015] [Indexed: 11/13/2022] Open

Hariri N, Ravandi SN. Comparing the Precision of Information Retrieval of MeSH-Controlled Vocabulary Search Method and a Visual Method in the Medline Medical Database. Electron Physician 2015;6:832-7. [PMID: 25763155 PMCID: PMC4324271 DOI: 10.14661/2014.832-837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2013] [Revised: 05/19/2013] [Accepted: 05/08/2014] [Indexed: 11/06/2022] Open

Abstract

BACKGROUND

Medline is one of the most important databases in the biomedical field. One of the most important hosts for Medline is Elton B. Stephens CO. (EBSCO), which has presented different search methods that can be used based on the needs of the users. Visual search and MeSH-controlled search methods are among the most common methods. The goal of this research was to compare the precision of the retrieved sources in the EBSCO Medline base using MeSH-controlled and visual search methods.

METHODS

This research was a semi-empirical study. By holding training workshops, 70 students of higher education in different educational departments of Kashan University of Medical Sciences were taught MeSH-Controlled and visual search methods in 2012. Then, the precision of 300 searches made by these students was calculated based on Best Precision, Useful Precision, and Objective Precision formulas and analyzed in SPSS software using the independent sample T Test, and three precisions obtained with the three precision formulas were studied for the two search methods.

RESULTS

The mean precision of the visual method was greater than that of the MeSH-Controlled search for all three types of precision, i.e. Best Precision, Useful Precision, and Objective Precision, and their mean precisions were significantly different (P <0.001). Sixty-five percent of the researchers indicated that, although the visual method was better than the controlled method, the control of keywords in the controlled method resulted in finding more proper keywords for the searches. Fifty-three percent of the participants in the research also mentioned that the use of the combination of the two methods produced better results.

CONCLUSION

For users, it is more appropriate to use a natural, language-based method, such as the visual method, in the EBSCO Medline host than to use the controlled method, which requires users to use special keywords. The potential reason for their preference was that the visual method allowed them more freedom of action.

Collapse

Papanikolaou N, Pavlopoulos GA, Pafilis E, Theodosiou T, Schneider R, Satagopam VP, Ouzounis CA, Eliopoulos AG, Promponas VJ, Iliopoulos I. BioTextQuest(+): a knowledge integration platform for literature mining and concept discovery. ACTA ACUST UNITED AC 2014;30:3249-56. [PMID: 25100685 DOI: 10.1093/bioinformatics/btu524] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]

Abstract

SUMMARY

The iterative process of finding relevant information in biomedical literature and performing bioinformatics analyses might result in an endless loop for an inexperienced user, considering the exponential growth of scientific corpora and the plethora of tools designed to mine PubMed(®) and related biological databases. Herein, we describe BioTextQuest(+), a web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest(+) enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents. The front-end interface facilitates the browsing of document clustering per subject, the analysis of term co-occurrence, the generation of tag clouds containing highly represented terms per cluster and at-a-glance popup windows with information about relevant genes and proteins. Moreover, to support experimental research, BioTextQuest(+) addresses integration of its primary functionality with biological repositories and software tools able to deliver further bioinformatics services. The Google-like interface extends beyond simple use by offering a range of advanced parameterization for expert users. We demonstrate the functionality of BioTextQuest(+) through several exemplary research scenarios including author disambiguation, functional term enrichment, knowledge acquisition and concept discovery linking major human diseases, such as obesity and ageing.

AVAILABILITY

The service is accessible at http://bioinformatics.med.uoc.gr/biotextquest.

CONTACT

g.pavlopoulos@gmail.com or georgios.pavlopoulos@esat.kuleuven.be

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Affiliation(s)

Nikolas Papanikolaou Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus
Georgios A Pavlopoulos Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus
Evangelos Pafilis Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus
Theodosios Theodosiou Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus
Reinhard Schneider Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus
Venkata P Satagopam Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus
Christos A Ouzounis Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus
Aristides G Eliopoulos Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus
Vasilis J Promponas Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus
Ioannis Iliopoulos Division of Basic Sciences, University of Crete, Medical School, Heraklion 71110, Greece, Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts-Fourneaux, L-4362 Esch sur Alzette, Luxembourg, Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas, 70013 Heraklion, Crete, Greece and Department of Biological Sciences, Bioinformatics Research Laboratory, University of Cyprus, PO Box 20537, CY 1678, Nicosia, Cyprus

Collapse

How to learn about gene function: text-mining or ontologies? Methods 2014;74:3-15. [PMID: 25088781 DOI: 10.1016/j.ymeth.2014.07.004] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2014] [Revised: 07/01/2014] [Accepted: 07/09/2014] [Indexed: 12/31/2022] Open

Abstract

As the amount of genome information increases rapidly, there is a correspondingly greater need for methods that provide accurate and automated annotation of gene function. For example, many high-throughput technologies--e.g., next-generation sequencing--are being used today to generate lists of genes associated with specific conditions. However, their functional interpretation remains a challenge and many tools exist trying to characterize the function of gene-lists. Such systems rely typically in enrichment analysis and aim to give a quick insight into the underlying biology by presenting it in a form of a summary-report. While the load of annotation may be alleviated by such computational approaches, the main challenge in modern annotation remains to develop a systems form of analysis in which a pipeline can effectively analyze gene-lists quickly and identify aggregated annotations through computerized resources. In this article we survey some of the many such tools and methods that have been developed to automatically interpret the biological functions underlying gene-lists. We overview current functional annotation aspects from the perspective of their epistemology (i.e., the underlying theories used to organize information about gene function into a body of verified and documented knowledge) and find that most of the currently used functional annotation methods fall broadly into one of two categories: they are based either on 'known' formally-structured ontology annotations created by 'experts' (e.g., the GO terms used to describe the function of Entrez Gene entries), or--perhaps more adventurously--on annotations inferred from literature (e.g., many text-mining methods use computer-aided reasoning to acquire knowledge represented in natural languages). Overall however, deriving detailed and accurate insight from such gene lists remains a challenging task, and improved methods are called for. In particular, future methods need to (1) provide more holistic insight into the underlying molecular systems; (2) provide better follow-up experimental testing and treatment options, and (3) better manage gene lists derived from organisms that are not well-studied. We discuss some promising approaches that may help achieve these advances, especially the use of extended dictionaries of biomedical concepts and molecular mechanisms, as well as greater use of annotation benchmarks.

Collapse

Keilwagen J, Grosse I, Grau J. Area under precision-recall curves for weighted and unweighted data. PLoS One 2014;9:e92209. [PMID: 24651729 PMCID: PMC3961324 DOI: 10.1371/journal.pone.0092209] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2013] [Accepted: 02/20/2014] [Indexed: 12/25/2022] Open

Khare R, Leaman R, Lu Z. Accessing biomedical literature in the current information landscape. Methods Mol Biol 2014;1159:11-31. [PMID: 24788259 PMCID: PMC4593617 DOI: 10.1007/978-1-4939-0709-0_2] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]

Pavlopoulos GA, Promponas VJ, Ouzounis CA, Iliopoulos I. Biological information extraction and co-occurrence analysis. Methods Mol Biol 2014;1159:77-92. [PMID: 24788262 DOI: 10.1007/978-1-4939-0709-0_5] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]

Yepes AJJ, Mork JG, Demner-Fushman D, Aronson AR. Comparison and combination of several MeSH indexing approaches. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2013;2013:709-718. [PMID: 24551371 PMCID: PMC3900212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]

Ross MK, Lin KW, Truong K, Kumar A, Conway M. Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features. BIOMEDICAL INFORMATICS INSIGHTS 2013;6:35-45. [PMID: 23926434 PMCID: PMC3728208 DOI: 10.4137/bii.s11987] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]

Jimeno-Yepes AJ, Plaza L, Mork JG, Aronson AR, Díaz A. MeSH indexing based on automatically generated summaries. BMC Bioinformatics 2013;14:208. [PMID: 23802936 PMCID: PMC3706357 DOI: 10.1186/1471-2105-14-208] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2012] [Accepted: 06/18/2013] [Indexed: 11/21/2022] Open

Abstract

BACKGROUND

MEDLINE citations are manually indexed at the U.S. National Library of Medicine (NLM) using as reference the Medical Subject Headings (MeSH) controlled vocabulary. For this task, the human indexers read the full text of the article. Due to the growth of MEDLINE, the NLM Indexing Initiative explores indexing methodologies that can support the task of the indexers. Medical Text Indexer (MTI) is a tool developed by the NLM Indexing Initiative to provide MeSH indexing recommendations to indexers. Currently, the input to MTI is MEDLINE citations, title and abstract only. Previous work has shown that using full text as input to MTI increases recall, but decreases precision sharply. We propose using summaries generated automatically from the full text for the input to MTI to use in the task of suggesting MeSH headings to indexers. Summaries distill the most salient information from the full text, which might increase the coverage of automatic indexing approaches based on MEDLINE. We hypothesize that if the results were good enough, manual indexers could possibly use automatic summaries instead of the full texts, along with the recommendations of MTI, to speed up the process while maintaining high quality of indexing results.

RESULTS

We have generated summaries of different lengths using two different summarizers, and evaluated the MTI indexing on the summaries using different algorithms: MTI, individual MTI components, and machine learning. The results are compared to those of full text articles and MEDLINE citations. Our results show that automatically generated summaries achieve similar recall but higher precision compared to full text articles. Compared to MEDLINE citations, summaries achieve higher recall but lower precision.

CONCLUSIONS

Our results show that automatic summaries produce better indexing than full text articles. Summaries produce similar recall to full text but much better precision, which seems to indicate that automatic summaries can efficiently capture the most important contents within the original articles. The combination of MEDLINE citations and automatically generated summaries could improve the recommendations suggested by MTI. On the other hand, indexing performance might be dependent on the MeSH heading being indexed. Summarization techniques could thus be considered as a feature selection algorithm that might have to be tuned individually for each MeSH heading.

Collapse

Ortuño FM, Rojas I, Andrade-Navarro MA, Fontaine JF. Using cited references to improve the retrieval of related biomedical documents. BMC Bioinformatics 2013;14:113. [PMID: 23537461 PMCID: PMC3618341 DOI: 10.1186/1471-2105-14-113] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2012] [Accepted: 03/18/2013] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

A popular query from scientists reading a biomedical abstract is to search for topic-related documents in bibliographic databases. Such a query is challenging because the amount of information attached to a single abstract is little, whereas classification-based retrieval algorithms are optimally trained with large sets of relevant documents. As a solution to this problem, we propose a query expansion method that extends the information related to a manuscript using its cited references.

RESULTS

Data on cited references and text sections in 249,108 full-text biomedical articles was extracted from the Open Access subset of the PubMed Central® database (PMC-OA). Of the five standard sections of a scientific article, the Introduction and Discussion sections contained most of the citations (mean = 10.2 and 9.9 citations, respectively). A large proportion of articles (98.4%) and their cited references (79.5%) were indexed in the PubMed® database. Using the MedlineRanker abstract classification tool, cited references allowed accurate retrieval of the citing document in a test set of 10,000 documents and also of documents related to six biomedical topics defined by particular MeSH® terms from the entire PMC-OA (p-value<0.01). Classification performance was sensitive to the topic and also to the text sections from which the references were selected. Classifiers trained on the baseline (i.e., only text from the query document and not from the references) were outperformed in almost all the cases. Best performance was often obtained when using all cited references, though using the references from Introduction and Discussion sections led to similarly good results. This query expansion method performed significantly better than pseudo relevance feedback in 4 out of 6 topics.

CONCLUSIONS

The retrieval of documents related to a single document can be significantly improved by using the references cited by this document (p-value<0.01). Using references from Introduction and Discussion performs almost as well as using all references, which might be useful for methods that require reduced datasets due to computational limitations. Cited references from particular sections might not be appropriate for all topics. Our method could be a better alternative to pseudo relevance feedback though it is limited by full text availability.

Collapse

Efficient semantic network construction with application to PubMed search. Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2012.10.019] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

French L, Lane S, Xu L, Siu C, Kwok C, Chen Y, Krebs C, Pavlidis P. Application and evaluation of automated methods to extract neuroanatomical connectivity statements from free text. Bioinformatics 2012;28:2963-70. [PMID: 22954628 PMCID: PMC3496336 DOI: 10.1093/bioinformatics/bts542] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open

Caipirini: using gene sets to rank literature. BioData Min 2012;5:1. [PMID: 22297131 PMCID: PMC3307494 DOI: 10.1186/1756-0381-5-1] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2010] [Accepted: 02/01/2012] [Indexed: 11/10/2022] Open

Seymour E, Damle R, Sette A, Peters B. Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation. BMC Bioinformatics 2011;12:482. [PMID: 22182279 PMCID: PMC3314711 DOI: 10.1186/1471-2105-12-482] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2011] [Accepted: 12/19/2011] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

The Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad keyword queries of PubMed, and were classified as relevant (curatable) or irrelevant (not curatable) to the scope of the database by a Naïve Bayes classifier. The curatable abstracts were subsequently manually classified into categories corresponding to different disease domains. Over the past four years, we have examined how to further improve this approach in order to enhance classification performance and to reduce the need for manual intervention.

RESULTS

Utilizing 89,884 abstracts classified by a domain expert as curatable or uncuratable, we found that a SVM classifier outperformed the previously used Naïve Bayes classifier for curatability predictions with an AUC of 0.899 and 0.854, respectively. Next, using a non-hierarchical and a hierarchical application of SVM classifiers trained on 22,833 curatable abstracts manually classified into three levels of disease specific categories we demonstrated that a hierarchical application of SVM classifiers outperformed non-hierarchical SVM classifiers for categorization. Finally, to optimize the hierarchical SVM classifiers' error profile for the curation process, cost sensitivity functions were developed to avoid serious misclassifications. We tested our design on a benchmark dataset of 1,388 references and achieved an overall category prediction accuracy of 94.4%, 93.9%, and 82.1% at the three levels of categorization, respectively.

CONCLUSIONS

A hierarchical application of SVM algorithms with cost sensitive output weighting enabled high quality reference classification with few serious misclassifications. This enabled us to significantly reduce the manual component of abstract categorization. Our findings are relevant to other databases that are developing their own document classifier schema and the datasets we make available provide large scale real-life benchmark sets for method developers.

Collapse

Jimeno-Yepes A, Wilkowski B, Mork JG, Van Lenten E, Fushman DD, Aronson AR. A bottom-up approach to MEDLINE indexing recommendations. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011;2011:1583-1592. [PMID: 22195224 PMCID: PMC3243198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]

Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M, Castagnoli L, Cesareni G, Tyers M, Schneider G, Rinaldi F, Leaman R, Gonzalez G, Matos S, Kim S, Wilbur WJ, Rocha L, Shatkay H, Tendulkar AV, Agarwal S, Liu F, Wang X, Rak R, Noto K, Elkan C, Lu Z, Dogan RI, Fontaine JF, Andrade-Navarro MA, Valencia A. The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 2011;12 Suppl 8:S3. [PMID: 22151929 PMCID: PMC3269938 DOI: 10.1186/1471-2105-12-s8-s3] [Citation(s) in RCA: 107] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open

Abstract

BACKGROUND

Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.

RESULTS

A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%.

CONCLUSIONS

The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.

Collapse

Affiliation(s)

Martin Krallinger Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
Miguel Vazquez Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
Florian Leitner Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
David Salgado Australian Regenerative Medicine Institute, Monash University, Australia
Andrew Chatr-aryamontri School of Biological Sciences, University of Edinburgh, Edinburgh, UK
Andrew Winter School of Biological Sciences, University of Edinburgh, Edinburgh, UK
Livia Perfetto Department of Biology, University of Rome Tor Vergata, Rome, Italy
Leonardo Briganti Department of Biology, University of Rome Tor Vergata, Rome, Italy
Luana Licata Department of Biology, University of Rome Tor Vergata, Rome, Italy
Marta Iannuccelli Department of Biology, University of Rome Tor Vergata, Rome, Italy
Luisa Castagnoli Department of Biology, University of Rome Tor Vergata, Rome, Italy
Gianni Cesareni Department of Biology, University of Rome Tor Vergata, Rome, Italy IRCSS, Fondazione Santa Lucia, Rome, Italy
Mike Tyers School of Biological Sciences, University of Edinburgh, Edinburgh, UK
Gerold Schneider Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
Fabio Rinaldi Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
Robert Leaman School of Computing, Informatics and Decision Systems Engineering, Arizona State University, Tempe, Arizona, USA
Graciela Gonzalez Department of Biomedical Informatics, Arizona State University, Tempe, Arizona, USA
Sergio Matos Institute of Electronics and Telematics Engineering of Aveiro, University of Aveiro Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
Sun Kim National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland, 20894, USA
W John Wilbur National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland, 20894, USA
Luis Rocha School of Informatics and Computing, Indiana University, 919 E. 10th St Bloomington IN, 47408, USA
Hagit Shatkay Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
Ashish V Tendulkar Department of Computer Science and Engineering, IIT Madras, Chennai-600 036, India
Shashank Agarwal Medical Informatics, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
Feifan Liu Medical Informatics, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
Xinglong Wang National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester, UK
Rafal Rak National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester, UK
Keith Noto Department of Computer Science, Tufts University, 161 College Ave, Medford, MA 02155, USA
Charles Elkan Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Zhiyong Lu National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland, 20894, USA
Rezarta Islamaj Dogan National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland, 20894, USA
Jean-Fred Fontaine Computational Biology and Data Mining Group, Max-Delbrück-Centrum für Molekulare Medizin, Robert-Rössle-Str. 10, 13125 Berlin, Germany
Miguel A Andrade-Navarro Computational Biology and Data Mining Group, Max-Delbrück-Centrum für Molekulare Medizin, Robert-Rössle-Str. 10, 13125 Berlin, Germany
Alfonso Valencia Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain

Collapse

Garten Y, Coulet A, Altman RB. Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics 2011;11:1467-89. [PMID: 21047206 DOI: 10.2217/pgs.10.136] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open

MeSHy: Mining unanticipated PubMed information using frequencies of occurrences and concurrences of MeSH terms. J Biomed Inform 2011;44:919-26. [PMID: 21684350 DOI: 10.1016/j.jbi.2011.05.009] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2010] [Revised: 05/29/2011] [Accepted: 05/31/2011] [Indexed: 11/22/2022]

Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One 2011;6:e18029. [PMID: 21437291 PMCID: PMC3060097 DOI: 10.1371/journal.pone.0018029] [Citation(s) in RCA: 175] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2010] [Accepted: 02/18/2011] [Indexed: 11/19/2022] Open

Abstract

Background

We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.

Methodology

We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.

Conclusions

PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

Collapse

Lu Z. PubMed and beyond: a survey of web tools for searching biomedical literature. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011;2011:baq036. [PMID: 21245076 PMCID: PMC3025693 DOI: 10.1093/database/baq036] [Citation(s) in RCA: 222] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]

Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS. BMC Bioinformatics 2010;11 Suppl 2:S6. [PMID: 20406504 PMCID: PMC3165966 DOI: 10.1186/1471-2105-11-s2-s6] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open

Garten Y, Altman RB. Teaching computers to read the pharmacogenomics literature ... so you don't have to. Pharmacogenomics 2010;11:515-8. [PMID: 20350132 PMCID: PMC3478760 DOI: 10.2217/pgs.10.48] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open

Garten Y, Tatonetti NP, Altman RB. Improving the prediction of pharmacogenes using text-derived drug-gene relationships. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2009:305-14. [PMID: 19908383 DOI: 10.1142/9789814295291_0033] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]

Sangkuhl K, Berlin DS, Altman RB, Klein TE. PharmGKB: understanding the effects of individual genetic variants. Drug Metab Rev 2009;40:539-51. [PMID: 18949600 DOI: 10.1080/03602530802413338] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]

Fontaine JF, Barbosa-Silva A, Schaefer M, Huska MR, Muro EM, Andrade-Navarro MA. MedlineRanker: flexible ranking of biomedical literature. Nucleic Acids Res 2009;37:W141-6. [PMID: 19429696 PMCID: PMC2703945 DOI: 10.1093/nar/gkp353] [Citation(s) in RCA: 91] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Krallinger M, Rojas AM, Valencia A. Creating reference datasets for systems biology applications using text mining. Ann N Y Acad Sci 2009;1158:14-28. [PMID: 19348628 DOI: 10.1111/j.1749-6632.2008.03750.x] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]

Krallinger M, Valencia A, Hirschman L. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol 2008;9 Suppl 2:S8. [PMID: 18834499 PMCID: PMC2559992 DOI: 10.1186/gb-2008-9-s2-s8] [Citation(s) in RCA: 145] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Agarwal P, Searls DB. Literature mining in support of drug discovery. Brief Bioinform 2008;9:479-92. [DOI: 10.1093/bib/bbn035] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open