1
|
Oliveira Dos Santos Á, Sergio da Silva E, Machado Couto L, Valadares Labanca Reis G, Silva Belo V. The use of artificial intelligence for automating or semi-automating biomedical literature analyses: a scoping review. J Biomed Inform 2023; 142:104389. [PMID: 37187321 DOI: 10.1016/j.jbi.2023.104389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 04/11/2023] [Accepted: 05/08/2023] [Indexed: 05/17/2023]
Abstract
OBJECTIVE Evidence-based medicine (EBM) is a decision-making process based on the conscious and judicious use of the best available scientific evidence. However, the exponential increase in the amount of information currently available likely exceeds the capacity of human-only analysis. In this context, artificial intelligence (AI) and its branches such as machine learning (ML) can be used to facilitate human efforts in analyzing the literature to foster EBM. The present scoping review aimed to examine the use of AI in the automation of biomedical literature survey and analysis with a view to establishing the state-of-the-art and identifying knowledge gaps. MATERIALS AND METHODS Comprehensive searches of the main databases were performed for articles published up to June 2022 and studies were selected according to inclusion and exclusion criteria. Data were extracted from the included articles and the findings categorized. RESULTS The total number of records retrieved from the databases was 12,145, of which 273 were included in the review. Classification of the studies according to the use of AI in evaluating the biomedical literature revealed three main application groups, namely assembly of scientific evidence (n=127; 47%), mining the biomedical literature (n=112; 41%) and quality analysis (n=34; 12%). Most studies addressed the preparation of systematic reviews, while articles focusing on the development of guidelines and evidence synthesis were the least frequent. The biggest knowledge gap was identified within the quality analysis group, particularly regarding methods and tools that assess the strength of recommendation and consistency of evidence. CONCLUSION Our review shows that, despite significant progress in the automation of biomedical literature surveys and analyses in recent years, intense research is needed to fill knowledge gaps on more difficult aspects of ML, deep learning and natural language processing, and to consolidate the use of automation by end-users (biomedical researchers and healthcare professionals).
Collapse
Affiliation(s)
| | - Eduardo Sergio da Silva
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | - Letícia Machado Couto
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | | | - Vinícius Silva Belo
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| |
Collapse
|
2
|
Jung E, Jain H, Sinha AP, Gaudioso C. Building a specialized lexicon for breast cancer clinical trial subject eligibility analysis. Health Informatics J 2021; 27:1460458221989392. [PMID: 33535885 DOI: 10.1177/1460458221989392] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
A natural language processing (NLP) application requires sophisticated lexical resources to support its processing goals. Different solutions, such as dictionary lookup and MetaMap, have been proposed in the healthcare informatics literature to identify disease terms with more than one word (multi-gram disease named entities). Although a lot of work has been done in the identification of protein- and gene-named entities in the biomedical field, not much research has been done on the recognition and resolution of terminologies in the clinical trial subject eligibility analysis. In this study, we develop a specialized lexicon for improving NLP and text mining analysis in the breast cancer domain, and evaluate it by comparing it with the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT). We use a hybrid methodology, which combines the knowledge of domain experts, terms from multiple online dictionaries, and the mining of text from sample clinical trials. Use of our methodology introduces 4243 unique lexicon items, which increase bigram entity match by 38.6% and trigram entity match by 41%. Our lexicon, which adds a significant number of new terms, is very useful for matching patients to clinical trials automatically based on eligibility matching. Beyond clinical trial matching, the specialized lexicon developed in this study could serve as a foundation for future healthcare text mining applications.
Collapse
Affiliation(s)
- Euisung Jung
- Information Operations and Technology Management, John B. and Lillian E. Neff College of Business and Innovation, The University of Toledo, USA
| | - Hemant Jain
- Gary W. Rollins College of Business, The University of Tennessee at Chattanooga, USA
| | - Atish P Sinha
- Lubar School of Business, University of Wisconsin-Milwaukee, USA
| | | |
Collapse
|
3
|
Rinaldi F, Lithgow O, Gama-Castro S, Solano H, Lopez A, Muñiz Rascado LJ, Ishida-Gutiérrez C, Méndez-Cruz CF, Collado-Vides J. Strategies towards digital and semi-automated curation in RegulonDB. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3074784. [PMID: 28365731 PMCID: PMC5467564 DOI: 10.1093/database/bax012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2016] [Accepted: 01/30/2017] [Indexed: 02/03/2023]
Abstract
Experimentally generated biological information needs to be organized and structured in order to become meaningful knowledge. However, the rate at which new information is being published makes manual curation increasingly unable to cope. Devising new curation strategies that leverage upon data mining and text analysis is, therefore, a promising avenue to help life science databases to cope with the deluge of novel information. In this article, we describe the integration of text mining technologies in the curation pipeline of the RegulonDB database, and discuss how the process can enhance the productivity of the curators.
Specifically, a named entity recognition approach is used to pre-annotate terms referring to a set of domain entities which are potentially relevant for the curation process. The annotated documents are presented to the curator, who, thanks to a custom-designed interface, can select sentences containing specific types of entities, thus restricting the amount of text that needs to be inspected. Additionally, a module capable of computing semantic similarity between sentences across the entire collection of articles to be curated is being integrated in the system. We tested the module using three sets of scientific articles and six domain experts. All these improvements are gradually enabling us to obtain a high throughput curation process with the same quality as manual curation.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich.,Institute of Computational Linguistics, University of Zurich, Andreasstrasse 15, Zurich 8050, Switzerland
| | - Oscar Lithgow
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Socorro Gama-Castro
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Hilda Solano
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Alejandra Lopez
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Luis José Muñiz Rascado
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Cecilia Ishida-Gutiérrez
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Carlos-Francisco Méndez-Cruz
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Julio Collado-Vides
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| |
Collapse
|
4
|
Spasić I, Greenwood M, Preece A, Francis N, Elwyn G. FlexiTerm: a flexible term recognition method. J Biomed Semantics 2013; 4:27. [PMID: 24112363 PMCID: PMC3853334 DOI: 10.1186/2041-1480-4-27] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2013] [Accepted: 10/03/2013] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND The increasing amount of textual information in biomedicine requires effective term recognition methods to identify textual representations of domain-specific concepts as the first step toward automating its semantic interpretation. The dictionary look-up approaches may not always be suitable for dynamic domains such as biomedicine or the newly emerging types of media such as patient blogs, the main obstacles being the use of non-standardised terminology and high degree of term variation. RESULTS In this paper, we describe FlexiTerm, a method for automatic term recognition from a domain-specific corpus, and evaluate its performance against five manually annotated corpora. FlexiTerm performs term recognition in two steps: linguistic filtering is used to select term candidates followed by calculation of termhood, a frequency-based measure used as evidence to qualify a candidate as a term. In order to improve the quality of termhood calculation, which may be affected by the term variation phenomena, FlexiTerm uses a range of methods to neutralise the main sources of variation in biomedical terms. It manages syntactic variation by processing candidates using a bag-of-words approach. Orthographic and morphological variations are dealt with using stemming in combination with lexical and phonetic similarity measures. The method was evaluated on five biomedical corpora. The highest values for precision (94.56%), recall (71.31%) and F-measure (81.31%) were achieved on a corpus of clinical notes. CONCLUSIONS FlexiTerm is an open-source software tool for automatic term recognition. It incorporates a simple term variant normalisation method. The method proved to be more robust than the baseline against less formally structured texts, such as those found in patient blogs or medical notes. The software can be downloaded freely at http://www.cs.cf.ac.uk/flexiterm.
Collapse
Affiliation(s)
- Irena Spasić
- School of Computer Science & Informatics, Cardiff University, Queen's Buildings, 5 The Parade, Cardiff, UK
| | - Mark Greenwood
- School of Computer Science & Informatics, Cardiff University, Queen's Buildings, 5 The Parade, Cardiff, UK
| | - Alun Preece
- School of Computer Science & Informatics, Cardiff University, Queen's Buildings, 5 The Parade, Cardiff, UK
| | - Nick Francis
- The Cochrane Institute for Primary Care and Public Health, Cardiff University, Heath Park, Cardiff, UK
| | - Glyn Elwyn
- The Cochrane Institute for Primary Care and Public Health, Cardiff University, Heath Park, Cardiff, UK
- Dartmouth Center for Health Care Delivery Science, Dartmouth College, Hanover, NH, USA
| |
Collapse
|
5
|
Rebholz-Schuhmann D, Kim JH, Yan Y, Dixit A, Friteyre C, Hoehndorf R, Backofen R, Lewin I. Evaluation and cross-comparison of lexical entities of biological interest (LexEBI). PLoS One 2013; 8:e75185. [PMID: 24124474 PMCID: PMC3790750 DOI: 10.1371/journal.pone.0075185] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2012] [Accepted: 08/14/2013] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical "term space" (the "Lexeome"), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness). RESULT This study compiles a resource for lexical terms of biomedical interest in a standard format (called "LexEBI"), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions. CONCLUSION LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation (http://www.ebi.ac.uk/Rebholz-srv/LexEBI/). The resource provides the disease terms as open source content, and fully interlinks terms across resources.
Collapse
Affiliation(s)
- Dietrich Rebholz-Schuhmann
- Department of Computational Linguistics, University of Zürich, Zürich, Switzerland
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
- * E-mail:
| | - Jee-Hyub Kim
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Ying Yan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Abhishek Dixit
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Caroline Friteyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Robert Hoehndorf
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, United Kingdom
| | - Rolf Backofen
- Albert-Ludwigs-University Freiburg, Fahnenbergplatz, Freiburg, Germany
| | - Ian Lewin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| |
Collapse
|
6
|
Rinaldi F, Clematide S, Hafner S, Schneider G, Grigonyte G, Romacker M, Vachon T. Using the OntoGene pipeline for the triage task of BioCreative 2012. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bas053. [PMID: 23396322 PMCID: PMC3568389 DOI: 10.1093/database/bas053] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
In this article, we describe the architecture of the OntoGene Relation mining pipeline and its application in the triage task of BioCreative 2012. The aim of the task is to support the triage of abstracts relevant to the process of curation of the Comparative Toxicogenomics Database. We use a conventional information retrieval system (Lucene) to provide a baseline ranking, which we then combine with information provided by our relation mining system, in order to achieve an optimized ranking. Our approach additionally delivers domain entities mentioned in each input document as well as candidate relationships, both ranked according to a confidence score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation. Thanks, in particular, to the high-quality entity recognition, the OntoGene system achieved the best overall results in the task.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland.
| | | | | | | | | | | | | |
Collapse
|
7
|
Abstract
Background One of the key pieces of information which biomedical text mining systems are expected to extract from the literature are interactions among different types of biomedical entities (proteins, genes, diseases, drugs, etc.). Several large resources of curated relations between biomedical entities are currently available, such as the Pharmacogenomics Knowledge Base (PharmGKB) or the Comparative Toxicogenomics Database (CTD). Biomedical text mining systems, and in particular those which deal with the extraction of relationships among entities, could make better use of the wealth of already curated material. Results We propose a simple and effective method based on logistic regression (also known as maximum entropy modeling) for an optimized ranking of relation candidates utilizing curated abstracts. Furthermore, we examine the effects and difficulties of using widely available metadata (i.e. MeSH terms and chemical substance index terms) for relation extraction. Cross-validation experiments result in an improvement of the ranking quality in terms of AUCiP/R by 39% (PharmGKB) and 116% (CTD) against a frequency-based baseline of 0.39 (PharmGKB) and 0.21 (CTD). For the TAP-10 metrics, we achieve an improvement of 53% (PharmGKB) and 134% (CTD) against the same baseline system (0.21 PharmGKB and 0.15 CTD). Conclusions Our experiments with the PharmGKB and the CTD database show a strong positive effect for the ranking of relation candidates utilizing the vast amount of curated relations covered by currently available knowledge databases. The tasks of concept identification and candidate relation generation profit from the adaptation to previously curated material. This presents an effective and practical method suitable for conservative extension and re-validation of biomedical relations from texts that has been successfully used for curation experiments with the PharmGKB and CTD database.
Collapse
|
8
|
Rinaldi F, Schneider G, Clematide S. Relation mining experiments in the pharmacogenomics domain. J Biomed Inform 2012; 45:851-61. [PMID: 22580177 DOI: 10.1016/j.jbi.2012.04.014] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Revised: 04/25/2012] [Accepted: 04/27/2012] [Indexed: 12/01/2022]
Abstract
The mutual interactions among genes, diseases, and drugs are at the heart of biomedical research, and are especially important for the pharmacological industry. The recent trend towards personalized medicine makes it increasingly relevant to be able to tailor drugs to specific genetic makeups. The pharmacogenetics and pharmacogenomics knowledge base (PharmGKB) aims at capturing relevant information about such interactions from several sources, including curation of the biomedical literature. Advanced text mining tools which can support the process of manual curation are increasingly necessary in order to cope with the deluge of new published results. However, effective evaluation of those tools requires the availability of manually curated data as gold standard. In this paper we discuss how the existing PharmGKB database can be used for such an evaluation task in a way similar to the usage of gold standard data derived from protein-protein interaction databases in one of the recent BioCreative shared tasks. Additionally, we present our own considerations and results on the feasibility and difficulty of such a task.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Binzmühlestrasse 14, 8050 Zürich, Switzerland.
| | | | | |
Collapse
|