1
|
Yin Y, Kim H, Xiao X, Wei CH, Kang J, Lu Z, Xu H, Fang M, Chen Q. Augmenting biomedical named entity recognition with general-domain resources. J Biomed Inform 2024; 159:104731. [PMID: 39368529 DOI: 10.1016/j.jbi.2024.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/05/2024] [Accepted: 09/27/2024] [Indexed: 10/07/2024]
Abstract
OBJECTIVE Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. METHODS We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. RESULTS We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. CONCLUSION This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
Collapse
Affiliation(s)
- Yu Yin
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Hyunjae Kim
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Xiao Xiao
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Chih Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Jaewoo Kang
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America
| | - Meng Fang
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom.
| | - Qingyu Chen
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America.
| |
Collapse
|
2
|
Farrell MJ, Le Guillarme N, Brierley L, Hunter B, Scheepens D, Willoughby A, Yates A, Mideo N. The changing landscape of text mining: a review of approaches for ecology and evolution. Proc Biol Sci 2024; 291:20240423. [PMID: 39082244 PMCID: PMC11289731 DOI: 10.1098/rspb.2024.0423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 06/20/2024] [Accepted: 06/20/2024] [Indexed: 08/02/2024] Open
Abstract
In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.
Collapse
Affiliation(s)
- Maxwell J. Farrell
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
- School of Biodiversity, One Health & Veterinary Medicine, University of Glasgow, Glasgow, UK
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| | - Nicolas Le Guillarme
- Université Grenoble Alpes, CNRS, LECA, Laboratoire d'Ecologie Alpine, Grenoble, France
| | - Liam Brierley
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
- Department of Health Data Science, University of Liverpool, Liverpool, UK
| | - Bronwen Hunter
- School of Life Sciences, University of Sussex, Brighton, UK
| | - Daan Scheepens
- Division of Biosciences, University College London, London, UK
| | | | - Andrew Yates
- Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
| | - Nicole Mideo
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
3
|
Nédellec C, Sauvion C, Bossy R, Borovikova M, Deléger L. TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature. PLoS One 2024; 19:e0305475. [PMID: 38870159 PMCID: PMC11175518 DOI: 10.1371/journal.pone.0305475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Accepted: 05/31/2024] [Indexed: 06/15/2024] Open
Abstract
Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. A growing number of plant molecular information networks provide interlinked interoperable data to support the discovery of gene-phenotype interactions. A large body of scientific literature and observational data obtained in-field and under controlled conditions document wheat breeding experiments. The cross-referencing of this complementary information is essential. Text from databases and scientific publications has been identified early on as a relevant source of information. However, the wide variety of terms used to refer to traits and phenotype values makes it difficult to find and cross-reference the textual information, e.g. simple dictionary lookup methods miss relevant terms. Corpora with manually annotated examples are thus needed to evaluate and train textual information extraction methods. While several corpora contain annotations of human and animal phenotypes, no corpus is available for plant traits. This hinders the evaluation of text mining-based crop knowledge graphs (e.g. AgroLD, KnetMiner, WheatIS-FAIDARE) and limits the ability to train machine learning methods and improve the quality of information. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 528 PubMed references that are fully annotated by trait, phenotype, and species. We address the interoperability challenge of crossing sparse assay data and publications by using the Wheat Trait and Phenotype Ontology to normalize trait mentions and the species taxonomy of the National Center for Biotechnology Information to normalize species. The paper describes the construction of the corpus. A study of the performance of state-of-the-art language models for both named entity recognition and linking tasks trained on the corpus shows that it is suitable for training and evaluation. This corpus is currently the most comprehensive manually annotated corpus for natural language processing studies on crop phenotype information from the literature.
Collapse
Affiliation(s)
- Claire Nédellec
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Clara Sauvion
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Robert Bossy
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Mariya Borovikova
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
- TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France
| | - Louise Deléger
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| |
Collapse
|
4
|
Gabud R, Lapitan P, Mariano V, Mendoza E, Pampolina N, Clariño MAA, Batista-Navarro R. Unsupervised literature mining approaches for extracting relationships pertaining to habitats and reproductive conditions of plant species. Front Artif Intell 2024; 7:1371411. [PMID: 38845683 PMCID: PMC11153722 DOI: 10.3389/frai.2024.1371411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Accepted: 05/10/2024] [Indexed: 06/09/2024] Open
Abstract
Introduction Fine-grained, descriptive information on habitats and reproductive conditions of plant species are crucial in forest restoration and rehabilitation efforts. Precise timing of fruit collection and knowledge of species' habitat preferences and reproductive status are necessary especially for tropical plant species that have short-lived recalcitrant seeds, and those that exhibit complex reproductive patterns, e.g., species with supra-annual mass flowering events that may occur in irregular intervals. Understanding plant regeneration in the way of planning for effective reforestation can be aided by providing access to structured information, e.g., in knowledge bases, that spans years if not decades as well as covering a wide range of geographic locations. The content of such a resource can be enriched with literature-derived information on species' time-sensitive reproductive conditions and location-specific habitats. Methods We sought to develop unsupervised approaches to extract relationships pertaining to habitats and their locations, and reproductive conditions of plant species and corresponding temporal information. Firstly, we handcrafted rules for a traditional rule-based pattern matching approach. We then developed a relation extraction approach building upon transformer models, i.e., the Text-to-Text Transfer Transformer (T5), casting the relation extraction problem as a question answering and natural language inference task. We then propose a novel unsupervised hybrid approach that combines our rule-based and transformer-based approaches. Results Evaluation of our hybrid approach on an annotated corpus of biodiversity-focused documents demonstrated an improvement of up to 15 percentage points in recall and best performance over solely rule-based and transformer-based methods with F1-scores ranging from 89.61 to 96.75% for reproductive condition - temporal expression relations, and ranging from 85.39% to 89.90% for habitat - geographic location relations. Our work shows that even without training models on any domain-specific labeled dataset, we are able to extract relationships between biodiversity concepts from literature with satisfactory performance.
Collapse
Affiliation(s)
- Roselyn Gabud
- Department of Computer Science, College of Engineering, University of the Philippines Diliman, Quezon City, Philippines
- Institute of Computer Science, College of Arts and Sciences, University of the Philippines Los Baños, Laguna, Philippines
| | - Portia Lapitan
- Department of Forest Biological Sciences, College of Forestry and Natural Resources, University of the Philippines Los Baños, Laguna, Philippines
| | - Vladimir Mariano
- Young Southeast Asian Leaders Initiative (YSEALI) Academy, Fulbright University Vietnam, Ho Chi Minh City, Vietnam
| | - Eduardo Mendoza
- Institute of Computer Science, College of Arts and Sciences, University of the Philippines Los Baños, Laguna, Philippines
- Mathematics and Statistics Department, De la Salle University, Manila, Philippines
- Center for Natural Science and Environmental Research, De la Salle University, Manila, Philippines
- Max Planck Institute of Biochemistry, Munich, Germany
| | - Nelson Pampolina
- Department of Forest Biological Sciences, College of Forestry and Natural Resources, University of the Philippines Los Baños, Laguna, Philippines
| | - Maria Art Antonette Clariño
- Institute of Computer Science, College of Arts and Sciences, University of the Philippines Los Baños, Laguna, Philippines
| | - Riza Batista-Navarro
- Institute of Computer Science, College of Arts and Sciences, University of the Philippines Los Baños, Laguna, Philippines
- Department of Computer Science, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
5
|
Gougherty AV, Clipp HL. Testing the reliability of an AI-based large language model to extract ecological information from the scientific literature. NPJ BIODIVERSITY 2024; 3:13. [PMID: 39242700 PMCID: PMC11332232 DOI: 10.1038/s44185-024-00043-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 03/08/2024] [Indexed: 09/09/2024]
Abstract
Artificial intelligence-based large language models (LLMs) have the potential to substantially improve the efficiency and scale of ecological research, but their propensity for delivering incorrect information raises significant concern about their usefulness in their current state. Here, we formally test how quickly and accurately an LLM performs in comparison to a human reviewer when tasked with extracting various types of ecological data from the scientific literature. We found the LLM was able to extract relevant data over 50 times faster than the reviewer and had very high accuracy (>90%) in extracting discrete and categorical data, but it performed poorly when extracting certain quantitative data. Our case study shows that LLMs offer great potential for generating large ecological databases at unprecedented speed and scale, but additional quality assurance steps are required to ensure data integrity.
Collapse
Affiliation(s)
| | - Hannah L Clipp
- USDA Forest Service Northern Research Station, Delaware, OH, USA
| |
Collapse
|
6
|
'Small Data' for big insights in ecology. Trends Ecol Evol 2023:S0169-5347(23)00019-8. [PMID: 36797167 DOI: 10.1016/j.tree.2023.01.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 01/18/2023] [Accepted: 01/25/2023] [Indexed: 02/17/2023]
Abstract
Big Data science has significantly furthered our understanding of complex systems by harnessing large volumes of data, generated at high velocity and in great variety. However, there is a risk that Big Data collection is prioritised to the detriment of 'Small Data' (data with few observations). This poses a particular risk to ecology where Small Data abounds. Machine learning experts are increasingly looking to Small Data to drive the next generation of innovation, leading to development in methods for Small Data such as transfer learning, knowledge graphs, and synthetic data. Meanwhile, meta-analysis and causal reasoning approaches are evolving to provide new insights from Small Data. These advances should add value to high-quality Small Data catalysing future insights for ecology.
Collapse
|
7
|
Jimeno Yepes AJ, Verspoor K. Classifying literature mentions of biological pathogens as experimentally studied using natural language processing. J Biomed Semantics 2023; 14:1. [PMID: 36721225 PMCID: PMC9889128 DOI: 10.1186/s13326-023-00282-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 01/17/2023] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health. OBJECTIVE In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications. METHODS We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen. RESULTS We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents. CONCLUSIONS We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisation algorithms were additionally evaluated on a small manually annotated data set shows that the data set that we have generated allows characterising pathogens of interest. TRIAL REGISTRATION N/A.
Collapse
Affiliation(s)
- Antonio Jose Jimeno Yepes
- School of Computing Technologies, RMIT University, Melbourne, Australia.
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
8
|
Koerich G, Fraser CI, Lee CK, Morgan FJ, Tonkin JD. Forecasting the future of life in Antarctica. Trends Ecol Evol 2023; 38:24-34. [PMID: 35934551 DOI: 10.1016/j.tree.2022.07.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 07/12/2022] [Accepted: 07/15/2022] [Indexed: 12/24/2022]
Abstract
Antarctic ecosystems are under increasing anthropogenic pressure, but efforts to predict the responses of Antarctic biodiversity to environmental change are hindered by considerable data challenges. Here, we illustrate how novel data capture technologies provide exciting opportunities to sample Antarctic biodiversity at wider spatiotemporal scales. Data integration frameworks, such as point process and hierarchical models, can mitigate weaknesses in individual data sets, improving confidence in their predictions. Increasing process knowledge in models is imperative to achieving improved forecasts of Antarctic biodiversity, which can be attained for data-limited species using hybrid modelling frameworks. Leveraging these state-of-the-art tools will help to overcome many of the data scarcity challenges presented by the remoteness of Antarctica, enabling more robust forecasts both near- and long-term.
Collapse
Affiliation(s)
- Gabrielle Koerich
- School of Biological Sciences, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand.
| | - Ceridwen I Fraser
- Department of Marine Science, University of Otago, PO Box 56, Dunedin 9054, New Zealand
| | - Charles K Lee
- International Centre for Terrestrial Antarctic Research, School of Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand
| | - Fraser J Morgan
- Manaaki Whenua - Landcare Research, Auckland 1072, New Zealand; Te Pūnaha Matatini, Centre of Research Excellence in Complex Systems, Auckland, New Zealand
| | - Jonathan D Tonkin
- School of Biological Sciences, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand; Te Pūnaha Matatini, Centre of Research Excellence in Complex Systems, Auckland, New Zealand; Bioprotection Aotearoa, Centre of Research Excellence, Canterbury, New Zealand.
| |
Collapse
|
9
|
Rogers AD, Appeltans W, Assis J, Ballance LT, Cury P, Duarte C, Favoretto F, Hynes LA, Kumagai JA, Lovelock CE, Miloslavich P, Niamir A, Obura D, O'Leary BC, Ramirez-Llodra E, Reygondeau G, Roberts C, Sadovy Y, Steeds O, Sutton T, Tittensor DP, Velarde E, Woodall L, Aburto-Oropeza O. Discovering marine biodiversity in the 21st century. ADVANCES IN MARINE BIOLOGY 2022; 93:23-115. [PMID: 36435592 DOI: 10.1016/bs.amb.2022.09.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
We review the current knowledge of the biodiversity of the ocean as well as the levels of decline and threat for species and habitats. The lack of understanding of the distribution of life in the ocean is identified as a significant barrier to restoring its biodiversity and health. We explore why the science of taxonomy has failed to deliver knowledge of what species are present in the ocean, how they are distributed and how they are responding to global and regional to local anthropogenic pressures. This failure prevents nations from meeting their international commitments to conserve marine biodiversity with the results that investment in taxonomy has declined in many countries. We explore a range of new technologies and approaches for discovery of marine species and their detection and monitoring. These include: imaging methods, molecular approaches, active and passive acoustics, the use of interconnected databases and citizen science. Whilst no one method is suitable for discovering or detecting all groups of organisms many are complementary and have been combined to give a more complete picture of biodiversity in marine ecosystems. We conclude that integrated approaches represent the best way forwards for accelerating species discovery, description and biodiversity assessment. Examples of integrated taxonomic approaches are identified from terrestrial ecosystems. Such integrated taxonomic approaches require the adoption of cybertaxonomy approaches and will be boosted by new autonomous sampling platforms and development of machine-speed exchange of digital information between databases.
Collapse
Affiliation(s)
- Alex D Rogers
- REV Ocean, Lysaker, Norway; Nekton Foundation, Begbroke Science Park, Oxford, United Kingdom.
| | - Ward Appeltans
- Intergovernmental Oceanographic Commission of UNESCO, Oostende, Belgium
| | - Jorge Assis
- Centre of Marine Sciences, University of Algarve, Faro, Portugal
| | - Lisa T Ballance
- Marine Mammal Institute, Oregon State University, Newport, OR, United States
| | | | - Carlos Duarte
- King Abdullah University of Science and Technology (KAUST), Red Sea Research Center (RSRC) and Computational Bioscience Research Center (CBRC), Thuwal, Kingdom of Saudi Arabia
| | - Fabio Favoretto
- Autonomous University of Baja California Sur, La Paz, Baja California Sur, Mexico
| | - Lisa A Hynes
- Nekton Foundation, Begbroke Science Park, Oxford, United Kingdom
| | - Joy A Kumagai
- Senckenberg Biodiversity and Climate Research Institute, Frankfurt am Main, Germany
| | - Catherine E Lovelock
- School of Biological Sciences, The University of Queensland, St Lucia, QLD, Australia
| | - Patricia Miloslavich
- Scientific Committee on Oceanic Research (SCOR), College of Earth, Ocean and Environment, University of Delaware, Newark, DE, United States; Departamento de Estudios Ambientales, Universidad Simón Bolívar, Venezuela & Scientific Committee for Oceanic Research (SCOR), Newark, DE, United States
| | - Aidin Niamir
- Senckenberg Biodiversity and Climate Research Institute, Frankfurt am Main, Germany
| | | | - Bethan C O'Leary
- Centre for Ecology & Conservation, College of Life and Environmental Sciences, University of Exeter, Penryn, United Kingdom; Department of Environment and Geography, University of York, York, United Kingdom
| | - Eva Ramirez-Llodra
- REV Ocean, Lysaker, Norway; Nekton Foundation, Begbroke Science Park, Oxford, United Kingdom
| | - Gabriel Reygondeau
- Yale Center for Biodiversity Movement and Global Change, Yale University, New Haven, CT, United States; Nippon Foundation-Nereus Program, Institute for the Oceans and Fisheries, University of British Columbia, Vancouver, BC, Canada
| | - Callum Roberts
- Centre for Ecology & Conservation, College of Life and Environmental Sciences, University of Exeter, Penryn, United Kingdom
| | - Yvonne Sadovy
- School of Biological Sciences, Swire Institute of Marine Science, The University of Hong Kong, Hong Kong
| | - Oliver Steeds
- Nekton Foundation, Begbroke Science Park, Oxford, United Kingdom
| | - Tracey Sutton
- Nova Southeastern University, Halmos College of Natural Sciences and Oceanography, Dania Beach, FL, United States
| | | | - Enriqueta Velarde
- Instituto de Ciencias Marinas y Pesquerías, Universidad Veracruzana, Veracruz, Mexico
| | - Lucy Woodall
- Nekton Foundation, Begbroke Science Park, Oxford, United Kingdom; Department of Zoology, University of Oxford, Oxford, United Kingdom
| | | |
Collapse
|
10
|
Farrell MJ, Brierley L, Willoughby A, Yates A, Mideo N. Past and future uses of text mining in ecology and evolution. Proc Biol Sci 2022; 289:20212721. [PMID: 35582795 PMCID: PMC9114983 DOI: 10.1098/rspb.2021.2721] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Ecology and evolutionary biology, like other scientific fields, are experiencing an exponential growth of academic manuscripts. As domain knowledge accumulates, scientists will need new computational approaches for identifying relevant literature to read and include in formal literature reviews and meta-analyses. Importantly, these approaches can also facilitate automated, large-scale data synthesis tasks and build structured databases from the information in the texts of primary journal articles, books, grey literature, and websites. The increasing availability of digital text, computational resources, and machine-learning based language models have led to a revolution in text analysis and natural language processing (NLP) in recent years. NLP has been widely adopted across the biomedical sciences but is rarely used in ecology and evolutionary biology. Applying computational tools from text mining and NLP will increase the efficiency of data synthesis, improve the reproducibility of literature reviews, formalize analyses of research biases and knowledge gaps, and promote data-driven discovery of patterns across ecology and evolutionary biology. Here we present recent use cases from ecology and evolution, and discuss future applications, limitations and ethical issues.
Collapse
Affiliation(s)
- Maxwell J. Farrell
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
| | - Liam Brierley
- Department of Health Data Science, University of Liverpool, Liverpool, UK
| | - Anna Willoughby
- Odum School of Ecology, University of Georgia, Athens, GA, USA,Center for the Ecology of Infectious Diseases, University of Georgia, Athens, GA, USA
| | - Andrew Yates
- University of Amsterdam, Amsterdam, The Netherlands
| | - Nicole Mideo
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
| |
Collapse
|