1
|
Nédellec C, Sauvion C, Bossy R, Borovikova M, Deléger L. TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature. PLoS One 2024; 19:e0305475. [PMID: 38870159 PMCID: PMC11175518 DOI: 10.1371/journal.pone.0305475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Accepted: 05/31/2024] [Indexed: 06/15/2024] Open
Abstract
Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. A growing number of plant molecular information networks provide interlinked interoperable data to support the discovery of gene-phenotype interactions. A large body of scientific literature and observational data obtained in-field and under controlled conditions document wheat breeding experiments. The cross-referencing of this complementary information is essential. Text from databases and scientific publications has been identified early on as a relevant source of information. However, the wide variety of terms used to refer to traits and phenotype values makes it difficult to find and cross-reference the textual information, e.g. simple dictionary lookup methods miss relevant terms. Corpora with manually annotated examples are thus needed to evaluate and train textual information extraction methods. While several corpora contain annotations of human and animal phenotypes, no corpus is available for plant traits. This hinders the evaluation of text mining-based crop knowledge graphs (e.g. AgroLD, KnetMiner, WheatIS-FAIDARE) and limits the ability to train machine learning methods and improve the quality of information. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 528 PubMed references that are fully annotated by trait, phenotype, and species. We address the interoperability challenge of crossing sparse assay data and publications by using the Wheat Trait and Phenotype Ontology to normalize trait mentions and the species taxonomy of the National Center for Biotechnology Information to normalize species. The paper describes the construction of the corpus. A study of the performance of state-of-the-art language models for both named entity recognition and linking tasks trained on the corpus shows that it is suitable for training and evaluation. This corpus is currently the most comprehensive manually annotated corpus for natural language processing studies on crop phenotype information from the literature.
Collapse
Affiliation(s)
- Claire Nédellec
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Clara Sauvion
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Robert Bossy
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Mariya Borovikova
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
- TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France
| | - Louise Deléger
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| |
Collapse
|
2
|
Zhong Z, Li G, Xu Z, Zeng H, Teng J, Feng X, Diao S, Gao Y, Li J, Zhang Z. Evaluating three strategies of genome-wide association analysis for integrating data from multiple populations. Anim Genet 2024; 55:265-276. [PMID: 38185881 DOI: 10.1111/age.13394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 11/24/2023] [Accepted: 12/21/2023] [Indexed: 01/09/2024]
Abstract
In livestock, genome-wide association studies (GWAS) are usually conducted in a single population (single-GWAS) with limited sample size and detection power. To enhance the detection power of GWAS, meta-analysis of GWAS (meta-GWAS) and mega-analysis of GWAS (mega-GWAS) have been proposed to integrate data from multiple populations at the level of summary statistics or individual data, respectively. However, there is a lack of comparison for these different strategies, which makes it difficult to guide the best practice of GWAS integrating data from multiple study populations. To maximize the comparison of different association analysis strategies across multiple populations, we conducted single-GWAS, meta-GWAS, and mega-GWAS for the backfat thickness of 100 kg (BFT_100) and days to 100 kg (DAYS_100) within each of the three commercial pig breeds (Duroc, Yorkshire, and Landrace). Based on controlling the genome inflation factor to one, we calculated corrected p-values (pC ). In Yorkshire, with the largest sample size, mega-GWAS, meta-GWAS and single-GWAS detected 149, 38 and 20 significant SNPs (pC < 1E-5) associated with BFT_100, as well as 26, four, and one QTL, respectively. Among them, pC of SNPs from mega-GWAS was the lowest, followed by meta-GWAS and single-GWAS. The correlation of pC among the three GWAS strategies ranged from 0.60 to 0.75 and the correlation of SNP effect values between meta-GWAS and mega-GWAS was 0.74, all showing good agreement. Collectively, even though there are differences in the integration of individual data or summary statistics, integrating data from multiple populations is an effective means of genetic argument for complex traits, especially mega-GWAS versus single-GWAS.
Collapse
Affiliation(s)
- Zhanming Zhong
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Guangzhen Li
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Zhiting Xu
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Haonan Zeng
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Jinyan Teng
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Xueyan Feng
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Shuqi Diao
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Yahui Gao
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Jiaqi Li
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Zhe Zhang
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| |
Collapse
|