Biscarini F, Nazzicari N, Broccanello C, Stevanato P, Marini S. "Noisy beets": impact of phenotyping errors on genomic predictions for binary traits in Beta vulgaris.
Plant Methods 2016;
12:36. [PMID:
27437026 PMCID:
PMC4949885 DOI:
10.1186/s13007-016-0136-4]
[Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Accepted: 07/06/2016] [Indexed: 06/06/2023]
Abstract
BACKGROUND
Noise (errors) in scientific data is endemic and may have a detrimental effect on statistical analyses and experimental results. The effects of noisy data have been assessed in genome-wide association studies for case-control experiments in human medicine. Little is known, however, on the impact of noisy data on genomic predictions, a widely used statistical application in plant and animal breeding.
RESULTS
In this study, the sensitivity to noise in the data of five classification methods (K-nearest neighbours-KNN, random forest-RF, ridge logistic regression-LR, and support vector machines with linear or radial basis function kernels) was investigated. A sugar beet population of 123 plants phenotyped for a binary trait and genotyped for 192 SNP (single nucleotide polymorphism) markers was used. Labels (0/1 phenotype) were randomly sampled to generate noise. From the base scenario without errors in the labels, increasing proportions of noisy labels-up to 50 %-were generated and introduced in the data.
CONCLUSIONS
Local classification methods-KNN and RF-showed higher tolerance to noisy labels compared to methods that leverage global data properties-LR and the two SVM models. In particular, KNN outperformed all other classifiers with AUC (area under the ROC curve) higher than 0.95 up to 20 % noisy labels. The runner-up method, RF, had an AUC of 0.941 with 20 % noise.
Collapse