Li MD, Huang ZR, Shan QY, Chen SL, Zhang N, Hu HT, Wang W. Performance and comparison of artificial intelligence and human experts in the detection and classification of colonic polyps.
BMC Gastroenterol 2022;
22:517. [PMID:
36513975 PMCID:
PMC9749329 DOI:
10.1186/s12876-022-02605-2]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 12/05/2022] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE
The main aim of this study was to analyze the performance of different artificial intelligence (AI) models in endoscopic colonic polyp detection and classification and compare them with doctors with different experience.
METHODS
We searched the studies on Colonoscopy, Colonic Polyps, Artificial Intelligence, Machine Learning, and Deep Learning published before May 2020 in PubMed, EMBASE, Cochrane, and the citation index of the conference proceedings. The quality of studies was assessed using the QUADAS-2 table of diagnostic test quality evaluation criteria. The random-effects model was calculated using Meta-DISC 1.4 and RevMan 5.3.
RESULTS
A total of 16 studies were included for meta-analysis. Only one study (1/16) presented externally validated results. The area under the curve (AUC) of AI group, expert group and non-expert group for detection and classification of colonic polyps were 0.940, 0.918, and 0.871, respectively. AI group had slightly lower pooled specificity than the expert group (79% vs. 86%, P < 0.05), but the pooled sensitivity was higher than the expert group (88% vs. 80%, P < 0.05). While the non-experts had less pooled specificity in polyp recognition than the experts (81% vs. 86%, P < 0.05), and higher pooled sensitivity than the experts (85% vs. 80%, P < 0.05).
CONCLUSION
The performance of AI in polyp detection and classification is similar to that of human experts, with high sensitivity and moderate specificity. Different tasks may have an impact on the performance of deep learning models and human experts, especially in terms of sensitivity and specificity.
Collapse