Ann Intern Med:在系统评价和荟萃分析中使用GPT-3.5 Turbo模型进行标题和摘要筛选的敏感性和特异性
本文由小咖机器人翻译整理
期刊来源:Ann Intern Med
原文链接:https://doi.org/10.7326/M23-3389
摘要内容如下:
背景
尽管科学文献呈指数增长,但系统评价仍是人工进行的。
客观
研究来自OpenAI的GPT-3.5 Turbo作为单个评价者在系统评价中进行标题和摘要筛选的敏感性和特异性。
设计
诊断试验准确性研究。
设置
来自5篇系统综述的未注释书目数据库,代表22665条引文。
参与者
没有。
测量
设计了一个通用提示框架来指导GPT执行标题和摘要筛选。将模型的输出与作者在2个规则下的决策进行比较。第一个规则平衡了敏感性和特异性,例如,作为第二个审查者。例如,第二个规则优化了灵敏度,以减少需要手动筛选的引用数量。
结果
在平衡原则下,敏感性为81.1%~96.5%,特异性为25.8%~80.4%。在所有审查中,GPT确定了人类遗漏的708条引文中的7条(1%),这些引文本应在全文筛选后纳入,但代价是22665条假阳性建议中的10279条(45.3%)需要在筛选过程中进行协调。根据敏感性原则,敏感性为94.6%~99.8%,特异性为2.2%~46.6%。将人工筛选限制在GPT未排除的引文中,可以将筛选的引文数量从6334条中的127条(2%)减少到4077条中的1851条(45.4%),代价是在全文水平上26条引文中的0到1条(3.8%)缺失。
局限性
微调提示所需的时间。研究的回顾性性质,5个系统评价的方便样本,以及对快速发展和时间敏感的GPT性能。
结论
GPT-3.5 Turbo模型可用作标题和摘要筛选的第二审查员,但需要额外的工作来协调增加的误报。它还显示了在人类筛选之前减少引文数量的潜力,代价是在全文水平上错过一些引文。
主要资金来源
没有。
英文原文如下:
Abstracts
BACKGROUND Systematic reviews are performed manually despite the exponential growth of scientific literature.
OBJECTIVE To investigate the sensitivity and specificity of GPT-3.5 Turbo, from OpenAI, as a single reviewer, for title and abstract screening in systematic reviews.
DESIGN Diagnostic test accuracy study.
SETTING Unannotated bibliographic databases from 5 systematic reviews representing 22 665 citations.
PARTICIPANTS None.
MEASUREMENTS A generic prompt framework to instruct GPT to perform title and abstract screening was designed. The output of the model was compared with decisions from authors under 2 rules. The first rule balanced sensitivity and specificity, for example, to act as a second reviewer. The second rule optimized sensitivity, for example, to reduce the number of citations to be manually screened.
RESULTS Under the balanced rule, sensitivities ranged from 81.1% to 96.5% and specificities ranged from 25.8% to 80.4%. Across all reviews, GPT identified 7 of 708 citations (1%) missed by humans that should have been included after full-text screening at the cost of 10 279 of 22 665 false-positive recommendations (45.3%) that would require reconciliation during the screening process. Under the sensitive rule, sensitivities ranged from 94.6% to 99.8% and specificities ranged from 2.2% to 46.6%. Limiting manual screening to citations not ruled out by GPT could reduce the number of citations to screen from 127 of 6334 (2%) to 1851 of 4077 (45.4%), at the cost of missing from 0 to 1 of 26 citations (3.8%) at the full-text level.
LIMITATIONS Time needed to fine-tune prompt. Retrospective nature of the study, convenient sample of 5 systematic reviews, and GPT performance sensitive to prompt development and time.
CONCLUSION The GPT-3.5 Turbo model may be used as a second reviewer for title and abstract screening, at the cost of additional work to reconcile added false positives. It also showed potential to reduce the number of citations before screening by humans, at the cost of missing some citations at the full-text level.
PRIMARY FUNDING SOURCE None.
-----------分割线---------
点击链接:https://www.mediecogroup.com/community/user/vip/categories/ ,成为医咖会员,获取12项专属权益。
