Ann Intern Med:在系统评价和荟萃分析中使用GPT-3.5 Turbo模型进行标题和摘要筛选的敏感性和特异性

2024-05-23 来源:Ann Intern Med

本文由小咖机器人翻译整理

期刊来源:Ann Intern Med

原文链接:https://doi.org/10.7326/M23-3389

摘要内容如下:

背景

尽管科学文献呈指数增长,但系统评价仍是人工进行的。

客观

研究来自OpenAI的GPT-3.5 Turbo作为单个评价者在系统评价中进行标题和摘要筛选的敏感性和特异性。

设计

诊断试验准确性研究。

设置

来自5篇系统综述的未注释书目数据库,代表22665条引文。

参与者

没有。

测量

设计了一个通用提示框架来指导GPT执行标题和摘要筛选。将模型的输出与作者在2个规则下的决策进行比较。第一个规则平衡了敏感性和特异性,例如,作为第二个审查者。例如,第二个规则优化了灵敏度,以减少需要手动筛选的引用数量。

结果

在平衡原则下,敏感性为81.1%~96.5%,特异性为25.8%~80.4%。在所有审查中,GPT确定了人类遗漏的708条引文中的7条(1%),这些引文本应在全文筛选后纳入,但代价是22665条假阳性建议中的10279条(45.3%)需要在筛选过程中进行协调。根据敏感性原则,敏感性为94.6%~99.8%,特异性为2.2%~46.6%。将人工筛选限制在GPT未排除的引文中,可以将筛选的引文数量从6334条中的127条(2%)减少到4077条中的1851条(45.4%),代价是在全文水平上26条引文中的0到1条(3.8%)缺失。

局限性

微调提示所需的时间。研究的回顾性性质,5个系统评价的方便样本,以及对快速发展和时间敏感的GPT性能。

结论

GPT-3.5 Turbo模型可用作标题和摘要筛选的第二审查员,但需要额外的工作来协调增加的误报。它还显示了在人类筛选之前减少引文数量的潜力,代价是在全文水平上错过一些引文。

主要资金来源

没有。

英文原文如下:

Abstracts

BACKGROUND  Systematic reviews are performed manually despite the exponential growth of scientific literature.

OBJECTIVE  To investigate the sensitivity and specificity of GPT-3.5 Turbo, from OpenAI, as a single reviewer, for title and abstract screening in systematic reviews.

DESIGN  Diagnostic test accuracy study.

SETTING  Unannotated bibliographic databases from 5 systematic reviews representing 22 665 citations.

PARTICIPANTS  None.

MEASUREMENTS  A generic prompt framework to instruct GPT to perform title and abstract screening was designed. The output of the model was compared with decisions from authors under 2 rules. The first rule balanced sensitivity and specificity, for example, to act as a second reviewer. The second rule optimized sensitivity, for example, to reduce the number of citations to be manually screened.

RESULTS  Under the balanced rule, sensitivities ranged from 81.1% to 96.5% and specificities ranged from 25.8% to 80.4%. Across all reviews, GPT identified 7 of 708 citations (1%) missed by humans that should have been included after full-text screening at the cost of 10 279 of 22 665 false-positive recommendations (45.3%) that would require reconciliation during the screening process. Under the sensitive rule, sensitivities ranged from 94.6% to 99.8% and specificities ranged from 2.2% to 46.6%. Limiting manual screening to citations not ruled out by GPT could reduce the number of citations to screen from 127 of 6334 (2%) to 1851 of 4077 (45.4%), at the cost of missing from 0 to 1 of 26 citations (3.8%) at the full-text level.

LIMITATIONS  Time needed to fine-tune prompt. Retrospective nature of the study, convenient sample of 5 systematic reviews, and GPT performance sensitive to prompt development and time.

CONCLUSION  The GPT-3.5 Turbo model may be used as a second reviewer for title and abstract screening, at the cost of additional work to reconcile added false positives. It also showed potential to reduce the number of citations before screening by humans, at the cost of missing some citations at the full-text level.

PRIMARY FUNDING SOURCE  None.

-----------分割线---------

点击链接:https://www.mediecogroup.com/community/user/vip/categories/ ,成为医咖会员,获取12项专属权益。

评论
请先登录后再发表评论
发表评论
下载附件需认证
为保证平台的学术氛围,请先完成认证,认证可享受以下权益
基础课程券2张
200积分
确认
取消
APP下载 领课程券
扫码下载APP
领基础课程券
公众号
统计咨询
扫一扫添加小咖个人微信,立即咨询统计分析服务!
会员服务
SCI-AI工具
积分商城
意见反馈