首頁(yè) 資訊 HealthBench: Evaluating Large Language Models Towards Improved Human Health

HealthBench: Evaluating Large Language Models Towards Improved Human Health

來(lái)源：泰然健康網(wǎng) 時(shí)間：2025年08月21日 01:04

Authors:Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui?onero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, Karan Singhal

View PDF HTML (experimental)

Abstract:We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health. Comments: Blog: this https URL Code: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.08775 [cs.CL] (or arXiv:2505.08775v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.08775

arXiv-issued DOI via DataCite

Submission history

From: Karan Singhal [view email]
[v1] Tue, 13 May 2025 17:53:59 UTC (7,398 KB)

網(wǎng)址: HealthBench: Evaluating Large Language Models Towards Improved Human Health http://m.u1s5d6.cn/newsview1706458.html

91高清中文字幕|亚洲无码网站网址|欧美一区二区乱伦|a乱码精品一区二区三|成人一区二区毛片|国产日韩精品视频短片|不卡无码无需播放器|鲁噜精品免费视频|wwwh日韩中出|精品五月婷婷无码

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Submission history

推薦資訊

從出汗看健康出汗透露你的健康信號(hào)

早上怎么喝水最健康？

91高清中文字幕|亚洲无码网站网址|欧美一区二区乱伦|a乱码精品一区二区三|成人一区二区毛片|国产日韩精品视频短片|不卡无码无需播放器|鲁噜精品免费视频|wwwh日韩中出|精品五月婷婷无码

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Submission history

推薦資訊

從出汗看健康 出汗透露你的健康信號(hào)

早上怎么喝水最健康？

從出汗看健康出汗透露你的健康信號(hào)