HealthBench: Evaluating Large Language Models Towards Improved Human Health
Authors:Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui?onero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, Karan Singhal
View PDF HTML (experimental)
Abstract:We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health. Comments: Blog: this https URL Code: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.08775 [cs.CL] (or arXiv:2505.08775v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.08775arXiv-issued DOI via DataCite
Submission history
From: Karan Singhal [view email]
[v1] Tue, 13 May 2025 17:53:59 UTC (7,398 KB)
相關(guān)知識
MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social Media
Towards a Personal Health Large Language Model
Do LLMs Provide Consistent Answers to Health
Can language learning support your mental health? ? GO Blog
Language: A Powerful Tool in Promoting Healthy Behaviors
虛擬物種的基本原理及其在物種分布模型評估中的應(yīng)用
[2025] 150 Courses & Webinars on AI in Healthcare — Class Central
大型預(yù)訓(xùn)練語言模型在網(wǎng)絡(luò)健康信息鑒別中的應(yīng)用探討
Language interpreting and translation: migrant health guide
Dietary protein intake and human health
網(wǎng)址: HealthBench: Evaluating Large Language Models Towards Improved Human Health http://m.u1s5d6.cn/newsview1706458.html
推薦資訊
- 1發(fā)朋友圈對老公徹底失望的心情 12775
- 2BMI體重指數(shù)計算公式是什么 11235
- 3補腎吃什么 補腎最佳食物推薦 11199
- 4性生活姿勢有哪些 盤點夫妻性 10428
- 5BMI正常值范圍一般是多少? 10137
- 6在線基礎(chǔ)代謝率(BMR)計算 9652
- 7一邊做飯一邊躁狂怎么辦 9138
- 8從出汗看健康 出汗透露你的健 9063
- 9早上怎么喝水最健康? 8613
- 10五大原因危害女性健康 如何保 7828