Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard

D'Anna, Gennaro; VAN CAUTER, Sofie; Thurnher, Majda; Van Goethem, Johan; Haller, Sven

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/43057

Full metadata record

DC Field	Value	Language
dc.contributor.author	D'Anna, Gennaro	-
dc.contributor.author	VAN CAUTER, Sofie	-
dc.contributor.author	Thurnher, Majda	-
dc.contributor.author	Van Goethem, Johan	-
dc.contributor.author	Haller, Sven	-
dc.date.accessioned	2024-06-05T06:44:39Z	-
dc.date.available	2024-06-05T06:44:39Z	-
dc.date.issued	2024	-
dc.date.submitted	2024-06-05T06:40:13Z	-
dc.identifier.citation	Neuroradiology, 66, p. 1245-1250	-
dc.identifier.uri	http://hdl.handle.net/1942/43057	-
dc.description.abstract	We compared different LLMs, notably chatGPT, GPT4, and Google Bard and we tested whether their performance differs in subspeciality domains, in executing examinations from four different courses of the European Society of Neuroradiology (ESNR) notably anatomy/embryology, neuro-oncology, head and neck and pediatrics. Written exams of ESNR were used as input data, related to anatomy/embryology (30 questions), neuro-oncology (50 questions), head and neck (50 questions), and pediatrics (50 questions). All exams together, and each exam separately were introduced to the three LLMs: chatGPT 3.5, GPT4, and Google Bard. Statistical analyses included a group-wise Friedman test followed by a pair-wise Wilcoxon test with multiple comparison corrections. Overall, there was a significant difference between the 3 LLMs (p < 0.0001), with GPT4 having the highest accuracy (70%), followed by chatGPT 3.5 (54%) and Google Bard (36%). The pair-wise comparison showed significant differences between chatGPT vs GPT 4 (p < 0.0001), chatGPT vs Bard (p < 0. 0023), and GPT4 vs Bard (p < 0.0001). Analyses per subspecialty showed the highest difference between the best LLM (GPT4, 70%) versus the worst LLM (Google Bard, 24%) in the head and neck exam, while the difference was least pronounced in neuro-oncology (GPT4, 62% vs Google Bard, 48%). We observed significant differences in the performance of the three different LLMs in the running of official exams organized by ESNR. Overall GPT 4 performed best, and Google Bard performed worst. This difference varied depending on subspeciality and was most pronounced in head and neck subspeciality.	-
dc.description.sponsorship	We would like to thank Sara Fullone of European Society of Neuroradiology Central Ofce for the support provided.	-
dc.language.iso	en	-
dc.publisher	SPRINGER	-
dc.rights	The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024	-
dc.subject.other	AI	-
dc.subject.other	LLM	-
dc.subject.other	chatGPT	-
dc.subject.other	Neuroradiology	-
dc.subject.other	GPT4	-
dc.title	Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard	-
dc.type	Journal Contribution	-
dc.identifier.epage	1250	-
dc.identifier.spage	1245	-
dc.identifier.volume	66	-
local.format.pages	6	-
local.bibliographicCitation.jcat	A1	-
dc.description.notes	D'Anna, G (corresponding author), ASST Ovest Milanese, Neuroimaging Unit, Legnano, Milan, Italy.	-
dc.description.notes	gennaro.danna@gmail.com	-
local.publisher.place	ONE NEW YORK PLAZA, SUITE 4600, NEW YORK, NY, UNITED STATES	-
local.type.refereed	Refereed	-
local.type.specified	Article	-
dc.identifier.doi	10.1007/s00234-024-03371-6	-
dc.identifier.pmid	38705899	-
dc.identifier.isi	001216135700001	-
dc.contributor.orcid	D'Anna, Gennaro/0000-0001-9890-9359; Haller, Sven/0000-0001-7433-0203	-
local.provider.type	wosris	-
local.description.affiliation	[D'Anna, Gennaro] ASST Ovest Milanese, Neuroimaging Unit, Legnano, Milan, Italy.	-
local.description.affiliation	[Van Cauter, Sofie] Ziekenhuis Oost Limburg, Dept Med Imaging, Genk, Belgium.	-
local.description.affiliation	[Van Cauter, Sofie] Hasselt Univ, Dept Med & Life Sci, Hasselt, Belgium.	-
local.description.affiliation	[Thurnher, Majda] Med Univ Vienna, Dept Biomed Imaging & Image Guided Therapy, Vienna, Austria.	-
local.description.affiliation	[Van Goethem, Johan] VITAZ, Dept Med & Mol Imaging, St Niklaas, Belgium.	-
local.description.affiliation	[Van Goethem, Johan] Univ Hosp Antwerp, Dept Radiol, Antwerp, Belgium.	-
local.description.affiliation	[Haller, Sven] CIMC Ctr Imagerie Med Cornavin, Geneva, Switzerland.	-
local.description.affiliation	[Haller, Sven] Uppsala Univ, Dept Surg Sci, Radiol, Uppsala, Sweden.	-
local.description.affiliation	[Haller, Sven] Univ Geneva, Fac Med, Geneva, Switzerland.	-
local.description.affiliation	[Haller, Sven] Capital Med Univ, Beijing Tiantan Hosp, Dept Radiol, Beijing, Peoples R China.	-
local.uhasselt.international	yes	-
item.contributor	D'Anna, Gennaro	-
item.contributor	VAN CAUTER, Sofie	-
item.contributor	Thurnher, Majda	-
item.contributor	Van Goethem, Johan	-
item.contributor	Haller, Sven	-
item.fulltext	With Fulltext	-
item.fullcitation	D'Anna, Gennaro; VAN CAUTER, Sofie; Thurnher, Majda; Van Goethem, Johan & Haller, Sven (2024) Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard. In: Neuroradiology, 66, p. 1245-1250.	-
item.accessRights	Restricted Access	-
crisitem.journal.issn	0028-3940	-
crisitem.journal.eissn	1432-1920	-
Appears in Collections:	Research publications

Files in This Item:

File	Description	Size	Format
s00234-024-03371-6.pdf Restricted Access	Published version	985.07 kB	Adobe PDF	View/Open Request a copy

Show simple item record

SCOPUS^TM
Citations

8

checked on Sep 16, 2025

WEB OF SCIENCE^TM
Citations

5

checked on Sep 16, 2025

Google Scholar^TM

Check

Files in This Item:

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM