Background
This study evaluated the diagnostic accuracy and consistency of ChatGPT-4o in salivary gland disorders compared to experienced clinicians.
Methods
Eighty anonymized salivary gland cases from peer-reviewed reports were evaluated by ChatGPT-4o using standardized multimodal prompts and by three oral medicine specialists who provided Top-5 differentials. The primary outcome was diagnostic accuracy at the most likely diagnosis (Top-1), within the top three (Top-3), and within the top five (Top-5) differential diagnoses, with agreement measured by Cohen’s kappa and subgroup analyses by gland type, imaging, and case difficulty.
Results
At Top-3 and Top-5, ChatGPT showed perfect sensitivity (100%) and Top-1 86.67%. Experts surpassed ChatGPT at Top-5 (77.5% vs. 67.5%, p |