Performance of Large Language Models on Iran’s Medical Informatics Graduate Entrance Exams
DOI:
https://doi.org/10.22034/TJT.3.1.73Keywords:
Educational Measurement, Large Language Models, Generative Artificial Intelligence, Medical Informatics, Distance EducationAbstract
Large language models (LLMs) with advanced natural language processing are increasingly used in medical education. This study evaluated and compared the accuracy of four LLMs (GPT-4O, O3-mini, Gemini, and Copilot) in answering questions from Iran’s master’s and doctoral entrance exams in medical informatics. Multiple-choice questions from the 2024 exams, 116 for master’s and 96 for doctoral programs, were submitted to the free versions of the models using uniform prompts. Responses were compared with the official answer key to measure accuracy, uncertainty, and error rates. Statistical analyses included Chi-Square tests and logistic regression. At the master’s level, O3-mini performed best, while Copilot was weakest, though the differences were not significant (p = 0.2088). At the doctoral level, GPT-4O and O3-mini outperformed Gemini, which had 79.44% accuracy and a 21.55% error rate, with significant differences (p < 0.001). Model performance varied across specialized subjects. These results indicate that LLM performance depends on model type, educational level, and the nature of the content, providing a foundation for more accurate assessments and targeted AI applications in specialized education.
References
1. Shao M, Basit A, Karri R, Shafique M. Survey of Different Large Language Model Architectures: Trends, Benchmarks, and Challenges. IEEE Access. 2024;12:188664-706. https://doi.org/10.1109/ACCESS.2024.3482107
2. Yang J, Jin H, Tang R, Han X, Feng Q, Jiang H, et al. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. ACM Trans Knowl Discov Data. 2024;18(6):Article 160. https://doi.org/10.1145/3649506
3. OpenAI. Introducing ChatGPT: OpenAI; 2022 [cited 2/26/2025]. Available from: https://openai.com/index/chatgpt/
4. OpenAI. Pioneering research on the path to AGI [cited 4/6/2025]. Available from: https://openai.com/research/
5. OpenAI. OpenAI o3-mini 2025 [cited 4/6/2025]. Available from: https://openai.com/index/openai-o3-mini/
6. Microsoft. Introducing Microsoft 365 Copilot – your copilot for work 2023 [cited 2/26/2025]. Available from: https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/
7. D. Hassabis SP. Introducing Gemini: our largest and most capable AI model 2023 [cited 4/6/2025]. Available from: https://blog.google/technology/ai/google-gemini-ai/#sundar-note
8. S P. Introducing Gemini: our largest and most capable AI model 2023 [cited 4/6/2025]. Available from: https://blog.google/technology/ai/google-gemini-ai/#sundar-note
9. Abd-alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ. 2023;9:e48291. https://doi.org/https://doi.org/10.2196/48291
10. Benítez TM, Xu Y, Boudreau JD, Kow AWC, Bello F, Van Phuoc L, et al. Harnessing the potential of large language models in medical education: promise and pitfalls. Journal of the American Medical Informatics Association. 2024;31(3).776-783. https://doi.org/10.1093/jamia/ocad252
11. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023;2(2):e0000198. https://doi.org/10.1371/journal.pdig.0000198
12. Alahmadi MD, Alharbi, M., Tayeb, A., & Alshangiti, M. Evaluating Large Language Models' Proficiency in Answering Arabic GAT Exam Questions. Engineering, Technology & Applied Science Research. 2024;14(6):17774-80. https://doi.org/10.48084/etasr.8481
13. Saowaprut P, Wabina RS, Yang J, Siriwat L. Evaluation of Large Language Models in Thailand’s National Medical Licensing Examination. medRxiv. 2024:2024.12. 20.24319441. https://doi.org/10.1101/2024.12.20.24319441
14. Rossettini G, Rodeghiero L, Corradi F, Cook C, Pillastrini P, Turolla A, et al. Comparative accuracy of ChatGPT-4, Microsoft Copilot, and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC Medical Education. 2024;24(1):694. https://doi.org/10.1186/s12909-024-05630-9
15. Sabri H, Saleh MH, Hazrati P, Merchant K, Misch J, Kumar PS, et al. Performance of three artificial intelligence (AI)‐based large language models in standardized testing; implications for AI‐assisted dental education. Journal of periodontal research. 2025;60(2):121-33. https://doi.org/10.1111/jre.13323
16. Weber JL, Martinez Neda B, Carbajal Juarez K, Wong-Ma J, Gago-Masague S, Ziv H, editors. Measuring cs student attitudes toward large language models. Proceedings of the 55th ACM Technical Symposium on Computer Science Education V 2; 2024. https://doi.org/10.1145/3626253.3635604
17. Liu M, M'hiri F, editors. Beyond traditional teaching: Large language models as simulated teaching assistants in computer science. Proceedings of the 55th ACM Technical Symposium on Computer Science Education V 1; 2024. https://doi.org/10.1145/3626252.3630789
18. Gan W, Qi Z, Wu J, Lin JC-W, editors. Large language models in education: Vision and opportunities. 2023 IEEE international conference on big data (BigData); 2023: IEEE. https://doi.org/10.1109/BigData59044.2023.10386291
19. Gomathy C, Dhanush M, Shyam BSK. A Study on Medical Informatics. International Journal of Scientific Research in Engineering and Management. 2023;07(11). http://doi.org/10.55041/IJSREM
20. Sarafi Nejad A, Fatehi F. Medical Informatics in Iran and the Emergence of Clinical Informatics. Iranian Journal of Medical Sciences. 2022;47(6):503-4. https://doi.org/10.30476/ijms.2022.48773
21. Colluoglu B, Dikici S. Transforming neurosurgical practice with large language models: comparative performance of ChatGPT-omni and Gemini in complex case management. World Neurosurgery. 2025:124103. https://doi.org/10.23736/s0390-5616.25.06447-1
22. Liévin V, Hother CE, Motzfeldt AG, Winther O. Can large language models reason about medical questions? Patterns. 2024;5(3). https://doi.org/10.48550/arXiv.2207.08143
23. Giannos P, Delardas O. Performance of ChatGPT on UK standardized admission tests: insights from the BMAT, TMUA, LNAT, and TSA examinations. JMIR Medical Education. 2023;9(1): e47737. https://doi.org/10.2196/47737
24. Guigue PA, Meyer R, Thivolle‐Lioux G, Brezinov Y, Levin G. Performance of ChatGPT in French language Parcours d'Accès Spécifique Santé test and in OBGYN. International Journal of Gynecology & Obstetrics. 2024;164(3):959-63. https://doi.org/10.1002/ijgo.15083
25. Saowaprut P, Rodis Wabina RS, Yang J, Siriwat L. Evaluation of Large Language Models in Thailand’s National Medical Licensing Examination. medRxiv. 2024:2024.12. 20.24319441. https://doi.org/10.1101/2024.12.20.24319441
26. Sharma P, Thapa K, Thapa D, Dhakal P, Upadhaya MD, Adhikari S, et al. Performance of ChatGPT on USMLE: Unlocking the potential of large language models for AI-assisted medical education. arXiv preprint arXiv:230700112. 2023. https://doi.org/10.1371/journal.pdig.0000198
27. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS digital health. 2023;2(2):e0000198. https://doi.org/10.1371/journal.pdig.0000198
28. Laupichler MC, Rother JF, Kadow ICG, Ahmadi S, Raupach T. Large language models in medical education: comparing ChatGPT-to human-generated exam questions. Academic Medicine. 2024;99(5):508-12. https://doi.org/10.1097/acm.0000000000005626
29. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. Correction: How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2024;10: e57594. https://doi.org/10.2196/45312
30. Beaulieu-Jones BR, Berrigan MT, Shah S, Marwaha JS, Lai S-L, Brat GA. Evaluating capabilities of large language models: performance of GPT-4 on surgical knowledge assessments. Surgery. 2024;175(4):936-42. https://doi.org/10.1016/j.surg.2023.12.014
31. Ebrahimian M, Behnam B, Ghayebi N, Sobhrakhshankhah E. ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model. BMJ Health & Care Informatics. 2023;30(1):e100815. https://doi.org/10.1136/bmjhci-2023-100815
