Performance of Large Language Models on Iran’s Medical Informatics Graduate Entrance Exams

Zeinab Ghaffari; Masoumeh Khedri; Ali Mohammad Hadianfard

doi:10.22034/TJT.3.1.73

Authors

Zeinab Ghaffari Urmia University of Medical Sciences Author
Masoumeh Khedri Smart University of Medical Sciences, Tehran, Iran , Ahvaz Jundishapur University of Medical Sciences Author
Ali Mohammad Hadianfard Ahvaz Jundishapur University of Medical Sciences Author

DOI:

https://doi.org/10.22034/TJT.3.1.73

Keywords:

Educational Measurement, Large Language Models, Generative Artificial Intelligence, Medical Informatics, Distance Education

Abstract

Large language models (LLMs) with advanced natural language processing are increasingly used in medical education. This study evaluated and compared the accuracy of four LLMs (GPT-4O, O3-mini, Gemini, and Copilot) in answering questions from Iran’s master’s and doctoral entrance exams in medical informatics. Multiple-choice questions from the 2024 exams, 116 for master’s and 96 for doctoral programs, were submitted to the free versions of the models using uniform prompts. Responses were compared with the official answer key to measure accuracy, uncertainty, and error rates. Statistical analyses included Chi-Square tests and logistic regression. At the master’s level, O3-mini performed best, while Copilot was weakest, though the differences were not significant (p = 0.2088). At the doctoral level, GPT-4O and O3-mini outperformed Gemini, which had 79.44% accuracy and a 21.55% error rate, with significant differences (p < 0.001). Model performance varied across specialized subjects. These results indicate that LLM performance depends on model type, educational level, and the nature of the content, providing a foundation for more accurate assessments and targeted AI applications in specialized education.

References

1. Shao M, Basit A, Karri R, Shafique M. Survey of Different Large Language Model Architectures: Trends, Benchmarks, and Challenges. IEEE Access. 2024;12:188664-706. https://doi.org/10.1109/ACCESS.2024.3482107

2. Yang J, Jin H, Tang R, Han X, Feng Q, Jiang H, et al. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. ACM Trans Knowl Discov Data. 2024;18(6):Article 160. https://doi.org/10.1145/3649506

3. OpenAI. Introducing ChatGPT: OpenAI; 2022 [cited 2/26/2025]. Available from: https://openai.com/index/chatgpt/

4. OpenAI. Pioneering research on the path to AGI [cited 4/6/2025]. Available from: https://openai.com/research/

5. OpenAI. OpenAI o3-mini 2025 [cited 4/6/2025]. Available from: https://openai.com/index/openai-o3-mini/

6. Microsoft. Introducing Microsoft 365 Copilot – your copilot for work 2023 [cited 2/26/2025]. Available from: https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/

7. D. Hassabis SP. Introducing Gemini: our largest and most capable AI model 2023 [cited 4/6/2025]. Available from: https://blog.google/technology/ai/google-gemini-ai/#sundar-note

8. S P. Introducing Gemini: our largest and most capable AI model 2023 [cited 4/6/2025]. Available from: https://blog.google/technology/ai/google-gemini-ai/#sundar-note

9. Abd-alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ. 2023;9:e48291. https://doi.org/https://doi.org/10.2196/48291

10. Benítez TM, Xu Y, Boudreau JD, Kow AWC, Bello F, Van Phuoc L, et al. Harnessing the potential of large language models in medical education: promise and pitfalls. Journal of the American Medical Informatics Association. 2024;31(3).776-783. https://doi.org/10.1093/jamia/ocad252

11. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023;2(2):e0000198. https://doi.org/10.1371/journal.pdig.0000198

12. Alahmadi MD, Alharbi, M., Tayeb, A., & Alshangiti, M. Evaluating Large Language Models' Proficiency in Answering Arabic GAT Exam Questions. Engineering, Technology & Applied Science Research. 2024;14(6):17774-80. https://doi.org/10.48084/etasr.8481

13. Saowaprut P, Wabina RS, Yang J, Siriwat L. Evaluation of Large Language Models in Thailand’s National Medical Licensing Examination. medRxiv. 2024:2024.12. 20.24319441. https://doi.org/10.1101/2024.12.20.24319441

14. Rossettini G, Rodeghiero L, Corradi F, Cook C, Pillastrini P, Turolla A, et al. Comparative accuracy of ChatGPT-4, Microsoft Copilot, and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC Medical Education. 2024;24(1):694. https://doi.org/10.1186/s12909-024-05630-9

15. Sabri H, Saleh MH, Hazrati P, Merchant K, Misch J, Kumar PS, et al. Performance of three artificial intelligence (AI)‐based large language models in standardized testing; implications for AI‐assisted dental education. Journal of periodontal research. 2025;60(2):121-33. https://doi.org/10.1111/jre.13323

16. Weber JL, Martinez Neda B, Carbajal Juarez K, Wong-Ma J, Gago-Masague S, Ziv H, editors. Measuring cs student attitudes toward large language models. Proceedings of the 55th ACM Technical Symposium on Computer Science Education V 2; 2024. https://doi.org/10.1145/3626253.3635604

17. Liu M, M'hiri F, editors. Beyond traditional teaching: Large language models as simulated teaching assistants in computer science. Proceedings of the 55th ACM Technical Symposium on Computer Science Education V 1; 2024. https://doi.org/10.1145/3626252.3630789

18. Gan W, Qi Z, Wu J, Lin JC-W, editors. Large language models in education: Vision and opportunities. 2023 IEEE international conference on big data (BigData); 2023: IEEE. https://doi.org/10.1109/BigData59044.2023.10386291

19. Gomathy C, Dhanush M, Shyam BSK. A Study on Medical Informatics. International Journal of Scientific Research in Engineering and Management. 2023;07(11). http://doi.org/10.55041/IJSREM

20. Sarafi Nejad A, Fatehi F. Medical Informatics in Iran and the Emergence of Clinical Informatics. Iranian Journal of Medical Sciences. 2022;47(6):503-4. https://doi.org/10.30476/ijms.2022.48773

21. Colluoglu B, Dikici S. Transforming neurosurgical practice with large language models: comparative performance of ChatGPT-omni and Gemini in complex case management. World Neurosurgery. 2025:124103. https://doi.org/10.23736/s0390-5616.25.06447-1

22. Liévin V, Hother CE, Motzfeldt AG, Winther O. Can large language models reason about medical questions? Patterns. 2024;5(3). https://doi.org/10.48550/arXiv.2207.08143

23. Giannos P, Delardas O. Performance of ChatGPT on UK standardized admission tests: insights from the BMAT, TMUA, LNAT, and TSA examinations. JMIR Medical Education. 2023;9(1): e47737. https://doi.org/10.2196/47737

24. Guigue PA, Meyer R, Thivolle‐Lioux G, Brezinov Y, Levin G. Performance of ChatGPT in French language Parcours d'Accès Spécifique Santé test and in OBGYN. International Journal of Gynecology & Obstetrics. 2024;164(3):959-63. https://doi.org/10.1002/ijgo.15083

25. Saowaprut P, Rodis Wabina RS, Yang J, Siriwat L. Evaluation of Large Language Models in Thailand’s National Medical Licensing Examination. medRxiv. 2024:2024.12. 20.24319441. https://doi.org/10.1101/2024.12.20.24319441

26. Sharma P, Thapa K, Thapa D, Dhakal P, Upadhaya MD, Adhikari S, et al. Performance of ChatGPT on USMLE: Unlocking the potential of large language models for AI-assisted medical education. arXiv preprint arXiv:230700112. 2023. https://doi.org/10.1371/journal.pdig.0000198

27. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS digital health. 2023;2(2):e0000198. https://doi.org/10.1371/journal.pdig.0000198

28. Laupichler MC, Rother JF, Kadow ICG, Ahmadi S, Raupach T. Large language models in medical education: comparing ChatGPT-to human-generated exam questions. Academic Medicine. 2024;99(5):508-12. https://doi.org/10.1097/acm.0000000000005626

29. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. Correction: How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2024;10: e57594. https://doi.org/10.2196/45312

30. Beaulieu-Jones BR, Berrigan MT, Shah S, Marwaha JS, Lai S-L, Brat GA. Evaluating capabilities of large language models: performance of GPT-4 on surgical knowledge assessments. Surgery. 2024;175(4):936-42. https://doi.org/10.1016/j.surg.2023.12.014

31. Ebrahimian M, Behnam B, Ghayebi N, Sobhrakhshankhah E. ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model. BMJ Health & Care Informatics. 2023;30(1):e100815. https://doi.org/10.1136/bmjhci-2023-100815

Performance of Large Language Models on Iran’s Medical Informatics Graduate Entrance Exams

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

Most read articles by the same author(s)

Information

Make a Submission

Browse