Abstract
This systematic review aims to synthesize existing reporting guidelines for large language models (LLMs) in healthcare research and evaluate their adequacy in addressing gaps in transparency, reproducibility, and clinical applicability. A systematic search was conducted to identify relevant studies on reporting guidelines for LLMs used in healthcare research using the PubMed database. We included 18 studies focused on reporting guidelines for LLMs used in healthcare research. The studies primarily aimed to develop or evaluate reporting frameworks to improve transparency, reproducibility, and methodological rigor in LLM applications. Several studies focused on creating structured reporting checklists for LLM applications in healthcare. The Chatbot Assessment Reporting Tool (CHART) was developed across multiple studies. Similarly, TRIPOD-LLM extended the TRIPOD+AI framework with 19 main items and 50 subitems, emphasizing modular reporting for diverse LLM tasks. Ultimately, while existing reporting guidelines represent an important advancement toward standardizing LLM research, their long-term impact will rely on broad adoption and iterative refinement to meet the evolving challenges of artificial intelligence.
References
Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, Wornow M, Swaminathan A, Lehmann LS, Hong HJ, Kashyap M, Chaurasia AR, Shah NR, Singh K, Tazbaz T, Milstein A, Pfeffer MA, Shah NH. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA. 2025
Zhang L, Zhao Q, Zhang D, Song M, Zhang Y, Wang X. Application of large language models in healthcare: A bibliometric analysis. Digital health. 2025
Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, Demner-Fushman D, Dligach D, Daneshjou R, Fernandes C, Hansen LH, Landman A, Lehmann L, McCoy LG, Miller T, Moreno A, Munch N, Restrepo D, Savova G, Umeton R, Gichoya JW, Collins GS, Moons KGM, Celi LA, Bitterman DS. The TRIPOD-LLM reporting guideline for studies using large language models. Nature medicine. 2025
Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, Demner-Fushman D, Dligach D, Daneshjou R, Fernandes C, Hansen LH, Landman A, Lehmann L, McCoy LG, Miller T, Moreno A, Munch N, Restrepo D, Savova G, Umeton R, Gichoya JW, Collins GS, Moons KGM, Celi LA, Bitterman DS. The TRIPOD-LLM Statement: A Targeted Guideline For Reporting Large Language Models Use. medRxiv : the preprint server for health sciences. 2024
Fareed M, Fatima M, Uddin J, Ahmed A, Sattar MA. A systematic review of ethical considerations of large language models in healthcare and medicine. Frontiers in digital health. 2025. doi: 10.3389/fdgth.2025.1653631
Iqbal U, Tanweer A, Rahmanti AR, Greenfield D, Lee LT, Li YJ. Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis. Journal of biomedical science. 2025
Guo Z, Lai A, Thygesen JH, Farrington J, Keen T, Li K. Large Language Models for Mental Health Applications: Systematic Review. JMIR mental health. 2024
Hobensack M, von Gerich H, Vyas P, Withall J, Peltonen LM, Block LJ, Davies S, Chan R, Van Bulck L, Cho H, Paquin R, Mitchell J, Topaz M, Song J. A rapid review on current and potential uses of large language models in nursing. International journal of nursing studies. 2024

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2025 Saif Aldeen Alryalat, Iyad Sultan

