Episode

Large Language Models for Thematic Analysis in Healthcare Research: A Blinded Mixed-Methods Comparison with Human Analysts

Hill, C.,Dahil, A.,Simpson, G.,Hardisty, D.,Keast, J.,Pinn, C. K.,Dambha-Miller, H.

Dec 29, 2025•8:54

Health Informatics

No ratings yet

Abstract

Large language models (LLMs) are increasingly used for qualitative thematic analysis, yet evidence on their performance in analysing focus-group data, where polyvocality and context complicate coding, remains limited. Given the increasing role of such models in thematic analysis, there is a need for methodological frameworks that enable systematic, metric-based comparisons between human and model-based analyses. We conducted a blinded mixed-methods comparison of two general-purpose LLMs (ChatGPT-5 and Claude 4 Sonnet), an LLM-based qualitative coding application (QualiGPT), and blinded human analysts on an in-person focus-group transcript informing an AI-enabled digital health proposal. We evaluated deductive coding using a 10-code, 6-theme codebook against an expert consensus adjudication; inductive coding with a structured Likert-scale comparison to a reference-standard set of inductive themes generated by expert consensus; and manual quote verification of LLM segments to define LLM hallucination (evidence absent or non-supportive) and error rate (including partial matches and speaker-coded segments). During deductive coding against an expert consensus adjudication, large language models (LLMs) yielded a mean agreement of 93.5% (95% CI 92.5-94.5) with {kappa} = 0.34 (95% CI 0.26-0.40); blinded human coders achieved 92.7% (95% CI 91.6-93.9) agreement with {kappa} = 0.34 (95% CI 0.26-0.41). Mean Gwets AC1 was 0.92 (95% CI 0.90-0.93) for the blinded human analysis, and 0.93 (95% CI 0.92-0.94) for the LLM-assisted deductive analysis, reflecting high agreement despite the low overall code prevalence (7.8%, SD = 3.2%). Only one model achieved non-inferiority in inductive analysis of the transcript (p = 0.043). The strict hallucination rate in inductive analysis was 1.2% (SD = 2.1%). LLMs were non-inferior to human analysts for deductive coding of the focus-group data, with variable performance in inductive analysis. Low hallucination but significant comprehensive error rates indicate that LLMs can augment qualitative analysis but require human verification.

Links & Resources

View on medRxiv Download PDF

Authors

Hill, C.Dahil, A.Simpson, G.Hardisty, D.Keast, J.Pinn, C. K.Dambha-Miller, H.

Cite This Paper

arXiv:10.64898/2025.12.25.25343031

Year:2025

Category:health_informatics

APA

C., H., A., D., G., S., D., H., J., K., K., P. C., H., D. (2025). Large Language Models for Thematic Analysis in Healthcare Research: A Blinded Mixed-Methods Comparison with Human Analysts. arXiv preprint arXiv:10.64898/2025.12.25.25343031.

MLA

Hill, C., Dahil, A., Simpson, G., Hardisty, D., Keast, J., Pinn, C. K., and Dambha-Miller, H.. "Large Language Models for Thematic Analysis in Healthcare Research: A Blinded Mixed-Methods Comparison with Human Analysts." arXiv preprint arXiv:10.64898/2025.12.25.25343031 (2025).