SAT-536 - Run-to-Run and Cross-Model Variability of Contemporary Large Language Models in Type 2 Diabetes Pharmacotherapy: Implications for Clinical Decision Support
Computer Scientist PATH Decision Support Software LLC Celebration, Florida, United States
Background Large language models (LLMs) are proposed for clinical decision support in endocrinology, but their stochastic generation raises a reliability question unquantified in type 2 diabetes (T2D) pharmacotherapy: how much does one model's recommendation vary across runs, and how much do different contemporary models disagree on the same patient? Methods We evaluated 9 contemporary LLMs (Claude Haiku 4.5, Sonnet 4.6, Opus 4.6; Gemini 3 Flash, 3.1 Flash-Lite, 3.1 Pro; GPT-5-mini, GPT-5.2, o3) on 7 standardized synthetic adult T2D vignettes at T=0.0, 0.7, and 1.0, with up to 30 trials per cell (5,528 runs). Predicted absolute A1c reduction and first-line agent (mapped to a 12-class taxonomy) were extracted per response using two independent parsers. Primary outcomes were within-cell coefficient of variation (CV) of A1c reduction, modal drug-class agreement, and cross-model dispersion at fixed temperature. We bootstrapped 95% CIs (2,000 resamples), decomposed A1c variance into between- and within-model components per vignette by temperature, and computed Fleiss' kappa for cross-model drug-class agreement. Results Claude Sonnet 4.6 returned dict-stringified outputs (parser agreement 40.6% vs 100% for the other 8 models) and was analyzed separately. Across 168 non-Sonnet cells, median within-cell CV for A1c reduction was 9.5% (95% CI 8.1-10.5); 45.8% of cells exceeded CV 10% and 7.1% exceeded CV 20%. CV rose with temperature (mean 8.7%, 11.8%, 12.7%). T=0 determinism was model-specific: mean CV at T=0 ranged from 2.4% (Gemini 3.1 Flash-Lite) to 18.4% (o3); 5 of 9 models exceeded 10%. Variance decomposition showed between-model variance dominated within-model variance on 5 of 7 vignettes (ICC 0.73-0.89 at T=0): on ambiguous cases, which LLM was asked drove more of the A1c estimate than run-to-run noise. Cross-model Fleiss' kappa for first-line drug class was 0.36 across 21 vignette-by-temperature strata (fair agreement). For the most ambiguous vignette, A1c estimates spanned 0.6-2.0% with cross-model CV 39.9%. Conclusions Contemporary LLMs show clinically meaningful run-to-run and cross-model variability in T2D pharmacotherapy; on ambiguous cases between-model disagreement exceeds run-to-run noise. Apparent determinism at T=0 is model-dependent. Broad deployment of LLMs as the primary therapeutic recommendation engine warrants caution. A hybrid design in which deterministic, guideline-anchored logic generates recommendations while an LLM handles explanation and patient-facing translation under physician oversight merits head-to-head evaluation.
*Unless otherwise noted, all abstracts presented at ENDO must not be released to the press or the public until the date and time of presentation. For oral presentations, the abstracts are embargoed until the session begins. The Endocrine Society reserves the right to lift the embargo on specific abstracts that are selected for promotion prior to or during ENDO.*