Cross-Continental Transfer of Perioperative Mortality Prediction: Intraoperative Features Generalize Where Preoperative Features Fail
Abstract
Background: Machine learning models for perioperative mortality prediction show strong internal discrimination, yet external validation, particularly across continents, remains rare. Whether intraoperative vital sign features, which improve internal performance by 5-10%, transfer across populations is unknown. Furthermore, aggregate discrimination metrics may overstate clinical utility through Simpson's paradox if models separate risk strata without discriminating within them. Methods: We conducted a bidirectional cross-continental external validation study using publicly available datasets from Korea (INSPIRE, n=127,413 surgeries, 1,387 deaths) and the United States (MOVER, n=57,545 surgeries, 823 deaths). Eight machine learning models (XGBoost and logistic regression with preoperative only or preoperative plus intraoperative features) were trained on each dataset and validated on the other. We assessed discrimination (AUC-ROC), calibration, clinical utility (decision curve analysis), and within-stratum performance to detect Simpson's paradox. Results: All models achieved clinically useful external discrimination (AUC >0.70). The best performing model (XGB-INS-B) achieved external AUC of 0.895, representing a 4.1% improvement over internal performance. Intraoperative models showed higher mean external AUC than preoperative models (0.82 versus 0.79). Models trained on the diverse Korean population showed 2.6-fold less degradation than those trained on the concentrated US high-acuity population (5.4% versus 13.9%). Critically, preoperative models exhibited Simpson's paradox: one model achieved acceptable overall AUC (0.756) while performing at near-random levels within both ASA strata (0.597 and 0.584). Intraoperative models maintained within-stratum discrimination (0.71-0.74). All models required Platt scaling recalibration. Conclusions: Intraoperative vital sign features provide population-independent prognostic information that transfers across continents, while preoperative models achieve apparent discrimination through risk stratum separation. For global deployment of perioperative prediction models, real-time physiological monitoring should be prioritized, and stratified validation is essential to detect Simpson's paradox.
Links & Resources
Authors
Cite This Paper
S., P. D. (2025). Cross-Continental Transfer of Perioperative Mortality Prediction: Intraoperative Features Generalize Where Preoperative Features Fail. arXiv preprint arXiv:10.64898/2025.12.28.25343118.
Purkayastha, D. S.. "Cross-Continental Transfer of Perioperative Mortality Prediction: Intraoperative Features Generalize Where Preoperative Features Fail." arXiv preprint arXiv:10.64898/2025.12.28.25343118 (2025).