An Imbalance-Robust Evaluation Framework for Extreme Risk Forecasts
Abstract
Evaluating rare-event forecasts is challenging because standard metrics collapse as event prevalence declines. Measures such as F1-score, AUPRC, MCC, and accuracy induce degenerate thresholds -- converging to zero or one -- and their values become dominated by class imbalance rather than tail discrimination. We develop a family of rare-event-stable (RES) metrics whose optimal thresholds remain strictly interior as the event probability approaches zero, ensuring coherent decision rules under extreme rarity. Simulations spanning event probabilities from 0.01 down to one in a million show that RES metrics maintain stable thresholds, consistent model rankings, and near-complete prevalence invariance, whereas traditional metrics exhibit statistically significant threshold drift and structural collapse. A credit-default application confirms these results: RES metrics yield interpretable probability-of-default cutoffs (4-9%) and remain robust under subsampling, while classical metrics fail operationally. The RES framework provides a principled, prevalence-invariant basis for evaluating extreme-risk forecasts.
Summary
This paper addresses the challenge of evaluating forecasts for rare events, where standard performance metrics like F1-score, AUPRC, MCC, and accuracy tend to break down due to class imbalance. The core problem is that these traditional metrics induce degenerate thresholds (approaching 0 or 1) as event prevalence decreases, making them unreliable for decision-making in extreme-risk scenarios. The authors develop a family of "Rare-Event-Stable" (RES) metrics designed to maintain stable, interior optimal thresholds even as event probability approaches zero. Their approach involves a theoretical analysis of the shortcomings of traditional metrics, followed by the formulation of RES metrics based on a prevalence-invariant policy parameter, alpha (α), which reflects the institution's tolerance for false positives versus false negatives. They then conduct simulations across a wide range of event probabilities (from 0.01 to 1 in a million) to demonstrate the stability of RES metrics compared to the threshold drift and structural collapse exhibited by traditional metrics. Finally, they apply the RES framework to a credit-default forecasting problem, showing that RES metrics provide interpretable probability-of-default cutoffs and remain robust under subsampling, while classical metrics fail operationally. The key finding is that RES metrics offer a principled and prevalence-invariant basis for evaluating extreme-risk forecasts, ensuring coherent decision rules even when events are extraordinarily rare. This is important because it provides practitioners with a reliable tool for evaluating and comparing models in high-stakes, rare-event domains where traditional metrics can lead to misleading conclusions and poor operational decisions. The separation of institutional preferences (α) from data-driven thresholds is a significant contribution.
Key Insights
- •Traditional metrics like F1-score, MCC, and Balanced Accuracy exhibit threshold drift and structural collapse as event prevalence declines due to their implicit marginal trade-offs depending on the event prevalence (π). As π approaches 0, these metrics overweight the penalty for false positives, forcing thresholds towards extreme values.
- •RES metrics are designed to maintain stable, interior optimal thresholds by embedding a trade-off between the true positive rate (TPR) and the false positive rate (FPR) that does not depend on prevalence. The marginal rate comparison must converge to a finite, non-zero limit as π → 0.
- •The RES metric family is characterized by M_RE(δ) = TPR(δ) / (α FPR(δ) + (1-α)), where α is a policy parameter encoding the institution's tolerance for false positives relative to false negatives.
- •Simulations show that RES metrics maintain stable thresholds and consistent model rankings across event probabilities ranging from 0.01 to 1 in a million, whereas traditional metrics exhibit statistically significant threshold drift (p < 0.001).
- •In a credit-default application, RES metrics yield interpretable probability-of-default cutoffs (4-9%) and remain robust under subsampling, while classical metrics fail operationally.
- •The Monotone Likelihood Ratio (MLR) property is assumed to guarantee the uniqueness of the optimal threshold δ*. However, the rare-event stability of RES metrics does not depend on uniqueness. Even if Λ(δ) oscillates, the condition for stability (Equation 3) ensures that all local maxima of the RES metric remain strictly interior as π → 0.
- •The paper introduces three structural principles that RES metrics must obey: balanced asymptotics, tail sensitivity, and prevalence invariance.
Practical Implications
- •The RES framework provides a more reliable and interpretable approach to evaluating forecasts in rare-event domains such as credit risk, fraud detection, anomaly detection, and safety engineering.
- •Practitioners can use RES metrics to select models and set decision thresholds that are robust to changes in event prevalence, leading to more consistent and defensible operational decisions.
- •The policy parameter α allows institutions to encode their tolerance for false positives versus false negatives, ensuring that evaluation metrics align with their specific risk preferences and operational constraints.
- •Future research could explore non-linear extensions of the RES family to accommodate varying risk aversion in the extreme tail. Calibration procedures can be developed for different institutional settings to map historical operating policies, capacity constraints, and explicit loss structures into a transparent and interpretable choice of α.
- •The RES framework opens up opportunities to re-evaluate existing models and alarm systems in rare-event domains using a more principled and prevalence-invariant approach.