Threshold¶
threshold
¶
Stage 5: Threshold calibration — permutation test, bootstrap, multiple testing correction.
Provides three functions: permutation_test: Per-feature p-values via permutation test. bootstrap_threshold: Per-feature absolute thresholds via bootstrap. correct_pvalues: Multiple testing correction (Bonferroni / BH).
Key invariant: the SHAP transformation σ_j is computed ONCE from D_ref (Stage 3) and reused for every permutation / bootstrap draw. SHAP is never recomputed inside the permutation loop.
permutation_test
¶
permutation_test(X_ref: DataFrame, X_mon: DataFrame, bucket_sets: dict[str, BucketSet], order: int = 1, n_permutations: int = 1000, max_samples: int | None = None, rng: Generator | None = None) -> dict[str, float]
Compute per-feature p-values via permutation test.
Under H₀ (no drift), reference and monitoring samples are exchangeable. We pool them, draw random splits, and compute SWIFT scores under the null.
The pre-computed SHAP transformation σ_j (stored in bucket_sets) is applied identically to every permutation — it is NOT recomputed.
The p-value uses the conservative formula: p_j = (1 + #{b : SWIFT_j^(b) ≥ SWIFT_j^obs}) / (1 + B)
When max_samples is set and the pooled data exceeds it, both
reference and monitoring data are randomly subsampled (preserving
the ref/mon ratio) before running the permutation loop. This
provides a significant speedup on large datasets with negligible
impact on statistical power.
Args: X_ref: Reference DataFrame (n_ref × p). X_mon: Monitoring DataFrame (n_mon × p). bucket_sets: Dict of feature → BucketSet with mean_shap populated. order: Wasserstein order (1 or 2). n_permutations: Number of permutations B. max_samples: Maximum total pool size. If the pool (n_ref + n_mon) exceeds this, subsample proportionally. None = no limit. rng: Random number generator for reproducibility.
Returns: Dict of feature_name → p-value ∈ (0, 1].
Source code in src/swift/threshold.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
bootstrap_threshold
¶
bootstrap_threshold(X_ref: DataFrame, bucket_sets: dict[str, BucketSet], n_mon: int, order: int = 1, alpha: float = 0.05, n_bootstrap: int = 1000, rng: Generator | None = None) -> dict[str, float]
Compute per-feature absolute thresholds via bootstrap.
Draws bootstrap samples of size n_mon from X_ref, computes SWIFT scores against X_ref, and returns the (1 − α) quantile as the threshold for each feature.
This provides a principled alternative to PSI's ad-hoc 0.10 / 0.25.
Args: X_ref: Reference DataFrame (n_ref × p). bucket_sets: Dict of feature → BucketSet with mean_shap populated. n_mon: Size of monitoring sample (bootstrap sample size). order: Wasserstein order (1 or 2). alpha: Significance level (threshold is (1-α) quantile). n_bootstrap: Number of bootstrap iterations B. rng: Random number generator for reproducibility.
Returns: Dict of feature_name → threshold (non-negative float).
Source code in src/swift/threshold.py
correct_pvalues
¶
correct_pvalues(pvalues: dict[str, float], method: CorrectionMethod, alpha: float = 0.05) -> dict[str, bool]
Apply multiple testing correction and return rejection decisions.
Bonferroni: reject if p_j < α / p (controls FWER). Benjamini-Hochberg: sort p-values, find largest k s.t. p_(k) ≤ k·α / p, then reject all with rank ≤ k (controls FDR).
Uses strict inequality (p < threshold) for both methods.
Args: pvalues: Dict of feature_name → p-value. method: CorrectionMethod.BONFERRONI or CorrectionMethod.BH. alpha: Significance level.
Returns: Dict of feature_name → bool (True = reject H₀ / flagged as drifted).