Normalization¶
normalization
¶
Stage 3: SHAP normalization — bucket-level mean SHAP and feature transformation.
Computes the mean SHAP value per bucket on the reference sample, then provides a transformation function that maps any feature value to its bucket's mean SHAP (the "SHAP normalization" step).
For empty buckets (no reference observations): creates synthetic observations by sampling real rows and placing the feature value inside the empty bucket, then computes SHAP on those synthetic observations.
compute_bucket_shap
¶
compute_bucket_shap(bucket_sets: dict[str, BucketSet], X_ref: DataFrame, shap_values: ndarray, model: object | None = None, n_synthetic: int = 10, rng: Generator | None = None) -> dict[str, BucketSet]
Compute mean SHAP per bucket for all features.
For each feature j and bucket k, computes: mean_shap_j^k = mean(shap_j(x_i) for all i where x_ij in bucket k)
If a bucket has zero observations in X_ref: - If model is provided: create n_synthetic observations by sampling real rows and setting the feature value to fall in the empty bucket, then compute SHAP on those synthetic observations. - If model is None: assign mean_shap = 0.0 with a warning.
Args: bucket_sets: Dict of feature_name -> BucketSet (from build_all_buckets). X_ref: Reference DataFrame (n_ref x p). shap_values: SHAP values array of shape (n_ref, p). model: Trained model for computing SHAP on synthetic observations. If None, empty buckets get mean_shap = 0.0. n_synthetic: Number of synthetic observations to create for empty buckets. rng: Random number generator for reproducibility.
Returns: Dict of feature_name -> BucketSet with mean_shap populated on each Bucket.
Source code in src/swift/normalization.py
transform_feature
¶
transform_feature(values: ndarray | Series, bucket_set: BucketSet) -> ndarray
Map feature values to their bucket's mean SHAP value.
This is the SHAP transformation sigma_j defined in the paper: sigma_j(x_ij) = mean_shap_j^{bucket(x_ij)}
Uses vectorized numpy operations for performance on large arrays: - Identifies NaN positions and maps them to the null bucket. - For numeric values, uses np.searchsorted on sorted decision points to assign bucket indices in O(n log k) time. - Falls back to element-wise assignment for categorical buckets.
Args: values: 1-D array or Series of feature values (may contain NaN). bucket_set: BucketSet with mean_shap populated on each bucket.
Returns: 1-D numpy array of transformed values (same length as input).
Source code in src/swift/normalization.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | |