Skip to content

What if data goes missing or my model has drifted?

Two failure modes that are not visible at training time but matter the moment the model is in production: a whole concept's data going missing (vendor outage, upstream feed cutoff, geography change), and the concept-level contributions shifting across periods even when overall metrics look stable.

When to use this

  • Before promoting a model to production, to write the operational-risk row "what if concept X goes missing?".
  • After a vendor outage or feed change, to quantify what the model actually lost.
  • Quarterly (or whenever the model is rescored on fresh data), to monitor concept-level drift before the global AUC tells you something is wrong.

The two views

Function Returns Use for
auc_drop + auc_drop_map Per-concept AUC drop (or any metric drop) under ablation "What does the model lose if concept X disappears?"
attribution_drift + concept_drift_lines + concept_drift_sunburst Per-(period, concept) mean|SHAP|; line + delta-sunburst views "Across periods, where is the concept-level drift?"

Ablation: three strategies

auc_drop answers "how much performance do I lose if a whole concept's data goes missing?" with three different definitions of "missing", trading off accuracy for cost.

permutation (default)

For each concept, shuffle the values of its features across rows and re-score on the held-out set, repeated n_repeats times.

  • Cheap — no retraining.
  • Model-agnostic — works for any model with predict_proba / decision_function.
  • The model still expects the features to be there; it just sees decorrelated values. This is an upper bound on "what would happen if the data became uninformative", not "what would happen if the column was deleted".
auc_drop(graph, model, X_test, y_test,
         feature_names=X_test.columns.tolist(),
         strategy="permutation",
         n_repeats=10, random_state=42)

Output columns: auc_drop_mean, auc_drop_std (across repeats), ablated_score_mean, baseline_score, feature_count, strategy.

retrain

For each concept, drop its features from the training set, retrain, score on the held-out set with the same columns dropped.

  • Most faithful to "this column will not exist in production".
  • Most expensive: one retrain per concept.
  • Requires a user-supplied train_fn(X_train, y_train) -> fitted_estimator callable so the package can rebuild the model exactly the way you want.
def train_lgb(X, y):
    m = lgb.LGBMClassifier(n_estimators=200, random_state=42, verbose=-1)
    m.fit(X, y)
    return m

auc_drop(graph, model, X_test, y_test,
         feature_names=X_test.columns.tolist(),
         strategy="retrain",
         train_fn=train_lgb,
         X_train=X_train, y_train=y_train)

auc_drop_std is NaN (single retrain, no spread across repeats).

shap_marginal

Cheapest. Subtract the concept's SHAP contributions from the prediction logits, apply the link function, re-score.

  • Almost free once you have SHAP values.
  • An approximation under SHAP additivity. Treats each concept's contribution as independently subtractable, which is exact only for additive models.
  • Useful as a sanity check or for fast iteration; not for regulatory submissions.
auc_drop(graph, model, X_test, y_test,
         feature_names=X_test.columns.tolist(),
         strategy="shap_marginal",
         shap_values=shap_values,
         base_predictions=p_test)

Picking a strategy

Use case Strategy
Quick iteration, "is this branch important?" permutation
Sanity check, you already have SHAP values shap_marginal
Regulatory submission, "what happens if Equifax goes down?" retrain
Comparison study (run all three, look for disagreements) All three with the same graph

A common pattern: run permutation first to identify the top-k concepts, then retrain only on those k.

Custom scoring metrics

auc_drop accepts any sklearn-compatible scoring callable via the metric argument.

from sklearn.metrics import log_loss

auc_drop(..., metric="roc_auc")                  # built-in
auc_drop(..., metric=lambda y, p: -log_loss(y, p))  # any callable

The function name (auc_drop) is historical; the implementation is metric-agnostic.

What skip_root does

The root concept covers every feature; ablating it means "the model gets pure noise / nothing", which by definition tanks the score.

skip_root=True (the default) leaves the root row in the DataFrame with auc_drop_mean = NaN so it does not pollute the colour scale of auc_drop_map. The structural columns (feature_count, etc.) are still populated — required so the parent sunburst sector is not smaller than its children.

Realism — does this scenario actually happen?

auc_drop tells you what would happen if a concept went missing. To answer how often that actually happens, pair with joint_missing_rate:

import pandas as pd

drop = auc_drop(graph, model, X_test, y_test,
                feature_names=X_test.columns.tolist(),
                strategy="permutation").reset_index()
jmr  = joint_missing_rate(graph, X_test).reset_index()

risk = (drop[["path", "auc_drop_mean"]]
        .merge(jmr[["path", "joint_missing_rate"]], on="path")
        .assign(realism_weighted=lambda d:
                d["auc_drop_mean"] * d["joint_missing_rate"]))

The package deliberately does not bake realism_weighted into auc_drop (see decision D2 in the roadmap decision log) — different teams want different fusion rules, and the join is one line.

AUC drop per concept (permutation)

Drift across periods

attribution_drift aggregates SHAP per (period, concept) and returns a long-form DataFrame the drift plots consume. Periods can be anything categorical: calendar quarters, model versions, geographies, a before/after-launch flag.

from concept_graph_xai import (
    attribution_drift, concept_drift_lines, concept_drift_sunburst,
)

dr = attribution_drift(graph, feature_names, shap_values,
                       periods="period_label", X=X)

# Line chart — "did each concept's contribution stay stable?"
concept_drift_lines(dr, max_concepts=8).show()

# Delta sunburst — "where in the tree is the drift concentrated?"
concept_drift_sunburst(graph, dr,
                       baseline="2024Q1", target="2024Q4").show()

Concept drift across periods (lines)

Concept SHAP drift delta sunburst

Reading the lines

One line per concept, x = period in stable order, y = mean|SHAP| for that concept in that period. Lines are sorted by max-across-periods so the most prominent concepts list first; max_concepts= caps the chart at the top-K to avoid spaghetti.

Reading the delta sunburst

Sector area = the concept's total contribution. Colour = signed change target - baseline (diverging palette). Red sectors got more important; blue sectors got less. Pair with the line chart: the sunburst answers where, the lines answer how monotonically.

What to do with the answer

  1. Define a retraining trigger in concept terms (e.g. "retrain if any top-3 concept's mean|SHAP| moves by more than X across two consecutive quarters"). This is what the drift sunburst is for.
  2. Cross-reference drift with the disparity heatmap from fairness: drift that lands in protected-group concepts is the most urgent kind.
  3. Quote ablation numbers weighted by joint-missing-rate when pushing them into operational-risk; the raw AUC drop is a "what-if", the realism-weighted version is an "expected loss".

Common pitfalls

  • Permutation can show positive AUC drops on noise. If auc_drop_mean is negative for some concept, the permutation improved the score — which usually means the feature was injecting noise the model latched onto. Worth investigating, not a bug.
  • Retrain identity. The retrain strategy will rebuild the model exactly as train_fn defines. If your training pipeline includes preprocessing (e.g. categorical encoders that learned categories from the training set), wrap that inside train_fn so the ablation actually reflects retraining-from-scratch.
  • Drift period order. Set pd.CategoricalDtype with explicit category order on the periods column for chronological ordering. First-seen ordering will mis-sort if the data is shuffled.