Clustered Influence Functions
Cache curvature once, answer subset queries fast.
An amortized subset oracle for high-query influence workloads.
Department of Artificial Intelligence, Eötvös Loránd University, Budapest
TL;DR. Standard influence functions need one expensive inverse-curvature solve per subset query, so they break down on high-query workloads (large-$K$ cross-validation, repeated resampling, what-if analysis). CiF builds the cache once — by clustering training gradients and solving a damped Gauss–Newton system only for cluster means — and then answers any subset query by counting cluster memberships and linearly recombining cached responses. Per-query cost becomes constant; the curvature solve is paid once for the whole workload.
01Problem
Influence functions estimate the counterfactual effect of removing a training subset $S$ via the linearised parameter update
For a single query this is fine. For a workload of $Q$ subset queries — a $K$-fold cross-validation, an unlearning triage, a sweep over candidate poisoned groups — the cost scales as $Q\cdot t_{\mathrm{IHVP}}$, which becomes prohibitive in exactly the regimes where influence is most useful.
02Method
The observation is structural: every subset query enters $\tilde H^{-1}$ only through the gradient sum $\sum_{i\in S} g_i$. That is the operand. CiF compresses the operand space rather than the curvature operator, and so retains exact damped-GGN semantics.
Offline (build once)
Sketch per-example gradients with a Johnson–Lindenstrauss map $\Pi$, cluster the $N$ sketches into $C$ groups, take the full-space mean of each cluster, and solve the damped GGN system $\tilde H v_c = \mu_c$ once per cluster. Cache $\{v_c\}_{c=1}^C \subset \mathbb{R}^p$.
Online (reuse many)
For any subset $S$, count how many of its members fall in each cluster, $m_c(S)$, and recombine:
For fixed evaluation sets, precompute $\alpha_c(V) = g_V^\top v_c$ once and the per-query cost drops to $O(|S| + C)$ scalar work.
Break-even is structural: CiF beats per-query IF as soon as $Q > C$. Once the cache is built, the marginal cost of additional queries effectively vanishes.
03Headline Result
Fold-level agreement with per-query IF, on a combined $K\!\in\!\{50, 100\}$ workload (150 fold queries total). CiF builds its cache once and reuses it across both fold counts.
| Model | $C$ | IF time | CiF time | Speedup | $\rho$ (K=50) | $\rho$ (K=100) |
|---|---|---|---|---|---|---|
| LeNet | 10 | 32.1 h | 2.1 h | 15.3× | 0.934 | 0.940 |
| AlexNet | 10 | 63.3 h | 5.4 h | 11.7× | 0.987 | 0.992 |
| ResNet-18 | 50 | 75.0 h | 25.4 h | 3.0× | 0.946 | 0.941 |
04Theory
Approximation error against per-query IF decomposes cleanly into a geometry term and a numerics term:
Theorem 3.1 (informal). $\lambda$ is the GGN damping; $\mathrm{tr}(\bar\Sigma)$ the within-cluster scatter; $\rho_c$ the relative CG residual.
This is also a usage rubric. Tighter fidelity needed? Increase $C$ (smaller $\mathrm{tr}(\bar\Sigma)$) or tighten the CG tolerance (smaller $\rho_c$). Faster cache build? Decrease $C$ or loosen $\rho_c$. The trade-off is exposed by two knobs and bounded in closed form.
05Is CiF the right tool for you?
Yes:
- Many subset queries against a fixed model snapshot ($Q \gg C$).
- Large-$K$ cross-validation, repeated resampling, fold sweeps.
- Group-removal triage and unlearning candidates.
- Workloads where the curvature proxy is reused over a long horizon.
No:
- One-off queries — a single IHVP suffices.
- Rapidly-changing models where cache invalidation dominates amortization.
- Cases where gradient heterogeneity is so high that no coarse partition tracks IF.
- Memory-tight settings where storing $C$ vectors in $\mathbb{R}^p$ is infeasible.
06Poster
ICML 2026 poster. Click for full size.
07Cite
@inproceedings{bado2026clustered
title = {Clustered Influence Functions},
author = {Bad{\'o}, Mikl{\'o}s M{\'a}t{\'e} and Fenech, Kristian},
booktitle = {Proceedings of the 43rd International Conference on
Machine Learning (ICML)},
year = {2026}
}