Accepted · ICML 2026 · Scalable Algorithms · July 2026

Clustered Influence Functions

Cache curvature once, answer subset queries fast.
An amortized subset oracle for high-query influence workloads.

Department of Artificial Intelligence, Eötvös Loránd University, Budapest

TL;DR. Standard influence functions need one expensive inverse-curvature solve per subset query, so they break down on high-query workloads (large-$K$ cross-validation, repeated resampling, what-if analysis). CiF builds the cache once — by clustering training gradients and solving a damped Gauss–Newton system only for cluster means — and then answers any subset query by counting cluster memberships and linearly recombining cached responses. Per-query cost becomes constant; the curvature solve is paid once for the whole workload.

01Problem

Influence functions estimate the counterfactual effect of removing a training subset $S$ via the linearised parameter update

$$\Delta\theta_S \;\approx\; \tfrac{1}{N}\,\tilde H^{-1}\!\sum_{i\in S} g_i.$$

For a single query this is fine. For a workload of $Q$ subset queries — a $K$-fold cross-validation, an unlearning triage, a sweep over candidate poisoned groups — the cost scales as $Q\cdot t_{\mathrm{IHVP}}$, which becomes prohibitive in exactly the regimes where influence is most useful.

02Method

The observation is structural: every subset query enters $\tilde H^{-1}$ only through the gradient sum $\sum_{i\in S} g_i$. That is the operand. CiF compresses the operand space rather than the curvature operator, and so retains exact damped-GGN semantics.

Offline (build once)

Sketch per-example gradients with a Johnson–Lindenstrauss map $\Pi$, cluster the $N$ sketches into $C$ groups, take the full-space mean of each cluster, and solve the damped GGN system $\tilde H v_c = \mu_c$ once per cluster. Cache $\{v_c\}_{c=1}^C \subset \mathbb{R}^p$.

Online (reuse many)

For any subset $S$, count how many of its members fall in each cluster, $m_c(S)$, and recombine:

$$\widehat{\Delta\theta}_S \;=\; \frac{1}{N}\sum_{c=1}^{C} m_c(S)\, v_c.$$

For fixed evaluation sets, precompute $\alpha_c(V) = g_V^\top v_c$ once and the per-query cost drops to $O(|S| + C)$ scalar work.

Break-even is structural: CiF beats per-query IF as soon as $Q > C$. Once the cache is built, the marginal cost of additional queries effectively vanishes.

03Headline Result

Fold-level agreement with per-query IF, on a combined $K\!\in\!\{50, 100\}$ workload (150 fold queries total). CiF builds its cache once and reuses it across both fold counts.

CIFAR-10. Speedup is IF runtime divided by CiF runtime; $\rho$ is Spearman rank correlation with per-query IF (95% bootstrap CI in the paper).
Model	$C$	IF time	CiF time	Speedup	$\rho$ (K=50)	$\rho$ (K=100)
LeNet	10	32.1 h	2.1 h	15.3×	0.934	0.940
AlexNet	10	63.3 h	5.4 h	11.7×	0.987	0.992
ResNet-18	50	75.0 h	25.4 h	3.0×	0.946	0.941

04Theory

Approximation error against per-query IF decomposes cleanly into a geometry term and a numerics term:

$$\mathbb{E}\bigl\|\Delta\theta_S - \widehat{\Delta\theta}_S\bigr\|_2 \;\le\; \underbrace{\frac{1}{\lambda N}\sqrt{\frac{m(N-m)}{N-1}\,\mathrm{tr}(\bar\Sigma)}}_{\text{clustering scatter}} \;+\; \underbrace{\frac{m}{\lambda N}\sum_c \frac{n_c}{N}\,\rho_c\|\mu_c\|_2}_{\text{solver residual}}.$$

Theorem 3.1 (informal). $\lambda$ is the GGN damping; $\mathrm{tr}(\bar\Sigma)$ the within-cluster scatter; $\rho_c$ the relative CG residual.

This is also a usage rubric. Tighter fidelity needed? Increase $C$ (smaller $\mathrm{tr}(\bar\Sigma)$) or tighten the CG tolerance (smaller $\rho_c$). Faster cache build? Decrease $C$ or loosen $\rho_c$. The trade-off is exposed by two knobs and bounded in closed form.

05Is CiF the right tool for you?

Yes:

Many subset queries against a fixed model snapshot ($Q \gg C$).
Large-$K$ cross-validation, repeated resampling, fold sweeps.
Group-removal triage and unlearning candidates.
Workloads where the curvature proxy is reused over a long horizon.

No:

One-off queries — a single IHVP suffices.
Rapidly-changing models where cache invalidation dominates amortization.
Cases where gradient heterogeneity is so high that no coarse partition tracks IF.
Memory-tight settings where storing $C$ vectors in $\mathbb{R}^p$ is infeasible.

06Poster

ICML 2026 poster. Click for full size.

ICML 2026 poster for Clustered Influence Functions

Poster presented at ICML 2026

07Cite

@inproceedings{bado2026clustered
  title     = {Clustered Influence Functions},
  author    = {Bad{\'o}, Mikl{\'o}s M{\'a}t{\'e} and Fenech, Kristian},
  booktitle = {Proceedings of the 43rd International Conference on
               Machine Learning (ICML)},
  year      = {2026}
}